v0.6.0 Open Source ยท Kubernetes Native ยท vLLM + TGI + llama.cpp + Ollama

Run production LLMs
on your own hardware

We analyzed 200 GitHub issues with a 14B model on two $400 GPUs. Total cost: one cent. LLMKube makes self-hosted inference actually work.

See it in action

Deploy LLMs with any runtime in seconds using the llmkube CLI

terminal
$ llmkube deploy llama-3.1-8b --gpu --runtime vllm
๐Ÿš€ Deploying LLM inference service โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Name: llama-3.1-8b Runtime: vllm Accelerator: cuda GPU: 2 x nvidia ๐Ÿ“ฆ Creating Model 'llama-3.1-8b'... โœ… Model created โš™๏ธ Creating InferenceService 'llama-3.1-8b'... โœ… InferenceService created (runtime: vllm)
Step 1/4: Deploy with vLLM runtime

Why LLMKube?

Local LLMs are great for prototyping. Scaling them for a team is where it gets hard.

The scaling problem

  • ร— Silent failures with no alerts
  • ร— Multi-GPU memory math by trial and error
  • ร— Updates that break your setup
  • ร— Docker Compose that doesn't scale
  • ร— One person managing everything
  • ร— Every machine set up by hand

With LLMKube

  • Pluggable runtimes: vLLM, TGI, llama.cpp, or bring your own
  • HPA autoscaling that responds to real inference metrics
  • GPU layer offloading with custom sharding splits
  • Infrastructure as code, not scripts and duct tape
  • Grafana dashboards for inference metrics out of the box
  • CUDA 13 and NVIDIA Blackwell GPU support

vLLM for speed. TGI for flexibility. llama.cpp for efficiency. LLMKube for all of them.

One operator, every runtime. The platform layer your inference stack is missing.

Deploy an LLM in seconds

Simple, declarative YAML that feels native to Kubernetes developers

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: phi-3-mini
spec:
  source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
  format: gguf
  quantization: Q4_K_M
  hardware:
    accelerator: cuda
    gpu:
      enabled: true
      count: 1
  resources:
    cpu: "2"
    memory: "4Gi"
Supports GGUF models from HuggingFace, with automatic download and caching
Limited to 10 Teams

Early Adopter Program

Help shape the future of LLMKube and get direct access to the maintainer.

What you get

  • Private Discord with other early adopters
  • Direct input on the roadmap
  • Your logo on our website (when ready)
  • Early access to new features

What we need

  • Real-world feedback on your use case
  • 30 minutes monthly for a feedback call
  • Permission to share your story (anonymized if needed)

Apply to join

Ready to deploy your first LLM?

Join the community of developers deploying LLMs on Kubernetes.

Open source and free forever

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

ยฉ 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetesยฎ is a registered trademark of The Linux Foundation.