Run production LLMs
on your own hardware
A Kubernetes operator for self-hosted LLM inference. vLLM, llama.cpp, TGI, NVIDIA, Apple Silicon. Recently a local model on two $400 GPUs wrote its own next feature, merged as PR #283.
See it in action
Deploy LLMs with any runtime in seconds using the llmkube CLI
What's happening here
Recent posts from the lab
Back to Shadowstack: a 35B at 256K context (and 512K with YaRN) on two consumer Blackwell cards
We've been deep on Apple Silicon for a couple of months; this weekend I turned back to Shadowstack (two consumer RTX 5060 Ti) to see what…
ReadIntroducing Foreman: a Kubernetes-native orchestrator for your local LLM fleet (LLMKube 0.8.0)
LLMKube 0.8.0 ships Foreman, an opt-in add-on that dispatches agentic workloads (coder, verifier, reviewer) across a heterogeneous fleet of…
ReadWhat we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support
0.7.9 adds mlx-server as a first-class runtime on the metal-agent: an OpenAI-compatible MLX inference server you select with --runtime…
ReadWhy LLMKube?
Local LLMs are great for prototyping. Scaling them for a team is where it gets hard.
The scaling problem
- × Silent failures with no alerts
- × Multi-GPU memory math by trial and error
- × Updates that break your setup
- × Docker Compose that doesn't scale
- × One person managing everything
- × Every machine set up by hand
With LLMKube
- Pluggable runtimes: vLLM, TGI, llama.cpp, or bring your own
- HPA autoscaling that responds to real inference metrics
- GPU layer offloading with custom sharding splits
- Infrastructure as code, not scripts and duct tape
- Grafana dashboards for inference metrics out of the box
- CUDA 13 and NVIDIA Blackwell GPU support
vLLM for speed. TGI for flexibility. llama.cpp for efficiency. LLMKube for all of them.
One operator, every runtime. The platform layer your inference stack is missing.
Deploy an LLM in seconds
Simple, declarative YAML that feels native to Kubernetes developers
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: phi-3-mini
spec:
source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
format: gguf
quantization: Q4_K_M
hardware:
accelerator: cuda
gpu:
enabled: true
count: 1
resources:
cpu: "2"
memory: "4Gi"Early Adopter Program
Help shape the future of LLMKube and get direct access to the maintainer.
What you get
- Private Discord with other early adopters
- Direct input on the roadmap
- Your logo on our website (when ready)
- Early access to new features
What we need
- Real-world feedback on your use case
- 30 minutes monthly for a feedback call
- Permission to share your story (anonymized if needed)
Apply to join
Ready to deploy your first LLM?
Join the community of developers deploying LLMs on Kubernetes.
Open source and free forever