Skip to documentation content
Browse documentation
Getting Started
Reference
Documentation
Run any GGUF, vLLM-compatible, or MLX model on your own Kubernetes clusters, with the same operational primitives as the rest of your fleet.
What LLMKube is for
LLMKube is a Kubernetes operator for self-hosted inference. Install it in your cluster, declare the models you want to serve as Model custom resources, and declare how to serve them as InferenceService custom resources. The operator handles model download, pod scheduling, health checks, metrics, and, on Apple Silicon, native process supervision.
Where it sits relative to other projects in this space:
- Multi-runtime by design. One CRD, four backends. The
runtimefield on InferenceService selects llama.cpp, vLLM, TGI, or PersonaPlex; the metal-agent path adds oMLX and Ollama for Apple Silicon. Mix runtimes on the same cluster without standing up a different operator for each. - Native Apple Silicon, not just NVIDIA. The metal-agent runs as a native macOS process, manages llama-server / oMLX / Ollama lifecycles directly on the host, and registers Endpoints back into Kubernetes. M-series boxes participate in the same control plane as your GPU nodes.
- Single-tenant boxes and multi-GPU clusters. The same operator deploys to a Mac mini, a Minikube cluster, or a multi-node GPU cluster with layer-sharded models. No split between "edge" and "datacenter" tooling.
- Kubernetes-native, not Kubernetes-adjacent. Everything is a CRD with proper status conditions, events, owner references, and Prometheus metrics.
kubectl get, HPA, NetworkPolicy, and PriorityClass behave the way you expect.
What's new in 0.7.6
Released this week- Memory-pressure protection
- The metal-agent watches host memory, surfaces a
MemoryPressurecondition on every managed InferenceService, and (when enabled) evicts the lowest-priority workload before the kernel does. Includes per-serviceevictionProtectionfor production-critical traffic. Read the docs PR #382, #384 - RuntimeClassName, podAnnotations, podLabels
- New first-class fields on InferenceService for selecting a Kubernetes
RuntimeClass(typical use:nvidiaon clusters where the NVIDIA runtime isn't the default) and for tagging inference pods for cost-attribution, service-mesh routing, and custom admission controllers. CRD reference PR #380, #381 modelRefis now mutable- Switching the model on an existing InferenceService now triggers a rolling update instead of being silently ignored. Closes a long-standing footgun where stale pods kept serving the old model after a
kubectl edit. PR #385