Skip to content
Skip to documentation content
Browse documentation

Documentation

Run any GGUF, vLLM-compatible, or MLX model on your own Kubernetes clusters, with the same operational primitives as the rest of your fleet.

What LLMKube is for

LLMKube is a Kubernetes operator for self-hosted inference. Install it in your cluster, declare the models you want to serve as Model custom resources, and declare how to serve them as InferenceService custom resources. The operator handles model download, pod scheduling, health checks, metrics, and, on Apple Silicon, native process supervision.

Where it sits relative to other projects in this space:

  • Multi-runtime by design. One CRD, four backends. The runtime field on InferenceService selects llama.cpp, vLLM, TGI, or PersonaPlex; the metal-agent path adds oMLX and Ollama for Apple Silicon. Mix runtimes on the same cluster without standing up a different operator for each.
  • Native Apple Silicon, not just NVIDIA. The metal-agent runs as a native macOS process, manages llama-server / oMLX / Ollama lifecycles directly on the host, and registers Endpoints back into Kubernetes. M-series boxes participate in the same control plane as your GPU nodes.
  • Single-tenant boxes and multi-GPU clusters. The same operator deploys to a Mac mini, a Minikube cluster, or a multi-node GPU cluster with layer-sharded models. No split between "edge" and "datacenter" tooling.
  • Kubernetes-native, not Kubernetes-adjacent. Everything is a CRD with proper status conditions, events, owner references, and Prometheus metrics. kubectl get, HPA, NetworkPolicy, and PriorityClass behave the way you expect.

What's new in 0.7.6

Released this week
Memory-pressure protection
The metal-agent watches host memory, surfaces a MemoryPressure condition on every managed InferenceService, and (when enabled) evicts the lowest-priority workload before the kernel does. Includes per-service evictionProtection for production-critical traffic. Read the docs PR #382, #384
RuntimeClassName, podAnnotations, podLabels
New first-class fields on InferenceService for selecting a Kubernetes RuntimeClass (typical use: nvidia on clusters where the NVIDIA runtime isn't the default) and for tagging inference pods for cost-attribution, service-mesh routing, and custom admission controllers. CRD reference PR #380, #381
modelRef is now mutable
Switching the model on an existing InferenceService now triggers a rolling update instead of being silently ignored. Closes a long-standing footgun where stale pods kept serving the old model after a kubectl edit. PR #385
LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.