Documentation

Run any GGUF, vLLM-compatible, or MLX model on your own Kubernetes clusters, with the same operational primitives as the rest of your fleet.

Start here · 5 minutes

Install LLMKube and serve a model

Install the operator with Helm, deploy a model from the catalog, and hit an OpenAI-compatible endpoint on a fresh kind, Minikube, or Docker Desktop cluster. No cloud account required.

Read the quickstart

Concepts

Architecture

Controller in-cluster, agent on the Mac. How the pieces split.

New in 0.7.6

Memory-pressure protection

The watchdog only evicts when LLMKube is the cause.

What LLMKube is for

LLMKube is a Kubernetes operator for self-hosted inference. Install it in your cluster, declare the models you want to serve as Model custom resources, and declare how to serve them as InferenceService custom resources. The operator handles model download, pod scheduling, health checks, metrics, and, on Apple Silicon, native process supervision.

Where it sits relative to other projects in this space:

Multi-runtime by design. One CRD, four backends. The runtime field on InferenceService selects llama.cpp, vLLM, TGI, or PersonaPlex; the metal-agent path adds oMLX and Ollama for Apple Silicon. Mix runtimes on the same cluster without standing up a different operator for each.
Native Apple Silicon, not just NVIDIA. The metal-agent runs as a native macOS process, manages llama-server / oMLX / Ollama lifecycles directly on the host, and registers Endpoints back into Kubernetes. M-series boxes participate in the same control plane as your GPU nodes.
Single-tenant boxes and multi-GPU clusters. The same operator deploys to a Mac mini, a Minikube cluster, or a multi-node GPU cluster with layer-sharded models. No split between "edge" and "datacenter" tooling.
Kubernetes-native, not Kubernetes-adjacent. Everything is a CRD with proper status conditions, events, owner references, and Prometheus metrics. kubectl get, HPA, NetworkPolicy, and PriorityClass behave the way you expect.

What's new in 0.7.6

Released this week

Memory-pressure protection: The metal-agent watches host memory, surfaces a MemoryPressure condition on every managed InferenceService, and (when enabled) evicts the lowest-priority workload before the kernel does. Includes per-service evictionProtection for production-critical traffic. Read the docs PR #382, #384
RuntimeClassName, podAnnotations, podLabels: New first-class fields on InferenceService for selecting a Kubernetes RuntimeClass (typical use: nvidia on clusters where the NVIDIA runtime isn't the default) and for tagging inference pods for cost-attribution, service-mesh routing, and custom admission controllers. CRD reference PR #380, #381
modelRef is now mutable: Switching the model on an existing InferenceService now triggers a rolling update instead of being silently ignored. Closes a long-standing footgun where stale pods kept serving the old model after a kubectl edit. PR #385