Browse documentation
Architecture
LLMKube is two cooperating processes: an in-cluster controller that owns the Kubernetes-side desired state, and an out-of-cluster metal-agent that owns OS-level process supervision on Apple Silicon hosts.
Why two layers?
Kubernetes models pod lifecycle in terms of containers, cgroups, and a CRI-compliant runtime. Apple Silicon doesn't expose any of those at the level the kubelet needs, and the high-performance Metal-accelerated paths for inference (llama-server compiled for Metal, oMLX, Ollama) are native macOS processes, not container images.
The split is the simplest thing that works: the controller owns everything that lives inside the cluster (CRDs, Deployments on Linux/GPU nodes, Services, metrics scrape targets), and the agent owns everything that lives on the Mac (process supervision, working-directory ownership, host memory observation, model file management). They communicate through the Kubernetes API: the agent watches Models and InferenceServices, posts status back, and registers Endpoints objects so cluster Services route traffic to the off-cluster process.
A pure containerized approach would either lose Metal acceleration (running llama.cpp under Linux containers on macOS) or require custom virtualization plumbing. A pure agent-only approach would lose all the K8s-native operational surface (HPA, NetworkPolicy, PodDisruptionBudgets, the entire observability stack). The two-layer model keeps both.
Reconciliation flow
Walk through what happens when you kubectl apply -f model.yaml && kubectl apply -f isvc.yaml:
- Model created. The controller sees the new
Model. If the source URL targets a CUDA/CPU backend, the controller creates a Job that downloads the GGUF into the namespace's model-cache PVC. Ifspec.hardware.accelerator: metal, no in-cluster Job runs; the metal-agent will fetch on first use. - Model becomes Ready. Once the file is on disk and the SHA matches, the controller patches
.status.phase = Readyand parses the GGUF header into.status.gguf(architecture, layer count, context length). - InferenceService created. The controller resolves
spec.modelRef, blocks until the Model is Ready, then materializes a Deployment, Service, and PodMonitor. The runtime backend (llamacpp,vllm,tgi,personaplex,generic) decides container image, args, and the three-probe health check shape. - Pod scheduling. If GPUs are scarce, the controller writes a
WaitingForGPUphase plus a queue position based onspec.priority. Higher priority services preempt lower priority ones when capacity frees up. - Metal-agent path (if applicable). If the resolved Model targets Metal, the agent on the assigned host picks up the desired state, downloads the weights if needed, spawns the runtime (llama-server, oMLX, Ollama), and registers an
Endpointsobject pointing to the host's IP. The cluster Service routes through normal kube-proxy paths. - Ready. The startup probe transitions to passing when llama-server returns
200 OKon/health. The controller patches.status.phase = Readyand the Service starts accepting traffic.
Component reference
| Component | Where it runs | What it owns | Talks to |
|---|---|---|---|
| controller | Deployment in llmkube-system | Model + InferenceService reconciliation; Deployments, Services, Jobs, PodMonitors; status conditions; controller-side metrics | Kubernetes API server |
| metal-agent | Native macOS process under launchd | Native llama-server / oMLX / Ollama process tree; on-host model files; host memory watchdog; per-process eviction | Kubernetes API server (via kubeconfig); local OS |
| runtime pod | Pod on a Linux/GPU node | The actual inference process (llama.cpp, vLLM, TGI). Exposes /v1/chat/completions and /metrics | Inbound: Service traffic. Outbound: model cache PVC |
Looking for the field-by-field schema? See the CRD reference.