Architecture

LLMKube is two cooperating processes: an in-cluster controller that owns the Kubernetes-side desired state, and an out-of-cluster metal-agent that owns OS-level process supervision on Apple Silicon hosts.

Two cooperating processes. The controller owns Kubernetes-side desired state; the metal-agent owns OS-level process supervision and registers Endpoints back into the cluster.

Why two layers?

Kubernetes models pod lifecycle in terms of containers, cgroups, and a CRI-compliant runtime. Apple Silicon doesn't expose any of those at the level the kubelet needs, and the high-performance Metal-accelerated paths for inference (llama-server compiled for Metal, oMLX, Ollama) are native macOS processes, not container images.

The split is the simplest thing that works: the controller owns everything that lives inside the cluster (CRDs, Deployments on Linux/GPU nodes, Services, metrics scrape targets), and the agent owns everything that lives on the Mac (process supervision, working-directory ownership, host memory observation, model file management). They communicate through the Kubernetes API: the agent watches Models and InferenceServices, posts status back, and registers Endpoints objects so cluster Services route traffic to the off-cluster process.

A pure containerized approach would either lose Metal acceleration (running llama.cpp under Linux containers on macOS) or require custom virtualization plumbing. A pure agent-only approach would lose all the K8s-native operational surface (HPA, NetworkPolicy, PodDisruptionBudgets, the entire observability stack). The two-layer model keeps both.

Reconciliation flow

Walk through what happens when you kubectl apply -f model.yaml && kubectl apply -f isvc.yaml:

Model created. The controller sees the new Model. If the source URL targets a CUDA/CPU backend, the controller creates a Job that downloads the GGUF into the namespace's model-cache PVC. If spec.hardware.accelerator: metal, no in-cluster Job runs; the metal-agent will fetch on first use.
Model becomes Ready. Once the file is on disk and the SHA matches, the controller patches .status.phase = Ready and parses the GGUF header into .status.gguf (architecture, layer count, context length).
InferenceService created. The controller resolves spec.modelRef, blocks until the Model is Ready, then materializes a Deployment, Service, and PodMonitor. The runtime backend (llamacpp, vllm, tgi, personaplex, generic) decides container image, args, and the three-probe health check shape.
Pod scheduling. If GPUs are scarce, the controller writes a WaitingForGPU phase plus a queue position based on spec.priority. Higher priority services preempt lower priority ones when capacity frees up.
Metal-agent path (if applicable). If the resolved Model targets Metal, the agent on the assigned host picks up the desired state, downloads the weights if needed, spawns the runtime (llama-server, oMLX, Ollama), and registers an Endpoints object pointing to the host's IP. The cluster Service routes through normal kube-proxy paths.
Ready. The startup probe transitions to passing when llama-server returns 200 OK on /health. The controller patches .status.phase = Ready and the Service starts accepting traffic.

Component reference

Component	Where it runs	What it owns	Talks to
controller	Deployment in `llmkube-system`	Model + InferenceService reconciliation; Deployments, Services, Jobs, PodMonitors; status conditions; controller-side metrics	Kubernetes API server
metal-agent	Native macOS process under `launchd`	Native llama-server / oMLX / Ollama process tree; on-host model files; host memory watchdog; per-process eviction	Kubernetes API server (via kubeconfig); local OS
runtime pod	Pod on a Linux/GPU node	The actual inference process (llama.cpp, vLLM, TGI). Exposes `/v1/chat/completions` and `/metrics`	Inbound: Service traffic. Outbound: model cache PVC

Looking for the field-by-field schema? See the CRD reference.