Skip to content
Skip to documentation content
Browse documentation

Architecture

LLMKube is two cooperating processes: an in-cluster controller that owns the Kubernetes-side desired state, and an out-of-cluster metal-agent that owns OS-level process supervision on Apple Silicon hosts.

LLMKube architectureTwo zones. The top zone, Kubernetes cluster, contains the LLMKube controller on the left and Custom Resources on the right; the controller watches Models and InferenceServices and creates owned objects (Job, Deployment, Service, PodMonitor). The bottom zone, Apple Silicon host, contains the metal-agent on the left and native processes on the right (llama-server, oMLX, Ollama); the agent supervises these processes. A sky-blue arrow connects the two zones, representing the metal-agent registering Endpoints back into the cluster API server.Kubernetes clusterLLMKube controllerDeployment in llmkube-system• reconciles CRDs• schedules pods, sets priority class• patches conditions and status• exports controller metricsCustom resourcesModelmdlInferenceServiceisvcOwned objectsJob · Deployment · Service · PodMonitorcreated and updated by the controllerwatchescreatesregisters EndpointsHTTPS · kubeconfig credentialsApple Silicon hostMac mini · Mac Studio · MacBook Prometal-agentlaunchd-managed, on-host• polls Kubernetes API server• supervises native process tree• watches host memory pressure• exports per-process metricsNative processesspawned and supervised by the agentllama-serverone per InferenceServiceomlxshared daemon for MLX modelsollamashared daemon for Ollama modelssupervises
Two cooperating processes. The controller owns Kubernetes-side desired state; the metal-agent owns OS-level process supervision and registers Endpoints back into the cluster.

Why two layers?

Kubernetes models pod lifecycle in terms of containers, cgroups, and a CRI-compliant runtime. Apple Silicon doesn't expose any of those at the level the kubelet needs, and the high-performance Metal-accelerated paths for inference (llama-server compiled for Metal, oMLX, Ollama) are native macOS processes, not container images.

The split is the simplest thing that works: the controller owns everything that lives inside the cluster (CRDs, Deployments on Linux/GPU nodes, Services, metrics scrape targets), and the agent owns everything that lives on the Mac (process supervision, working-directory ownership, host memory observation, model file management). They communicate through the Kubernetes API: the agent watches Models and InferenceServices, posts status back, and registers Endpoints objects so cluster Services route traffic to the off-cluster process.

A pure containerized approach would either lose Metal acceleration (running llama.cpp under Linux containers on macOS) or require custom virtualization plumbing. A pure agent-only approach would lose all the K8s-native operational surface (HPA, NetworkPolicy, PodDisruptionBudgets, the entire observability stack). The two-layer model keeps both.

Reconciliation flow

Walk through what happens when you kubectl apply -f model.yaml && kubectl apply -f isvc.yaml:

  1. Model created. The controller sees the new Model. If the source URL targets a CUDA/CPU backend, the controller creates a Job that downloads the GGUF into the namespace's model-cache PVC. If spec.hardware.accelerator: metal, no in-cluster Job runs; the metal-agent will fetch on first use.
  2. Model becomes Ready. Once the file is on disk and the SHA matches, the controller patches .status.phase = Ready and parses the GGUF header into .status.gguf (architecture, layer count, context length).
  3. InferenceService created. The controller resolves spec.modelRef, blocks until the Model is Ready, then materializes a Deployment, Service, and PodMonitor. The runtime backend (llamacpp, vllm, tgi, personaplex, generic) decides container image, args, and the three-probe health check shape.
  4. Pod scheduling. If GPUs are scarce, the controller writes a WaitingForGPU phase plus a queue position based on spec.priority. Higher priority services preempt lower priority ones when capacity frees up.
  5. Metal-agent path (if applicable). If the resolved Model targets Metal, the agent on the assigned host picks up the desired state, downloads the weights if needed, spawns the runtime (llama-server, oMLX, Ollama), and registers an Endpoints object pointing to the host's IP. The cluster Service routes through normal kube-proxy paths.
  6. Ready. The startup probe transitions to passing when llama-server returns 200 OK on /health. The controller patches .status.phase = Ready and the Service starts accepting traffic.

Component reference

ComponentWhere it runsWhat it ownsTalks to
controllerDeployment in llmkube-systemModel + InferenceService reconciliation; Deployments, Services, Jobs, PodMonitors; status conditions; controller-side metricsKubernetes API server
metal-agentNative macOS process under launchdNative llama-server / oMLX / Ollama process tree; on-host model files; host memory watchdog; per-process evictionKubernetes API server (via kubeconfig); local OS
runtime podPod on a Linux/GPU nodeThe actual inference process (llama.cpp, vLLM, TGI). Exposes /v1/chat/completions and /metricsInbound: Service traffic. Outbound: model cache PVC

Looking for the field-by-field schema? See the CRD reference.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.