Browse documentation
CRD reference
LLMKube exposes two custom resources: Model describes a model file the operator should make available, and InferenceService describes how to serve it. Both live in the inference.llmkube.dev/v1alpha1 API group.
Model
A Model resource describes where to fetch a model from and what hardware acceleration it targets. The controller materializes a download Job (CUDA/CPU path) or hands off to the metal-agent (Metal path), then exposes the file at a stable on-disk location for any InferenceService that references it.
Example
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: qwen3-8b
spec:
source: https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf
format: gguf
quantization: Q4_K_M
hardware:
accelerator: cuda
memoryBudget: 24Gi
resources:
cpu: "1"
memory: 8Gi Commonly used fields
| Field | Type | Description | Default |
|---|---|---|---|
| spec.source | string | URL or path. Supports http(s)://, file://, pvc://, or absolute paths. For MLX models, point at the directory containing config.json. | required |
| spec.format | enum | gguf, mlx, safetensors, pytorch, custom. gguf pairs with the llama.cpp runtime; mlx with oMLX; the rest with the generic runtime. | gguf |
| spec.quantization | string | Free-form quantization label (Q4_K_M, Q5_K_M, F16). Surfaced in status and metrics; the controller does not validate it. | — |
| spec.hardware.accelerator | enum | cpu, metal, cuda, rocm. metal routes the model through the metal-agent path; everything else stays in-cluster. | cpu |
| spec.hardware.memoryBudget | quantity | Absolute memory cap for the model process (e.g. 24Gi). Wins over memoryFraction and the agent-level default. | — |
| spec.hardware.memoryFraction | float (0–1) | Fraction of host memory to budget for this model. Used by the metal-agent when no absolute budget is set. | agent default |
| spec.resources.cpu | quantity | CPU request for in-cluster runtime pods. Kubernetes scheduling unit, not enforced on metal-agent processes. | — |
| spec.resources.memory | quantity | Memory request for in-cluster runtime pods. Set above the GGUF file size to leave headroom for KV cache. | — |
InferenceService
An InferenceService resource describes how to serve a Model: which runtime, how many replicas, what context window, what GPU and memory shape. The controller creates and updates a Deployment + Service + PodMonitor to match. Short name: isvc.
Example
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: chat-prod
spec:
modelRef: qwen3-8b
runtime: llamacpp
replicas: 2
priority: high
evictionProtection: true
contextSize: 32768
flashAttention: true
parallelSlots: 4
runtimeClassName: nvidia
podLabels:
cost-center: ai-platform
podAnnotations:
sidecar.istio.io/inject: "true"
resources:
gpu: 1
cpu: "2"
memory: 16Gi Commonly used fields
| Field | Type | Description | Default |
|---|---|---|---|
| spec.modelRef | string | Name of the Model in the same namespace. Mutable: changing it triggers a rolling update to the new model. | required |
| spec.runtime | enum | llamacpp, vllm, tgi, personaplex, generic. Selects the container image, args, and probe shape. | llamacpp |
| spec.replicas | int (0–10) | Pod count. When spec.autoscaling is set, this becomes the initial count and HPA takes over. | 1 |
| spec.priority | enum | batch, low, normal, high, critical. Drives both GPU scheduling order and memory-pressure eviction order. | normal |
| spec.evictionProtection | bool | Excludes this service from metal-agent memory-pressure eviction. MemoryPressure conditions are still surfaced for visibility. | false |
| spec.contextSize | int | llama.cpp context window (-c). Range 128–2,097,152. Larger contexts cost more KV cache memory. | runtime default |
| spec.flashAttention | bool | Enables llama.cpp --flash-attn. NVIDIA: requires Ampere+. Apple Silicon: defaults to true via the metal-agent path to avoid long-context decode regression. | runtime default |
| spec.mlock | bool | Currently surfaced through spec.extraArgs: ["--mlock"]. Pins model weights into RAM at the cost of memory headroom. | runtime default |
| spec.parallelSlots | int (1–64) | Concurrent request slots (--parallel). Each slot uses additional KV memory. | 1 |
| spec.runtimeClassName | string | Selects a Kubernetes RuntimeClass. Most common value: nvidia on clusters where the NVIDIA runtime is not the default. | — |
| spec.podAnnotations | map[string]string | Merged into pod metadata. For service-mesh injection, cost attribution, custom admission controllers. | — |
| spec.podLabels | map[string]string | Merged into pod labels. Operator-managed labels (app, inference.llmkube.dev/model, inference.llmkube.dev/service) take precedence on collision. | — |
| spec.resources.gpu | int (0–8) | GPU count per pod. Combine with Model.spec.hardware.gpu.sharding for multi-GPU layer splits. | 0 |
| spec.resources.cpu | quantity | CPU request per pod (e.g. "2", "500m"). | — |
| spec.resources.memory | quantity | Memory request per pod (e.g. 16Gi). For hybrid GPU/CPU offload, prefer resources.hostMemory. | — |
Note on mlock: there is no first-class spec.mlock field today. Set it through spec.extraArgs: ["--mlock"] on the llama.cpp runtime.
Looking for the full generated CRD YAML, including every field and validation? See the CRD bases on GitHub.