CRD reference

LLMKube exposes two custom resources: Model describes a model file the operator should make available, and InferenceService describes how to serve it. Both live in the inference.llmkube.dev/v1alpha1 API group.

Model

A Model resource describes where to fetch a model from and what hardware acceleration it targets. The controller materializes a download Job (CUDA/CPU path) or hands off to the metal-agent (Metal path), then exposes the file at a stable on-disk location for any InferenceService that references it.

Example

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: qwen3-8b
spec:
  source: https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf
  format: gguf
  quantization: Q4_K_M
  hardware:
    accelerator: cuda
    memoryBudget: 24Gi
  resources:
    cpu: "1"
    memory: 8Gi

Commonly used fields

Field	Type	Description	Default
spec.source	string	URL or path. Supports `http(s)://`, `file://`, `pvc://`, or absolute paths. For MLX models, point at the directory containing `config.json`.	required
spec.format	enum	`gguf`, `mlx`, `safetensors`, `pytorch`, `custom`. `gguf` pairs with the llama.cpp runtime; `mlx` with oMLX; the rest with the generic runtime.	gguf
spec.quantization	string	Free-form quantization label (`Q4_K_M`, `Q5_K_M`, `F16`). Surfaced in status and metrics; the controller does not validate it.	—
spec.hardware.accelerator	enum	`cpu`, `metal`, `cuda`, `rocm`. `metal` routes the model through the metal-agent path; everything else stays in-cluster.	cpu
spec.hardware.memoryBudget	quantity	Absolute memory cap for the model process (e.g. `24Gi`). Wins over `memoryFraction` and the agent-level default.	—
spec.hardware.memoryFraction	float (0–1)	Fraction of host memory to budget for this model. Used by the metal-agent when no absolute budget is set.	agent default
spec.resources.cpu	quantity	CPU request for in-cluster runtime pods. Kubernetes scheduling unit, not enforced on metal-agent processes.	—
spec.resources.memory	quantity	Memory request for in-cluster runtime pods. Set above the GGUF file size to leave headroom for KV cache.	—

InferenceService

An InferenceService resource describes how to serve a Model: which runtime, how many replicas, what context window, what GPU and memory shape. The controller creates and updates a Deployment + Service + PodMonitor to match. Short name: isvc.

Example

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: chat-prod
spec:
  modelRef: qwen3-8b
  runtime: llamacpp
  replicas: 2
  priority: high
  evictionProtection: true
  contextSize: 32768
  flashAttention: true
  parallelSlots: 4
  runtimeClassName: nvidia
  podLabels:
    cost-center: ai-platform
  podAnnotations:
    sidecar.istio.io/inject: "true"
  resources:
    gpu: 1
    cpu: "2"
    memory: 16Gi

Commonly used fields

Field	Type	Description	Default
spec.modelRef	string	Name of the `Model` in the same namespace. Mutable: changing it triggers a rolling update to the new model.	required
spec.runtime	enum	`llamacpp`, `vllm`, `tgi`, `personaplex`, `generic`. Selects the container image, args, and probe shape.	llamacpp
spec.replicas	int (0–10)	Pod count. When `spec.autoscaling` is set, this becomes the initial count and HPA takes over.	1
spec.priority	enum	`batch`, `low`, `normal`, `high`, `critical`. Drives both GPU scheduling order and memory-pressure eviction order.	normal
spec.evictionProtection	bool	Excludes this service from metal-agent memory-pressure eviction. `MemoryPressure` conditions are still surfaced for visibility.	false
spec.contextSize	int	llama.cpp context window (`-c`). Range 128–2,097,152. Larger contexts cost more KV cache memory.	runtime default
spec.flashAttention	bool	Enables llama.cpp `--flash-attn`. NVIDIA: requires Ampere+. Apple Silicon: defaults to true via the metal-agent path to avoid long-context decode regression.	runtime default
spec.mlock	bool	Currently surfaced through `spec.extraArgs: ["--mlock"]`. Pins model weights into RAM at the cost of memory headroom.	runtime default
spec.parallelSlots	int (1–64)	Concurrent request slots (`--parallel`). Each slot uses additional KV memory.	1
spec.runtimeClassName	string	Selects a Kubernetes `RuntimeClass`. Most common value: `nvidia` on clusters where the NVIDIA runtime is not the default.	—
spec.podAnnotations	map[string]string	Merged into pod metadata. For service-mesh injection, cost attribution, custom admission controllers.	—
spec.podLabels	map[string]string	Merged into pod labels. Operator-managed labels (`app`, `inference.llmkube.dev/model`, `inference.llmkube.dev/service`) take precedence on collision.	—
spec.resources.gpu	int (0–8)	GPU count per pod. Combine with `Model.spec.hardware.gpu.sharding` for multi-GPU layer splits.	0
spec.resources.cpu	quantity	CPU request per pod (e.g. `"2"`, `"500m"`).	—
spec.resources.memory	quantity	Memory request per pod (e.g. `16Gi`). For hybrid GPU/CPU offload, prefer `resources.hostMemory`.	—

Note on mlock: there is no first-class spec.mlock field today. Set it through spec.extraArgs: ["--mlock"] on the llama.cpp runtime.

Looking for the full generated CRD YAML, including every field and validation? See the CRD bases on GitHub.