Skip to content
Skip to documentation content
Browse documentation

CRD reference

LLMKube exposes two custom resources: Model describes a model file the operator should make available, and InferenceService describes how to serve it. Both live in the inference.llmkube.dev/v1alpha1 API group.

Model

A Model resource describes where to fetch a model from and what hardware acceleration it targets. The controller materializes a download Job (CUDA/CPU path) or hands off to the metal-agent (Metal path), then exposes the file at a stable on-disk location for any InferenceService that references it.

Example

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: qwen3-8b
spec:
  source: https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf
  format: gguf
  quantization: Q4_K_M
  hardware:
    accelerator: cuda
    memoryBudget: 24Gi
  resources:
    cpu: "1"
    memory: 8Gi

Commonly used fields

FieldTypeDescriptionDefault
spec.sourcestringURL or path. Supports http(s)://, file://, pvc://, or absolute paths. For MLX models, point at the directory containing config.json.required
spec.formatenumgguf, mlx, safetensors, pytorch, custom. gguf pairs with the llama.cpp runtime; mlx with oMLX; the rest with the generic runtime.gguf
spec.quantizationstringFree-form quantization label (Q4_K_M, Q5_K_M, F16). Surfaced in status and metrics; the controller does not validate it.
spec.hardware.acceleratorenumcpu, metal, cuda, rocm. metal routes the model through the metal-agent path; everything else stays in-cluster.cpu
spec.hardware.memoryBudgetquantityAbsolute memory cap for the model process (e.g. 24Gi). Wins over memoryFraction and the agent-level default.
spec.hardware.memoryFractionfloat (0–1)Fraction of host memory to budget for this model. Used by the metal-agent when no absolute budget is set.agent default
spec.resources.cpuquantityCPU request for in-cluster runtime pods. Kubernetes scheduling unit, not enforced on metal-agent processes.
spec.resources.memoryquantityMemory request for in-cluster runtime pods. Set above the GGUF file size to leave headroom for KV cache.

InferenceService

An InferenceService resource describes how to serve a Model: which runtime, how many replicas, what context window, what GPU and memory shape. The controller creates and updates a Deployment + Service + PodMonitor to match. Short name: isvc.

Example

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: chat-prod
spec:
  modelRef: qwen3-8b
  runtime: llamacpp
  replicas: 2
  priority: high
  evictionProtection: true
  contextSize: 32768
  flashAttention: true
  parallelSlots: 4
  runtimeClassName: nvidia
  podLabels:
    cost-center: ai-platform
  podAnnotations:
    sidecar.istio.io/inject: "true"
  resources:
    gpu: 1
    cpu: "2"
    memory: 16Gi

Commonly used fields

FieldTypeDescriptionDefault
spec.modelRefstringName of the Model in the same namespace. Mutable: changing it triggers a rolling update to the new model.required
spec.runtimeenumllamacpp, vllm, tgi, personaplex, generic. Selects the container image, args, and probe shape.llamacpp
spec.replicasint (0–10)Pod count. When spec.autoscaling is set, this becomes the initial count and HPA takes over.1
spec.priorityenumbatch, low, normal, high, critical. Drives both GPU scheduling order and memory-pressure eviction order.normal
spec.evictionProtectionboolExcludes this service from metal-agent memory-pressure eviction. MemoryPressure conditions are still surfaced for visibility.false
spec.contextSizeintllama.cpp context window (-c). Range 128–2,097,152. Larger contexts cost more KV cache memory.runtime default
spec.flashAttentionboolEnables llama.cpp --flash-attn. NVIDIA: requires Ampere+. Apple Silicon: defaults to true via the metal-agent path to avoid long-context decode regression.runtime default
spec.mlockboolCurrently surfaced through spec.extraArgs: ["--mlock"]. Pins model weights into RAM at the cost of memory headroom.runtime default
spec.parallelSlotsint (1–64)Concurrent request slots (--parallel). Each slot uses additional KV memory.1
spec.runtimeClassNamestringSelects a Kubernetes RuntimeClass. Most common value: nvidia on clusters where the NVIDIA runtime is not the default.
spec.podAnnotationsmap[string]stringMerged into pod metadata. For service-mesh injection, cost attribution, custom admission controllers.
spec.podLabelsmap[string]stringMerged into pod labels. Operator-managed labels (app, inference.llmkube.dev/model, inference.llmkube.dev/service) take precedence on collision.
spec.resources.gpuint (0–8)GPU count per pod. Combine with Model.spec.hardware.gpu.sharding for multi-GPU layer splits.0
spec.resources.cpuquantityCPU request per pod (e.g. "2", "500m").
spec.resources.memoryquantityMemory request per pod (e.g. 16Gi). For hybrid GPU/CPU offload, prefer resources.hostMemory.

Note on mlock: there is no first-class spec.mlock field today. Set it through spec.extraArgs: ["--mlock"] on the llama.cpp runtime.

Looking for the full generated CRD YAML, including every field and validation? See the CRD bases on GitHub.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.