Skip to content
Skip to documentation content
Browse documentation

Memory-pressure protection

The metal-agent watches host memory and surfaces conditions on every workload, but only stops a process when LLMKube itself is the cause of the pressure.

Three llama.cpp services on a kind cluster, one with evictionProtection: true. Watch the watchdog hit Critical, surface conditions on every workload, and explicitly hold fire.

Real-time recording on a kind cluster, idle waits compressed.

What the cast shows

The recording layers three behaviors on top of each other:

  1. Pressure detection. A 5-second-tick watchdog samples host memory and per-process RSS. When the available fraction drops below the configured threshold, it transitions to Warning or Critical and updates the llmkube_metal_agent_memory_pressure_level gauge.
  2. Visibility on the workload. Every managed InferenceService receives a MemoryPressure status condition, including services that opted out of eviction. Operators can read it from kubectl get isvc instead of needing Prometheus.
  3. Friendly-fire guard. Even at Critical, the watchdog refuses to evict if LLMKube isn't the dominant memory consumer (default: it must own >50% of system RSS). The decision is recorded in llmkube_metal_agent_evictions_skipped_total{reason="below_guard"} so you can alert on "the system is in distress, but it's not us."

Eviction selector rules

When the watchdog does decide to evict, the selector applies these rules in order:

  1. Safety floor. If only one managed process exists, refuse to evict. Single-tenant systems shouldn't be killed by their own watchdog.
  2. Skip non-llama-server runtimes. oMLX and Ollama use shared daemons; killing them via this code path doesn't free per-model memory.
  3. Skip spec.evictionProtection: true workloads. Per-service opt-out for production-critical inference.
  4. Lowest spec.priority wins. The same enum used for GPU scheduling: batch < low < normal < high < critical.
  5. Tie-break by largest RSS, then by oldest StartedAt for determinism.

Evicted services are stamped into a respawn block until pressure clears, so the controller can't immediately re-spawn the workload the watchdog just killed. The block lifts automatically when the agent observes Normal pressure.

Enabling protection

Eviction is opt-in. The watchdog runs and surfaces conditions by default; --eviction-enabled activates the kill path.

helm install llmkube llmkube/llmkube \
  --set metalAgent.evictionEnabled=true \
  --set metalAgent.memoryPressureCritical=0.10 \
  --set metalAgent.memoryPressureWarning=0.20

Mark a specific InferenceService as never-evict:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: chat-prod
spec:
  modelRef: qwen3-8b
  priority: critical
  evictionProtection: true

Metrics worth alerting on

llmkube_metal_agent_memory_pressure_level
Gauge: 0 = Normal, 1 = Warning, 2 = Critical. Page on Critical = "host memory is under stress."
llmkube_metal_agent_evictions_total
Counter: workloads stopped by the watchdog. Non-zero rate means the agent is actively pruning under pressure.
llmkube_metal_agent_evictions_skipped_total{reason}
Counter: tracks why an eviction was skipped. Labels include disabled, below_guard, floor, all_protected, runtime_ineligible.
llmkube_metal_agent_process_rss_bytes{name}
Gauge: per-process resident set size. Useful for capacity planning and spotting the model that's growing.

Related: Install in 5 minutes · CRD reference

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.