Browse documentation
Memory-pressure protection
The metal-agent watches host memory and surfaces conditions on every workload, but only stops a process when LLMKube itself is the cause of the pressure.
Three llama.cpp services on a kind cluster, one with evictionProtection: true. Watch the watchdog hit Critical, surface conditions on every workload, and explicitly hold fire.
What the cast shows
The recording layers three behaviors on top of each other:
- Pressure detection. A 5-second-tick watchdog samples host memory and per-process RSS. When the available fraction drops below the configured threshold, it transitions to Warning or Critical and updates the
llmkube_metal_agent_memory_pressure_levelgauge. - Visibility on the workload. Every managed
InferenceServicereceives aMemoryPressurestatus condition, including services that opted out of eviction. Operators can read it fromkubectl get isvcinstead of needing Prometheus. - Friendly-fire guard. Even at Critical, the watchdog refuses to evict if LLMKube isn't the dominant memory consumer (default: it must own >50% of system RSS). The decision is recorded in
llmkube_metal_agent_evictions_skipped_total{reason="below_guard"}so you can alert on "the system is in distress, but it's not us."
Eviction selector rules
When the watchdog does decide to evict, the selector applies these rules in order:
- Safety floor. If only one managed process exists, refuse to evict. Single-tenant systems shouldn't be killed by their own watchdog.
- Skip non-llama-server runtimes. oMLX and Ollama use shared daemons; killing them via this code path doesn't free per-model memory.
- Skip
spec.evictionProtection: trueworkloads. Per-service opt-out for production-critical inference. - Lowest
spec.prioritywins. The same enum used for GPU scheduling:batch < low < normal < high < critical. - Tie-break by largest RSS, then by oldest
StartedAtfor determinism.
Evicted services are stamped into a respawn block until pressure clears, so the controller can't immediately re-spawn the workload the watchdog just killed. The block lifts automatically when the agent observes Normal pressure.
Enabling protection
Eviction is opt-in. The watchdog runs and surfaces conditions by default; --eviction-enabled activates the kill path.
helm install llmkube llmkube/llmkube \
--set metalAgent.evictionEnabled=true \
--set metalAgent.memoryPressureCritical=0.10 \
--set metalAgent.memoryPressureWarning=0.20 Mark a specific InferenceService as never-evict:
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: chat-prod
spec:
modelRef: qwen3-8b
priority: critical
evictionProtection: trueMetrics worth alerting on
- llmkube_metal_agent_memory_pressure_level
- Gauge: 0 = Normal, 1 = Warning, 2 = Critical. Page on Critical = "host memory is under stress."
- llmkube_metal_agent_evictions_total
- Counter: workloads stopped by the watchdog. Non-zero rate means the agent is actively pruning under pressure.
- llmkube_metal_agent_evictions_skipped_total{reason}
- Counter: tracks why an eviction was skipped. Labels include
disabled,below_guard,floor,all_protected,runtime_ineligible. - llmkube_metal_agent_process_rss_bytes{name}
- Gauge: per-process resident set size. Useful for capacity planning and spotting the model that's growing.
Related: Install in 5 minutes · CRD reference