Skip to content
Releases

What we shipped in LLMKube 0.7.6: memory-pressure protection, mutable modelRef, and a community PR worth celebrating

Christopher Maher
Christopher Maher
8 min read

Spent the weekend shipping. 0.7.6 is the biggest LLMKube release since multi-GPU sharding landed: memory-pressure protection on the metal-agent (with priority-based eviction and a friendly-fire guard), modelRef finally mutable, ParallelSlots extended to vLLM thanks to a polished community PR, three new K8s-native pod fields, and a proper CNCF-style docs site to host all of it. Here's what landed.

Memory-pressure protection (the headline)

Apple Silicon and bare-metal Linux hosts don't expose a Kubernetes-managed memory model for native LLM processes. An oversized model or a runaway warmup can swap-thrash the entire box and pull unrelated workloads down with it. The metal-agent now runs a watchdog that observes host memory, surfaces a MemoryPressure condition on every managed InferenceService, and (when enabled) evicts the lowest-priority workload before the kernel does something nastier.

The eviction selector applies these rules in order: safety floor (refuse to evict if only one workload exists), skip non-llama-server runtimes, skip spec.evictionProtection: true workloads, lowest priority wins, tie-break by largest RSS then oldest StartedAt. The friendly-fire guard refuses to evict at all when LLMKube is using less than 50% of system RSS — if pressure comes from a build job or a browser, the watchdog leaves your inference workloads alone and increments llmkube_metal_agent_evictions_skipped_total{reason="below_guard"} so you can alert on "the system is in distress, but it's not us."

The live demo runs on the docs page: an asciinema recording of three llama.cpp services (one with evictionProtection: true) hitting Critical pressure and the agent explicitly holding fire. Worth the 60 seconds.

/docs/memory-pressure-protection · PR #382 + #384

A community PR worth celebrating

@Faylixe filed discussion #338 about strict per-replica request serialization on a Raspberry Pi 5 k3s cluster, then went and read our code and pointed out that spec.parallelSlots already existed for the llama.cpp runtime — which scoped the issue down to a real bug they found in the process: setting parallelSlots: 1 silently did nothing because the operator had a > 1 guard, so the renderer fell through to llama.cpp's auto-default of 4. Filed as #339, fixed in their PR.

Then they kept going. PR #340 extended ParallelSlots to the vLLM runtime via --max-num-seqs, added arg-collision detection so users supplying --parallel through spec.extraArgs don't end up with the flag emitted twice, extracted shared arg-parsing helpers into a new runtime_args.go, and shipped the documentation alongside the code. They wrote the regression tests, fixed grammar in the docs in the same PR, and caught a stale CRD doc comment that nobody else had noticed.

This is exactly the contribution shape an OSS maintainer hopes for: a real workload, a real bug found in passing, a substantive fix that ships with tests and docs and improves the area beyond the original report. Thank you Faylixe.

Faylixe joins @PabloCastorino, @matiasinsaurralde, and @xingzihai on the project's contributor list. The full grid is on the repo README.

modelRef is now mutable

This was a long-standing footgun (#301) and probably the single most likely thing a curious visitor would trip over. The Deployment selector included inference.llmkube.dev/model, and Kubernetes treats Deployment selectors as immutable post-creation. Editing spec.modelRef on a running InferenceService accepted the CR change but every reconcile failed at the apiserver with field is immutable, with no error surfaced to the user — the failures only showed up in controller logs, and the Pod kept running the old model.

Fix: split labels into two sets. app + inference.llmkube.dev/service stay on the Deployment selector (immutable identity, derived from isvc.Name, never changes). The full label set including inference.llmkube.dev/model still ships on Deployment metadata and the Pod template, so kubectl get pods -l inference.llmkube.dev/model=<name> still works as a filter. modelRef edits now trigger a clean rolling update with the new model label, new init container args, new model path.

Two regression tests lock the apiserver-immutability constraint at the source: the selector must not contain the model label, and swapping modelRef on the same InferenceService must produce Deployments whose selectors are deeply equal. PR #385.

Three new K8s-native fields on InferenceService

Three first-class CRD fields that make LLMKube interoperate cleanly with the rest of a real Kubernetes cluster, instead of forcing operators to fall back to extraArgs hacks or sidecar tricks:

  • spec.runtimeClassName — selects a Kubernetes RuntimeClass. Typical use: nvidia on clusters where the NVIDIA container runtime isn't the default, or a custom runtime class for sandboxed inference (gVisor, Kata, Firecracker). PR #380.
  • spec.podAnnotations and spec.podLabels — merged into pod metadata. For Istio sidecar injection (sidecar.istio.io/inject: "true"), cost-attribution labels (cost-center: ai-platform), custom admission controllers, service-mesh routing, anything that reads pod metadata. Operator-managed labels (app, the two inference.llmkube.dev/* keys) take precedence on collision so the selector chain doesn't break. PR #381.

Catalog phi-4-mini OOM and the docs reference fix

Caught two failure modes in a quickstart timing test on Saturday night that would have hit on minute one of any new visitor's session:

  1. llmkube deploy phi-4-mini from the catalog defaults landed in CrashLoopBackOff: catalog set context_size: 128000, but at 128K the KV cache for a 3.8B model exceeds the recommended 4Gi memory request. Pod OOM'd repeatedly while loading. Default context dropped to 8192; users who actually want 128K can still set it via spec.contextSize or --context. PR #387.
  2. The quickstart docs page told users to run llmkube deploy phi-3-mini — which doesn't exist in the catalog. Replaced every reference with phi-4-mini and dropped the misleading explicit --cpu 500m --memory 1Gi flags (the CLI defaults to the catalog's recommended resources, which are correct). llmkube-web #67.

The fixes ship together. Verified on a fresh kind cluster: llmkube deploy phi-4-mini reaches Ready in 13 seconds and serves an OpenAI-compatible API on port-forward.

Docs site, rebuilt

The old /docs URL went straight to a single getting-started page. Now there's a real CNCF-style docs site with a sidebar (Getting Started · Concepts · Guides · Operations · Reference), 17 routes, and three hand-written pages built from scratch:

  • /docs/concepts/architecture — how the controller and metal-agent split responsibility, with an inline SVG diagram and a six-step reconciliation walk.
  • /docs/concepts/crds — field-by-field reference for Model and InferenceService, sourced directly from the Go types.
  • /docs/concepts/comparison — honest comparison vs KubeAI, llm-d, Ollama, and the NVIDIA NIM Operator. Every cell verified against project repos and official docs. Includes a "When to pick something else" section that names where each peer is the better choice.

Plus the memory-pressure-protection page with the live asciinema cast embedded.

Smaller fixes worth knowing

  • Metal accelerator phase fix (#376): InferenceService Phase used to flip to Ready seconds after applying the manifest, before the metal-agent had even fetched the model. The phase now derives from real Endpoints, not desiredReplicas, so Ready means actually serving.
  • Spec-drift detection on the metal-agent (#353): editing an InferenceService spec on a metal-agent-managed workload now triggers a respawn with the new flags. Previously the agent silently kept the old process running.
  • Two metrics removed from llmkube-controller-manager (llmkube_active_models_total, llmkube_active_inferenceservices_total) — never exposed in our shipped Grafana dashboards. If you happen to scrape them in a custom dashboard, the gauges go away in 0.7.6. Contact us if this breaks anything for you.

Install

# Helm
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update
helm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace

# CLI (macOS / Linux)
brew install defilantech/tap/llmkube

# Or upgrade in place
brew upgrade llmkube
helm upgrade llmkube llmkube/llmkube --namespace llmkube-system

Full changelog on the v0.7.6 release page.

What's next

Memory-pressure protection PR-2C (#383): K8s event emission, single-process safety floor (so the watchdog can't kill your only inference workload on a single-tenant box), and a few recovery metrics. Plus filling in the docs stub pages, and the next round of community PRs we'd love to land before the next release.

If you're trying LLMKube, file issues, ping us on Discord, or follow along on GitHub. Real workloads find real bugs — that's how Faylixe found #339, and how the next few features will probably get scoped too.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.