What we shipped in LLMKube 0.7.7: OpenShift first-class, vllm-swift + TurboQuant, and a community-shipped Longhorn fix

Spent the week on a release that's half mine and half not. 0.7.7 makes OpenShift a first-class deploy target, lands the vllm-swift runtime with TurboQuant KV cache passthrough on Apple Silicon, picks up two community-driven changes (vLLM tuning fields from an engineer in France, plus a Longhorn FSGroup fix from a user who filed the cleanest bug report of the year), and adds enough observability glue to make multi-runtime fleets legible. Here's what landed and the story behind it.

OpenShift goes first-class

Up to 0.7.6, deploying LLMKube on OpenShift, OKD, or MicroShift required tribal knowledge: the restricted-v2 SecurityContextConstraint that admission uses by default injects an fsGroup from each namespace's allocated supplemental-groups range, and any pod template that ships its own explicit fsGroup outside that range gets rejected at admission. LLMKube's controller didn't know to back off, so OpenShift installs needed manual SCC tuning or a pre-flight oc adm policy dance every time.

0.7.7 makes that a one-flag install. The controller takes a new --default-fsgroup argument (default 102 to match the standard curlimages/curl init container's curl_group GID), and the chart ships a values-openshift.yaml preset that flips it to 0, which disables the operator's default and lets restricted-v2 inject its own value. The deployment template was carefully written to not collapse 0 back to the default through Go template's default function.

A new MicroShift-backed e2e job (test-e2e-openshift) exercises the preset on every PR against a real OpenShift-compatible API, so SCC-admission regressions are caught at PR time instead of at customer time. PR #422, closes #421.

This is the kind of change that sounds small and isn't. OpenShift is the deployment substrate in a large portion of the regulated industries we hear from: healthcare, defense, financial services, large-format manufacturing. "Does your operator install cleanly on OpenShift?" is usually one of the first three questions in a procurement conversation, and a "yes, with this preset, validated in CI" answer is meaningfully different from "yes, you'll need to tweak the SCC."

helm install llmkube ./charts/llmkube \
  -f charts/llmkube/values-openshift.yaml \
  -n llmkube-system --create-namespace

vllm-swift on Apple Silicon, with TurboQuant passthrough

vllm-swift is TheTom's port of vLLM that runs natively on Apple Silicon via a Swift bridge. It's the engine behind the M5 Max long-context numbers we published two weeks ago. 0.7.7 makes it a first-class runtime in LLMKube: set spec.runtime: vllm-swift on an InferenceService and the metal-agent will resolve the binary, manage the process, and emit the same Prometheus / health-probe surface as the other runtimes.

The interesting part is the TurboQuant passthrough. The existing kvCacheCustomDtype field on vllmConfig previously routed only to upstream vLLM's turbo2 cache type. On the vllm-swift path, the same field now translates to kv_scheme and kv_bits entries inside vllm-swift's --additional-config JSON, so the same CRD shape gets you turbo2 on a CUDA node and a memory-efficient KV cache on an M-series Mac. One config, two engines, comparable behavior. A sample manifest landed at config/samples/inferenceservice_qwen3_4b_vllm_swift_turboquant.yaml.

PR #393, closes #391. Worth pairing with the M5 Max long-context post if you haven't seen it yet.

vLLM tuning fields, courtesy of @Faylixe (again)

@Faylixe, who landed the ParallelSlots-to-vLLM extension in 0.7.6, came back with PR #394. Two new optional fields on vllmConfig:

gpuMemoryUtilization (float, 0.1 to 0.99): passes --gpu-memory-utilization to vLLM. Default unset (vLLM uses 0.90). Lower it when you want to leave headroom for spec-drift respawns or co-tenancy.
cpuOffloadGB (int32, >= 0): passes --cpu-offload-gb to vLLM. Per-rank, so 4 on TP=2 means 4 GB of CPU RAM per GPU. The use case is FP8 weights that don't quite fit VRAM. Throughput drops 2-5x on the offloaded path, but the alternative is "doesn't run."

Both fields validate at the CRD level (numeric bounds enforced by kubebuilder validation), and both refuse to emit their flags in non-GPU contexts to avoid a confusing failure mode where vLLM ignores them silently. The PR shipped with tests, the docstrings explain the tradeoffs, and the code path is sympathetic to operators trying to put a 70B FP8 model on a 48 GiB card. Thank you Faylixe.

A Longhorn bug, a clean reproducer, and a same-release fix

The story for this one is worth telling in full. A community user opened a bug report against a fresh 0.7.6 install on K3s 1.31, with Longhorn as the storage class, on an RTX 3060 home cluster. Their InferenceService kept failing because the init container couldn't write to the persistent volume. The error was unambiguous: mkdir: can't create directory '/models/<hash>': Permission denied.

We hadn't seen this internally because our default kind-based CI used the local-path-provisioner, which mounts volumes in 0777 hostPath mode. Permissive enough to hide the root cause. Longhorn (and OpenShift's restricted-v2 SCC) provision PVCs as root:root 0755, which is fine for inference containers that run as root but breaks the standard non-root curlimages/curl:8.18.0 init image (uid=101, gid=102).

The reporter described themselves as a beginner in the bug body, then delivered one of the cleanest reproducers we've seen this year: the full manifest, the storage provider (Longhorn ext4 on a 100 GiB ReadWriteOnce PVC), the exact error line, the working manual workaround (a kubectl patch that set podSecurityContext.fsGroup: 1000), and a root-cause hypothesis that pointed directly at the fix. The hypothesis named the right mechanism but had the specific GID wrong (curlimages/curl runs as curl_user uid=101 and curl_group gid=102, not 1000); we corrected the value while writing the fix. None of that detracts from the report. The reproducer was self-contained enough that the fix shipped within hours of the bug landing.

The fix in PR #419 defaults fsGroup: 102 on the rendered PodSecurityContext whenever the user hasn't supplied their own. Kubernetes recursively chowns the volume to curl_group and adds it to every container's supplementary groups, which means the volume is writable for the init container and readable by the inference container regardless of its primary UID. User-supplied spec.podSecurityContext values are still respected verbatim.

A new Longhorn-backed e2e CI job exercises the fix against the same root:root 0755 conditions Longhorn produces in production, so any future regression in this code path will hit CI before it hits a contributor's home cluster. Thank you to the reporter, whose bug filing should be the new gold standard.

Observability: runtime label, recording rules, and a starter dashboard

Inference pods now carry a fourth operator-managed label, inference.llmkube.dev/runtime, alongside the existing app, inference.llmkube.dev/service, and inference.llmkube.dev/model. The PodMonitor relabelings promote all four onto every scraped time series, so dashboards and recording rules can group by service or runtime without an expensive join against kube_pod_labels.

A small set of recording rules (llmkube:inference:request_latency:p95_5m, llmkube:inference:ttft:p95_5m, llmkube:inference:llamacpp_request:p95_5m) pre-aggregates the two most-asked latency queries across both runtimes, and a starter Grafana dashboard ships in docs/grafana/llmkube-inference.json to consume them. Drop it into your Grafana, point it at your Prometheus, and you have a per-runtime / per-service latency view in 30 seconds. PR #410.

Metal-agent now emits Kubernetes events

When the metal-agent's memory watchdog transitions pressure levels, evicts a workload, skips an eviction under the friendly-fire guard, or blocks a respawn because pressure is still elevated, it now publishes a Kubernetes Event on the affected InferenceService. kubectl describe isvc <name> surfaces them, and k9s / Lens / ArgoCD pick them up automatically. PR #411, closes #390.

Pairs nicely with the llmkube_metal_agent_evictions_skipped_total{reason} counter from 0.7.6: the metric tells you the rate, the events tell you the why.

A 35-hour CPU pin, and the fix for it

Issue #405 reported a Mac kind cluster that pinned a CPU core for 35 hours. The root cause was a hybrid topology where the metal-agent runs on the macOS host and the controller runs inside a container, with an InferenceService referencing a file:// Model source that pointed at a host path the controller pod couldn't see. The controller's fetch failed with ENOENT immediately, returned the error to controller-runtime, and entered the rate-limited workqueue's tight-retry path. ~5 ms initial backoff, ramping to a cap, doing expensive reconcile work (status updates, GGUF parses, metric writes) hundreds of times per second.

PR #412 treats ENOENT and EACCES on Model fetch as terminal. The reconcile still updates status (Model goes Failed with a clear reason) and still requeues after 5 minutes, but it returns nil to controller-runtime instead of the error, so the rate-limited fast-retry path doesn't engage. The steady-state cost of a misconfigured Model is now one reconcile every 5 minutes until the operator fixes the spec, not a pinned core. A runbook entry landed alongside the fix in docs/operations/runbooks/controller-hot-spin-on-file-source.md for anyone hitting it on an older release.

Upgrade notes

Nothing in 0.7.7 is a breaking API change. CRD edits are additive (two new optional fields on vllmConfig, plus an expanded docstring on Model.spec.source). The chart's values.yaml gains one new key with a default that does the right thing for vanilla Kubernetes. A few things to expect on upgrade:

First reconcile rolls every InferenceService Pod once. The deployment template gains fsGroup: 102 and the new inference.llmkube.dev/runtime label, both of which change the pod template hash. Single-replica services see a brief restart; multi-replica services do a rolling update. The recursive chown that fires on first volume mount can be slow on large model-cache PVCs, so expect first-restart cold starts to take a bit longer than usual.
OpenShift / OKD / MicroShift installs must pass the new preset: helm upgrade llmkube ./charts/llmkube -f charts/llmkube/values-openshift.yaml -n llmkube-system. Without it, the restricted-v2 SCC will reject the rendered pod template at admission.
Custom --init-container-image with a non-curl UID/GID (anything other than curl_user=101 / curl_group=102): set spec.podSecurityContext on each InferenceService, or pass --default-fsgroup=<your-gid> to the controller via the chart's controllerManager.initContainer.defaultFSGroup value.

Full rollback instructions are in docs/operations/runbooks/upgrade-rollback.md if you need them.

Install

# Helm (vanilla Kubernetes)
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update
helm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace

# Helm (OpenShift / OKD / MicroShift)
helm install llmkube llmkube/llmkube \
  -f charts/llmkube/values-openshift.yaml \
  --namespace llmkube-system --create-namespace

# CLI (macOS / Linux)
brew install defilantech/tap/llmkube

# Upgrade in place
brew upgrade llmkube
helm upgrade llmkube llmkube/llmkube --namespace llmkube-system

Full changelog on the v0.7.7 release page.

What's next

The next release window is focused on multi-cluster scenarios: a single LLMKube control plane managing inference fleets across multiple sites, which is what a real edge-AI deployment in manufacturing or retail looks like. Plus a few more community PRs that are already in flight, and the next round of runbooks for the operations index.

If you're running LLMKube, file issues, ping us on Discord, or follow along on GitHub. Real workloads find real bugs. That's how Faylixe found #339 in 0.7.6, that's how someone on Longhorn found #418 in 0.7.7, and it's how the next few features will probably get scoped too.