What we shipped in LLMKube 0.7.8: ModelRouter Phase 1, fail-closed PII routing, and a hybrid local + cloud agentic story

0.7.8 is the release where LLMKube grows a routing layer. The new ModelRouter CRD exposes a single OpenAI-compatible endpoint that dispatches across local InferenceService instances and external providers (Anthropic, OpenAI, LiteLLM, Bedrock, Vertex), enforces fail-closed policy for regulated data, and ships with the per-rule and per-backend timeout knobs that real agentic workloads actually need. Plus a stack of supporting fixes the data plane needed to be honest about going live, three new docs guides, and the kind of Phase 1 limitations callout that says "here's what we don't do yet" so you can plan around it.

The story: why ModelRouter

Every team building agentic systems hits the same shape: a fast local model handles most of the work, a frontier cloud model handles the hard steps, and somewhere in the middle there's a compliance question about which kinds of data are allowed to leave the cluster. Up to 0.7.7, LLMKube gave you the local-model piece. The handoff and the policy gate were Python in the agent runtime, scattered across teams, hard to audit.

ModelRouter pulls those concerns into the platform where they belong. Declare the backends and the routing rules in a CRD, point the agent at one URL, and the proxy resolves the rest on every request: classification, capability match, fail-closed enforcement, fallback, audit log. The agent code shrinks to "talk to this endpoint." Routing policy moves to the layer where the security team can actually review it.

What ModelRouter actually does

A ModelRouter resource compiles to a small managed HTTP proxy (a controller-owned Deployment + Service + ConfigMap) that speaks the OpenAI Chat Completions API. The compiled config is mounted from the ConfigMap; a content hash on the pod template auto-rolls the proxy when you edit the spec.

One thing to set expectations on up front: v1alpha1 ModelRouter is a guardrail and audit layer, not a classifier. Routing decisions are deterministic functions of caller-supplied signals (request body, headers) and the rules you declare; the proxy doesn't run NLP, doesn't scan prompts for PII, and doesn't infer task complexity from request size. Classifier sidecars that auto-tag are Phase 2 scope. The contract today is "caller asserts, platform enforces and audits" — more on that in the next section.

apiVersion: inference.llmkube.dev/v1alpha1
kind: ModelRouter
metadata: { name: coding-router }
spec:
  backends:
    - name: local-qwen
      inferenceServiceRef: { name: qwen3-coder }
      tier: local
      capabilities: [code, tools]
    - name: cloud-opus
      external:
        provider: anthropic
        model: claude-opus-4-7
        credentialsSecretRef: { name: anthropic-key }
      tier: cloud
      capabilities: [vision]
  rules:
    - name: pii-stays-local
      match: { dataClassification: [pii] }
      route: { backends: [local-qwen] }
      failClosed: true
      timeout: 8s
    - name: complex-to-cloud
      match: { taskComplexity: complex }
      route: { backends: [cloud-opus, local-qwen] }
      timeout: 60s
  defaultRoute: local-qwen

Six things in that snippet earned their place by hitting real problems we had:

tier: local / cloud. A backend's tier is the policy-level fact: cloud-tier backends are the ones the fail-closed gate can reject. The CRD's apply-time validator rejects rules that match dataClassification: [pii] and route to a tier: cloud backend, so a policy violation is caught at kubectl apply time, not at request time.
failClosed: true. When the rule's matched and none of its backends are healthy, the proxy returns HTTP 503 instead of falling through to defaultRoute. For PII this is the only correct behavior: "fall back to a cloud model" is exactly what you don't want when the local one is down.
timeout on the rule. Configurable response-header timeout per rule, per backend, or globally. A pii-stays-local route with an 8s budget can fail fast; a complex-to-cloud route with a 60s budget gives Anthropic Opus enough room to think. This shipped late in the release window (PR #461, closes #457 and #458) because the original 30s global default was too tight for cold-start Bedrock.
capabilities. Rules can require capabilities the backend advertises (match.requiredCapabilities: [vision]); requests asking for capabilities no backend has return 503 instead of silently degrading.
Multiple backends per rule. backends: [cloud-opus, local-qwen] with the default primary-fallback strategy means try Opus first, fall back to local on 5xx. A half-open circuit breaker (PR #454) lets a quarantined backend recover instead of locking out for the configured quarantine duration on the first error.
External providers without re-marshalling. provider: anthropic selects the request-shape adapter; the proxy injects credentials from the credentialsSecretRef Secret and forwards the body untouched. No JSON copy, no overhead beyond the TCP handshake (which on cloud tier is intentionally fresh per request after PR #460, to avoid silent stalls from globally-distributed load balancers that aggressively recycle idle connections).

Plus a structured audit log line per request: rule name, backend, tier, status, latency, configured timeout. Compliance audits, "why did this 503", and per-rule budget verification all trace back to those records.

How matching works (caller-asserts, platform enforces)

Worth being explicit about, because it changes the integration shape on the caller side. The proxy extracts five matching dimensions per request and the source for each is fixed in v1alpha1:

match.models: ["gpt-4*"] — read from the OpenAI request body's model field. Glob-matched with Go's path.Match. Zero glue: every OpenAI-shaped client SDK already sets this field.
match.requiredCapabilities: [vision] — matched against each candidate backend's advertised capabilities array on the ModelRouter spec, not against the request. The rule matches when at least one candidate backend has every required capability. Zero glue: capabilities are declared in the CRD by the platform team.
match.headers: { X-Foo: bar } — case-insensitive equality against the inbound HTTP headers. Caller asserts: agent runtime sets whatever headers the platform team declared.
match.dataClassification: [pii] — matched against the x-llmkube-classification header (header name configurable via Policy.Classification.HeaderKey). Caller asserts: the agent author tags requests that handle sensitive data; the proxy enforces and audits.
match.taskComplexity: complex — matched against the x-llmkube-task-complexity header. Caller asserts: same shape.

Two things follow from that contract. First, the value the proxy delivers without any caller cooperation is real and complete: model-glob routing, capability-aware routing, primary-fallback with the half-open circuit breaker, per-rule timeouts, fail-closed apply-time validation, and the audit log. If your routing logic is "this model goes here, that one goes there, with cloud-then-local failover and a budget on response-header latency," you ship today with one kubectl apply and no agent changes.

Second, the policy-enforcement story (PII never leaves on-prem, complex tasks always go to the big model) requires the agent runtime to set the right header. The proxy is the guardrail: it enforces routing on whatever the caller asserts, blocks misconfigured specs at kubectl apply time, and logs every decision. The proxy is not the source of truth for what a given request actually contains. For a self-built agent that's one line of code per signal. For an off-the-shelf agent runtime (OpenCode, Claude Code, Aider, Cline), it's a small wrapper or a feature request upstream.

Why this design: classifier sidecars sound nice in a release post but are operationally fraught (false negatives are silent compliance violations; false positives break the developer experience). The v1alpha1 contract lets the security team own the policy CRD without owning the classifier accuracy, and the agent team owns the tagging logic where they can ship it iteratively. Pluggable classifier sidecars are explicit Phase 2 scope, and the matching surface above is forward-compatible: a sidecar that sets the header is indistinguishable from an agent that sets it.

What it means for agentic coding

If you're running OpenCode, Claude Code, Cline, or Aider against a local model today, ModelRouter is the piece that lets you bleed past on-prem capacity without hard-coding the failover in the agent. Two shapes work cleanly with what these runtimes already send:

Route by model name. Configure your agent to address its long-context requests to claude-opus-4-7 and routine completions to qwen3-coder. The proxy's match.models rules dispatch each to the right backend, with cloud-then-local failover and the half-open circuit breaker handling intermittent cloud outages. Zero agent code changes beyond the model strings the agent already uses.
Tag with one header per signal. If you also want PII enforcement or a hard split between "complex" and "routine" traffic, the agent sets x-llmkube-classification: pii or x-llmkube-task-complexity: complex on the requests where it applies. The proxy enforces the policy and logs the decision. For OpenCode and Aider that's a small client wrapper or upstream patch; for self-built agents it's one line at the request boundary.

We took the proxy data plane through a deliberate agentic-suitability audit before tagging the release. Streaming SSE is genuinely unbuffered (8 KiB chunks, flush per write). The request body cap is 32 MiB so 128K-token prompts pass through cleanly. Client disconnects propagate as context.Canceled and explicitly do not quarantine the backend, so when the agent gives up routing decisions stay clean. The connection-pool tuning splits intentionally: local backends share the pool (10s idle timeout), cloud backends opt out with Connection: close per request, trading the TCP handshake for robustness against silent LB drops.

Worth being explicit about what ModelRouter Phase 1 doesn't do, because the gaps are real and you'll plan around them better than you'll get surprised by them:

Timeouts cap TTFT, not total stream duration. rule.timeout bounds how long the proxy waits for the first response header from the upstream. Once headers arrive, the stream runs as long as the upstream keeps producing. For 10-minute refactors that's exactly what you want; for a hung stream where the upstream goes silent mid-response, your client-side read deadline is the safety net. A stream-duration cap is Phase 2.
The audit log doesn't include token counts or streamed bytes. Per-request token / byte accounting and a proxy-emitted Prometheus histogram are the headline item for the next release (#433).
Inbound request bodies are buffered, not streamed. Outbound is streaming chunk-by-chunk; inbound waits for the full body (up to the 32 MiB cap) before dispatch. For 50 concurrent long-context requests that's ~25 MB of resident memory, comfortable on a default 256 MiB proxy pod, not zero.

All three are documented as Phase 1 limitations on the ModelRouter concept doc, framed as scope rather than apology. Users hitting these are sophisticated enough that honesty earns trust.

Stability fixes that made this release ship-ready

The data plane went through a real workload-shaped audit before we tagged. Four bugs surfaced, four bugs fixed, all with isolated tests pinning the invariant:

Per-attempt context deadlines no longer quarantine backends (PR #463, closes #462). A strict rule with a 50ms timeout against a slow-but-healthy backend used to mark the backend unhealthy on context.DeadlineExceeded. A sibling rule with a 120s timeout pointing at the same backend would then starve until quarantine expired. The fix is surgical: only quarantine on genuine connectivity failures (dial / TLS / 5xx), not on context errors.
Half-open circuit breaker recovery (PR #454, closes #452 and #453). After a quarantine expires, the proxy now admits one probe request before fully reopening the backend. If the probe succeeds the backend goes healthy; if it fails, quarantine extends without flooding it with traffic.
Cloud-tier connection lifecycle (PR #460, closes #459). Anthropic / OpenAI / Bedrock load balancers recycle idle connections aggressively and don't always send FIN. Cloud-tier backends now use Connection: close per request, trading the TCP handshake cost for predictable response times under sustained load.
External provider URL defaults plus a cluster-wide LiteLLM URL (PR #451, closes #438). First-party providers (Anthropic, OpenAI, Bedrock, Vertex) now use their published default URLs when external.url is unset. Operators can configure a cluster-wide LiteLLM URL via controllerManager.routerProxy.defaultLiteLLMURL so application teams declaring external: { provider: litellm } don't have to repeat the proxy address on every ModelRouter.

And two more landed in the release-readiness PR itself (PR #468):

External pod-template annotations survive reconcile (closes #456). The previous reconciler did a wholesale existing.Spec.Template = desired.Spec.Template, which stripped every annotation any external actor set on the pod template: Istio / Linkerd sidecar injectors, kubectl rollout restart's kubectl.kubernetes.io/restartedAt, GitOps tool sync labels. The visible symptom: kubectl rollout restart on the proxy spinning two ReplicaSets that flap as the controller fought kubectl's annotation, and in-flight requests got truncated. Same fix landed in the InferenceService reconciler, where the same pattern lived. Coverage: a new e2e step in the kind merge gate now patches an external annotation, forces a reconcile via spec.proxy.replicas, and asserts the annotation survives.
Owner refs no longer set BlockOwnerDeletion. The API server's GarbageCollector admission validates BlockOwnerDeletion by RESTMapping the owner Kind to check the caller's permission on the finalizers subresource. On kind that discovery cache is warm by the time the controller starts; on MicroShift-in-MINC the in-container apiserver populates discovery lazily on first request, and the controller's first reconcile races and loses. Result: weeks of Run on Ubuntu (MicroShift via MINC, OpenShift SCC) failures that the improved diagnostics from #466 finally surfaced. We don't actually need BlockOwnerDeletion; cascading delete works without it. Cleared it everywhere (router Deployment / Service / ConfigMap, InferenceService Deployment / Service, HPA) and the MicroShift lane is green for the first time in weeks.

Three new docs guides, plus an architecture refresh

Marcel Dempers' "That DevOps Guy" ran a video on LLMKube and the traffic wave that followed surfaced three obvious gaps in the docs: people landed on /docs/guides/air-gapped, /docs/guides/openshift-install, and /docs/guides/macos-metal from the video and hit stubs. So we ported the underlying material from the operator repo's top-level docs/ directory to the public site, with ModelRouter-aware refreshes throughout:

Air-gapped install: the canonical no-internet path, with four model-weight strategies (pvc://, internal HTTPS + SHA256, file://, HF repo ID), a custom-CA-cert flow, the image bundle for the operator + runtime + router-proxy, and a ModelRouter section covering air-gapped LiteLLM-only setups for shops that need a "cloud-shaped" backend without public internet egress.
OpenShift / OKD / MicroShift install: end-to-end walkthrough of the values-openshift.yaml preset that 0.7.7 shipped, with the per-InferenceService podSecurityContext.fsGroup escape hatch documented and the MicroShift CI lane status called out.
macOS Metal Agent: native macOS install path for Apple Silicon nodes, with the launchd plist tuning, the --memory-fraction defaults by total RAM, the optional --apple-power-enabled setup, and a ModelRouter integration section showing how to wire a metal-agent-managed InferenceService as a local-tier backend.
Architecture concept refresh: updated diagram showing ModelRouter as the optional policy-aware layer above InferenceService, plus a "where things live in the repo" table for new contributors.

Upgrade notes

Nothing in 0.7.8 is a breaking API change. ModelRouter is a wholly new CRD; users on 0.7.7 who never create a ModelRouter see no behavior change. InferenceService and Model surfaces are backward-compatible. A few things to know on upgrade:

The router-proxy image isn't in the release pipeline yet. The chart's default controllerManager.routerProxy.tag is dev rather than the chart appVersion. If you create a ModelRouter without overriding, the proxy pod will fail to pull. Two ways forward: build the image locally with make docker-build-router-proxy ROUTER_PROXY_IMG=<your-registry>/llmkube-router-proxy:0.7.8 and push it, then override via --set controllerManager.routerProxy.repository=<your-registry>/llmkube-router-proxy; or override spec.proxy.image per ModelRouter. Tracking issue: #449. Users who never create a ModelRouter are unaffected.
Operator-managed children no longer set BlockOwnerDeletion on their owner reference. Cascading delete still works; the only behavior change is that the "block" semantics during foreground cascading delete are looser. LLMKube doesn't use finalizer-based cleanup workflows, so this is invisible in practice.
Pod-template annotations and labels set by external actors now survive reconcile. If you were relying on the previous behavior (operator stomps everything), you'll see external annotations stick around. This is the intentional fix for #456.

Install

# Helm (vanilla Kubernetes)
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update
helm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace

# Helm (OpenShift / OKD / MicroShift)
helm install llmkube llmkube/llmkube \
  -f charts/llmkube/values-openshift.yaml \
  --namespace llmkube-system --create-namespace

# CLI (macOS / Linux)
brew install defilantech/tap/llmkube

# Upgrade in place
brew upgrade llmkube
helm upgrade llmkube llmkube/llmkube --namespace llmkube-system

Full changelog on the v0.7.8 release page.

What's next

Phase 2 of ModelRouter has three concrete items already scoped: a stream-duration timeout field so hung streams get force-closed by the proxy instead of the agent's TCP keepalive; per-request token and byte accounting in the audit log plus a Prometheus histogram (#433); and streaming inbound request bodies so 128K-token prompts dispatch without buffering. Beyond that, the budget-cap and shadow-routing strategies that the v1alpha1 schema already declares but doesn't yet enforce.

The release-pipeline gap on the router-proxy image (#449) is the other thing we want to close before 0.8.0. Once the image ships versioned alongside the controller, the chart default flips off dev and the air-gapped flow gets one step shorter.

If you're running LLMKube, file issues, ping us on Discord, or follow along on GitHub. Real workloads find real bugs. Three of the six fixes in this release came out of dogfooding the proxy under sustained agentic load over the past two weeks, and the seventh came out of someone watching the Marcel Dempers video and clicking into the docs. We'll keep shipping in that direction.