Skip to content
Releases

What we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support

Christopher Maher
Christopher Maher
8 min read

0.7.9 is a release that came out of dogfooding. The headline feature is a new runtime: the metal-agent, LLMKube's native macOS daemon, can now supervise mlx-server, an OpenAI-compatible MLX inference server, as a first-class runtime alongside llama-server. The rest of the release is the honest part. While building toward a metrics-driven autoscaling tutorial and running the new runtime as the live backend for a real agentic coding workload, four genuine bugs fell out, all four are fixed here, and with them fixed spec.autoscaling works end to end. Plus a Kubernetes scale subresource on InferenceService from a community contributor. Here's what landed.

mlx-server: a second Apple Silicon runtime

Up to 0.7.9, the metal-agent had one runtime: llama-server from llama.cpp. That covers a lot of ground, but MLX is Apple's own array framework, and an MLX-native inference server can squeeze the M-series memory architecture in ways a portable C++ runtime does not. So 0.7.9 teaches the metal-agent to manage mlx-server: an OpenAI-compatible MLX inference server. You select it with a single field, the same way you select any other runtime.

The metal-agent treats it as a peer of llama-server: it resolves the binary, manages the process lifecycle, runs the same three-probe health surface, and the memory pre-flight described later in this post applies to it too. Install mlx-server on the Apple Silicon node from the Homebrew tap, then point an InferenceService at it with --runtime mlx-server.

# Install mlx-server on the Apple Silicon node
brew install defilantech/tap/mlx-server

# Select it on the InferenceService
spec:
  runtime: mlx-server

The dogfood: Qwen3.6-35B-A3B-8bit driving opencode

A runtime that only works in CI is a runtime we don't trust. So the way 0.7.9's mlx-server support got proven is the way the last few features got proven: we ran a real workload on it. We deployed Qwen3.6-35B-A3B-8bit through the new runtime on an Apple M5 Max with 128 GB of unified memory, and used that endpoint as the live backend for opencode, an agentic coding tool. Not a smoke test against a toy prompt: the model drove both the plan agent and the build agent, which means real tool-calling. The model wrote files through opencode and edited the codebase.

Qwen3.6-35B-A3B is a Mixture of Experts model: 35B total parameters, roughly 3B active per token. The 8-bit quant is the sweet spot on a 128 GB machine; it leaves enough headroom for a generous context window without spilling. An MoE model is also a good stress test for a new runtime, because the routing layer touches more of the memory subsystem than a dense model of the same active size would.

Benchmark numbers from that machine

The numbers below are from the same M5 Max, the same Qwen3.6-35B-A3B-8bit deployment, running through mlx-server v0.1.0:

  • 102.7 tokens/sec single-stream. That is the steady-state decode rate for a single request, and it was very stable across runs: not a best-case spike, a number you can plan around.
  • 107 ms time-to-first-token. Fast enough that the agent loop feels responsive, which matters when a build agent is doing many small turns rather than one long generation.
  • 4-way concurrency aggregates to only ~1.3x. Worth being explicit about: with four concurrent streams, mlx-server v0.1.0 does not get anywhere near 4x aggregate throughput. It lands around 1.3x. For a single-user coding tool, where one developer is driving one agent, that is fine; the single-stream number is the one that matters for that use case. If your workload is many concurrent users, this is a real ceiling in v0.1.0 and you should size around it rather than be surprised by it.

We are not going to dress up the concurrency number. mlx-server is a young project at v0.1.0, the single-stream story is genuinely good, and the multi-stream story will improve. The honest framing: this runtime is excellent for an individual developer running a local agent today, and the concurrency work is upstream's to do.

Four bugs the autoscaling tutorial flushed out

The bigger story in 0.7.9 is the four bugs. We were building toward a tutorial on metrics-driven autoscaling: deploy a model, put load on it, watch the HorizontalPodAutoscaler add replicas off real inference metrics. Writing a tutorial forces you to walk the exact path a user walks, and walking that path on real hardware surfaced four things that were quietly broken. None of them showed up in CI. All four are fixed in this release.

  • The PodMonitor selector matched no pods. The PodMonitor that is supposed to scrape llama.cpp inference pods had a selector that did not match the labels the operator actually puts on those pods. The result was silent: no error, no warning, just a PodMonitor that selected nothing, so llama.cpp metrics never reached Prometheus. Any autoscaling or dashboard built on those metrics was building on an empty series. With the selector corrected, the inference pods get scraped.
  • The operator fought the HorizontalPodAutoscaler. When an HPA scaled the underlying Deployment, the InferenceService reconciler would reconcile the Deployment back to the replica count in its own spec, and the HPA would scale it again. The two controllers tugged the replica count back and forth, so autoscaling never settled. The fix makes the operator stop owning the replica count when an HPA is in charge of it.
  • The Metal-path InferenceService never updated to Ready. For InferenceServices served by the metal-agent, the service could be healthy and serving traffic while its status sat at not-Ready forever. The operator was not watching the Endpoints object for the Metal path, so it never observed that the backend had come up. Adding the Endpoints watch lets the reconcile fire when the backend is actually ready and the status flips to Ready.
  • The metal-agent memory pre-flight was skipped for local-path models. The metal-agent runs a memory pre-flight check before starting a runtime, so it can refuse to start a model that will not fit rather than thrash. That check was being skipped entirely for models loaded from a local path, which are exactly the models you reach for during local dogfooding. The pre-flight now runs for local-path models too.

The theme tying these together is the one this blog keeps coming back to: we verify on real hardware, and we do not ship features that only work in theory. A metrics-driven autoscaling tutorial that you cannot actually follow end to end is not a tutorial, it is a wish. With these four fixes in place, spec.autoscaling, the native HPA-based autoscaling on InferenceService, works end to end: the PodMonitor scrapes, the metrics land in Prometheus, the HPA reads them, and the operator lets the HPA drive the replica count without fighting it.

kubectl scale support on InferenceService

0.7.9 adds a Kubernetes scale subresource to the InferenceService CRD. In practice that means kubectl scale now works directly on an InferenceService:

kubectl scale inferenceservice/my-service --replicas=3

This is the standard Kubernetes shape for a scalable resource, and it is the manual counterpart to the spec.autoscaling path covered above: use kubectl scale when you want to set the replica count yourself, and spec.autoscaling when you want an HPA to set it from metrics. Between them, scaling an InferenceService now behaves the way Kubernetes users expect.

Welcome, mircea-pavel-anton

The scale subresource is the work of mircea-pavel-anton, who landed it in PR #474. This is their first contribution to LLMKube, and it is a genuinely good one: the scale subresource is the kind of small, correct, infrastructure-shaped change that makes the whole resource behave the way Kubernetes users expect, and it slots cleanly into the autoscaling work the rest of this release is built on. Thank you, Mircea. A well-scoped external PR that lands the standard Kubernetes pattern is exactly the kind of contribution we hope to see more of, and we're glad you picked LLMKube to make it on.

Upgrade notes

Nothing in 0.7.9 is a breaking API change. The mlx-server runtime is purely additive: if you never set spec.runtime: mlx-server, nothing about your deployments changes. The scale subresource is additive on the InferenceService CRD. One thing to expect on upgrade:

  • The corrected PodMonitor starts scraping inference pods. Because the old PodMonitor selector matched no pods, llama.cpp inference metrics were silently absent from Prometheus. After upgrading, the corrected selector matches, the inference pods get scraped, and those metrics start appearing. If you have dashboards or alerts that looked empty before, they should begin populating once Prometheus picks up the new PodMonitor. This is the fix landing, not a regression.

Install

# Helm (vanilla Kubernetes)
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update
helm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace

# Helm (OpenShift / OKD / MicroShift)
helm install llmkube llmkube/llmkube \
  -f charts/llmkube/values-openshift.yaml \
  --namespace llmkube-system --create-namespace

# CLI (macOS / Linux)
brew install defilantech/tap/llmkube

# mlx-server runtime (Apple Silicon nodes)
brew install defilantech/tap/mlx-server

# Upgrade in place
brew upgrade llmkube
helm upgrade llmkube llmkube/llmkube --namespace llmkube-system

Full changelog on the v0.7.9 release page.

What's next

The autoscaling tutorial that flushed out these four bugs is the next thing to ship now that the path it walks actually works end to end. mlx-server is at v0.1.0, and its concurrency story is the obvious place for it to grow; we'll keep dogfooding it as a local agentic backend and feeding what we find back upstream.

If you're running LLMKube, file issues, ping us on Discord, or follow along on GitHub. Real workloads find real bugs. All four fixes in this release came out of trying to write a tutorial and running a model on the new runtime under a real coding workload, and the scale subresource came from a community contributor noticing a gap. We'll keep shipping in that direction.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.