Blog

Insights, tutorials, and updates from the LLMKube team

The best part of LLMKube 0.9.0 is code I did not write

For two months this blog has argued that the hard part of a self-hosted coding agent is the harness, not the model. LLMKube 0.9.0 is where that harness, Foreman, stopped being something I build alone. The two changes I am proudest of were written by Jory, a contributor I have never met: he found our fleet scheduler funneling every task onto one node while the rest sat idle, and 23 orphaned nodes leaking on a live cluster, and fixed both correctly, including the concurrency subtleties my review caught. This is a post about the release, about why the partnership is the real story, and an open invitation to help build the parts you care about.

Blog

The best part of LLMKube 0.9.0 is code I did not write

A local model opened 41 of our pull requests in five weeks. The model is the least interesting part.

A 27B model on an AMD mini-PC fixed a bug in our operator. Then it overreached.

Trust the harness, not the model: a weekend of local agents building their own guardrails

Making a fleet of self-hosted LLM agents trustworthy

My Mac kernel-panicked, so I taught my code reviewer to stop trusting itself: 48 hours of local-model review integrity

Back to Shadowstack: a 35B at 256K context (and 512K with YaRN) on two consumer Blackwell cards

Introducing Foreman: a Kubernetes-native orchestrator for your local LLM fleet (LLMKube 0.8.0)

What we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support

What we shipped in LLMKube 0.7.8: ModelRouter Phase 1, fail-closed PII routing, and a hybrid local + cloud agentic story

What we shipped in LLMKube 0.7.7: OpenShift first-class, vllm-swift + TurboQuant, and a community-shipped Longhorn fix

What we shipped in LLMKube 0.7.6: memory-pressure protection, mutable modelRef, and a community PR worth celebrating

vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.

We ran Qwen3.6-27B on $800 of consumer GPUs, day one: llama.cpp vs vLLM

I Sent the Agents Loose on My Kubernetes Operator. Here's What They Shipped.

Why Qwen 3.6 Doesn't Need --cpu-moe (and Why Qwen3-Coder Does) on Dual 16GB

The Model I Deployed Wrote My Operator's Next Feature

Your Local LLM Can Write Code While You Sleep. Here's What Ours Built.

How We Got Native Metal GPU Performance in Kubernetes (Without Containers)

I Built a Text-to-SVG Pipeline Over a Weekend (And You Can Too)

Introducing CLI Benchmarks: Test Your LLM Deployments Like a Platform Engineer

ShadowStack Stress Test: Running Production 32B Models on Consumer Hardware

Why Ollama Breaks at Scale (And What to Do About It)

Thanksgiving 2025: Gratitude, Benchmarks, and Building in the Open

Multi-GPU Support Ships: First Run on ShadowStack

Building ShadowStack: Our On-Prem LLM Testing Lab

Introducing LLMKube: Kubernetes for Local LLMs