Skip to content

Blog

Insights, tutorials, and updates from the LLMKube team

Releases 8 min read

What we shipped in LLMKube 0.7.6: memory-pressure protection, mutable modelRef, and a community PR worth celebrating

0.7.6 is the biggest LLMKube release since multi-GPU sharding landed. Memory-pressure protection on the metal-agent (priority-based eviction with a friendly-fire guard), modelRef finally mutable, ParallelSlots extended to vLLM thanks to a polished community PR from @Faylixe, three new K8s-native pod fields (runtimeClassName, podAnnotations, podLabels), a real CNCF-style docs site, plus a quickstart-killer caught and fixed Saturday night. Here's what landed.

Christopher Maher
Christopher Maher
Read more
Benchmarks 16 min read

vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data

TheTom asked us to run his vllm-swift TurboQuant+ work through the same kind of sweep we did on the llama.cpp fork. 36 cells, then a deep-context follow-up out to 192K. fp16 wins per-seq decode at every cell where it runs, but hits the memory ceiling at d=128K B=32 and d=192K B=32. turbo4v2 runs both: 1,360 tok/s and 1,024 tok/s aggregate. That is the value-prop confirmation: TurboQuant+ on this engine on this hardware is a memory-ceiling tool, not a throughput accelerator. Honest numbers below.

Christopher Maher
Christopher Maher
Read more
Benchmarks 11 min read

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

Followup to the M5 Max long-context post. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point. Overnight bench delivered all four. q8_0 KV is essentially free at 4k context (KL 0.0016, top-1 token agreement 98.6%). -ctk q8_0 -ctv turbo4 matches symmetric q8_0 throughput and fits 512K where symmetric q8_0 OOM'd. -ctk f16 -ctv turbo4 hits a Metal kernel fallback and craters 78x at 128K.

Christopher Maher
Christopher Maher
Read more
Benchmarks 10 min read

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

Built TheTom's TurboQuant fork of llama.cpp for Metal, ran the bench overnight on M5 Max, and surfaced two findings the upstream community thread didn't have. First: at 128K+ context, turbo3 (3-bit KV) beats q8_0 (8-bit KV) on prompt processing. Second: turbo3 and turbo4 split by phase, turbo3 wins prefill, turbo4 wins decode at long context. Plus 1M context for batch coding workloads on a MacBook, and two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.

Christopher Maher
Christopher Maher
Read more
Benchmarks 12 min read

62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.

Qwen3.6-35B-A3B Q8 on a MacBook Pro M5 Max scored 62.2% on Aider Polyglot (n=225/225), beating Claude Sonnet 4 with 32k thinking, o1-high, and DeepSeek R1 on the official leaderboard. Then Devstral 2 scored 4% on the same harness but 81.7% on HumanEval+: same model, 20× swing, benchmark numbers don't transfer. Plus the InferCost Apple Silicon collector that landed today, validating live cost-per-token attribution end to end with sub-watt agreement to the agent gauge.

Christopher Maher
Christopher Maher
Read more
Engineering 8 min read

Why Qwen 3.6 Doesn't Need --cpu-moe (and Why Qwen3-Coder Does) on Dual 16GB

The --cpu-moe flag trades VRAM savings for CPU compute cost per token. On dual RTX 5060 Ti cards that trade is required to run Qwen3-Coder-30B at all, but pure overhead for Qwen 3.6-35B-A3B, whose DeltaNet attention keeps the KV cache small enough that the model already fits in VRAM. Same hardware, same flag, opposite correct answers. Plus what shipped in LLMKube 0.7.0 because of the thread that surfaced this.

Christopher Maher
Christopher Maher
Read more
LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.