Skip to content
Back to blog
Benchmarks 10 min read

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

Christopher Maher
Christopher Maher

A 7-hour overnight bench on an M5 Max, two findings I haven't seen in the upstream community thread, and two PRs back to the LLMKube operator to make TurboQuant a first-class citizen of the InferenceService CRD.

TL;DR

A TurboQuant-enabled llama-server on Apple Silicon runs Qwen3.6-35B-A3B Q8 at up to 1M-token context on a 128 GB MacBook Pro M5 Max. Standard f16 KV cache OOMs at 256K. Two findings worth quoting:

  1. At 128K+ context, the 3-bit KV cache (turbo3) matches or beats the 8-bit cache (q8_0) on prompt processing. Smaller cache means less memory bandwidth pressure during attention, and the throughput gap that exists at short context flips by ~128K depth.
  2. turbo3 and turbo4 split by workload phase. Long-context prefill favors turbo3 (~27% faster than turbo4 at 256K). Long-context decode favors turbo4 (~11% faster than turbo3 at 256K). They are not interchangeable — different attention bottlenecks dominate during prefill and decode.

We built TheTom's feature/turboquant-kv-cache fork of llama.cpp for Metal, validated on M5 Max, and took two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.

Why KV cache, why now

If you're running coding agents locally — single-model or architect+editor combos — the binding constraint isn't model weights. It's KV cache.

Weights you can quantize once, store on disk, and forget. KV cache is generated per token of context at inference time, sized by the model's depth and head dimensions, and held in working memory the entire session. A 35B-class model with flash-attn on uses roughly 256 KB of fp16 KV per token. That sounds small until you do the multiplication:

Contextfp16 KV
32K~8 GB
64K~16 GB
128K~32 GB
256K~64 GB
512K~128 GB
1M~256 GB

A 128 GB MacBook with flash-attn and mlock on can fit one 35B model at 128K with f16 KV, just barely. 256K doesn't fit. Co-resident two-model setups (architect + editor) don't fit at all past 64K.

Standard q8_0 quantization halves the KV footprint with sub-1% perplexity penalty. That gets you to 256K with a single model on the Mac.

TurboQuant (Google Research, ICLR 2026, arxiv:2504.19874) compresses further. Randomized Walsh-Hadamard transforms decorrelate KV blocks before scalar quantization, hitting ~3.25 bits per value (turbo3) or ~4.25 bits per value (turbo4) with attention-fidelity loss inside the noise floor of normal sampling variance.

Cache typebits/valueCompression vs fp16KV at 256K
f1616.01.0×~64 GB
q8_08.02.0×~32 GB
turbo44.253.8×~17 GB
turbo33.254.9×~13 GB

Upstream community discussion at ggml-org/llama.cpp#20969. Not yet in main, landing in forks per backend. TheTom's fork (github.com/TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) is the Metal-supporting variant.

The bench

llama-bench from TheTom's fork build, single Qwen3.6-35B-A3B Q8 model, sweep across cache types and KV-depths.

./build/bin/llama-bench \
  -m Qwen3.6-35B-A3B-Q8_0.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 8192 -d 32768 -d 131072 -d 262144 -d 524288 -d 1048576 \
  -p 512 -n 128 -ngl 99 -fa 1 \
  --threads 6 --batch-size 2048 \
  -r 3 -o md

-d N pre-allocates N tokens of KV cache before measuring throughput. Mean of 3 reps. Metal-agent stopped during the run for clean memory budget. The 1M cell on turbo3 alone took several hours wall-clock; full sweep ran ~7 hours overnight.

The numbers

Generation throughput (tok/s)

Depthf16q8_0turbo3turbo4
089.487.479.579.7
8K84.279.272.271.2
32K72.667.861.561.8
64K60.7
128K44.440.736.037.7
256KOOM26.622.925.5
512KOOMOOM13.316.0
1MOOMOOM6.51OOM

Prompt processing throughput (tok/s)

Depthf16q8_0turbo3turbo4
02962294829042854
8K2098162316531439
32K1063802784678
128K321245253 ←206
256KOOM124128 ←101
512KOOMOOM6656
1MOOMOOM30.1OOM

Full grid is final. Bench ran 8h 20m wall-clock.

Finding 1: turbo3 beats q8_0 at long context

The framing in the upstream discussion is approximately "turbo3 trades a small (~10%) generation throughput hit for ~2.5× more KV memory headroom." That's true at short context. At long context, the trade flips.

At 128K depth, f16 wins prefill at 321 tok/s, but turbo3 at 253 tok/s edges out q8_0 at 245 tok/s. At 256K (where f16 OOMs), turbo3 at 128 tok/s beats q8_0 at 124 tok/s.

What's happening: at 35B-class model size with deep contexts, the GPU spends most of its time during attention reading KV cache from memory rather than computing on it. Smaller cache → less bandwidth pressure → throughput recovers, even though there's more dequantization work per access. The break-even is somewhere between 32K and 128K on M5 Max.

For coding-agent workloads where context grows monotonically across a session, this is the regime that matters. You're spending most of your tokens at 32K+ depth, not at depth 0.

Finding 2: turbo3 and turbo4 split by workload phase

The 25% extra bits per value in turbo4 (4.25 vs 3.25 bits) buys you something specific, and what it buys depends on the phase.

Prefill (prompt processing) at long context

Depthturbo3 ppturbo4 ppturbo3 advantage
8K16531439+15%
32K784678+16%
128K253206+23%
256K128101+27%
512K6656+18%

Smaller cache means less data to read per attention step; during prefill the GPU pulls huge contiguous batches through attention, and the bandwidth-bound regime favors turbo3 cleanly.

Decode (generation) at long context

Depthturbo3 tgturbo4 tgturbo4 advantage
128K36.037.7+5%
256K22.925.5+11%
512K13.316.0+20%

During decode the dequantization overhead per access matters more than total bytes read. turbo4's simpler representation (4.25 bits has less complex quantization geometry than 3.25 bits) wins at the per-token attention pass — and the gap widens with depth.

Practical implications by workload

Workload shapeCache typeWhy
Aider/OpenCode coding agents (deep context, lots of generated tokens per turn)turbo4Wins decode at depth
RAG-heavy / batch question answering (heavy prefill, short answers)turbo3Wins prefill at depth
Pure context-window maximization (1M context)turbo3Only it fits at 1M
Short-context interactive (≤32K)f16 if it fits, else q8_0Both turbos are ~10% slower

This isn't a framing the upstream community discussion has surfaced clearly. Different bottleneck regimes for different phases, and the right cache type depends on which phase dominates your workload.

What this enables on a MacBook

Three concrete capabilities:

  1. 256K context for two co-resident coding models. turbo3 KV at 256K (~13 GB) plus 37 GB Qwen3.6 weights, alongside Devstral-Small-2-24B at the same context with comparable footprint, totals ~88 GB. Under the 100 GB practical budget.
  2. 1M context for batch / agentic workloads. turbo3 KV at 1M is ~52 GB. We measured 30 tok/s prefill, 6.5 tok/s decode at 1M on Qwen3.6-35B-A3B Q8. Slow — a 4K-token agent response at 1M context is ~10 minutes wall-clock — but it works. Overnight agentic batches that need the full context window are feasible. As far as we can tell, nobody else has demonstrated this on Apple Silicon yet.
  3. More headroom for non-attention buffers. Cutting KV by 5× makes batch buffers, prefix cache, and draft models for speculative decoding actually composable.

Caveats

  • TheTom's fork is research-grade. Pinned to commit 11a241d0d; rebases needed as upstream moves.
  • LLMKube's metal-runtime can't drive turbo3/turbo4 yet because of #349 and #350. PR #353 closes #350; #349 is next.
  • No perplexity numbers in this run. Throughput and memory ceilings only. The +1% perplexity penalty for turbo3 in the upstream discussion is on Qwen 3.5 — we'll re-run on Qwen 3.6 in a follow-up.
  • Single hardware sample. M5 Max only. Crossover point and prefill/decode split likely shift with memory bandwidth (614 GB/s on M5 Max) and GPU core count.

What we contributed back

  • LLMKube PR #351 (merged): cacheTypeCustomK/cacheTypeCustomV on InferenceServiceSpec. Closes #282.
  • LLMKube PR #353 (open): metal-agent respawns on ISVC spec drift; honors replicas: 0. Closes #350.
  • Issues filed: #349, #350.
  • Comment going to llama.cpp discussion #20969 with the M5 Max numbers and the prefill/decode split.

How to try it yourself

# 1. Build TheTom's fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# 2. Run the bench (turbo3 and turbo4 separately to see the split)
./build/bin/llama-bench \
  -m /path/to/your/model.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 32768 -d 131072 -d 262144 \
  -p 512 -n 128 -ngl 99 -fa 1 -r 3 -o md

Memory ceiling depends on your unified-memory budget; sub-64 GB Macs probably can't reach 256K with a 35B-class model at any cache type. M3 Pro/Max territory is more realistic for 13B models at 128K with turbo3.

For NVIDIA: @spiritbuun's CUDA fork is the equivalent path.

Open invitation

If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same bench, we want your numbers. The crossover point and the prefill/decode split likely shift with memory bandwidth.

Drop results in llama.cpp discussion #20969 or open an issue on defilantech/llmkube.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.