TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max
Yesterday's M5 Max KV cache post drew a clean set of asks in the comments: where are the perplexity numbers, what about KL divergence, did you try asymmetric K/V combos, can you fill the 32K to 128K gap with a 64K row. I ran them overnight on the same hardware. Numbers below.
TL;DR
q8_0KV cache is essentially free at 4k context. PPL delta vsf16is −0.0005 (well inside the ±0.036 stderr). KL is 0.0016. Top-1 token agreement is 98.64%.turbo3andturbo4cost real but small quality. turbo3: ~1% PPL increase, 5pp top-token disagreement, KL roughly 12× q8_0. turbo4 sits between, in line with its lower compression ratio.-ctk q8_0 -ctv turbo4is the new winner for long-context. Matches symmetric q8_0 throughput at every depth tested and fits 512K, where symmetric q8_0 OOM'd. q8_0-grade prefill, turbo4-grade memory ceiling.-ctk f16 -ctv turbo4is broken on this fork on Metal. The Metal FlashAttention kernel doesn't fast-path that K/V combination, so it falls back to a generic dequant-then-attention path. 34× slower at 8K, 78× slower at 128K. Don't use it.- The 64K row shows the prefill curves nearly converged. turbo3 at 470 tok/s sits within 2% of q8_0 at 479 tok/s. The bandwidth-bound regime kicks in somewhere between 64K and 128K on this hardware, earlier than the 128K crossover from yesterday's post had me estimating.
Quality eval: perplexity and KL divergence
The original post had no quality numbers. The first comment under it flagged that gap and asked for perplexity, and a follow-on comment added KL divergence to the list. Both are now in the bag.
Setup: llama-perplexity from TheTom's TurboQuant fork build, wikitext-2-raw test set, context size 4096. The canonical 512 doesn't fill enough KV cache to surface cache-quantization effects, so I bumped it to 4096 to let the cache actually fill. The f16 run saves a baseline logits file via --kl-divergence-base. Each subsequent run computes KL against that baseline, which means the comparisons are pinned to the exact same model weights and tokenization.
| Cache type | PPL | KL vs f16 | Top-1 token agreement |
|---|---|---|---|
f16 | 5.7438 ± 0.0355 | baseline | n/a |
q8_0 | 5.7433 ± 0.0355 | 0.0016 ± 0.0001 | 98.64% ± 0.03 |
turbo3 (~4.9×) | 5.8092 ± 0.0360 | 0.0199 ± 0.0002 | 93.93% ± 0.06 |
turbo4 (~3.8×) | 5.7810 ± 0.0359 | 0.0131 ± 0.0003 | 95.28% ± 0.06 |
q8_0 KV is essentially free at this depth. PPL delta is −0.0005, which is noise inside the ±0.0355 stderr. KL is 0.0016, three orders of magnitude smaller than the turbo3 number. The quantized cache picks the same top-1 token as f16 98.64% of the time. The community worry about q8_0 corroding output quality doesn't bear out here.
turbo3 costs measurable but small quality. ~1% perplexity increase, 5 percentage points of top-token disagreement, KL roughly 12× q8_0's. turbo4 sits between turbo3 and q8_0 on every metric, matching its lower compression ratio. Quality cost scales monotonically with compression, no surprises in the ranking.
One caveat I'd want to underline: PPL was at 4096 context. Quality at deeper contexts, where the cache is more saturated and dequant errors compound across more attention steps, might tell a different story. That's a bench for a future weekend.
Asymmetric K/V: which combos work, which don't
One commenter on the original post pointed out that the big issue asymmetric KV tackles is exactly the K-precision problem: compressing the keys hurts quality a great deal more than compressing the values. The original post called this out in its caveats too but didn't bench it. Now we have data.
Three combinations, same llama-bench flags as yesterday's symmetric sweep. Decode tok/s (token generation):
| Depth | q8_0 K / turbo4 V | q8_0 K / turbo3 V | f16 K / turbo4 V |
|---|---|---|---|
| 0 | 82.9 | 81.8 | 72.8 |
| 8K | 75.4 | 75.6 | 16.9 |
| 32K | 66.0 | 63.2 | 8.6 |
| 128K | 41.0 | 38.2 | 2.8 |
| 256K | 27.1 | 25.0 | skipped |
| 512K | 16.5 | 14.8 | skipped |
Prompt processing tells a similar story (skipping the full table for length, the relative ordering matches): q8_0/turbo4 lands within 1-2% of symmetric q8_0 prefill at every shared depth, and q8_0/turbo3 is similarly close.
-ctk q8_0 -ctv turbo4: the new long-context winner
This is the standout combination. At 256K context it puts up 27.1 tok/s decode against yesterday's symmetric q8_0 baseline of 26.6 tok/s. Prefill at 256K hits 128 tok/s versus symmetric q8_0's 124. The throughput is statistically indistinguishable from symmetric q8_0 at every depth they share.
And it fits 512K, where symmetric q8_0 OOM'd in yesterday's post. Decode at 512K is 16.5 tok/s, almost identical to symmetric turbo4 at 16.0. So the asymmetric configuration gets you q8_0-level prefill behavior with turbo4-level context ceiling, on a single MacBook Pro.
The hypothesis that V compresses cheap and K compresses expensive looks right on the throughput side. Quality side I'd want a PPL run on the asymmetric combos to fully close the loop, since I haven't measured KL or PPL with mixed K/V types yet.
-ctk q8_0 -ctv turbo3: similar trick, worse decode
Same prefill behavior as the q8_0/turbo4 combo (within 1-2% at every depth) but decode is consistently lower. Tighter V quantization taxes the per-token attention pass more, since decode is bottlenecked by dequantization work rather than total bytes read. If you have memory headroom, q8_0/turbo4 dominates.
-ctk f16 -ctv turbo4: kernel fallback, do not use
Putting f16 on K and turbo4 on V breaks the Metal FlashAttention kernel's fast path. The fork falls back to a generic dequant-then-attention implementation that's catastrophically slow:
| Depth | Symmetric f16 pp512 | f16 K / turbo4 V pp512 | Slowdown |
|---|---|---|---|
| 8K | 2098 | 61.0 | 34× |
| 32K | 1063 | 16.4 | 65× |
| 128K | 321 | 4.1 | 78× |
I cut the run before 256K once the trajectory was clear. The slowdown widens with depth, which is consistent with the non-fast-path attention being O(n) more expensive in dequant work per cache access. Don't use this combination on this fork on Metal until kernel coverage lands. If you're on a different backend (CUDA), verify the same combo before assuming it works.
The 64K row: filling the gap
One commenter asked for a 64K data point sitting between 32K and 128K, particularly on the prefill side. Reasonable ask: yesterday's prefill curves dropped 3–4× between those two depths, so 64K is exactly the depth where the bandwidth-bound regime is supposed to kick in.
All seven configurations at depth 65536:
| Cache | pp512 (tok/s) | tg128 (tok/s) |
|---|---|---|
f16 (symmetric) | 602.0 | 59.8 |
q8_0 (symmetric) | 479.2 | 57.9 |
turbo3 (symmetric) | 469.8 | 49.9 |
turbo4 (symmetric) | 418.0 | 55.2 |
| q8_0 K / turbo4 V | 468.2 | 55.9 |
| q8_0 K / turbo3 V | 465.6 | 52.6 |
| f16 K / turbo4 V | 8.3 | 4.9 |
Two things stood out. First, the prefill curves are nearly converged at 64K already. turbo3 at 470 tok/s is within 2% of q8_0 at 479 tok/s. Yesterday's data showed turbo3 actually pulling ahead of q8_0 by 128K (253 vs 245), so the bandwidth-bound regime kicks in somewhere in the 64K to 128K range on this hardware. Earlier than I'd estimated when I wrote the original post.
Second, the asymmetric q8_0/turbo* rows track symmetric q8_0 prefill closely at this depth, same as they do at the deeper depths. Same story all the way down the curve: as long as K stays at q8_0, V-side compression is essentially free on prefill.
Updated cache-type recommendations
Same shape as yesterday's recommendations, with the asymmetric data folded in:
| Workload | Cache type | Why |
|---|---|---|
| Coding agents (deep context, lots of generated tokens) | -ctk q8_0 -ctv turbo4 | q8_0-grade quality on K, turbo4 memory savings on V, fits 512K, decode 27 tok/s at 256K |
| RAG or batch QA (heavy prefill, short answers) | -ctk q8_0 -ctv turbo4 or symmetric turbo3 at the deepest depths | Prefill is bandwidth-bound past ~64K, both options work |
| Pure 1M context maxing | Symmetric turbo3 | Only thing that fits 1M on a 128 GB Mac |
| Short interactive (under 32K) | f16 if memory allows, else q8_0 | Quality cost is genuinely zero, throughput is best |
The asymmetric combos are expressible directly in LLMKube's InferenceService spec via the cacheTypeCustomK and cacheTypeCustomV fields that landed in 0.7.3. So if you're running this through the operator, the spec for the new long-context winner is:
spec:
runtime: llamacpp
cacheTypeCustomK: q8_0
cacheTypeCustomV: turbo4
contextSize: 524288 Caveats
- Perplexity was measured at 4096 context. Quality at deeper contexts might tell a different story, since the cache fills more and dequant errors have more attention steps to compound through.
- Asymmetric quality numbers (PPL or KL on the q8_0/turbo* combos) are not yet measured. The throughput data argues V-side compression is cheap, but I haven't verified the quality side end-to-end.
-ctk f16 -ctv turbo*is a kernel fallback on this fork on Metal. Verify before assuming the same combination works on other backends. CUDA may have different kernel coverage.- Single hardware data point (M5 Max, 128 GB). The crossover depths and the prefill/decode split likely shift with memory bandwidth and GPU core count.
What's still in flight
Three asks from the original thread that this followup didn't fully address. Running them next:
- Aider Polyglot pass for f16, turbo3, turbo4. A commenter asked whether the fast cache types still produce useful code, not just fast tokens. q8_0 scored 62.2% on Polyglot earlier this week (n=225). Each Polyglot run is roughly 6 to 12 hours wall on M5 Max, so this is a few overnight runs serial. Running later this week.
- Wider quant types: q4_0, q4_1, iq4_nl, q5_0, q5_1. Another commenter asked for these to extend the depth sweep with more cache options. After the Aider runs.
- Same sweep on a non-MoE non-DeltaNet model. A third commenter asked whether these results transfer to other architectures. Qwen 3.6 uses DeltaNet hybrid attention, which already shrinks the per-token KV footprint. On a dense GQA model where cache is the dominant bottleneck the splits should be larger, not smaller. After the wider quant types.
Methodology
Hardware: MacBook Pro M5 Max, 128 GB unified memory. Build: TheTom's llama-cpp-turboquant fork, branch feature/turboquant-kv-cache, built with cmake -B build -DGGML_METAL=ON. Model: Qwen3.6-35B-A3B Q8_0 GGUF.
Quality bench: llama-perplexity on wikitext-2-raw test set, -c 4096, full corpus (~60 chunks). f16 baseline saved via --kl-divergence-base; each quant run loaded the same baseline file via --kl-divergence for KL computation against pinned logits. Same model, same tokenization, only the KV cache type varies.
Throughput bench: llama-bench, -p 512 -n 128 -ngl 99 -fa 1 --threads 6 --batch-size 2048 -r 3, depth sweep via -d. Same flags as yesterday's symmetric sweep so rows are directly comparable. Metal-agent stopped during the run for clean memory budget. Total wall-clock for the asymmetric sweep was about 8.5 hours; the 64K supplement added another 80 minutes.
If you want the raw llama-bench tables or the per-cell stderr from llama-perplexity, reach out and I'll send them. If you have non-M5-Max Apple Silicon and want to run a slice of this matrix on your hardware, drop me a line — second data point would help characterize how the crossover shifts with memory bandwidth.