TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

Yesterday's M5 Max KV cache post drew a clean set of asks in the comments: where are the perplexity numbers, what about KL divergence, did you try asymmetric K/V combos, can you fill the 32K to 128K gap with a 64K row. I ran them overnight on the same hardware. Numbers below.

TL;DR

q8_0 KV cache is essentially free at 4k context. PPL delta vs f16 is −0.0005 (well inside the ±0.036 stderr). KL is 0.0016. Top-1 token agreement is 98.64%.
turbo3 and turbo4 cost real but small quality. turbo3: ~1% PPL increase, 5pp top-token disagreement, KL roughly 12× q8_0. turbo4 sits between, in line with its lower compression ratio.
-ctk q8_0 -ctv turbo4 is the new winner for long-context. Matches symmetric q8_0 throughput at every depth tested and fits 512K, where symmetric q8_0 OOM'd. q8_0-grade prefill, turbo4-grade memory ceiling.
-ctk f16 -ctv turbo4 is broken on this fork on Metal. The Metal FlashAttention kernel doesn't fast-path that K/V combination, so it falls back to a generic dequant-then-attention path. 34× slower at 8K, 78× slower at 128K. Don't use it.
The 64K row shows the prefill curves nearly converged. turbo3 at 470 tok/s sits within 2% of q8_0 at 479 tok/s. The bandwidth-bound regime kicks in somewhere between 64K and 128K on this hardware, earlier than the 128K crossover from yesterday's post had me estimating.

Quality eval: perplexity and KL divergence

The original post had no quality numbers. The first comment under it flagged that gap and asked for perplexity, and a follow-on comment added KL divergence to the list. Both are now in the bag.

Setup: llama-perplexity from TheTom's TurboQuant fork build, wikitext-2-raw test set, context size 4096. The canonical 512 doesn't fill enough KV cache to surface cache-quantization effects, so I bumped it to 4096 to let the cache actually fill. The f16 run saves a baseline logits file via --kl-divergence-base. Each subsequent run computes KL against that baseline, which means the comparisons are pinned to the exact same model weights and tokenization.

Cache type	PPL	KL vs f16	Top-1 token agreement
`f16`	5.7438 ± 0.0355	baseline	n/a
`q8_0`	5.7433 ± 0.0355	0.0016 ± 0.0001	98.64% ± 0.03
`turbo3` (~4.9×)	5.8092 ± 0.0360	0.0199 ± 0.0002	93.93% ± 0.06
`turbo4` (~3.8×)	5.7810 ± 0.0359	0.0131 ± 0.0003	95.28% ± 0.06

q8_0 KV is essentially free at this depth. PPL delta is −0.0005, which is noise inside the ±0.0355 stderr. KL is 0.0016, three orders of magnitude smaller than the turbo3 number. The quantized cache picks the same top-1 token as f16 98.64% of the time. The community worry about q8_0 corroding output quality doesn't bear out here.

turbo3 costs measurable but small quality. ~1% perplexity increase, 5 percentage points of top-token disagreement, KL roughly 12× q8_0's. turbo4 sits between turbo3 and q8_0 on every metric, matching its lower compression ratio. Quality cost scales monotonically with compression, no surprises in the ranking.

One caveat I'd want to underline: PPL was at 4096 context. Quality at deeper contexts, where the cache is more saturated and dequant errors compound across more attention steps, might tell a different story. That's a bench for a future weekend.

Asymmetric K/V: which combos work, which don't

One commenter on the original post pointed out that the big issue asymmetric KV tackles is exactly the K-precision problem: compressing the keys hurts quality a great deal more than compressing the values. The original post called this out in its caveats too but didn't bench it. Now we have data.

Three combinations, same llama-bench flags as yesterday's symmetric sweep. Decode tok/s (token generation):

Depth	q8_0 K / turbo4 V	q8_0 K / turbo3 V	f16 K / turbo4 V
0	82.9	81.8	72.8
8K	75.4	75.6	16.9
32K	66.0	63.2	8.6
128K	41.0	38.2	2.8
256K	27.1	25.0	skipped
512K	16.5	14.8	skipped

Prompt processing tells a similar story (skipping the full table for length, the relative ordering matches): q8_0/turbo4 lands within 1-2% of symmetric q8_0 prefill at every shared depth, and q8_0/turbo3 is similarly close.

`-ctk q8_0 -ctv turbo4`: the new long-context winner

This is the standout combination. At 256K context it puts up 27.1 tok/s decode against yesterday's symmetric q8_0 baseline of 26.6 tok/s. Prefill at 256K hits 128 tok/s versus symmetric q8_0's 124. The throughput is statistically indistinguishable from symmetric q8_0 at every depth they share.

And it fits 512K, where symmetric q8_0 OOM'd in yesterday's post. Decode at 512K is 16.5 tok/s, almost identical to symmetric turbo4 at 16.0. So the asymmetric configuration gets you q8_0-level prefill behavior with turbo4-level context ceiling, on a single MacBook Pro.

The hypothesis that V compresses cheap and K compresses expensive looks right on the throughput side. Quality side I'd want a PPL run on the asymmetric combos to fully close the loop, since I haven't measured KL or PPL with mixed K/V types yet.

`-ctk q8_0 -ctv turbo3`: similar trick, worse decode

Same prefill behavior as the q8_0/turbo4 combo (within 1-2% at every depth) but decode is consistently lower. Tighter V quantization taxes the per-token attention pass more, since decode is bottlenecked by dequantization work rather than total bytes read. If you have memory headroom, q8_0/turbo4 dominates.

`-ctk f16 -ctv turbo4`: kernel fallback, do not use

Putting f16 on K and turbo4 on V breaks the Metal FlashAttention kernel's fast path. The fork falls back to a generic dequant-then-attention implementation that's catastrophically slow:

Depth	Symmetric f16 pp512	f16 K / turbo4 V pp512	Slowdown
8K	2098	61.0	34×
32K	1063	16.4	65×
128K	321	4.1	78×

I cut the run before 256K once the trajectory was clear. The slowdown widens with depth, which is consistent with the non-fast-path attention being O(n) more expensive in dequant work per cache access. Don't use this combination on this fork on Metal until kernel coverage lands. If you're on a different backend (CUDA), verify the same combo before assuming it works.

The 64K row: filling the gap

One commenter asked for a 64K data point sitting between 32K and 128K, particularly on the prefill side. Reasonable ask: yesterday's prefill curves dropped 3–4× between those two depths, so 64K is exactly the depth where the bandwidth-bound regime is supposed to kick in.

All seven configurations at depth 65536:

Cache	pp512 (tok/s)	tg128 (tok/s)
`f16` (symmetric)	602.0	59.8
`q8_0` (symmetric)	479.2	57.9
`turbo3` (symmetric)	469.8	49.9
`turbo4` (symmetric)	418.0	55.2
q8_0 K / turbo4 V	468.2	55.9
q8_0 K / turbo3 V	465.6	52.6
f16 K / turbo4 V	8.3	4.9

Two things stood out. First, the prefill curves are nearly converged at 64K already. turbo3 at 470 tok/s is within 2% of q8_0 at 479 tok/s. Yesterday's data showed turbo3 actually pulling ahead of q8_0 by 128K (253 vs 245), so the bandwidth-bound regime kicks in somewhere in the 64K to 128K range on this hardware. Earlier than I'd estimated when I wrote the original post.

Second, the asymmetric q8_0/turbo* rows track symmetric q8_0 prefill closely at this depth, same as they do at the deeper depths. Same story all the way down the curve: as long as K stays at q8_0, V-side compression is essentially free on prefill.

Updated cache-type recommendations

Same shape as yesterday's recommendations, with the asymmetric data folded in:

Workload	Cache type	Why
Coding agents (deep context, lots of generated tokens)	`-ctk q8_0 -ctv turbo4`	q8_0-grade quality on K, turbo4 memory savings on V, fits 512K, decode 27 tok/s at 256K
RAG or batch QA (heavy prefill, short answers)	`-ctk q8_0 -ctv turbo4` or symmetric `turbo3` at the deepest depths	Prefill is bandwidth-bound past ~64K, both options work
Pure 1M context maxing	Symmetric `turbo3`	Only thing that fits 1M on a 128 GB Mac
Short interactive (under 32K)	`f16` if memory allows, else `q8_0`	Quality cost is genuinely zero, throughput is best

The asymmetric combos are expressible directly in LLMKube's InferenceService spec via the cacheTypeCustomK and cacheTypeCustomV fields that landed in 0.7.3. So if you're running this through the operator, the spec for the new long-context winner is:

spec:
  runtime: llamacpp
  cacheTypeCustomK: q8_0
  cacheTypeCustomV: turbo4
  contextSize: 524288

Caveats

Perplexity was measured at 4096 context. Quality at deeper contexts might tell a different story, since the cache fills more and dequant errors have more attention steps to compound through.
Asymmetric quality numbers (PPL or KL on the q8_0/turbo* combos) are not yet measured. The throughput data argues V-side compression is cheap, but I haven't verified the quality side end-to-end.
-ctk f16 -ctv turbo* is a kernel fallback on this fork on Metal. Verify before assuming the same combination works on other backends. CUDA may have different kernel coverage.
Single hardware data point (M5 Max, 128 GB). The crossover depths and the prefill/decode split likely shift with memory bandwidth and GPU core count.

What's still in flight

Three asks from the original thread that this followup didn't fully address. Running them next:

Aider Polyglot pass for f16, turbo3, turbo4. A commenter asked whether the fast cache types still produce useful code, not just fast tokens. q8_0 scored 62.2% on Polyglot earlier this week (n=225). Each Polyglot run is roughly 6 to 12 hours wall on M5 Max, so this is a few overnight runs serial. Running later this week.
Wider quant types: q4_0, q4_1, iq4_nl, q5_0, q5_1. Another commenter asked for these to extend the depth sweep with more cache options. After the Aider runs.
Same sweep on a non-MoE non-DeltaNet model. A third commenter asked whether these results transfer to other architectures. Qwen 3.6 uses DeltaNet hybrid attention, which already shrinks the per-token KV footprint. On a dense GQA model where cache is the dominant bottleneck the splits should be larger, not smaller. After the wider quant types.

Methodology

Hardware: MacBook Pro M5 Max, 128 GB unified memory. Build: TheTom's llama-cpp-turboquant fork, branch feature/turboquant-kv-cache, built with cmake -B build -DGGML_METAL=ON. Model: Qwen3.6-35B-A3B Q8_0 GGUF.

Quality bench: llama-perplexity on wikitext-2-raw test set, -c 4096, full corpus (~60 chunks). f16 baseline saved via --kl-divergence-base; each quant run loaded the same baseline file via --kl-divergence for KL computation against pinned logits. Same model, same tokenization, only the KV cache type varies.

Throughput bench: llama-bench, -p 512 -n 128 -ngl 99 -fa 1 --threads 6 --batch-size 2048 -r 3, depth sweep via -d. Same flags as yesterday's symmetric sweep so rows are directly comparable. Metal-agent stopped during the run for clean memory budget. Total wall-clock for the asymmetric sweep was about 8.5 hours; the 64K supplement added another 80 minutes.

If you want the raw llama-bench tables or the per-cell stderr from llama-perplexity, reach out and I'll send them. If you have non-M5-Max Apple Silicon and want to run a slice of this matrix on your hardware, drop me a line — second data point would help characterize how the crossover shifts with memory bandwidth.