Back to Shadowstack: a 35B at 256K context (and 512K with YaRN) on two consumer Blackwell cards

We've spent the last couple of months mostly on Apple Silicon. The TurboQuant KV-cache work, the mlx-server runtime, the long-context runs on an M5 Max: all of it has lived on a Mac. This weekend I wanted to turn back to Shadowstack, our two consumer RTX 5060 Ti, and see what thresholds we could push. The part that got me excited: a few months ago, on these exact two cards, a model like this at this context simply would not fit.

Where we've been

Back in December I ran a 32B stress test on Shadowstack: three 32B models, dual RTX 5060 Ti, proving you can serve production-scale weights on hardware under $1,200. That test was about fitting the weights. Since then most of the KV-cache experimentation moved to the M5 Max, where I posted a long-context TurboQuant run and a follow-up with perplexity and KL divergence after the r/LocalLLaMA comments asked for them.

This weekend's question was different. Not "can these cards hold a 35B," but "how much context can we put in front of one." The bottleneck for long context isn't the weights, it's the KV cache, and that is exactly what TurboQuant compresses. The catch: all of that compression had only ever run on Metal. None of it had touched consumer Blackwell.

First threshold: getting the kernels to run on Blackwell

The TurboQuant work lives in TheTom's llama.cpp fork (the feature/turboquant-kv-cache branch). It ships CUDA kernels for the turbo cache types, but nobody had built them for sm_120, the consumer Blackwell target the RTX 50-series uses. So step one was just compiling the fork for sm_120 (CUDA 12.8 is the floor for that arch) and confirming the turbo2, turbo3, and turbo4 flash-attention kernels actually launch on a 5060 Ti. They do. That alone is the thing that wasn't true a few months ago.

The setup

2x NVIDIA RTX 5060 Ti, 16 GB each, 32 GB total (consumer Blackwell, compute capability 12.0)
Qwen3.6-35B-A3B, Q4_K_M GGUF (19.70 GiB, 34.66B params), layer-split across both cards
TheTom's TurboQuant llama.cpp fork, built for CUDA sm_120
llama-server / llama-bench / llama-perplexity for serving, throughput, and quality

Q4_K_M is the quant that fits: at ~21 GB of weights, layer-split across two 16 GB cards leaves real room for KV cache. Everything below is on those Q4_K_M weights, so the quality numbers reflect the config you would actually serve, not a fatter quant run for show. (My M5 Max quality post used Q8_0 weights, so the absolute perplexity differs; the KV-cache deltas are the comparable part.)

The threshold: how much context fits

Qwen3.6-35B-A3B trains to a 262,144-token (256K) context. The question is whether you can actually hold that much KV cache on 32 GB after the weights. Same model, same hardware, only the KV cache type changes:

KV cache (K / V)	Max context on 32 GB	VRAM at max
f16 / f16	~192K (out of memory at 256K)	28.2 GB @ 192K
q8_0 K / turbo4 V	full 256K native	27.3 GB
q8_0 K / turbo2 V	full 256K native	26.9 GB
q8_0 K / turbo2 V + YaRN 2x	512K (2x beyond native)	25.5 GB

This is the headline. With standard f16 KV, the 35B runs out of memory before it reaches its own native context: it caps somewhere around 192K. With turbo KV, the same model on the same two cards holds the full 256K with about 5 GB to spare, and pushing past native with YaRN reaches 512K. That 512K run needed llama.cpp to fall back from pipeline parallelism to fit the compute buffer, but it loads and serves.

One honest note on this model specifically: the absolute VRAM savings between the turbo4 and turbo2 value caches are small (about 0.4 GB at 256K). Part of that is that the auto-asymmetric guard pins the keys at q8_0 in both, so only the value half is actually shrinking; part is that Qwen3.6's DeltaNet hybrid attention already keeps the standard KV path small. The footprint here is weight-dominated. On a dense model with full GQA, where the cache is the real bottleneck, the gap would be larger. For this model the win is "reaches and exceeds native context," not a giant memory multiplier.

The numbers the community asked for: perplexity and KL divergence

When I posted the M5 Max KV runs, the first two comments asked for perplexity and KL divergence, and I said I'd add them. Here they are again, this time on Blackwell. Setup: llama-perplexity from the TurboQuant fork, wikitext-2-raw test set, context 8192, ~131K tokens evaluated. The f16 run saves a baseline logits file via --kl-divergence-base; every other run computes KL against it, so the comparisons are pinned to the same weights and tokenization.

KV cache	PPL vs f16	Mean KL div	Top-1 token agreement
f16 / f16	5.6258 ± 0.0516 (baseline)	baseline	n/a
q8_0 / q8_0	+0.09%	0.0067 ± 0.0004	96.81% ± 0.07
q8_0 K / turbo4 V	+0.53%	0.0117 ± 0.0005	95.46% ± 0.08
q8_0 K / turbo3 V	+0.75%	0.0148 ± 0.0004	94.74% ± 0.09

The quality cost is small and cleanly graded. A turbo4 value cache costs about half a percent of perplexity and keeps the same top-1 token as f16 about 95% of the time; turbo3 is a touch more. The ranking is monotonic with value-cache compression, no surprises. q8_0 is nearly free. These are the prices you pay to fit the context above.

One correction to the first version of this post: I originally labeled those last two rows turbo4 and turbo3 as if both keys and values were compressed. They were not. The TurboQuant fork ships an auto-asymmetric guard, and on this model it fired: auto-asymmetric: GQA ratio 8:1, upgrading K from turbo3 to q8_0. So when I asked for symmetric turbo, the library quietly kept the keys at q8_0 and compressed only the values. That is TheTom's asymmetric KV-compression result built into the runtime: keys feed the attention softmax, where small errors get amplified exponentially, so the library protects them by default. The numbers above are real and reproducible. They are exactly what you get when you set those flags. They just describe a q8_0 key cache, not a compressed one.

Keys, values, and what happens when you turn the guard off

That auto-asymmetric guard raised an obvious question: what does this model actually do if you let the keys get compressed? The conventional result, and TheTom's paper, is that keys are the fragile half. On a dense model with full softmax attention, a small key error rides through the exponential and can reroute attention entirely; Qwen2.5-7B at symmetric turbo3 degrades to a perplexity in the thousands. So I disabled the guard (TURBO_AUTO_ASYMMETRIC=0) and ran the real key types, holding one side fixed to isolate the other. Same wikitext-2 setup, same f16 baseline (PPL 5.6258).

KV cache (K / V)	PPL vs f16	Mean KL div	Top-1 token agreement
q8_0 K / turbo3 V	+0.75%	0.0148	94.74%
turbo3 K / q8_0 V	+0.43%	0.0134	95.29%
turbo3 K / turbo3 V	+1.29%	0.0212	93.82%
turbo2 K / turbo2 V	+2.51%	0.0410	91.54%

Two things stand out, and both point the opposite way from the dense-model intuition. First, there is no catastrophe. Real symmetric turbo3, with the keys genuinely at 3 bits, costs +1.3% perplexity and holds the top token 93.8% of the time. Even turbo2 on both sides, roughly 2 bits each, is only +2.5%. Where Qwen2.5 fell off a cliff, this model walks down a gentle ramp.

Second, the small asymmetry that does exist leans toward protecting the values, not the keys. At an equal bit budget, turbo3 K / q8_0 V (KL 0.0134, 95.29% top token) beats q8_0 K / turbo3 V (KL 0.0148, 94.74%). Compressing the values hurt slightly more than compressing the keys. That is the inverse of the dense-model rule, and the likely reason is mechanistic: most of Qwen3.6's layers are DeltaNet linear attention, which has no softmax over keys, so the exponential amplification that makes keys fragile elsewhere mostly is not present here. No softmax, no key amplification.

I want to be careful not to over-read this. It is one model, on wikitext-2, at 8K context over sixteen chunks, measured by perplexity and KL divergence rather than a downstream task. The effect is real (it is several times the error bars) but small, and the honest headline is "this architecture stays healthy under aggressive symmetric compression," not "values dominate." It does suggest the GQA-ratio heuristic that triggers the auto-upgrade may be conservative for hybrid and linear-attention models, where the keys are not the liability the softmax case assumes. That is a fun thread to pull on next.

Throughput, honestly

Here is where I have to correct myself. My first pass used quick single-shot timings at low context and I told myself decode was basically free across cache types. That was an artifact. Run it properly with llama-bench (warmup plus three repeats) and measure decode at depth, and a real tradeoff shows up:

KV cache (K / V)	decode tok/s @ 0	decode tok/s @ 64K
f16 / f16	117.0	84.2
q8_0 / q8_0	115.0	42.1
turbo4 / turbo4	114.4	51.6
turbo3 / turbo3	113.9	53.2
f16 K / turbo4 V	115.2	55.0
q8_0 K / turbo4 V	114.4	51.6

At no context everything decodes around 115 tok/s. At 64K of depth the quantized caches pay a dequantization tax on every token: f16 holds 84 tok/s while turbo drops to about 52. So turbo KV is not free: it trades roughly 35-40% of decode-at-depth to fit context that f16 can't fit at all on this hardware. Prefill, by contrast, is nearly identical across cache types (about 2,450 tok/s at no context, 1,500 at 64K); it's compute-bound, not cache-read-bound.

Two things worth flagging. q8_0 is the slowest at depth (42 tok/s), behind both turbo levels; its dequant path is less optimized in this build. And f16 K / turbo4 V is the quiet winner among the compressed-value configs: fastest at depth (55 tok/s) and best quality of the turbo-value set. That's notable because on Metal that exact combination was broken in my M5 Max testing, where the FlashAttention kernel didn't fast-path it and fell back to a path tens of times slower. On CUDA Blackwell it just works.

One label note, same as in the quality table: the turbo4 / turbo4 and turbo3 / turbo3 rows here ran with the keys auto-upgraded to q8_0 by the guard, which is exactly why turbo4 / turbo4 and q8_0 K / turbo4 V come out identical. The decode-at-depth tax is paid by the compressed value cache, not the keys.

Does the long context actually work, or just allocate?

Fitting a context is not the same as using it. So I ran a needle-in-a-haystack: bury a passphrase deep in filler text and ask the model to retrieve it, on the most aggressive config (q8_0 K / turbo2 V with 2x YaRN, keys pinned by the auto-asymmetric guard).

Prompt size	Needle depth	Retrieved?
41K tokens	~31K	Yes
264K tokens	~210K (beyond native)	Yes
481K tokens	~409K (deep in the YaRN zone)	Yes

It pulled the passphrase out exactly at every depth, including ~409K tokens into a 481K-token context, through 2-bit value cache and YaRN extension. The extended context is usable for recall, not just allocatable.

What I'm not claiming yet

Needle retrieval proves recall, not reasoning. I have not yet run a functional quality eval (a coding benchmark across the cache types) on this hardware; that's the next thing, and the honest open item that the original thread also asked for. I also haven't measured coherence at the aggressive 512K corner beyond retrieval, and I haven't swept the wider quant types (q4_0, q5_0, and friends). Those are on the list, not in this post.

Why this matters

The point of LLMKube has always been that you don't need a datacenter to run serious local inference. A few months ago "serious" meant fitting a 32B's weights on two consumer cards. This weekend it meant putting a 35B in front of a quarter-million tokens of context, on the same two cards, and reaching half a million with YaRN. The tradeoff is real and I'd rather state it plainly: you give up decode speed at depth, and you spend a little perplexity, to buy context that otherwise wouldn't fit. For long-document and long-session work on hardware a small team can actually afford, that is a trade worth having on the table.

Numbers, configs, and the honest caveats are tracked in issue #601. If you've pushed long context on consumer Blackwell, I'd love to compare notes.