Skip to content
Back to blog
Benchmarks 10 min read

Back to Shadowstack: a 35B at 256K context (and 512K with YaRN) on two consumer Blackwell cards

Christopher Maher
Christopher Maher
May 31, 2026

We've spent the last couple of months mostly on Apple Silicon. The TurboQuant KV-cache work, the mlx-server runtime, the long-context runs on an M5 Max: all of it has lived on a Mac. This weekend I wanted to turn back to Shadowstack, our two consumer RTX 5060 Ti, and see what thresholds we could push. The part that got me excited: a few months ago, on these exact two cards, a model like this at this context simply would not fit.

Where we've been

Back in December I ran a 32B stress test on Shadowstack: three 32B models, dual RTX 5060 Ti, proving you can serve production-scale weights on hardware under $1,200. That test was about fitting the weights. Since then most of the KV-cache experimentation moved to the M5 Max, where I posted a long-context TurboQuant run and a follow-up with perplexity and KL divergence after the r/LocalLLaMA comments asked for them.

This weekend's question was different. Not "can these cards hold a 35B," but "how much context can we put in front of one." The bottleneck for long context isn't the weights, it's the KV cache, and that is exactly what TurboQuant compresses. The catch: all of that compression had only ever run on Metal. None of it had touched consumer Blackwell.

First threshold: getting the kernels to run on Blackwell

The TurboQuant work lives in TheTom's llama.cpp fork (the feature/turboquant-kv-cache branch). It ships CUDA kernels for the turbo cache types, but nobody had built them for sm_120, the consumer Blackwell target the RTX 50-series uses. So step one was just compiling the fork for sm_120 (CUDA 12.8 is the floor for that arch) and confirming the turbo2, turbo3, and turbo4 flash-attention kernels actually launch on a 5060 Ti. They do. That alone is the thing that wasn't true a few months ago.

The setup

  • 2x NVIDIA RTX 5060 Ti, 16 GB each, 32 GB total (consumer Blackwell, compute capability 12.0)
  • Qwen3.6-35B-A3B, Q4_K_M GGUF (19.70 GiB, 34.66B params), layer-split across both cards
  • TheTom's TurboQuant llama.cpp fork, built for CUDA sm_120
  • llama-server / llama-bench / llama-perplexity for serving, throughput, and quality

Q4_K_M is the quant that fits: at ~21 GB of weights, layer-split across two 16 GB cards leaves real room for KV cache. Everything below is on those Q4_K_M weights, so the quality numbers reflect the config you would actually serve, not a fatter quant run for show. (My M5 Max quality post used Q8_0 weights, so the absolute perplexity differs; the KV-cache deltas are the comparable part.)

The threshold: how much context fits

Qwen3.6-35B-A3B trains to a 262,144-token (256K) context. The question is whether you can actually hold that much KV cache on 32 GB after the weights. Same model, same hardware, only the KV cache type changes:

KV cache (K / V)Max context on 32 GBVRAM at max
f16 / f16~192K (out of memory at 256K)28.2 GB @ 192K
q8_0 K / turbo4 Vfull 256K native27.3 GB
turbo3 K / turbo2 Vfull 256K native26.9 GB
turbo3 K / turbo2 V + YaRN 2x512K (2x beyond native)25.5 GB

This is the headline. With standard f16 KV, the 35B runs out of memory before it reaches its own native context: it caps somewhere around 192K. With turbo KV, the same model on the same two cards holds the full 256K with about 5 GB to spare, and pushing past native with YaRN reaches 512K. That 512K run needed llama.cpp to fall back from pipeline parallelism to fit the compute buffer, but it loads and serves.

One honest note on this model specifically: the absolute VRAM savings between turbo4 and turbo2 are small (about 0.4 GB at 256K), because Qwen3.6's DeltaNet hybrid attention already keeps the standard KV path small. The footprint here is weight-dominated. On a dense model with full GQA, where the cache is the real bottleneck, the gap would be larger. For this model the win is "reaches and exceeds native context," not a giant memory multiplier.

The numbers the community asked for: perplexity and KL divergence

When I posted the M5 Max KV runs, the first two comments asked for perplexity and KL divergence, and I said I'd add them. Here they are again, this time on Blackwell. Setup: llama-perplexity from the TurboQuant fork, wikitext-2-raw test set, context 8192, ~131K tokens evaluated. The f16 run saves a baseline logits file via --kl-divergence-base; every other run computes KL against it, so the comparisons are pinned to the same weights and tokenization.

KV cachePPL vs f16Mean KL divTop-1 token agreement
f165.6258 ± 0.0516 (baseline)baselinen/a
q8_0+0.09%0.0067 ± 0.000496.81% ± 0.07
turbo4+0.53%0.0117 ± 0.000595.46% ± 0.08
turbo3+0.75%0.0148 ± 0.000494.74% ± 0.09

The quality cost is small and cleanly graded. turbo4 (4-bit values) costs about half a percent of perplexity and keeps the same top-1 token as f16 about 95% of the time; turbo3 (3-bit) is a touch more. The ranking is monotonic with compression, no surprises. q8_0 is nearly free. These are the prices you pay to fit the context above.

Throughput, honestly

Here is where I have to correct myself. My first pass used quick single-shot timings at low context and I told myself decode was basically free across cache types. That was an artifact. Run it properly with llama-bench (warmup plus three repeats) and measure decode at depth, and a real tradeoff shows up:

KV cache (K / V)decode tok/s @ 0decode tok/s @ 64K
f16 / f16117.084.2
q8_0 / q8_0115.042.1
turbo4 / turbo4114.451.6
turbo3 / turbo3113.953.2
f16 K / turbo4 V115.255.0
q8_0 K / turbo4 V114.451.6

At no context everything decodes around 115 tok/s. At 64K of depth the quantized caches pay a dequantization tax on every token: f16 holds 84 tok/s while turbo drops to about 52. So turbo KV is not free: it trades roughly 35-40% of decode-at-depth to fit context that f16 can't fit at all on this hardware. Prefill, by contrast, is nearly identical across cache types (about 2,450 tok/s at no context, 1,500 at 64K); it's compute-bound, not cache-read-bound.

Two things worth flagging. q8_0 is the slowest at depth (42 tok/s), behind both turbo levels; its dequant path is less optimized in this build. And f16 K / turbo4 V is the quiet winner among the compressed-value configs: fastest at depth (55 tok/s) and best quality of the turbo-value set. That's notable because on Metal that exact combination was broken in my M5 Max testing, where the FlashAttention kernel didn't fast-path it and fell back to a path tens of times slower. On CUDA Blackwell it just works.

Does the long context actually work, or just allocate?

Fitting a context is not the same as using it. So I ran a needle-in-a-haystack: bury a passphrase deep in filler text and ask the model to retrieve it, on the most aggressive config (turbo3 K / turbo2 V with 2x YaRN).

Prompt sizeNeedle depthRetrieved?
41K tokens~31KYes
264K tokens~210K (beyond native)Yes
481K tokens~409K (deep in the YaRN zone)Yes

It pulled the passphrase out exactly at every depth, including ~409K tokens into a 481K-token context, through 2-bit value cache and YaRN extension. The extended context is usable for recall, not just allocatable.

What I'm not claiming yet

Needle retrieval proves recall, not reasoning. I have not yet run a functional quality eval (a coding benchmark across the cache types) on this hardware; that's the next thing, and the honest open item that the original thread also asked for. I also haven't measured coherence at the aggressive 512K corner beyond retrieval, and I haven't swept the wider quant types (q4_0, q5_0, and friends). Those are on the list, not in this post.

Why this matters

The point of LLMKube has always been that you don't need a datacenter to run serious local inference. A few months ago "serious" meant fitting a 32B's weights on two consumer cards. This weekend it meant putting a 35B in front of a quarter-million tokens of context, on the same two cards, and reaching half a million with YaRN. The tradeoff is real and I'd rather state it plainly: you give up decode speed at depth, and you spend a little perplexity, to buy context that otherwise wouldn't fit. For long-document and long-session work on hardware a small team can actually afford, that is a trade worth having on the table.

Numbers, configs, and the honest caveats are tracked in issue #601. If you've pushed long context on consumer Blackwell, I'd love to compare notes.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.