vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data

After the Part 2 KV cache post, TheTom asked whether we'd run his vllm-swift through the same kind of sweep. Different engine class entirely — vLLM scheduler with a Swift/Metal hot path replacing the Python/MLX worker, MLX-format weights instead of GGUF, and a BatchedKVCache built for concurrent decode. We had cycles, so we ran it. 36 cells, three KV schemes, four depths, three concurrency levels. Honest data below.

TL;DR

fp16 wins decode at every cell we tested. turbo4v2 lands 10–17% slower per sequence; turbo3 lands 19–39% slower. The bandwidth-bound crossover that powered turbo on the llama.cpp fork past 256K doesn't appear here at ≤32K. Either it crosses over later (we couldn't push past 32K cleanly — see caveats), or vllm-swift's contiguous KV access pattern changes the regime entirely.
fp16 at d=32K, B=32 runs at 3,078 tok/s aggregate for a 35B-A3B MoE on a single MacBook Pro. Working set is ~56 GB (35 GB weights + ~21 GB KV) thanks to the model's hybrid attention pattern — only 10 of 40 layers carry full-attention KV. Comfortably within 128 GB UMA at this depth.
Per-sequence decode is essentially flat across concurrency. fp16 at d=8K: 102.2 / 110.6 / 110.9 tok/s at B=1 / 8 / 32. That's the BatchedKVCache promise delivered — aggregate scales near-linearly in batch size.
Prefill is uniform across KV schemes. Peak prefill ~3,800 tok/s for all three (fp16, turbo4v2, turbo3). KV compression doesn't hurt prefill in this regime, since prefill writes the cache once rather than reading per-step.
The honest TurboQuant+ pitch on this engine at this depth range is memory-ceiling, not throughput. turbo3's 4.6× compression buys longer context or larger batches that don't fit otherwise. It does not buy faster decode at any cell we measured.

Why we ran this

Two of TheTom's recent releases sit on different engine substrates. The llama-cpp-turboquant fork we benched in Part 1 and Part 2 integrates TurboQuant directly into llama.cpp's GGML kernels. vllm-swift is something different: a vLLM platform plugin that swaps vLLM's MLX worker for a Swift/Metal one talking through a C bridge, with TurboQuant+ exposed via --additional-config '{"kv_scheme": ..., "kv_bits": ...}'. Same compression family, different engine context, different scheduler.

TheTom landed v0.3.0 a few days back and pinged us on the upstream llama.cpp discussion thread asking if we'd run vllm-swift through the same kind of A/B. Reasonable ask — the published vllm-swift README has short-context concurrency numbers, but nobody had pushed turbo3 / turbo4v2 across depth and batch size on the same hardware our llama.cpp data came from. We had the M5 Max free this morning. So we ran it.

Setup

Hardware: MacBook Pro M5 Max, 128 GB unified memory.
Engine: vllm-swift v0.3.0 from TheTom's Homebrew tap (brew install TheTom/tap/vllm-swift). Bottle ships libVLLMBridge.dylib at /opt/homebrew/Cellar/vllm-swift/0.3.0/lib.
Model: mlx-community/Qwen3.6-35B-A3B-8bit (35 GB on disk). Picked to match the Qwen3.6-35B-A3B-Q8_0.gguf baseline from Parts 1 and 2 as closely as MLX format allows.
Matrix: KV scheme × depth × concurrency = 3 × 4 × 3 = 36 cells. KV: fp16, turbo4v2, turbo3. Depths: 1024, 8192, 16384, 32768. Concurrency: 1, 8, 32. Subprocess-per-cell isolation per TheTom's published bench methodology; greedy decoding (temp=0); 50 generation tokens; unique prompts per slot (no prefix-cache hits).
Implementation: routed through the C bridge (vsm_engine_create) directly rather than vLLM's LLM class because vllm 0.19.1 in the bottled venv has a model-resolution bug that mis-routed our path-style model argument to a default Qwen/Qwen3-0.6B from the HuggingFace cache. The bridge accepts kv_scheme and kv_bits as direct C API parameters, which makes the matrix straightforward to drive in subprocesses.
Total wall-clock: 53 minutes for the full 36-cell matrix. No OOMs, no failed cells.

Per-sequence decode tok/s, B=32

This is the cell that matters most for practical workloads — multiple concurrent users, full depth. Decode tok/s per sequence (multiply by B for aggregate engine throughput):

Depth	fp16	turbo4v2	turbo3
1024	115.4	104.0 (−10%)	93.9 (−19%)
8192	110.9	97.4 (−12%)	70.8 (−36%)
16384	105.9	91.0 (−14%)	67.2 (−37%)
32768	96.2	79.7 (−17%)	58.3 (−39%)

Aggregate engine throughput at B=32 (per-seq × 32):

Depth	fp16	turbo4v2	turbo3
1024	3,693	3,328	3,005
8192	3,549	3,117	2,266
16384	3,389	2,912	2,150
32768	3,078	2,550	1,866

fp16 aggregate at d=32K, B=32 lands at 3,078 tok/s. That's a useful production number for short-prompt, long-context workloads on a 35B-A3B MoE on a single MacBook Pro — with no compression at all, and no OOM.

The expected crossover never showed up

In Part 1 on the llama.cpp fork, turbo3 prefill caught up with q8_0 around 128K and turbo4 decode beat fp16 past 256K. The story was that once contexts grow large enough, dequantization work becomes cheaper than the bandwidth saved by reading a smaller cache, and the compressed schemes pull ahead. That's the bandwidth-bound regime crossover.

Across all 36 cells we ran on vllm-swift, fp16 wins decode every single time. There's no crossover at 1K, 8K, 16K, or 32K. The gap to turbo4v2 grows with depth (10% at 1K, 17% at 32K), but it doesn't close, much less invert. turbo3 sits even further behind, and the gap also widens with depth.

Two non-exclusive explanations for the difference vs the llama.cpp data:

The crossover is past where we tested. vllm-swift's README sets --max-model-len 40960 in its examples (with a # max 40960 comment), so we capped the matrix at 32K depth. The model itself supports 262144 natively per its text_config.max_position_embeddings, so the 40K cap is a vllm-swift convention rather than a model constraint. The llama.cpp data didn't show turbo4 winning decode until 256K, so the crossover may live at 64K, 128K, or beyond; we just didn't push past 32K in this run.
The KV access pattern is different. vllm-swift uses a contiguous BatchedKVCache that avoids paged attention's block-table indirection. That's exactly what makes its short-context decode fast in the README's Qwen3-4B numbers, but it also means the regime where reading a smaller cache wins on bandwidth might never trigger the same way it does in llama.cpp's GGML kernels. Different access pattern, different bottleneck.

We can't distinguish those two from the data we have. The honest framing is: at ≤32K depth on this engine on this hardware, fp16 wins decode and TurboQuant+ buys you context-ceiling rather than speed.

Per-sequence decode is flat across concurrency

The non-obvious win in this data is the concurrency dimension. Per-sequence decode tok/s for fp16, across all four depths and all three concurrency levels:

Depth	B=1	B=8	B=32
1024	107.3	116.7	115.4
8192	102.2	110.6	110.9
16384	97.3	106.0	105.9
32768	90.5	96.7	96.2

The numbers in each row hardly move across B=1, B=8, B=32. Per-sequence throughput at B=8 even slightly beats B=1 (because batched matmul is more efficient than single-sequence at this scale). Aggregate scales nearly linearly with B. That's the BatchedKVCache promise delivered — on a 35B-A3B MoE, the engine handles 32 concurrent decoders with no per-seq slowdown to speak of.

This is the cell where vllm-swift earns its keep relative to llama.cpp's llama-server path: the underlying engine is built for concurrent decode where llama.cpp is built for single-sequence depth.

Prefill: uniform across KV schemes

Prefill tok/s at d=8K, B=8 (the cell where prefill peaks for all three schemes):

KV scheme	Prefill tok/s
fp16	3,857
turbo4v2	3,836
turbo3	3,766

Within 2.5% across all three. Consistent with theory — prefill writes the KV cache once and reads it once per token of new context, so the per-step compression cost is small relative to the matmul-heavy work. This matches the regime described in TheTom's vllm-swift performance docs: short-context wins are dominated by the engine's reduced Python overhead, not the KV scheme.

Memory budget at d=32K, B=32

A useful sanity check on this cell. Pulling the 35B-A3B's text_config out of config.json:

num_hidden_layers: 40
layer_types: 10 entries are full_attention, 30 are linear_attention (every fourth layer is full attention; full_attention_interval: 4)
num_key_value_heads: 2 (heavy GQA)
head_dim: 256
max_position_embeddings: 262144

Only the 10 full-attention layers carry a growing KV cache; the linear-attention layers don't accumulate KV per token. That gives us:

Weights at 8-bit MLX: ~35 GB
fp16 KV per token: 10 layers × 2 KV heads × 256 head_dim × 2 (K+V) × 2 bytes = ~20 KB
Per sequence at d=32K: 32K × 20 KB ~= 640 MB
B=32 at d=32K: 32 × 640 MB = ~20 GB just for KV
Total working set: ~56 GB. Plenty of headroom in 128 GB UMA.

So fp16 d=32K B=32 fitting isn't a surprise — it's exactly what the architecture supports. The practical implication still stands: you can run 32 concurrent 32K-token contexts on a 35B-A3B MoE on a single MacBook Pro without KV compression, at 3,078 tok/s aggregate decode. An earlier draft of this post had wrong KV math (carrying over Qwen3-4B numbers: 36 layers × 8 KV heads × 128 head_dim) and called the result "surprising." It isn't. The architecture-aware math gets you there in one line.

Updated recommendation

Workload on vllm-swift, M5 Max, ≤32K depth	KV scheme	Why
High-concurrency batched serving (chat, agent fleets)	fp16	Best decode at every depth and concurrency we measured. Fits 32 sequences at 32K on 128 GB UMA.
Memory-pressured: more concurrent users than fp16 fits	turbo4v2	10–17% per-seq decode cost, 3× KV memory savings. Use when you need batch beyond what fp16 fits.
Memory-pressured: largest possible context on this engine	turbo3	19–39% per-seq decode cost, 4.6× KV memory savings. Only choice if turbo4v2 still doesn't fit.
Single-stream long-context (1M)	`turbo3` on the llama.cpp fork	vllm-swift caps at 40K context per its published examples; llama.cpp + turbo3 is the path past that.

Caveats

Capped at d=32K. vllm-swift's README example sets --max-model-len 40960 with a # max 40960 comment, so we capped the matrix at 32K. The model itself supports 262144 (256K) natively per its text_config.max_position_embeddings, so the 40K limit reads like a vllm-swift convention rather than a model constraint — we just didn't find a knob in the published examples to push past it. The llama.cpp turbo crossover only appeared past 256K, so the regime where compression starts beating fp16 may simply live past where we tested. This is the single biggest reason to read the data with care.
vllm 0.19.1 in the bottled venv has a model-resolution bug that breaks the standard LLM(model="/local/path") path for MLX-format directories — it falls back to a default Qwen/Qwen3-0.6B from the HF cache. We routed through the C bridge directly to bypass it. Same numerical engine path TheTom's bench_throughput.py uses, just extended to take kv_scheme and kv_bits.
turbo4v2 in vllm-swift is not the same kernel as turbo4 in the llama.cpp fork. Same compression family, different generation. We did not run a quality smoke (PPL or KL) to confirm turbo4v2 reproduces turbo4's quality numbers from Part 2; that's a follow-up.
Single hardware data point. M5 Max, 128 GB UMA. Memory bandwidth and GPU core count differ enough across Apple Silicon that the regime may shift on M2 Pro / M3 Ultra / M4 Max. If you have non-M5 Apple Silicon and want to run a slice, drop the numbers and we'll fold them into a comparison.
vllm-metal baseline column not included. TheTom's published comparison runs vllm-swift against vllm-metal (Python/MLX) at every cell. We skipped that column to ship today. The published vllm-swift performance doc already has that comparison at short context, and our cross-engine baseline of choice is the llama.cpp fork data from Part 1.

Methodology

Hardware: MacBook Pro M5 Max, 128 GB unified memory. Engine: vllm-swift v0.3.0 from TheTom's Homebrew tap, bottled. Bridge dylib at /opt/homebrew/Cellar/vllm-swift/0.3.0/lib/libVLLMBridge.dylib. Model: mlx-community/Qwen3.6-35B-A3B-8bit downloaded via vllm-swift download.

Bench routed through the C bridge directly via vsm_engine_create(model_path, "float16", max_kv, kv_scheme, kv_bits, 0.9). Subprocess-per-cell to dodge the EngineCore zombie process issue documented in vllm-swift's performance doc. Greedy decode (temp=0), 50 generation tokens, unique prompts per slot, two-step warmup before timed decode. Three KV schemes × four depths × three concurrency levels = 36 cells, one rep per cell. Reported numbers are: per-sequence decode tok/s = total_decoded_tokens / decode_window; aggregate engine tok/s = per_seq × B; prefill tok/s = (prompt_tokens × B) / prefill_seconds.

Phase 0 sanity: ran TheTom's published bench_throughput.py on Qwen3-4B-4bit with default flags. Reproduced his README short-context numbers within 5% (B=8 ours 480.2 vs published 477; B=32 ours 1185 vs published 1194; B=64 ours 1482 vs published 1518). Bench rig validated, then we moved to the 35B matrix.

Total wall-clock: 53 minutes for the 36-cell matrix. Memory-agent stopped during the run for clean memory budget. Raw per-cell JSON and the bench harness will be in the LLMKube benchmarks repo; reach out if you want them ahead of that.

What we'd run next

Quality smoke for turbo4v2 vs turbo4. Confirm the v2 kernel reproduces turbo4's PPL and KL numbers from Part 2. Quick run; we just didn't have time today.
Push past 40K. Find out whether the bandwidth-bound crossover lives at 64K, 128K, or never, on this engine. The model's text_config.max_position_embeddings is 262144, so the path may be a runtime override of vllm-swift's example cap rather than a full RoPE-scaling exercise — needs a poke at the engine config to confirm.
vllm-metal baseline column. Drop in the missing comparison so the table fully matches TheTom's published methodology rather than relying on cross-engine reference to llama.cpp.
q4_0 / q5_0 / q4_1 / q5_1 / iq4_nl on the llama.cpp fork. The wider standard quants commenters asked for in Part 2. That's a separate post (Part 3b in the series); the data is already in hand.

Thanks again to TheTom for shipping vllm-swift publicly and being open to a head-to-head bench against the data we already had on the llama.cpp side. The honest framing here doesn't make TurboQuant+ on vllm-swift look like a straight throughput win at ≤32K, but the engine's concurrency story is genuinely good and the memory-ceiling argument for turbo3 still holds for batched workloads that don't fit at fp16. If the crossover lives past 32K on this engine, we'd love to see the data; happy to run a sweep if RoPE scaling lands cleanly.