Skip to content
Back to blog
Benchmarks 16 min read

vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data

Christopher Maher
Christopher Maher

After the Part 2 KV cache post, TheTom asked whether we'd run his vllm-swift through the same kind of sweep. Different engine class entirely — vLLM scheduler with a Swift/Metal hot path replacing the Python/MLX worker, MLX-format weights instead of GGUF, and a BatchedKVCache built for concurrent decode. We had cycles, so we ran it. 36 cells, three KV schemes, four depths, three concurrency levels. Honest data below.

TL;DR

  1. fp16 wins decode at every cell we tested. turbo4v2 lands 10–17% slower per sequence; turbo3 lands 19–39% slower. The bandwidth-bound crossover that powered turbo on the llama.cpp fork past 256K doesn't appear here at ≤32K. Either it crosses over later (we couldn't push past 32K cleanly — see caveats), or vllm-swift's contiguous KV access pattern changes the regime entirely.
  2. fp16 at d=32K, B=32 runs at 3,078 tok/s aggregate for a 35B-A3B MoE on a single MacBook Pro. Working set is ~56 GB (35 GB weights + ~21 GB KV) thanks to the model's hybrid attention pattern — only 10 of 40 layers carry full-attention KV. Comfortably within 128 GB UMA at this depth.
  3. Per-sequence decode is essentially flat across concurrency. fp16 at d=8K: 102.2 / 110.6 / 110.9 tok/s at B=1 / 8 / 32. That's the BatchedKVCache promise delivered — aggregate scales near-linearly in batch size.
  4. Prefill is uniform across KV schemes. Peak prefill ~3,800 tok/s for all three (fp16, turbo4v2, turbo3). KV compression doesn't hurt prefill in this regime, since prefill writes the cache once rather than reading per-step.
  5. The honest TurboQuant+ pitch on this engine at this depth range is memory-ceiling, not throughput. turbo3's 4.6× compression buys longer context or larger batches that don't fit otherwise. It does not buy faster decode at any cell we measured.
  6. Update 2026-05-02: we extended the matrix to d=64K / 128K / 192K. fp16 d=128K B=32 and fp16 d=192K B=32 both hit the memory ceiling and SKIPPED. turbo4v2 ran both: 1,360 tok/s aggregate at d=128K B=32, 1,024 tok/s aggregate at d=192K B=32. That is the value-prop confirmation in two cells. Per-seq decode degrades smoothly with depth, no bandwidth-bound crossover ever appears. Deep-context section below.

Why we ran this

Two of TheTom's recent releases sit on different engine substrates. The llama-cpp-turboquant fork we benched in Part 1 and Part 2 integrates TurboQuant directly into llama.cpp's GGML kernels. vllm-swift is something different: a vLLM platform plugin that swaps vLLM's MLX worker for a Swift/Metal one talking through a C bridge, with TurboQuant+ exposed via --additional-config '{"kv_scheme": ..., "kv_bits": ...}'. Same compression family, different engine context, different scheduler.

TheTom landed v0.3.0 a few days back and pinged us on the upstream llama.cpp discussion thread asking if we'd run vllm-swift through the same kind of A/B. Reasonable ask — the published vllm-swift README has short-context concurrency numbers, but nobody had pushed turbo3 / turbo4v2 across depth and batch size on the same hardware our llama.cpp data came from. We had the M5 Max free this morning. So we ran it.

Setup

  • Hardware: MacBook Pro M5 Max, 128 GB unified memory.
  • Engine: vllm-swift v0.3.0 from TheTom's Homebrew tap (brew install TheTom/tap/vllm-swift). Bottle ships libVLLMBridge.dylib at /opt/homebrew/Cellar/vllm-swift/0.3.0/lib.
  • Model: mlx-community/Qwen3.6-35B-A3B-8bit (35 GB on disk). Picked to match the Qwen3.6-35B-A3B-Q8_0.gguf baseline from Parts 1 and 2 as closely as MLX format allows.
  • Matrix: KV scheme × depth × concurrency = 3 × 4 × 3 = 36 cells. KV: fp16, turbo4v2, turbo3. Depths: 1024, 8192, 16384, 32768. Concurrency: 1, 8, 32. Subprocess-per-cell isolation per TheTom's published bench methodology; greedy decoding (temp=0); 50 generation tokens; unique prompts per slot (no prefix-cache hits).
  • Implementation: routed through the C bridge (vsm_engine_create) directly rather than vLLM's LLM class because vllm 0.19.1 in the bottled venv has a model-resolution bug that mis-routed our path-style model argument to a default Qwen/Qwen3-0.6B from the HuggingFace cache. The bridge accepts kv_scheme and kv_bits as direct C API parameters, which makes the matrix straightforward to drive in subprocesses.
  • Total wall-clock: 53 minutes for the full 36-cell matrix. No OOMs, no failed cells.

Per-sequence decode tok/s, B=32

This is the cell that matters most for practical workloads — multiple concurrent users, full depth. Decode tok/s per sequence (multiply by B for aggregate engine throughput):

Depthfp16turbo4v2turbo3
1024115.4104.0 (−10%)93.9 (−19%)
8192110.997.4 (−12%)70.8 (−36%)
16384105.991.0 (−14%)67.2 (−37%)
3276896.279.7 (−17%)58.3 (−39%)

Aggregate engine throughput at B=32 (per-seq × 32):

Depthfp16turbo4v2turbo3
10243,6933,3283,005
81923,5493,1172,266
163843,3892,9122,150
327683,0782,5501,866

fp16 aggregate at d=32K, B=32 lands at 3,078 tok/s. That's a useful production number for short-prompt, long-context workloads on a 35B-A3B MoE on a single MacBook Pro — with no compression at all, and no OOM.

The expected crossover never showed up

In Part 1 on the llama.cpp fork, turbo3 prefill caught up with q8_0 around 128K and turbo4 decode beat fp16 past 256K. The story was that once contexts grow large enough, dequantization work becomes cheaper than the bandwidth saved by reading a smaller cache, and the compressed schemes pull ahead. That's the bandwidth-bound regime crossover.

Across all 36 cells we ran on vllm-swift, fp16 wins decode every single time. There's no crossover at 1K, 8K, 16K, or 32K. The gap to turbo4v2 grows with depth (10% at 1K, 17% at 32K), but it doesn't close, much less invert. turbo3 sits even further behind, and the gap also widens with depth.

Two non-exclusive explanations for the difference vs the llama.cpp data:

  • The crossover is past where we tested. vllm-swift's README sets --max-model-len 40960 in its examples (with a # max 40960 comment), so we capped the matrix at 32K depth. The model itself supports 262144 natively per its text_config.max_position_embeddings, so the 40K cap is a vllm-swift convention rather than a model constraint. The llama.cpp data didn't show turbo4 winning decode until 256K, so the crossover may live at 64K, 128K, or beyond; we just didn't push past 32K in this run.
  • The KV access pattern is different. vllm-swift uses a contiguous BatchedKVCache that avoids paged attention's block-table indirection. That's exactly what makes its short-context decode fast in the README's Qwen3-4B numbers, but it also means the regime where reading a smaller cache wins on bandwidth might never trigger the same way it does in llama.cpp's GGML kernels. Different access pattern, different bottleneck.

We can't distinguish those two from the data we have. The honest framing is: at ≤32K depth on this engine on this hardware, fp16 wins decode and TurboQuant+ buys you context-ceiling rather than speed.

Per-sequence decode is flat across concurrency

The non-obvious win in this data is the concurrency dimension. Per-sequence decode tok/s for fp16, across all four depths and all three concurrency levels:

DepthB=1B=8B=32
1024107.3116.7115.4
8192102.2110.6110.9
1638497.3106.0105.9
3276890.596.796.2

The numbers in each row hardly move across B=1, B=8, B=32. Per-sequence throughput at B=8 even slightly beats B=1 (because batched matmul is more efficient than single-sequence at this scale). Aggregate scales nearly linearly with B. That's the BatchedKVCache promise delivered — on a 35B-A3B MoE, the engine handles 32 concurrent decoders with no per-seq slowdown to speak of.

This is the cell where vllm-swift earns its keep relative to llama.cpp's llama-server path: the underlying engine is built for concurrent decode where llama.cpp is built for single-sequence depth.

Prefill: uniform across KV schemes

Prefill tok/s at d=8K, B=8 (the cell where prefill peaks for all three schemes):

KV schemePrefill tok/s
fp163,857
turbo4v23,836
turbo33,766

Within 2.5% across all three. Consistent with theory — prefill writes the KV cache once and reads it once per token of new context, so the per-step compression cost is small relative to the matmul-heavy work. This matches the regime described in TheTom's vllm-swift performance docs: short-context wins are dominated by the engine's reduced Python overhead, not the KV scheme.

Memory budget at d=32K, B=32

A useful sanity check on this cell. Pulling the 35B-A3B's text_config out of config.json:

  • num_hidden_layers: 40
  • layer_types: 10 entries are full_attention, 30 are linear_attention (every fourth layer is full attention; full_attention_interval: 4)
  • num_key_value_heads: 2 (heavy GQA)
  • head_dim: 256
  • max_position_embeddings: 262144

Only the 10 full-attention layers carry a growing KV cache; the linear-attention layers don't accumulate KV per token. That gives us:

  • Weights at 8-bit MLX: ~35 GB
  • fp16 KV per token: 10 layers × 2 KV heads × 256 head_dim × 2 (K+V) × 2 bytes = ~20 KB
  • Per sequence at d=32K: 32K × 20 KB ~= 640 MB
  • B=32 at d=32K: 32 × 640 MB = ~20 GB just for KV
  • Total working set: ~56 GB. Plenty of headroom in 128 GB UMA.

So fp16 d=32K B=32 fitting isn't a surprise — it's exactly what the architecture supports. The practical implication still stands: you can run 32 concurrent 32K-token contexts on a 35B-A3B MoE on a single MacBook Pro without KV compression, at 3,078 tok/s aggregate decode. An earlier draft of this post had wrong KV math (carrying over Qwen3-4B numbers: 36 layers × 8 KV heads × 128 head_dim) and called the result "surprising." It isn't. The architecture-aware math gets you there in one line.

Update 2026-05-02: deep-context cells out to 192K

After the original post landed, TheTom replied confirming that the 40K cap is a README convention only, not a code limit. --max-model-len N plumbs straight to the model's native max (262144 for Qwen3.6-35B-A3B). He also sketched the cells worth running (64K / 128K / 192K) along with rough working-set math, predicting that fp16 would OOM at d=192K B=32 and that turbo would be the only KV scheme that fit there. So we ran the extension matrix the next night.

Same hardware, same model, same harness. Depths: 65536, 131072, 196608. KV: fp16, turbo4v2, turbo3. Concurrency: 1, 8, 32. We ran 17 cells out of the 27-cell matrix before stopping (full fp16 sweep, full turbo4v2 sweep, one turbo3 sanity cell; the remaining 8 turbo3 cells are deferred to a separate overnight slot).

The empirical memory ceiling

Two fp16 cells hit the memory ceiling and SKIPPED:

  • fp16 d=128K B=32: swap-thrashed for 4 hours then SKIPPED at the harness timeout. macOS's compressed-memory layer kept it from outright OOM, but the system was paging at 1+ TB cumulative swapouts and the worker was making token-by-token progress through swapped pages.
  • fp16 d=192K B=32: same pattern, killed early to save wall time once the swap-thrash signature was clear.

These are the cells where TurboQuant+ earns its keep. Same cells under turbo4v2:

Cellfp16turbo4v2
d=128K B=32 per-seq decodeSKIP42.5 tok/s
d=128K B=32 aggregateSKIP1,360 tok/s
d=128K B=32 wall4 hr (timeout)79 min
d=192K B=32 per-seq decodeSKIP32.0 tok/s
d=192K B=32 aggregateSKIP1,024 tok/s
d=192K B=32 wallkilled (would have timed out)2 hr 51 min

That is the value prop in one table. At cells where fp16 cannot run, turbo4v2 produces real production throughput. 1,024 tok/s aggregate at d=192K B=32 means a single MacBook Pro serving 32 concurrent users at 192K context. That is what TurboQuant+ is for on this engine on this hardware.

Per-sequence decode degradation curve

Combining the original ≤32K cells with the deep extension, the full per-seq decode curve at B=8 (the cleanest cell to chart since both schemes ran everywhere):

Depthfp16turbo4v2
8K110.696.8
16K106.090.9
32K96.779.7
65K83.761.7
128K66.540.0
192K55.432.5

Smooth degradation in both schemes. fp16 wins per-seq decode at every depth where both ran. The gap widens with depth: 12% at 8K, 26% at 65K, 40% at 128K, 41% at 192K. The bandwidth-bound crossover that powers TurboQuant on llama.cpp past 256K still does not appear here, even at 6× the depth of the original matrix.

Why no crossover, and why TheTom predicted this

TheTom's reply on the discussion thread spelled it out: M5 Max sustained memory bandwidth (~400 GB/s) against the per-lane decode rates we measured leaves headroom at every depth tested. The bandwidth-bound regime that drives the llama.cpp crossover (where reading a smaller cache is genuinely cheaper than reading the full one) does not trigger here. On vllm-swift's contiguous BatchedKVCache on M5 Max, compression is mostly about staying under the memory ceiling, not about raw decode throughput. The deep extension confirms this empirically.

One thing the deep run surfaced: a memory-pressure trap

The fp16 d=128K B=32 cell did not OOM cleanly. It swap-thrashed. The system stayed alive thanks to macOS compressed memory and the harness's per-cell subprocess isolation, but each token decode was waiting on disk-paging I/O, and the cell ran ~10x slower than its compute budget would suggest. We had bumped gpu_memory_utilization to 0.95 between the original 32K matrix and the deep extension; in retrospect that was too aggressive at 128K+ depths, where the engine over-committed against macOS's UMA file cache and started compressing/swapping. For deep-context vllm-swift workloads on Apple Silicon, leave gpu_memory_utilization at 0.9 or lower. The headline cells still landed (we surgically killed the swap-thrash worker once the signature was clear), but the wall time cost was real.

Updated recommendation

Based on the combined ≤32K matrix and the deep-context extension. Cells past d=128K B=32 are where the recommendation matters most, because that is where fp16 stops fitting.

WorkloadKV schemeWhy (with empirical evidence)
High-concurrency batched serving up to ~64K context per sequencefp16Best per-seq decode at every cell where it ran. fp16 d=65K B=32 = 2,672 tok/s aggregate. Fits comfortably under 128 GB UMA.
High concurrency at ≥128K context (where fp16 hits the memory ceiling)turbo4v2Empirically the only KV scheme that ran d=128K B=32 and d=192K B=32. 1,360 tok/s and 1,024 tok/s aggregate respectively. fp16 SKIPPED both cells.
Need even more memory headroom than turbo4v2 providesturbo320–52% per-seq decode cost vs fp16 depending on depth (range across Suite A ≤32K and one deep sanity cell), 4.6× KV memory savings. Single deep sanity cell at d=65K B=1 ran at 37.5 tok/s; full deep-depth sweep is queued for a separate overnight slot.
Single-stream long-context past 256K (1M-class)turbo3 on the llama.cpp forkvllm-swift handles up to the model's native max (262K for this model) but is built for concurrent decode rather than single-stream depth-pushing. llama.cpp + turbo3 is the path past 256K on M5 Max.

Caveats

  • Originally capped at d=32K, since extended to 192K (see Update 2026-05-02 above). TheTom confirmed the README's 40K limit is convention only; --max-model-len N accepts up to the model's native max (262144). Within the depths we tested, no bandwidth-bound crossover ever appears. We have not yet pushed to 256K (the depth at which the llama.cpp Part 1 data showed turbo winning decode). That extension is open; if you have the cycles to run it, we would gladly fold the data in.
  • Bug filing was half-real. We filed TheTom/vllm-swift#11 after hitting what looked like a vllm 0.19.1 model-resolution bug. TheTom's diagnosis was sharper than ours: the CLI half (vllm-swift serve <path> forwarding the positional path to vLLM's deprecated model_tag slot) was a real bug, now fixed and merged in PR #12. The Python LLM(model=...) half was our script missing if __name__ == "__main__":, which on macOS makes multiprocessing.spawn recursively re-import and degrades the EngineCore child's config to defaults. With the main-guard added, the unpatched wrapper handles additional_config cleanly. Our C-bridge workaround was correct (and the right path for the matrix anyway), but framing it as "a vllm-swift bug" was overstated.
  • turbo4v2 in vllm-swift is not the same kernel as turbo4 in the llama.cpp fork. Same compression family, different generation. We did not run a quality smoke (PPL or KL) to confirm turbo4v2 reproduces turbo4's quality numbers from Part 2; that's a follow-up.
  • Single hardware data point. M5 Max, 128 GB UMA. Memory bandwidth and GPU core count differ enough across Apple Silicon that the regime may shift on M2 Pro / M3 Ultra / M4 Max. If you have non-M5 Apple Silicon and want to run a slice, drop the numbers and we'll fold them into a comparison.
  • vllm-metal baseline column not included. TheTom's published comparison runs vllm-swift against vllm-metal (Python/MLX) at every cell. We skipped that column to ship today. The published vllm-swift performance doc already has that comparison at short context, and our cross-engine baseline of choice is the llama.cpp fork data from Part 1.

Methodology

Hardware: MacBook Pro M5 Max, 128 GB unified memory. Engine: vllm-swift v0.3.0 from TheTom's Homebrew tap, bottled. Bridge dylib at /opt/homebrew/Cellar/vllm-swift/0.3.0/lib/libVLLMBridge.dylib. Model: mlx-community/Qwen3.6-35B-A3B-8bit downloaded via vllm-swift download.

Bench routed through the C bridge directly via vsm_engine_create(model_path, "float16", max_kv, kv_scheme, kv_bits, 0.9). Subprocess-per-cell to dodge the EngineCore zombie process issue documented in vllm-swift's performance doc. Greedy decode (temp=0), 50 generation tokens, unique prompts per slot, two-step warmup before timed decode. Three KV schemes × four depths × three concurrency levels = 36 cells, one rep per cell. Reported numbers are: per-sequence decode tok/s = total_decoded_tokens / decode_window; aggregate engine tok/s = per_seq × B; prefill tok/s = (prompt_tokens × B) / prefill_seconds.

Phase 0 sanity: ran TheTom's published bench_throughput.py on Qwen3-4B-4bit with default flags. Reproduced his README short-context numbers within 5% (B=8 ours 480.2 vs published 477; B=32 ours 1185 vs published 1194; B=64 ours 1482 vs published 1518). Bench rig validated, then we moved to the 35B matrix.

Total wall-clock: 53 minutes for the 36-cell matrix. Memory-agent stopped during the run for clean memory budget. Raw per-cell JSON and the bench harness will be in the LLMKube benchmarks repo; reach out if you want them ahead of that.

What we'd run next

  • Complete the turbo3 deep sweep. The 8 missing turbo3 cells at d=65K B=8 / B=32 and at d=128K / d=192K across all B levels. Adds the third row to the recommendation table. Estimated 5-6 hr as its own overnight slot; conclusion is not expected to change but the table will be complete.
  • Push to 256K. The depth where llama.cpp Part 1 first showed turbo winning decode. We did not reach it on vllm-swift this round; the model itself supports it. If the bandwidth-bound crossover exists on this engine, that is the cell to find it in. fp16 will OOM well before that depth, so this becomes an exclusively turbo run.
  • Quality smoke for turbo4v2 vs turbo4. Confirm the v2 kernel reproduces turbo4's PPL and KL numbers from Part 2. Quick run.
  • vllm-metal baseline column. Drop in the missing comparison so the table fully matches TheTom's published methodology rather than relying on cross-engine reference to llama.cpp.
  • q4_0 / q5_0 / q4_1 / q5_1 / iq4_nl on the llama.cpp fork. The wider standard quants commenters asked for in Part 2. Separate post (Part 3b in the series); data is already in hand from a 8K PPL run on the llama.cpp fork.

Thanks again to TheTom for shipping vllm-swift publicly, for the responsiveness on the bug report, and for the on-the-fly architecture corrections. The deep-context extension confirms the framing: TurboQuant+ on vllm-swift on M5 Max is a memory-ceiling tool. fp16 wins per-seq decode where it can run; turbo wins by being the only option where fp16 cannot. 1,024 tok/s aggregate at d=192K B=32 on a single MacBook Pro is a useful production data point, and exactly the regime where the engine + the compression scheme + this hardware all earn their keep together.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.