We ran Qwen3.6-27B on $800 of consumer GPUs, day one. Here's how llama.cpp and vLLM compared, and what each token actually costs.

A Kubernetes-native bake-off on 2× RTX 5060 Ti, with reproducible manifests and a cost-per-token number neither cloud nor OSS FinOps tools will tell you.

This is a runtime comparison, not a model evaluation. Both llama.cpp and vLLM serve the same Qwen3.6-27B in every cell; we're measuring how the two serving stacks differ on identical work. Where cloud APIs enter in §8, it's on cost, not capability — this post makes no claim about whether Qwen3.6-27B "beats" GPT-4o or Claude on task quality.

TL;DR

Qwen3.6-27B (Tongyi Lab, released 2026-04-21, Apache 2.0) runs on a pair of RTX 5060 Ti 16 GB consumer cards via Kubernetes and LLMKube. Total hardware: about $800 street.
vLLM wins throughput by 3 to 4× at high concurrency thanks to NVFP4 and PagedAttention. llama.cpp plus TurboQuant wins context. We served one 43K-token prompt end-to-end (a single captured sample; higher-concurrency cells timed out on our 300 s harness budget) on hardware where vLLM's in-memory cap is 16K.
Cost per million tokens is two numbers, not one: $0.13 amortized (full cost of ownership) and $0.010 marginal (electricity during active serving). At 32.7% utilization over the bench window, the 13× gap between them is the real FinOps conversation.
Everything is reproducible. Manifests, harness, and summary.csv at github.com/defilantech/llmkube-bench.

1. Why we did this

Two days ago, Tongyi Lab dropped Qwen3.6-27B with the claim it matches frontier agentic-coding models at the 27B parameter count. The community response was predictable: does this actually work locally, or is it another model that benchmarks well but nobody can run? (Note for readers comparing against Qwen3.6-35B-A3B: the 27B is the non-MoE sibling. None of the MoE-specific flags like --cpu-moe apply here.)

The ecosystem has a harder time answering "how should I serve it?" There are two dominant open-source inference runtimes for models like this, and they optimize for different things.

llama.cpp is ubiquitous, GGUF-based, with broad quantization support. It runs on almost anything with a GPU, and is adopted by the hobbyist and homelab crowd. It recently grew TurboQuant KV-cache compression (ggml-org/llama.cpp#20969), pushing achievable context windows on small VRAM into territory nobody else touches.

vLLM is throughput-focused, with PagedAttention, continuous batching, and FP8/NVFP4 on recent NVIDIA hardware. It is the production serving runtime for teams running real traffic, targeting data-center GPUs.

The ecosystem answers "which should I use" with vibes and forum posts. We wanted numbers, from the same hardware, same model, same day the model dropped. If a 27B-class model can genuinely run on a pair of $400 GPUs, the practical question for anyone thinking about on-prem inference is which runtime makes that hardware actually worth something.

So we benchmarked both, published every configuration, and then turned the token counts into dollars using our companion tool InferCost, so the "is it cheaper than the cloud?" question has an honest answer rather than the usual founder-math.

2. Hardware and the constraint

The node running this bench is shadowstack, a microk8s cluster on a single box.

GPUs: 2× NVIDIA GeForce RTX 5060 Ti 16 GB (Blackwell GB206)
GPU memory: 15.48 GiB usable per card after driver reserve, 30.96 GiB aggregate
OS: Ubuntu 24.04.3 LTS, kernel 6.17.0-oem
Kubernetes: MicroK8s v1.32.13
Orchestration: LLMKube operator (chart 0.7.0), NVIDIA GPU Operator, DCGM exporter
Street price: about $400 per card × 2

5060 Ti is a Blackwell consumer GPU with native FP4 hardware. That is load-bearing. Without NVFP4, the 27B class is out of reach. At BF16 the model would need about 55 GB, at FP8 about 28 GB, at NVFP4 about 14 GB. Only the last one fits 2× 16 GB with room for activations and KV cache.

The VRAM budget is the whole story. On enterprise hardware (H100, A100, even the 3090 that the community's "qwen 27B on a 3090" discourse is built on), most of this bake-off's complexity disappears. On 2× 16 GB consumer cards you are constantly one configuration flag away from an out-of-memory crash, and the runtime that lets you navigate that wins real users.

3. The first attempt that didn't work

Our original target was Qwen/Qwen3.5-27B-FP8 (Qwen's official FP8 safetensors, the model everyone was excited about). On paper: 28 GB weights, TP=2, about 14 GB per shard. Should fit.

It doesn't. Qwen's 27B-class FP8 release is a VLM. The checkpoint includes a vision encoder that stays resident in VRAM whether or not you ever send an image. Three successive mitigations on vLLM, each measured against the crash logs:

1. Default config. OOM during profile_run on the vision encoder:

CUDA out of memory. Tried to allocate 576.00 MiB.
GPU 0 has a total capacity of 15.48 GiB of which 175.19 MiB is free.
This process has 15.30 GiB memory in use.

2. --limit-mm-per-prompt image=0,video=0, maxModelLen 16K, max-num-batched-tokens 4K. Skipped multimodal dummy inputs during profile. The vision encoder weights stay resident. OOM now at determine_available_memory:

Tried to allocate 1.19 GiB.
GPU 0 has 1.02 GiB free.
This process has 14.45 GiB in use.

3. --gpu-memory-utilization 0.95, PYTORCH_ALLOC_CONF=expandable_segments:True. Pushed against the wall:

Tried to allocate 32.00 MiB.
GPU 0 has 3.19 MiB free.
This process has 15.47 GiB in use.

15.47 of 15.48 GiB. No knob left. Qwen3.5-27B-FP8 cannot be served via vLLM on 2× 16 GB consumer cards in any configuration we found. A 3090 or 4090 (24 GB) would have considerably more headroom for the vision encoder plus KV cache (we didn't reproduce on one, but it's plausible the default config would fit there). That's a real hardware-sizing footnote to the "run 27B locally" discourse, since not every pair of 16 GB cards is enough.

Then Qwen3.6-27B dropped, and within 24 hours the community had published NVFP4 quants that halve the weight footprint again. That is the pivot that made this bench possible.

4. Method

Both runtimes run Qwen3.6-27B, served via LLMKube as a Kubernetes Deployment with OpenAI-compatible endpoints, and are benchmarked against each other on identical workloads. All manifests live in the public repo.

llama.cpp candidate

Source: unsloth/Qwen3.6-27B-GGUF Q4_K_M (about 17 GB)
Parallelism: split-mode=layer across both GPUs
KV cache: TurboQuant tbqp3 (keys) + tbq3 (values), about 3 bits per element
Max context: 65,536
Image: AmesianX's TurboQuant fork v1.5.2, built from source and pushed to a private registry (Kaniko build manifest in the bench repo; retarget to your own registry to reproduce)
Flash attention: on
Parallel slots: 16 for short patterns (chat, coding, agentic), 1 for long-context patterns (long_context, long_context_extreme)

TurboQuant is AmesianX's llama.cpp fork implementing the KV-cache compression algorithm from Google Research's TurboQuant paper. Asymmetric: QJL correction (tbqp*) on keys only because keys feed Q·K inner products while values go through a softmax-weighted sum. Our own internal benchmarks show about 60% KV cache reduction vs f16 at the same context, the table stakes for pushing context on small VRAM.

The slot count asymmetry matters and we want to be upfront about it: llama.cpp divides --ctx-size by --parallel to get per-slot context. With parallelSlots=16 and 65K total context, each slot gets 4 K tokens, which is enough for chat/coding/agentic prompts but rejects 5 K+ long-context requests. Dropping to parallelSlots=1 gives every request the full 65 K, at the cost of serving concurrent long-context requests from a queue. Readers should treat llama.cpp's long_context c=16/c=64 numbers as queue-behavior measurements, not throughput measurements.

vLLM candidate

Source: sakamakismile/Qwen3.6-27B-NVFP4 (about 14 GB)
Parallelism: tensor-parallel (TP=2)
Quantization: compressed-tensors wrapping NVFP4 (Blackwell-native 4-bit float)
KV cache: FP8 E4M3 (8 bits)
Max context: 16,384
Attention backend: FLASHINFER
CUDA graphs: disabled (--enforce-eager)
Prefix caching: on
Chunked prefill: on
Image: vllm/vllm-openai:latest

Two forced choices here deserve a note.

--enforce-eager is required because CUDA graph capture for NVFP4 plus VLM weights plus KV cache exhausts the 15.48 GiB budget before KV init even starts. Skipping graph capture costs about 10 to 15% throughput, which becomes part of the fair comparison: on this hardware class vLLM gives up one of its own optimizations.

maxModelLen: 16384 is not "the model's ceiling". It is what fits after NVFP4 weights (14 GB / 2 = 7 GB/shard), vision encoder (about 2 GB), KV cache at FP8, and activations. 32K OOMs during profile; 16K fits with about 1 GiB headroom.

Workloads

Five patterns by four concurrency levels per runtime.

Pattern	Shape	Purpose
`chat`	128-in / 256-out, 20 prompts	Interactive baseline
`coding`	1K-in / 1K-out, 20 prompts	Typical code-gen turn
`long_context`	~5K-in / 1K-out, 10 prompts	Code review, RAG-heavy
`long_context_extreme`	~43K-in / 1K-out, 10 prompts	vLLM's 16K cap cannot attempt this
`agentic`	4K shared prefix + 512 delta / 512-out, 20 prompts	Stresses prefix caching

Concurrency 1, 4, 16, 64. Per cell: 2 min warmup (discarded), 5 min measurement. Temperature 0, seed 42, streaming on.

The full workload matrix is 40 cells (5 × 4 × 2 runtimes). We run 36 of them. long_context_extreme is not attempted on vLLM because its 16K cap would reject every prompt before submission. That asymmetry is one of the bake-off's findings, not a methodology gap.

5. Results: throughput and latency

Single-request latency (c=1)

pattern	llama.cpp TTFT p50	vLLM TTFT p50	Winner
chat	208 ms	157 ms	vLLM
coding	413 ms	106 ms	vLLM
agentic	911 ms	409 ms	vLLM
long_context (5K)	2,279 ms	581 ms	vLLM

vLLM is faster at single-request latency across the board, typically 2 to 4× on prefill-heavy patterns. llama.cpp plus TurboQuant pays a prefill tax: compressing the KV cache to about 3 bits per element is memory-cheap and compute-expensive. On short prompts the gap is narrow; on long prompts it opens up.

Quantization caveat: these numbers compare Q4_K_M (llama.cpp) against NVFP4 (vLLM). They are not the same quantization, and on this hardware there is no apples-to-apples option: llama.cpp doesn't ship an NVFP4 runtime, and Q4_K_M has no vLLM implementation. We've filled out a side-by-side output-quality check in QUALITY-GATE.md so readers can judge whether the two quants produce comparable answers at this parameter count. Read the speed numbers as "at each runtime's native quant on this hardware," not "at identical model quality."

Throughput under load (c=64)

pattern	llama.cpp tok/s	vLLM tok/s	Ratio
chat	94	345	3.7×
coding	133 (60% success)	377	2.8×
agentic	72	262	3.6×

This is vLLM's home turf. PagedAttention plus continuous batching turn 64 concurrent requests into about 90% GPU utilization; llama.cpp's slot-based scheduling (even with 16 parallel slots) serializes far more aggressively. The coding c=64 drop to 60% success on llama.cpp is KV cache saturation: at 16 slots by about 2K per-slot context, heavy coding prompts overflow.

Inter-token latency

Stable and tight on both runtimes. Median ITL:

llama.cpp: 49 to 175 ms/token across patterns and concurrencies
vLLM: 64 to 67 ms/token across patterns and concurrencies (remarkably flat, because continuous batching amortizes decode across the batch)

The llama.cpp ITL spread widens at high concurrency as slot contention kicks in. vLLM's is basically a constant, which is what makes it good for conversational workloads where you care about per-token cadence.

The honest version

vLLM wins the throughput axis. That's a real result, not a function of tuning. On 2× 16 GB consumer hardware with Qwen3.6-27B, if you're trying to maximize requests per second, vLLM is the answer, and it wins while giving up about 10 to 15% of its own throughput to --enforce-eager (disabled CUDA graphs were required to fit VRAM). The NVFP4 kernels on Blackwell, PagedAttention's batching, and continuous prefill scheduling all compound even with that handicap.

Except…

6. Results: context

The 5K baseline

Both runtimes serve long_context (about 5K input tokens, 1K output) at c=1 in about 13 seconds end-to-end. llama.cpp measures 20 tok/s, vLLM 19 tok/s. Near parity at this context size.

At higher concurrency the story differs because we configured llama.cpp with parallelSlots=1 to give every request the full 65K context (required for the extreme pattern, see below). Concurrency c=16 and c=64 on llama.cpp show queue saturation: the harness sends 16 or 64 concurrent requests, but the server processes them serially. That's not a throughput measurement, it's a queue measurement. On production llama.cpp with parallelSlots=16 and a smaller per-request context, short-prompt throughput would match our earlier numbers, but then you can't serve 43K prompts.

Which brings us to the real test

long_context_extreme: a roughly 43,000-token prompt in, 1024 tokens out.

vLLM, as configured here, can't attempt this. Its maxModelLen is 16K, set that way because 32K OOMs during graph capture on this hardware. A 43K-token request is rejected before it reaches inference. We did not explore --swap-space CPU offload, which in principle could trade a lot of latency for more context; that's a follow-up. Out of the box on 2× 16 GB consumer cards with Qwen3.6-27B NVFP4, we did not find an in-memory configuration that serves 43K.

llama.cpp plus TurboQuant served it. One sample captured at c=16 end-to-end:

Prompt tokens: about 43,000
Prefill time (TTFT): 186 seconds (3.1 min)
Decode rate: 171 ms/token
Output: 1024 tokens in about 175 seconds
Total wall time: about 6 minutes per request

This is not fast. It's not meant to be fast. What it is, is possible. TurboQuant's roughly 3-bit KV cache makes the memory math work where FP16 or FP8 KV can't. On the same hardware, at the same moment, one runtime cannot attempt the workload and the other completes it.

The higher-concurrency cells for this pattern hit our harness's 300s per-request timeout because decode plus prefill combined exceeds 300s. Bumping the harness timeout to 600s would capture all four c-levels cleanly; that's a follow-up. The c=1 and c=16 samples are enough to prove the capability.

The real tradeoff

Throughput versus context is the tradeoff, not "vLLM is better" or "llama.cpp is better". On this hardware:

Production chat, interactive coding, short agentic loops (≤ 8K context): vLLM. 3 to 4× throughput, lower TTFT, better ITL stability.
Long-document review, RAG with full-file context, overnight batch agentic on 40K+ codebases (> 16K context): llama.cpp plus TurboQuant. Slower per token, but it's the only runtime that serves the workload at all.

For many real workloads the answer is "run both". vLLM for the chat endpoint, llama.cpp for the batch endpoint that processes whole PRs overnight.

7. What it costs

Throughput numbers are interesting. Dollars per token are what actually get budgets approved.

InferCost is our companion tool: a Kubernetes operator that reads real-time GPU power draw from DCGM, combines it with hardware amortization and electricity rates declared on a CostProfile CR, and computes the real cost of inference. It discovers inference pods by the inference.llmkube.dev/model label LLMKube stamps on each Deployment, scrapes each pod's /metrics endpoint directly (no Prometheus required), and writes cost attribution into a UsageReport custom resource.

Here's a live UsageReport status from shadowstack, captured after a 10-minute mixed workload:

$ kubectl -n bench get usagereport bench-window -o yaml
...
status:
  period: "2026-04-23"
  periodStart: "2026-04-23T00:00:00Z"
  periodEnd:   "2026-04-23T21:21:42Z"
  inputTokens:  638
  outputTokens: 12400
  activeEnergyKWh:     0.645
  activeHoursInPeriod: 4.53
  totalHoursInPeriod:  21.36
  utilizationPercent:  21.20
  estimatedCostUSD:             0.83
  costPerMillionTokens:         63.79   # amortized
  marginalCostPerMillionTokens:  3.96   # electricity during active serving
  byModel:
  - model:     qwen36-27b-llamacpp
    namespace: bench
    inputTokens:  638
    outputTokens: 12400
    costPerMillionTokens: 63.79
    estimatedCostUSD: 0.83
  byNamespace:
  - namespace: bench
    tokenCount: 13038
    estimatedCostUSD: 0.83

The numbers look alarming at first: $63.79/MTok amortized for a tiny workload against a day's worth of hardware amortization. That's the point. At 21.2% utilization over this window, amortized is 16× higher than marginal. Scale up the utilization and the amortized number drops toward the marginal one. That's what the bench window numbers below capture.

The full bench window (Apr 23, 2026, 00:00 UTC to 10:07 UTC, about 10 hours), from summary.csv cross-referenced with the CostProfile spec:

Metric	Value
Total input tokens	2,518,242
Total output tokens	1,233,143
Total tokens	3,751,385
Active GPU energy	0.459 kWh
Utilization (active hours / wall-clock hours)	32.7%
Total dollar cost (amortization + electricity)	$0.50

Hardware amortization on the CostProfile spec: 2× RTX 5060 Ti at $480 each = $960, 3-year useful life, 5% annual maintenance. Electricity $0.08/kWh, PUE 1.0.

The two numbers

Metric	Value	Which question it answers
`costPerMillionTokens` (amortized)	$0.13	"What did my hardware cost per token I served today?"
`marginalCostPerMillionTokens`	$0.010	"What did the electricity actually cost to generate those tokens?"

Both numbers are correct. They answer different questions.

Amortized $0.13/MTok spreads the full cost of hardware ownership (amortization, idle electricity, active electricity) across whatever tokens you served today. It tells you the answer to "was today's inference worth what we paid for the hardware?" At 32.7% utilization, you're leaving about two-thirds of the compute capacity you already bought idle, and the amortized rate reflects that.

Marginal $0.010/MTok includes only the electricity drawn during active serving. It answers "what did these specific tokens cost me beyond what I'd be paying anyway?", the relevant comparison when cloud APIs only bill marginally.

The 13× gap between them is the entire FinOps conversation. At 100% utilization the two numbers converge; at low utilization they diverge by more than an order of magnitude. Neither is the "right" number. They describe different things.

8. Cloud comparison

Cloud APIs bill marginally. That's how they work: no inference, no invoice. So the fair comparison against on-prem is marginal versus marginal. Cloud prices below are output token pricing on public pricing pages as of April 2026; check each provider for current rates and input-vs-output splits.

Provider / Model	Output $/MTok	On-prem ratio (marginal)
shadowstack marginal	$0.010	1×
OpenAI GPT-4o	$10.00	1,000× cheaper on-prem
Google Gemini 2.5 Pro	$10.00	1,000× cheaper on-prem
Anthropic Claude Opus 4.5	$25.00	2,500× cheaper on-prem

Those ratios are almost offensive. They're also the upper bound, the ceiling of savings if you saturated this hardware.

The floor, at the bench window's 32.7% utilization (i.e., our actual mixed-workload cost over ten hours), uses the amortized number:

Provider / Model	Output $/MTok	On-prem ratio (amortized at 32.7%)
shadowstack amortized	$0.13	1×
OpenAI GPT-4o	$10.00	77× cheaper on-prem
Google Gemini 2.5 Pro	$10.00	77× cheaper on-prem
Anthropic Claude Opus 4.5	$25.00	192× cheaper on-prem

Even the worst case, amortized cost at 32.7% utilization, is 77× cheaper than GPT-4o or Gemini 2.5 Pro on output tokens. Against Claude Opus 4.5 (Anthropic's flagship large-frontier model), on-prem is 192× cheaper dollars-for-dollars. Those numbers do narrow on a blended input-plus-output basis, but the direction doesn't change.

For context on the hardware investment: $960 of GPUs pays for itself in Opus 4.5 output tokens at roughly 38.4 million tokens of traffic. At a modest 100K output tokens a day that's about a year; at 1M output tokens a day (a small agentic coding team), it's under six weeks. Against GPT-4o or Gemini 2.5 Pro the break-even point is 96M output tokens: ~2.6 years at 100K/day, ~3 months at 1M/day. Input tokens are cheaper on every cloud model, so a realistic blended workload stretches those numbers modestly, but not by an order of magnitude.

This math is why enterprises with serious inference budgets are re-examining on-prem. It's not about paranoia or data residency (though those help). It's that the marginal economics on modern consumer GPUs, with the right runtime, genuinely work.

9. Reproduce it yourself

Everything is in the public repo: github.com/defilantech/llmkube-bench.

# Requires: K8s cluster with LLMKube v0.7+, 2× NVIDIA 16+ GB, DCGM exporter,
# hf-token Secret in the bench namespace.
git clone https://github.com/defilantech/llmkube-bench.git
cd llmkube-bench
make install                                      # Python deps via uv
make bench RESULTS_DIR=results/$(date +%F)-myhw   # ~3-4 hours for full matrix

That's the workstation path. The bench also runs fully in-cluster. A Kaniko Job builds the harness image, a bench-runner Job with a scoped ServiceAccount orchestrates the runtime swaps, results land on a hostPath volume. See manifests/bench-runner/README.md.

Every number in this post traces to a row in results/2026-04-23-shadowstack/summary.csv. Every manifest, every image digest, every Prometheus snapshot is committed.

10. What's next

A few things we'd do differently on the next bench:

Raise the harness per-request timeout from 300s to 600s so long_context_extreme at higher concurrencies captures cleanly. The one sample we got is defensible; four clean samples would be better.
Test with Qwen's own FP4 release once they ship one. The sakamakismile community NVFP4 has been solid for the throughput measurements, but an official Qwen FP4 would remove a variable from the methodology.
Multi-node llama.cpp would close the long-context throughput gap. Splitting layers across 4 GPUs instead of 2 gives per-shard VRAM headroom for higher --parallel settings and cuts the TurboQuant prefill time roughly in half.

But the big-picture answer is already here. On $800 of consumer GPUs, you can serve the same day's flagship open-source model, at either throughput that crushes cloud APIs or context lengths that no cloud provider offers at any price. And InferCost shows you the honest dollar math instead of the misleading single-number dashboards you'd get from every "AI observability" tool on the market.

If you want to follow along:

github.com/defilantech/llmkube, the Kubernetes operator running both runtimes in this bench
github.com/defilantech/infercost, the cost attribution controller producing the $/MTok numbers
github.com/defilantech/llmkube-bench, the full reproducible bench
@defilan on X, where the threads go

If this was useful, star the repos. If it was wrong about something, open an issue; the goal is accurate numbers, not winning arguments.

— Chris