Why Qwen 3.6 Doesn't Need --cpu-moe (and Why Qwen3-Coder Does) on Dual 16GB

A subtle detail of Qwen 3.6-35B-A3B's architecture makes one of the standard hybrid GPU/CPU offloading flags — --cpu-moe — actively harmful for generation throughput. On the same dual RTX 5060 Ti setup, with and without that flag: 21.7 tok/s vs 107.8 tok/s. Same GPUs, same GGUF, same context window. A 5x swing on one flag.

This post is about why that happens, what it says about classifying MoE models for offloading, and what shipped in LLMKube 0.7.0 as a direct consequence.

What `--cpu-moe` actually does

The flag keeps MoE expert weights in system RAM rather than VRAM. At decode time, llama.cpp does not move the weights across PCIe for each token. Instead, the small activation vector travels from VRAM to DRAM, the CPU performs the expert matmul using weights already resident in DRAM, and the result returns to VRAM. The GPU still handles attention and everything that is not an MoE expert.

The tradeoff the flag makes is not bandwidth for bandwidth. It is VRAM savings for CPU compute cost. You free up GPU memory by moving expert weight storage off-card, and in exchange, every token pays for the CPU to run the expert math rather than the GPU. For sparsely routed MoE models (Mixtral: 32 layers, 2 of 8 experts per token, 64 CPU expert ops per token) on a capable CPU, this is a favorable trade when the alternative is not fitting the model in VRAM at all.

The flag's usefulness therefore depends on two things at once: do you actually need the VRAM savings? and can your CPU keep up with the per-token expert workload? If the answer to the first question is "no" — if the model already fits in VRAM for your target context — then the CPU cost buys you nothing.

Qwen 3.6's architecture, and why the flag is unnecessary here

Qwen 3.6-35B-A3B has 40 layers in a repeating pattern: ten blocks of 3 × (Gated DeltaNet → MoE) + 1 × (Gated Attention → MoE). That works out to 30 DeltaNet layers and 10 Gated Attention layers, each followed by an MoE feed-forward with 256 experts. Nine experts activate per token (8 routed + 1 shared), intermediate dim 512 per expert.

The relevant detail for offloading strategy is where the KV cache comes from. DeltaNet is a variant of linear attention: rather than building a KV cache that grows with context, it maintains a compact recurrent state across tokens. Only the 10 Gated Attention layers contribute to the KV cache. At 90K context with q8_0 cache quantization, the total KV footprint is modest — small enough that Qwen 3.6 at Q4_K_M plus its KV cache fits entirely on two 16 GB cards.

Because the model fits in VRAM without any offloading, the "VRAM savings" that --cpu-moe normally trades for is not needed. Enabling it anyway pays the CPU-compute cost (40 layers × 9 active experts ≈ 360 CPU expert matmuls per token) without any memory benefit to balance it. On a consumer-class CPU that ratio is severe enough to dominate decode throughput.

Why the same flag was right for Qwen3-Coder on the same box

Qwen3-Coder-30B-A3B on this same hardware ran at 31 tok/s with --cpu-moe enabled. That model uses a classic architecture: 48 standard GQA layers and 128 experts, 8 active per token. At 90K context every layer contributes to the KV cache, which makes the total VRAM requirement substantially larger than Qwen 3.6's. On dual 16 GB cards Qwen3-Coder does not fit without offloading; --cpu-moe is the only way to run it at that context at all.

So the flag is doing its job there. The VRAM savings are real and necessary, and paying the CPU cost per token is the price of running the model on a consumer rig. The lesson is not that --cpu-moe is bad — it is that whether it helps depends on whether your model-plus-context combination actually overflows VRAM on your target hardware.

The numbers

Config	Generation (single-request)
Qwen 3.6-35B-A3B with `--cpu-moe`	21.7 tok/s
Qwen 3.6-35B-A3B without `--cpu-moe`	107.8 tok/s

Dual RTX 5060 Ti, 16 GB each. Q4_K_M quantization. 90K context window. q8_0 KV cache. Identical in every other respect. The only change between rows is whether the expert weights sit on GPU or in DRAM.

The only mechanical difference between the two rows is whether expert compute lives on the GPU or on the CPU. Memory footprint is identical in both cases because Qwen 3.6 fits in VRAM either way.

Heuristic:

--cpu-moe is a memory trade: VRAM savings in exchange for CPU compute cost per token. Only enable it when your model-plus-context combination overflows the VRAM you have. If it fits in VRAM already, the CPU cost is pure overhead. For hybrid-attention models with small KV caches (DeltaNet, Mamba-SSM), you will often fit in VRAM where a full-KV-cache model of the same parameter count would not — which flips the right default on the same hardware.

How the investigation started

I did not catch this before I benchmarked. The day before this post I published an LLMKube writeup of Qwen 3.6 with hybrid offloading at 21.7 tok/s and cross-posted it to r/LocalLLaMA. The thread pushed back: multiple commenters reported 3–5x higher throughput on comparable hardware. One flagged ~67 tok/s as achievable. Another shared a Q5_K_M configuration on a 4060 Ti + 5060 Ti pairing at roughly 60 tok/s. Another argued for vLLM with prefix caching as the right backend for agentic workloads.

Every data point said the same thing: my config was the bottleneck, not the hardware. --cpu-moe was the first flag I pulled, and throughput jumped 5x on the next run. The architecture reasoning above is the post-hoc explanation; the community thread was the prompt.

What shipped in LLMKube 0.7.0

The same thread surfaced adjacent gaps in LLMKube's runtime coverage that were unrelated to --cpu-moe but worth closing while the feedback was fresh. Seven new runtime controls landed in PR #291, bundled into the 0.7.0 release:

extraArgs passthrough for vLLM. The llama.cpp runtime already honored it as an escape hatch; vLLM silently ignored it. Both runtimes now accept it.
Sharding strategy now actually applies. The Model CRD exposed sharding.strategy (layer, tensor, pipeline) for months, but the llama.cpp runtime hardcoded --split-mode layer. Now tensor correctly maps to --split-mode row, with none and row added as explicit enum values.
enablePrefixCaching for vLLM. Agentic workloads share a system prompt across every request. Prefix caching is a real time-to-first-token win there.
attentionBackend for vLLM. Pick flashinfer, flash_attn, xformers, or torch_sdpa per InferenceService.
noWarmup for llama.cpp. Trade pod-startup latency for first-request latency. Useful for scale-to-zero deployments.
Reasoning budget controls. Thinking models like Qwen 3.6 and GLM-5 can run away on reasoning tokens. reasoningBudget caps the count; reasoningBudgetMessage nudges the model to conclude when the budget is exhausted.
GGUF metadata overrides. metadataOverrides is a typed array field for llama.cpp's --override-kv, which is how you extend context windows beyond native training limits (e.g., Qwen 3.6's 262K to 1M via YaRN) by overriding the embedded context_length.

Two of those are technically behavior changes for existing configs (sharding strategy semantics and vLLM extraArgs forwarding). Both are called out in the 0.7.0 changelog's ⚠ BREAKING CHANGES section.

Running Qwen 3.6 on LLMKube 0.7.0

The corrected config for dual 16 GB cards:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: qwen36-hybrid
spec:
  modelRef: qwen36-35b-hybrid
  contextSize: 90112
  cacheTypeK: q8_0
  cacheTypeV: q8_0
  # moeCPUOffload: false   <-- do NOT enable for hybrid DeltaNet MoE
  flashAttention: true
  jinja: true
  resources:
    gpu: 2
    cpu: "8"
    memory: "16Gi"

Drop the moeCPUOffload and hostMemory lines that appeared in the original writeup. They were the wrong answer for this specific model. The rest of the configuration — GPU count, context window, flash attention, Jinja templates, KV cache quantization — stays the same.

Two takeaways

Size the VRAM savings before paying for them. --cpu-moe is a memory-pressure valve, not a throughput feature. Check whether your model-plus-context combination fits in VRAM first. If it does, the flag is pure overhead. Hybrid-attention architectures (DeltaNet, Mamba-SSM, recurrent layers mixed into MoE stacks) systematically shift the "does it fit" answer because their KV cache is smaller per layer than a full-GQA model of the same parameter count.

Public iteration beats private polish. The quickest path from a suboptimal config to a 5x speedup was publishing the suboptimal config. The seven runtime additions in 0.7.0 got there the same way. For an operator that has to work well across a lot of different hardware and model shapes, the edge cases live in the community's deployments, not in my test rig.

Resources: The LLMKube repository has the operator, sample manifests, and Helm chart. PR #291 is the 0.7.0 runtime controls bundle. The original Qwen 3.6 post carries a banner pointing to this analysis.

Run Your Own Local Inference Stack

LLMKube handles model downloads, GPU scheduling, hybrid offloading (where it fits), and an OpenAI-compatible endpoint. Point any agentic tool at it and keep every token on your network.

View on GitHub Getting Started

What --cpu-moe actually does