KV cache types

Long-context inference is bandwidth-bound on the KV cache. Picking the right cache type buys you either more concurrency at the same context, or longer context on the same hardware.

What this page will cover

How KV cache memory scales with context length and model size, and why it matters for agentic workloads.
The standard llama.cpp cache types: f16, q8_0, q4_0, q5_0, iq4_nl.
The vLLM equivalents: auto, fp8_e5m2, fp8_e4m3, and how they differ from llama.cpp.
The custom-string escape hatch (cacheTypeCustomK/V, kvCacheCustomDtype) for fork-only types like TurboQuant turbo3/turbo4/turbo2.

Read the source on GitHub