Skip to documentation content Read the source on GitHub
Browse documentation
Getting Started
Reference
In progress Page being written
KV cache types
Long-context inference is bandwidth-bound on the KV cache. Picking the right cache type buys you either more concurrency at the same context, or longer context on the same hardware.
What this page will cover
- How KV cache memory scales with context length and model size, and why it matters for agentic workloads.
- The standard llama.cpp cache types: f16, q8_0, q4_0, q5_0, iq4_nl.
- The vLLM equivalents: auto, fp8_e5m2, fp8_e4m3, and how they differ from llama.cpp.
- The custom-string escape hatch (cacheTypeCustomK/V, kvCacheCustomDtype) for fork-only types like TurboQuant turbo3/turbo4/turbo2.