Skip to content
Skip to documentation content
Browse documentation
In progress Page being written

KV cache types

Long-context inference is bandwidth-bound on the KV cache. Picking the right cache type buys you either more concurrency at the same context, or longer context on the same hardware.

What this page will cover

  • How KV cache memory scales with context length and model size, and why it matters for agentic workloads.
  • The standard llama.cpp cache types: f16, q8_0, q4_0, q5_0, iq4_nl.
  • The vLLM equivalents: auto, fp8_e5m2, fp8_e4m3, and how they differ from llama.cpp.
  • The custom-string escape hatch (cacheTypeCustomK/V, kvCacheCustomDtype) for fork-only types like TurboQuant turbo3/turbo4/turbo2.
Read the source on GitHub
LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.