Topic Guide
KV Cache Quantization for Local LLMs
KV cache quantization is the long-context memory lever. Once a model fits, the prompt length and concurrent sessions are usually limited by K/V bytes per token, not by the static weight file.
Practical Answer
For local long-context inference, FP16 KV cache is a debug-friendly default, not the right long-term default. INT8 K/V, asymmetric Q4 K plus higher-precision V, and TurboQuant-style compression all attack the same bottleneck: K/V memory grows linearly with tokens. On 16 GB and 32 GB GPUs, that growth decides whether 128k context is real or just a model-card number.
When This Page Helps
Use this page for readers trying to understand why a model that fits in VRAM still fails at long context. The search intent is practical memory math: bytes per token, which precision is safe, and what an engine must implement.
Status Snapshot
- Best current answer
- KV cache precision is the lever that turns model-card context into usable local context after the weights already fit.
- Reader problem
- Readers need to compute bytes per token, choose a precision strategy, and understand the quality and bandwidth tradeoff.
- Main bottleneck
- At long context, attention reads can become the dominant decode traffic. A smaller cache only helps when kernels read the compressed layout directly.
Action Plan
- Compute bytes per token Start from layers, KV heads, head dimension, and precision. That number explains context limits better than model size alone.
- Prefer asymmetric precision K often tolerates lower precision than V. INT8 or Q4 K with higher-precision V is usually a better first target than uniformly shrinking both.
- Read compressed KV directly A long-context speedup disappears if the attention path dequantizes the whole cache into FP16 scratch before every read.
- Validate at long context Short prompts rarely reveal KV quantization regressions. Test perplexity, retrieval, and decode throughput at 16k, 32k, and beyond.
Operator Checklist
- Write the formula down Use layers times KV heads times head dimension times K/V precision times token count, then compare it with the actual VRAM budget.
- Choose K and V separately Do not assume symmetric quantization. K and V often have different error tolerance and bandwidth value.
- Include allocator behavior Paging, contiguous arenas, prefix reuse, and fragmentation determine whether the theoretical saving is usable.
- Test retrieval, not only perplexity Long-context failures often show up as missed facts or degraded attention over distant tokens.
What Matters
- Weights are static after load; KV cache grows with every prompt token and generated token.
- Grouped-query attention reduces KV cost, but long context still makes KV memory large.
- K vectors usually tolerate lower precision than V vectors, so asymmetric formats are attractive.
- A good attention kernel should read quantized KV directly rather than dequantizing into a full FP16 scratch copy.
What To Measure
- Bytes per token
- The most useful KV number because it directly predicts maximum context and session capacity.
- Attention bandwidth
- Measure whether the compressed format actually reduces memory traffic in decode.
- Quality delta
- Compare retrieval, perplexity, and answer stability at the target context length.
- Cache residency
- Show reserved versus active KV bytes and whether pages are shared, evicted, or compacted.
Common Traps
- Calling FP16 the default answer FP16 is useful for correctness bring-up, but it is too expensive for serious 128k local context on 16 GB and 32 GB cards.
- Ignoring page layout Quantization, paging, and prefix reuse interact. A smaller cache still needs an allocation layout that serves the workload.
- Optimizing memory but losing bandwidth The format only helps if the kernel spends less time reading memory after scale loads, unpacking, and correction are included.
Useful Next Posts
- Long-context budgets by GPU memory A table for 16 GB, 24 GB, 32 GB, and Apple Silicon using common Qwen and Gemma shapes.
- FP8 versus Q4 KV cache A practical comparison of decode bandwidth, implementation complexity, and quality risk.
- Prefix cache plus KV quantization Explain where prompt caching changes memory pressure and where it only moves the cost.
Deep Reads
Run It
Common Questions
Why does KV cache matter for local LLMs?
KV cache stores the keys and values needed for attention over prior tokens. It grows linearly with context length and can become the largest moving memory cost after the model weights fit.
Is FP16 KV cache still useful?
Yes, as a simple correctness baseline and debug mode. For long context on 16 GB or 32 GB cards, it is usually too expensive to be the practical default.
What is TurboQuant doing differently?
TurboQuant combines vector quantization with residual correction for keys so attention inner products stay low-bias at very low bit widths. It is a design target for future compressed KV paths in ZINC.