# Prefix Caching Prefix caching reuses compatible prefix KV cache state across requests. It is a capacity and latency optimization for workloads with shared prompts or repeated prefixes. ## Configuration `enable_prefix_caching` controls whether compatible strategies may use prefix caching. The value is boolean and defaults to `True`. Strategy normalization still has the final say: `decoding_strategy="d2f"` forces prefix caching off, while `multi_bd` and `dmax` leave it enabled when the rest of the cache layout is compatible. | Surface | How to set it | Notes | | --- | --- | --- | | Server CLI | Prefix caching is enabled by default; add `--disable-prefix-caching` to turn it off. | Useful when debugging cache layout or request state. | | Benchmark CLI | Use `--enable-prefix-caching` or `--no-enable-prefix-caching`. | Makes prefix-cache behavior explicit in experiment commands. | ## When to Disable It Disable prefix caching while debugging cache layout, request state, or strategy changes. Once correctness is stable, re-enable it for throughput and latency checks. ## Related Arguments | Surface | Names | Notes | | --- | --- | --- | | Python/config | `enable_prefix_caching` | Main config field before strategy normalization. | | CLI | `--disable-prefix-caching`, `--enable-prefix-caching`, `--no-enable-prefix-caching` | Server and benchmark CLIs expose the toggle with different flag names. | | Related config | `decoding_strategy`, `multi_block_prefix_full`, `kv_cache_layout` | Strategy and cache layout decide whether prefix reuse is actually compatible. |