Prefix Caching¶
Prefix caching reuses compatible prefix KV cache state across requests. It is a capacity and latency optimization for workloads with shared prompts or repeated prefixes.
Configuration¶
enable_prefix_caching controls whether compatible strategies may use prefix
caching.
The value is boolean and defaults to True. Strategy normalization still has
the final say: decoding_strategy="d2f" forces prefix caching off, while
multi_bd and dmax leave it enabled when the rest of the cache layout is
compatible.
Surface |
How to set it |
Notes |
|---|---|---|
Server CLI |
Prefix caching is enabled by default; add |
Useful when debugging cache layout or request state. |
Benchmark CLI |
Use |
Makes prefix-cache behavior explicit in experiment commands. |
When to Disable It¶
Disable prefix caching while debugging cache layout, request state, or strategy changes. Once correctness is stable, re-enable it for throughput and latency checks.