Prefix Caching

Prefix caching reuses compatible prefix KV cache state across requests. It is a capacity and latency optimization for workloads with shared prompts or repeated prefixes.

Configuration

enable_prefix_caching controls whether compatible strategies may use prefix caching.

The value is boolean and defaults to True. Strategy normalization still has the final say: decoding_strategy="d2f" forces prefix caching off, while multi_bd and dmax leave it enabled when the rest of the cache layout is compatible.

Surface

How to set it

Notes

Server CLI

Prefix caching is enabled by default; add --disable-prefix-caching to turn it off.

Useful when debugging cache layout or request state.

Benchmark CLI

Use --enable-prefix-caching or --no-enable-prefix-caching.

Makes prefix-cache behavior explicit in experiment commands.

When to Disable It

Disable prefix caching while debugging cache layout, request state, or strategy changes. Once correctness is stable, re-enable it for throughput and latency checks.