Optimized Attention

Optimized attention paths reduce memory movement and improve throughput by combining strategy metadata, paged KV cache layout, and specialized kernels.

Attention Implementation

attn_impl selects the attention backend.

The core config accepts triton, triton_grouped, and naive.

The server CLI and benchmark CLI expose all core choices.

Use triton_grouped for normal serving, benchmarking, and performance reports. The older triton path and the naive path are kept for compatibility and debugging; they are not recommended for optimized runs.

Paged Attention

Paged attention stores KV cache in pages instead of requiring a single contiguous region per request. The scheduler and KV cache manager use page tables to map request positions to cache storage.

Key

How to set it

What it does

page_size

Use 4, 8, 16, or 32 for most models; diffusion_gemma uses 256.

Sets the KV cache page size used by paged attention.

block_size

Keep it less than or equal to page_size.

Keeps diffusion block layout compatible with KV cache pages.

kv_cache_layout

Use unified unless a strategy or experiment needs distinct.

Chooses how cache storage is organized for attention.

Chunked Prefill

Chunked prefill splits long prefill work into smaller chunks so the engine can respect token budgets and cache constraints. Strategy-specific model runners prepare chunked prefill tensors and attention metadata.