Optimized Attention¶

Optimized attention paths reduce memory movement and improve throughput by combining strategy metadata, paged KV cache layout, and specialized kernels.

Attention Implementation¶

attn_impl selects the attention backend.

The core config accepts triton, triton_grouped, and naive.

The server CLI and benchmark CLI expose all core choices.

Use triton_grouped for normal serving, benchmarking, and performance reports. The older triton path and the naive path are kept for compatibility and debugging; they are not recommended for optimized runs.

Paged Attention¶

Paged attention stores KV cache in pages instead of requiring a single contiguous region per request. The scheduler and KV cache manager use page tables to map request positions to cache storage.

Key	How to set it	What it does
`page_size`	Use `4`, `8`, `16`, or `32` for most models; `diffusion_gemma` uses `256`.	Sets the KV cache page size used by paged attention.
`block_size`	Keep it less than or equal to `page_size`.	Keeps diffusion block layout compatible with KV cache pages.
`kv_cache_layout`	Use `unified` unless a strategy or experiment needs `distinct`.	Chooses how cache storage is organized for attention.

Chunked Prefill¶

Chunked prefill splits long prefill work into smaller chunks so the engine can respect token budgets and cache constraints. Strategy-specific model runners prepare chunked prefill tensors and attention metadata.

Related Arguments¶

Surface	Names	Notes
Python/config	`attn_impl`, `page_size`, `kv_cache_layout`	Primary configuration fields for attention and cache layout.
CLI	`--attn-impl`, `--page-size`, `--kv-cache-layout`	Use for serving or benchmark overrides.
Kernel package	`diffulex_kernel`	Provides the lower-level optimized attention and cache helpers.