# Optimized Attention Optimized attention paths reduce memory movement and improve throughput by combining strategy metadata, paged KV cache layout, and specialized kernels. ## Attention Implementation `attn_impl` selects the attention backend. The core config accepts `triton`, `triton_grouped`, and `naive`. The server CLI and benchmark CLI expose all core choices. Use `triton_grouped` for normal serving, benchmarking, and performance reports. The older `triton` path and the `naive` path are kept for compatibility and debugging; they are not recommended for optimized runs. ## Paged Attention Paged attention stores KV cache in pages instead of requiring a single contiguous region per request. The scheduler and KV cache manager use page tables to map request positions to cache storage. | Key | How to set it | What it does | | --- | --- | --- | | `page_size` | Use `4`, `8`, `16`, or `32` for most models; `diffusion_gemma` uses `256`. | Sets the KV cache page size used by paged attention. | | `block_size` | Keep it less than or equal to `page_size`. | Keeps diffusion block layout compatible with KV cache pages. | | `kv_cache_layout` | Use `unified` unless a strategy or experiment needs `distinct`. | Chooses how cache storage is organized for attention. | ## Chunked Prefill Chunked prefill splits long prefill work into smaller chunks so the engine can respect token budgets and cache constraints. Strategy-specific model runners prepare chunked prefill tensors and attention metadata. ## Related Arguments | Surface | Names | Notes | | --- | --- | --- | | Python/config | `attn_impl`, `page_size`, `kv_cache_layout` | Primary configuration fields for attention and cache layout. | | CLI | `--attn-impl`, `--page-size`, `--kv-cache-layout` | Use for serving or benchmark overrides. | | Kernel package | `diffulex_kernel` | Provides the lower-level optimized attention and cache helpers. |