# diffulex.attention

`diffulex.attention` is the boundary between strategy-specific attention
metadata and the attention kernels used by model layers. Strategy model runners
prepare metadata for each engine step, then the attention layer reads that
metadata through the package-level fetch hook.

This package should stay small. New decoding strategies usually add their own
metadata subclasses under `diffulex.strategy.*.attention`; shared backend
selection and metadata plumbing belong here.

| Module | Role |
| --- | --- |
| `diffulex.attention.attn_impl` | Implements the `Attention` module and dispatches to reference, Triton, or grouped Triton attention paths. |
| `diffulex.attention.metadata` | Defines the shared metadata base class and global fetch/warmup helpers used by attention layers. |

## diffulex.attention.attn_impl

This module owns the common attention layer interface used by model
implementations. It keeps backend-specific calls behind a single `Attention`
module so model code can pass hidden states, QKV projections, and cache tensors
without directly choosing a kernel.

| Symbol | Purpose |
| --- | --- |
| `Attention` | PyTorch module that selects the configured attention implementation and consumes the current attention metadata. |
| `reference_torch_attention` | Debug/reference implementation for correctness checks. |
| `triton_attention` | Optimized attention path for the standard metadata layout. |
| `triton_grouped_attention` | Optimized grouped attention path for grouped metadata layouts. |

Use `triton_grouped` when measuring throughput or reporting performance. The
plain `triton` and reference paths are retained for compatibility and debugging.

## diffulex.attention.metadata

This module defines the metadata contract shared by attention layers and
strategy model runners. A strategy-specific runner installs a fetch function
before execution; the attention layer reads the current metadata through that
function during forward passes.

| Symbol | Purpose |
| --- | --- |
| `AttnMetaDataBase` | Base dataclass for prefill/decode lengths, page tables, slot mapping, context lengths, page size, block size, and cache layout. |
| `set_fetch_fn_for_attn_metadata` | Installs the strategy-specific metadata fetch function. |
| `set_warming_up` / `is_warming_up` / `reset_warming_up` | Track CUDA graph warmup state so attention code can distinguish warmup from normal execution. |

When adding a new strategy, define the strategy-specific metadata subclass in
the strategy package and keep only shared metadata mechanics in this module.