# The Design

This section describes the major internal boundaries in Diffulex. The design is
registry-driven: a decoding strategy selects request state, scheduling, KV cache
management, model execution, and attention metadata without changing the public
engine API.

## Engine Architecture

The public `Diffulex` symbol constructs `DiffulexEngine`. Engine startup follows
this sequence:

1. Build and validate `diffulex.config.Config`.
2. Compute the requested parallel world size.
3. Spawn model runner workers for nonzero ranks.
4. Load the tokenizer and synchronize tokenizer-derived fields.
5. Construct the rank-0 model runner.
6. Construct the strategy-specific scheduler.

The engine owns request submission, stepping, output recording, worker cleanup,
and profiling lifecycle. It delegates strategy-specific behavior to registered
components.

## Request Flow

`add_request` tokenizes string prompts, creates a request object through
`AutoReq`, assigns page size metadata, and adds the request to the scheduler.

`step` runs one scheduler/model/sampler iteration:

1. the scheduler returns executable requests and whether the step is prefill;
2. requests are prepared for execution;
3. the model runner executes and returns sample output;
4. the scheduler postprocesses request state;
5. finished request IDs are evicted from sampler state.

`generate` repeats this loop until the scheduler reports completion.

## Scheduler

The scheduler decides which requests can prefill, decode, append blocks,
preempt, abort, or finish during each engine step. Strategy templates provide
common lifecycle operations, while concrete strategies define the exact policy.

Data parallel scheduling is handled separately from model parallel execution.
The scheduler must respect memory and token budgets such as `max_num_reqs`,
`max_num_batched_tokens`, and `max_model_len`.

## KV Cache and Paged Attention

KV cache managers track page allocation, append rules, prefix reuse, and layout
metadata consumed by attention kernels. Diffulex supports `unified` and
`distinct` KV cache layouts; the layout must match attention metadata and kernel
expectations.

Paged attention lets the scheduler manage cache blocks without requiring every
request to occupy one contiguous memory region.

## Model Runner

Model runners prepare tensors, set attention metadata, execute model forward
passes, invoke samplers, and optionally capture CUDA graphs. Strategy-specific
model runners are the main boundary for changing tensor layout or attention
semantics.

## Registries

Diffulex uses registries for extensibility:

- `AutoReq`
- `AutoScheduler`
- `AutoKVCacheManager`
- `AutoModelRunner`
- `AutoModelForDiffusionLM`
- `AutoSampler`

Importing strategy, model, or sampler modules triggers decorators that populate
these registries.

## Kernels

Kernel modules provide optimized attention, KV cache, top-k, and MoE operations
used by the engine and model layers. Kernel changes should be covered by focused
numerical tests before they are wired into a strategy path.