The Design

This section describes the major internal boundaries in Diffulex. The design is registry-driven: a decoding strategy selects request state, scheduling, KV cache management, model execution, and attention metadata without changing the public engine API.

Engine Architecture

The public Diffulex symbol constructs DiffulexEngine. Engine startup follows this sequence:

  1. Build and validate diffulex.config.Config.

  2. Compute the requested parallel world size.

  3. Spawn model runner workers for nonzero ranks.

  4. Load the tokenizer and synchronize tokenizer-derived fields.

  5. Construct the rank-0 model runner.

  6. Construct the strategy-specific scheduler.

The engine owns request submission, stepping, output recording, worker cleanup, and profiling lifecycle. It delegates strategy-specific behavior to registered components.

Request Flow

add_request tokenizes string prompts, creates a request object through AutoReq, assigns page size metadata, and adds the request to the scheduler.

step runs one scheduler/model/sampler iteration:

  1. the scheduler returns executable requests and whether the step is prefill;

  2. requests are prepared for execution;

  3. the model runner executes and returns sample output;

  4. the scheduler postprocesses request state;

  5. finished request IDs are evicted from sampler state.

generate repeats this loop until the scheduler reports completion.

Scheduler

The scheduler decides which requests can prefill, decode, append blocks, preempt, abort, or finish during each engine step. Strategy templates provide common lifecycle operations, while concrete strategies define the exact policy.

Data parallel scheduling is handled separately from model parallel execution. The scheduler must respect memory and token budgets such as max_num_reqs, max_num_batched_tokens, and max_model_len.

KV Cache and Paged Attention

KV cache managers track page allocation, append rules, prefix reuse, and layout metadata consumed by attention kernels. Diffulex supports unified and distinct KV cache layouts; the layout must match attention metadata and kernel expectations.

Paged attention lets the scheduler manage cache blocks without requiring every request to occupy one contiguous memory region.

Model Runner

Model runners prepare tensors, set attention metadata, execute model forward passes, invoke samplers, and optionally capture CUDA graphs. Strategy-specific model runners are the main boundary for changing tensor layout or attention semantics.

Registries

Diffulex uses registries for extensibility:

  • AutoReq

  • AutoScheduler

  • AutoKVCacheManager

  • AutoModelRunner

  • AutoModelForDiffusionLM

  • AutoSampler

Importing strategy, model, or sampler modules triggers decorators that populate these registries.

Kernels

Kernel modules provide optimized attention, KV cache, top-k, and MoE operations used by the engine and model layers. Kernel changes should be covered by focused numerical tests before they are wired into a strategy path.