Model Loading and Configuration Walkthrough¶

This tutorial walks through the path from user configuration to a constructed Diffulex engine. It focuses on what happens when you call Diffulex(...) and which configuration fields affect the early load path.

Starting Point¶

The public API is intentionally small:

from diffulex import Diffulex, SamplingParams

llm = Diffulex(
    model="/path/to/LLaDA2.0-mini",
    model_name="llada2_mini",
    decoding_strategy="multi_bd",
    sampling_mode="naive",
    tensor_parallel_size=1,
    data_parallel_size=1,
)

Diffulex returns a DiffulexEngine instance. The constructor separates keyword arguments that match diffulex.config.Config fields and ignores unrelated keywords. This keeps the public constructor aligned with the engine config without requiring a separate wrapper class.

Configuration Creation¶

The first major step is building Config(model, **config_kwargs). The model path must be an existing local directory. Diffulex then validates model family, decoding strategy, sampling mode, page and block sizes, cache layout, parallel topology, LoRA settings, and runtime optimization flags.

Important model-specific behavior happens here:

Condition	Normalized behavior
`decoding_strategy="d2f"`	Forces `multi_block_prefix_full=True` and disables prefix caching.
`decoding_strategy="multi_bd"`	Forces `multi_block_prefix_full=False`.
`decoding_strategy="dmax"`	Forces `multi_block_prefix_full=False` and requires an edit-sampling model with `sampling_mode="edit"`.
`model_name="diffusion_gemma"`	Uses the native `diffusion_gemma` strategy defaults, `block_size=256`, `page_size=256`, and `buffer_size=1`.

If a validation error is raised, fix the configuration before debugging model weights or kernels.

Tokenizer and HF Config¶

After config validation, the engine loads the tokenizer with auto_tokenizer_from_pretrained. It records the tokenizer vocabulary size and EOS token ID on the config. If the tokenizer exposes mask_token_id, Diffulex uses that value instead of the default mask token.

Config also loads the Hugging Face config through AutoConfig.from_pretrained with trust_remote_code=True. The effective max_model_len is clamped to the model config’s maximum sequence length.

Worker Processes¶

Diffulex computes the model-parallel world size from tensor, expert, and data parallel sizes. Rank 0 runs in the main process. Additional ranks are spawned as worker processes with Python multiprocessing. Each worker constructs a model runner from the same validated config.

If startup fails, the engine calls exit() to clean up worker processes before re-raising the exception.

Strategy Components¶

The decoding strategy selects several registered components:

request state through AutoReq;
scheduler through AutoScheduler;
KV cache manager through AutoKVCacheManager;
model runner through AutoModelRunner;
attention metadata functions through the strategy model runner.

Built-in strategies are imported from diffulex.strategy, which triggers their registry decorators.

First Generation¶

Once the engine is constructed, use generate for normal offline inference:

outputs = llm.generate(
    ["Solve: 12 + 30."],
    SamplingParams(temperature=0.0, max_tokens=32),
)

for output in outputs.trajectories:
    print(output.text)

Call llm.exit() when the process is done with the engine.

Practical Debugging¶

Use a tiny prompt, tensor_parallel_size=1, and data_parallel_size=1 while validating a new model family. Once the model loads and one request completes, increase parallelism and batch limits.