Model Loading and Configuration Walkthrough

This tutorial walks through the path from user configuration to a constructed Diffulex engine. It focuses on what happens when you call Diffulex(...) and which configuration fields affect the early load path.

Starting Point

The public API is intentionally small:

from diffulex import Diffulex, SamplingParams

llm = Diffulex(
    model="/path/to/LLaDA2.0-mini",
    model_name="llada2_mini",
    decoding_strategy="multi_bd",
    sampling_mode="naive",
    tensor_parallel_size=1,
    data_parallel_size=1,
)

Diffulex returns a DiffulexEngine instance. The constructor separates keyword arguments that match diffulex.config.Config fields and ignores unrelated keywords. This keeps the public constructor aligned with the engine config without requiring a separate wrapper class.

Configuration Creation

The first major step is building Config(model, **config_kwargs). The model path must be an existing local directory. Diffulex then validates model family, decoding strategy, sampling mode, page and block sizes, cache layout, parallel topology, LoRA settings, and runtime optimization flags.

Important model-specific behavior happens here:

Condition

Normalized behavior

decoding_strategy="d2f"

Forces multi_block_prefix_full=True and disables prefix caching.

decoding_strategy="multi_bd"

Forces multi_block_prefix_full=False.

decoding_strategy="dmax"

Forces multi_block_prefix_full=False and requires an edit-sampling model with sampling_mode="edit".

model_name="diffusion_gemma"

Uses the native diffusion_gemma strategy defaults, block_size=256, page_size=256, and buffer_size=1.

If a validation error is raised, fix the configuration before debugging model weights or kernels.

Tokenizer and HF Config

After config validation, the engine loads the tokenizer with auto_tokenizer_from_pretrained. It records the tokenizer vocabulary size and EOS token ID on the config. If the tokenizer exposes mask_token_id, Diffulex uses that value instead of the default mask token.

Config also loads the Hugging Face config through AutoConfig.from_pretrained with trust_remote_code=True. The effective max_model_len is clamped to the model config’s maximum sequence length.

Worker Processes

Diffulex computes the model-parallel world size from tensor, expert, and data parallel sizes. Rank 0 runs in the main process. Additional ranks are spawned as worker processes with Python multiprocessing. Each worker constructs a model runner from the same validated config.

If startup fails, the engine calls exit() to clean up worker processes before re-raising the exception.

Strategy Components

The decoding strategy selects several registered components:

  • request state through AutoReq;

  • scheduler through AutoScheduler;

  • KV cache manager through AutoKVCacheManager;

  • model runner through AutoModelRunner;

  • attention metadata functions through the strategy model runner.

Built-in strategies are imported from diffulex.strategy, which triggers their registry decorators.

First Generation

Once the engine is constructed, use generate for normal offline inference:

outputs = llm.generate(
    ["Solve: 12 + 30."],
    SamplingParams(temperature=0.0, max_tokens=32),
)

for output in outputs.trajectories:
    print(output.text)

Call llm.exit() when the process is done with the engine.

Practical Debugging

Use a tiny prompt, tensor_parallel_size=1, and data_parallel_size=1 while validating a new model family. Once the model loads and one request completes, increase parallelism and batch limits.