Model Loading and Configuration Walkthrough¶
This tutorial walks through the path from user configuration to a constructed
Diffulex engine. It focuses on what happens when you call Diffulex(...) and
which configuration fields affect the early load path.
Starting Point¶
The public API is intentionally small:
from diffulex import Diffulex, SamplingParams
llm = Diffulex(
model="/path/to/LLaDA2.0-mini",
model_name="llada2_mini",
decoding_strategy="multi_bd",
sampling_mode="naive",
tensor_parallel_size=1,
data_parallel_size=1,
)
Diffulex returns a DiffulexEngine instance. The constructor separates keyword
arguments that match diffulex.config.Config fields and ignores unrelated
keywords. This keeps the public constructor aligned with the engine config
without requiring a separate wrapper class.
Configuration Creation¶
The first major step is building Config(model, **config_kwargs). The model
path must be an existing local directory. Diffulex then validates model family,
decoding strategy, sampling mode, page and block sizes, cache layout, parallel
topology, LoRA settings, and runtime optimization flags.
Important model-specific behavior happens here:
Condition |
Normalized behavior |
|---|---|
|
Forces |
|
Forces |
|
Forces |
|
Uses the native |
If a validation error is raised, fix the configuration before debugging model weights or kernels.
Tokenizer and HF Config¶
After config validation, the engine loads the tokenizer with
auto_tokenizer_from_pretrained. It records the tokenizer vocabulary size and
EOS token ID on the config. If the tokenizer exposes mask_token_id, Diffulex
uses that value instead of the default mask token.
Config also loads the Hugging Face config through AutoConfig.from_pretrained
with trust_remote_code=True. The effective max_model_len is clamped to the
model config’s maximum sequence length.
Worker Processes¶
Diffulex computes the model-parallel world size from tensor, expert, and data parallel sizes. Rank 0 runs in the main process. Additional ranks are spawned as worker processes with Python multiprocessing. Each worker constructs a model runner from the same validated config.
If startup fails, the engine calls exit() to clean up worker processes before
re-raising the exception.
Strategy Components¶
The decoding strategy selects several registered components:
request state through
AutoReq;scheduler through
AutoScheduler;KV cache manager through
AutoKVCacheManager;model runner through
AutoModelRunner;attention metadata functions through the strategy model runner.
Built-in strategies are imported from diffulex.strategy, which triggers their
registry decorators.
First Generation¶
Once the engine is constructed, use generate for normal offline inference:
outputs = llm.generate(
["Solve: 12 + 30."],
SamplingParams(temperature=0.0, max_tokens=32),
)
for output in outputs.trajectories:
print(output.text)
Call llm.exit() when the process is done with the engine.
Practical Debugging¶
Use a tiny prompt, tensor_parallel_size=1, and data_parallel_size=1 while
validating a new model family. Once the model loads and one request completes,
increase parallelism and batch limits.