# Model Loading and Configuration Walkthrough

This tutorial walks through the path from user configuration to a constructed
`Diffulex` engine. It focuses on what happens when you call `Diffulex(...)` and
which configuration fields affect the early load path.

## Starting Point

The public API is intentionally small:

```python
from diffulex import Diffulex, SamplingParams

llm = Diffulex(
    model="/path/to/LLaDA2.0-mini",
    model_name="llada2_mini",
    decoding_strategy="multi_bd",
    sampling_mode="naive",
    tensor_parallel_size=1,
    data_parallel_size=1,
)
```

`Diffulex` returns a `DiffulexEngine` instance. The constructor separates keyword
arguments that match `diffulex.config.Config` fields and ignores unrelated
keywords. This keeps the public constructor aligned with the engine config
without requiring a separate wrapper class.

## Configuration Creation

The first major step is building `Config(model, **config_kwargs)`. The `model`
path must be an existing local directory. Diffulex then validates model family,
decoding strategy, sampling mode, page and block sizes, cache layout, parallel
topology, LoRA settings, and runtime optimization flags.

Important model-specific behavior happens here:

| Condition | Normalized behavior |
| --- | --- |
| `decoding_strategy="d2f"` | Forces `multi_block_prefix_full=True` and disables prefix caching. |
| `decoding_strategy="multi_bd"` | Forces `multi_block_prefix_full=False`. |
| `decoding_strategy="dmax"` | Forces `multi_block_prefix_full=False` and requires an edit-sampling model with `sampling_mode="edit"`. |
| `model_name="diffusion_gemma"` | Uses the native `diffusion_gemma` strategy defaults, `block_size=256`, `page_size=256`, and `buffer_size=1`. |

If a validation error is raised, fix the configuration before debugging model
weights or kernels.

## Tokenizer and HF Config

After config validation, the engine loads the tokenizer with
`auto_tokenizer_from_pretrained`. It records the tokenizer vocabulary size and
EOS token ID on the config. If the tokenizer exposes `mask_token_id`, Diffulex
uses that value instead of the default mask token.

`Config` also loads the Hugging Face config through `AutoConfig.from_pretrained`
with `trust_remote_code=True`. The effective `max_model_len` is clamped to the
model config's maximum sequence length.

## Worker Processes

Diffulex computes the model-parallel world size from tensor, expert, and data
parallel sizes. Rank 0 runs in the main process. Additional ranks are spawned as
worker processes with Python multiprocessing. Each worker constructs a model
runner from the same validated config.

If startup fails, the engine calls `exit()` to clean up worker processes before
re-raising the exception.

## Strategy Components

The decoding strategy selects several registered components:

- request state through `AutoReq`;
- scheduler through `AutoScheduler`;
- KV cache manager through `AutoKVCacheManager`;
- model runner through `AutoModelRunner`;
- attention metadata functions through the strategy model runner.

Built-in strategies are imported from `diffulex.strategy`, which triggers their
registry decorators.

## First Generation

Once the engine is constructed, use `generate` for normal offline inference:

```python
outputs = llm.generate(
    ["Solve: 12 + 30."],
    SamplingParams(temperature=0.0, max_tokens=32),
)

for output in outputs.trajectories:
    print(output.text)
```

Call `llm.exit()` when the process is done with the engine.

## Practical Debugging

Use a tiny prompt, `tensor_parallel_size=1`, and `data_parallel_size=1` while
validating a new model family. Once the model loads and one request completes,
increase parallelism and batch limits.