Configuration¶
Diffulex uses the same core engine fields across Python inference, HTTP serving, and benchmark execution. Each entry point exposes those fields in a slightly different form, but the validation rules and runtime effects are the same.
Engine Arguments¶
Engine arguments control model loading, decoding strategy, parallelism, memory limits, KV cache layout, LoRA behavior, and runtime optimizations.
The config validator enforces relationships between these values. For example,
block_size must be less than or equal to page_size, and both must be one of
the supported page/block sizes for the selected model family.
Engine Parameter Reference¶
Key |
How to set it |
What it does |
|---|---|---|
|
Point to the local model checkpoint directory. This is required. |
Loads the base model. Python config uses |
|
Point to a tokenizer directory, or leave it |
Lets benchmark flows use a tokenizer stored separately from the weights. |
|
Choose one of the registered model keys: |
Selects the model adapter and the sampler defaults that go with it. |
|
Use |
Chooses the request type, scheduler, KV cache manager, model runner, and attention metadata path. |
|
Use |
Selects sampler behavior. |
|
Use the tokenizer’s mask token ID. The default is |
Tells diffusion samplers which token represents a masked position. |
|
Use |
Splits one model replica across multiple GPUs. |
|
Use |
Runs independent serving or evaluation groups for higher throughput. |
|
Use a fractional target such as |
Guides engine memory planning so it does not reserve the entire GPU. |
|
Use a positive sequence length. The default is |
Sets the requested maximum prompt-plus-output length. |
|
Use a positive token budget. The default is |
Limits how many tokens the scheduler can place in one batch. |
|
Use a positive request count. The default is |
Caps the number of active requests the engine can track. |
|
Use |
Sets the token span of one diffusion block. |
|
Use a positive block count. The default is |
Controls how many diffusion blocks can be active for one request. |
|
Use |
Sets the KV cache page size. |
|
Use |
Chooses how KV cache storage is organized internally. |
|
Use |
Selects the attention backend. |
|
Leave it |
Reuses compatible prefix KV cache state across requests. |
|
Set |
Disables graph-style optimized execution paths. |
|
Leave |
Enables full-static CUDA graph runner paths. |
|
Leave |
Enables |
|
Defaults to |
Passes the compile mode through to PyTorch. |
|
Leave |
Uses optional vLLM-backed common layers. |
MoE and Token Merge Parameters¶
Key |
How to set it |
What it does |
|---|---|---|
|
Use |
Chooses how token-merge metadata is built. |
|
Use a positive integer. The default is |
Keeps this many candidate tokens in token-merge metadata. |
|
Leave |
Renormalizes token-merge probabilities after candidate filtering. |
|
Use a non-negative float. The default is |
Weights the token-merge interpolation. |
|
Use |
Selects the MoE GEMM implementation. |
DMax and Edit Sampling Parameters¶
Key |
How to set it |
What it does |
|---|---|---|
|
Use a positive integer. The default is |
Caps the number of edit steps per block in DMax/edit-sampling runs. |
|
Use a non-negative float. The default is |
Penalty weight for token-to-token edit selection. |
|
Leave |
Enables the fast argmax path for DMax confidence computation. |
|
Leave |
Forces DMax active blocks to remain on the prefill execution path. |
|
Leave |
Enables the vectorized greedy sampler for batched mask-filling. |
|
Leave |
Applies |
Fast-dLLM-v2 Parameters¶
Key |
How to set it |
What it does |
|---|---|---|
|
Leave |
When enabled, already-refined sub-blocks within the buffer reuse cached KV — only the active sub-block is recomputed. Set |
DiffusionGemma Sampler Controls¶
DiffusionGemma uses a canvas-denoising sampler with its own hyperparameters.
These are only active when model_name=diffusion_gemma:
Key |
How to set it |
What it does |
|---|---|---|
|
Use a positive integer. The default is |
Caps the number of denoising steps per block. |
|
Use a positive integer. The default is |
Number of consecutive argmax matches required for convergence. |
|
Use a float from |
Mean entropy threshold for accepting convergence. |
|
Use a float. The default is |
Minimum temperature in the denoising schedule. |
|
Use a float. The default is |
Maximum (initial) temperature in the denoising schedule. |
|
Use a float from |
Entropy threshold for stochastic re-masking during denoising. |
MoE Dispatching¶
Key |
How to set it |
What it does |
|---|---|---|
|
Use |
Number of expert-parallel groups for MoE models. |
|
Use |
Selects the MoE token dispatcher. |
|
Use |
Controls the DeepEP dispatch mode when |
|
Use a positive integer. The default is |
Caps tokens per rank in DeepEP dispatch. |
Distributed and Deployment¶
Key |
How to set it |
What it does |
|---|---|---|
|
Use an IP or hostname. The default is |
PyTorch distributed master address. |
|
Use an integer. The default is |
PyTorch distributed master port. |
|
Use |
PyTorch distributed communication backend. |
|
Use a positive integer. The default is |
Timeout for distributed operations. |
|
Use |
Starting CUDA device index for local ranks. |
Runtime Tuning¶
Key |
How to set it |
What it does |
|---|---|---|
|
Leave |
Skips model and CUDA graph warmup at engine startup. |
|
Leave |
Applies |
|
Leave |
Enables CUDA graph capture for prefill steps. |
|
Use |
Maximum sequence length for prefill CUDA graph capture. |
|
Use a positive integer. The default is |
Number of warmup steps when auto-computing max NFE. |
|
Use a positive float. The default is |
Minimum TPF used when auto-computing max NFE. |
|
Use |
Overrides automatic KV cache page pool sizing. |
|
Leave at the default |
Splits the K cache head dimension for memory layout optimization. |
Profiler¶
Key |
How to set it |
What it does |
|---|---|---|
|
Use |
PyTorch profiler configuration dict. Set to a dict with |
Strategy Defaults¶
Some settings are normalized based on the selected strategy:
Strategy |
Normalized behavior |
|---|---|
|
Forces |
|
Enables block-causal Multi-Block Diffusion and forces |
|
Forces |
|
Uses DiffusionGemma request/sampler/runtime defaults. |
Model-specific defaults may also apply. diffusion_gemma uses block and page
size 256, uses buffer_size=1, and enables DiffusionGemma sampler controls.
Sampling Parameters¶
Sampling parameters are passed through diffulex.SamplingParams for Python
inference and through matching CLI/config fields for benchmark and server paths.
They are request-level settings rather than engine construction settings.
Key |
How to set it |
What it does |
|---|---|---|
|
Use |
Controls generation randomness. |
|
Use a positive output-token limit. |
Caps generated tokens for each request. |
|
Use a positive integer, or leave it unset. |
Caps forward evaluations when the strategy supports that limit. |
|
Leave |
Controls whether EOS ends generation. |
|
Use a positive integer, or leave it unset. |
Stops generation after a long repeated-token run. |
Decoding Thresholds¶
Diffulex groups decoding thresholds in DecodingThresholds:
Key |
How to set it |
What it does |
|---|---|---|
|
Start from the default |
Controls when a strategy may add another decoding block. |
|
Start from the default |
Controls when semi-complete block state can move forward. |
|
Use a confidence value from |
Accepts mask-to-token updates once confidence is high enough. |
|
Use a confidence value from |
Accepts token-to-token edits in edit-style decoding. |
|
Use a confidence value from |
Remasks filled tokens that fall below the confidence threshold. |
|
Use a stability ratio from |
Requires enough token stability before DMax-style edit blocks advance. |
The flat CLI flags are folded into the threshold object during config construction. Keep thresholds in the config file when comparing strategies so command lines stay readable.
LoRA¶
Key |
How to set it |
What it does |
|---|---|---|
|
Set |
Enables LoRA adapter loading. |
|
Point to the adapter checkpoint directory. Required when |
Provides the adapter weights. |
|
Set |
Avoids per-forward adapter compute when the model and adapter support merging. |
When use_lora=True, lora_path must be provided. If pre_merge_lora=True,
Diffulex attempts to merge adapter weights into the base model before inference
when the model path and runtime support it.
Runtime Optimizations¶
Key |
How to set it |
What it does |
|---|---|---|
|
Set |
Bypasses graph-style execution paths. |
|
Leave |
Enables the full-static runner where available. |
|
Leave |
Enables |
|
Defaults to |
Passes the compile mode to PyTorch. |
|
Leave |
Uses optional vLLM-backed common layers. |
Use eager mode while debugging. Enable CUDA graph and compile paths for throughput measurements after the model and strategy are already validated.
Benchmark YAML Structure¶
Benchmark YAML files use nested engine and eval sections:
engine:
model_path: /path/to/model
tokenizer_path: null
model_name: dream
decoding_strategy: multi_bd
tensor_parallel_size: 1
data_parallel_size: 1
eval:
dataset_name: gsm8k_diffulex
dataset_limit: 10
temperature: 0.0
max_tokens: 512
engine fields are forwarded to the Diffulex engine after compatibility
normalization. eval fields control lm-eval task selection, sampling limits,
and output behavior.
Evaluation Parameter Reference¶
Key |
How to set it |
What it does |
|---|---|---|
|
Use an lm-eval task name. The default is |
Selects the benchmark task. |
|
Use the dataset split name expected by the task. The default is |
Selects the dataset split passed to lm-eval. |
|
Use a positive integer for smoke tests, or |
Limits how many examples are evaluated. |
|
Point to a directory of task YAMLs, leave it |
Controls where lm-eval looks for task definitions. |
|
Point to a local data file, or leave it |
Overrides |
|
Use a sampling temperature. The default is |
Sets generation randomness during benchmark requests. |
|
Use a positive output-token limit. The config class defaults to |
Caps generated tokens per example. |
|
Use a positive integer, or leave it |
Caps forward evaluations when a strategy supports that limit. |
|
Use a positive integer, or leave it |
Stops generation after a long repeated-token run. |
|
Leave |
Controls whether EOS terminates generation. |
|
Point to an output directory. The default is |
Sets where benchmark artifacts are written. |
|
Leave |
Writes each run under a timestamped task directory. |
|
Leave |
Saves lm-eval results and sample outputs. |
|
Leave |
Allows lm-eval code-execution tasks to run. |
Environment Variables¶
Diffulex uses normal Python, PyTorch, CUDA, and distributed runtime environment variables. Set variables such as CUDA visibility and library paths before starting the Python process.
When CUDA_VISIBLE_DEVICES is set, device_ids should refer to PyTorch logical
device IDs, not physical GPU IDs.