Configuration¶

Diffulex uses the same core engine fields across Python inference, HTTP serving, and benchmark execution. Each entry point exposes those fields in a slightly different form, but the validation rules and runtime effects are the same.

Engine Arguments¶

Engine arguments control model loading, decoding strategy, parallelism, memory limits, KV cache layout, LoRA behavior, and runtime optimizations.

The config validator enforces relationships between these values. For example, block_size must be less than or equal to page_size, and both must be one of the supported page/block sizes for the selected model family.

Engine Parameter Reference¶

Key	How to set it	What it does
`model` / `model_path`	Point to the local model checkpoint directory. This is required.	Loads the base model. Python config uses `model`; benchmark YAML uses `engine.model_path`.
`tokenizer_path`	Point to a tokenizer directory, or leave it `null` to reuse the model path.	Lets benchmark flows use a tokenizer stored separately from the weights.
`model_name`	Choose one of the registered model keys: `dream`, `sdar`, `sdar_moe`, `fast_dllm_v2`, `llada`, `llada2`, `llada2_moe`, `llada2_mini`, `llada2dot1_mini`, `llada2_mini_dmax`, or `diffusion_gemma`. The default is `dream`.	Selects the model adapter and the sampler defaults that go with it.
`decoding_strategy`	Use `d2f`, `multi_bd`, `dmax`, or `diffusion_gemma`. The default is `d2f`.	Chooses the request type, scheduler, KV cache manager, model runner, and attention metadata path.
`sampling_mode`	Use `naive` for the standard sampler or `edit` for edit-style decoding. The default is `naive`.	Selects sampler behavior. `edit` is only valid for compatible LLaDA2-family models.
`mask_token_id`	Use the tokenizer’s mask token ID. The default is `151666`, and tokenizer metadata can override it.	Tells diffusion samplers which token represents a masked position.
`tensor_parallel_size`	Use `1` to `8` tensor-parallel ranks. Core config defaults to `2`; CLI examples usually start at `1`.	Splits one model replica across multiple GPUs.
`data_parallel_size`	Use `1` to `1024` data-parallel groups. The default is `1`.	Runs independent serving or evaluation groups for higher throughput.
`gpu_memory_utilization`	Use a fractional target such as `0.9`.	Guides engine memory planning so it does not reserve the entire GPU.
`max_model_len`	Use a positive sequence length. The default is `2048`, and the HF model config may clamp it lower.	Sets the requested maximum prompt-plus-output length.
`max_num_batched_tokens`	Use a positive token budget. The default is `4096`, and it must be at least the effective `max_model_len`.	Limits how many tokens the scheduler can place in one batch.
`max_num_reqs`	Use a positive request count. The default is `128`.	Caps the number of active requests the engine can track.
`block_size`	Use `4`, `8`, `16`, or `32` for most models; `diffusion_gemma` uses `256`. The default is `32`.	Sets the token span of one diffusion block.
`buffer_size`	Use a positive block count. The default is `4`; `diffusion_gemma` is forced to `1`.	Controls how many diffusion blocks can be active for one request.
`page_size`	Use `4`, `8`, `16`, or `32` for most models; `diffusion_gemma` uses `256`. Keep it greater than or equal to `block_size`.	Sets the KV cache page size.
`kv_cache_layout`	Use `unified` unless a strategy or experiment needs `distinct`.	Chooses how KV cache storage is organized internally.
`attn_impl`	Use `triton_grouped` for normal serving and benchmark runs. `triton` and `naive` are compatibility/debug fallbacks and are not recommended for performance reporting. The default is `triton_grouped`.	Selects the attention backend.
`enable_prefix_caching`	Leave it `True` for compatible strategies. `d2f` forces it off during normalization.	Reuses compatible prefix KV cache state across requests.
`enforce_eager`	Set `True` while debugging; leave `False` for optimized runs.	Disables graph-style optimized execution paths.
`enable_full_static_runner`	Leave `True` for supported optimized multi-block paths.	Enables full-static CUDA graph runner paths.
`enable_torch_compile`	Leave `True` when the model path supports compile; turn it off for debugging.	Enables `torch.compile` where Diffulex can use it safely.
`torch_compile_mode`	Defaults to `reduce-overhead`; use another PyTorch compile mode only when profiling justifies it.	Passes the compile mode through to PyTorch.
`enable_vllm_layers`	Leave `True` unless isolating a layer implementation issue.	Uses optional vLLM-backed common layers.

MoE and Token Merge Parameters¶

Key	How to set it	What it does
`token_merge_mode`	Use `dmax_topk` or `iter_smooth_topk`. The default is `dmax_topk`.	Chooses how token-merge metadata is built.
`token_merge_top_k`	Use a positive integer. The default is `1`.	Keeps this many candidate tokens in token-merge metadata.
`token_merge_renormalize`	Leave `True` unless an experiment needs raw probabilities.	Renormalizes token-merge probabilities after candidate filtering.
`token_merge_weight`	Use a non-negative float. The default is `1.0`.	Weights the token-merge interpolation.
`moe_gemm_impl`	Use `triton`, `vllm`, `vllm_modular`, or `naive`. The default is `triton`.	Selects the MoE GEMM implementation.

DMax and Edit Sampling Parameters¶

Key	How to set it	What it does
`max_post_edit_steps`	Use a positive integer. The default is `16`.	Caps the number of edit steps per block in DMax/edit-sampling runs.
`penalty_lambda`	Use a non-negative float. The default is `0.0`.	Penalty weight for token-to-token edit selection.
`dmax_sampler_fast_path`	Leave `True` unless debugging. The default is `True`.	Enables the fast argmax path for DMax confidence computation.
`dmax_force_prefill_active`	Leave `False` unless profiling a specific DMax prefill pattern.	Forces DMax active blocks to remain on the prefill execution path.
`enable_vectorized_sampler`	Leave `False` unless extending the vectorized sampler.	Enables the vectorized greedy sampler for batched mask-filling.
`enable_vectorized_sampler_compile`	Leave `False` unless paired with `enable_vectorized_sampler`.	Applies `torch.compile` to the vectorized sampler path.

Fast-dLLM-v2 Parameters¶

Key	How to set it	What it does
`fdv2_use_block_cache`	Leave `True` for the standard dual-cache path. The default is `True`.	When enabled, already-refined sub-blocks within the buffer reuse cached KV — only the active sub-block is recomputed. Set `False` to re-compute the full buffer every sub-block step.

DiffusionGemma Sampler Controls¶

DiffusionGemma uses a canvas-denoising sampler with its own hyperparameters. These are only active when model_name=diffusion_gemma:

Key	How to set it	What it does
`diffusion_gemma_max_denoising_steps`	Use a positive integer. The default is `48`.	Caps the number of denoising steps per block.
`diffusion_gemma_stability_threshold`	Use a positive integer. The default is `2`.	Number of consecutive argmax matches required for convergence.
`diffusion_gemma_confidence_threshold`	Use a float from `0` to `1`. The default is `0.1`.	Mean entropy threshold for accepting convergence.
`diffusion_gemma_t_min`	Use a float. The default is `0.0`.	Minimum temperature in the denoising schedule.
`diffusion_gemma_t_max`	Use a float. The default is `1.0`.	Maximum (initial) temperature in the denoising schedule.
`diffusion_gemma_entropy_bound`	Use a float from `0` to `1`. The default is `1.0`.	Entropy threshold for stochastic re-masking during denoising.

MoE Dispatching¶

Key	How to set it	What it does
`expert_parallel_size`	Use `1` for standard MoE; increase only when extending the runtime. The default is `1`.	Number of expert-parallel groups for MoE models.
`moe_dispatcher_backend`	Use `standard`, `naive`, or `deepep`. The default is `standard`.	Selects the MoE token dispatcher.
`deepep_mode`	Use `auto` unless debugging DeepEP-specific behavior.	Controls the DeepEP dispatch mode when `moe_dispatcher_backend=deepep`.
`deepep_num_max_dispatch_tokens_per_rank`	Use a positive integer. The default is `256`.	Caps tokens per rank in DeepEP dispatch.

Distributed and Deployment¶

Key	How to set it	What it does
`master_addr`	Use an IP or hostname. The default is `"localhost"`.	PyTorch distributed master address.
`master_port`	Use an integer. The default is `2333`.	PyTorch distributed master port.
`distributed_backend`	Use `nccl` unless debugging with `gloo`.	PyTorch distributed communication backend.
`distributed_timeout_seconds`	Use a positive integer. The default is `600`.	Timeout for distributed operations.
`device_start`	Use `0` unless mapping specific GPUs.	Starting CUDA device index for local ranks.

Runtime Tuning¶

Key	How to set it	What it does
`skip_warmup`	Leave `False` unless iterating on config changes. The default is `False`.	Skips model and CUDA graph warmup at engine startup.
`enable_cudagraph_torch_compile`	Leave `False` unless profiling shows a measurable gain.	Applies `torch.compile` to CUDA graph captured regions.
`enable_prefill_cudagraph`	Leave `False` unless prefill latency is the bottleneck.	Enables CUDA graph capture for prefill steps.
`prefill_cudagraph_max_len`	Use `0` (auto-detect) or a positive token count.	Maximum sequence length for prefill CUDA graph capture.
`auto_max_nfe_warmup_steps`	Use a positive integer. The default is `8`.	Number of warmup steps when auto-computing max NFE.
`auto_max_nfe_tpf_floor`	Use a positive float. The default is `1.0`.	Minimum TPF used when auto-computing max NFE.
`num_pages`	Use `-1` for auto-sizing, or a specific page count.	Overrides automatic KV cache page pool sizing.
`k_cache_hdim_split_factor_x`	Leave at the default `8` unless tuning the KV cache layout.	Splits the K cache head dimension for memory layout optimization.

Profiler¶

Key	How to set it	What it does
`profiler_config`	Use `null` unless collecting a profiling trace.	PyTorch profiler configuration dict. Set to a dict with `activities`, `schedule`, etc. to enable profiling.

Strategy Defaults¶

Some settings are normalized based on the selected strategy:

Strategy	Normalized behavior
`d2f`	Forces `multi_block_prefix_full=True` and disables prefix caching.
`multi_bd`	Enables block-causal Multi-Block Diffusion and forces `multi_block_prefix_full=False`.
`dmax`	Forces `multi_block_prefix_full=False` and requires edit sampling.
`diffusion_gemma`	Uses DiffusionGemma request/sampler/runtime defaults.

Model-specific defaults may also apply. diffusion_gemma uses block and page size 256, uses buffer_size=1, and enables DiffusionGemma sampler controls.

Sampling Parameters¶

Sampling parameters are passed through diffulex.SamplingParams for Python inference and through matching CLI/config fields for benchmark and server paths. They are request-level settings rather than engine construction settings.

Key	How to set it	What it does
`temperature`	Use `0.0` for deterministic evaluation, or a higher float when sampling is desired.	Controls generation randomness.
`max_tokens`	Use a positive output-token limit.	Caps generated tokens for each request.
`max_nfe`	Use a positive integer, or leave it unset.	Caps forward evaluations when the strategy supports that limit.
`ignore_eos`	Leave `False` for normal generation; set `True` only when a task should continue after EOS.	Controls whether EOS ends generation.
`max_repetition_run`	Use a positive integer, or leave it unset.	Stops generation after a long repeated-token run.

Decoding Thresholds¶

Diffulex groups decoding thresholds in DecodingThresholds:

Key	How to set it	What it does
`add_block_threshold`	Start from the default `0.1`; tune as a float for block-add behavior.	Controls when a strategy may add another decoding block.
`semi_complete_threshold`	Start from the default `0.9`; tune as a float for block advancement.	Controls when semi-complete block state can move forward.
`accept_threshold`	Use a confidence value from `0` to `1`. The default is `0.9`.	Accepts mask-to-token updates once confidence is high enough.
`edit_threshold`	Use a confidence value from `0` to `1`. The default is `0.0`.	Accepts token-to-token edits in edit-style decoding.
`remask_threshold`	Use a confidence value from `0` to `1`. The default is `0.4`.	Remasks filled tokens that fall below the confidence threshold.
`token_stability_threshold`	Use a stability ratio from `0` to `1`. The default is `0.0`.	Requires enough token stability before DMax-style edit blocks advance.

The flat CLI flags are folded into the threshold object during config construction. Keep thresholds in the config file when comparing strategies so command lines stay readable.

LoRA¶

Key	How to set it	What it does
`use_lora`	Set `True` when an adapter should be loaded.	Enables LoRA adapter loading.
`lora_path`	Point to the adapter checkpoint directory. Required when `use_lora=True`.	Provides the adapter weights.
`pre_merge_lora`	Set `True` when the adapter should be merged into the base model at load time.	Avoids per-forward adapter compute when the model and adapter support merging.

When use_lora=True, lora_path must be provided. If pre_merge_lora=True, Diffulex attempts to merge adapter weights into the base model before inference when the model path and runtime support it.

Runtime Optimizations¶

Key	How to set it	What it does
`enforce_eager`	Set `True` while debugging; keep `False` for optimized runs.	Bypasses graph-style execution paths.
`enable_full_static_runner`	Leave `True` for supported multi-block optimized paths.	Enables the full-static runner where available.
`enable_torch_compile`	Leave `True` when the model supports compile; disable it to isolate compile issues.	Enables `torch.compile` where supported.
`torch_compile_mode`	Defaults to `reduce-overhead`. Change it only for a measured profiling reason.	Passes the compile mode to PyTorch.
`enable_vllm_layers`	Leave `True` unless comparing layer implementations.	Uses optional vLLM-backed common layers.

Use eager mode while debugging. Enable CUDA graph and compile paths for throughput measurements after the model and strategy are already validated.

Benchmark YAML Structure¶

Benchmark YAML files use nested engine and eval sections:

engine:
  model_path: /path/to/model
  tokenizer_path: null
  model_name: dream
  decoding_strategy: multi_bd
  tensor_parallel_size: 1
  data_parallel_size: 1
eval:
  dataset_name: gsm8k_diffulex
  dataset_limit: 10
  temperature: 0.0
  max_tokens: 512

engine fields are forwarded to the Diffulex engine after compatibility normalization. eval fields control lm-eval task selection, sampling limits, and output behavior.

Evaluation Parameter Reference¶

Key	How to set it	What it does
`dataset_name`	Use an lm-eval task name. The default is `gsm8k_diffulex`.	Selects the benchmark task.
`dataset_split`	Use the dataset split name expected by the task. The default is `test`.	Selects the dataset split passed to lm-eval.
`dataset_limit`	Use a positive integer for smoke tests, or `null` for the full task.	Limits how many examples are evaluated.
`include_path`	Point to a directory of task YAMLs, leave it `null` for bundled tasks, or use an empty string to disable the bundled include path.	Controls where lm-eval looks for task definitions.
`dataset_data_files`	Point to a local data file, or leave it `null`.	Overrides `dataset_kwargs.data_files` in the task YAML.
`temperature`	Use a sampling temperature. The default is `0.0` for deterministic evaluation.	Sets generation randomness during benchmark requests.
`max_tokens`	Use a positive output-token limit. The config class defaults to `256`; the sample YAML uses `512`.	Caps generated tokens per example.
`max_nfe`	Use a positive integer, or leave it `null`.	Caps forward evaluations when a strategy supports that limit.
`max_repetition_run`	Use a positive integer, or leave it `null`.	Stops generation after a long repeated-token run.
`ignore_eos`	Leave `False` unless a task needs generation to continue past EOS.	Controls whether EOS terminates generation.
`output_dir`	Point to an output directory. The default is `benchmark_results`.	Sets where benchmark artifacts are written.
`use_run_subdirectory`	Leave `True` for normal runs.	Writes each run under a timestamped task directory.
`save_results`	Leave `True` unless only logs are needed.	Saves lm-eval results and sample outputs.
`confirm_run_unsafe_code`	Leave `True` only for tasks where executing generated code is expected and acceptable.	Allows lm-eval code-execution tasks to run.

Environment Variables¶

Diffulex uses normal Python, PyTorch, CUDA, and distributed runtime environment variables. Set variables such as CUDA visibility and library paths before starting the Python process.

When CUDA_VISIBLE_DEVICES is set, device_ids should refer to PyTorch logical device IDs, not physical GPU IDs.