# Configuration Diffulex uses the same core engine fields across Python inference, HTTP serving, and benchmark execution. Each entry point exposes those fields in a slightly different form, but the validation rules and runtime effects are the same. ## Engine Arguments Engine arguments control model loading, decoding strategy, parallelism, memory limits, KV cache layout, LoRA behavior, and runtime optimizations. The config validator enforces relationships between these values. For example, `block_size` must be less than or equal to `page_size`, and both must be one of the supported page/block sizes for the selected model family. ## Engine Parameter Reference | Key | How to set it | What it does | | --- | --- | --- | | `model` / `model_path` | Point to the local model checkpoint directory. This is required. | Loads the base model. Python config uses `model`; benchmark YAML uses `engine.model_path`. | | `tokenizer_path` | Point to a tokenizer directory, or leave it `null` to reuse the model path. | Lets benchmark flows use a tokenizer stored separately from the weights. | | `model_name` | Choose one of the registered model keys: `dream`, `sdar`, `sdar_moe`, `fast_dllm_v2`, `llada`, `llada2`, `llada2_moe`, `llada2_mini`, `llada2dot1_mini`, `llada2_mini_dmax`, or `diffusion_gemma`. The default is `dream`. | Selects the model adapter and the sampler defaults that go with it. | | `decoding_strategy` | Use `d2f`, `multi_bd`, `dmax`, or `diffusion_gemma`. The default is `d2f`. | Chooses the request type, scheduler, KV cache manager, model runner, and attention metadata path. | | `sampling_mode` | Use `naive` for the standard sampler or `edit` for edit-style decoding. The default is `naive`. | Selects sampler behavior. `edit` is only valid for compatible LLaDA2-family models. | | `mask_token_id` | Use the tokenizer's mask token ID. The default is `151666`, and tokenizer metadata can override it. | Tells diffusion samplers which token represents a masked position. | | `tensor_parallel_size` | Use `1` to `8` tensor-parallel ranks. Core config defaults to `2`; CLI examples usually start at `1`. | Splits one model replica across multiple GPUs. | | `data_parallel_size` | Use `1` to `1024` data-parallel groups. The default is `1`. | Runs independent serving or evaluation groups for higher throughput. | | `gpu_memory_utilization` | Use a fractional target such as `0.9`. | Guides engine memory planning so it does not reserve the entire GPU. | | `max_model_len` | Use a positive sequence length. The default is `2048`, and the HF model config may clamp it lower. | Sets the requested maximum prompt-plus-output length. | | `max_num_batched_tokens` | Use a positive token budget. The default is `4096`, and it must be at least the effective `max_model_len`. | Limits how many tokens the scheduler can place in one batch. | | `max_num_reqs` | Use a positive request count. The default is `128`. | Caps the number of active requests the engine can track. | | `block_size` | Use `4`, `8`, `16`, or `32` for most models; `diffusion_gemma` uses `256`. The default is `32`. | Sets the token span of one diffusion block. | | `buffer_size` | Use a positive block count. The default is `4`; `diffusion_gemma` is forced to `1`. | Controls how many diffusion blocks can be active for one request. | | `page_size` | Use `4`, `8`, `16`, or `32` for most models; `diffusion_gemma` uses `256`. Keep it greater than or equal to `block_size`. | Sets the KV cache page size. | | `kv_cache_layout` | Use `unified` unless a strategy or experiment needs `distinct`. | Chooses how KV cache storage is organized internally. | | `attn_impl` | Use `triton_grouped` for normal serving and benchmark runs. `triton` and `naive` are compatibility/debug fallbacks and are not recommended for performance reporting. The default is `triton_grouped`. | Selects the attention backend. | | `enable_prefix_caching` | Leave it `True` for compatible strategies. `d2f` forces it off during normalization. | Reuses compatible prefix KV cache state across requests. | | `enforce_eager` | Set `True` while debugging; leave `False` for optimized runs. | Disables graph-style optimized execution paths. | | `enable_full_static_runner` | Leave `True` for supported optimized multi-block paths. | Enables full-static CUDA graph runner paths. | | `enable_torch_compile` | Leave `True` when the model path supports compile; turn it off for debugging. | Enables `torch.compile` where Diffulex can use it safely. | | `torch_compile_mode` | Defaults to `reduce-overhead`; use another PyTorch compile mode only when profiling justifies it. | Passes the compile mode through to PyTorch. | | `enable_vllm_layers` | Leave `True` unless isolating a layer implementation issue. | Uses optional vLLM-backed common layers. | ## MoE and Token Merge Parameters | Key | How to set it | What it does | | --- | --- | --- | | `token_merge_mode` | Use `dmax_topk` or `iter_smooth_topk`. The default is `dmax_topk`. | Chooses how token-merge metadata is built. | | `token_merge_top_k` | Use a positive integer. The default is `1`. | Keeps this many candidate tokens in token-merge metadata. | | `token_merge_renormalize` | Leave `True` unless an experiment needs raw probabilities. | Renormalizes token-merge probabilities after candidate filtering. | | `token_merge_weight` | Use a non-negative float. The default is `1.0`. | Weights the token-merge interpolation. | | `moe_gemm_impl` | Use `triton`, `vllm`, `vllm_modular`, or `naive`. The default is `triton`. | Selects the MoE GEMM implementation. | ## DMax and Edit Sampling Parameters | Key | How to set it | What it does | | --- | --- | --- | | `max_post_edit_steps` | Use a positive integer. The default is `16`. | Caps the number of edit steps per block in DMax/edit-sampling runs. | | `penalty_lambda` | Use a non-negative float. The default is `0.0`. | Penalty weight for token-to-token edit selection. | | `dmax_sampler_fast_path` | Leave `True` unless debugging. The default is `True`. | Enables the fast argmax path for DMax confidence computation. | | `dmax_force_prefill_active` | Leave `False` unless profiling a specific DMax prefill pattern. | Forces DMax active blocks to remain on the prefill execution path. | | `enable_vectorized_sampler` | Leave `False` unless extending the vectorized sampler. | Enables the vectorized greedy sampler for batched mask-filling. | | `enable_vectorized_sampler_compile` | Leave `False` unless paired with `enable_vectorized_sampler`. | Applies `torch.compile` to the vectorized sampler path. | ## Fast-dLLM-v2 Parameters | Key | How to set it | What it does | | --- | --- | --- | | `fdv2_use_block_cache` | Leave `True` for the standard dual-cache path. The default is `True`. | When enabled, already-refined sub-blocks within the buffer reuse cached KV — only the active sub-block is recomputed. Set `False` to re-compute the full buffer every sub-block step. | ## DiffusionGemma Sampler Controls DiffusionGemma uses a canvas-denoising sampler with its own hyperparameters. These are only active when `model_name=diffusion_gemma`: | Key | How to set it | What it does | | --- | --- | --- | | `diffusion_gemma_max_denoising_steps` | Use a positive integer. The default is `48`. | Caps the number of denoising steps per block. | | `diffusion_gemma_stability_threshold` | Use a positive integer. The default is `2`. | Number of consecutive argmax matches required for convergence. | | `diffusion_gemma_confidence_threshold` | Use a float from `0` to `1`. The default is `0.1`. | Mean entropy threshold for accepting convergence. | | `diffusion_gemma_t_min` | Use a float. The default is `0.0`. | Minimum temperature in the denoising schedule. | | `diffusion_gemma_t_max` | Use a float. The default is `1.0`. | Maximum (initial) temperature in the denoising schedule. | | `diffusion_gemma_entropy_bound` | Use a float from `0` to `1`. The default is `1.0`. | Entropy threshold for stochastic re-masking during denoising. | ## MoE Dispatching | Key | How to set it | What it does | | --- | --- | --- | | `expert_parallel_size` | Use `1` for standard MoE; increase only when extending the runtime. The default is `1`. | Number of expert-parallel groups for MoE models. | | `moe_dispatcher_backend` | Use `standard`, `naive`, or `deepep`. The default is `standard`. | Selects the MoE token dispatcher. | | `deepep_mode` | Use `auto` unless debugging DeepEP-specific behavior. | Controls the DeepEP dispatch mode when `moe_dispatcher_backend=deepep`. | | `deepep_num_max_dispatch_tokens_per_rank` | Use a positive integer. The default is `256`. | Caps tokens per rank in DeepEP dispatch. | ## Distributed and Deployment | Key | How to set it | What it does | | --- | --- | --- | | `master_addr` | Use an IP or hostname. The default is `"localhost"`. | PyTorch distributed master address. | | `master_port` | Use an integer. The default is `2333`. | PyTorch distributed master port. | | `distributed_backend` | Use `nccl` unless debugging with `gloo`. | PyTorch distributed communication backend. | | `distributed_timeout_seconds` | Use a positive integer. The default is `600`. | Timeout for distributed operations. | | `device_start` | Use `0` unless mapping specific GPUs. | Starting CUDA device index for local ranks. | ## Runtime Tuning | Key | How to set it | What it does | | --- | --- | --- | | `skip_warmup` | Leave `False` unless iterating on config changes. The default is `False`. | Skips model and CUDA graph warmup at engine startup. | | `enable_cudagraph_torch_compile` | Leave `False` unless profiling shows a measurable gain. | Applies `torch.compile` to CUDA graph captured regions. | | `enable_prefill_cudagraph` | Leave `False` unless prefill latency is the bottleneck. | Enables CUDA graph capture for prefill steps. | | `prefill_cudagraph_max_len` | Use `0` (auto-detect) or a positive token count. | Maximum sequence length for prefill CUDA graph capture. | | `auto_max_nfe_warmup_steps` | Use a positive integer. The default is `8`. | Number of warmup steps when auto-computing max NFE. | | `auto_max_nfe_tpf_floor` | Use a positive float. The default is `1.0`. | Minimum TPF used when auto-computing max NFE. | | `num_pages` | Use `-1` for auto-sizing, or a specific page count. | Overrides automatic KV cache page pool sizing. | | `k_cache_hdim_split_factor_x` | Leave at the default `8` unless tuning the KV cache layout. | Splits the K cache head dimension for memory layout optimization. | ## Profiler | Key | How to set it | What it does | | --- | --- | --- | | `profiler_config` | Use `null` unless collecting a profiling trace. | PyTorch profiler configuration dict. Set to a dict with `activities`, `schedule`, etc. to enable profiling. | ## Strategy Defaults Some settings are normalized based on the selected strategy: | Strategy | Normalized behavior | | --- | --- | | `d2f` | Forces `multi_block_prefix_full=True` and disables prefix caching. | | `multi_bd` | Enables block-causal Multi-Block Diffusion and forces `multi_block_prefix_full=False`. | | `dmax` | Forces `multi_block_prefix_full=False` and requires edit sampling. | | `diffusion_gemma` | Uses DiffusionGemma request/sampler/runtime defaults. | Model-specific defaults may also apply. `diffusion_gemma` uses block and page size `256`, uses `buffer_size=1`, and enables DiffusionGemma sampler controls. ## Sampling Parameters Sampling parameters are passed through `diffulex.SamplingParams` for Python inference and through matching CLI/config fields for benchmark and server paths. They are request-level settings rather than engine construction settings. | Key | How to set it | What it does | | --- | --- | --- | | `temperature` | Use `0.0` for deterministic evaluation, or a higher float when sampling is desired. | Controls generation randomness. | | `max_tokens` | Use a positive output-token limit. | Caps generated tokens for each request. | | `max_nfe` | Use a positive integer, or leave it unset. | Caps forward evaluations when the strategy supports that limit. | | `ignore_eos` | Leave `False` for normal generation; set `True` only when a task should continue after EOS. | Controls whether EOS ends generation. | | `max_repetition_run` | Use a positive integer, or leave it unset. | Stops generation after a long repeated-token run. | ## Decoding Thresholds Diffulex groups decoding thresholds in `DecodingThresholds`: | Key | How to set it | What it does | | --- | --- | --- | | `add_block_threshold` | Start from the default `0.1`; tune as a float for block-add behavior. | Controls when a strategy may add another decoding block. | | `semi_complete_threshold` | Start from the default `0.9`; tune as a float for block advancement. | Controls when semi-complete block state can move forward. | | `accept_threshold` | Use a confidence value from `0` to `1`. The default is `0.9`. | Accepts mask-to-token updates once confidence is high enough. | | `edit_threshold` | Use a confidence value from `0` to `1`. The default is `0.0`. | Accepts token-to-token edits in edit-style decoding. | | `remask_threshold` | Use a confidence value from `0` to `1`. The default is `0.4`. | Remasks filled tokens that fall below the confidence threshold. | | `token_stability_threshold` | Use a stability ratio from `0` to `1`. The default is `0.0`. | Requires enough token stability before DMax-style edit blocks advance. | The flat CLI flags are folded into the threshold object during config construction. Keep thresholds in the config file when comparing strategies so command lines stay readable. ## LoRA | Key | How to set it | What it does | | --- | --- | --- | | `use_lora` | Set `True` when an adapter should be loaded. | Enables LoRA adapter loading. | | `lora_path` | Point to the adapter checkpoint directory. Required when `use_lora=True`. | Provides the adapter weights. | | `pre_merge_lora` | Set `True` when the adapter should be merged into the base model at load time. | Avoids per-forward adapter compute when the model and adapter support merging. | When `use_lora=True`, `lora_path` must be provided. If `pre_merge_lora=True`, Diffulex attempts to merge adapter weights into the base model before inference when the model path and runtime support it. ## Runtime Optimizations | Key | How to set it | What it does | | --- | --- | --- | | `enforce_eager` | Set `True` while debugging; keep `False` for optimized runs. | Bypasses graph-style execution paths. | | `enable_full_static_runner` | Leave `True` for supported multi-block optimized paths. | Enables the full-static runner where available. | | `enable_torch_compile` | Leave `True` when the model supports compile; disable it to isolate compile issues. | Enables `torch.compile` where supported. | | `torch_compile_mode` | Defaults to `reduce-overhead`. Change it only for a measured profiling reason. | Passes the compile mode to PyTorch. | | `enable_vllm_layers` | Leave `True` unless comparing layer implementations. | Uses optional vLLM-backed common layers. | Use eager mode while debugging. Enable CUDA graph and compile paths for throughput measurements after the model and strategy are already validated. ## Benchmark YAML Structure Benchmark YAML files use nested `engine` and `eval` sections: ```yaml engine: model_path: /path/to/model tokenizer_path: null model_name: dream decoding_strategy: multi_bd tensor_parallel_size: 1 data_parallel_size: 1 eval: dataset_name: gsm8k_diffulex dataset_limit: 10 temperature: 0.0 max_tokens: 512 ``` `engine` fields are forwarded to the Diffulex engine after compatibility normalization. `eval` fields control lm-eval task selection, sampling limits, and output behavior. ## Evaluation Parameter Reference | Key | How to set it | What it does | | --- | --- | --- | | `dataset_name` | Use an lm-eval task name. The default is `gsm8k_diffulex`. | Selects the benchmark task. | | `dataset_split` | Use the dataset split name expected by the task. The default is `test`. | Selects the dataset split passed to lm-eval. | | `dataset_limit` | Use a positive integer for smoke tests, or `null` for the full task. | Limits how many examples are evaluated. | | `include_path` | Point to a directory of task YAMLs, leave it `null` for bundled tasks, or use an empty string to disable the bundled include path. | Controls where lm-eval looks for task definitions. | | `dataset_data_files` | Point to a local data file, or leave it `null`. | Overrides `dataset_kwargs.data_files` in the task YAML. | | `temperature` | Use a sampling temperature. The default is `0.0` for deterministic evaluation. | Sets generation randomness during benchmark requests. | | `max_tokens` | Use a positive output-token limit. The config class defaults to `256`; the sample YAML uses `512`. | Caps generated tokens per example. | | `max_nfe` | Use a positive integer, or leave it `null`. | Caps forward evaluations when a strategy supports that limit. | | `max_repetition_run` | Use a positive integer, or leave it `null`. | Stops generation after a long repeated-token run. | | `ignore_eos` | Leave `False` unless a task needs generation to continue past EOS. | Controls whether EOS terminates generation. | | `output_dir` | Point to an output directory. The default is `benchmark_results`. | Sets where benchmark artifacts are written. | | `use_run_subdirectory` | Leave `True` for normal runs. | Writes each run under a timestamped task directory. | | `save_results` | Leave `True` unless only logs are needed. | Saves lm-eval results and sample outputs. | | `confirm_run_unsafe_code` | Leave `True` only for tasks where executing generated code is expected and acceptable. | Allows lm-eval code-execution tasks to run. | ## Environment Variables Diffulex uses normal Python, PyTorch, CUDA, and distributed runtime environment variables. Set variables such as CUDA visibility and library paths before starting the Python process. When `CUDA_VISIBLE_DEVICES` is set, `device_ids` should refer to PyTorch logical device IDs, not physical GPU IDs.