diffulex.bench¶

The benchmark CLI runs Diffulex through lm-evaluation-harness tasks. It is the main command-line interface for GSM8K, MATH500, HumanEval, MBPP, and custom lm-eval compatible task YAMLs.

The entry point is:

python -m diffulex_bench.main --help

Config-First Usage¶

Use a YAML config when you want repeatable settings:

python -m diffulex_bench.main \
  --config diffulex_bench/configs/llada2_mini_gsm8k.yml \
  --model-path /path/to/LLaDA2.0-mini \
  --dataset-limit 10

Command line flags override matching config fields. This makes it practical to keep one config per model family and vary paths, limits, and output directories per run.

Model and Strategy Arguments¶

Flag	How to set it	What it does
`--model-path`	Point to the local base-model checkpoint directory. Required unless the YAML already sets `engine.model_path`.	Passes the model weights path into Diffulex.
`--tokenizer-path`	Point to a tokenizer directory, or omit it to fall back to the model path in the benchmark config flow.	Lets lm-eval use a tokenizer stored separately from the weights.
`--model-name`	Use a registered model key: `dream`, `sdar`, `sdar_moe`, `fast_dllm_v2`, `llada`, `llada2`, `llada2_moe`, `llada2_mini`, `llada2dot1_mini`, `llada2_mini_dmax`, or `diffusion_gemma`. The default is `dream`.	Selects the model adapter and sampler defaults.
`--decoding-strategy`	Use `d2f`, `multi_bd`, `dmax`, or `diffusion_gemma` where supported by the selected model/config.	Chooses the strategy-specific request, scheduler, cache, runner, and attention metadata path.
`--sampling-mode`	Use `naive`, `edit`, or omit it and let config/model defaults apply.	Selects sampler behavior. `edit` is restricted to compatible LLaDA2-family names.
`--mask-token-id`	Use the tokenizer’s mask token ID. The default is `151666`.	Supplies the mask token when tokenizer metadata does not override it.

Parallelism and Capacity Arguments¶

Flag	How to set it	What it does
`--tensor-parallel-size`	Use `1` to `8` ranks. The CLI default is `1`.	Splits one model replica across multiple GPUs.
`--data-parallel-size`	Use `1` to `1024` groups. The CLI default is `1`.	Runs independent evaluation groups when enough devices are available.
`--gpu-memory-utilization`	Use a fraction such as `0.9`.	Guides GPU memory planning for engine allocation.
`--max-model-len`	Use a positive sequence length. The default is `2048`, and the HF config may clamp it.	Sets the requested prompt-plus-output length limit.
`--max-num-batched-tokens`	Use a positive token budget. The default is `4096`; it must cover the effective model length.	Limits scheduler batch size by token count.
`--max-num-reqs`	Use a positive request count, or omit it to let config defaults apply.	Caps active requests and replaces deprecated `--max-num-seqs`.
`--max-num-seqs`	Deprecated; use `--max-num-reqs` for new commands.	Keeps older benchmark commands working when `--max-num-reqs` is unset.

Sampling Arguments¶

Flag	How to set it	What it does
`--temperature`	Use `0.0` for deterministic evaluation or a higher float for sampling.	Sets generation randomness for benchmark requests.
`--max-tokens`	Use a positive output-token limit. The default is `256`.	Caps generated tokens per request.
`--max-nfe`	Use a positive number of forward evaluations, or omit it.	Adds a hard evaluation-step bound when the strategy supports it.
`--max-repetition-run`	Use a positive run length, or omit it.	Stops a request after the generated suffix repeats one token for too long.
`--ignore-eos`	Add the flag only when a task should continue after EOS.	Prevents EOS from ending generation.

Dataset Arguments¶

Flag	How to set it	What it does
`--dataset`	Use an lm-eval task name. The default is `gsm8k_diffulex`.	Selects the benchmark task; bundled tasks live under `diffulex_bench/tasks`.
`--include-path`	Point to a directory of task YAMLs, omit it for bundled tasks, or pass an empty string to disable the bundled include path.	Controls where lm-eval looks for task definitions.
`--dataset-split`	Use the split name expected by the dataset. The default is `test`.	Passes the dataset split through config.
`--dataset-limit`	Use a positive number for smoke tests or partial runs, and omit it for the full task.	Limits how many examples are evaluated.
`--dataset-data-files`	Point to a local JSON file, or omit it.	Overrides `dataset_kwargs.data_files` in the task YAML.
`--confirm-run-unsafe-code`	Enable only for lm-eval tasks where executing generated code is expected and acceptable.	Allows code-execution tasks to run.

Output and Logging¶

Flag	How to set it	What it does
`--output-dir`	Point to an output directory. The default is `benchmark_results`.	Sets the base directory for benchmark artifacts.
`--use-run-subdirectory` / `--no-use-run-subdirectory`	Leave run subdirectories on for normal runs, or disable them when a fixed output path is needed.	Writes each run under `run_<timestamp>_<task>/`.
`--save-results` / `--no-save-results`	Leave saving on unless only logs are needed.	Controls result and sample output files.
`--log-file`	Point to a log file, or omit it for console logging.	Adds persistent benchmark logs.
`--log-level`	Use `DEBUG`, `INFO`, `WARNING`, or `ERROR`. The default is `INFO`.	Controls benchmark log verbosity.

LoRA and Runtime Controls¶

Flag	How to set it	What it does
`--use-lora`	Add the flag when benchmarking with an adapter.	Enables LoRA adapter loading.
`--lora-path`	Point to the adapter checkpoint directory. Required with `--use-lora`.	Loads the adapter weights.
`--pre-merge-lora`	Add when the adapter should be merged into the base model at load time. Benchmark YAML defaults to pre-merge on.	Avoids per-forward adapter compute when merging is supported.
`--enforce-eager` / `--no-enforce-eager`	Use eager mode for debugging, or explicitly allow optimized graph paths for measurement.	Overrides config-driven eager/optimized execution behavior.
`--kv-cache-layout`	Use `unified` for the default cache layout or `distinct` for strategy experiments.	Chooses KV cache storage layout.
`--attn-impl`	Use `triton_grouped` for benchmark runs and performance reports. `triton` and `naive` are compatibility/debug fallbacks, or omit to keep config defaults.	Overrides the attention backend.
`--page-size`	Use `4`, `8`, `16`, or `32` for most models; DiffusionGemma uses `256`.	Sets the KV cache page size.
`--block-size`	Use `4`, `8`, `16`, or `32` for most models; DiffusionGemma uses `256`. Keep it no larger than `--page-size`.	Sets the token span of one diffusion block.
`--buffer-size`	Use a positive block count, or omit it and keep the config default of `4`.	Controls how many diffusion blocks can remain active.
`--enable-prefix-caching` / `--no-enable-prefix-caching`	Use the pair to make prefix-cache behavior explicit in an experiment command.	Enables or disables compatible prefix cache reuse.

Threshold and Token-Merge Controls¶

Flag	How to set it	What it does
`--add-block-threshold`	Omit it to use `0.1`, or pass a float for block-add tuning.	Controls when another decoding block can be added.
`--semi-complete-threshold`	Omit it to use `0.9`, or pass a float for block advancement tuning.	Controls when semi-complete block state can advance.
`--accept-threshold`	Use a confidence value from `0` to `1`. The default is `0.9`.	Accepts mask-to-token updates once confidence is high enough.
`--edit-threshold`	Use a confidence value from `0` to `1`. The default is `0.0`.	Accepts token-to-token edits in edit-style decoding.
`--remask-threshold`	Use a confidence value from `0` to `1`. The default is `0.4`.	Remasks filled tokens that fall below the confidence threshold.
`--token-stability-threshold`	Use a stability ratio from `0` to `1`. The default is `0.0`.	Controls when the next DMax edit block can be added.
`--token-merge-mode`	Use `dmax_topk`, `iter_smooth_topk`, or omit it and keep config defaults.	Selects token-merge metadata behavior for DMax-style strategies.
`--token-merge-top-k`	Use a positive candidate count, or omit it and keep the config default of `1`.	Keeps this many candidates in token-merge metadata.
`--token-merge-renormalize`	Add when the experiment should explicitly renormalize token-merge probabilities.	Controls probability normalization after token candidate filtering.
`--token-merge-weight`	Use a non-negative interpolation weight, or omit it and keep the config default of `1.0`.	Weights token-merge interpolation.

For long-lived values, prefer YAML config. Use CLI overrides for experiment specific changes that should be visible in the command history.