diffulex.bench

The benchmark CLI runs Diffulex through lm-evaluation-harness tasks. It is the main command-line interface for GSM8K, MATH500, HumanEval, MBPP, and custom lm-eval compatible task YAMLs.

The entry point is:

python -m diffulex_bench.main --help

Config-First Usage

Use a YAML config when you want repeatable settings:

python -m diffulex_bench.main \
  --config diffulex_bench/configs/llada2_mini_gsm8k.yml \
  --model-path /path/to/LLaDA2.0-mini \
  --dataset-limit 10

Command line flags override matching config fields. This makes it practical to keep one config per model family and vary paths, limits, and output directories per run.

Model and Strategy Arguments

Flag

How to set it

What it does

--model-path

Point to the local base-model checkpoint directory. Required unless the YAML already sets engine.model_path.

Passes the model weights path into Diffulex.

--tokenizer-path

Point to a tokenizer directory, or omit it to fall back to the model path in the benchmark config flow.

Lets lm-eval use a tokenizer stored separately from the weights.

--model-name

Use a registered model key: dream, sdar, sdar_moe, fast_dllm_v2, llada, llada2, llada2_moe, llada2_mini, llada2dot1_mini, llada2_mini_dmax, or diffusion_gemma. The default is dream.

Selects the model adapter and sampler defaults.

--decoding-strategy

Use d2f, multi_bd, dmax, or diffusion_gemma where supported by the selected model/config.

Chooses the strategy-specific request, scheduler, cache, runner, and attention metadata path.

--sampling-mode

Use naive, edit, or omit it and let config/model defaults apply.

Selects sampler behavior. edit is restricted to compatible LLaDA2-family names.

--mask-token-id

Use the tokenizer’s mask token ID. The default is 151666.

Supplies the mask token when tokenizer metadata does not override it.

Parallelism and Capacity Arguments

Flag

How to set it

What it does

--tensor-parallel-size

Use 1 to 8 ranks. The CLI default is 1.

Splits one model replica across multiple GPUs.

--data-parallel-size

Use 1 to 1024 groups. The CLI default is 1.

Runs independent evaluation groups when enough devices are available.

--gpu-memory-utilization

Use a fraction such as 0.9.

Guides GPU memory planning for engine allocation.

--max-model-len

Use a positive sequence length. The default is 2048, and the HF config may clamp it.

Sets the requested prompt-plus-output length limit.

--max-num-batched-tokens

Use a positive token budget. The default is 4096; it must cover the effective model length.

Limits scheduler batch size by token count.

--max-num-reqs

Use a positive request count, or omit it to let config defaults apply.

Caps active requests and replaces deprecated --max-num-seqs.

--max-num-seqs

Deprecated; use --max-num-reqs for new commands.

Keeps older benchmark commands working when --max-num-reqs is unset.

Sampling Arguments

Flag

How to set it

What it does

--temperature

Use 0.0 for deterministic evaluation or a higher float for sampling.

Sets generation randomness for benchmark requests.

--max-tokens

Use a positive output-token limit. The default is 256.

Caps generated tokens per request.

--max-nfe

Use a positive number of forward evaluations, or omit it.

Adds a hard evaluation-step bound when the strategy supports it.

--max-repetition-run

Use a positive run length, or omit it.

Stops a request after the generated suffix repeats one token for too long.

--ignore-eos

Add the flag only when a task should continue after EOS.

Prevents EOS from ending generation.

Dataset Arguments

Flag

How to set it

What it does

--dataset

Use an lm-eval task name. The default is gsm8k_diffulex.

Selects the benchmark task; bundled tasks live under diffulex_bench/tasks.

--include-path

Point to a directory of task YAMLs, omit it for bundled tasks, or pass an empty string to disable the bundled include path.

Controls where lm-eval looks for task definitions.

--dataset-split

Use the split name expected by the dataset. The default is test.

Passes the dataset split through config.

--dataset-limit

Use a positive number for smoke tests or partial runs, and omit it for the full task.

Limits how many examples are evaluated.

--dataset-data-files

Point to a local JSON file, or omit it.

Overrides dataset_kwargs.data_files in the task YAML.

--confirm-run-unsafe-code

Enable only for lm-eval tasks where executing generated code is expected and acceptable.

Allows code-execution tasks to run.

Output and Logging

Flag

How to set it

What it does

--output-dir

Point to an output directory. The default is benchmark_results.

Sets the base directory for benchmark artifacts.

--use-run-subdirectory / --no-use-run-subdirectory

Leave run subdirectories on for normal runs, or disable them when a fixed output path is needed.

Writes each run under run_<timestamp>_<task>/.

--save-results / --no-save-results

Leave saving on unless only logs are needed.

Controls result and sample output files.

--log-file

Point to a log file, or omit it for console logging.

Adds persistent benchmark logs.

--log-level

Use DEBUG, INFO, WARNING, or ERROR. The default is INFO.

Controls benchmark log verbosity.

LoRA and Runtime Controls

Flag

How to set it

What it does

--use-lora

Add the flag when benchmarking with an adapter.

Enables LoRA adapter loading.

--lora-path

Point to the adapter checkpoint directory. Required with --use-lora.

Loads the adapter weights.

--pre-merge-lora

Add when the adapter should be merged into the base model at load time. Benchmark YAML defaults to pre-merge on.

Avoids per-forward adapter compute when merging is supported.

--enforce-eager / --no-enforce-eager

Use eager mode for debugging, or explicitly allow optimized graph paths for measurement.

Overrides config-driven eager/optimized execution behavior.

--kv-cache-layout

Use unified for the default cache layout or distinct for strategy experiments.

Chooses KV cache storage layout.

--attn-impl

Use triton_grouped for benchmark runs and performance reports. triton and naive are compatibility/debug fallbacks, or omit to keep config defaults.

Overrides the attention backend.

--page-size

Use 4, 8, 16, or 32 for most models; DiffusionGemma uses 256.

Sets the KV cache page size.

--block-size

Use 4, 8, 16, or 32 for most models; DiffusionGemma uses 256. Keep it no larger than --page-size.

Sets the token span of one diffusion block.

--buffer-size

Use a positive block count, or omit it and keep the config default of 4.

Controls how many diffusion blocks can remain active.

--enable-prefix-caching / --no-enable-prefix-caching

Use the pair to make prefix-cache behavior explicit in an experiment command.

Enables or disables compatible prefix cache reuse.

Threshold and Token-Merge Controls

Flag

How to set it

What it does

--add-block-threshold

Omit it to use 0.1, or pass a float for block-add tuning.

Controls when another decoding block can be added.

--semi-complete-threshold

Omit it to use 0.9, or pass a float for block advancement tuning.

Controls when semi-complete block state can advance.

--accept-threshold

Use a confidence value from 0 to 1. The default is 0.9.

Accepts mask-to-token updates once confidence is high enough.

--edit-threshold

Use a confidence value from 0 to 1. The default is 0.0.

Accepts token-to-token edits in edit-style decoding.

--remask-threshold

Use a confidence value from 0 to 1. The default is 0.4.

Remasks filled tokens that fall below the confidence threshold.

--token-stability-threshold

Use a stability ratio from 0 to 1. The default is 0.0.

Controls when the next DMax edit block can be added.

--token-merge-mode

Use dmax_topk, iter_smooth_topk, or omit it and keep config defaults.

Selects token-merge metadata behavior for DMax-style strategies.

--token-merge-top-k

Use a positive candidate count, or omit it and keep the config default of 1.

Keeps this many candidates in token-merge metadata.

--token-merge-renormalize

Add when the experiment should explicitly renormalize token-merge probabilities.

Controls probability normalization after token candidate filtering.

--token-merge-weight

Use a non-negative interpolation weight, or omit it and keep the config default of 1.0.

Weights token-merge interpolation.

For long-lived values, prefer YAML config. Use CLI overrides for experiment specific changes that should be visible in the command history.