# diffulex.bench The benchmark CLI runs Diffulex through lm-evaluation-harness tasks. It is the main command-line interface for GSM8K, MATH500, HumanEval, MBPP, and custom lm-eval compatible task YAMLs. The entry point is: ```bash python -m diffulex_bench.main --help ``` ## Config-First Usage Use a YAML config when you want repeatable settings: ```bash python -m diffulex_bench.main \ --config diffulex_bench/configs/llada2_mini_gsm8k.yml \ --model-path /path/to/LLaDA2.0-mini \ --dataset-limit 10 ``` Command line flags override matching config fields. This makes it practical to keep one config per model family and vary paths, limits, and output directories per run. ## Model and Strategy Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--model-path` | Point to the local base-model checkpoint directory. Required unless the YAML already sets `engine.model_path`. | Passes the model weights path into Diffulex. | | `--tokenizer-path` | Point to a tokenizer directory, or omit it to fall back to the model path in the benchmark config flow. | Lets lm-eval use a tokenizer stored separately from the weights. | | `--model-name` | Use a registered model key: `dream`, `sdar`, `sdar_moe`, `fast_dllm_v2`, `llada`, `llada2`, `llada2_moe`, `llada2_mini`, `llada2dot1_mini`, `llada2_mini_dmax`, or `diffusion_gemma`. The default is `dream`. | Selects the model adapter and sampler defaults. | | `--decoding-strategy` | Use `d2f`, `multi_bd`, `dmax`, or `diffusion_gemma` where supported by the selected model/config. | Chooses the strategy-specific request, scheduler, cache, runner, and attention metadata path. | | `--sampling-mode` | Use `naive`, `edit`, or omit it and let config/model defaults apply. | Selects sampler behavior. `edit` is restricted to compatible LLaDA2-family names. | | `--mask-token-id` | Use the tokenizer's mask token ID. The default is `151666`. | Supplies the mask token when tokenizer metadata does not override it. | ## Parallelism and Capacity Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--tensor-parallel-size` | Use `1` to `8` ranks. The CLI default is `1`. | Splits one model replica across multiple GPUs. | | `--data-parallel-size` | Use `1` to `1024` groups. The CLI default is `1`. | Runs independent evaluation groups when enough devices are available. | | `--gpu-memory-utilization` | Use a fraction such as `0.9`. | Guides GPU memory planning for engine allocation. | | `--max-model-len` | Use a positive sequence length. The default is `2048`, and the HF config may clamp it. | Sets the requested prompt-plus-output length limit. | | `--max-num-batched-tokens` | Use a positive token budget. The default is `4096`; it must cover the effective model length. | Limits scheduler batch size by token count. | | `--max-num-reqs` | Use a positive request count, or omit it to let config defaults apply. | Caps active requests and replaces deprecated `--max-num-seqs`. | | `--max-num-seqs` | Deprecated; use `--max-num-reqs` for new commands. | Keeps older benchmark commands working when `--max-num-reqs` is unset. | ## Sampling Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--temperature` | Use `0.0` for deterministic evaluation or a higher float for sampling. | Sets generation randomness for benchmark requests. | | `--max-tokens` | Use a positive output-token limit. The default is `256`. | Caps generated tokens per request. | | `--max-nfe` | Use a positive number of forward evaluations, or omit it. | Adds a hard evaluation-step bound when the strategy supports it. | | `--max-repetition-run` | Use a positive run length, or omit it. | Stops a request after the generated suffix repeats one token for too long. | | `--ignore-eos` | Add the flag only when a task should continue after EOS. | Prevents EOS from ending generation. | ## Dataset Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--dataset` | Use an lm-eval task name. The default is `gsm8k_diffulex`. | Selects the benchmark task; bundled tasks live under `diffulex_bench/tasks`. | | `--include-path` | Point to a directory of task YAMLs, omit it for bundled tasks, or pass an empty string to disable the bundled include path. | Controls where lm-eval looks for task definitions. | | `--dataset-split` | Use the split name expected by the dataset. The default is `test`. | Passes the dataset split through config. | | `--dataset-limit` | Use a positive number for smoke tests or partial runs, and omit it for the full task. | Limits how many examples are evaluated. | | `--dataset-data-files` | Point to a local JSON file, or omit it. | Overrides `dataset_kwargs.data_files` in the task YAML. | | `--confirm-run-unsafe-code` | Enable only for lm-eval tasks where executing generated code is expected and acceptable. | Allows code-execution tasks to run. | ## Output and Logging | Flag | How to set it | What it does | | --- | --- | --- | | `--output-dir` | Point to an output directory. The default is `benchmark_results`. | Sets the base directory for benchmark artifacts. | | `--use-run-subdirectory` / `--no-use-run-subdirectory` | Leave run subdirectories on for normal runs, or disable them when a fixed output path is needed. | Writes each run under `run__/`. | | `--save-results` / `--no-save-results` | Leave saving on unless only logs are needed. | Controls result and sample output files. | | `--log-file` | Point to a log file, or omit it for console logging. | Adds persistent benchmark logs. | | `--log-level` | Use `DEBUG`, `INFO`, `WARNING`, or `ERROR`. The default is `INFO`. | Controls benchmark log verbosity. | ## LoRA and Runtime Controls | Flag | How to set it | What it does | | --- | --- | --- | | `--use-lora` | Add the flag when benchmarking with an adapter. | Enables LoRA adapter loading. | | `--lora-path` | Point to the adapter checkpoint directory. Required with `--use-lora`. | Loads the adapter weights. | | `--pre-merge-lora` | Add when the adapter should be merged into the base model at load time. Benchmark YAML defaults to pre-merge on. | Avoids per-forward adapter compute when merging is supported. | | `--enforce-eager` / `--no-enforce-eager` | Use eager mode for debugging, or explicitly allow optimized graph paths for measurement. | Overrides config-driven eager/optimized execution behavior. | | `--kv-cache-layout` | Use `unified` for the default cache layout or `distinct` for strategy experiments. | Chooses KV cache storage layout. | | `--attn-impl` | Use `triton_grouped` for benchmark runs and performance reports. `triton` and `naive` are compatibility/debug fallbacks, or omit to keep config defaults. | Overrides the attention backend. | | `--page-size` | Use `4`, `8`, `16`, or `32` for most models; DiffusionGemma uses `256`. | Sets the KV cache page size. | | `--block-size` | Use `4`, `8`, `16`, or `32` for most models; DiffusionGemma uses `256`. Keep it no larger than `--page-size`. | Sets the token span of one diffusion block. | | `--buffer-size` | Use a positive block count, or omit it and keep the config default of `4`. | Controls how many diffusion blocks can remain active. | | `--enable-prefix-caching` / `--no-enable-prefix-caching` | Use the pair to make prefix-cache behavior explicit in an experiment command. | Enables or disables compatible prefix cache reuse. | ## Threshold and Token-Merge Controls | Flag | How to set it | What it does | | --- | --- | --- | | `--add-block-threshold` | Omit it to use `0.1`, or pass a float for block-add tuning. | Controls when another decoding block can be added. | | `--semi-complete-threshold` | Omit it to use `0.9`, or pass a float for block advancement tuning. | Controls when semi-complete block state can advance. | | `--accept-threshold` | Use a confidence value from `0` to `1`. The default is `0.9`. | Accepts mask-to-token updates once confidence is high enough. | | `--edit-threshold` | Use a confidence value from `0` to `1`. The default is `0.0`. | Accepts token-to-token edits in edit-style decoding. | | `--remask-threshold` | Use a confidence value from `0` to `1`. The default is `0.4`. | Remasks filled tokens that fall below the confidence threshold. | | `--token-stability-threshold` | Use a stability ratio from `0` to `1`. The default is `0.0`. | Controls when the next DMax edit block can be added. | | `--token-merge-mode` | Use `dmax_topk`, `iter_smooth_topk`, or omit it and keep config defaults. | Selects token-merge metadata behavior for DMax-style strategies. | | `--token-merge-top-k` | Use a positive candidate count, or omit it and keep the config default of `1`. | Keeps this many candidates in token-merge metadata. | | `--token-merge-renormalize` | Add when the experiment should explicitly renormalize token-merge probabilities. | Controls probability normalization after token candidate filtering. | | `--token-merge-weight` | Use a non-negative interpolation weight, or omit it and keep the config default of `1.0`. | Weights token-merge interpolation. | For long-lived values, prefer YAML config. Use CLI overrides for experiment specific changes that should be visible in the command history.