diffulex.bench¶
The benchmark CLI runs Diffulex through lm-evaluation-harness tasks. It is the main command-line interface for GSM8K, MATH500, HumanEval, MBPP, and custom lm-eval compatible task YAMLs.
The entry point is:
python -m diffulex_bench.main --help
Config-First Usage¶
Use a YAML config when you want repeatable settings:
python -m diffulex_bench.main \
--config diffulex_bench/configs/llada2_mini_gsm8k.yml \
--model-path /path/to/LLaDA2.0-mini \
--dataset-limit 10
Command line flags override matching config fields. This makes it practical to keep one config per model family and vary paths, limits, and output directories per run.
Model and Strategy Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Point to the local base-model checkpoint directory. Required unless the YAML already sets |
Passes the model weights path into Diffulex. |
|
Point to a tokenizer directory, or omit it to fall back to the model path in the benchmark config flow. |
Lets lm-eval use a tokenizer stored separately from the weights. |
|
Use a registered model key: |
Selects the model adapter and sampler defaults. |
|
Use |
Chooses the strategy-specific request, scheduler, cache, runner, and attention metadata path. |
|
Use |
Selects sampler behavior. |
|
Use the tokenizer’s mask token ID. The default is |
Supplies the mask token when tokenizer metadata does not override it. |
Parallelism and Capacity Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Use |
Splits one model replica across multiple GPUs. |
|
Use |
Runs independent evaluation groups when enough devices are available. |
|
Use a fraction such as |
Guides GPU memory planning for engine allocation. |
|
Use a positive sequence length. The default is |
Sets the requested prompt-plus-output length limit. |
|
Use a positive token budget. The default is |
Limits scheduler batch size by token count. |
|
Use a positive request count, or omit it to let config defaults apply. |
Caps active requests and replaces deprecated |
|
Deprecated; use |
Keeps older benchmark commands working when |
Sampling Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Use |
Sets generation randomness for benchmark requests. |
|
Use a positive output-token limit. The default is |
Caps generated tokens per request. |
|
Use a positive number of forward evaluations, or omit it. |
Adds a hard evaluation-step bound when the strategy supports it. |
|
Use a positive run length, or omit it. |
Stops a request after the generated suffix repeats one token for too long. |
|
Add the flag only when a task should continue after EOS. |
Prevents EOS from ending generation. |
Dataset Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Use an lm-eval task name. The default is |
Selects the benchmark task; bundled tasks live under |
|
Point to a directory of task YAMLs, omit it for bundled tasks, or pass an empty string to disable the bundled include path. |
Controls where lm-eval looks for task definitions. |
|
Use the split name expected by the dataset. The default is |
Passes the dataset split through config. |
|
Use a positive number for smoke tests or partial runs, and omit it for the full task. |
Limits how many examples are evaluated. |
|
Point to a local JSON file, or omit it. |
Overrides |
|
Enable only for lm-eval tasks where executing generated code is expected and acceptable. |
Allows code-execution tasks to run. |
Output and Logging¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Point to an output directory. The default is |
Sets the base directory for benchmark artifacts. |
|
Leave run subdirectories on for normal runs, or disable them when a fixed output path is needed. |
Writes each run under |
|
Leave saving on unless only logs are needed. |
Controls result and sample output files. |
|
Point to a log file, or omit it for console logging. |
Adds persistent benchmark logs. |
|
Use |
Controls benchmark log verbosity. |
LoRA and Runtime Controls¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Add the flag when benchmarking with an adapter. |
Enables LoRA adapter loading. |
|
Point to the adapter checkpoint directory. Required with |
Loads the adapter weights. |
|
Add when the adapter should be merged into the base model at load time. Benchmark YAML defaults to pre-merge on. |
Avoids per-forward adapter compute when merging is supported. |
|
Use eager mode for debugging, or explicitly allow optimized graph paths for measurement. |
Overrides config-driven eager/optimized execution behavior. |
|
Use |
Chooses KV cache storage layout. |
|
Use |
Overrides the attention backend. |
|
Use |
Sets the KV cache page size. |
|
Use |
Sets the token span of one diffusion block. |
|
Use a positive block count, or omit it and keep the config default of |
Controls how many diffusion blocks can remain active. |
|
Use the pair to make prefix-cache behavior explicit in an experiment command. |
Enables or disables compatible prefix cache reuse. |
Threshold and Token-Merge Controls¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Omit it to use |
Controls when another decoding block can be added. |
|
Omit it to use |
Controls when semi-complete block state can advance. |
|
Use a confidence value from |
Accepts mask-to-token updates once confidence is high enough. |
|
Use a confidence value from |
Accepts token-to-token edits in edit-style decoding. |
|
Use a confidence value from |
Remasks filled tokens that fall below the confidence threshold. |
|
Use a stability ratio from |
Controls when the next DMax edit block can be added. |
|
Use |
Selects token-merge metadata behavior for DMax-style strategies. |
|
Use a positive candidate count, or omit it and keep the config default of |
Keeps this many candidates in token-merge metadata. |
|
Add when the experiment should explicitly renormalize token-merge probabilities. |
Controls probability normalization after token candidate filtering. |
|
Use a non-negative interpolation weight, or omit it and keep the config default of |
Weights token-merge interpolation. |
For long-lived values, prefer YAML config. Use CLI overrides for experiment specific changes that should be visible in the command history.