Benchmark¶
Use diffulex_bench for dataset-backed evaluation workloads such as GSM8K,
HumanEval, MBPP, and MATH500. The benchmark path wraps Diffulex as an
lm-evaluation-harness model and writes logs, samples, trajectories, and metrics
under the configured output directory.
Recommended Workflow¶
Pick a config under
diffulex_bench/configs/.Override the local model path from the command line.
Start with
--dataset-limit.Inspect generated text and metrics.
Remove the limit only after the limited run is correct.
LLaDA2-Mini GSM8K¶
The maintained LLaDA2-mini path is:
CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \
--config diffulex_bench/configs/llada2_mini_gsm8k.yml \
--model-path /path/to/LLaDA2.0-mini \
--dataset-limit 10 \
--output-dir benchmark_results/llada2_mini_gsm8k
The convenience wrapper is equivalent for most runs:
CUDA_VISIBLE_DEVICES=0 \
MODEL_PATH=/path/to/LLaDA2.0-mini \
DATASET_LIMIT=10 \
script/run_llada2_mini_gsm8k.sh
The config defaults to single-request TP1 execution. Keep that path for
single-sample speed comparisons. Increase max_num_reqs or data parallelism
only when measuring aggregate throughput.
DiffusionGemma¶
Native Diffulex DiffusionGemma benchmark:
CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \
--config diffulex_bench/configs/diffusion_gemma_gsm8k.yml \
--model-path /path/to/diffusiongemma-26B-A4B-it \
--dataset-limit 10
DiffusionGemma uses a 256-token block/page size and model-specific entropy-bound sampling controls. Keep the provided config as the baseline unless you are profiling a specific runtime change.
vLLM DiffusionGemma Baseline¶
The vLLM runner is a baseline for comparison, not a Diffulex engine launcher:
CUDA_VISIBLE_DEVICES=0 \
MODEL=/path/to/diffusiongemma-26B-A4B-it \
CONFIG_PATH=examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_smoke.yml \
script/run_vllm_diffusion_gemma_gsm8k.sh
Use examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_full.yml for
the full run after the smoke run passes.
Other Configs¶
Config |
Typical use |
|---|---|
|
DiffuCoder D2F-style GSM8K. |
|
Dream D2F-style GSM8K. |
|
Fast-dLLM-v2 multi-block GSM8K. |
|
SDAR dense GSM8K. |
|
SDAR-MoE GSM8K. |
|
LLaDA2-mini DMax/edit sampling. |
Most configs include development-cluster paths. Override them with
--model-path, or copy the YAML and edit the path for repeated runs.
Output Files¶
A normal run writes to output_dir/run_<timestamp>_<task>/ unless
use_run_subdirectory is disabled. Inspect:
Artifact |
Use |
|---|---|
Benchmark log |
Engine args, progress logs, startup errors. |
|
Token counts, NFE counts, aggregate throughput, per-sample throughput. |
Response JSON files |
Full, truncated, and extracted responses. |
Decode trajectory |
Per-step output trajectory when enabled by the benchmark path. |
lm-eval result files |
Task metrics and per-sample records. |
Diffulex reports both aggregate throughput and per-sample mean throughput. For engine comparison, aggregate throughput is usually the better capacity metric; per-sample mean remains useful for single-request latency behavior.
Common Overrides¶
Override |
Example |
Notes |
|---|---|---|
Model path |
|
Prefer CLI override over editing checked-in configs. |
Dataset limit |
|
Use for smoke tests. Omit for full evaluation. |
Token cap |
|
Raise only if outputs are truncated. |
NFE cap |
|
Use when comparing diffusion step budgets. |
Output dir |
|
Keeps runs grouped by experiment. |
Progress bars |
|
Avoids terminal stalls during long runs. |
For exact CLI options, run:
python -m diffulex_bench.main --help