Benchmark

Use diffulex_bench for dataset-backed evaluation workloads such as GSM8K, HumanEval, MBPP, and MATH500. The benchmark path wraps Diffulex as an lm-evaluation-harness model and writes logs, samples, trajectories, and metrics under the configured output directory.

LLaDA2-Mini GSM8K

The maintained LLaDA2-mini path is:

CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \
  --config diffulex_bench/configs/llada2_mini_gsm8k.yml \
  --model-path /path/to/LLaDA2.0-mini \
  --dataset-limit 10 \
  --output-dir benchmark_results/llada2_mini_gsm8k

The convenience wrapper is equivalent for most runs:

CUDA_VISIBLE_DEVICES=0 \
MODEL_PATH=/path/to/LLaDA2.0-mini \
DATASET_LIMIT=10 \
script/run_llada2_mini_gsm8k.sh

The config defaults to single-request TP1 execution. Keep that path for single-sample speed comparisons. Increase max_num_reqs or data parallelism only when measuring aggregate throughput.

DiffusionGemma

Native Diffulex DiffusionGemma benchmark:

CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \
  --config diffulex_bench/configs/diffusion_gemma_gsm8k.yml \
  --model-path /path/to/diffusiongemma-26B-A4B-it \
  --dataset-limit 10

DiffusionGemma uses a 256-token block/page size and model-specific entropy-bound sampling controls. Keep the provided config as the baseline unless you are profiling a specific runtime change.

vLLM DiffusionGemma Baseline

The vLLM runner is a baseline for comparison, not a Diffulex engine launcher:

CUDA_VISIBLE_DEVICES=0 \
MODEL=/path/to/diffusiongemma-26B-A4B-it \
CONFIG_PATH=examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_smoke.yml \
script/run_vllm_diffusion_gemma_gsm8k.sh

Use examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_full.yml for the full run after the smoke run passes.

Other Configs

Config

Typical use

diffucoder_instruct_gsm8k.yml

DiffuCoder D2F-style GSM8K.

dream_base_gsm8k.yml

Dream D2F-style GSM8K.

fast_dllm_v2_gsm8k.yml

Fast-dLLM-v2 multi-block GSM8K.

sdar_chat_gsm8k.yml

SDAR dense GSM8K.

sdar_moe_chat_gsm8k.yml

SDAR-MoE GSM8K.

llada2_mini_dmax_gsm8k.yml

LLaDA2-mini DMax/edit sampling.

Most configs include development-cluster paths. Override them with --model-path, or copy the YAML and edit the path for repeated runs.

Output Files

A normal run writes to output_dir/run_<timestamp>_<task>/ unless use_run_subdirectory is disabled. Inspect:

Artifact

Use

Benchmark log

Engine args, progress logs, startup errors.

diffulex_stats.json

Token counts, NFE counts, aggregate throughput, per-sample throughput.

Response JSON files

Full, truncated, and extracted responses.

Decode trajectory

Per-step output trajectory when enabled by the benchmark path.

lm-eval result files

Task metrics and per-sample records.

Diffulex reports both aggregate throughput and per-sample mean throughput. For engine comparison, aggregate throughput is usually the better capacity metric; per-sample mean remains useful for single-request latency behavior.

Common Overrides

Override

Example

Notes

Model path

--model-path /path/to/model

Prefer CLI override over editing checked-in configs.

Dataset limit

--dataset-limit 10

Use for smoke tests. Omit for full evaluation.

Token cap

--max-tokens 2048

Raise only if outputs are truncated.

NFE cap

--max-nfe 1024

Use when comparing diffusion step budgets.

Output dir

--output-dir benchmark_results/my_run

Keeps runs grouped by experiment.

Progress bars

eval.use_tqdm: false in YAML

Avoids terminal stalls during long runs.

For exact CLI options, run:

python -m diffulex_bench.main --help