# Benchmark Use `diffulex_bench` for dataset-backed evaluation workloads such as GSM8K, HumanEval, MBPP, and MATH500. The benchmark path wraps Diffulex as an lm-evaluation-harness model and writes logs, samples, trajectories, and metrics under the configured output directory. ## Recommended Workflow 1. Pick a config under `diffulex_bench/configs/`. 2. Override the local model path from the command line. 3. Start with `--dataset-limit`. 4. Inspect generated text and metrics. 5. Remove the limit only after the limited run is correct. ## LLaDA2-Mini GSM8K The maintained LLaDA2-mini path is: ```bash CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \ --config diffulex_bench/configs/llada2_mini_gsm8k.yml \ --model-path /path/to/LLaDA2.0-mini \ --dataset-limit 10 \ --output-dir benchmark_results/llada2_mini_gsm8k ``` The convenience wrapper is equivalent for most runs: ```bash CUDA_VISIBLE_DEVICES=0 \ MODEL_PATH=/path/to/LLaDA2.0-mini \ DATASET_LIMIT=10 \ script/run_llada2_mini_gsm8k.sh ``` The config defaults to single-request TP1 execution. Keep that path for single-sample speed comparisons. Increase `max_num_reqs` or data parallelism only when measuring aggregate throughput. ## DiffusionGemma Native Diffulex DiffusionGemma benchmark: ```bash CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \ --config diffulex_bench/configs/diffusion_gemma_gsm8k.yml \ --model-path /path/to/diffusiongemma-26B-A4B-it \ --dataset-limit 10 ``` DiffusionGemma uses a 256-token block/page size and model-specific entropy-bound sampling controls. Keep the provided config as the baseline unless you are profiling a specific runtime change. ## vLLM DiffusionGemma Baseline The vLLM runner is a baseline for comparison, not a Diffulex engine launcher: ```bash CUDA_VISIBLE_DEVICES=0 \ MODEL=/path/to/diffusiongemma-26B-A4B-it \ CONFIG_PATH=examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_smoke.yml \ script/run_vllm_diffusion_gemma_gsm8k.sh ``` Use `examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_full.yml` for the full run after the smoke run passes. ## Other Configs | Config | Typical use | | --- | --- | | `diffucoder_instruct_gsm8k.yml` | DiffuCoder D2F-style GSM8K. | | `dream_base_gsm8k.yml` | Dream D2F-style GSM8K. | | `fast_dllm_v2_gsm8k.yml` | Fast-dLLM-v2 multi-block GSM8K. | | `sdar_chat_gsm8k.yml` | SDAR dense GSM8K. | | `sdar_moe_chat_gsm8k.yml` | SDAR-MoE GSM8K. | | `llada2_mini_dmax_gsm8k.yml` | LLaDA2-mini DMax/edit sampling. | Most configs include development-cluster paths. Override them with `--model-path`, or copy the YAML and edit the path for repeated runs. ## Output Files A normal run writes to `output_dir/run__/` unless `use_run_subdirectory` is disabled. Inspect: | Artifact | Use | | --- | --- | | Benchmark log | Engine args, progress logs, startup errors. | | `diffulex_stats.json` | Token counts, NFE counts, aggregate throughput, per-sample throughput. | | Response JSON files | Full, truncated, and extracted responses. | | Decode trajectory | Per-step output trajectory when enabled by the benchmark path. | | lm-eval result files | Task metrics and per-sample records. | Diffulex reports both aggregate throughput and per-sample mean throughput. For engine comparison, aggregate throughput is usually the better capacity metric; per-sample mean remains useful for single-request latency behavior. ## Common Overrides | Override | Example | Notes | | --- | --- | --- | | Model path | `--model-path /path/to/model` | Prefer CLI override over editing checked-in configs. | | Dataset limit | `--dataset-limit 10` | Use for smoke tests. Omit for full evaluation. | | Token cap | `--max-tokens 2048` | Raise only if outputs are truncated. | | NFE cap | `--max-nfe 1024` | Use when comparing diffusion step budgets. | | Output dir | `--output-dir benchmark_results/my_run` | Keeps runs grouped by experiment. | | Progress bars | `eval.use_tqdm: false` in YAML | Avoids terminal stalls during long runs. | For exact CLI options, run: ```bash python -m diffulex_bench.main --help ```