Benchmark Workflow Deep Dive¶
Use diffulex_bench for evaluation workloads such as GSM8K, HumanEval, MBPP,
and MATH500. The benchmark path wraps Diffulex as an lm-evaluation-harness
model, so task definitions and result files follow lm-eval conventions while
engine settings remain Diffulex-specific.
Workflow¶
Select a benchmark config under
diffulex_bench/configs/.Override model paths and engine settings from the command line.
Run
python -m diffulex_bench.main.Inspect the generated run directory and log file.
Repeat with one controlled change at a time.
Config Files¶
Configs group engine and evaluation settings. Use them for values that should stay stable across runs, such as decoding strategy, model family, cache layout, thresholds, and default dataset.
Command line flags override matching config values. This lets you reuse a config while changing paths or dataset limits:
python -m diffulex_bench.main \
--config diffulex_bench/configs/llada2_mini_gsm8k.yml \
--model-path /path/to/LLaDA2.0-mini \
--dataset-limit 10 \
--output-dir outputs/bench
Start with --dataset-limit while validating a new model or strategy. Remove
the limit only after model loading, generation, and result writing are stable.
Task Resolution¶
By default, Diffulex uses bundled lm-eval task YAML files under
diffulex_bench/tasks. Use --include-path when running task definitions from
another directory. Use --dataset-data-files to override a task YAML’s
data_files field with a local JSON file for one run.
Code-generation tasks may require --confirm-run-unsafe-code because lm-eval
executes generated code for scoring. Keep this explicit when running untrusted
models or unreviewed prompts.
Output Layout¶
The benchmark runner writes results under --output-dir. By default, each run
uses a timestamped subdirectory with the task name in the path. This avoids
overwriting prior runs and keeps trajectory, sample, and metric files together.
For ad hoc testing, keep the default run subdirectory behavior. Disable it only when an external script expects a stable output directory.
Reading Results¶
Inspect these artifacts first:
the benchmark log for model args, task names, and engine startup errors;
lm-eval result JSON for aggregate scores;
logged samples when
--save-resultsis enabled;Diffulex trajectory or stats output when a run saves per-request data.
If scores look wrong, check that the task name, tokenizer path, max_tokens,
temperature, and decoding strategy match the intended experiment.
Common Iteration Pattern¶
For a new configuration, use a gradual run sequence:
Run one or two examples with
--dataset-limit.Increase
--max-tokensonly if outputs are truncated.Run a larger but still limited sample.
Only then run the full dataset.
This keeps model load errors, task wiring errors, and quality regressions separate from long-running evaluation cost.