# Benchmark Workflow Deep Dive Use `diffulex_bench` for evaluation workloads such as GSM8K, HumanEval, MBPP, and MATH500. The benchmark path wraps Diffulex as an lm-evaluation-harness model, so task definitions and result files follow lm-eval conventions while engine settings remain Diffulex-specific. ## Workflow 1. Select a benchmark config under `diffulex_bench/configs/`. 2. Override model paths and engine settings from the command line. 3. Run `python -m diffulex_bench.main`. 4. Inspect the generated run directory and log file. 5. Repeat with one controlled change at a time. ## Config Files Configs group engine and evaluation settings. Use them for values that should stay stable across runs, such as decoding strategy, model family, cache layout, thresholds, and default dataset. Command line flags override matching config values. This lets you reuse a config while changing paths or dataset limits: ```bash python -m diffulex_bench.main \ --config diffulex_bench/configs/llada2_mini_gsm8k.yml \ --model-path /path/to/LLaDA2.0-mini \ --dataset-limit 10 \ --output-dir outputs/bench ``` Start with `--dataset-limit` while validating a new model or strategy. Remove the limit only after model loading, generation, and result writing are stable. ## Task Resolution By default, Diffulex uses bundled lm-eval task YAML files under `diffulex_bench/tasks`. Use `--include-path` when running task definitions from another directory. Use `--dataset-data-files` to override a task YAML's `data_files` field with a local JSON file for one run. Code-generation tasks may require `--confirm-run-unsafe-code` because lm-eval executes generated code for scoring. Keep this explicit when running untrusted models or unreviewed prompts. ## Output Layout The benchmark runner writes results under `--output-dir`. By default, each run uses a timestamped subdirectory with the task name in the path. This avoids overwriting prior runs and keeps trajectory, sample, and metric files together. For ad hoc testing, keep the default run subdirectory behavior. Disable it only when an external script expects a stable output directory. ## Reading Results Inspect these artifacts first: - the benchmark log for model args, task names, and engine startup errors; - lm-eval result JSON for aggregate scores; - logged samples when `--save-results` is enabled; - Diffulex trajectory or stats output when a run saves per-request data. If scores look wrong, check that the task name, tokenizer path, `max_tokens`, temperature, and decoding strategy match the intended experiment. ## Common Iteration Pattern For a new configuration, use a gradual run sequence: 1. Run one or two examples with `--dataset-limit`. 2. Increase `--max-tokens` only if outputs are truncated. 3. Run a larger but still limited sample. 4. Only then run the full dataset. This keeps model load errors, task wiring errors, and quality regressions separate from long-running evaluation cost.