# diffulex_bench The `diffulex_bench` package wraps Diffulex engine configuration and lm-evaluation-harness execution. It is the preferred path for repeatable evaluation runs because it keeps engine arguments, dataset settings, output paths, and logging in one configuration flow. ## Public Symbols - `BenchmarkRunner` - `load_benchmark_dataset` - `compute_metrics` - `BenchmarkConfig` - `EngineConfig` - `EvalConfig` - `DiffulexLM` ## Configuration Classes `BenchmarkConfig` groups two main sections: | Config class | What it contains | | --- | --- | | `EngineConfig` | Model path, tokenizer path, model family, decoding strategy, parallelism, LoRA settings, cache layout, thresholds, and runtime options. | | `EvalConfig` | Dataset name, split, limits, generation limits, output directory, result saving, and lm-eval include path. | Configs can be loaded from YAML or JSON and then overridden from the command line. This is useful for keeping a reusable base config while changing model paths or dataset limits per run. ```python from diffulex_bench.config import BenchmarkConfig config = BenchmarkConfig.from_yaml("diffulex_bench/configs/example.yml") ``` ## lm-eval Integration `diffulex_bench.main` converts `BenchmarkConfig` into lm-eval model arguments and then invokes lm-evaluation-harness with the `diffulex` model adapter. The adapter is registered by importing `diffulex_bench.lm_eval_model`. The benchmark entry point also handles: - encoding and decoding complex engine values for lm-eval model args; - default task include paths under `diffulex_bench/tasks`; - optional task data file overrides for local JSON datasets; - per-run output directories with timestamped names; - sample logging when result saving is enabled. ## Dataset Helpers `load_benchmark_dataset` and `compute_metrics` support the older local runner path. New evaluation workflows should prefer the lm-eval based CLI unless a custom runner needs direct access to dataset records or metrics. ## Typical Usage Run from the command line for normal evaluations: ```bash python -m diffulex_bench.main \ --config diffulex_bench/configs/example.yml \ --model-path /path/to/model \ --dataset gsm8k_diffulex \ --dataset-limit 100 ``` Use the Python API when writing tests or custom scripts that need to construct or inspect benchmark configuration before launching a run.