diffulex_bench¶
The diffulex_bench package wraps Diffulex engine configuration and
lm-evaluation-harness execution. It is the preferred path for repeatable
evaluation runs because it keeps engine arguments, dataset settings, output
paths, and logging in one configuration flow.
Public Symbols¶
BenchmarkRunnerload_benchmark_datasetcompute_metricsBenchmarkConfigEngineConfigEvalConfigDiffulexLM
Configuration Classes¶
BenchmarkConfig groups two main sections:
Config class |
What it contains |
|---|---|
|
Model path, tokenizer path, model family, decoding strategy, parallelism, LoRA settings, cache layout, thresholds, and runtime options. |
|
Dataset name, split, limits, generation limits, output directory, result saving, and lm-eval include path. |
Configs can be loaded from YAML or JSON and then overridden from the command line. This is useful for keeping a reusable base config while changing model paths or dataset limits per run.
from diffulex_bench.config import BenchmarkConfig
config = BenchmarkConfig.from_yaml("diffulex_bench/configs/example.yml")
lm-eval Integration¶
diffulex_bench.main converts BenchmarkConfig into lm-eval model arguments
and then invokes lm-evaluation-harness with the diffulex model adapter. The
adapter is registered by importing diffulex_bench.lm_eval_model.
The benchmark entry point also handles:
encoding and decoding complex engine values for lm-eval model args;
default task include paths under
diffulex_bench/tasks;optional task data file overrides for local JSON datasets;
per-run output directories with timestamped names;
sample logging when result saving is enabled.
Dataset Helpers¶
load_benchmark_dataset and compute_metrics support the older local runner
path. New evaluation workflows should prefer the lm-eval based CLI unless a
custom runner needs direct access to dataset records or metrics.
Typical Usage¶
Run from the command line for normal evaluations:
python -m diffulex_bench.main \
--config diffulex_bench/configs/example.yml \
--model-path /path/to/model \
--dataset gsm8k_diffulex \
--dataset-limit 100
Use the Python API when writing tests or custom scripts that need to construct or inspect benchmark configuration before launching a run.