diffulex_bench¶

The diffulex_bench package wraps Diffulex engine configuration and lm-evaluation-harness execution. It is the preferred path for repeatable evaluation runs because it keeps engine arguments, dataset settings, output paths, and logging in one configuration flow.

Public Symbols¶

BenchmarkRunner
load_benchmark_dataset
compute_metrics
BenchmarkConfig
EngineConfig
EvalConfig
DiffulexLM

Configuration Classes¶

BenchmarkConfig groups two main sections:

Config class	What it contains
`EngineConfig`	Model path, tokenizer path, model family, decoding strategy, parallelism, LoRA settings, cache layout, thresholds, and runtime options.
`EvalConfig`	Dataset name, split, limits, generation limits, output directory, result saving, and lm-eval include path.

Configs can be loaded from YAML or JSON and then overridden from the command line. This is useful for keeping a reusable base config while changing model paths or dataset limits per run.

from diffulex_bench.config import BenchmarkConfig

config = BenchmarkConfig.from_yaml("diffulex_bench/configs/example.yml")

lm-eval Integration¶

diffulex_bench.main converts BenchmarkConfig into lm-eval model arguments and then invokes lm-evaluation-harness with the diffulex model adapter. The adapter is registered by importing diffulex_bench.lm_eval_model.

The benchmark entry point also handles:

encoding and decoding complex engine values for lm-eval model args;
default task include paths under diffulex_bench/tasks;
optional task data file overrides for local JSON datasets;
per-run output directories with timestamped names;
sample logging when result saving is enabled.

Dataset Helpers¶

load_benchmark_dataset and compute_metrics support the older local runner path. New evaluation workflows should prefer the lm-eval based CLI unless a custom runner needs direct access to dataset records or metrics.

Typical Usage¶

Run from the command line for normal evaluations:

python -m diffulex_bench.main \
  --config diffulex_bench/configs/example.yml \
  --model-path /path/to/model \
  --dataset gsm8k_diffulex \
  --dataset-limit 100

Use the Python API when writing tests or custom scripts that need to construct or inspect benchmark configuration before launching a run.