diffulex_bench

The diffulex_bench package wraps Diffulex engine configuration and lm-evaluation-harness execution. It is the preferred path for repeatable evaluation runs because it keeps engine arguments, dataset settings, output paths, and logging in one configuration flow.

Public Symbols

  • BenchmarkRunner

  • load_benchmark_dataset

  • compute_metrics

  • BenchmarkConfig

  • EngineConfig

  • EvalConfig

  • DiffulexLM

Configuration Classes

BenchmarkConfig groups two main sections:

Config class

What it contains

EngineConfig

Model path, tokenizer path, model family, decoding strategy, parallelism, LoRA settings, cache layout, thresholds, and runtime options.

EvalConfig

Dataset name, split, limits, generation limits, output directory, result saving, and lm-eval include path.

Configs can be loaded from YAML or JSON and then overridden from the command line. This is useful for keeping a reusable base config while changing model paths or dataset limits per run.

from diffulex_bench.config import BenchmarkConfig

config = BenchmarkConfig.from_yaml("diffulex_bench/configs/example.yml")

lm-eval Integration

diffulex_bench.main converts BenchmarkConfig into lm-eval model arguments and then invokes lm-evaluation-harness with the diffulex model adapter. The adapter is registered by importing diffulex_bench.lm_eval_model.

The benchmark entry point also handles:

  • encoding and decoding complex engine values for lm-eval model args;

  • default task include paths under diffulex_bench/tasks;

  • optional task data file overrides for local JSON datasets;

  • per-run output directories with timestamped names;

  • sample logging when result saving is enabled.

Dataset Helpers

load_benchmark_dataset and compute_metrics support the older local runner path. New evaluation workflows should prefer the lm-eval based CLI unless a custom runner needs direct access to dataset records or metrics.

Typical Usage

Run from the command line for normal evaluations:

python -m diffulex_bench.main \
  --config diffulex_bench/configs/example.yml \
  --model-path /path/to/model \
  --dataset gsm8k_diffulex \
  --dataset-limit 100

Use the Python API when writing tests or custom scripts that need to construct or inspect benchmark configuration before launching a run.