Quickstart

This page gives the shortest working path for the current codebase:

  • install Diffulex;

  • run a small LLaDA2-mini benchmark;

  • run one in-process Python generation;

  • start the HTTP server;

  • optionally run the vLLM DiffusionGemma baseline.

For reproducing the MBD-LMs experiments, use the Diffulex mbd-lms branch. For engine development, open-source contributions, or exploring new decoding algorithms and turning them into runnable systems, use the main branch. The current main branch contains ongoing runtime and model-specific optimizations.

Prerequisites

  • Diffulex is installed in a Python environment. See Installation.

  • At least one NVIDIA GPU is visible to PyTorch.

  • The model checkpoint exists locally.

The examples below use LLaDA2-mini:

export MODEL_PATH=/data/ckpts/inclusionAI/LLaDA2.0-mini

Replace this path with the location of the checkpoint on your machine.

1. Install

From the repository root:

uv venv --python 3.11 --seed
source .venv/bin/activate
uv pip install -e .
uv pip install vllm==0.23.0

Verify the install:

python -c "from diffulex import Diffulex, SamplingParams; print('ok')"
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
python -c "import vllm; print(vllm.__version__)"

2. Run a Small Benchmark

Use the maintained LLaDA2-mini GSM8K runner. Start with a small limit:

CUDA_VISIBLE_DEVICES=0 \
MODEL_PATH="$MODEL_PATH" \
DATASET_LIMIT=10 \
script/run_llada2_mini_gsm8k.sh

The script wraps:

python -m diffulex_bench.main \
  --config diffulex_bench/configs/llada2_mini_gsm8k.yml \
  --model-path "$MODEL_PATH" \
  --dataset-limit 10

Results are written under benchmark_results/llada2_mini_gsm8k/ by default. Remove DATASET_LIMIT only after the limited run loads the model, generates answers, and writes results correctly.

3. Run Python Inference

For a direct in-process call:

from diffulex import Diffulex, SamplingParams

model_path = "/data/ckpts/inclusionAI/LLaDA2.0-mini"

llm = Diffulex(
    model=model_path,
    model_name="llada2_mini",
    decoding_strategy="multi_bd",
    sampling_mode="naive",
    mask_token_id=156895,
    tensor_parallel_size=1,
    data_parallel_size=1,
    gpu_memory_utilization=0.45,
    max_model_len=4096,
    max_num_batched_tokens=4096,
    max_num_reqs=1,
    block_size=32,
    buffer_size=1,
    page_size=32,
    attn_impl="triton_grouped",
    enable_prefix_caching=True,
    enable_full_static_runner=True,
    enable_vllm_layers=True,
)

outputs = llm.generate(
    ["Solve: Natalia sold clips to 48 friends in April, and half as many in May. How many clips did she sell in May?"],
    SamplingParams(temperature=0.0, max_tokens=256, max_nfe=1024),
)

for item in outputs.trajectories:
    print(item.text)

llm.exit()

Use attn_impl="naive" and enforce_eager=True only when debugging correctness. Use the optimized settings when measuring throughput.

4. Start the HTTP Server

The server uses the same engine configuration, exposed as CLI flags. A minimal single-GPU LLaDA2-mini command is:

CUDA_VISIBLE_DEVICES=0 python -m diffulex.server \
  --model "$MODEL_PATH" \
  --model-name llada2_mini \
  --decoding-strategy multi_bd \
  --sampling-mode naive \
  --tensor-parallel-size 1 \
  --data-parallel-size 1 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --max-num-reqs 1 \
  --block-size 32 \
  --buffer-size 1 \
  --page-size 32 \
  --gpu-memory-utilization 0.45 \
  --attn-impl triton_grouped \
  --host 127.0.0.1 \
  --port 8000

Send a request:

curl -s http://127.0.0.1:8000/generate \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"max_nfe":256}' \
  | python -m json.tool

For local server demo visualization:

streamlit run examples/streamlit_block_append_chat.py -- --base-url http://127.0.0.1:8000

5. Run DiffusionGemma or vLLM Baselines

Diffulex has a native DiffusionGemma benchmark config:

CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \
  --config diffulex_bench/configs/diffusion_gemma_gsm8k.yml \
  --model-path /data/ckpts/google/diffusiongemma-26B-A4B-it \
  --dataset-limit 10

The repository also keeps a vLLM DiffusionGemma baseline runner. This is for comparison, not for starting Diffulex:

CUDA_VISIBLE_DEVICES=0 \
CONFIG_PATH=examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_smoke.yml \
script/run_vllm_diffusion_gemma_gsm8k.sh

Use the *_full.yml config only after the smoke run succeeds.

Decoding Strategy Cheatsheet

Strategy

Typical models

Notes

d2f

D2F LoRA-style LLaDA, Dream, DiffuCoder paths

Full-prefix block decoding; disables prefix caching.

multi_bd

LLaDA2-mini, SDAR, Fast-dLLM-v2, stable DiffuCoder/Dream reasoner paths

Multi-Block Diffusion: block-causal multi-block decoding with prefix caching.

dmax

Supported LLaDA2 edit-sampling experiments

Requires sampling_mode="edit".

diffusion_gemma

DiffusionGemma

Native DiffusionGemma canvas/block decoder.

For more detail, read Configuration and Benchmark.