Quickstart¶
This page gives the shortest working path for the current codebase:
install Diffulex;
run a small LLaDA2-mini benchmark;
run one in-process Python generation;
start the HTTP server;
optionally run the vLLM DiffusionGemma baseline.
For reproducing the MBD-LMs experiments, use the Diffulex
mbd-lms branch. For
engine development, open-source contributions, or exploring new decoding
algorithms and turning them into runnable systems, use the
main branch. The
current main branch contains ongoing runtime and model-specific optimizations.
Prerequisites¶
Diffulex is installed in a Python environment. See Installation.
At least one NVIDIA GPU is visible to PyTorch.
The model checkpoint exists locally.
The examples below use LLaDA2-mini:
export MODEL_PATH=/data/ckpts/inclusionAI/LLaDA2.0-mini
Replace this path with the location of the checkpoint on your machine.
1. Install¶
From the repository root:
uv venv --python 3.11 --seed
source .venv/bin/activate
uv pip install -e .
uv pip install vllm==0.23.0
Verify the install:
python -c "from diffulex import Diffulex, SamplingParams; print('ok')"
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
python -c "import vllm; print(vllm.__version__)"
2. Run a Small Benchmark¶
Use the maintained LLaDA2-mini GSM8K runner. Start with a small limit:
CUDA_VISIBLE_DEVICES=0 \
MODEL_PATH="$MODEL_PATH" \
DATASET_LIMIT=10 \
script/run_llada2_mini_gsm8k.sh
The script wraps:
python -m diffulex_bench.main \
--config diffulex_bench/configs/llada2_mini_gsm8k.yml \
--model-path "$MODEL_PATH" \
--dataset-limit 10
Results are written under benchmark_results/llada2_mini_gsm8k/ by default.
Remove DATASET_LIMIT only after the limited run loads the model, generates
answers, and writes results correctly.
3. Run Python Inference¶
For a direct in-process call:
from diffulex import Diffulex, SamplingParams
model_path = "/data/ckpts/inclusionAI/LLaDA2.0-mini"
llm = Diffulex(
model=model_path,
model_name="llada2_mini",
decoding_strategy="multi_bd",
sampling_mode="naive",
mask_token_id=156895,
tensor_parallel_size=1,
data_parallel_size=1,
gpu_memory_utilization=0.45,
max_model_len=4096,
max_num_batched_tokens=4096,
max_num_reqs=1,
block_size=32,
buffer_size=1,
page_size=32,
attn_impl="triton_grouped",
enable_prefix_caching=True,
enable_full_static_runner=True,
enable_vllm_layers=True,
)
outputs = llm.generate(
["Solve: Natalia sold clips to 48 friends in April, and half as many in May. How many clips did she sell in May?"],
SamplingParams(temperature=0.0, max_tokens=256, max_nfe=1024),
)
for item in outputs.trajectories:
print(item.text)
llm.exit()
Use attn_impl="naive" and enforce_eager=True only when debugging
correctness. Use the optimized settings when measuring throughput.
4. Start the HTTP Server¶
The server uses the same engine configuration, exposed as CLI flags. A minimal single-GPU LLaDA2-mini command is:
CUDA_VISIBLE_DEVICES=0 python -m diffulex.server \
--model "$MODEL_PATH" \
--model-name llada2_mini \
--decoding-strategy multi_bd \
--sampling-mode naive \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 4096 \
--max-num-batched-tokens 4096 \
--max-num-reqs 1 \
--block-size 32 \
--buffer-size 1 \
--page-size 32 \
--gpu-memory-utilization 0.45 \
--attn-impl triton_grouped \
--host 127.0.0.1 \
--port 8000
Send a request:
curl -s http://127.0.0.1:8000/generate \
-H 'Content-Type: application/json' \
-d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"max_nfe":256}' \
| python -m json.tool
For local server demo visualization:
streamlit run examples/streamlit_block_append_chat.py -- --base-url http://127.0.0.1:8000
5. Run DiffusionGemma or vLLM Baselines¶
Diffulex has a native DiffusionGemma benchmark config:
CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \
--config diffulex_bench/configs/diffusion_gemma_gsm8k.yml \
--model-path /data/ckpts/google/diffusiongemma-26B-A4B-it \
--dataset-limit 10
The repository also keeps a vLLM DiffusionGemma baseline runner. This is for comparison, not for starting Diffulex:
CUDA_VISIBLE_DEVICES=0 \
CONFIG_PATH=examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_smoke.yml \
script/run_vllm_diffusion_gemma_gsm8k.sh
Use the *_full.yml config only after the smoke run succeeds.
Decoding Strategy Cheatsheet¶
Strategy |
Typical models |
Notes |
|---|---|---|
|
D2F LoRA-style LLaDA, Dream, DiffuCoder paths |
Full-prefix block decoding; disables prefix caching. |
|
LLaDA2-mini, SDAR, Fast-dLLM-v2, stable DiffuCoder/Dream reasoner paths |
Multi-Block Diffusion: block-causal multi-block decoding with prefix caching. |
|
Supported LLaDA2 edit-sampling experiments |
Requires |
|
DiffusionGemma |
Native DiffusionGemma canvas/block decoder. |
For more detail, read Configuration and Benchmark.