Diffulex Engine — Multi-Block Diffusion Language Models

Inference Engine

1. Diffulex Is Where MBD-LMs Become Runnable

Train and define the method in mbd-lms; reproduce, serve, profile, and extend MultiBD systems through Diffulex.

2. Diffulex Is Built for Research-Grade dLLM Inference

Diffulex is a flexible and extensible inference engine for block-style diffusion language models. It unifies a wide range of decoding paradigms under a single runtime: MultiBD (BufSz=1 reduces to SingleBD; BufSz=4 enables full multi-block concurrency), Token Merge, Edit Sampling, D2F MultiBD, and native DiffusionGemma inference. Each strategy composes with model-specific samplers, KV cache managers, and schedulers — all selectable via decoding_strategy.

Core Inference Strategies

Strategy	`decoding_strategy`	`sampling_mode`	Description
Multi-Block Diffusion	`multi_bd`	—	BufSz=1 reduces to SingleBD; BufSz≥2 enables concurrent decoding of a bounded running-set.
Token Merge + Edit	`dmax`	—	Token merge with top-k descriptors plus iterative edit refinement (M2T + T2T).
Edit Sampling	—	`edit`	Iterative refinement via edit-based decoding, re-denoising selected spans while keeping the rest fixed.
D2F	`d2f`	—	Discrete diffusion forcing for Dream and DiffuCoder model families.
Fast-dLLM Dual Cache	`fast_dllm_v2`	—	Dual-cache inference for Fast-dLLM-v2, overlapping KV-cache updates with block decoding.
DiffusionGemma	`diffusion_gemma`	—	Native uniform DLM inference with full-sequence denoising for the DiffusionGemma model family.

Supported Models & Strategies

Diffulex ships with first-class support for the following model families and inference strategies. Strategies are selected via decoding_strategy and compose with model-specific samplers, KV cache managers, and schedulers.

Model family	`model_name`	Decoding Strategy	Status
Dream / D2F-Dream	`dream`	`d2f`	Supported
DiffuCoder / D2F-DiffuCoder	`diffucoder`	`d2f`	Supported
Dream reasoner	`dream_reasoner`	`multi_bd`	Supported
Stable-DiffCoder	`stable_diffcoder`	`multi_bd`	Supported
LLaDA / D2F-LLaDA	`llada`	`d2f`	Supported
Fast-dLLM-v2	`fast_dllm_v2`	`multi_bd` or `fast_dllm_v2`	Supported
SDAR	`sdar`	`multi_bd`	Supported
SDAR-MoE	`sdar_moe`	`multi_bd`	Supported
LLaDA2 family	`llada2 / llada2_mini / llada2_moe / llada2dot1_mini`	`multi_bd` or `dmax`	Supported
DiffusionGemma	`diffusion_gemma`	`diffusion_gemma`	Supported

Extension-friendly engine

The engine separates algorithm semantics from systems concerns such as prefix caching, paged attention, CUDA Graph-friendly execution, batching, benchmarking, and HTTP serving.

Agent-assisted research

With the existing strategy implementations as references, researchers can efficiently use coding agents such as Claude Code or Codex to add new algorithms and quickly turn them into runnable, measurable systems.

For Reproduction

Use the Diffulex mbd-lms branch to reproduce the reported MBD-LMs experiments. This branch keeps configs and runtime assumptions aligned with the paper setup.

For New Systems Work

Use Diffulex main for engine development, open-source contributions, model support, kernel optimization, and new decoding algorithms that need to become real runnable systems.

Full GSM8K test split, single active request, 1x A100-SXM4-80GB.
Run	Agg e2e TPS	Agg decode TPS
LLaDA2-mini / Diffulex	181.12	193.66
LLaDA2-mini / SGLang	177.48	194.78
DiffusionGemma / Diffulex	468.77	797.48
DiffusionGemma / vLLM	611.66	658.79

Single A100 GSM8K Stats

Diffulex Runs at the Same Throughput Class as Mainstream dLLM Engines

We ran the full GSM8K test split with 1,319 samples on a single NVIDIA A100-SXM4-80GB. The strict single-sample, single-active-request runs below focus on aggregate TPS: total tokens divided by total time. This is the more convincing throughput number because it is token/time weighted, instead of an average of per-request TPS values.

We include LLaDA2-mini and DiffusionGemma because they are the dLLM families most directly supported by SGLang and vLLM respectively. Under configurations aligned as closely as possible, Diffulex lands in the same performance range as these mainstream engines.

On aggregate e2e TPS, Diffulex reaches 181.12 on LLaDA2-mini, matching SGLang's 177.48. On DiffusionGemma, Diffulex reaches 797.48 aggregate decode TPS, ahead of vLLM's 658.79, while vLLM leads on aggregate e2e TPS.

Explore Diffulex Main Reproduce MBD-LMs

3. Single Core Backend, Multiple Main Strategies

The MultiBD block buffer is a single core backend that naturally supports SingleBD, MultiBD, and DualCache-style inference — all through the same fixed-shape, CUDA Graph-friendly pipeline. In Diffulex, the most complex part of adding a new strategy is modifying the request state machine. The three core strategies above all involve non-trivial state machine work, yet each fits cleanly within the existing framework.

Prefix Buffer (sz=1) B₁
denoising

BufSz=1 → SingleBD

Buffer encloses one block. Sequential decoding with static input shape — CUDA Graph replay works out of the box. The buffer always has the same physical layout regardless of how many blocks have completed.

Prefix Buffer (sz=4) B₁
→ cache B₂
refining B₃
refining B₄
dummy

BufSz>1 → MultiBD

Buffer encloses a bounded running-set of consecutive blocks. Earlier blocks complete and wait to enter KV cache while later blocks are already refining. Same static shape, same CUDA Graph path — just a larger buffer_size.

Prefix Buffer = FDv2 "block" (sz=4, blksz=8) SubB₀
cached SubB₁
cached SubB₂
cached SubB₃
active → Next FDv2 block SubB₄
dummy SubB₅
dummy SubB₆
dummy SubB₇
dummy

Top: FDv2 "block" (32 tokens) → Block Buffer (4 blocks × 8 tokens) Inside buffer: FDv2 "sub-block" (8 tokens) → Diffulex block

DualCache via Buffer Mapping

The original Fast-dLLM-v2 algorithm splits each 32-token block into four 8-token sub-blocks. Diffulex maps the FDv2 block to a Block Buffer and each sub-block to a block inside it. Already-refined SubBs are KV-cached within the buffer; only the active SubB is recomputed. When the buffer is done, it slides to the next FDv2 block. Three birds, one stone.

How Strategies Map to the Engine

The hardest part of engine development is modifying the request state machine — the block lifecycle, buffer management, and step/postprocess transitions. Strategies fall into three tiers based on how deeply they touch this core.

Tier	What changes	Strategies	Effort
State machine	Request FSM, scheduler, model runner, CUDA graphs	MultiBD / SingleBD, DualCache (Fast-dLLM-v2), DiffusionGemma	Heavy — new block lifecycle, multi-mode graphs
Sampler only	Sampler logic; request FSM and scheduler unchanged	Token Merge + Edit (DMax), Edit Sampling / T2T (LLaDA2.1)	Light — no state machine changes
Static parameters	Config flags, attention metadata; request FSM unchanged	D2F MultiBD, Dream token-shift, SDAR	Minimal — a few config fields + sampler override

State machine tier. MultiBD defines the baseline request state machine shared by all strategies: block activation, dummy-slot management, and the decode-store overlap cycle. DualCache (Fast-dLLM-v2) extends this with a 3-mode FSM — full-buffer init, sub-block refine, and final commit — each requiring its own CUDA graph capture and attention metadata. DiffusionGemma replaces the mask-filling lifecycle entirely with a canvas-denoising loop: random-token initialization, entropy-bound stability tracking, and self-conditioning. All three remain within the MultiBD buffer framework despite their complexity.

Sampler-only tier. DMax (Token Merge + Edit) and LLaDA2.1 (Edit Sampling / T2T) require no changes to the request state machine or scheduler. DMax operates entirely within the sampler: full-block argmax, top-k merge descriptors, and confidence-gated commit — all computed from logits without touching block lifecycle code. DMax's sampler inherits LLaDA2.1's mask-to-token and token-to-token edit transfers, adding merge descriptors on top. The engine pipeline treats these as opaque block_writes.

Static-parameter tier. D2F MultiBD requires only two static flags: multi_block_prefix_full=True and prefix caching disabled. These control the attention kernel's visibility window — the rest of the MultiBD backend runs unchanged. Dream and SDAR's token-shift sampling involves a ~30-line sampler subclass with a one-line logit-shift override. DiffusionGemma's attention changes are similarly localized to the model runner's metadata preparation.