1. Diffulex Is Where MBD-LMs Become Runnable
Train and define the method in mbd-lms; reproduce, serve, profile, and extend MultiBD systems through Diffulex.
2. Diffulex Is Built for Research-Grade dLLM Inference
Diffulex is a flexible and extensible inference engine for block-style diffusion language models. It unifies a wide range of decoding paradigms under a single runtime: MultiBD (BufSz=1 reduces to SingleBD; BufSz=4 enables full multi-block concurrency), Token Merge, Edit Sampling, D2F MultiBD, and native DiffusionGemma inference. Each strategy composes with model-specific samplers, KV cache managers, and schedulers — all selectable via decoding_strategy.
Core Inference Strategies
| Strategy | decoding_strategy |
sampling_mode |
Description |
|---|---|---|---|
| Multi-Block Diffusion | multi_bd | — | BufSz=1 reduces to SingleBD; BufSz≥2 enables concurrent decoding of a bounded running-set. |
| Token Merge + Edit | dmax | — | Token merge with top-k descriptors plus iterative edit refinement (M2T + T2T). |
| Edit Sampling | — | edit | Iterative refinement via edit-based decoding, re-denoising selected spans while keeping the rest fixed. |
| D2F | d2f | — | Discrete diffusion forcing for Dream and DiffuCoder model families. |
| Fast-dLLM Dual Cache | fast_dllm_v2 | — | Dual-cache inference for Fast-dLLM-v2, overlapping KV-cache updates with block decoding. |
| DiffusionGemma | diffusion_gemma | — | Native uniform DLM inference with full-sequence denoising for the DiffusionGemma model family. |
Supported Models & Strategies
Diffulex ships with first-class support for the following model families and inference strategies. Strategies are selected via decoding_strategy and compose with model-specific samplers, KV cache managers, and schedulers.
| Model family | model_name |
Decoding Strategy | Status |
|---|---|---|---|
| Dream / D2F-Dream | dream | d2f | Supported |
| DiffuCoder / D2F-DiffuCoder | diffucoder | d2f | Supported |
| Dream reasoner | dream_reasoner | multi_bd | Supported |
| Stable-DiffCoder | stable_diffcoder | multi_bd | Supported |
| LLaDA / D2F-LLaDA | llada | d2f | Supported |
| Fast-dLLM-v2 | fast_dllm_v2 | multi_bd or fast_dllm_v2 | Supported |
| SDAR | sdar | multi_bd | Supported |
| SDAR-MoE | sdar_moe | multi_bd | Supported |
| LLaDA2 family | llada2 / llada2_mini / llada2_moe / llada2dot1_mini | multi_bd or dmax | Supported |
| DiffusionGemma | diffusion_gemma | diffusion_gemma | Supported |
Extension-friendly engine
The engine separates algorithm semantics from systems concerns such as prefix caching, paged attention, CUDA Graph-friendly execution, batching, benchmarking, and HTTP serving.
Agent-assisted research
With the existing strategy implementations as references, researchers can efficiently use coding agents such as Claude Code or Codex to add new algorithms and quickly turn them into runnable, measurable systems.
For Reproduction
Use the Diffulex mbd-lms branch to reproduce the reported MBD-LMs experiments. This branch keeps configs and runtime assumptions aligned with the paper setup.
For New Systems Work
Use Diffulex main for engine development, open-source contributions, model support, kernel optimization, and new decoding algorithms that need to become real runnable systems.
| Run | Agg e2e TPS | Agg decode TPS |
|---|---|---|
| LLaDA2-mini / Diffulex | 181.12 | 193.66 |
| LLaDA2-mini / SGLang | 177.48 | 194.78 |
| DiffusionGemma / Diffulex | 468.77 | 797.48 |
| DiffusionGemma / vLLM | 611.66 | 658.79 |
Diffulex Runs at the Same Throughput Class as Mainstream dLLM Engines
We ran the full GSM8K test split with 1,319 samples on a single NVIDIA A100-SXM4-80GB. The strict single-sample, single-active-request runs below focus on aggregate TPS: total tokens divided by total time. This is the more convincing throughput number because it is token/time weighted, instead of an average of per-request TPS values.
We include LLaDA2-mini and DiffusionGemma because they are the dLLM families most directly supported by SGLang and vLLM respectively. Under configurations aligned as closely as possible, Diffulex lands in the same performance range as these mainstream engines.
On aggregate e2e TPS, Diffulex reaches 181.12 on LLaDA2-mini, matching SGLang's 177.48. On DiffusionGemma, Diffulex reaches 797.48 aggregate decode TPS, ahead of vLLM's 658.79, while vLLM leads on aggregate e2e TPS.
3. Single Core Backend, Multiple Main Strategies
The MultiBD block buffer is a single core backend that naturally supports SingleBD, MultiBD, and DualCache-style inference — all through the same fixed-shape, CUDA Graph-friendly pipeline. In Diffulex, the most complex part of adding a new strategy is modifying the request state machine. The three core strategies above all involve non-trivial state machine work, yet each fits cleanly within the existing framework.
denoising
BufSz=1 → SingleBD
Buffer encloses one block. Sequential decoding with static input shape — CUDA Graph replay works out of the box. The buffer always has the same physical layout regardless of how many blocks have completed.
→ cache B₂
refining B₃
refining B₄
dummy
BufSz>1 → MultiBD
Buffer encloses a bounded running-set of consecutive blocks. Earlier blocks complete and wait to enter KV cache while later blocks are already refining. Same static shape, same CUDA Graph path — just a larger buffer_size.
cached SubB₁
cached SubB₂
cached SubB₃
active → Next FDv2 block SubB₄
dummy SubB₅
dummy SubB₆
dummy SubB₇
dummy
DualCache via Buffer Mapping
The original Fast-dLLM-v2 algorithm splits each 32-token block into four 8-token sub-blocks. Diffulex maps the FDv2 block to a Block Buffer and each sub-block to a block inside it. Already-refined SubBs are KV-cached within the buffer; only the active SubB is recomputed. When the buffer is done, it slides to the next FDv2 block. Three birds, one stone.
How Strategies Map to the Engine
The hardest part of engine development is modifying the request state machine — the block lifecycle, buffer management, and step/postprocess transitions. Strategies fall into three tiers based on how deeply they touch this core.
| Tier | What changes | Strategies | Effort |
|---|---|---|---|
| State machine | Request FSM, scheduler, model runner, CUDA graphs | MultiBD / SingleBD, DualCache (Fast-dLLM-v2), DiffusionGemma | Heavy — new block lifecycle, multi-mode graphs |
| Sampler only | Sampler logic; request FSM and scheduler unchanged | Token Merge + Edit (DMax), Edit Sampling / T2T (LLaDA2.1) | Light — no state machine changes |
| Static parameters | Config flags, attention metadata; request FSM unchanged | D2F MultiBD, Dream token-shift, SDAR | Minimal — a few config fields + sampler override |
State machine tier. MultiBD defines the baseline request state machine shared by all strategies: block activation, dummy-slot management, and the decode-store overlap cycle. DualCache (Fast-dLLM-v2) extends this with a 3-mode FSM — full-buffer init, sub-block refine, and final commit — each requiring its own CUDA graph capture and attention metadata. DiffusionGemma replaces the mask-filling lifecycle entirely with a canvas-denoising loop: random-token initialization, entropy-bound stability tracking, and self-conditioning. All three remain within the MultiBD buffer framework despite their complexity.
Sampler-only tier. DMax (Token Merge + Edit) and LLaDA2.1 (Edit Sampling / T2T) require no changes to the request state machine or scheduler. DMax operates entirely within the sampler: full-block argmax, top-k merge descriptors, and confidence-gated commit — all computed from logits without touching block lifecycle code. DMax's sampler inherits LLaDA2.1's mask-to-token and token-to-token edit transfers, adding merge descriptors on top. The engine pipeline treats these as opaque block_writes.
Static-parameter tier. D2F MultiBD requires only two static flags: multi_block_prefix_full=True and prefix caching disabled. These control the attention kernel's visibility window — the rest of the MultiBD backend runs unchanged. Dream and SDAR's token-shift sampling involves a ~30-line sampler subclass with a one-line logit-shift override. DiffusionGemma's attention changes are similarly localized to the model runner's metadata preparation.