Inference Engine

1. Diffulex Is Where MBD-LMs Become Runnable

Train and define the method in mbd-lms; reproduce, serve, profile, and extend MultiBD systems through Diffulex.

Inference Engine
Diffulex is where MBD-LMs become runnable.

Train and define the method in mbd-lms; reproduce, serve, profile, and extend MultiBD systems through Diffulex.

2. Diffulex Is Built for Research-Grade dLLM Inference

Diffulex is a flexible and extensible inference engine for block-style diffusion language models. It unifies a wide range of decoding paradigms under a single runtime: MultiBD (BufSz=1 reduces to SingleBD; BufSz=4 enables full multi-block concurrency), Token Merge, Edit Sampling, D2F MultiBD, and native DiffusionGemma inference. Each strategy composes with model-specific samplers, KV cache managers, and schedulers — all selectable via decoding_strategy.

Core Inference Strategies

Strategy decoding_strategy sampling_mode Description
Multi-Block Diffusionmulti_bdBufSz=1 reduces to SingleBD; BufSz≥2 enables concurrent decoding of a bounded running-set.
Token Merge + EditdmaxToken merge with top-k descriptors plus iterative edit refinement (M2T + T2T).
Edit SamplingeditIterative refinement via edit-based decoding, re-denoising selected spans while keeping the rest fixed.
D2Fd2fDiscrete diffusion forcing for Dream and DiffuCoder model families.
Fast-dLLM Dual Cachefast_dllm_v2Dual-cache inference for Fast-dLLM-v2, overlapping KV-cache updates with block decoding.
DiffusionGemmadiffusion_gemmaNative uniform DLM inference with full-sequence denoising for the DiffusionGemma model family.

Supported Models & Strategies

Diffulex ships with first-class support for the following model families and inference strategies. Strategies are selected via decoding_strategy and compose with model-specific samplers, KV cache managers, and schedulers.

Model family model_name Decoding Strategy Status
Dream / D2F-Dreamdreamd2fSupported
DiffuCoder / D2F-DiffuCoderdiffucoderd2fSupported
Dream reasonerdream_reasonermulti_bdSupported
Stable-DiffCoderstable_diffcodermulti_bdSupported
LLaDA / D2F-LLaDAlladad2fSupported
Fast-dLLM-v2fast_dllm_v2multi_bd or fast_dllm_v2Supported
SDARsdarmulti_bdSupported
SDAR-MoEsdar_moemulti_bdSupported
LLaDA2 familyllada2 / llada2_mini / llada2_moe / llada2dot1_minimulti_bd or dmaxSupported
DiffusionGemmadiffusion_gemmadiffusion_gemmaSupported

Extension-friendly engine

The engine separates algorithm semantics from systems concerns such as prefix caching, paged attention, CUDA Graph-friendly execution, batching, benchmarking, and HTTP serving.

Agent-assisted research

With the existing strategy implementations as references, researchers can efficiently use coding agents such as Claude Code or Codex to add new algorithms and quickly turn them into runnable, measurable systems.

For Reproduction

Use the Diffulex mbd-lms branch to reproduce the reported MBD-LMs experiments. This branch keeps configs and runtime assumptions aligned with the paper setup.

For New Systems Work

Use Diffulex main for engine development, open-source contributions, model support, kernel optimization, and new decoding algorithms that need to become real runnable systems.

Full GSM8K test split, single active request, 1x A100-SXM4-80GB.
Run Agg e2e TPS Agg decode TPS
LLaDA2-mini / Diffulex 181.12 193.66
LLaDA2-mini / SGLang 177.48 194.78
DiffusionGemma / Diffulex 468.77 797.48
DiffusionGemma / vLLM 611.66 658.79
Single A100 GSM8K Stats

Diffulex Runs at the Same Throughput Class as Mainstream dLLM Engines

We ran the full GSM8K test split with 1,319 samples on a single NVIDIA A100-SXM4-80GB. The strict single-sample, single-active-request runs below focus on aggregate TPS: total tokens divided by total time. This is the more convincing throughput number because it is token/time weighted, instead of an average of per-request TPS values.

We include LLaDA2-mini and DiffusionGemma because they are the dLLM families most directly supported by SGLang and vLLM respectively. Under configurations aligned as closely as possible, Diffulex lands in the same performance range as these mainstream engines.

On aggregate e2e TPS, Diffulex reaches 181.12 on LLaDA2-mini, matching SGLang's 177.48. On DiffusionGemma, Diffulex reaches 797.48 aggregate decode TPS, ahead of vLLM's 658.79, while vLLM leads on aggregate e2e TPS.

3. Single Core Backend, Multiple Main Strategies

The MultiBD block buffer is a single core backend that naturally supports SingleBD, MultiBD, and DualCache-style inference — all through the same fixed-shape, CUDA Graph-friendly pipeline. In Diffulex, the most complex part of adding a new strategy is modifying the request state machine. The three core strategies above all involve non-trivial state machine work, yet each fits cleanly within the existing framework.

Prefix Buffer (sz=1) B₁
denoising

BufSz=1 → SingleBD

Buffer encloses one block. Sequential decoding with static input shape — CUDA Graph replay works out of the box. The buffer always has the same physical layout regardless of how many blocks have completed.

Prefix Buffer (sz=4) B₁
→ cache
B₂
refining
B₃
refining
B₄
dummy

BufSz>1 → MultiBD

Buffer encloses a bounded running-set of consecutive blocks. Earlier blocks complete and wait to enter KV cache while later blocks are already refining. Same static shape, same CUDA Graph path — just a larger buffer_size.

Prefix Buffer = FDv2 "block" (sz=4, blksz=8) SubB₀
cached
SubB₁
cached
SubB₂
cached
SubB₃
active
Next FDv2 block SubB₄
dummy
SubB₅
dummy
SubB₆
dummy
SubB₇
dummy
Top: FDv2 "block" (32 tokens) → Block Buffer (4 blocks × 8 tokens) Inside buffer: FDv2 "sub-block" (8 tokens) → Diffulex block

DualCache via Buffer Mapping

The original Fast-dLLM-v2 algorithm splits each 32-token block into four 8-token sub-blocks. Diffulex maps the FDv2 block to a Block Buffer and each sub-block to a block inside it. Already-refined SubBs are KV-cached within the buffer; only the active SubB is recomputed. When the buffer is done, it slides to the next FDv2 block. Three birds, one stone.

How Strategies Map to the Engine

The hardest part of engine development is modifying the request state machine — the block lifecycle, buffer management, and step/postprocess transitions. Strategies fall into three tiers based on how deeply they touch this core.

Tier What changes Strategies Effort
State machine Request FSM, scheduler, model runner, CUDA graphs MultiBD / SingleBD, DualCache (Fast-dLLM-v2), DiffusionGemma Heavy — new block lifecycle, multi-mode graphs
Sampler only Sampler logic; request FSM and scheduler unchanged Token Merge + Edit (DMax), Edit Sampling / T2T (LLaDA2.1) Light — no state machine changes
Static parameters Config flags, attention metadata; request FSM unchanged D2F MultiBD, Dream token-shift, SDAR Minimal — a few config fields + sampler override

State machine tier. MultiBD defines the baseline request state machine shared by all strategies: block activation, dummy-slot management, and the decode-store overlap cycle. DualCache (Fast-dLLM-v2) extends this with a 3-mode FSM — full-buffer init, sub-block refine, and final commit — each requiring its own CUDA graph capture and attention metadata. DiffusionGemma replaces the mask-filling lifecycle entirely with a canvas-denoising loop: random-token initialization, entropy-bound stability tracking, and self-conditioning. All three remain within the MultiBD buffer framework despite their complexity.

Sampler-only tier. DMax (Token Merge + Edit) and LLaDA2.1 (Edit Sampling / T2T) require no changes to the request state machine or scheduler. DMax operates entirely within the sampler: full-block argmax, top-k merge descriptors, and confidence-gated commit — all computed from logits without touching block lifecycle code. DMax's sampler inherits LLaDA2.1's mask-to-token and token-to-token edit transfers, adding merge descriptors on top. The engine pipeline treats these as opaque block_writes.

Static-parameter tier. D2F MultiBD requires only two static flags: multi_block_prefix_full=True and prefix caching disabled. These control the attention kernel's visibility window — the rest of the MultiBD backend runs unchanged. Dream and SDAR's token-shift sampling involves a ~30-line sampler subclass with a one-line logit-shift override. DiffusionGemma's attention changes are similarly localized to the model runner's metadata preparation.