MBD-LMs Paradigm — Multi-Block Diffusion Language Models

1. TL;DR

Block Diffusion Language Models (BD-LMs) make diffusion-based text generation more practical by supporting KV caching and flexible-length generation. However, native BD-LMs usually perform Single-Block Diffusion (SingleBD): each forward pass refines one noisy block conditioned on a clean cached prefix. This preserves the serving benefits of BD-LMs, but blocks are still processed sequentially.

We propose Multi-Block Diffusion Language Models (MBD-LMs), a formulation and post-training recipe for reliable Multi-Block Diffusion (MultiBD). On the model side, MBD-LMs are BD-LMs post-trained with Multi-block Teacher Forcing (MultiTF) so they can handle practical MultiBD running-set states. On the inference side, MBD-LMs decode a bounded running-set of consecutive blocks through an optimized Block Buffer runtime.

MBD-LMs Definition BD-LMs trained for MultiBD states, then executed by a MultiBD runtime.

Model focus

MultiTF post-training teaches the model bounded noisy block groups, heterogeneous slot-wise mask ratios, and block-causal visibility patterns.

Inference focus

MultiBD keeps a bounded running-set, uses Block Buffer execution, preserves prefix KV caching, commits completed blocks, and runs through Diffulex.

The training and method code lives in SJTU-DENG-Lab/mbd-lms. The executable inference runtime is Diffulex: use the Diffulex mbd-lms branch for experiment reproduction, and Diffulex main for engine development and new decoding algorithms.

Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%. When combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02 percentage-point average accuracy drop on math and code benchmarks.

SingleBD versus MultiBD — **Figure 1.** SingleBD decodes blocks sequentially and creates KV-cache storing bubbles. MultiBD overlaps future-block refinement with KV-cache storing of completed blocks, enabling inter-block parallelism.

Key Contributions

2. Contributions

The useful takeaway from MBD-LMs is not just "decode more blocks." The project ties together a model-side training distribution, an inference-time running-set abstraction, and a runtime path that keeps the system executable.

MBD-LMs

Reframes BD-LM generation as a bounded running-set of consecutive blocks, making inter-block parallelism explicit while preserving clean prefix KV semantics.

MultiTF post-training

Constructs inference-like noisy block groups with systematic/random layouts, heterogeneous slot-wise mask ratios, and group-aware dual-stream masking.

Block Buffer runtime

Executes MultiBD with fixed physical block slots, dummy-slot activation, prefix-cache reuse, decode-store overlap, and CUDA Graph-friendly shapes.

Train-to-engine split

Keeps method training in mbd-lms and runnable inference in Diffulex, so reproduction and new dLLM serving work have clear entry points.

Interactive Decode Trace

3. Why MultiBD Removes the Store Bubble

Click through the same request under SingleBD and MultiBD. SingleBD can only refine one noisy block at a time; after the block is complete, it still spends a KV-store forward pass that produces no new output. MultiBD keeps a bounded running-set, admits the next block before the front block fully leaves the buffer, and overlaps KV storing with later-block decoding.

BD-LM / SingleBD Decode B1 only

ForwardB1 denoising

OutputB1 token updates

KV actionRead prefix KV

Prefix KV cached

Block 1

Block 2

Block 3

Block 4

Decode B1 Decode B1 Finalize B1 Store B1 only Start B2 Decode B2 Finalize B2 Store B2 only Start B3 Decode B3

Later blocks wait until block 1 is stored into KV.

MBD-LM / MultiBD Decode B1 in a buffer

ForwardB1 denoising

OutputB1 token updates

KV actionRead prefix KV

Prefix KV cached

Block 1

Block 2

Block 3

Block 4

Decode B1 Stabilize B1 Admit B2 Decode B1+B2 Store B1 + Decode B2 Slide buffer Decode B2+B3 Admit B4 Store B2 + Decode tail Steady pipeline

The next block is already inside the running-set before block 1 leaves the buffer.

Takeaway: MultiBD turns the store-only bubble into useful later-block decoding work.

4. From SingleBD to MultiBD

Diffusion Language Models (DLMs) generate text through iterative denoising and naturally support parallel token refinement. Fully bidirectional DLMs, however, are difficult to serve efficiently because they do not naturally support KV caching or dynamic-length generation. BD-LMs address this issue by generating text in block-causal form: completed blocks become a clean cached prefix, and the current block is denoised under block-causal attention.

This design gives native BD-LMs efficient intra-block parallelism, but not inter-block parallelism. In SingleBD, a later block cannot begin refinement until the current block has finished decoding and has been committed to the KV cache. The result is a storing bubble: during cache storing, no new token is generated and no decode-store overlap is exploited.

MultiBD removes this bottleneck by maintaining a small running-set of consecutive blocks. Earlier blocks in the running-set may be completed and waiting to enter the cache, while later blocks can already be active noisy blocks. This enables the model to refine future blocks while completed blocks are being committed to the KV cache.

Why training-free MultiBD is not enough

A natural question is whether existing BD-LMs can simply run MultiBD at inference time. The paper shows that this is only partially effective. Direct MultiBD inference increases TPF, confirming that multi-block decoding relaxes the single-block bottleneck, but it can degrade accuracy because the model was not trained on practical MultiBD states.

The mismatch has two components. First, practical MultiBD does not decode an unbounded noisy suffix. It uses a bounded running-set, often with an active part around two blocks and occasional expansion to three or four active blocks. Second, active slots can have heterogeneous mask-ratio patterns: adjacent slots may differ substantially in noise level. Reliable MultiBD therefore requires training states that match both the bounded running-set structure and the slot-wise noise patterns observed during inference.

**Figure 2.** Train-inference statistics for MultiBD. D2F-style schedules, chain-uniform MultiTF schedules, inference-time mask ratios, and active-block trajectories reveal the bounded and heterogeneous nature of practical MultiBD inference.

5. MBD-LMs: A Running-Set View of BD-LMs

MBD-LMs formulate BD-LM generation around a running-set of consecutive blocks. At decoding step s, the running-set contains the blocks that have not yet entered the prefix KV cache. It includes active noisy blocks and completed preceding blocks waiting to be cached. Blocks before the running-set form the clean cached prefix.

This view unifies several regimes. Teacher-Forcing-trained BD-LMs correspond to the SingleBD extreme, where the model observes one noisy block conditioned on a clean cached prefix. D2F introduces visibility among multiple noisy blocks, but its training states still differ from practical MultiBD in running-set size and slot-wise noise patterns. Practical MultiBD is the bounded intermediate regime: the running-set should be larger than one to expose inter-block parallelism, but small enough to keep each forward pass efficient.

Train-inference alignment across paradigms — **Figure 3.** TF and D2F provide existing BD-LM training states, but neither directly matches practical MultiBD. MultiTF builds inference-like noise-groups with heterogeneous slot-wise noise patterns.

6. MultiTF: Post-Training BD-LMs for MultiBD

Multi-block Teacher Forcing (MultiTF) turns BD-LMs into MBD-LMs by constructing training states that resemble practical MultiBD inference. Instead of corrupting only one block as in standard teacher forcing, MultiTF corrupts a bounded group of consecutive blocks, called a noise-group, while conditioning later groups on clean earlier groups.

Noise-group layouts

Systematic layouts enumerate group sizes and shifts so that blocks appear at different group-relative positions. Random layouts add non-regular group-size combinations and boundary patterns.

Chain-uniform scheduling

Within each noise-group, mask ratios are sampled monotonically but randomly, producing larger and more diverse slot-wise noise gaps than a fixed D2F-style monotonic schedule.

Dual-stream masking

Noisy blocks inside the same noise-group can attend to each other under block-causal visibility, each noise-group can condition on its clean prefix, and clean tokens are prevented from attending to noisy tokens.

The resulting inputs are used for masked-token cross-entropy, and model-specific objectives such as DMax OPUT can be applied on top of the same MultiTF input construction.

Overview of MultiTF — **Figure 4.** MultiTF constructs systematic and random noise-group layouts, applies a Group-Aware Dual-Stream Mask, and post-trains BD-LMs into MBD-LMs.

7. Optimized MultiBD with Block Buffer

MultiBD is useful only if the additional parallelism can be translated into wall-clock speedup. A naive implementation directly materializes the current running-set as the physical input to each forward pass. This exposes inter-block parallelism, but the number of processed tokens changes over time and across requests, making CUDA Graph capture and replay difficult.

To make MultiBD practically executable, the paper introduces the Block Buffer mechanism. A Block Buffer contains a fixed number of physical block slots. Real resident blocks inside the buffer form the logical running-set, while trailing dummy slots reserve capacity for future blocks. A future block enters decoding by activating an existing dummy slot instead of extending the physical input sequence. When the front block is completed, it is committed to the KV cache and the buffer slides forward by appending a new dummy slot at the tail.

Each slot follows the state transition dummy → active → to-cache → in-cache. This design preserves prefix-cache reuse, keeps input shapes static, overlaps decoding with KV-cache storing, and supports CUDA Graph replay.

Block Buffer inference pipeline — **Figure 5.** MultiBD inference with Block Buffer. A fixed block-buffer hierarchy enables parallel block refinement while preserving prefix-cache semantics and static-shape execution.

Training Repository

8. mbd-lms Defines and Trains MBD-LMs

The SJTU-DENG-Lab/mbd-lms repository is the home for the method-side work. It is where Multi-block Teacher Forcing is implemented, where training configs live, and where checkpoints are prepared before they are evaluated through the Diffulex runtime.

MultiTF training

The repository contains the post-training path that constructs bounded noisy block groups, heterogeneous slot-wise mask ratios, and group-aware attention masks for practical MultiBD states.

Training assets

Use this repo for environment setup, dataset preparation, model-specific training configs, multi-node launch scripts, and checkpoint conversion utilities.

Method documentation

The project page, guidelines, and figures define the SingleBD-to-MultiBD transition, MultiTF, Block Buffer inference, and the reported training/evaluation setup.

Train and Prepare

Start here when working on MBD-LM training, reproducing MultiTF data construction, or converting trained checkpoints into usable model artifacts.

Open mbd-lms Training Repo

Run and Serve

Move to Diffulex when you need benchmark execution, HTTP serving, optimized kernels, prefix caching, and system-level MultiBD runtime behavior.

Open Diffulex Reproduction Branch

9. Main Results

The experiments evaluate mathematical reasoning on GSM8K and MATH500, and code generation on MBPP+ and HumanEval+. The paper reports Accuracy, Tokens Per Forward pass (TPF), and Accuracy Under Parallelism (AUP), where TPF measures decoding parallelism and AUP summarizes the accuracy-parallelism trade-off.

The main trend is consistent across models: MBD-LMs substantially improve TPF over native SingleBD, and MultiTF often recovers or improves the quality lost by training-free MultiBD. On LLaDA2-Mini, MultiTF raises average accuracy from 78.59% under training-free MultiBD to 81.03%, while further increasing average TPF from 4.41 to 6.19. On SDAR-8B-Chat-b32, MBD-SDAR-8B-Chat-b32 increases average TPF from 2.54 to 4.46 and improves average accuracy from 69.00% to 69.74%.

Training-free MultiBD 3.47 -> 4.41 TPF

LLaDA2-Mini gains parallelism immediately, but accuracy drops before alignment training.

MultiTF aligned 4.41 -> 6.19 TPF

MBD-LLaDA2-Mini recovers quality and raises average accuracy to 81.03%.

DMax compatible 9.34 TPF

MBD-LLaDA2-Mini-DMax reaches the highest reported average parallelism.

Selected aggregate results from the reported math and code evaluations.
Model / Setting	Avg. Accuracy	Avg. TPF	Interpretation
LLaDA2-Mini SingleBD	79.95%	3.47	Native one-block baseline.
LLaDA2-Mini training-free MultiBD	78.59%	4.41	Parallelism improves, but train-inference mismatch hurts quality.
MBD-LLaDA2-Mini	81.03%	6.19	MultiTF aligns the model with practical MultiBD states.
SDAR-8B-Chat-b32 SingleBD	69.00%	2.54	Second-model baseline for transfer.
MBD-SDAR-8B-Chat-b32	69.74%	4.46	Shows the same parallelism-quality trend beyond LLaDA2.

**Table 1.** Evaluation results across math and code benchmarks. MBD-LMs consistently improve TPF over SingleBD and improve the accuracy-parallelism trade-off in most settings.
Model	GSM8K		MATH500		MBPP+		HumanEval+		Average
Model	Acc	TPF	Acc	TPF	Acc	TPF	Acc	TPF	Acc	TPF	AUP
LLaDA2-Mini-DMax (bufsz=2, blksz=32)
SingleBD (Native)	91.89	5.70	76.80	6.13	72.22	6.14	77.44	7.44	79.59	6.35	459.54
MultiBD (training-free)	89.84	8.76	73.80	9.08	72.22	8.44	76.83	10.96	78.17	9.31	651.98
MBD-LLaDA2-Mini-DMax	91.74	8.95	75.00	9.31	70.11	8.34	77.44	10.78	78.57	9.34	661.28
LLaDA2-Mini (bufsz=2, blksz=32)
SingleBD (Native)	91.89	2.27	74.20	2.83	75.66	3.25	78.05	5.53	79.95	3.47	247.41
MultiBD (training-free)	92.65	2.76	73.60	3.53	72.49	3.97	75.61	7.37	78.59	4.41	301.81
MBD-LLaDA2-Mini	91.96	5.55	79.20	6.02	72.49	5.35	80.49	7.85	81.03	6.19	449.18
SDAR-8B-Chat-b32 (bufsz=4, blksz=32)
SingleBD (Native)	90.07	2.52	65.60	3.81	52.65	1.83	67.68	2.00	69.00	2.54	141.64
MultiBD (training-free)	89.01	2.78	60.60	5.06	52.12	1.97	65.85	2.24	66.89	3.01	156.35
MBD-SDAR-8B-Chat-b32	89.16	3.08	68.00	5.08	58.99	4.87	62.80	4.82	69.74	4.46	210.42
SDAR-8B-Chat-b4 (bufsz=4, blksz=4)
SingleBD (Native)	91.05	1.33	72.80	1.46	64.80	1.13	73.70	1.07	75.59	1.25	85.46
MultiBD (training-free)	90.45	2.39	70.60	2.68	65.80	1.55	74.39	1.47	75.31	2.00	129.59
MBD-SDAR-8B-Chat-b4	91.81	2.28	72.40	2.52	64.29	2.62	72.56	2.24	75.27	2.42	148.65

Ablations further support the training-state alignment story. Combining systematic and random layouts gives the best AUP among the layout variants. Replacing the chain-uniform scheduler with other schedulers reduces the accuracy-parallelism trade-off; in particular, the D2F-style monotonic scheduler causes a large accuracy drop in the reported ablation, indicating that noisy-block visibility alone is not sufficient when slot-wise noise patterns are mismatched.

**Table 2a.** Training-free MultiBD transfers to additional model variants.
Model	GSM8K		MATH500		Average
Model	Acc	TPF	Acc	TPF	Acc	TPF	AUP
LLaDA2-Mini-CAP (bufsz=2, blksz=32)
SingleBD (Native)	91.74	3.08	77.80	3.71	84.77	3.40	247.30
MultiBD (training-free)	91.21	4.00	77.20	4.94	84.21	4.47	319.17
LLaDA2.1-Mini (bufsz=2, blksz=32)
SingleBD (Native)	93.03	4.12	81.40	4.87	87.22	4.50	390.64
MultiBD (training-free)	92.27	5.80	81.00	7.20	86.63	6.50	558.52

**Table 2b-i.** Noise-group layout construction ablation.
Configuration	Acc	TPF	AUP
SingleBD (Native)	84.67	6.57	536.89
+ systematic layouts	83.22	9.71	774.03
+ random layouts	82.72	9.42	747.46
systematic + random layouts (ours)	84.59	9.87	805.34

**Table 2b-ii.** Block-level noise-scheduler ablation.
Configuration	Acc	TPF	AUP
SingleBD (Native)	84.67	6.57	536.89
D2F-style monotonic scheduler	—	—	—
random scheduler	83.14	9.70	771.74
sorted-uniform scheduler	81.28	9.73	748.73
chain-uniform scheduler (ours)	84.59	9.87	805.34

10. Throughput: From TPF to TPS

Higher TPF does not automatically imply proportional wall-clock speedup because MultiBD processes a larger static Block Buffer at each forward pass. The paper therefore separates useful committed tokens from the per-step computational workload. Increasing the buffer size can improve throughput when the useful-token gain outweighs the extra per-step cost introduced by resident blocks and dummy slots.

Engine version note. Table 3 was measured with the older Diffulex release used by the public mbd-lms reproduction branch. Current Diffulex is faster, but we have not refreshed this exact H100 TP=2 result because we do not currently have access to an H100 machine with CUDA 13 under the original setting.

On the legacy H100 TP=2 setup reported in Table 3, MBD-LLaDA2-Mini increases average TPF from 3.47 to 6.19 while step latency rises from 7.07 ms to 8.78 ms. The measured average TPS increases from 517.16 to 745.92. With DMax, MBD-LLaDA2-Mini-DMax increases average TPF from 6.35 to 9.34, and average TPS rises from 779.49 to 926.67 in the same table.

MBD-LLaDA2-Mini 517.16 -> 745.92 TPS

MBD-LLaDA2-Mini-DMax 779.49 -> 926.67 TPS

Hardware 2x H100, TP=2

**Table 3.** Throughput and single-step latency comparison on two H100 GPUs with TP=2, measured with the older Diffulex `mbd-lms` branch used for reproduction. MultiBD improves realized TPS despite increasing per-step latency.
Model	Forward-step statistics				Realized throughput
Model	Avg. TPF	TPF Gain	Step Lat. (ms)	Lat. Cost	GSM8K TPS	MATH500 TPS	MBPP+ TPS	HumanEval+ TPS	Avg. TPS	TPS Gain
LLaDA2-Mini	3.47	—	7.07	1.00x	344.05	403.45	496.19	824.94	517.16	—
MBD-LLaDA2-Mini	6.19	+78.39%	8.78	1.24x	687.87	707.89	646.73	941.18	745.92	+44.24%
LLaDA2-Mini-DMax	6.35	+83.00%	9.02	1.28x	700.82	730.60	754.97	931.55	779.49	+50.73%
MBD-LLaDA2-Mini-DMax	9.34	+169.16%	11.20	1.58x	834.52	851.07	896.65	1124.43	926.67	+79.19%

11. Citation

A formal arXiv record is on the way. Until then, please cite MBD-LMs with the temporary BibTeX entry below.

Temporary BibTeX

@misc{jin2026mbdlms,
  title        = {Multi-Block Diffusion Language Models},
  author       = {Yijie Jin and Jiajun Xu and Yuxuan Liu and Chenkai Xu and Yi Tu and Jiajun Li and Dandan Tu and Xiaohui Ye and Kai Yu and Pengfei Liu and Zhijie Deng},
  year         = {2026},
  note         = {arXiv on the way}
}

References

Marianne Arriola et al. Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models. ICLR, 2025.
Tiwei Bie et al. LLaDA2.0: Scaling Up Diffusion Language Models to 100B. arXiv preprint, 2025.
Xu Wang et al. Diffusion LLMs Can Do Faster-than-AR Inference via Discrete Diffusion Forcing. arXiv preprint, 2025.
Zigeng Chen et al. DMax: Aggressive Parallel Decoding for dLLMs. arXiv preprint, 2026.
Shuang Cheng et al. SDAR: A Synergistic Diffusion-Autoregression Paradigm for Scalable Sequence Generation. arXiv preprint, 2025.