TL; DR
Block Diffusion Language Models (BD-LMs) make diffusion-based text generation more practical by supporting KV caching and flexible-length generation. However, native BD-LMs usually perform Single-Block Diffusion (SingleBD): each forward pass refines one noisy block conditioned on a clean cached prefix. This preserves the serving benefits of BD-LMs, but blocks are still processed sequentially.
We propose Multi-Block Diffusion Language Models (MBD-LMs), a formulation and post-training recipe for reliable Multi-Block Diffusion (MultiBD). The key idea is to decode a bounded running-set of consecutive blocks concurrently, while training the model on states that resemble this practical inference regime. MBD-LMs are obtained by Multi-block Teacher Forcing (MultiTF) and served with an optimized Block Buffer inference engine.
Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%. When combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02 percentage-point average accuracy drop on math and code benchmarks.

From SingleBD to MultiBD
Diffusion Language Models (DLMs) generate text through iterative denoising and naturally support parallel token refinement. Fully bidirectional DLMs, however, are difficult to serve efficiently because they do not naturally support KV caching or dynamic-length generation. BD-LMs address this issue by generating text in block-causal form: completed blocks become a clean cached prefix, and the current block is denoised under block-causal attention.
This design gives native BD-LMs efficient intra-block parallelism, but not inter-block parallelism. In SingleBD, a later block cannot begin refinement until the current block has finished decoding and has been committed to the KV cache. The result is a storing bubble: during cache storing, no new token is generated and no decode-store overlap is exploited.
MultiBD removes this bottleneck by maintaining a small running-set of consecutive blocks. Earlier blocks in the running-set may be completed and waiting to enter the cache, while later blocks can already be active noisy blocks. This enables the model to refine future blocks while completed blocks are being committed to the KV cache.
Why Training-Free MultiBD Is Not Enough
A natural question is whether existing BD-LMs can simply run MultiBD at inference time. The paper shows that this is only partially effective. Direct MultiBD inference increases TPF, confirming that multi-block decoding relaxes the single-block bottleneck, but it can degrade accuracy because the model was not trained on practical MultiBD states.
The mismatch has two components. First, practical MultiBD does not decode an unbounded noisy suffix. It uses a bounded running-set, often with an active part around two blocks and occasional expansion to three or four active blocks. Second, active slots can have heterogeneous mask-ratio patterns: adjacent slots may differ substantially in noise level. Reliable MultiBD therefore requires training states that match both the bounded running-set structure and the slot-wise noise patterns observed during inference.

MBD-LMs: A Running-Set View of BD-LMs
MBD-LMs formulate BD-LM generation around a running-set of consecutive blocks. At decoding step $s$, the running-set contains the blocks that have not yet entered the prefix KV cache. It includes active noisy blocks and completed preceding blocks waiting to be cached. Blocks before the running-set form the clean cached prefix.
This view unifies several regimes. Teacher-Forcing-trained BD-LMs correspond to the SingleBD extreme, where the model observes one noisy block conditioned on a clean cached prefix. D2F introduces visibility among multiple noisy blocks, but its training states still differ from practical MultiBD in running-set size and slot-wise noise patterns. Practical MultiBD is the bounded intermediate regime: the running-set should be larger than one to expose inter-block parallelism, but small enough to keep each forward pass efficient.

MultiTF: Post-Training BD-LMs for MultiBD
Multi-block Teacher Forcing (MultiTF) turns BD-LMs into MBD-LMs by constructing training states that resemble practical MultiBD inference. Instead of corrupting only one block as in standard teacher forcing, MultiTF corrupts a bounded group of consecutive blocks, called a noise-group, while conditioning later groups on clean earlier groups.
MultiTF has three main ingredients:
- Noise-group layouts. Systematic layouts enumerate group sizes and shifts so that blocks appear at different group-relative positions. Random layouts add non-regular group-size combinations and boundary patterns.
- Chain-uniform noise-scheduling. Within each noise-group, mask ratios are sampled monotonically but randomly, producing larger and more diverse slot-wise noise gaps than a fixed D2F-style monotonic schedule.
- Group-Aware Dual-Stream Mask. Noisy blocks inside the same noise-group can attend to each other under block-causal visibility, each noise-group can condition on its clean prefix, and clean tokens are prevented from attending to noisy tokens.
The resulting inputs are used for masked-token cross-entropy, and model-specific objectives such as DMax OPUT can be applied on top of the same MultiTF input construction.

Optimized MultiBD with Block Buffer
MultiBD is useful only if the additional parallelism can be translated into wall-clock speedup. A naive implementation directly materializes the current running-set as the physical input to each forward pass. This exposes inter-block parallelism, but the number of processed tokens changes over time and across requests, making CUDA Graph capture and replay difficult.
To make MultiBD practically executable, the paper introduces the Block Buffer mechanism. A Block Buffer contains a fixed number of physical block slots. Real resident blocks inside the buffer form the logical running-set, while trailing dummy slots reserve capacity for future blocks. A future block enters decoding by activating an existing dummy slot instead of extending the physical input sequence. When the front block is completed, it is committed to the KV cache and the buffer slides forward by appending a new dummy slot at the tail.
Each slot follows the state transition dummy -> active -> to-cache -> in-cache. This design preserves prefix-cache reuse, keeps input shapes static, overlaps decoding with KV-cache storing, and supports CUDA Graph replay.

Main Results
The experiments evaluate mathematical reasoning on GSM8K and MATH500, and code generation on MBPP+ and HumanEval+. The paper reports Accuracy, Tokens Per Forward pass (TPF), and Accuracy Under Parallelism (AUP), where TPF measures decoding parallelism and AUP summarizes the accuracy-parallelism trade-off.
The main trend is consistent across models: MBD-LMs substantially improve TPF over native SingleBD, and MultiTF often recovers or improves the quality lost by training-free MultiBD. On LLaDA2-Mini, MultiTF raises average accuracy from 78.59% under training-free MultiBD to 81.03%, while further increasing average TPF from 4.41 to 6.19. On SDAR-8B-Chat-b32, MBD-SDAR-8B-Chat-b32 increases average TPF from 2.54 to 4.46 and improves average accuracy from 69.00% to 69.74%.


Throughput: From TPF to TPS
Higher TPF does not automatically imply proportional wall-clock speedup because MultiBD processes a larger static Block Buffer at each forward pass. The paper therefore separates useful committed tokens from the per-step computational workload. Increasing the buffer size can improve throughput when the useful-token gain outweighs the extra per-step cost introduced by resident blocks and dummy slots.
On the H100 TP=2 setup reported in Table 3, MBD-LLaDA2-Mini increases average TPF from 3.47 to 6.19 while step latency rises from 7.07 ms to 8.78 ms. The measured average TPS increases from 517.16 to 745.92. With DMax, MBD-LLaDA2-Mini-DMax increases average TPF from 6.35 to 9.34, and average TPS rises from 779.49 to 926.67 in the same table.

Conclusion
Citation
The official BibTeX entry will be added after the paper is publicly released.
Reference
[1] Marianne Arriola et al. Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models. ICLR, 2025.
[2] Tiwei Bie et al. LLaDA2.0: Scaling Up Diffusion Language Models to 100B. arXiv preprint, 2025.
[3] Xu Wang et al. Diffusion LLMs Can Do Faster-than-AR Inference via Discrete Diffusion Forcing. arXiv preprint, 2025.
[4] Zigeng Chen et al. DMax: Aggressive Parallel Decoding for dLLMs. arXiv preprint, 2026.
[5] Shuang Cheng et al. SDAR: A Synergistic Diffusion-Autoregression Paradigm for Scalable Sequence Generation. arXiv preprint, 2025.
