Multi-Block Diffusion¶
Multi-Block Diffusion (MultiBD) is the inference formulation used by Multi-Block Diffusion Language Models (MBD-LMs). Native block diffusion language models often run Single-Block Diffusion (SingleBD): one noisy block is iteratively denoised, committed into the KV cache, and only then can the next block start. SingleBD preserves KV caching, but it leaves inter-block parallelism unused.
MultiBD keeps a bounded running-set of consecutive blocks active at the same time. Earlier blocks may be complete and waiting to enter the KV cache while later blocks are already being refined. This exposes inter-block parallelism without giving up the clean cached prefix that makes block diffusion models servable.
In Diffulex, the runtime option for this method is:
decoding_strategy: multi_bd
This page describes the engine-side implementation. For reproducing the
MBD-LMs experiments and their training recipe, use the Diffulex
mbd-lms branch.
For new runtime development and open-source contributions, use
main.
Method Terms¶
Term |
Meaning in Diffulex |
|---|---|
SingleBD |
One active noisy block is decoded at a time; later blocks wait for the current block to finish and enter KV cache. |
MultiBD |
A bounded active block set is decoded concurrently with block-causal visibility. |
Running-set |
The consecutive blocks that have not yet become part of the clean cached prefix. |
Block Buffer |
A fixed-size physical buffer for resident blocks. It keeps shapes stable while logical blocks enter, complete, and commit to KV cache. |
MultiTF |
The MBD-LMs post-training recipe that aligns training states with practical MultiBD inference states. This is part of the experiment branch, not a server flag. |
Runtime Mapping¶
decoding_strategy="multi_bd" selects the block-aware request state,
scheduler, KV cache manager, model runner, and attention metadata path. The
core implementation lives in diffulex.engine and diffulex.mixin.multi_block.
At a high level:
The request state tracks block-level progress and the active running-set.
The scheduler decides when another block can enter the active set.
Completed front blocks are committed into the KV cache.
Prefix caching reuses the clean cached prefix for later steps and requests.
Static-shape execution can run over the configured block/buffer layout.
For d2f, config normalization forces multi_block_prefix_full=True, which
uses full-prefix multi-block behavior and disables prefix caching. For
multi_bd and dmax, config normalization forces
multi_block_prefix_full=False, which is the block-causal path needed by
prefix caching.
When It Applies¶
Use multi_bd for model families that are configured for block-causal
multi-block decoding, such as LLaDA2, SDAR, SDAR-MoE, Fast-dLLM-v2, Stable-
DiffCoder, and Dream reasoner paths.
There are two common usage modes:
Mode |
What it means |
|---|---|
Training-free MultiBD |
Run an existing compatible BD-LM with |
MBD-LM reproduction |
Use checkpoints/configs from the MBD-LMs experiment setup, where MultiTF was used to align training with MultiBD inference. Reproduce these through the |
DMax-style token merging composes with the same block-causal runtime ideas, but
uses decoding_strategy="dmax" because it has additional edit-sampling and
token-merge requirements.
Block Size¶
block_size controls the token span managed as one diffusion block.
For most model families, choose one of 4, 8, 16, or 32. The general
default is 32. model_name="diffusion_gemma" uses 256, and config
normalization forces that value for both block and page size.
block_size must not exceed page_size. If you change one, check the other at
the same time so the KV cache layout still matches the decoding block layout.
Larger block sizes can reduce block-management overhead but increase the amount of work tied to one block. Smaller block sizes expose more scheduling granularity but can increase bookkeeping pressure.
Buffer Size¶
buffer_size controls how many active diffusion blocks the request can keep in
the multi-block buffer.
The general default is 4. model_name="diffusion_gemma" is normalized to
1.
Increasing the buffer can expose more inter-block parallelism and improve overlap between block progress, scheduling, and KV-cache commits. It also increases active state and per-step work. When debugging, use the strategy default or a small value before tuning throughput.