# Multi-Block Diffusion

**Multi-Block Diffusion (MultiBD)** is the inference formulation used by
Multi-Block Diffusion Language Models (MBD-LMs). Native block diffusion language
models often run **Single-Block Diffusion (SingleBD)**: one noisy block is
iteratively denoised, committed into the KV cache, and only then can the next
block start. SingleBD preserves KV caching, but it leaves inter-block
parallelism unused.

MultiBD keeps a bounded running-set of consecutive blocks active at the same
time. Earlier blocks may be complete and waiting to enter the KV cache while
later blocks are already being refined. This exposes inter-block parallelism
without giving up the clean cached prefix that makes block diffusion models
servable.

In Diffulex, the runtime option for this method is:

```yaml
decoding_strategy: multi_bd
```

This page describes the engine-side implementation. For reproducing the
MBD-LMs experiments and their training recipe, use the Diffulex
[`mbd-lms`](https://github.com/SJTU-DENG-Lab/Diffulex/tree/mbd-lms) branch.
For new runtime development and open-source contributions, use
[`main`](https://github.com/SJTU-DENG-Lab/Diffulex/tree/main).

## Method Terms

| Term | Meaning in Diffulex |
| --- | --- |
| SingleBD | One active noisy block is decoded at a time; later blocks wait for the current block to finish and enter KV cache. |
| MultiBD | A bounded active block set is decoded concurrently with block-causal visibility. |
| Running-set | The consecutive blocks that have not yet become part of the clean cached prefix. |
| Block Buffer | A fixed-size physical buffer for resident blocks. It keeps shapes stable while logical blocks enter, complete, and commit to KV cache. |
| MultiTF | The MBD-LMs post-training recipe that aligns training states with practical MultiBD inference states. This is part of the experiment branch, not a server flag. |

## Runtime Mapping

`decoding_strategy="multi_bd"` selects the block-aware request state,
scheduler, KV cache manager, model runner, and attention metadata path. The
core implementation lives in `diffulex.engine` and `diffulex.mixin.multi_block`.

At a high level:

1. The request state tracks block-level progress and the active running-set.
2. The scheduler decides when another block can enter the active set.
3. Completed front blocks are committed into the KV cache.
4. Prefix caching reuses the clean cached prefix for later steps and requests.
5. Static-shape execution can run over the configured block/buffer layout.

For `d2f`, config normalization forces `multi_block_prefix_full=True`, which
uses full-prefix multi-block behavior and disables prefix caching. For
`multi_bd` and `dmax`, config normalization forces
`multi_block_prefix_full=False`, which is the block-causal path needed by
prefix caching.

## When It Applies

Use `multi_bd` for model families that are configured for block-causal
multi-block decoding, such as LLaDA2, SDAR, SDAR-MoE, Fast-dLLM-v2, Stable-
DiffCoder, and Dream reasoner paths.

There are two common usage modes:

| Mode | What it means |
| --- | --- |
| Training-free MultiBD | Run an existing compatible BD-LM with `decoding_strategy=multi_bd`. This can improve parallelism, but quality depends on how well the checkpoint tolerates MultiBD states. |
| MBD-LM reproduction | Use checkpoints/configs from the MBD-LMs experiment setup, where MultiTF was used to align training with MultiBD inference. Reproduce these through the `mbd-lms` branch. |

DMax-style token merging composes with the same block-causal runtime ideas, but
uses `decoding_strategy="dmax"` because it has additional edit-sampling and
token-merge requirements.

## Block Size

`block_size` controls the token span managed as one diffusion block.

For most model families, choose one of `4`, `8`, `16`, or `32`. The general
default is `32`. `model_name="diffusion_gemma"` uses `256`, and config
normalization forces that value for both block and page size.

`block_size` must not exceed `page_size`. If you change one, check the other at
the same time so the KV cache layout still matches the decoding block layout.

Larger block sizes can reduce block-management overhead but increase the amount
of work tied to one block. Smaller block sizes expose more scheduling
granularity but can increase bookkeeping pressure.

## Buffer Size

`buffer_size` controls how many active diffusion blocks the request can keep in
the multi-block buffer.

The general default is `4`. `model_name="diffusion_gemma"` is normalized to
`1`.

Increasing the buffer can expose more inter-block parallelism and improve
overlap between block progress, scheduling, and KV-cache commits. It also
increases active state and per-step work. When debugging, use the strategy
default or a small value before tuning throughput.

## Related Arguments

| Surface | Names | Notes |
| --- | --- | --- |
| Python/config | `block_size`, `buffer_size` | Primary knobs for block span and active block count. |
| CLI | `--block-size`, `--buffer-size` | Use for quick serving or benchmark overrides. |
| Related config | `page_size`, `multi_block_prefix_full` | `page_size` must stay compatible with `block_size`; strategy normalization controls `multi_block_prefix_full`. |