Demo Videos — Multi-Block Diffusion Language Models

Decoding Demo Videos

LLaDA2 + MBD: Same Prompts, Massive Speedup

Each group below runs the same set of prompts across three stages of the same LLaDA2 model: vanilla LLaDA2-Mini (native single-block decoding), MBD-LLaDA2-Mini (our MultiBD post-training), and MBD-LLaDA2-Mini-DMax (with aggressive DMax parallel decoding). The speed progression from left to right comes purely from our method — all demos run on a single NVIDIA A100-SXM4-80GB through the Diffulex engine. On newer hardware the gap widens further.

How It Works

Featured Diffulex trace

MBD-LLaDA2-Mini-DMax Demo

This selected trace uses MBD-LLaDA2-Mini-DMax, the fastest model we trained.

Playback note. The demo videos pass through a Streamlit frontend, which can consume much of the engine-side throughput advantage. Use the aggregate TPS numbers in the Runtime Engine section to judge the actual engine path.

Vanilla LLaDA2-Mini

Native single-block decoding baseline. Same prompts, sequential block refinement.

MBD-LLaDA2-Mini

Our MultiBD post-training applied to the same model. Inter-block parallelism via bounded running-set.

MBD-LLaDA2-Mini-DMax

MultiBD + DMax aggressive parallel decoding. Highest throughput, same model backbone.

DiffusionGemma

Same method applied to a second diffusion model family for cross-architecture comparison.