LLaDA2 + MBD: Same Prompts, Massive Speedup
Each group below runs the same set of prompts across three stages of the same LLaDA2 model: vanilla LLaDA2-Mini (native single-block decoding), MBD-LLaDA2-Mini (our MultiBD post-training), and MBD-LLaDA2-Mini-DMax (with aggressive DMax parallel decoding). The speed progression from left to right comes purely from our method — all demos run on a single NVIDIA A100-SXM4-80GB through the Diffulex engine. On newer hardware the gap widens further.
MBD-LLaDA2-Mini-DMax Demo
This selected trace uses MBD-LLaDA2-Mini-DMax, the fastest model we trained.
Playback note. The demo videos pass through a Streamlit frontend, which can consume much of the engine-side throughput advantage. Use the aggregate TPS numbers in the Runtime Engine section to judge the actual engine path.
Vanilla LLaDA2-Mini
Native single-block decoding baseline. Same prompts, sequential block refinement.
MBD-LLaDA2-Mini
Our MultiBD post-training applied to the same model. Inter-block parallelism via bounded running-set.
MBD-LLaDA2-Mini-DMax
MultiBD + DMax aggressive parallel decoding. Highest throughput, same model backbone.
DiffusionGemma
Same method applied to a second diffusion model family for cross-architecture comparison.