Home¶
Diffusion Language Model Serving Engine
Diffulex is a Diffusion Language Model Serving Engine built on PagedAttention-style runtime primitives. It provides a unified engine for KV cache management, block scheduling, prefix reuse, MoE execution, CUDA graph replay, and model-specific diffusion samplers.
Diffulex is also the runtime engine behind the Multi-Block Diffusion Language
Models (MBD-LMs) line of work. Native Block Diffusion LMs perform
Single-Block Diffusion (SingleBD): each forward pass refines one noisy block
conditioned on a clean cached prefix. This preserves KV caching but leaves
blocks sequential, creating a store bubble where the GPU runs a forward that
produces no new output. Multi-Block Diffusion (MultiBD) removes this
bottleneck by maintaining a bounded running-set of consecutive blocks, enabling
decode-store overlap and inter-block parallelism. MBD-LMs are BD-LMs
post-trained with Multi-block Teacher Forcing (MultiTF) so the model can
handle practical MultiBD running-set states — and Diffulex executes them with an
optimized Block Buffer runtime that preserves static input shapes for CUDA
Graph replay. In the engine, MultiBD is exposed as decoding_strategy=multi_bd.
For reproducing the MBD-LMs experiments, use the Diffulex
mbd-lms branch
(CUDA 12). For engine development, open-source contributions, or exploring new
decoding algorithms and turning them into runnable systems, use the
main branch. main
contains ongoing runtime and model-specific optimizations, so its behavior and
performance profile may differ from the experiment reproduction branch.
The main branch requires CUDA 13.
Where to Start¶
Goal |
Start here |
|---|---|
Understand MultiBD in the engine |
|
Install Diffulex and run one command |
|
Set up Python, CUDA, and vLLM dependencies |
|
Run GSM8K or other lm-eval benchmarks |
|
Start the HTTP server |
|
Tune engine or YAML parameters |
|
Use Diffulex as a research backend |
|
Add a model, strategy, or kernel |
Current Scope¶
Diffulex focuses on cache-aware block-wise dLLM decoding. The main supported runtime pieces are:
PagedAttention-style KV cache management for diffusion decoding.
Strategy-specific schedulers and request state.
Prefix caching for block-causal Multi-Block Diffusion.
Tensor and data parallel inference paths.
Optional vLLM-backed common layers and MoE kernels.
Benchmark and HTTP serving entry points.
For new algorithms, Diffulex main is intended to be a research backend rather
than only a benchmark runner. Its Block Buffer, paged KV cache, scheduler,
sampler, and Triton kernel boundaries are designed so block-level generation
ideas can be implemented as strategy components. See
Research Engine for the implementation
map.
Model Families¶
Model family |
|
Typical strategy |
Status |
|---|---|---|---|
Dream / D2F-Dream |
|
|
Supported |
DiffuCoder / D2F-DiffuCoder |
|
|
Supported |
Dream reasoner |
|
|
Supported |
Stable-DiffCoder |
|
|
Supported |
LLaDA / D2F-LLaDA |
|
|
Supported |
Fast-dLLM-v2 |
|
|
Supported |
SDAR |
|
|
Supported |
SDAR-MoE |
|
|
Supported |
LLaDA2 family |
|
|
Supported |
DiffusionGemma |
|
|
Supported |
Use Models for compatibility details before mixing model names, strategies, and sampling modes.