Home

Diffulex

Diffusion Language Model Serving Engine

GitHub Discord

Diffulex is a Diffusion Language Model Serving Engine built on PagedAttention-style runtime primitives. It provides a unified engine for KV cache management, block scheduling, prefix reuse, MoE execution, CUDA graph replay, and model-specific diffusion samplers.

Diffulex is also the runtime engine behind the Multi-Block Diffusion Language Models (MBD-LMs) line of work. Native Block Diffusion LMs perform Single-Block Diffusion (SingleBD): each forward pass refines one noisy block conditioned on a clean cached prefix. This preserves KV caching but leaves blocks sequential, creating a store bubble where the GPU runs a forward that produces no new output. Multi-Block Diffusion (MultiBD) removes this bottleneck by maintaining a bounded running-set of consecutive blocks, enabling decode-store overlap and inter-block parallelism. MBD-LMs are BD-LMs post-trained with Multi-block Teacher Forcing (MultiTF) so the model can handle practical MultiBD running-set states — and Diffulex executes them with an optimized Block Buffer runtime that preserves static input shapes for CUDA Graph replay. In the engine, MultiBD is exposed as decoding_strategy=multi_bd.

For reproducing the MBD-LMs experiments, use the Diffulex mbd-lms branch (CUDA 12). For engine development, open-source contributions, or exploring new decoding algorithms and turning them into runnable systems, use the main branch. main contains ongoing runtime and model-specific optimizations, so its behavior and performance profile may differ from the experiment reproduction branch. The main branch requires CUDA 13.

Where to Start

Goal

Start here

Understand MultiBD in the engine

Multi-Block Diffusion

Install Diffulex and run one command

Quickstart

Set up Python, CUDA, and vLLM dependencies

Installation

Run GSM8K or other lm-eval benchmarks

Benchmark

Start the HTTP server

Server

Tune engine or YAML parameters

Configuration

Use Diffulex as a research backend

Research Engine

Add a model, strategy, or kernel

Developer Guide

Current Scope

Diffulex focuses on cache-aware block-wise dLLM decoding. The main supported runtime pieces are:

  • PagedAttention-style KV cache management for diffusion decoding.

  • Strategy-specific schedulers and request state.

  • Prefix caching for block-causal Multi-Block Diffusion.

  • Tensor and data parallel inference paths.

  • Optional vLLM-backed common layers and MoE kernels.

  • Benchmark and HTTP serving entry points.

For new algorithms, Diffulex main is intended to be a research backend rather than only a benchmark runner. Its Block Buffer, paged KV cache, scheduler, sampler, and Triton kernel boundaries are designed so block-level generation ideas can be implemented as strategy components. See Research Engine for the implementation map.

Model Families

Model family

model_name

Typical strategy

Status

Dream / D2F-Dream

dream

d2f

Supported

DiffuCoder / D2F-DiffuCoder

diffucoder

d2f

Supported

Dream reasoner

dream_reasoner

multi_bd

Supported

Stable-DiffCoder

stable_diffcoder

multi_bd

Supported

LLaDA / D2F-LLaDA

llada

d2f

Supported

Fast-dLLM-v2

fast_dllm_v2

multi_bd or fast_dllm_v2

Supported

SDAR

sdar

multi_bd

Supported

SDAR-MoE

sdar_moe

multi_bd

Supported

LLaDA2 family

llada2, llada2_mini, llada2_moe, llada2dot1_mini, llada2_mini_dmax

multi_bd, dmax, or fast_dllm_v2

Supported

DiffusionGemma

diffusion_gemma

diffusion_gemma

Supported

Use Models for compatibility details before mixing model names, strategies, and sampling modes.