Research Engine

Diffulex main is the active engine branch for researchers who want to turn a diffusion language model idea into a runnable, profiled, and serveable system. Use the Diffulex mbd-lms branch when reproducing the reported MBD-LMs experiments; use main when building new decoding algorithms, cache behavior, model support, kernels, or serving features.

Why This Backend Fits dLLM Research

Most block-level dLLM inference algorithms can be expressed as a small set of runtime decisions:

  • what block state each request owns;

  • which blocks are active in the current running set;

  • when a block can be appended, committed, rewritten, or evicted;

  • how the active block view maps to prefix KV cache and paged attention;

  • how logits are converted back into masked-token updates.

Diffulex keeps these concerns separated. This makes it practical to implement new algorithms on top of the same serving backend instead of rebuilding scheduler, cache, attention, CUDA graph, benchmark, and HTTP serving paths for each idea.

The engine is also friendly to code-agent assisted research. A coding agent such as Codex or Claude Code can usually make a bounded patch when it is given the strategy reference to copy from and the file map below. The important point is that the algorithm semantics live in registered strategy components, while the systems machinery remains reusable.

Block Buffer Backend

Diffulex treats Block Buffer execution as a three-level backend:

Level

What it owns

Main files

Logical block state

Prompt blocks, noisy blocks, dummy slots, commit readiness, block progress, and request-local generation state.

diffulex/engine/dllm_block.py, diffulex/engine/request.py, diffulex/strategy/<name>/engine/request.py

Running-set policy

Which blocks are active, when new blocks enter the buffer, when completed blocks are committed, and when requests prefill, decode, preempt, or finish.

diffulex/engine/scheduler.py, diffulex/engine/kv_cache_manager.py, diffulex/strategy/<name>/engine/{scheduler,kv_cache_manager}.py

Paged KV and kernel view

Prefix reuse, page tables, cache append rules, attention metadata, and dLLM-oriented Triton kernels.

diffulex/attention/, diffulex/mixin/multi_block/, diffulex_kernel/python/, diffulex/strategy/<name>/attention/metadata.py, diffulex/strategy/<name>/engine/model_runner.py

This layering is why SingleBD, MultiBD, TokenMerge/DMax, edit refinement, DiffusionGemma-style uniform diffusion, and future DualCache-style designs can share one backend. New algorithms usually change the running-set and sampling semantics, not the whole engine.

Implementation Map

Use an existing strategy as the reference before adding a new one:

Reference

Use it when

diffulex/strategy/multi_bd

The algorithm is a standard block-causal running-set decoder.

diffulex/strategy/d2f

The algorithm is closest to native SingleBD or prefix-full block decoding.

diffulex/strategy/dmax and diffulex/strategy/templates/token_merge

The algorithm changes token acceptance, merge metadata, or edit-style sampling.

diffulex/strategy/diffusion_gemma

The algorithm is model-specific and has non-standard canvas, sampler, or block semantics.

diffulex/strategy/templates/dual_cache

The algorithm changes cache ownership or needs a DualCache-style design.

A strategy-level algorithm typically covers these files:

File

Required when

What to implement

diffulex/strategy/<name>/__init__.py

Always.

Export registered request, scheduler, cache manager, and model runner classes. Strategy packages are auto-imported from diffulex/strategy/__init__.py.

diffulex/strategy/<name>/config.py

The strategy needs normalized defaults.

Register a StrategyConfigRegistry normalizer and force only the invariants required by the algorithm.

diffulex/strategy/<name>/engine/request.py

Always.

Register AutoReq; add per-request state such as block progress, edit windows, token-merge metadata, or custom finish conditions.

diffulex/strategy/<name>/engine/scheduler.py

Always.

Register AutoScheduler; define add-block, commit, prefill/decode, preemption, and finish policy.

diffulex/strategy/<name>/engine/kv_cache_manager.py

Always for cache-aware strategies.

Register AutoKVCacheManager; define page allocation, append, prefix reuse, and cache-commit behavior.

diffulex/strategy/<name>/attention/metadata.py

The attention mask or page interpretation differs from an existing strategy.

Define metadata consumed by diffulex.attention.Attention; start from MultiBlockAttnMetaDataMixin when possible.

diffulex/strategy/<name>/engine/model_runner.py

Always.

Register AutoModelRunner; prepare tensors, set attention metadata, call the model, invoke the sampler, and connect CUDA graph/full-static runner paths.

diffulex/sampler/<model_or_strategy>.py

Sampling semantics change.

Implement mask-to-token updates, edit updates, token merge, confidence thresholds, or model-specific output postprocessing.

diffulex/mixin/<feature>/

The behavior should be reused across strategies.

Put shared scheduler, sampler, request, or runner helpers here instead of duplicating strategy code.

diffulex_kernel/python/*.py

The algorithm needs a new fused operation.

Add a Triton or Python reference path for attention, KV cache, sampler, top-k, layernorm, or other dLLM-specific kernels.

diffulex/config.py and CLI/benchmark config files

A new public option is genuinely needed.

Add validation and user-facing flags only for options that users should tune. Keep compatibility-only fields out of docs and help text.

As a planning rule, a strategy that reuses Block Buffer, grouped attention, and an existing sampler is often a small patch across 6 to 8 files. A strategy with new sampler semantics usually adds one or two sampler/mixin files. A strategy with new kernel semantics needs extra reference code, Triton code, and profiling work. The goal is to spend code on the new algorithm, not on rebuilding serving infrastructure.

Code-Agent Workflow

For agent-assisted implementation, give the agent a narrow instruction:

  1. Name the closest reference strategy.

  2. State which algorithm semantics differ.

  3. Ask it to create or modify only the files in the implementation map.

  4. Require triton_grouped attention unless the task is explicitly debugging a fallback path.

  5. Ask for a tiny generation smoke run, then a limited benchmark run, then profiling only after correctness is stable.

This workflow works well because Diffulex keeps strategy registration, attention metadata, sampler logic, and dLLM-oriented Triton kernels in explicit locations. It lets researchers explore real high-performance algorithm variants with much less systems boilerplate.