# Research Engine Diffulex `main` is the active engine branch for researchers who want to turn a diffusion language model idea into a runnable, profiled, and serveable system. Use the Diffulex `mbd-lms` branch when reproducing the reported MBD-LMs experiments; use `main` when building new decoding algorithms, cache behavior, model support, kernels, or serving features. ## Why This Backend Fits dLLM Research Most block-level dLLM inference algorithms can be expressed as a small set of runtime decisions: - what block state each request owns; - which blocks are active in the current running set; - when a block can be appended, committed, rewritten, or evicted; - how the active block view maps to prefix KV cache and paged attention; - how logits are converted back into masked-token updates. Diffulex keeps these concerns separated. This makes it practical to implement new algorithms on top of the same serving backend instead of rebuilding scheduler, cache, attention, CUDA graph, benchmark, and HTTP serving paths for each idea. The engine is also friendly to code-agent assisted research. A coding agent such as Codex or Claude Code can usually make a bounded patch when it is given the strategy reference to copy from and the file map below. The important point is that the algorithm semantics live in registered strategy components, while the systems machinery remains reusable. ## Block Buffer Backend Diffulex treats Block Buffer execution as a three-level backend: | Level | What it owns | Main files | | --- | --- | --- | | Logical block state | Prompt blocks, noisy blocks, dummy slots, commit readiness, block progress, and request-local generation state. | `diffulex/engine/dllm_block.py`, `diffulex/engine/request.py`, `diffulex/strategy//engine/request.py` | | Running-set policy | Which blocks are active, when new blocks enter the buffer, when completed blocks are committed, and when requests prefill, decode, preempt, or finish. | `diffulex/engine/scheduler.py`, `diffulex/engine/kv_cache_manager.py`, `diffulex/strategy//engine/{scheduler,kv_cache_manager}.py` | | Paged KV and kernel view | Prefix reuse, page tables, cache append rules, attention metadata, and dLLM-oriented Triton kernels. | `diffulex/attention/`, `diffulex/mixin/multi_block/`, `diffulex_kernel/python/`, `diffulex/strategy//attention/metadata.py`, `diffulex/strategy//engine/model_runner.py` | This layering is why SingleBD, MultiBD, TokenMerge/DMax, edit refinement, DiffusionGemma-style uniform diffusion, and future DualCache-style designs can share one backend. New algorithms usually change the running-set and sampling semantics, not the whole engine. ## Implementation Map Use an existing strategy as the reference before adding a new one: | Reference | Use it when | | --- | --- | | `diffulex/strategy/multi_bd` | The algorithm is a standard block-causal running-set decoder. | | `diffulex/strategy/d2f` | The algorithm is closest to native SingleBD or prefix-full block decoding. | | `diffulex/strategy/dmax` and `diffulex/strategy/templates/token_merge` | The algorithm changes token acceptance, merge metadata, or edit-style sampling. | | `diffulex/strategy/diffusion_gemma` | The algorithm is model-specific and has non-standard canvas, sampler, or block semantics. | | `diffulex/strategy/templates/dual_cache` | The algorithm changes cache ownership or needs a DualCache-style design. | A strategy-level algorithm typically covers these files: | File | Required when | What to implement | | --- | --- | --- | | `diffulex/strategy//__init__.py` | Always. | Export registered request, scheduler, cache manager, and model runner classes. Strategy packages are auto-imported from `diffulex/strategy/__init__.py`. | | `diffulex/strategy//config.py` | The strategy needs normalized defaults. | Register a `StrategyConfigRegistry` normalizer and force only the invariants required by the algorithm. | | `diffulex/strategy//engine/request.py` | Always. | Register `AutoReq`; add per-request state such as block progress, edit windows, token-merge metadata, or custom finish conditions. | | `diffulex/strategy//engine/scheduler.py` | Always. | Register `AutoScheduler`; define add-block, commit, prefill/decode, preemption, and finish policy. | | `diffulex/strategy//engine/kv_cache_manager.py` | Always for cache-aware strategies. | Register `AutoKVCacheManager`; define page allocation, append, prefix reuse, and cache-commit behavior. | | `diffulex/strategy//attention/metadata.py` | The attention mask or page interpretation differs from an existing strategy. | Define metadata consumed by `diffulex.attention.Attention`; start from `MultiBlockAttnMetaDataMixin` when possible. | | `diffulex/strategy//engine/model_runner.py` | Always. | Register `AutoModelRunner`; prepare tensors, set attention metadata, call the model, invoke the sampler, and connect CUDA graph/full-static runner paths. | | `diffulex/sampler/.py` | Sampling semantics change. | Implement mask-to-token updates, edit updates, token merge, confidence thresholds, or model-specific output postprocessing. | | `diffulex/mixin//` | The behavior should be reused across strategies. | Put shared scheduler, sampler, request, or runner helpers here instead of duplicating strategy code. | | `diffulex_kernel/python/*.py` | The algorithm needs a new fused operation. | Add a Triton or Python reference path for attention, KV cache, sampler, top-k, layernorm, or other dLLM-specific kernels. | | `diffulex/config.py` and CLI/benchmark config files | A new public option is genuinely needed. | Add validation and user-facing flags only for options that users should tune. Keep compatibility-only fields out of docs and help text. | As a planning rule, a strategy that reuses Block Buffer, grouped attention, and an existing sampler is often a small patch across 6 to 8 files. A strategy with new sampler semantics usually adds one or two sampler/mixin files. A strategy with new kernel semantics needs extra reference code, Triton code, and profiling work. The goal is to spend code on the new algorithm, not on rebuilding serving infrastructure. ## Code-Agent Workflow For agent-assisted implementation, give the agent a narrow instruction: 1. Name the closest reference strategy. 2. State which algorithm semantics differ. 3. Ask it to create or modify only the files in the implementation map. 4. Require `triton_grouped` attention unless the task is explicitly debugging a fallback path. 5. Ask for a tiny generation smoke run, then a limited benchmark run, then profiling only after correctness is stable. This workflow works well because Diffulex keeps strategy registration, attention metadata, sampler logic, and dLLM-oriented Triton kernels in explicit locations. It lets researchers explore real high-performance algorithm variants with much less systems boilerplate.