# Quickstart This page gives the shortest working path for the current codebase: - install Diffulex; - run a small LLaDA2-mini benchmark; - run one in-process Python generation; - start the HTTP server; - optionally run the vLLM DiffusionGemma baseline. For reproducing the MBD-LMs experiments, use the Diffulex [`mbd-lms`](https://github.com/SJTU-DENG-Lab/Diffulex/tree/mbd-lms) branch. For engine development, open-source contributions, or exploring new decoding algorithms and turning them into runnable systems, use the [`main`](https://github.com/SJTU-DENG-Lab/Diffulex/tree/main) branch. The current main branch contains ongoing runtime and model-specific optimizations. ## Prerequisites - Diffulex is installed in a Python environment. See [Installation](installation.md). - At least one NVIDIA GPU is visible to PyTorch. - The model checkpoint exists locally. The examples below use LLaDA2-mini: ```bash export MODEL_PATH=/data/ckpts/inclusionAI/LLaDA2.0-mini ``` Replace this path with the location of the checkpoint on your machine. ## 1. Install From the repository root: ```bash uv venv --python 3.11 --seed source .venv/bin/activate uv pip install -e . uv pip install vllm==0.23.0 ``` Verify the install: ```bash python -c "from diffulex import Diffulex, SamplingParams; print('ok')" python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())" python -c "import vllm; print(vllm.__version__)" ``` ## 2. Run a Small Benchmark Use the maintained LLaDA2-mini GSM8K runner. Start with a small limit: ```bash CUDA_VISIBLE_DEVICES=0 \ MODEL_PATH="$MODEL_PATH" \ DATASET_LIMIT=10 \ script/run_llada2_mini_gsm8k.sh ``` The script wraps: ```bash python -m diffulex_bench.main \ --config diffulex_bench/configs/llada2_mini_gsm8k.yml \ --model-path "$MODEL_PATH" \ --dataset-limit 10 ``` Results are written under `benchmark_results/llada2_mini_gsm8k/` by default. Remove `DATASET_LIMIT` only after the limited run loads the model, generates answers, and writes results correctly. ## 3. Run Python Inference For a direct in-process call: ```python from diffulex import Diffulex, SamplingParams model_path = "/data/ckpts/inclusionAI/LLaDA2.0-mini" llm = Diffulex( model=model_path, model_name="llada2_mini", decoding_strategy="multi_bd", sampling_mode="naive", mask_token_id=156895, tensor_parallel_size=1, data_parallel_size=1, gpu_memory_utilization=0.45, max_model_len=4096, max_num_batched_tokens=4096, max_num_reqs=1, block_size=32, buffer_size=1, page_size=32, attn_impl="triton_grouped", enable_prefix_caching=True, enable_full_static_runner=True, enable_vllm_layers=True, ) outputs = llm.generate( ["Solve: Natalia sold clips to 48 friends in April, and half as many in May. How many clips did she sell in May?"], SamplingParams(temperature=0.0, max_tokens=256, max_nfe=1024), ) for item in outputs.trajectories: print(item.text) llm.exit() ``` Use `attn_impl="naive"` and `enforce_eager=True` only when debugging correctness. Use the optimized settings when measuring throughput. ## 4. Start the HTTP Server The server uses the same engine configuration, exposed as CLI flags. A minimal single-GPU LLaDA2-mini command is: ```bash CUDA_VISIBLE_DEVICES=0 python -m diffulex.server \ --model "$MODEL_PATH" \ --model-name llada2_mini \ --decoding-strategy multi_bd \ --sampling-mode naive \ --tensor-parallel-size 1 \ --data-parallel-size 1 \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --max-num-reqs 1 \ --block-size 32 \ --buffer-size 1 \ --page-size 32 \ --gpu-memory-utilization 0.45 \ --attn-impl triton_grouped \ --host 127.0.0.1 \ --port 8000 ``` Send a request: ```bash curl -s http://127.0.0.1:8000/generate \ -H 'Content-Type: application/json' \ -d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"max_nfe":256}' \ | python -m json.tool ``` For local server demo visualization: ```bash streamlit run examples/streamlit_block_append_chat.py -- --base-url http://127.0.0.1:8000 ``` ## 5. Run DiffusionGemma or vLLM Baselines Diffulex has a native DiffusionGemma benchmark config: ```bash CUDA_VISIBLE_DEVICES=0 python -m diffulex_bench.main \ --config diffulex_bench/configs/diffusion_gemma_gsm8k.yml \ --model-path /data/ckpts/google/diffusiongemma-26B-A4B-it \ --dataset-limit 10 ``` The repository also keeps a vLLM DiffusionGemma baseline runner. This is for comparison, not for starting Diffulex: ```bash CUDA_VISIBLE_DEVICES=0 \ CONFIG_PATH=examples/engine_lm_eval/configs/vllm_diffusion_gemma_gsm8k_smoke.yml \ script/run_vllm_diffusion_gemma_gsm8k.sh ``` Use the `*_full.yml` config only after the smoke run succeeds. ## Decoding Strategy Cheatsheet | Strategy | Typical models | Notes | | --- | --- | --- | | `d2f` | D2F LoRA-style LLaDA, Dream, DiffuCoder paths | Full-prefix block decoding; disables prefix caching. | | `multi_bd` | LLaDA2-mini, SDAR, Fast-dLLM-v2, stable DiffuCoder/Dream reasoner paths | Multi-Block Diffusion: block-causal multi-block decoding with prefix caching. | | `dmax` | Supported LLaDA2 edit-sampling experiments | Requires `sampling_mode="edit"`. | | `diffusion_gemma` | DiffusionGemma | Native DiffusionGemma canvas/block decoder. | For more detail, read [Configuration](../user_guide/configuration.md) and [Benchmark](../user_guide/benchmark.md).