Server

Use the HTTP server when an application, UI, or integration test needs to send requests to Diffulex over HTTP instead of calling the Python API in-process.

The server starts a FastAPI frontend and a synchronous backend worker that owns the Diffulex engine. ZMQ addresses are generated automatically for local runs.

Start a LLaDA2-Mini Server

export MODEL_PATH=/path/to/LLaDA2.0-mini

CUDA_VISIBLE_DEVICES=0 python -m diffulex.server \
  --model "$MODEL_PATH" \
  --model-name llada2_mini \
  --decoding-strategy multi_bd \
  --sampling-mode naive \
  --tensor-parallel-size 1 \
  --data-parallel-size 1 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --max-num-reqs 1 \
  --block-size 32 \
  --buffer-size 1 \
  --page-size 32 \
  --gpu-memory-utilization 0.45 \
  --attn-impl triton_grouped \
  --host 127.0.0.1 \
  --port 8000

Use --attn-impl triton_grouped for normal serving and demos. Other attention backends are compatibility/debug fallbacks and are not recommended for performance reporting.

Generate Endpoint

Non-streaming request:

curl -s http://127.0.0.1:8000/generate \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"max_nfe":256}' \
  | python -m json.tool

The response contains:

Field

Meaning

text

Generated completion text.

token_ids

Generated token IDs.

nfe

Number of forward evaluations used by the request.

finish_reason

Stop reason when available.

full_text

Prompt plus generated text when available.

Streaming request:

curl -N http://127.0.0.1:8000/generate \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"stream":true,"stream_mode":"denoise"}'

stream_mode="denoise" emits editable buffer snapshots. stream_mode="block_append" emits stable appended text.

Chat Endpoint

The server also exposes an OpenAI-style chat path:

curl -N http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "messages": [{"role": "user", "content": "Solve: 12 + 30."}],
    "temperature": 0.0,
    "max_tokens": 64,
    "stream": true,
    "stream_mode": "block_append"
  }'

Demo Visualization

The repository includes a local Streamlit demo for visualizing server responses. Start it after the HTTP server is ready:

streamlit run examples/streamlit_block_append_chat.py -- --base-url http://127.0.0.1:8000

The demo talks to the server API and is intended for local validation and video capture, not production serving or throughput measurement.

Important Server Flags

Flag

Notes

--model

Required local checkpoint path.

--model-name

Registered Diffulex model name such as llada2_mini, sdar, or diffusion_gemma.

--decoding-strategy

Use a strategy compatible with the model.

--sampling-mode

Usually naive; use edit only for compatible LLaDA2 edit/DMax paths.

--tensor-parallel-size, --data-parallel-size

Must fit the visible CUDA devices.

--device-ids

Logical CUDA IDs after CUDA_VISIBLE_DEVICES is applied.

--max-model-len, --max-num-batched-tokens, --max-num-reqs

Capacity controls. Start small.

--block-size, --buffer-size, --page-size

Strategy/model layout controls. DiffusionGemma uses 256/1/256.

--disable-full-static-runner, --disable-torch-compile, --enforce-eager

Debugging toggles for optimized paths.

--use-lora, --lora-path, --pre-merge-lora

LoRA loading and merge controls.

Run python -m diffulex.server --help for the complete current option list.