Server¶
Use the HTTP server when an application, UI, or integration test needs to send requests to Diffulex over HTTP instead of calling the Python API in-process.
The server starts a FastAPI frontend and a synchronous backend worker that owns the Diffulex engine. ZMQ addresses are generated automatically for local runs.
Start a LLaDA2-Mini Server¶
export MODEL_PATH=/path/to/LLaDA2.0-mini
CUDA_VISIBLE_DEVICES=0 python -m diffulex.server \
--model "$MODEL_PATH" \
--model-name llada2_mini \
--decoding-strategy multi_bd \
--sampling-mode naive \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 4096 \
--max-num-batched-tokens 4096 \
--max-num-reqs 1 \
--block-size 32 \
--buffer-size 1 \
--page-size 32 \
--gpu-memory-utilization 0.45 \
--attn-impl triton_grouped \
--host 127.0.0.1 \
--port 8000
Use --attn-impl triton_grouped for normal serving and demos. Other attention
backends are compatibility/debug fallbacks and are not recommended for
performance reporting.
Generate Endpoint¶
Non-streaming request:
curl -s http://127.0.0.1:8000/generate \
-H 'Content-Type: application/json' \
-d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"max_nfe":256}' \
| python -m json.tool
The response contains:
Field |
Meaning |
|---|---|
|
Generated completion text. |
|
Generated token IDs. |
|
Number of forward evaluations used by the request. |
|
Stop reason when available. |
|
Prompt plus generated text when available. |
Streaming request:
curl -N http://127.0.0.1:8000/generate \
-H 'Content-Type: application/json' \
-d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"stream":true,"stream_mode":"denoise"}'
stream_mode="denoise" emits editable buffer snapshots.
stream_mode="block_append" emits stable appended text.
Chat Endpoint¶
The server also exposes an OpenAI-style chat path:
curl -N http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "Solve: 12 + 30."}],
"temperature": 0.0,
"max_tokens": 64,
"stream": true,
"stream_mode": "block_append"
}'
Demo Visualization¶
The repository includes a local Streamlit demo for visualizing server responses. Start it after the HTTP server is ready:
streamlit run examples/streamlit_block_append_chat.py -- --base-url http://127.0.0.1:8000
The demo talks to the server API and is intended for local validation and video capture, not production serving or throughput measurement.
Important Server Flags¶
Flag |
Notes |
|---|---|
|
Required local checkpoint path. |
|
Registered Diffulex model name such as |
|
Use a strategy compatible with the model. |
|
Usually |
|
Must fit the visible CUDA devices. |
|
Logical CUDA IDs after |
|
Capacity controls. Start small. |
|
Strategy/model layout controls. DiffusionGemma uses |
|
Debugging toggles for optimized paths. |
|
LoRA loading and merge controls. |
Run python -m diffulex.server --help for the complete current option
list.