# Server Use the HTTP server when an application, UI, or integration test needs to send requests to Diffulex over HTTP instead of calling the Python API in-process. The server starts a FastAPI frontend and a synchronous backend worker that owns the Diffulex engine. ZMQ addresses are generated automatically for local runs. ## Start a LLaDA2-Mini Server ```bash export MODEL_PATH=/path/to/LLaDA2.0-mini CUDA_VISIBLE_DEVICES=0 python -m diffulex.server \ --model "$MODEL_PATH" \ --model-name llada2_mini \ --decoding-strategy multi_bd \ --sampling-mode naive \ --tensor-parallel-size 1 \ --data-parallel-size 1 \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --max-num-reqs 1 \ --block-size 32 \ --buffer-size 1 \ --page-size 32 \ --gpu-memory-utilization 0.45 \ --attn-impl triton_grouped \ --host 127.0.0.1 \ --port 8000 ``` Use `--attn-impl triton_grouped` for normal serving and demos. Other attention backends are compatibility/debug fallbacks and are not recommended for performance reporting. ## Generate Endpoint Non-streaming request: ```bash curl -s http://127.0.0.1:8000/generate \ -H 'Content-Type: application/json' \ -d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"max_nfe":256}' \ | python -m json.tool ``` The response contains: | Field | Meaning | | --- | --- | | `text` | Generated completion text. | | `token_ids` | Generated token IDs. | | `nfe` | Number of forward evaluations used by the request. | | `finish_reason` | Stop reason when available. | | `full_text` | Prompt plus generated text when available. | Streaming request: ```bash curl -N http://127.0.0.1:8000/generate \ -H 'Content-Type: application/json' \ -d '{"prompt":"Solve: 12 + 30.","temperature":0.0,"max_tokens":64,"stream":true,"stream_mode":"denoise"}' ``` `stream_mode="denoise"` emits editable buffer snapshots. `stream_mode="block_append"` emits stable appended text. ## Chat Endpoint The server also exposes an OpenAI-style chat path: ```bash curl -N http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "messages": [{"role": "user", "content": "Solve: 12 + 30."}], "temperature": 0.0, "max_tokens": 64, "stream": true, "stream_mode": "block_append" }' ``` ## Demo Visualization The repository includes a local Streamlit demo for visualizing server responses. Start it after the HTTP server is ready: ```bash streamlit run examples/streamlit_block_append_chat.py -- --base-url http://127.0.0.1:8000 ``` The demo talks to the server API and is intended for local validation and video capture, not production serving or throughput measurement. ## Important Server Flags | Flag | Notes | | --- | --- | | `--model` | Required local checkpoint path. | | `--model-name` | Registered Diffulex model name such as `llada2_mini`, `sdar`, or `diffusion_gemma`. | | `--decoding-strategy` | Use a strategy compatible with the model. | | `--sampling-mode` | Usually `naive`; use `edit` only for compatible LLaDA2 edit/DMax paths. | | `--tensor-parallel-size`, `--data-parallel-size` | Must fit the visible CUDA devices. | | `--device-ids` | Logical CUDA IDs after `CUDA_VISIBLE_DEVICES` is applied. | | `--max-model-len`, `--max-num-batched-tokens`, `--max-num-reqs` | Capacity controls. Start small. | | `--block-size`, `--buffer-size`, `--page-size` | Strategy/model layout controls. DiffusionGemma uses `256/1/256`. | | `--disable-full-static-runner`, `--disable-torch-compile`, `--enforce-eager` | Debugging toggles for optimized paths. | | `--use-lora`, `--lora-path`, `--pre-merge-lora` | LoRA loading and merge controls. | Run `python -m diffulex.server --help` for the complete current option list.