Server¶
Use the HTTP server entry point when you want an interactive service instead of a benchmark run.
Generic form:
python -m diffulex.server.launch \
--model /path/to/model \
--model-name <model_name> \
--decoding-strategy <strategy> \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 2048 \
--max-num-batched-tokens 4096 \
--max-num-reqs 128 \
--gpu-memory-utilization 0.9
The server process accepts the same core engine arguments as the benchmark path, plus HTTP-specific flags:
--host--port--log-level--device-ids--zmq-command-addr--zmq-event-addr
Supported models¶
Fast-dLLM-v2¶
python -m diffulex.server.launch \
--model /YOUR-CKPT-PATH/Efficient-Large-Model/Fast_dLLM_v2_7B \
--model-name fast_dllm_v2 \
--decoding-strategy multi_bd \
--sampling-mode naive \
--tensor-parallel-size 2 \
--data-parallel-size 1 \
--max-model-len 1024 \
--max-num-batched-tokens 1024 \
--max-num-reqs 24 \
--gpu-memory-utilization 0.4 \
--block-size 32 \
--buffer-size 1 \
--accept-threshold 0.95 \
--semi-complete-threshold 0.9 \
--add-block-threshold 0.1 \
--enforce-eager
D2F-LLaDA¶
python -m diffulex.server.launch \
--model /YOUR-CKPT-PATH/GSAI-ML/LLaDA-8B-Instruct \
--model-name llada \
--decoding-strategy d2f \
--tensor-parallel-size 2 \
--data-parallel-size 1 \
--use-lora \
--lora-path /YOUR-CKPT-PATH/SJTU-DENG-Lab/D2F_LLaDA_Instruct_8B_Lora \
--pre-merge-lora \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--max-num-reqs 32 \
--accept-threshold 0.95 \
--semi-complete-threshold 0.9 \
--add-block-threshold 0.1 \
--enforce-eager
SDAR¶
python -m diffulex.server.launch \
--model /YOUR-CKPT-PATH/JetLM/SDAR-1.7B-Chat-b32 \
--host 0.0.0.0 \
--port 8000 \
--model-name sdar \
--decoding-strategy multi_bd \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--device-ids 1 \
--block-size 32 \
--buffer-size 4 \
--page-size 32 \
--max-num-batched-tokens 4096 \
--max-num-reqs 128 \
--max-model-len 2048 \
--gpu-memory-utilization 0.5 \
--kv-cache-layout unified \
--add-block-threshold 0.1 \
--semi-complete-threshold 0.9 \
--accept-threshold 0.95