diffulex.server¶
diffulex.server launches an HTTP service backed by a Diffulex engine process.
Use it when an application, UI, or test client needs request/response access
instead of in-process Python generation.
The entry point is:
python -m diffulex.server --help
Minimal Command¶
python -m diffulex.server \
--model /path/to/LLaDA2.0-mini \
--model-name llada2_mini \
--decoding-strategy multi_bd \
--sampling-mode naive \
--max-model-len 4096 \
--max-num-batched-tokens 4096 \
--max-num-reqs 1 \
--block-size 32 \
--buffer-size 1 \
--page-size 32 \
--host 0.0.0.0 \
--port 8000
The server starts a frontend process and a synchronous backend worker. The frontend exposes HTTP routes; the backend owns the Diffulex engine and model execution. ZMQ addresses are generated automatically unless explicitly provided.
Network Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Use an interface such as |
Sets the HTTP bind host. |
|
Use an available TCP port. The default is |
Sets the HTTP bind port. |
|
Use a uvicorn log level such as |
Controls server log verbosity. |
|
Leave empty for automatic setup, or provide a ZMQ address. |
Sets the optional frontend-to-backend command channel. |
|
Leave empty for automatic setup, or provide a ZMQ address. |
Sets the optional backend-to-frontend event channel. |
Most local runs should leave the ZMQ addresses unset.
Model and Strategy Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Point to the local base-model checkpoint directory. This flag is required. |
Loads model weights for the engine backend. |
|
Use a registered model key. The default is |
Selects model adapter and sampler defaults. |
|
Use |
Chooses the strategy-specific request, scheduler, cache, runner, and attention metadata path. |
|
Use |
Selects sampler behavior. |
Parallelism and Device Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Use |
Splits one model replica across multiple GPUs. |
|
Use |
Runs independent worker groups for serving throughput. |
|
Use the host address for distributed initialization. The default is |
Tells distributed workers where to rendezvous. |
|
Use an available port from |
Sets the distributed rendezvous port. |
|
Use a positive timeout in seconds. The default is |
Bounds how long distributed setup may wait. |
|
Provide comma-separated logical CUDA IDs, or leave empty. |
Limits the server to selected PyTorch-visible devices. |
Capacity and Layout Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Use |
Sets the token span of one diffusion block. |
|
Use a positive block count. The default is |
Controls how many diffusion blocks can remain active for one request. |
|
Use |
Sets the KV cache page size. |
|
Use a positive sequence length. The default is |
Sets the requested prompt-plus-output length limit. |
|
Use a positive token budget. The default is |
Limits scheduler batch size by token count. |
|
Use a positive request count. The default is |
Caps active requests tracked by the server. |
|
Use a fraction such as |
Guides GPU memory planning. |
|
Use |
Chooses KV cache storage layout. |
Runtime Toggles¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Add the flag when isolating full-static runner issues. |
Disables the supported full-static CUDA Graph runner path. |
|
Add the flag while debugging compile-related behavior. |
Disables |
|
Defaults to |
Passes compile mode through to PyTorch. |
|
Add during correctness debugging. Leave it off for optimized throughput checks. |
Forces eager execution and bypasses graph-style optimizations. |
|
Use |
Selects the server attention backend. |
|
Add only when debugging cache behavior or comparing without reuse. |
Turns off compatible prefix cache reuse. |
Use --enforce-eager while debugging model or scheduler behavior. Remove it
when measuring optimized throughput.
LoRA Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Add the flag when serving with an adapter. |
Enables LoRA adapter loading. |
|
Point to the adapter checkpoint directory. Required with |
Loads the adapter weights. |
|
Add when the adapter should be merged into the base model at load time. |
Avoids per-forward adapter compute when the model path supports merging. |
Threshold Arguments¶
Flag |
How to set it |
What it does |
|---|---|---|
|
Omit it to use |
Controls when another decoding block can be added. |
|
Omit it to use |
Controls when semi-complete block state can advance. |
|
Use a confidence value from |
Accepts mask-to-token updates once confidence is high enough. |
|
Use a confidence value from |
Remasks filled tokens that fall below the confidence threshold. |
|
Use a stability ratio from |
Controls edit-block progress for DMax-style decoding. |
Validation Tips¶
Start with conservative limits such as low --max-num-reqs,
--max-num-batched-tokens, and --max-model-len. Once the server starts and a
small request succeeds, increase limits for the target workload.