# diffulex.server `diffulex.server` launches an HTTP service backed by a Diffulex engine process. Use it when an application, UI, or test client needs request/response access instead of in-process Python generation. The entry point is: ```bash python -m diffulex.server --help ``` ## Minimal Command ```bash python -m diffulex.server \ --model /path/to/LLaDA2.0-mini \ --model-name llada2_mini \ --decoding-strategy multi_bd \ --sampling-mode naive \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --max-num-reqs 1 \ --block-size 32 \ --buffer-size 1 \ --page-size 32 \ --host 0.0.0.0 \ --port 8000 ``` The server starts a frontend process and a synchronous backend worker. The frontend exposes HTTP routes; the backend owns the Diffulex engine and model execution. ZMQ addresses are generated automatically unless explicitly provided. ## Network Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--host` | Use an interface such as `127.0.0.1` for local access or `0.0.0.0` to listen on all interfaces. The default is `0.0.0.0`. | Sets the HTTP bind host. | | `--port` | Use an available TCP port. The default is `8000`. | Sets the HTTP bind port. | | `--log-level` | Use a uvicorn log level such as `info`, `warning`, or `debug`. The default is `info`. | Controls server log verbosity. | | `--zmq-command-addr` | Leave empty for automatic setup, or provide a ZMQ address. | Sets the optional frontend-to-backend command channel. | | `--zmq-event-addr` | Leave empty for automatic setup, or provide a ZMQ address. | Sets the optional backend-to-frontend event channel. | Most local runs should leave the ZMQ addresses unset. ## Model and Strategy Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--model` | Point to the local base-model checkpoint directory. This flag is required. | Loads model weights for the engine backend. | | `--model-name` | Use a registered model key. The default is `dream`. | Selects model adapter and sampler defaults. | | `--decoding-strategy` | Use `d2f`, `multi_bd`, `dmax`, or `diffusion_gemma` where supported by the selected model/config. | Chooses the strategy-specific request, scheduler, cache, runner, and attention metadata path. | | `--sampling-mode` | Use `naive` for the standard sampler or `edit` for compatible edit-sampling models. The default is `naive`. | Selects sampler behavior. | ## Parallelism and Device Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--tensor-parallel-size` | Use `1` to `8` ranks. The server default is `1`. | Splits one model replica across multiple GPUs. | | `--data-parallel-size` | Use `1` to `1024` groups. The default is `1`. | Runs independent worker groups for serving throughput. | | `--master-addr` | Use the host address for distributed initialization. The default is `localhost`. | Tells distributed workers where to rendezvous. | | `--master-port` | Use an available port from `1` to `65535`. The default is `2333`. | Sets the distributed rendezvous port. | | `--distributed-timeout-seconds` | Use a positive timeout in seconds. The default is `600`. | Bounds how long distributed setup may wait. | | `--device-ids` | Provide comma-separated logical CUDA IDs, or leave empty. | Limits the server to selected PyTorch-visible devices. | ## Capacity and Layout Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--block-size` | Use `4`, `8`, `16`, or `32` for most models; DiffusionGemma uses `256`. The default is `32`. | Sets the token span of one diffusion block. | | `--buffer-size` | Use a positive block count. The default is `4`. | Controls how many diffusion blocks can remain active for one request. | | `--page-size` | Use `4`, `8`, `16`, or `32` for most models; DiffusionGemma uses `256`. Keep it at least as large as `--block-size`. | Sets the KV cache page size. | | `--max-model-len` | Use a positive sequence length. The default is `2048`, and the HF config may clamp it. | Sets the requested prompt-plus-output length limit. | | `--max-num-batched-tokens` | Use a positive token budget. The default is `4096`; it must cover the effective model length. | Limits scheduler batch size by token count. | | `--max-num-reqs` | Use a positive request count. The default is `128`. | Caps active requests tracked by the server. | | `--gpu-memory-utilization` | Use a fraction such as `0.9`. | Guides GPU memory planning. | | `--kv-cache-layout` | Use `unified` for the default layout or `distinct` for strategy experiments. | Chooses KV cache storage layout. | ## Runtime Toggles | Flag | How to set it | What it does | | --- | --- | --- | | `--disable-full-static-runner` | Add the flag when isolating full-static runner issues. | Disables the supported full-static CUDA Graph runner path. | | `--disable-torch-compile` | Add the flag while debugging compile-related behavior. | Disables `torch.compile` where it would otherwise be used. | | `--torch-compile-mode` | Defaults to `reduce-overhead`; use another PyTorch compile mode only when profiling calls for it. | Passes compile mode through to PyTorch. | | `--enforce-eager` | Add during correctness debugging. Leave it off for optimized throughput checks. | Forces eager execution and bypasses graph-style optimizations. | | `--attn-impl` | Use `triton_grouped` for normal serving and performance reports. `triton` and `naive` are compatibility/debug fallbacks. The default is `triton_grouped`. | Selects the server attention backend. | | `--disable-prefix-caching` | Add only when debugging cache behavior or comparing without reuse. | Turns off compatible prefix cache reuse. | Use `--enforce-eager` while debugging model or scheduler behavior. Remove it when measuring optimized throughput. ## LoRA Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--use-lora` | Add the flag when serving with an adapter. | Enables LoRA adapter loading. | | `--lora-path` | Point to the adapter checkpoint directory. Required with `--use-lora`. | Loads the adapter weights. | | `--pre-merge-lora` | Add when the adapter should be merged into the base model at load time. | Avoids per-forward adapter compute when the model path supports merging. | ## Threshold Arguments | Flag | How to set it | What it does | | --- | --- | --- | | `--add-block-threshold` | Omit it to use `0.1`, or pass a float for block-add tuning. | Controls when another decoding block can be added. | | `--semi-complete-threshold` | Omit it to use `0.9`, or pass a float for block advancement tuning. | Controls when semi-complete block state can advance. | | `--accept-threshold` | Use a confidence value from `0` to `1`. The default is `0.9`. | Accepts mask-to-token updates once confidence is high enough. | | `--remask-threshold` | Use a confidence value from `0` to `1`. The default is `0.4`. | Remasks filled tokens that fall below the confidence threshold. | | `--token-stability-threshold` | Use a stability ratio from `0` to `1`. The default is `0.0`. | Controls edit-block progress for DMax-style decoding. | ## Validation Tips Start with conservative limits such as low `--max-num-reqs`, `--max-num-batched-tokens`, and `--max-model-len`. Once the server starts and a small request succeeds, increase limits for the target workload.