diffulex.server

diffulex.server launches an HTTP service backed by a Diffulex engine process. Use it when an application, UI, or test client needs request/response access instead of in-process Python generation.

The entry point is:

python -m diffulex.server --help

Minimal Command

python -m diffulex.server \
  --model /path/to/LLaDA2.0-mini \
  --model-name llada2_mini \
  --decoding-strategy multi_bd \
  --sampling-mode naive \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --max-num-reqs 1 \
  --block-size 32 \
  --buffer-size 1 \
  --page-size 32 \
  --host 0.0.0.0 \
  --port 8000

The server starts a frontend process and a synchronous backend worker. The frontend exposes HTTP routes; the backend owns the Diffulex engine and model execution. ZMQ addresses are generated automatically unless explicitly provided.

Network Arguments

Flag

How to set it

What it does

--host

Use an interface such as 127.0.0.1 for local access or 0.0.0.0 to listen on all interfaces. The default is 0.0.0.0.

Sets the HTTP bind host.

--port

Use an available TCP port. The default is 8000.

Sets the HTTP bind port.

--log-level

Use a uvicorn log level such as info, warning, or debug. The default is info.

Controls server log verbosity.

--zmq-command-addr

Leave empty for automatic setup, or provide a ZMQ address.

Sets the optional frontend-to-backend command channel.

--zmq-event-addr

Leave empty for automatic setup, or provide a ZMQ address.

Sets the optional backend-to-frontend event channel.

Most local runs should leave the ZMQ addresses unset.

Model and Strategy Arguments

Flag

How to set it

What it does

--model

Point to the local base-model checkpoint directory. This flag is required.

Loads model weights for the engine backend.

--model-name

Use a registered model key. The default is dream.

Selects model adapter and sampler defaults.

--decoding-strategy

Use d2f, multi_bd, dmax, or diffusion_gemma where supported by the selected model/config.

Chooses the strategy-specific request, scheduler, cache, runner, and attention metadata path.

--sampling-mode

Use naive for the standard sampler or edit for compatible edit-sampling models. The default is naive.

Selects sampler behavior.

Parallelism and Device Arguments

Flag

How to set it

What it does

--tensor-parallel-size

Use 1 to 8 ranks. The server default is 1.

Splits one model replica across multiple GPUs.

--data-parallel-size

Use 1 to 1024 groups. The default is 1.

Runs independent worker groups for serving throughput.

--master-addr

Use the host address for distributed initialization. The default is localhost.

Tells distributed workers where to rendezvous.

--master-port

Use an available port from 1 to 65535. The default is 2333.

Sets the distributed rendezvous port.

--distributed-timeout-seconds

Use a positive timeout in seconds. The default is 600.

Bounds how long distributed setup may wait.

--device-ids

Provide comma-separated logical CUDA IDs, or leave empty.

Limits the server to selected PyTorch-visible devices.

Capacity and Layout Arguments

Flag

How to set it

What it does

--block-size

Use 4, 8, 16, or 32 for most models; DiffusionGemma uses 256. The default is 32.

Sets the token span of one diffusion block.

--buffer-size

Use a positive block count. The default is 4.

Controls how many diffusion blocks can remain active for one request.

--page-size

Use 4, 8, 16, or 32 for most models; DiffusionGemma uses 256. Keep it at least as large as --block-size.

Sets the KV cache page size.

--max-model-len

Use a positive sequence length. The default is 2048, and the HF config may clamp it.

Sets the requested prompt-plus-output length limit.

--max-num-batched-tokens

Use a positive token budget. The default is 4096; it must cover the effective model length.

Limits scheduler batch size by token count.

--max-num-reqs

Use a positive request count. The default is 128.

Caps active requests tracked by the server.

--gpu-memory-utilization

Use a fraction such as 0.9.

Guides GPU memory planning.

--kv-cache-layout

Use unified for the default layout or distinct for strategy experiments.

Chooses KV cache storage layout.

Runtime Toggles

Flag

How to set it

What it does

--disable-full-static-runner

Add the flag when isolating full-static runner issues.

Disables the supported full-static CUDA Graph runner path.

--disable-torch-compile

Add the flag while debugging compile-related behavior.

Disables torch.compile where it would otherwise be used.

--torch-compile-mode

Defaults to reduce-overhead; use another PyTorch compile mode only when profiling calls for it.

Passes compile mode through to PyTorch.

--enforce-eager

Add during correctness debugging. Leave it off for optimized throughput checks.

Forces eager execution and bypasses graph-style optimizations.

--attn-impl

Use triton_grouped for normal serving and performance reports. triton and naive are compatibility/debug fallbacks. The default is triton_grouped.

Selects the server attention backend.

--disable-prefix-caching

Add only when debugging cache behavior or comparing without reuse.

Turns off compatible prefix cache reuse.

Use --enforce-eager while debugging model or scheduler behavior. Remove it when measuring optimized throughput.

LoRA Arguments

Flag

How to set it

What it does

--use-lora

Add the flag when serving with an adapter.

Enables LoRA adapter loading.

--lora-path

Point to the adapter checkpoint directory. Required with --use-lora.

Loads the adapter weights.

--pre-merge-lora

Add when the adapter should be merged into the base model at load time.

Avoids per-forward adapter compute when the model path supports merging.

Threshold Arguments

Flag

How to set it

What it does

--add-block-threshold

Omit it to use 0.1, or pass a float for block-add tuning.

Controls when another decoding block can be added.

--semi-complete-threshold

Omit it to use 0.9, or pass a float for block advancement tuning.

Controls when semi-complete block state can advance.

--accept-threshold

Use a confidence value from 0 to 1. The default is 0.9.

Accepts mask-to-token updates once confidence is high enough.

--remask-threshold

Use a confidence value from 0 to 1. The default is 0.4.

Remasks filled tokens that fall below the confidence threshold.

--token-stability-threshold

Use a stability ratio from 0 to 1. The default is 0.0.

Controls edit-block progress for DMax-style decoding.

Validation Tips

Start with conservative limits such as low --max-num-reqs, --max-num-batched-tokens, and --max-model-len. Once the server starts and a small request succeeds, increase limits for the target workload.