diffulex.server

diffulex.server is the online serving layer. It turns HTTP requests into engine commands, forwards them to the backend worker, and converts backend events back into plain JSON or server-sent events.

The package has two serving paths:

Path

When it is used

Main modules

HTTP + ZMQ backend

The normal diffulex.server CLI path. The FastAPI process and the engine process communicate through ZMQ queues.

api_server, frontend, backend_worker, launch, protocol, zmq_queue

In-process async loop

Useful for tests or embedders that want to own a single Python process. Engine calls still run through a one-worker executor.

engine_loop

The HTTP surface exposes POST /generate, GET /v1/models, and POST /v1/chat/completions. Streaming responses are sent as SSE frames and can use either block append events or denoise buffer snapshots.

diffulex.server.api_server

api_server builds the FastAPI application. It owns request validation, OpenAI-compatible chat response shaping, SSE formatting, and translation from HTTP payloads to ServingGenerate commands.

Symbol

How to use it

What it does

GenerateRequest

Use it as the request model for /generate.

Accepts a string prompt or token IDs, plus sampling and streaming fields.

ChatMessage

Use it inside ChatCompletionRequest.messages.

Stores one chat message with role and content.

ChatCompletionRequest

Use it as the request model for /v1/chat/completions.

Accepts chat messages and the same sampling/streaming controls as /generate.

sampling_params_from_request

Pass a generate or chat request model.

Builds SamplingParams from request fields such as temperature, max_tokens, max_nfe, and ignore_eos.

create_app

Pass a FrontendManager.

Returns the FastAPI app and wires startup/shutdown hooks to the frontend.

chat_delta_chunk

Pass a ServingDelta, request, and model id.

Builds an OpenAI-style streaming chat delta chunk.

chat_finish_chunk

Pass a final ServingReply.

Builds the final OpenAI-style streaming chat chunk and usage shell.

denoise_chat_event

Pass a ServingDelta, ServingBufferSnapshot, or ServingReply.

Wraps denoise events in chat-completion metadata while preserving Diffulex-specific fields.

stream_mode="block_append" is closest to normal token streaming: only appended text deltas are emitted. stream_mode="denoise" keeps the diffusion decoding state visible by sending buffer snapshots as the model edits a block.

diffulex.server.args

args defines the CLI schema used by diffulex.server. ServerArgs keeps web-server options and engine options in one dataclass, then exposes engine_kwargs() for constructing DiffulexEngine.

Symbol

How to use it

What it does

parse_device_ids

Pass a comma-separated string such as 0,1,2,3.

Converts it to logical CUDA device IDs; empty input becomes an empty list.

ServerArgs

Construct directly in tests, or receive it from parse_args.

Stores host/port, ZMQ addresses, model identity, parallelism, cache, attention, MoE, threshold, and LoRA options.

ServerArgs.engine_kwargs

Call before creating the engine.

Returns only the engine-facing subset and fills threshold defaults when a CLI value is omitted.

build_arg_parser

Use when extending the server CLI.

Creates the argparse.ArgumentParser and declares allowed values for fields such as sampling_mode and attn_impl.

parse_args

Pass an optional argv sequence.

Parses CLI flags and returns ServerArgs.

When adding a new serving flag, add it to ServerArgs, build_arg_parser, and parse_args, then decide whether it belongs in engine_kwargs(). Web-only fields such as host and port should stay out of the engine kwargs.

diffulex.server.backend_worker

backend_worker runs the synchronous engine process used by the CLI server. It receives serialized ServingCommand objects, advances the engine, and pushes serialized ServingEvent objects back to the frontend.

Symbol

How to use it

What it does

default_engine_factory

Leave as the default unless tests inject a fake engine.

Imports built-in strategies and constructs DiffulexEngine.

SyncBackendWorker

Create through from_zmq for the normal server path.

Owns one engine instance and one blocking receive/send loop.

SyncBackendWorker.from_zmq

Pass model, engine kwargs, command address, and event address.

Builds the worker with ZMQ pull/push queues and protocol serializers.

SyncBackendWorker.run_forever

Call inside the backend process.

Initializes the engine, serves commands until shutdown, and closes resources.

run_sync_backend_worker

Use as the multiprocessing target.

Constructs the worker, reports startup errors through ready_queue, and enters run_forever().

The backend treats ServingShutdown as a control command and forwards all other commands to DiffulexEngine.run_serving_tick.

diffulex.server.engine_loop

engine_loop is an in-process alternative to the ZMQ worker. It is useful when an application wants async admission and streaming, but does not want to start a separate backend process.

Symbol

How to use it

What it does

QueuedCommand

Internal queue item.

Wraps a ServingCommand before the loop admits it.

default_engine_factory

Default constructor hook.

Imports strategies and creates DiffulexEngine.

EngineLoop

Construct with a model path and engine kwargs, then await start().

Owns one engine in a one-worker executor and serializes all engine mutations.

EngineLoop.generate

Pass prompt text or token IDs and SamplingParams.

Queues a non-streaming request and awaits the final ServingReply.

EngineLoop.generate_stream

Pass prompt, sampling params, and optional disconnect callback.

Yields ServingDelta, ServingBufferSnapshot, ServingReply, or ServingError events.

EngineLoop.render_chat_prompt

Pass chat messages.

Delegates chat-template rendering to the loaded engine.

EngineLoop.stop

Call during shutdown.

Stops the loop, fails pending waiters, exits the engine, and closes the executor.

All engine calls run through call_engine, so FastAPI-style async request handling does not mutate DiffulexEngine concurrently.

diffulex.server.frontend

frontend is the async bridge between HTTP handlers and the backend process. It tracks per-request event queues and aborts backend work when the client disconnects before completion.

Symbol

How to use it

What it does

ClientDisconnected

Catch it at the HTTP layer.

Signals that the client went away while the request was still active.

FrontendReqState

Internal request state.

Stores buffered events and an asyncio.Event used to wake waiters.

FrontendManager

Create directly with queues or via from_zmq.

Sends commands, listens for backend events, and maps events back to request IDs.

FrontendManager.from_zmq

Pass model id and ZMQ addresses.

Creates async push/pull queues using the protocol serializers.

FrontendManager.generate

Pass a ServingGenerate.

Waits until a final reply or error arrives.

FrontendManager.generate_stream

Pass a ServingGenerate.

Yields backend events as they arrive and aborts incomplete requests on exit.

FrontendManager.abort_request

Pass a request ID.

Sends a ServingAbort command to the backend.

The frontend creates request IDs with the diffulex- prefix and keeps request state only while the request is active.

diffulex.server.launch

launch is the CLI entry point. It parses ServerArgs, resolves IPC addresses, starts the backend process, builds the FastAPI app, and runs Uvicorn.

Symbol

How to use it

What it does

default_ipc_addr

Pass parsed server args and a name such as commands.

Creates an ipc:// address under the system temporary directory.

resolve_zmq_addrs

Pass ServerArgs.

Uses explicit ZMQ addresses when provided, otherwise creates default IPC addresses.

start_backend

Pass args and resolved ZMQ addresses.

Starts the synchronous backend in a spawned process and waits for readiness.

main

Used by the CLI module.

Runs the complete HTTP server lifecycle.

The default address scheme is local IPC. Use --zmq-command-addr and --zmq-event-addr only when the frontend and backend need explicit custom transport addresses.

diffulex.server.protocol

protocol defines the typed messages that cross the frontend/backend boundary. The dataclasses are intentionally simple because they are serialized to dicts and packed with msgpack before being sent over ZMQ.

Symbol

How to use it

What it does

PromptInput

Store a string prompt or token ID list.

Represents /generate input.

ChatInput

Store a list of {role, content} messages.

Represents chat-completion input before template rendering.

ServingGenerate

Send from frontend to backend.

Starts a generation request with sampling params, streaming mode, user, and timestamp metadata.

ServingAbort

Send from frontend to backend.

Requests cancellation for one request ID.

ServingShutdown

Send during frontend shutdown.

Asks the backend worker to stop.

ServingReply

Send from backend to frontend.

Represents the final generated text, token IDs, NFE count, and finish reason.

ServingDelta

Send during block_append streaming.

Represents newly appended text and token IDs with an offset.

ServingBufferSnapshot

Send during denoise streaming.

Represents the current editable buffer span and text after a denoising update.

ServingError

Send when backend work fails.

Carries an error message for one request ID.

serving_command_to_dict / serving_command_from_dict

Use at queue boundaries.

Serialize and restore frontend-to-backend commands.

serving_event_to_dict / serving_event_from_dict

Use at queue boundaries.

Serialize and restore backend-to-frontend events.

Each command and event exposes request_id through the underlying rid, which keeps the HTTP, frontend, backend, and engine layers aligned on one identifier.

diffulex.server.zmq_queue

zmq_queue wraps pyzmq PUSH/PULL sockets with msgpack serialization. The module has synchronous classes for the backend worker and async classes for the FastAPI frontend.

Symbol

How to use it

What it does

ZmqPushQueue

Use in synchronous code that sends objects.

Encodes an object to a dict, packs it with msgpack, and sends it through a PUSH socket.

ZmqPullQueue

Use in synchronous code that receives objects.

Receives a msgpack payload from a PULL socket and decodes it back to a typed object.

ZmqAsyncPushQueue

Use in async code that sends objects.

Async PUSH equivalent of ZmqPushQueue.

ZmqAsyncPullQueue

Use in async code that receives objects.

Async PULL equivalent of ZmqPullQueue.

All queue classes accept a create flag. When create=True, the socket binds the address; when create=False, it connects to an address owned by another process.