diffulex.server¶
diffulex.server is the online serving layer. It turns HTTP requests into
engine commands, forwards them to the backend worker, and converts backend
events back into plain JSON or server-sent events.
The package has two serving paths:
Path |
When it is used |
Main modules |
|---|---|---|
HTTP + ZMQ backend |
The normal |
|
In-process async loop |
Useful for tests or embedders that want to own a single Python process. Engine calls still run through a one-worker executor. |
|
The HTTP surface exposes POST /generate, GET /v1/models, and
POST /v1/chat/completions. Streaming responses are sent as SSE frames and can
use either block append events or denoise buffer snapshots.
diffulex.server.api_server¶
api_server builds the FastAPI application. It owns request validation,
OpenAI-compatible chat response shaping, SSE formatting, and translation from
HTTP payloads to ServingGenerate commands.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Use it as the request model for |
Accepts a string prompt or token IDs, plus sampling and streaming fields. |
|
Use it inside |
Stores one chat message with |
|
Use it as the request model for |
Accepts chat messages and the same sampling/streaming controls as |
|
Pass a generate or chat request model. |
Builds |
|
Pass a |
Returns the FastAPI app and wires startup/shutdown hooks to the frontend. |
|
Pass a |
Builds an OpenAI-style streaming chat delta chunk. |
|
Pass a final |
Builds the final OpenAI-style streaming chat chunk and usage shell. |
|
Pass a |
Wraps denoise events in chat-completion metadata while preserving Diffulex-specific fields. |
stream_mode="block_append" is closest to normal token streaming: only appended
text deltas are emitted. stream_mode="denoise" keeps the diffusion decoding
state visible by sending buffer snapshots as the model edits a block.
diffulex.server.args¶
args defines the CLI schema used by diffulex.server. ServerArgs
keeps web-server options and engine options in one dataclass, then exposes
engine_kwargs() for constructing DiffulexEngine.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Pass a comma-separated string such as |
Converts it to logical CUDA device IDs; empty input becomes an empty list. |
|
Construct directly in tests, or receive it from |
Stores host/port, ZMQ addresses, model identity, parallelism, cache, attention, MoE, threshold, and LoRA options. |
|
Call before creating the engine. |
Returns only the engine-facing subset and fills threshold defaults when a CLI value is omitted. |
|
Use when extending the server CLI. |
Creates the |
|
Pass an optional argv sequence. |
Parses CLI flags and returns |
When adding a new serving flag, add it to ServerArgs, build_arg_parser, and
parse_args, then decide whether it belongs in engine_kwargs(). Web-only
fields such as host and port should stay out of the engine kwargs.
diffulex.server.backend_worker¶
backend_worker runs the synchronous engine process used by the CLI server. It
receives serialized ServingCommand objects, advances the engine, and pushes
serialized ServingEvent objects back to the frontend.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Leave as the default unless tests inject a fake engine. |
Imports built-in strategies and constructs |
|
Create through |
Owns one engine instance and one blocking receive/send loop. |
|
Pass model, engine kwargs, command address, and event address. |
Builds the worker with ZMQ pull/push queues and protocol serializers. |
|
Call inside the backend process. |
Initializes the engine, serves commands until shutdown, and closes resources. |
|
Use as the multiprocessing target. |
Constructs the worker, reports startup errors through |
The backend treats ServingShutdown as a control command and forwards all other
commands to DiffulexEngine.run_serving_tick.
diffulex.server.engine_loop¶
engine_loop is an in-process alternative to the ZMQ worker. It is useful when
an application wants async admission and streaming, but does not want to start a
separate backend process.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Internal queue item. |
Wraps a |
|
Default constructor hook. |
Imports strategies and creates |
|
Construct with a model path and engine kwargs, then |
Owns one engine in a one-worker executor and serializes all engine mutations. |
|
Pass prompt text or token IDs and |
Queues a non-streaming request and awaits the final |
|
Pass prompt, sampling params, and optional disconnect callback. |
Yields |
|
Pass chat messages. |
Delegates chat-template rendering to the loaded engine. |
|
Call during shutdown. |
Stops the loop, fails pending waiters, exits the engine, and closes the executor. |
All engine calls run through call_engine, so FastAPI-style async request
handling does not mutate DiffulexEngine concurrently.
diffulex.server.frontend¶
frontend is the async bridge between HTTP handlers and the backend process. It
tracks per-request event queues and aborts backend work when the client
disconnects before completion.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Catch it at the HTTP layer. |
Signals that the client went away while the request was still active. |
|
Internal request state. |
Stores buffered events and an |
|
Create directly with queues or via |
Sends commands, listens for backend events, and maps events back to request IDs. |
|
Pass model id and ZMQ addresses. |
Creates async push/pull queues using the protocol serializers. |
|
Pass a |
Waits until a final reply or error arrives. |
|
Pass a |
Yields backend events as they arrive and aborts incomplete requests on exit. |
|
Pass a request ID. |
Sends a |
The frontend creates request IDs with the diffulex- prefix and keeps request
state only while the request is active.
diffulex.server.launch¶
launch is the CLI entry point. It parses ServerArgs, resolves IPC
addresses, starts the backend process, builds the FastAPI app, and runs Uvicorn.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Pass parsed server args and a name such as |
Creates an |
|
Pass |
Uses explicit ZMQ addresses when provided, otherwise creates default IPC addresses. |
|
Pass args and resolved ZMQ addresses. |
Starts the synchronous backend in a spawned process and waits for readiness. |
|
Used by the CLI module. |
Runs the complete HTTP server lifecycle. |
The default address scheme is local IPC. Use --zmq-command-addr and
--zmq-event-addr only when the frontend and backend need explicit custom
transport addresses.
diffulex.server.protocol¶
protocol defines the typed messages that cross the frontend/backend boundary.
The dataclasses are intentionally simple because they are serialized to dicts
and packed with msgpack before being sent over ZMQ.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Store a string prompt or token ID list. |
Represents |
|
Store a list of |
Represents chat-completion input before template rendering. |
|
Send from frontend to backend. |
Starts a generation request with sampling params, streaming mode, user, and timestamp metadata. |
|
Send from frontend to backend. |
Requests cancellation for one request ID. |
|
Send during frontend shutdown. |
Asks the backend worker to stop. |
|
Send from backend to frontend. |
Represents the final generated text, token IDs, NFE count, and finish reason. |
|
Send during |
Represents newly appended text and token IDs with an offset. |
|
Send during |
Represents the current editable buffer span and text after a denoising update. |
|
Send when backend work fails. |
Carries an error message for one request ID. |
|
Use at queue boundaries. |
Serialize and restore frontend-to-backend commands. |
|
Use at queue boundaries. |
Serialize and restore backend-to-frontend events. |
Each command and event exposes request_id through the underlying rid, which
keeps the HTTP, frontend, backend, and engine layers aligned on one identifier.
diffulex.server.zmq_queue¶
zmq_queue wraps pyzmq PUSH/PULL sockets with msgpack serialization. The
module has synchronous classes for the backend worker and async classes for the
FastAPI frontend.
Symbol |
How to use it |
What it does |
|---|---|---|
|
Use in synchronous code that sends objects. |
Encodes an object to a dict, packs it with msgpack, and sends it through a PUSH socket. |
|
Use in synchronous code that receives objects. |
Receives a msgpack payload from a PULL socket and decodes it back to a typed object. |
|
Use in async code that sends objects. |
Async PUSH equivalent of |
|
Use in async code that receives objects. |
Async PULL equivalent of |
All queue classes accept a create flag. When create=True, the socket binds
the address; when create=False, it connects to an address owned by another
process.