diffulex.server¶

diffulex.server is the online serving layer. It turns HTTP requests into engine commands, forwards them to the backend worker, and converts backend events back into plain JSON or server-sent events.

The package has two serving paths:

Path	When it is used	Main modules
HTTP + ZMQ backend	The normal `diffulex.server` CLI path. The FastAPI process and the engine process communicate through ZMQ queues.	`api_server`, `frontend`, `backend_worker`, `launch`, `protocol`, `zmq_queue`
In-process async loop	Useful for tests or embedders that want to own a single Python process. Engine calls still run through a one-worker executor.	`engine_loop`

The HTTP surface exposes POST /generate, GET /v1/models, and POST /v1/chat/completions. Streaming responses are sent as SSE frames and can use either block append events or denoise buffer snapshots.

diffulex.server.api_server¶

api_server builds the FastAPI application. It owns request validation, OpenAI-compatible chat response shaping, SSE formatting, and translation from HTTP payloads to ServingGenerate commands.

Symbol	How to use it	What it does
`GenerateRequest`	Use it as the request model for `/generate`.	Accepts a string prompt or token IDs, plus sampling and streaming fields.
`ChatMessage`	Use it inside `ChatCompletionRequest.messages`.	Stores one chat message with `role` and `content`.
`ChatCompletionRequest`	Use it as the request model for `/v1/chat/completions`.	Accepts chat messages and the same sampling/streaming controls as `/generate`.
`sampling_params_from_request`	Pass a generate or chat request model.	Builds `SamplingParams` from request fields such as `temperature`, `max_tokens`, `max_nfe`, and `ignore_eos`.
`create_app`	Pass a `FrontendManager`.	Returns the FastAPI app and wires startup/shutdown hooks to the frontend.
`chat_delta_chunk`	Pass a `ServingDelta`, request, and model id.	Builds an OpenAI-style streaming chat delta chunk.
`chat_finish_chunk`	Pass a final `ServingReply`.	Builds the final OpenAI-style streaming chat chunk and usage shell.
`denoise_chat_event`	Pass a `ServingDelta`, `ServingBufferSnapshot`, or `ServingReply`.	Wraps denoise events in chat-completion metadata while preserving Diffulex-specific fields.

stream_mode="block_append" is closest to normal token streaming: only appended text deltas are emitted. stream_mode="denoise" keeps the diffusion decoding state visible by sending buffer snapshots as the model edits a block.

diffulex.server.args¶

args defines the CLI schema used by diffulex.server. ServerArgs keeps web-server options and engine options in one dataclass, then exposes engine_kwargs() for constructing DiffulexEngine.

Symbol	How to use it	What it does
`parse_device_ids`	Pass a comma-separated string such as `0,1,2,3`.	Converts it to logical CUDA device IDs; empty input becomes an empty list.
`ServerArgs`	Construct directly in tests, or receive it from `parse_args`.	Stores host/port, ZMQ addresses, model identity, parallelism, cache, attention, MoE, threshold, and LoRA options.
`ServerArgs.engine_kwargs`	Call before creating the engine.	Returns only the engine-facing subset and fills threshold defaults when a CLI value is omitted.
`build_arg_parser`	Use when extending the server CLI.	Creates the `argparse.ArgumentParser` and declares allowed values for fields such as `sampling_mode` and `attn_impl`.
`parse_args`	Pass an optional argv sequence.	Parses CLI flags and returns `ServerArgs`.

When adding a new serving flag, add it to ServerArgs, build_arg_parser, and parse_args, then decide whether it belongs in engine_kwargs(). Web-only fields such as host and port should stay out of the engine kwargs.

diffulex.server.backend_worker¶

backend_worker runs the synchronous engine process used by the CLI server. It receives serialized ServingCommand objects, advances the engine, and pushes serialized ServingEvent objects back to the frontend.

Symbol	How to use it	What it does
`default_engine_factory`	Leave as the default unless tests inject a fake engine.	Imports built-in strategies and constructs `DiffulexEngine`.
`SyncBackendWorker`	Create through `from_zmq` for the normal server path.	Owns one engine instance and one blocking receive/send loop.
`SyncBackendWorker.from_zmq`	Pass model, engine kwargs, command address, and event address.	Builds the worker with ZMQ pull/push queues and protocol serializers.
`SyncBackendWorker.run_forever`	Call inside the backend process.	Initializes the engine, serves commands until shutdown, and closes resources.
`run_sync_backend_worker`	Use as the multiprocessing target.	Constructs the worker, reports startup errors through `ready_queue`, and enters `run_forever()`.

The backend treats ServingShutdown as a control command and forwards all other commands to DiffulexEngine.run_serving_tick.

diffulex.server.engine_loop¶

engine_loop is an in-process alternative to the ZMQ worker. It is useful when an application wants async admission and streaming, but does not want to start a separate backend process.

Symbol	How to use it	What it does
`QueuedCommand`	Internal queue item.	Wraps a `ServingCommand` before the loop admits it.
`default_engine_factory`	Default constructor hook.	Imports strategies and creates `DiffulexEngine`.
`EngineLoop`	Construct with a model path and engine kwargs, then `await start()`.	Owns one engine in a one-worker executor and serializes all engine mutations.
`EngineLoop.generate`	Pass prompt text or token IDs and `SamplingParams`.	Queues a non-streaming request and awaits the final `ServingReply`.
`EngineLoop.generate_stream`	Pass prompt, sampling params, and optional disconnect callback.	Yields `ServingDelta`, `ServingBufferSnapshot`, `ServingReply`, or `ServingError` events.
`EngineLoop.render_chat_prompt`	Pass chat messages.	Delegates chat-template rendering to the loaded engine.
`EngineLoop.stop`	Call during shutdown.	Stops the loop, fails pending waiters, exits the engine, and closes the executor.

All engine calls run through call_engine, so FastAPI-style async request handling does not mutate DiffulexEngine concurrently.

diffulex.server.frontend¶

frontend is the async bridge between HTTP handlers and the backend process. It tracks per-request event queues and aborts backend work when the client disconnects before completion.

Symbol	How to use it	What it does
`ClientDisconnected`	Catch it at the HTTP layer.	Signals that the client went away while the request was still active.
`FrontendReqState`	Internal request state.	Stores buffered events and an `asyncio.Event` used to wake waiters.
`FrontendManager`	Create directly with queues or via `from_zmq`.	Sends commands, listens for backend events, and maps events back to request IDs.
`FrontendManager.from_zmq`	Pass model id and ZMQ addresses.	Creates async push/pull queues using the protocol serializers.
`FrontendManager.generate`	Pass a `ServingGenerate`.	Waits until a final reply or error arrives.
`FrontendManager.generate_stream`	Pass a `ServingGenerate`.	Yields backend events as they arrive and aborts incomplete requests on exit.
`FrontendManager.abort_request`	Pass a request ID.	Sends a `ServingAbort` command to the backend.

The frontend creates request IDs with the diffulex- prefix and keeps request state only while the request is active.

diffulex.server.launch¶

launch is the CLI entry point. It parses ServerArgs, resolves IPC addresses, starts the backend process, builds the FastAPI app, and runs Uvicorn.

Symbol	How to use it	What it does
`default_ipc_addr`	Pass parsed server args and a name such as `commands`.	Creates an `ipc://` address under the system temporary directory.
`resolve_zmq_addrs`	Pass `ServerArgs`.	Uses explicit ZMQ addresses when provided, otherwise creates default IPC addresses.
`start_backend`	Pass args and resolved ZMQ addresses.	Starts the synchronous backend in a spawned process and waits for readiness.
`main`	Used by the CLI module.	Runs the complete HTTP server lifecycle.

The default address scheme is local IPC. Use --zmq-command-addr and --zmq-event-addr only when the frontend and backend need explicit custom transport addresses.

diffulex.server.protocol¶

protocol defines the typed messages that cross the frontend/backend boundary. The dataclasses are intentionally simple because they are serialized to dicts and packed with msgpack before being sent over ZMQ.

Symbol	How to use it	What it does
`PromptInput`	Store a string prompt or token ID list.	Represents `/generate` input.
`ChatInput`	Store a list of `{role, content}` messages.	Represents chat-completion input before template rendering.
`ServingGenerate`	Send from frontend to backend.	Starts a generation request with sampling params, streaming mode, user, and timestamp metadata.
`ServingAbort`	Send from frontend to backend.	Requests cancellation for one request ID.
`ServingShutdown`	Send during frontend shutdown.	Asks the backend worker to stop.
`ServingReply`	Send from backend to frontend.	Represents the final generated text, token IDs, NFE count, and finish reason.
`ServingDelta`	Send during `block_append` streaming.	Represents newly appended text and token IDs with an offset.
`ServingBufferSnapshot`	Send during `denoise` streaming.	Represents the current editable buffer span and text after a denoising update.
`ServingError`	Send when backend work fails.	Carries an error message for one request ID.
`serving_command_to_dict` / `serving_command_from_dict`	Use at queue boundaries.	Serialize and restore frontend-to-backend commands.
`serving_event_to_dict` / `serving_event_from_dict`	Use at queue boundaries.	Serialize and restore backend-to-frontend events.

Each command and event exposes request_id through the underlying rid, which keeps the HTTP, frontend, backend, and engine layers aligned on one identifier.

diffulex.server.zmq_queue¶

zmq_queue wraps pyzmq PUSH/PULL sockets with msgpack serialization. The module has synchronous classes for the backend worker and async classes for the FastAPI frontend.

Symbol	How to use it	What it does
`ZmqPushQueue`	Use in synchronous code that sends objects.	Encodes an object to a dict, packs it with msgpack, and sends it through a PUSH socket.
`ZmqPullQueue`	Use in synchronous code that receives objects.	Receives a msgpack payload from a PULL socket and decodes it back to a typed object.
`ZmqAsyncPushQueue`	Use in async code that sends objects.	Async PUSH equivalent of `ZmqPushQueue`.
`ZmqAsyncPullQueue`	Use in async code that receives objects.	Async PULL equivalent of `ZmqPullQueue`.

All queue classes accept a create flag. When create=True, the socket binds the address; when create=False, it connects to an address owned by another process.