# diffulex.server

`diffulex.server` is the online serving layer. It turns HTTP requests into
engine commands, forwards them to the backend worker, and converts backend
events back into plain JSON or server-sent events.

The package has two serving paths:

| Path | When it is used | Main modules |
| --- | --- | --- |
| HTTP + ZMQ backend | The normal `diffulex.server` CLI path. The FastAPI process and the engine process communicate through ZMQ queues. | `api_server`, `frontend`, `backend_worker`, `launch`, `protocol`, `zmq_queue` |
| In-process async loop | Useful for tests or embedders that want to own a single Python process. Engine calls still run through a one-worker executor. | `engine_loop` |

The HTTP surface exposes `POST /generate`, `GET /v1/models`, and
`POST /v1/chat/completions`. Streaming responses are sent as SSE frames and can
use either block append events or denoise buffer snapshots.

## diffulex.server.api_server

`api_server` builds the FastAPI application. It owns request validation,
OpenAI-compatible chat response shaping, SSE formatting, and translation from
HTTP payloads to `ServingGenerate` commands.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `GenerateRequest` | Use it as the request model for `/generate`. | Accepts a string prompt or token IDs, plus sampling and streaming fields. |
| `ChatMessage` | Use it inside `ChatCompletionRequest.messages`. | Stores one chat message with `role` and `content`. |
| `ChatCompletionRequest` | Use it as the request model for `/v1/chat/completions`. | Accepts chat messages and the same sampling/streaming controls as `/generate`. |
| `sampling_params_from_request` | Pass a generate or chat request model. | Builds `SamplingParams` from request fields such as `temperature`, `max_tokens`, `max_nfe`, and `ignore_eos`. |
| `create_app` | Pass a `FrontendManager`. | Returns the FastAPI app and wires startup/shutdown hooks to the frontend. |
| `chat_delta_chunk` | Pass a `ServingDelta`, request, and model id. | Builds an OpenAI-style streaming chat delta chunk. |
| `chat_finish_chunk` | Pass a final `ServingReply`. | Builds the final OpenAI-style streaming chat chunk and usage shell. |
| `denoise_chat_event` | Pass a `ServingDelta`, `ServingBufferSnapshot`, or `ServingReply`. | Wraps denoise events in chat-completion metadata while preserving Diffulex-specific fields. |

`stream_mode="block_append"` is closest to normal token streaming: only appended
text deltas are emitted. `stream_mode="denoise"` keeps the diffusion decoding
state visible by sending buffer snapshots as the model edits a block.

## diffulex.server.args

`args` defines the CLI schema used by `diffulex.server`. `ServerArgs`
keeps web-server options and engine options in one dataclass, then exposes
`engine_kwargs()` for constructing `DiffulexEngine`.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `parse_device_ids` | Pass a comma-separated string such as `0,1,2,3`. | Converts it to logical CUDA device IDs; empty input becomes an empty list. |
| `ServerArgs` | Construct directly in tests, or receive it from `parse_args`. | Stores host/port, ZMQ addresses, model identity, parallelism, cache, attention, MoE, threshold, and LoRA options. |
| `ServerArgs.engine_kwargs` | Call before creating the engine. | Returns only the engine-facing subset and fills threshold defaults when a CLI value is omitted. |
| `build_arg_parser` | Use when extending the server CLI. | Creates the `argparse.ArgumentParser` and declares allowed values for fields such as `sampling_mode` and `attn_impl`. |
| `parse_args` | Pass an optional argv sequence. | Parses CLI flags and returns `ServerArgs`. |

When adding a new serving flag, add it to `ServerArgs`, `build_arg_parser`, and
`parse_args`, then decide whether it belongs in `engine_kwargs()`. Web-only
fields such as `host` and `port` should stay out of the engine kwargs.

## diffulex.server.backend_worker

`backend_worker` runs the synchronous engine process used by the CLI server. It
receives serialized `ServingCommand` objects, advances the engine, and pushes
serialized `ServingEvent` objects back to the frontend.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `default_engine_factory` | Leave as the default unless tests inject a fake engine. | Imports built-in strategies and constructs `DiffulexEngine`. |
| `SyncBackendWorker` | Create through `from_zmq` for the normal server path. | Owns one engine instance and one blocking receive/send loop. |
| `SyncBackendWorker.from_zmq` | Pass model, engine kwargs, command address, and event address. | Builds the worker with ZMQ pull/push queues and protocol serializers. |
| `SyncBackendWorker.run_forever` | Call inside the backend process. | Initializes the engine, serves commands until shutdown, and closes resources. |
| `run_sync_backend_worker` | Use as the multiprocessing target. | Constructs the worker, reports startup errors through `ready_queue`, and enters `run_forever()`. |

The backend treats `ServingShutdown` as a control command and forwards all other
commands to `DiffulexEngine.run_serving_tick`.

## diffulex.server.engine_loop

`engine_loop` is an in-process alternative to the ZMQ worker. It is useful when
an application wants async admission and streaming, but does not want to start a
separate backend process.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `QueuedCommand` | Internal queue item. | Wraps a `ServingCommand` before the loop admits it. |
| `default_engine_factory` | Default constructor hook. | Imports strategies and creates `DiffulexEngine`. |
| `EngineLoop` | Construct with a model path and engine kwargs, then `await start()`. | Owns one engine in a one-worker executor and serializes all engine mutations. |
| `EngineLoop.generate` | Pass prompt text or token IDs and `SamplingParams`. | Queues a non-streaming request and awaits the final `ServingReply`. |
| `EngineLoop.generate_stream` | Pass prompt, sampling params, and optional disconnect callback. | Yields `ServingDelta`, `ServingBufferSnapshot`, `ServingReply`, or `ServingError` events. |
| `EngineLoop.render_chat_prompt` | Pass chat messages. | Delegates chat-template rendering to the loaded engine. |
| `EngineLoop.stop` | Call during shutdown. | Stops the loop, fails pending waiters, exits the engine, and closes the executor. |

All engine calls run through `call_engine`, so FastAPI-style async request
handling does not mutate `DiffulexEngine` concurrently.

## diffulex.server.frontend

`frontend` is the async bridge between HTTP handlers and the backend process. It
tracks per-request event queues and aborts backend work when the client
disconnects before completion.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `ClientDisconnected` | Catch it at the HTTP layer. | Signals that the client went away while the request was still active. |
| `FrontendReqState` | Internal request state. | Stores buffered events and an `asyncio.Event` used to wake waiters. |
| `FrontendManager` | Create directly with queues or via `from_zmq`. | Sends commands, listens for backend events, and maps events back to request IDs. |
| `FrontendManager.from_zmq` | Pass model id and ZMQ addresses. | Creates async push/pull queues using the protocol serializers. |
| `FrontendManager.generate` | Pass a `ServingGenerate`. | Waits until a final reply or error arrives. |
| `FrontendManager.generate_stream` | Pass a `ServingGenerate`. | Yields backend events as they arrive and aborts incomplete requests on exit. |
| `FrontendManager.abort_request` | Pass a request ID. | Sends a `ServingAbort` command to the backend. |

The frontend creates request IDs with the `diffulex-` prefix and keeps request
state only while the request is active.

## diffulex.server.launch

`launch` is the CLI entry point. It parses `ServerArgs`, resolves IPC
addresses, starts the backend process, builds the FastAPI app, and runs Uvicorn.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `default_ipc_addr` | Pass parsed server args and a name such as `commands`. | Creates an `ipc://` address under the system temporary directory. |
| `resolve_zmq_addrs` | Pass `ServerArgs`. | Uses explicit ZMQ addresses when provided, otherwise creates default IPC addresses. |
| `start_backend` | Pass args and resolved ZMQ addresses. | Starts the synchronous backend in a spawned process and waits for readiness. |
| `main` | Used by the CLI module. | Runs the complete HTTP server lifecycle. |

The default address scheme is local IPC. Use `--zmq-command-addr` and
`--zmq-event-addr` only when the frontend and backend need explicit custom
transport addresses.

## diffulex.server.protocol

`protocol` defines the typed messages that cross the frontend/backend boundary.
The dataclasses are intentionally simple because they are serialized to dicts
and packed with msgpack before being sent over ZMQ.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `PromptInput` | Store a string prompt or token ID list. | Represents `/generate` input. |
| `ChatInput` | Store a list of `{role, content}` messages. | Represents chat-completion input before template rendering. |
| `ServingGenerate` | Send from frontend to backend. | Starts a generation request with sampling params, streaming mode, user, and timestamp metadata. |
| `ServingAbort` | Send from frontend to backend. | Requests cancellation for one request ID. |
| `ServingShutdown` | Send during frontend shutdown. | Asks the backend worker to stop. |
| `ServingReply` | Send from backend to frontend. | Represents the final generated text, token IDs, NFE count, and finish reason. |
| `ServingDelta` | Send during `block_append` streaming. | Represents newly appended text and token IDs with an offset. |
| `ServingBufferSnapshot` | Send during `denoise` streaming. | Represents the current editable buffer span and text after a denoising update. |
| `ServingError` | Send when backend work fails. | Carries an error message for one request ID. |
| `serving_command_to_dict` / `serving_command_from_dict` | Use at queue boundaries. | Serialize and restore frontend-to-backend commands. |
| `serving_event_to_dict` / `serving_event_from_dict` | Use at queue boundaries. | Serialize and restore backend-to-frontend events. |

Each command and event exposes `request_id` through the underlying `rid`, which
keeps the HTTP, frontend, backend, and engine layers aligned on one identifier.

## diffulex.server.zmq_queue

`zmq_queue` wraps pyzmq PUSH/PULL sockets with msgpack serialization. The
module has synchronous classes for the backend worker and async classes for the
FastAPI frontend.

| Symbol | How to use it | What it does |
| --- | --- | --- |
| `ZmqPushQueue` | Use in synchronous code that sends objects. | Encodes an object to a dict, packs it with msgpack, and sends it through a PUSH socket. |
| `ZmqPullQueue` | Use in synchronous code that receives objects. | Receives a msgpack payload from a PULL socket and decodes it back to a typed object. |
| `ZmqAsyncPushQueue` | Use in async code that sends objects. | Async PUSH equivalent of `ZmqPushQueue`. |
| `ZmqAsyncPullQueue` | Use in async code that receives objects. | Async PULL equivalent of `ZmqPullQueue`. |

All queue classes accept a `create` flag. When `create=True`, the socket binds
the address; when `create=False`, it connects to an address owned by another
process.