# diffulex.moe

`diffulex.moe` contains Mixture-of-Experts configuration helpers, router
metadata, token dispatchers, top-k routing, and fused expert execution layers.
Model code should use the package-level builders instead of selecting MoE
implementations directly.

Current public support is conservative: model code should use the package-level
builders and the documented MoE GEMM options instead of selecting dispatcher
internals directly.

| Module | Role |
| --- | --- |
| `diffulex.moe.config` | Reads MoE-related attributes from model configs. |
| `diffulex.moe.dispatcher` | Token dispatcher implementations and dispatcher factory. |
| `diffulex.moe.layer` | Fused MoE layer implementations and layer factory. |
| `diffulex.moe.metadata` | Router, dispatcher, and expert execution metadata dataclasses. |
| `diffulex.moe.topk` | Top-k router implementations and router factory. |

## diffulex.moe.config

This module normalizes model-config differences so MoE code can ask for common
concepts such as expert count, experts-per-token, sparse-layer placement, and
intermediate size.

| Symbol | Purpose |
| --- | --- |
| `get_num_experts` | Reads total expert count. |
| `get_num_experts_per_tok` | Reads top-k experts per token. |
| `get_moe_intermediate_size` | Reads MoE hidden size. |
| `is_moe_layer` | Determines whether a layer index should use MoE. |

Use these helpers rather than reading raw HF config attributes directly.

## diffulex.moe.dispatcher

This package moves tokens between ranks for expert execution. The dispatcher
factory chooses an implementation based on config.

| Symbol | Purpose |
| --- | --- |
| `TokenDispatcher` | Abstract dispatcher contract. |
| `DispatcherOutput` | Output structure returned by dispatchers. |
| `build_token_dispatcher` | Factory for the configured dispatcher backend. |
| `NaiveA2ADispatcher` | Reference all-to-all dispatcher used by internal experiments. |

Use the dispatcher factory from model code. Direct dispatcher selection is not a
public tuning surface for normal serving or benchmark runs.

## diffulex.moe.layer

This package executes expert MLPs after routing. It provides naive, tensor
parallel, expert parallel, and optional vLLM-backed implementations behind a
factory function.

| Symbol | Purpose |
| --- | --- |
| `build_moe_block` | Factory for MoE blocks. |
| `FusedMoE` | Base fused MoE layer contract. |
| `SharedExpertMLP` | Shared expert MLP helper. |
| `NaiveFusedMoE` | Reference fused MoE implementation. |
| `TPFusedMoE` | Tensor-parallel fused MoE implementation. |
| `EPFusedMoE` | Expert-parallel fused MoE implementation. |

Model layers should call the factory rather than instantiate implementation
classes directly.

## diffulex.moe.metadata

This module defines structured metadata passed between routers, dispatchers, and
expert execution layers.

| Symbol | Purpose |
| --- | --- |
| `RouterMetadata` | Router output metadata. |
| `DispatchMetadata` | Base dispatcher metadata. |
| `ExpertExecutionMetadata` | Metadata needed while executing experts. |
| `DispatcherStage` | Dispatcher lifecycle stage enum. |

Use these dataclasses to keep dispatcher and expert-layer contracts explicit.

## diffulex.moe.topk

This package selects experts for each token. It provides top-k router
implementations and a factory used by MoE layers.

| Symbol | Purpose |
| --- | --- |
| `TopKRouter` | Base router contract. |
| `TopKOutput` | Router output dataclass. |
| `build_topk_router` | Factory for configured router behavior. |
| `NaiveTopKRouter` | Standard top-k router. |
| `GroupLimitedTopKRouter` | Router with group-limited expert selection. |