# Add New Kernel

Add a new kernel when Python or the existing Triton/vLLM paths are not enough
for a specific operation. Keep the kernel isolated until it matches a reference
implementation, then integrate it through the narrowest engine boundary.

## Choose the Location

Use `diffulex_kernel/python/` for Diffulex-owned Python/Triton kernel entry
points. Use the relevant third-party kernel package only when extending code
that already lives there.

Expose public kernel symbols through `diffulex_kernel/__init__.py` only when
other packages need to import them directly.

## Keep a Reference Path

Before optimizing, write or identify a reference implementation. The reference
can be a simple PyTorch implementation or an existing slower kernel path.

A good kernel test checks:

- output values against the reference;
- supported dtypes;
- boundary shapes;
- layout assumptions;
- device placement.

## Integrate Through Engine Boundaries

Most kernels should be called through one of these layers:

- attention implementation;
- KV cache helper;
- model layer;
- MoE routing or GEMM helper;
- strategy-specific model runner.

Avoid calling a new kernel directly from scheduler code. The scheduler should
decide what runs, not own tensor-level implementation details.

## Validate Layout Assumptions

Document the expected tensor layout near the call site. For KV cache and
attention kernels, verify these values together:

| Field | What to verify |
| --- | --- |
| `page_size` | Matches the KV cache paging expected by the kernel. |
| `block_size` | Fits within the selected page size and strategy layout. |
| `kv_cache_layout` | Matches the memory interpretation used by the kernel. |
| Attention metadata fields | Describe the same shape, page table, and context layout that the kernel reads. |
| dtype and device | Match the kernel's supported execution path. |

Mismatched metadata often looks like a kernel bug, so inspect metadata before
changing low-level code.

## Profiling

Profile the kernel in isolation before measuring full engine throughput. Once
the kernel is integrated, compare against the previous path with the same model,
prompt set, token limits, batch limits, and parallelism.

## Verification Checklist

1. Add a focused correctness test.
2. Add shape or dtype coverage for supported variants.
3. Run the focused kernel test.
4. Run the smallest engine path that reaches the kernel.
5. Profile only after correctness is stable.