Profiling¶
Use profiling when correctness is already established and you need to understand where time or memory is spent. Keep profiling runs narrow; changing many runtime settings at once makes results difficult to interpret.
Pytorch Profiler¶
Use PyTorch Profiler when you need CPU and CUDA timing for a focused inference path. Diffulex records named regions around major engine operations such as scheduler work, request preparation, model runner execution, and output recording.
Keep profiling runs small:
use a short prompt set;
limit generated tokens;
record one model and strategy configuration at a time;
save traces outside the source tree when they are large.
What to Measure¶
Choose the metric before profiling:
end-to-end latency for one request;
throughput for a fixed prompt set;
scheduler overhead;
model runner execution time;
prefill cost versus decode cost;
kernel time for attention, top-k, or MoE operations.
Use the smallest workload that still exhibits the behavior being measured.
Runtime Toggles¶
Compare optimized and debug paths deliberately:
Setting |
What it compares |
|---|---|
|
Debug-friendly eager execution against optimized paths. |
CUDA Graph paths |
Launch-overhead reduction against eager execution. |
|
Supported compiled execution against uncompiled execution. |
|
Optional vLLM-backed layers against local layer implementations. |
Run a baseline before changing any toggle. Keep model, prompts, token limits, and parallelism fixed across comparisons.
Existing Scripts¶
Diffulex includes profiling and benchmark scripts under script/. Use these
when they match the target workload because they capture local conventions for
model paths, task names, and profiler output.
If you create a new profiling script, keep large trace files out of the source tree and document the exact command used to generate a result.