Profiling¶

Use profiling when correctness is already established and you need to understand where time or memory is spent. Keep profiling runs narrow; changing many runtime settings at once makes results difficult to interpret.

Pytorch Profiler¶

Use PyTorch Profiler when you need CPU and CUDA timing for a focused inference path. Diffulex records named regions around major engine operations such as scheduler work, request preparation, model runner execution, and output recording.

Keep profiling runs small:

use a short prompt set;
limit generated tokens;
record one model and strategy configuration at a time;
save traces outside the source tree when they are large.

What to Measure¶

Choose the metric before profiling:

end-to-end latency for one request;
throughput for a fixed prompt set;
scheduler overhead;
model runner execution time;
prefill cost versus decode cost;
kernel time for attention, top-k, or MoE operations.

Use the smallest workload that still exhibits the behavior being measured.

Runtime Toggles¶

Compare optimized and debug paths deliberately:

Setting	What it compares
`enforce_eager=True`	Debug-friendly eager execution against optimized paths.
CUDA Graph paths	Launch-overhead reduction against eager execution.
`enable_torch_compile`	Supported compiled execution against uncompiled execution.
`enable_vllm_layers`	Optional vLLM-backed layers against local layer implementations.

Run a baseline before changing any toggle. Keep model, prompts, token limits, and parallelism fixed across comparisons.

Existing Scripts¶

Diffulex includes profiling and benchmark scripts under script/. Use these when they match the target workload because they capture local conventions for model paths, task names, and profiler output.

If you create a new profiling script, keep large trace files out of the source tree and document the exact command used to generate a result.