Profiling

Use profiling when correctness is already established and you need to understand where time or memory is spent. Keep profiling runs narrow; changing many runtime settings at once makes results difficult to interpret.

Pytorch Profiler

Use PyTorch Profiler when you need CPU and CUDA timing for a focused inference path. Diffulex records named regions around major engine operations such as scheduler work, request preparation, model runner execution, and output recording.

Keep profiling runs small:

  • use a short prompt set;

  • limit generated tokens;

  • record one model and strategy configuration at a time;

  • save traces outside the source tree when they are large.

What to Measure

Choose the metric before profiling:

  • end-to-end latency for one request;

  • throughput for a fixed prompt set;

  • scheduler overhead;

  • model runner execution time;

  • prefill cost versus decode cost;

  • kernel time for attention, top-k, or MoE operations.

Use the smallest workload that still exhibits the behavior being measured.

Runtime Toggles

Compare optimized and debug paths deliberately:

Setting

What it compares

enforce_eager=True

Debug-friendly eager execution against optimized paths.

CUDA Graph paths

Launch-overhead reduction against eager execution.

enable_torch_compile

Supported compiled execution against uncompiled execution.

enable_vllm_layers

Optional vLLM-backed layers against local layer implementations.

Run a baseline before changing any toggle. Keep model, prompts, token limits, and parallelism fixed across comparisons.

Existing Scripts

Diffulex includes profiling and benchmark scripts under script/. Use these when they match the target workload because they capture local conventions for model paths, task names, and profiler output.

If you create a new profiling script, keep large trace files out of the source tree and document the exact command used to generate a result.