# Profiling Use profiling when correctness is already established and you need to understand where time or memory is spent. Keep profiling runs narrow; changing many runtime settings at once makes results difficult to interpret. ## Pytorch Profiler Use PyTorch Profiler when you need CPU and CUDA timing for a focused inference path. Diffulex records named regions around major engine operations such as scheduler work, request preparation, model runner execution, and output recording. Keep profiling runs small: - use a short prompt set; - limit generated tokens; - record one model and strategy configuration at a time; - save traces outside the source tree when they are large. ## What to Measure Choose the metric before profiling: - end-to-end latency for one request; - throughput for a fixed prompt set; - scheduler overhead; - model runner execution time; - prefill cost versus decode cost; - kernel time for attention, top-k, or MoE operations. Use the smallest workload that still exhibits the behavior being measured. ## Runtime Toggles Compare optimized and debug paths deliberately: | Setting | What it compares | | --- | --- | | `enforce_eager=True` | Debug-friendly eager execution against optimized paths. | | CUDA Graph paths | Launch-overhead reduction against eager execution. | | `enable_torch_compile` | Supported compiled execution against uncompiled execution. | | `enable_vllm_layers` | Optional vLLM-backed layers against local layer implementations. | Run a baseline before changing any toggle. Keep model, prompts, token limits, and parallelism fixed across comparisons. ## Existing Scripts Diffulex includes profiling and benchmark scripts under `script/`. Use these when they match the target workload because they capture local conventions for model paths, task names, and profiler output. If you create a new profiling script, keep large trace files out of the source tree and document the exact command used to generate a result.