2025-03-31

Linux 6.15 Perf Tooling Introduces New Support For Latency Profiling

The perf tools changes were merged today for the Linux 6.15 kernel. Most notable this cycle for the wonderful perf tooling is introducing the notion of latency profiling by leveraging kernel scheduler information. This latency data will be further useful for Linux software engineers working to optimize system latency/performance.

The perf tools pull request for Linux 6.15 explains of this new “–latency” option for the “perf record” command:

“Introduce latency profiling using scheduler information. The latency profiling is to show impacts on wall-time rather than cpu-time. By tracking context switches, it can weight samples and find which part of the code contributed more to the execution latency.

The value (period) of the sample is weighted by dividing it by the number of parallel execution at the moment. The parallelism is tracked in perf report with sched-switch records. This will reduce the portion that are run in parallel and in turn increase the portion of serial executions.

For now, it’s limited to profile processes, IOW system-wide profiling is not supported. You can add –latency option to enable this.”

Dmitry Vyukov of Google worked on this latency reporting for perf as well as a new parallelism key. In the prior patch series he further elaborated on this latency profiling focus and purpose:

“There are two notions of time: wall-clock time and CPU time. For a single-threaded program, or a program running on a single-core machine, these notions are the same. However, for a multi-threaded/multi-process program running on a multi-core machine, these notions are significantly different. Each second of wall-clock time we have number-of-cores seconds of CPU time.

Currently perf only allows to profile CPU time. Perf (and all other existing profilers to the be best of my knowledge) does not allow to profile wall-clock time.

Optimizing CPU overhead is useful to improve ‘throughput’, while optimizing wall-clock overhead is useful to improve ‘latency’. These profiles are complementary and are not interchangeable. Examples of where latency profile is needed:

- optimzing build latency
- optimizing server request latency
- optimizing ML training/inference latency
- optimizing running time of any command line program

CPU profile is useless for these use cases at best (if a user understands the difference), or misleading at worst (if a user tries to use a wrong profile for a job).
…
Brief outline of the implementation:
- add context switch collection during record
- calculate number of threads running on CPUs (parallelism level) during report
- divide each sample weight by the parallelism level
This effectively models that we were taking 1 sample per unit of wall-clock time.

We still default to the CPU profile, so it’s up to users to learn about the second profiling mode and use it when appropriate.”

The code is merged and ready to go with Linux 6.15. This new documentation goes into more detail on the CPU and latency overhead reporting for perf. Can’t wait to see what improvements will be uncovered by Google and others leveraging perf record –latency.