Statistical Profiling

The HPCToolkit Performance Tools


Version 2021.05.15-master

hpcrun is profiling tool that collects call path profiles of program executions using statistical sampling of hardware counters, software counters, or timers.

See hpctoolkit(1) for an overview of HPCToolkit.

Table of Contents


hpcrun [profiling-options] command [command-arguments]

hpcrun [info-options]


hpcrun profiles the execution of an arbitrary command command using statistical sampling. hpcrun can profile an execution using multiple sample sources simultaneously, supports measurement of applications with multiple processes and/or multiple threads, and handles complex runtime behaviors including fork, exec, and/or dynamic loading of shared libraries. hpcrun can be used in conjunction with program launchers such as mpiexec, SLURM's srun, LSF's jsrun, and Cray's aprun.

To profile a statically-linked executable, make sure to link with hpclink(1) .

To configure hpcrun's sampling sources, specify events and periods using the -e/--event option. For an event e and period p, after every p instances of e, a sample is generated that causes hpcrun to inspect the current calling context and augment its execution measurements of the monitored command.

If no sample source is specified, by default hpcrun profile using the timer CPUTIME on Linux at a frequency of 200 samples per second per thread.

When command terminates, a profile measurement database will be written to the directory:


where jobid is a parallel job launcher id associated with the execution, if available.

hpcrun allows you to abort an execution and write the partial profiling data to disk by sending a signal such as SIGHUP or SIGINT (which is often bound to Control-c). This can be extremely useful to collect data for long-running or misbehaving applications.


Default values for optional arguments are shown in {}.


The command to profile.

Arguments to the command to profile.

Options: Informational

-l, -L, --list-events
List available events. (N.B.: some may not be profilable)

-V, --version
Print version information.

-h, --help
Print help.

Options: Profiling

-ds, --delay-sampling
Don't start sampling until the application enables sampling under program control. Use this option to measure specific intervals in an application's execution by bracketing code regions where measurement is desired with calls to hpctoolkit_sampling_start() and hpctoolkit_sampling_stop(). Sampling may be started and stopped any number of times during an execution; measurements from all measurement intervals are aggregated.

Override and disable gprof instrumentation This option is only useful when using hpcrun to add HPCToolkit's measurement subsystem to a dynamically-linked application that has been compiled with -pg. One can't measure performance with HPCToolkit when gprof instrumentation is active in the same execution.

-e event[@howoften], --event event[@howoften]
event may be an architecture-independent hardware or software event supported by Linux perf, a native hardware counter event, a hardware counter event supported by the PAPI library, or a Linux system timer (CPUTIME and REALTIME). This option may be given multiple times to profile several events at once. While events measured using the Linux perf monitoring infrastructure will be transparently multiplexed if necessary, for other sampling sources or on operating systems such as the Blue Gene Compute Node Kernel, there may be system-dependent limits on how many events can be profiled simultaneously and on which events may be combined for profiling. If the value for howoften is a number, it will be interpreted as a sample period. For Linux perf events, one may specify a sampling frequency for howoften by writing f before a number. For instance, to sample an event 100 times per second, specify howoften as '@f100'. For Linux perf events, if no value for howoften is specified, hpcrun will monitor the event using frequency-based sampling at 300 samples/second.

-e gpu=nvidia[,pc], --event gpu=nvidia[,pc]
Collect comprehensive operation-level measurements for CUDA programs on NVIDIA GPUs, including timing of GPU kernel invocations, memory copies (implicit and explicit), driver and runtime activity, and overhead. If the optional argument pc is used, the GPU will collect instruction-level measurements of GPU kernels using Program Counter (PC) sampling in addition to the operation-level data. PC sampling attributes STALL reasons to individual GPU instructions.

-e gpu=amd, --event gpu=amd
Collect comprehensive operation-level measurements for HIP programs on AMD GPUs, including timing of GPU kernel invocations, memory copies (implicit and explicit), driver and runtime activity, and overhead.

-e gpu=opencl, --event gpu=opencl
Collect comprehensive operation-level measurements for OpenCL programs on AMD, Intel, or NVIDIA GPUs, including timing of GPU kernel invocations, memory copies (implicit and explicit), driver and runtime activity, and overhead. The opencl measurement mode may also be used to measure executions of DPC++ programs compiled to Intel's OpenCL backend.

-c howoften, --count howoften
Only available for events managed by Linux perf. This option specifies a default value for how often to sample. The value for howoften may be a number that will be used as a default event period or an f followed by a number, e.g. f100, to specify a default sampling frequency in samples/second.
By default, hpcrun will allow attribution of hardware counter events to have arbitrary skid. Some processor architectures, e.g., ARM, don't support attribution with any higher level of precision. If a processor does not support the specified level of attribution precision for a hardware counter event, hpcrun may record 0 occurrences of the event without reporting an error.

-f frac, -fp frac, --process-fraction frac
Measure only a fraction frac of the execution's processes. For each process, enable measurement of each thread with probability frac, a real number or a fraction (1/10) between 0 and 1. To minimize perturbations, when measurement for a process is disabled all threads in a process still receive sampling interrupts but they are ignored.

-lm size, --low-memsize size
Allocate an additional segment to store measurement data whenever free space in the current segment is less than the specified size. If not given, the default for size is 80K.

-m switch, --merge-threads switch
Merge non-overlapped threads into one virtual thread. This option is to reduce the number of generated profile and trace files as each thread generates its own profile and trace data. The options are:

-ms size, --memsize size
Use the specified size as segment size when allocating memory for measurement data. The specified value is rounded up to a multiple of the `system page size. If not given, the default for size is 4M.

-mp prob, --memleak-prob prob
Monitor a subset of memory allocations performed by the application to detect leaks. An allocation is a call to one of malloc, calloc, realloc, etc and its matching call to free. At each allocation HPCToolkit generates a pseudo-random number in the range [0.0, 1.0) and monitors the allocation if the number is less than the value prob specified here, The value may be written as a a floating point number or as a fraction. If not given, the default for prob is 0.1.

-o outpath, --output outpath
Directory to receive output data. If not given, the default directory ia hpctoolkit--measurements[-].

-r, --retain-recursion
Do not collapse simple recursive call chains. Normally as hpcrun monitors an application that employs simple recursion, it collapses call chains of recursive calls to a single level. This design enables a user to see how the aggregate costs of recursion are associated with each recursive call yet saves space and time during post-mortem analysis by collapsing long chains of recursive calls. If this option is given, hpcrun will record all elements of a recursive call chain. Note: When you use the RETCNT sample source this option is enabled automatically to gather accurate counts.

-nu, --no-unwind
Suppress unwinding of callstacks. Normally as hpcrun attributes performance metrics to full calling contexts. If this option is given, hpcrun collect only flat profiles, attributing metrics directly to functions without any information about the contexts in which they are called.

-t, --trace
Generate a call path trace in addition to a call path profile. This option will enable tracing for CPUs if a time-based metric, such as CPUTIME, REALTIME, or cycles is used. This option will enable tracing for GPU operations if a '-e gpu=...' option is used to enable measurement of GPU activities.

Options: HPCToolkit Development

These options are intended for use by the HPCToolkit team, but could be helpful to others interested in HPCToolkit's implementation. .

-d, --debug
After initialization, spin wait until you attach a debugger to one or more of the application's processes. After attaching you can set breakpoints or watchpoints in your application's code or in HPCToolkit's hpcrun code before beginning application execution. To continue after attaching, use the debugger to call hpcrun_continue() and then resume execution.

-dd flag, --dynamic-debug flag
Enable the flag flag, causing hpcrun to log debug messages guarded with that flag during execution. A list of dynamic debug flags can be found in HPCToolkit's source code in the file src/tool/hpcrun/messages/messages.flag-defns. Note that not all flags are meaningful on all architectures. The special value ALL enables all debug flags.
Caution: turning on debug flags produces many log messages, often dramatically slowing the application and potentially distorting the measured profile.

-md, --monitor-debug
Enable debug tracing of libmonitor, the hpcrun subsystem which implements process/thread control.

Terminate the hpcfnbounds server when it goes idle. By default, it is kept alive during the entire run.

Environment Variables

To function correctly, hpcrun must know the location of the HPCToolkit top-level installation directory so that it can access toolkit components located in its lib and libexec subdirectories. Under most circumstances, hpcrun requires no special environment variable settings.

There are two situations, however, where hpcrun must consult the HPCTOOLKIT environment variable to determine the location of the top-level installation directory:


When sampling with native events, by default hpcrun will profile using perf events. To force HPCToolkit to use PAPI (assuming it's available) instead of perf events, one must prefix the event with 'papi::' as follows:

hpcrun -e papi::CYCLES

For PAPI presets, there is no need to prefix the event with 'papi::'. For instance it is sufficient to specify PAPI_TOT_CYC event without any prefix to profile using PAPI.

To sample an execution 100 times per second (frequency-based sampling) counting CYCLES and 100 times a second counting INSTRUCTIONS:

hpcrun -e CYCLES@f100 -e INSTRUCTIONS@f100 ...

To sample an execution every 1,000,000 cycles and every 1,000,000 instructions using period-based sampling:

hpcrun -e CYCLES@1000000 -e INSTRUCTIONS@1000000 ...
By default, hpcrun will use frequency-based sampling with the rate 300 samples per second per event type. Hence the following command will cause HPCToolkit to sample CYCLES at 300 samples per second and INSTRUCTIONS at 300 samples per second:
hpcrun -e CYCLES -e INSTRUCTIONS ...
One can a different default rate using the -c option. The command below will sample CYCLES at 200 samples per second and INSTRUCTIONS at 200 samples per second:
hpcrun -c f200 -e CYCLES -e INSTRUCTIONS ...


Assume we wish to profile the application zoo. The following examples lists some useful events for different processor architectures.


Sample sources

hpcrun uses Linux perf_events (default on Linux platform) and optionally the PAPI library to provide access to hardware performance counter events. It is important to note that on most out-of-order pipelined architectures, a hardware counter interrupt is not precisely attributed to the instruction that induced the counter to overflow. The gap is commonly 50-70 instructions. This means that one should not assume that aggregation at the source line level is fully precise. (E.g., if a L1 D-cache miss is attributed to a statement that has been compiled to register-only operations, assume the miss is attributed to a nearby load.) However, aggregation at the procedure and loop level is reliable.

Linux perf_events Interface

Linux perf_events provides a powerful interface that supports measurement of both application execution and kernel activity. Using perf_events, one can measure both hardware and software events. Using a processor's hardware performance monitoring unit (PMU), the perf_events interface can measure an execution using any hardware counter supported by the PMU. Examples of hardware events include cycles, instructions completed, cache misses, and stall cycles. Using instrumentation built in to the Linux kernel, the perf_events interface can measure software events. Examples of software events include page faults, context switches, and CPU migrations.

HPCToolkit uses libpfm4 to translate from an event name string to an event code recognized by the kernel. An event name is case insensitive and is defined as followed:


Some capabilities of HPCToolkit's perf_events Interface include:

PAPI Interface (optional)
The PAPI library supports a large collection of hardware counter events. Some events have standard names across all platforms, e.g. PAPI_TOT_CYC, the event that measures total cycles. In addition to events whose names begin with the PAPI_ prefix, platforms also provide access to a set of native events with names that are specific to the platform's processor. A complete list of events supported by the PAPI library for your platform may be obtained by using the --list-events option. Any event whose name begins with the PAPI_ prefix that is listed as "Profilable" can be used as an event in a sampling source provided it does not conflict with another event.

The rules of thumb for selecting an appropriate set of events and their associated periods are complex.

System itimer (WALLCLOCK).
On Linux systems, the kernel will not deliver itimer interrupts faster than the unit of a jiffy, which defaults to 4 milliseconds; see the itimer man page. One can configure the kernel to use a value as small as 1 millisecond, but it is unlikely the kernel will actually deliver itimer signals at that rate when a period of 1000 microseconds is requested.

However, on Linux one can get quite close to the kernel Hz rate by setting the itimer interval to something less than the Hz rate. For example, if the Hz rate is 1000 microseconds, one can use 500 microseconds (or just 1) and obtain about 999 interrupts per second.

Platform-specific notes

Cray Systems
When using dynamically linked binaries on Cray systems, you should add the HPCTOOLKIT environment variable to your launch script. Set HPCTOOLKIT to the top-level HPCToolkit install prefix (the directory containing the bin, lib and libexec subdirectories) and export it to the environment. This is only needed for running dynamically linked binaries. For example:

#PBS -l mppwidth=#nodes
#PBS -l walltime=00:30:00

export HPCTOOLKIT=/path/to/hpctoolkit/install/directory

    ...Rest of Script...

If HPCTOOLKIT is not set, you may see errors such as the following in your job's error log.

/var/spool/alps/103526/hpcrun: Unable to find HPCTOOLKIT root directory.
Please set HPCTOOLKIT to the install prefix, either in this script,
or in your environment, and try again.

The problem is that the Cray ALPS job launcher copies the hpcrun script to a directory somewhere below /var/spool/alps/ and runs it from there. By moving hpcrun to a different directory, this breaks hpcrun's default method for finding HPCToolkit's top-level installation directory. The solution is to add HPCTOOLKIT to your environment so that hpcrun can find HPCToolkit's top-level installation directory.


See Also

hpctoolkit(1) .
hpclink(1) .


Version: 2021.05.15-master

License and Copyright

© 2002-2021, Rice University.
See README.License.


Rice University's HPCToolkit Research Group
Email: hpctoolkit-forum =at= rice.edu
WWW: http://hpctoolkit.org.