hpcrun:
Statistical Profiling

The HPCToolkit Performance Tools

2018/06/28

Version 2017.11

hpcrun is a call path profiler based on statistical sampling. It supports multiple sample sources during one execution. hpcrun profiles complex applications (forks, execs, threads and dynamic linking) and may be used in conjunction with parallel process launchers such as MPICH's mpiexec and SLURM's srun.

See hpctoolkit(1) for an overview of HPCToolkit.

Table of Contents

Synopsis

hpcrun [profiling-options] command [command-arguments]

hpcrun [info-options]

Description

hpcrun profiles the execution of an arbitrary command command using statistical sampling (rather than instrumentation). It collects per-thread call path profiles that represent the full calling context of sample points. Sample points may be generated from multiple simultaneous sampling sources. hpcrun profiles complex applications that use forks, execs, threads, and dynamic linking/unlinking; it may be used in conjuction with parallel process launchers such as MPICH's mpiexec and SLURM's srun.

To profile a statically linked executable, make sure to link with hpclink(1) .

To configure hpcrun's sampling sources, specify events and periods using the -e/--event option. For an event e and period p, after every p instances of e, a sample is generated that causes hpcrun to inspect the and record information about the monitored command.

When command terminates, a profile measurement databse will be written to the directory:

      hpctoolkit-command-measurements[-pid]

where pid is an operating system process id, if available.

hpcrun enables you to abort a process and write the partial profiling data to disk by sending the Interrupt signal (INT or Ctrl-C). This can be extremely useful on long-running or misbehaving applications.

Arguments

Default values for optional arguments are shown in {}.

Command

command
The command to profile.

command-arguments
Arguments to the command to profile.

Options: Informational

-l, -L, --list-events
List available events. (N.B.: some may not be profilable)

-V, --version
Print version information.

-h, --help
Print help.

Options: Profiling

-ds, --delay-sampling
Don't start sampling until the application chooses to turn it on. Use this option to measure only a subset of the application's execution by bracketing interesting regions with calls to hpctoolkit_sampling_start() and hpctoolkit_sampling_stop(). Sampling may be turned on and off any number of times during an execution; and measurements from all sampled regions are aggregated for attribution and display.

-e event[@period], --event event[@period]
Profile the event with corresponding sample period. event may be a PAPI event, a Perf event, a native processor event, or the special event WALLCLOCK. This option may be given multiple times to profile several events at once; there may be system-dependent limits on how many events can be profiled simultaneously and on which events may be combined for profiling. If no events are given, the default is to profile WALLCLOCK@5000. For perf event's counter, it is possible to specify the number of frequency as the sample threshold by prefixing with f before the number. For instance, to have 100 samples per second, the period is: @f100 .
N.B.:

-c number, --count number
Only available for perf event's counter. This option specifies the event period to sample. It uses the same format of period as the option -e mentioned above.

-p level, --precise-ip level
Specify the precise ip level (used only with perf events): NOTE: Some architectures support a precise IP with 0 skid. Incorrect level will unable hpcrun to sample the events.

-f frac, -fp frac, --process-fraction frac
Measure only a fraction frac of the execution's processes. For each process, enable measurement of each thread with probability frac, a real number or a fraction (1/10) between 0 and 1. To minimize perturbations, when measurement for a process is disabled all threads in a process still receive sampling interrupts but they are ignored.

-lm size, --low-memsize size
Allocate an additional segment to store measurement data whenever free space in the current segment is less than the specified size. If not given, the default for size is 80K.

-m size, --memsize size
Uuse the specified size as segment size when allocating memory for measurement data. The specified value is rounded up to a multiple of the `system page size. If not given, the default for size is 4M.

-mp prob, --memleak-prob prob
Monitor a subset of memory allocations performed by the application to detect leaks. An allocation is a call to one of malloc, calloc, realloc, etc and its matching call to free. At each allocation HPCToolkit generates a pseudo-random number in the range [0.0, 1.0) and monitors the allocation if the number is less than the value prob specified here, The value may be written as a a floating point number or as a fraction. If not given, the default for prob is 0.1.

-o outpath, --output outpath
Directory to receive output data. If not given, the default directory ia hpctoolkit--measurements[-].

-r, --retain-recursion
Do not collapse simple recursive call chains to a single node. Normally hpcrun does collapse such chains to present a more useful attribution of costs. If this option is given, all elements of a recursive call chain are recorded. Note: When you use the RETCNT sample source then this option is enabled automatically in order to gather accurate counts.

-t, --trace
Generate a call path trace (in addition to a call path profile).

Options: HPCToolkit Development

These options are intended for use by the HPCToolkit team, but could be helpful to others interested in HPCToolkit's implementation. .

-d, --debug
After initialization, spin wait until you attach a debugger to one or more of the application's processes. After attaching you can set breakpoints or watchpoints in your application's code or in HPCToolkit’s hpcrun code before beginning application execution. To continue after attaching, use the debugger to set program variable DEBUGGER WAIT to zero and then resume execution. Note: Your can only set HPCRUN WAIT if your HPCToolkit was built with debugging symbols. To build HPCToolkit with debugging symbols, include the option –enable-develop when configuring HPCToolkit.

-dd flag, --dynamic-debug flag
Enable the flag flag, causing hpcrun to log debug messages guarded with that flag during execution. A list of dynamic debug flags can be found in HPCToolkit’s source code in the file src/tool/hpcrun/messages/messages.flag-defns. Note that not all flags are meaningful on all architectures. The special value ALL enables all debug flags.
Caution: turning on debug flags produces many log messages, often dramatically slowing the application and potentially distorting the measured profile.

-q, --quiet
Turn on a default set of dynamic debugging variables to log information about HPCToolkit’s stack unwinding based on on-the-fly binary analysis. See the HPCToolkit User Manual for more details.
Bug: this option is unfortunately named.

-md, --monitor-debug
Enable debug tracing of libmonitor, the hpcrun subsystem which implements process/thread control. See the HPCToolkit User Manual for more details.

Environment Variables

For most systems, hpcrun requires no special environment variable settings. There are two situations, however, where hpcrun, to function correctly, must refer to environment variables. These environment variables, and corresponding situations are:

[HPCTOOLKIT] To function correctly, hpcrun must know the location of the HPCToolkit top-level installation directory. The hpcrun script uses elements of the installation lib and libexec subdirectories. For most systems, the installation procedure ensures that hpcrun can find the requisite components. Some parallel job launchers, however, will copy the hpcrun script to a different location from the installed base. If your system uses this copying mechanism, you must set the HPCTOOLKIT environment variable to the top-level installation directory.

[hpcrun] If you refer to the hpcrun script via a file system link you must also set HPCTOOLKIT, for the same reason.

Launching

When sampling with native events, by default hpcrun will profile using perf events. To force HPCToolkit to use PAPI (assuming it's available) instead of perf events, one must prefix the event with ‘papi::’ as follows:

hpcrun -e papi::CYCLES

For PAPI presets, there is no need to prefix the event with ‘papi::’. For instance it is sufficient to specify PAPI_TOT_CYC event without any prefix to profile using PAPI.

To sample an execution 100 times per second (frequency-based sampling) counting CYCLES and 100 times a second counting INSTRUCTIONS:

hpcrun -e CYCLES@f100 -e INSTRUCTIONS@f100 ...

To sample an execution every 1,000,000 cycles and every 1,000,000 instructions using period-based sampling:

hpcrun -e CYCLES@1000000 -e INSTRUCTIONS@1000000 ...
By default, hpcrun will use frequency-based sampling with the rate 300 samples per second per event type. Hence the following command will cause HPCToolkit to sample CYCLES at 300 samples per second and INSTRUCTIONS at 300 samples per second:
hpcrun -e CYCLES -e INSTRUCTIONS ...
One can a different default rate using the -c option. The command below will sample CYCLES at 200 samples per second and INSTRUCTIONS at 200 samples per second:
hpcrun -c f200 -e CYCLES -e INSTRUCTIONS ...

Examples

Assume we wish to profile the application zoo. The following examples lists some useful events for different processor architectures. In each case, the special option -- is used to clearly demarcate the end of hpcrun options.

Notes

Sample sources

hpcrun uses Linux perf_events (default on Linux platform) and optionally the PAPI library to provide access to hardware performance counter events. It is important to note that on most out-of-order pipelined architectures, a hardware counter interrupt is not precisely attributed to the instruction that induced the counter to overflow. The gap is commonly 50-70 instructions. This means that one should not assume that aggregation at the source line level is fully precise. (E.g., if a L1 D-cache miss is attributed to a statement that has been compiled to register-only operations, assume the miss is attributed to a nearby load.) However, aggregation at the procedure and loop level is reliable.

Linux perf_events Interface

Linux perf_events provides a powerful interface that supports measurement of both application execution and kernel activity. Using perf_events, one can measure both hardware and software events. Using a processor's hardware performance monitoring unit (PMU), the perf_events interface can measure an execution using any hardware counter supported by the PMU. Examples of hardware events include cycles, instructions completed, cache misses, and stall cycles. Using instrumentation built in to the Linux kernel, the perf_events interface can measure software events. Examples of software events include page faults, context switches, and CPU migrations.

HPCToolkit uses libpfm4 to translate from an event name string to an event code recognized by the kernel. An event name is case insensitive and is defined as followed:

[pmu::][event_name][:unit_mask][:modifier|:modifier=val] 

Some capabilities of HPCToolkit's perf_events Interface include:

PAPI Interface (optional)
The PAPI library supports a large collection of hardware counter events. Some events have standard names across all platforms, e.g. PAPI_TOT_CYC, the event that measures total cycles. In addition to events whose names begin with the PAPI_ prefix, platforms also provide access to a set of native events with names that are specific to the platform's processor. A complete list of events supported by the PAPI library for your platform may be obtained by using the --list-events option. Any event whose name begins with the PAPI_ prefix that is listed as "Profilable" can be used as an event in a sampling source --- provided it does not conflict with another event.

The precise rules for selecting good events and periods are complex.

System itimer (WALLCLOCK).
On Linux systems, the kernel will not deliver itimer interrupts faster than the unit of a jiffy, which defaults to 4 milliseconds; see the itimer man page. One can configure the kernel to use a value as small as 1 millisecond, but it is unlikely the kernel will actually deliver itimer signals at that rate when a period of 1000 microseconds is requested.

However, on Linux one can get quite close to the kernel Hz rate by setting the itimer interval to something less than the Hz rate. For example, if the Hz rate is 1000 microseconds, one can use 500 microseconds (or just 1) and obtain about 999 interrupts per second.

Platform-specific notes

Cray XE and XK
When using dynamically linked binaries on Cray XE and XK systems, you should add the HPCTOOLKIT environment variable to your launch script. Set HPCTOOLKIT to the top-level HPCToolkit install prefix (the directory containing the bin, lib and libexec subdirectories) and export it to the environment. This is only needed for running dynamically linked binaries. For example:

#!/bin/sh
#PBS -l mppwidth=#nodes
#PBS -l walltime=00:30:00
#PBS -V

export HPCTOOLKIT=/path/to/hpctoolkit/install/directory

    ...Rest of Script...

If HPCTOOLKIT is not set, you may see errors such as the following in your job's error log.

/var/spool/alps/103526/hpcrun: Unable to find HPCTOOLKIT root directory.
Please set HPCTOOLKIT to the install prefix, either in this script,
or in your environment, and try again.

The problem is that the Cray job launcher copies the hpcrun script to a directory somewhere below /var/spool/alps/ and runs it from there. By moving hpcrun to a different directory, this breaks hpcrun's method for finding its own install directory. The solution is to add HPCTOOLKIT to your environment so that hpcrun can find its install directory.

Miscellaneous

See Also

hpctoolkit(1) .
hpclink(1) .

Version

Version: 2017.11

License and Copyright

Copyright
© 2002-2018, Rice University.
License
See README.License.

Authors

Rice University's HPCToolkit Research Group
Email: hpctoolkit-forum =at= rice.edu
WWW: http://hpctoolkit.org.