VtuneAmplifier
Introduction
VTune Profiler performance profiler is a commercial application for software performance analysis of 32 and 64-bit x86-based machines. It's among a range of Intel tools that are installed and available at CERN through CVMFS. Since the compilers and performance tools are installed on CVMFS, they are available from any CVMFS-enabled Linux machine at CERN.
Starting Vtune
The basic usage requires sourcing the necessary setup script and passing your executable to vtune
(historically amplxe-cl)
. In practice, this means:
# Setup Intel Tools
# Do this before running asetup to avoid python clashes
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;
# Setup the latest 24.0 Athena nightly for Reco_tf.py
lsetup "asetup Athena,24.0,latest"
# Run a simple q455 job with 1 event to generate runargs.HITtoRDO.py
Reco_tf.py --AMI q445 --perfmon none --outputRDOFile myRDO.pool.root --maxEvents 1
# change the number of events in the generated runargs.HITtoRDO.py to what you need
# Run profiling
vtune -mrte-mode=native -collect hotspots $(which athena.py) -- preloadlib=$ATLASMKLLIBDIR_PRELOAD/libintlc.so.5:$ATLASMKLLIBDIR_PRELOAD/libimf.so runargs.HITtoRDO.py
VTune 2021.2+
and Python 3.7+
, trying to profile a python job that would use cppyy
(so all athena.py jobs...) makes the process hang. (See ATLINFR-4105 for a more detailed description of the problem.) To circumvent that issue, the -mrte-mode
flag, as shown above, is needed.
This should produce a folder called r000hs
and a file called r000hs.vtune
therein, which contains the profiling output. The results can be visualized using the GUI
via:
# Invoke the GUI for visualization
vtune-gui r000hs/r000hs.vtune
It is also possible to create call-graphs using
VTune
output. First, you need to obtain a Python script called gprof2dot
from this GitHub project and then:
vtune -report gprof-cc -result-dir output -format text -report-output output.txt
gprof2dot.py -f axe output.txt | dot -Tpng -o output.png
--strip
and --color-nodes-by-selftime
. The last one gives what GPerfTool
normally does, and the first strips the full variables' names in functions.
More useful information can be obtained at:
- 3rd CERN OpenLab/Intel hands-on workshop on code optimization
- Openlab Workshop 3
TriggerProfiling
twiki
Software vs Hardware Event-based sampling
If you're running a Hotspots analysis, vtune
uses software-based sampling by default. If you have an Intel chip and install sep
drivers (as root) as described on this webpage, you can enable hardware-based sampling by:
$ vtune -mrte-mode=native -collect hotspots -knob sampling-mode=hw ...;
Profiling a list of specific algorithms or a range of events in Athena jobs
We have a service called PerfMonVTune that allows users to profile either a list of specific algorithms or a range of events in Athena jobs. The current implementation doesn't allow mixing these two (i.e. profiling a specific algorithm in a range of events) but that can be easily provided if there is demand.
PerfMonVTune
isn't built as part of nightlies since it relies on VTune
which is not provided by default. However, the user can clone the package and easily compile it on top of VTune
+ a master Athena nightly as:
$ source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;
$ lsetup "asetup Athena,main,latest" "git";
$ git atlas init-workdir https://:@gitlab.cern.ch:8443/atlas/athena.git -p PerfMonVTune;
$ mkdir build; cd build;
$ cmake ../athena/Projects/WorkDir; cmake --build .;
$ source x86_64-*/setup.sh; cd ..;
jO
:
from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
cfg.merge(VTuneProfilerServiceCfg(configFlags, ResumeEvent = 5, PauseEvent = 15)
which will profile the entire job between the 5th (inclusive) and 15th (exclusive) events. Of course, this makes sense in either a serial or a single-thread Athena job. For multi-thread jobs, depending on the configuration, you might get contributions from other parallel events in flight (VTune
profiles the entire process).
If the user wants to profile a specific algorithm, then he/she can add
from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
configFlags.PerfMon.VTune.ProfiledAlgs = ["foo", "bar"]
cfg.merge(VTuneProfilerServiceCfg(configFlags))
foo
and bar
(exact match) are the algorithms to be profiled (profiling starts before and stops after calling the execute). Again, this makes the most sense if it's used in conjunction with either a serial or a single-thread Athena job. For multi-thread jobs, depending on the configuration, you might get contributions from algorithms in other parallel events in flight (VTune
profiles the entire process).
Then the sampling should be started in "paused state" as:
$ vtune -mrte-mode=native -collect hotspots -start-paused -- athena --threads 1 my_job_options.py
Running VTune
through the job transform
It is also possible to run VTune
through the job transform. Three main flags control the behavior:
vtune
: A boolean flag that toggles on/off the job execution under VTune
vtuneDefaultOpts
: A boolean flag that toggles on/off the default (hardcoded) VTune
options
vtuneExtraOpts
: A comma separated list of additional VTune
arguments
By default, running your favorite transform job w/ the --vtune="True"
flag, you'll get a hotspots analysis result. It is possible to collect a different analysis type with the extra options flag, e.g., --vtuneExtraOpts="-collect=threading"
. By default, we use the following VTune
options: -run-pass-thr=--no-altstack
and -mrte-mode=native
. If you do not want them, you can simply do --vtuneDefaultOpts="False"
.
Note that, you still have to setup VTune
(before Athena) and might need to preload your favorite libraries, e.g., tcmalloc
, etc., by hand when you run VTune
with this method (which is typically as simple as setting the right environment variable before running the job, e.g., export LD_PRELOAD="${TCMALLOCDIR}/libtcmalloc_minimal.so:${ATLASMKLLIBDIR_PRELOAD}/libimf.so"
).