VTune Profiler¶
Introduction¶
VTune Profiler performance profiler is a commercial application for software performance analysis of 32 and 64-bit x86-based machines. It's among a range of Intel tools that are installed and available at CERN through CVMFS. Since the compilers and performance tools are installed on CVMFS, they are available from any CVMFS-enabled Linux machine at CERN.
Starting VTune¶
The basic usage requires sourcing the necessary setup script and passing your executable to vtune (historically amplxe-cl). In practice, this means:
# Setup Intel Tools
# Do this before running asetup to avoid python clashes
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;
# Setup the latest 24.0 Athena nightly for Reco_tf.py
lsetup "asetup Athena,24.0,latest"
# Run a q445 job to generate the pickle file (to avoid profiling the configuration step)
ATHENA_CORE_NUMBER=1 Reco_tf.py \
--CA \
--AMI q445 \
--maxEvents -1 \
--multithreaded="True" \
--conditionsTag="$(python -c 'from AthenaConfiguration.TestDefaults import defaultConditionsTags; print(defaultConditionsTags.RUN3_MC)')" \
--outputAODFile="myAOD.MT.pool.root" \
--perfmon=none \
--athenaopts="--config-only=reco_q445.pkl"
# Run profiling
vtune -no-follow-child -mrte-mode=native -collect hotspots -- \
"$(which python)" "$(which athena.py)" \
--preloadlib="$ATLASMKLLIBDIR_PRELOAD/libintlc.so.5:$ATLASMKLLIBDIR_PRELOAD/libimf.so" \
reco_q445.pkl
Note
May 20, 2021: With VTune 2021.2+ and Python 3.7+, trying to profile a python job that would use cppyy (so all athena.py jobs...) makes the process hang. (See ATLINFR-4105 for a more detailed description of the problem.) To circumvent that issue, the -mrte-mode flag, as shown above, is needed.
This should produce a folder called r000hs and a file called r000hs.vtune therein, which contains the profiling output. The results can be visualized using the GUI via:
# Invoke the GUI for visualization
vtune-gui r000hs/r000hs.vtune

It is also possible to create call-graphs using VTune output. First, you need to obtain a Python script called gprof2dot from this GitHub project and then:
vtune -report gprof-cc -result-dir output -format text -report-output output.txt
gprof2dot.py -f axe output.txt | dot -Tpng -o output.png
--strip and --color-nodes-by-selftime. The last one gives what GPerfTool normally does, and the first strips the full variables' names in functions.
More useful information can be obtained at:
Software vs Hardware Event-based sampling¶
If you're running a Hotspots analysis, vtune uses software-based sampling by default. If you have an Intel chip and install sep drivers (as root) as described on this webpage, you can enable hardware-based sampling by:
vtune -mrte-mode=native -collect hotspots -knob sampling-mode=hw ...;
Profiling a list of specific algorithms or a range of events in Athena jobs¶
We have a service called PerfMonVTune that allows users to profile either a list of specific algorithms or a range of events in Athena jobs. The current implementation doesn't allow mixing these two (i.e. profiling a specific algorithm in a range of events) but that can be easily provided if there is demand.
PerfMonVTune isn't built as part of nightlies since it relies on VTune which is not provided by default. However, the user can clone the package and easily compile it on top of VTune + a master Athena nightly as:
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;
lsetup "asetup Athena,main,latest" "git";
git atlas init-workdir https://:@gitlab.cern.ch:8443/atlas/athena.git -p PerfMonVTune;
mkdir build; cd build;
cmake ../athena/Projects/WorkDir; cmake --build .;
source x86_64-*/setup.sh; cd ..;
from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
cfg.merge(VTuneProfilerServiceCfg(flags, ResumeEvent=5, PauseEvent=15))
which will profile the entire job between the 5th (inclusive) and 15th (exclusive) events. Of course, this makes sense in either a serial or a single-threaded Athena job. For multi-threaded jobs, depending on the configuration, you might get contributions from other parallel events in flight (VTune profiles the entire process).
If the user wants to profile a specific algorithm, then they can add
from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
flags.PerfMon.VTune.ProfiledAlgs = ["foo", "bar"]
cfg.merge(VTuneProfilerServiceCfg(flags))
foo and bar (exact match) are the algorithms to be profiled (profiling starts before and stops after calling the execute). Again, this makes the most sense if it's used in conjunction with either a serial or a single-threaded Athena job. For multi-threaded jobs, depending on the configuration, you might get contributions from algorithms in other parallel events in flight (VTune profiles the entire process).
Then the sampling should be started in "paused state" as:
vtune -mrte-mode=native -collect hotspots -start-paused -- athena --threads 1 my_job_options.py
Running VTune through the job transform¶
It is also possible to run VTune through the job transform. Three main flags control the behavior:
vtune: A boolean flag that toggles on/off the job execution underVTunevtuneDefaultOpts: A boolean flag that toggles on/off the default (hardcoded)VTuneoptionsvtuneExtraOpts: A comma separated list of additionalVTunearguments
By default, running your favorite transform job with the --vtune="True" flag will produce a hotspots analysis result. It is possible to collect a different analysis type with the extra options flag, e.g., --vtuneExtraOpts="-collect=threading". By default, we use the following VTune options: -run-pass-thr=--no-altstack and -mrte-mode=native. If you do not want them, you can simply do --vtuneDefaultOpts="False".
Note that you still have to set up VTune (before Athena) and might need to preload your favorite libraries, e.g., tcmalloc, etc., by hand when you run VTune with this method (which is typically as simple as setting the right environment variable before running the job, e.g., export LD_PRELOAD="${TCMALLOCDIR}/libtcmalloc_minimal.so:${ATLASMKLLIBDIR_PRELOAD}/libimf.so").