Valgrind¶

Introduction¶

Valgrind is an extremely useful code-checking tool. It works by tracking every single bit of memory, and checking that they are all properly initialized, etc. This takes a lot of CPU power and memory, so expect programs being watched by Valgrind to run a lot slower.

But, to quote the Valgrind website:

"With the tools that come with Valgrind, you can automatically detect many memory management and threading bugs, avoiding hours of frustrating bug-hunting, and making your programs more stable. You can also perform detailed profiling, to speed up and reduce memory use of your programs."

As mentioned, Valgrind comes with several possible tools or skins which can be used. The default is Memcheck, which checks all reads and writes of memory, and intercepts all calls to malloc/new/free/delete.

As a result, it can catch the following errors:

Use of uninitialized memory
Reading/writing memory after it has been freed
Reading/writing off the end of malloc's blocks
Reading/writing inappropriate areas on the stack
Memory leaks — where pointers to malloc's blocks are lost forever
Mismatched use of malloc/new/new [] vs free/delete/delete []
Overlapping src and dst pointers in memcpy() and related functions
Some misuses of the POSIX pthreads API

There are other possibilities, such as

Addrcheck - a lightweight (faster) version of Memcheck
Cachegrind - is a cache profiler
Callgrind - a call graph profiler (extended version of Cachegrind)
- You can use this to profile only your algorithm, using Valkyrie
Massif - a heap memory profiler
Helgrind - data races in multithreaded programs

For more information, read the official Valgrind documentation.

Starting Valgrind¶

Valgrind is shipped with the LCG releases and is available in the Athena environment.

Alma9 Configuration Issue

There is a known issue with running Valgrind on Alma9 with the default config. Add --enable-debuginfod=no to your valgrind options to resolve this.

To use Valgrind with Athena and job options, you need to first generate a pickle file from your Python job options and then call valgrind as shown here:

athena.py --config-only=rec.pkl --stdcmalloc jobOptions.py
valgrind $valgrindOpts $(which python) $(which athena.py) --stdcmalloc rec.pkl

where valgrindOpts is the configuration for a particular tool. For example, to check for memory leaks/violations:

valgrindOpts="--show-possibly-lost=no --smc-check=all --tool=memcheck --leak-check=full --num-callers=30 --log-file=valgrind.%p.%n.out --track-origins=yes --enable-debuginfod=no"

Using a pickle file is an optimization

Using a pickle file allows us to skip the profiling of the Python configuration reading. You can also choose to just run valgrind on your original command line directly without creating a pickle file first.

Common valgrind options:

--leak-check=yes enables leak-checking (the default, but full printout of leaks)
--trace-children=yes you likely need this, as athena.py spawns a subprocess and by default runs in a subprocess, so your logfile would be empty
--num-callers=25 gives the depth of the stacktrace - depending on the problem you might need to increase this number even further
--show-reachable=yes will also show leaks that are still reachable, see the manual for an explanation
--track-origins=yes will tell you where you allocated variables which are later used as uninitialized
- the last two options will increase memory usage
--smc-check=all tells Valgrind to allow code to be modified during run time, self-modifying code. This is needed for JIT with ROOT 6. It is set as a default in VALGRIND_OPTS in all recent releases
--enable-debuginfod=no needed on Alma9 to avoid configuration issues

You can also check memory heap utilization with Massif:

valgrindOpts="--tool=massif --pages-as-heap=yes --threshold=0.01 --detailed-freq=1 --log-file=valgrind.log --enable-debuginfod=no"

Massif-specific options:

--pages-as-heap=yes tells Massif to profile memory at the page level
--threshold=0.01 is the significance threshold for heap allocations, as a percentage of total memory size
--detailed-freq=1 is the frequency of detailed snapshots. 1 means every snapshot is detailed

Other Massif options can be found in the Massif manual.

This will dump all output into (appropriately) valgrind.log.

Performance monitoring (i.e., PerfMon) needs to be turned off when running Valgrind to avoid a crash. The following fragment should be included in your job options file or in the preExec argument for a transform (e.g., Reco_tf.py) command:

from RecExConfig.RecFlags import rec
rec.doPerfMon.set_Value_and_Lock(False)
rec.doDetailedPerfMon.set_Value_and_Lock(False)
rec.doSemiDetailedPerfMon.set_Value_and_Lock(False)

Running Valgrind on debug builds is pretty slow, so you may want to run on optimized builds first. This will at least give you a rough idea of where the problems lie (for instance it can tell you the methods/functions of classes with problems). For more details, you can run in debug mode. Again, running the entirety of ATLAS reconstruction in debug mode is slow (and may take too much memory). It's probably best to just rebuild the particular packages you're interested in debug mode.

CA-based Configuration¶

To produce a pickled configuration with ComponentAccumulator, run something like:

python -m AthExHelloWorld.HelloWorldConfig --config-only=myConfig.pkl --evtMax=10

To execute it using Valgrind:

valgrind --leak-check=yes --trace-children=yes --num-callers=25 --show-reachable=yes --track-origins=yes --smc-check=all --enable-debuginfod=no $(which python) $(which athena.py) myConfig.pkl

Note that you can use the --evtMax parameter when creating the pickle file to limit the number of events.

It is also possible to run Valgrind directly on your configuration script, which means that the Python configuration stage is also processed by Valgrind:

valgrind --tool=memcheck --leak-check=full --smc-check=all --num-callers=30 --enable-debuginfod=no $(which python) -m AthExHelloWorld.HelloWorldConfig --evtMax=10

Running Through Job Transforms¶

Most of the transform jobs have been migrated to CA, which means each job basically boils down to running an auto-generated runargs file, e.g., runargs.JOBNAME.py, and then executing the corresponding runwrapper.JOBNAME.sh. You can simply add Valgrind inside the transform job:

Reco_tf.py \
  --perfmon 'none' \
  [...] \
  --athenaopts="--stdcmalloc" \
  --valgrind "True" \
  --valgrindDefaultOpts "False" \
  --valgrindExtraOpts="${valgrindOpts}"

Suppression Files¶

There will normally be a lot of errors reported from external packages over which we have no control. These can be suppressed by using suppression files as follows:

source $(which valgrind-atlas-opts.sh)

This script will set up all relevant suppression files in $VALGRIND_OPTS so there is no need to specify them directly on the command line and you can run the valgrind command as listed above. In case you want to use additional suppression files, specify them directly on the command line via --suppressions.