Testing and debugging¶
Please follow the tutorials at the following links:
GDB¶
We have prepared an example Athena algorithm that you can use to play with the the debugger. It is available in this gitlab repository. After checking it out, build it against a release of main that has debugging enabled.
You can do this in whatever way you're comfortable with. However, if you're unsure, the commands below should work.
It's recommended that you use LXPLUS for these exercises.
mkdir DebugTutorial
cd DebugTutorial
setupATLAS
git clone https://gitlab.cern.ch/ssnyder/DebugTutorial.git
mkdir build
cd build
asetup main,Athena,latest,dbg
cmake ../DebugTutorial
. ./x86_64-el9-gcc14-dbg/setup.sh
make
This package defines an Athena algorithm TestAlg that is intended
to retrieve a collection of jets and make a histogram of the invariant
masses of all jet pairs for up to the first six jets. It does this
by first building (during initialization) a table of the indices
of the jets for each pair. It then uses this table during execute
to form the jet pairs.
If you look at the code, it's an interesting mix of good practice
and silly mistakes.
If you run it:
$ athena.py ../DebugTutorial/DebugTutorial/python/DebugTutorial.py 2>&1|tee log
or
$ python -m DebugTutorial.DebugTutorial 2>&1|tee log
you'll find that it crashes with a C++ exception:
TestAlg FATAL jet 0 xverse pt Standard std::exception is caught in sysExecute
TestAlg ERROR SG::ExcBadAuxVar: Attempt to retrieve nonexistent aux data item `::Pt' (371).
We'll try to catch this exception with the debugger.
First, set things up so that the debugger won't always be trying to load
all symbols. Add this to ~/.gdbinit:
set auto-solib 0
Next add the following at the end of your ~/.gdbinit file:
python import GdbUtils
gdb so that it can find the source files associated with a
release. Also provides commands like findlib to load a shared library by
address instead of name.
Then rerun athena like this, telling it to stop after initialization but before the first event is processed:
$ athena.py --debug=exec ../DebugTutorial/DebugTutorial/python/DebugTutorial.py
...
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f69e26868ba in __futex_abstimed_wait_common () from /lib64/libc.so.6
(gdb)
Then tell gdb to stop when an exception is thrown and continue:
(gdb) sha libstdc++
Reading symbols from /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/lib64/libstdc++.so.6...
...
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) c
Continuing.
AthenaEventLoopMgr INFO ===>>> start of run 189751 <<<===
AthenaEventLoopMgr INFO ===>>> start processing event #35007, run #189751 0 events processed so far <<<===
Thread 1 "athena.py" hit Catchpoint 1 (exception thrown), 0x00007f69e22b4350 in __cxa_throw ()
from /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/lib64/libstdc++.so.6
Ask for a backtrace. It's incomplete because we haven't yet loaded symbols for most libraries:
(gdb) bt
#0 0x00007f69e22b4350 in __cxa_throw ()
from /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/lib64/libstdc++.so.6
#1 0x00007f69adc09d65 in ?? ()
from /cvmfs/atlas-nightlies.cern.ch/repo/sw/main_Athena_x86_64-el9-gcc14-dbg/2026-01-17T2101/Athena/25.0.24/InstallArea/x86_64-el9-gcc14-dbg/lib/libAthContainers.so
#2 0x0000000000000000 in ?? ()
(gdb)
Load libraries and look at it again:
(gdb) sha AthCont
...
(gdb) sha Debug
...
(gdb) sha xAODCore
...
(gdb) sha xAODBase
...
(gdb) sha xAODJet
...
(gdb) bt
#0 0x00007f69e22b4350 in __cxa_throw ()
from /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/lib64/libstdc++.so.6
#1 0x00007f69adc09d65 in SG::AuxVectorData::getDataOol (this=0x10169400,
auxid=371, allowMissing=false)
at /build/atnight/localbuilds/nightlies/Athena/main/athena/Control/AthContainers/Root/AuxVectorData.cxx:377
#2 0x00007f69ae4369e4 in SG::AuxVectorData::Cache::getDataArray (
this=0x10169438, auxid=371, parent=...)
at /build/atnight/localbuilds/nightlies/Athena/main/athena/Control/AthContainers/AthContainers/AuxVectorData.icc:462
#3 0x00007f69ae4363e3 in SG::AuxVectorData::getDataArray (this=0x10169400,
auxid=371)
at /build/atnight/localbuilds/nightlies/Athena/main/athena/Control/AthContainers/AthContainers/AuxVectorData.icc:254
#4 0x00007f69ae44281e in SG::AuxVectorData::getData<float> (this=0x10169400,
auxid=371, ndx=0)
at /build/atnight/localbuilds/nightlies/Athena/main/athena/Control/AthContainers/AthContainers/AuxVectorData.icc:193
#5 0x00007f697e39e2ad in SG::ConstAccessor<float, std::allocator<float> >::operator()<xAOD::Jet_v1> (
this=0x7f694f5ff8c0 <Tutorial::TestAlg::execute()::PtAcc>, e=...)
at /build/atnight/localbuilds/nightlies/Athena/main/athena/Control/AthContainers/AthContainers/ConstAccessor.icc:113
#6 0x00007fa2ca88787b in Tutorial::TestAlg::execute (this=0x137ea000)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:67
...
So the exception originated at line 67 of our algorithm. We can move up the stack to examine it:
(gdb) up
#1 0x00007f69adc09d65 in SG::AuxVectorData::getDataOol (this=0x10169400,
auxid=371, allowMissing=false)
at /build/atnight/localbuilds/nightlies/Athena/main/athena/Control/AthContainers/Root/AuxVectorData.cxx:377
377 throw SG::ExcBadAuxVar (auxid);
Repeatedly press enter until you get to:
(gdb)
#6 0x00007fa2ca88787b in Tutorial::TestAlg::execute (this=0x137ea000)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:67
67 msg(MSG::VERBOSE) << jetlab(i) << " " << PtAcc(*jet) << endmsg;
The AthContainers
library contains a number of functions that may
be called from the debugger to inspect xAOD objects;
see AthContainers/debug.h.
The function SGdebug::print_aux_vars will dump
out the list of aux variables associated with an object. So let's call it:
(gdb) p jet
$1 = (const xAOD::Jet *) 0x839a8a0
(gdb) p SGdebug::print_aux_vars(jet)
204 ::pt [float]
205 ::eta [float]
...
So there's no Pt --- it's instead pt. Actually, we shouldn't really
be using an explicit ConstAccessor here at all; we should instead just call
the provided pt() method. So change this line to:
msg(MSG::VERBOSE) << jetlab(i) << " " << jet->pt() << endmsg;
(and remove the PtAcc declaration).
and rebuild. Now it gets further, but still crashes. I get this, though what you get may differ slightly:
EventSelector INFO Disconnecting input so
urceID: 76376094-4CF4-084F-B321-C5E2CEE3F29D
/eos/atlas/atlascerngroupdisk/data-art/grid...root INFO Database being retired...
-------------------------------------------------------------------------------------
Producing (fast) stack trace...
-------------------------------------------------------------------------------------
(pid=2510009 ppid=1047100) received fatal signal 11 (Segmentation fault)
...
Run it in the debugger again, and just continue after it stops:
$ athena.py --debug=exec ../DebugTutorial/DebugTutorial/python/DebugTutorial.py
...
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fee74cd4175 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
(gdb) c
...
Thread 1 "athena.py" received signal SIGSEGV, Segmentation fault.
0x00007fee75633fec in ?? ()
from /cvmfs/atlas-nightlies.cern.ch/repo/sw/main_Athena_x86_64-el9-gcc14-dbg/sw/lcg/releases/LCG_106b_ATLAS_1/gperftools/2.15/x86_64-el9-gcc14-dbg/lib64/libtcmalloc_minimal.so
When I load the appropriate libraries and go back up the stack, I find that it crashed inside ROOT:
#0 0x00007f20f04a5fec in tcmalloc::SLL_Next (t=0x0)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/linked_list.h:45
#1 0x00007f20f04b5dfa in tcmalloc::SLL_PopRange (head=0x115e080, N=21,
start=0x7fff885954e0, end=0x7fff885954e8)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/linked_list.h:93
#2 0x00007f20f04b5f86 in tcmalloc::ThreadCache::FreeList::PopRange (
this=0x115e080, N=21, start=0x7fff885954e0, end=0x7fff885954e8)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/thread_cache.h:239
#3 0x00007f20f04b51bf in tcmalloc::ThreadCache::ReleaseToCentralCache (
this=0x115e040, src=0x115e080, cl=2, N=21)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/thread_cache.cc:205
#4 0x00007f20f04b529b in tcmalloc::ThreadCache::Scavenge (this=0x115e040)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/thread_cache.cc:223
#5 0x00007f20f04bb923 in tcmalloc::ThreadCache::Deallocate (cl=32,
ptr=0x12571a80, this=0x115e040)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/thread_cache.h:405
#6 (anonymous namespace)::do_free_with_callback (size_hint=0, use_hint=false,
invalid_free_fn=0x7f20f04a2ad9 <(anonymous namespace)::InvalidFree(void*)>, ptr=0x12571a80)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/tcmalloc.cc:1541
#7 (anonymous namespace)::do_free (ptr=0x12571a80)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/tcmalloc.cc:1563
#8 free_fast_path (ptr=0x12571a80)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/tcmalloc.cc:1948
#9 tc_free (ptr=0x12571a80)
at /build/jenkins/workspace/lcg_release_pipeline/build/externals/gperftools-2.15/src/gperftools/2.15/src/tcmalloc.cc:1953
#10 0x00007f20eed338ee in TStorage::ObjectDealloc (vp=0x12571a80)
at /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.32.08/src/ROOT/6.32.08/core/base/src/TStorage.cxx:325
#11 0x00007f20eed07ca0 in TObject::operator delete (ptr=0x12571a80)
at /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.32.08/src/ROOT/6.32.08/core/base/src/TObject.cxx:1095
#12 0x00007f20d0d5a6de in TBranchElement::~TBranchElement (this=0x12571a80,
__in_chrg=<optimized out>)
at /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.32.08/src/ROOT/6.32.08/tree/tree/src/TBranchElement.cxx:1003
...
You may however see something a bit different. This is in fact a common feature of heap problems: often, a crash doesn't occur until some time after the operation that actually screwed things up. So when you look the backtrace, you see a crash from an operation that looks perfectly fine.
Something that can often help is to use the standard malloc (not tcmalloc)
with extra checking enabled. This often makes the error show up earlier.
If you set the environment variable MALLOC_CHECK_ to 2, then malloc
will abort as soon as it detects a problem. For this to work,
you also need to preload the debugging version of malloc. Let's try it:
ATHENA_ADD_PRELOAD=/usr/lib64/libc_malloc_debug.so.0 MALLOC_CHECK_=2 athena.py --stdcmalloc --debug=exec ../DebugTutorial/DebugTutorial/python/DebugTutorial.py
...
(gdb) c
...
AthenaEventLoopMgr INFO ===>>> start processing event #35007, run #189751 0 events processed so far <<<===
...
free(): invalid pointer
Thread 1 "athena.py" received signal SIGABRT, Aborted.
0x00007f094e88ba6c in __pthread_kill_implementation () from /lib64/libc.so.6
If you look at the stack now, we see:
(gdb) bt
#0 0x00007f094e88ba6c in __pthread_kill_implementation ()
from /lib64/libc.so.6
#1 0x00007f094e83e686 in raise () from /lib64/libc.so.6
#2 0x00007f094e828833 in abort () from /lib64/libc.so.6
#3 0x00007f094e829170 in __libc_message.cold () from /lib64/libc.so.6
#4 0x00007f094e87faab in __libc_fatal () from /lib64/libc.so.6
#5 0x00007f094f1d701a in free_check.part ()
from /usr/lib64/libc_malloc_debug.so.0
#6 0x00007f094f1d8271 in free () from /usr/lib64/libc_malloc_debug.so.0
#7 0x00007fc0406c53da in Tutorial::jetlab[abi:cxx11](int) (i=10)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:54
#8 0x00007fc0406c5871 in Tutorial::TestAlg::execute (this=0xd862620)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:66
(gdb) up 7
#7 0x00007fc0406c53da in Tutorial::jetlab[abi:cxx11](int) (i=10)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:54
54 delete [] buf;
(gdb) list
49 {
50 char* buf = new char[16];
51 // 15 chars + null byte
52 sprintf (buf, "jet %1d xverse pt", i);
53 std::string s(buf);
54 delete [] buf;
55 return s;
56 }
57
58
(gdb) p buf
$1 = 0xef1c5c0 "jet 10 xverse pt"
Hmm, is it a little suspicious that it crashed for i=10? The code here is:
char* buf = new char[16];
// 15 chars + null byte
sprintf (buf, "jet %1d xverse pt", i);
The %1d was counted as taking one character, but if the argument
is larger than 9, then sprintf will actually output more than one character,
leading to a buffer overrun!
It's probably worth pointing out at this juncture that allocating a buffer
explicitly and doing a sprintf to it is a completely idiotic way of doing
this. It's probably best just to write directly to the message stream.
Also, the output is written as
msg(MSG::VERBOSE) << jetlab(i) << " " << jet->pt() << endreq;
which is bad since it always evaluates its arguments, even if the message isn't actually printed. Better to write this like
ATH_MSG_VERBOSE( "jet " << i << " xverse pt " << jet->pt() );
and get rid of the jetlab function. If we make that change and rerun,
then we no longer get a crash, but the histogram at the end is nearly empty
(again, what you get may differ slightly):
TH1.Print Name = hist, Entries= 5, Total sum= 5
fSumw[0]=0, x=-5
fSumw[1]=0, x=5
fSumw[2]=3, x=15
fSumw[3]=0, x=25
fSumw[4]=1, x=35
fSumw[5]=1, x=45
fSumw[6]=0, x=55
fSumw[7]=0, x=65
fSumw[8]=0, x=75
fSumw[9]=0, x=85
fSumw[10]=0, x=95
fSumw[11]=0, x=105
fSumw[12]=0, x=115
fSumw[13]=0, x=125
fSumw[14]=0, x=135
fSumw[15]=0, x=145
fSumw[16]=0, x=155
fSumw[17]=0, x=165
fSumw[18]=0, x=175
fSumw[19]=0, x=185
fSumw[20]=0, x=195
fSumw[21]=0, x=205
Let's stop in the execute() method and take a look.
$ athena.py --debug=exec ../DebugTutorial/DebugTutorial/python/DebugTutorial.py
...
(gdb) sha Debug
...
(gdb) br Tutorial::TestAlg::execute
Breakpoint 1 at 0x7f2e282c51c7: file /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx, line 51.
(gdb) c
Continuing.
...
Thread 1 "athena.py" hit Breakpoint 1, Tutorial::TestAlg::execute (
this=0x120fc000)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:51
51 int i = 0;
(gdb) list 59
54 ATH_MSG_VERBOSE( "jet " << i << " xverse pt " << jet->pt() );
55 ++i;
56 }
57
58 // Fill the histogram with invariant masses of jet pairs from the table.
59 unsigned int njet = jets->size();
60 if (njet > 1) {
61 if (njet > m_end.size())
62 njet = m_end.size();
63 for (int i=0; i < m_end[njet]; ++i) {
(It's possible that your line numbers may be slightly different, depending on exactly how you edited the file. Adjust the line numbers in the commands below appropriately.)
Let's go past the loop:
(gdb) br 59
Breakpoint 2 at 0x7efe146c2c62: file /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx, line 59.
(gdb) c
Continuing.
...
Thread 1 "athena.py" hit Breakpoint 2, Tutorial::TestAlg::execute (
this=0x120fc000)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:59
59 unsigned int njet = jets->size();
and then step through from here:
(gdb) n
60 if (njet > 1) {
(gdb) p njet
$1 = 18
(gdb) n
61 if (njet > m_end.size())
(gdb)
62 njet = m_end.size();
(gdb)
63 for (int i=0; i < m_end[njet]; ++i) {
(gdb) p njet
$3 = 6
(gdb) p m_end[6]
$3 = (__gnu_cxx::__alloc_traits<std::allocator<int>, int>::value_type &) @0xf903b58: 2
(gdb) p m_end.size()
$5 = 6
(gdb) p *m_end._M_impl._M_start@6
$3 = {2, 6, 12, 20, 30, 42}
The intent of the code is to look only at the top 6 jets. The array m_end
gives the ending position in the m_pairtable array for a given number of jets.
But if we look at how it's filled, m_end[i] is for i+1 jets, not i!
In fact, we're accessing the array out-of-bounds here, and getting garbage,
here 2.
Let's change the line
for (int i=0; i < m_end[njet]; ++i) {
to
for (int i=0; i < m_end[njet-1]; ++i) {
Now the histogram shows some entries:
TH1.Print Name = hist, Entries= 105, Total sum= 99
fSumw[0]=0, x=-5
fSumw[1]=10, x=5
fSumw[2]=14, x=15
fSumw[3]=8, x=25
fSumw[4]=5, x=35
fSumw[5]=9, x=45
fSumw[6]=7, x=55
fSumw[7]=10, x=65
fSumw[8]=6, x=75
fSumw[9]=8, x=85
fSumw[10]=6, x=95
fSumw[11]=3, x=105
fSumw[12]=3, x=115
fSumw[13]=1, x=125
fSumw[14]=2, x=135
fSumw[15]=0, x=145
fSumw[16]=1, x=155
fSumw[17]=2, x=165
fSumw[18]=2, x=175
fSumw[19]=1, x=185
fSumw[20]=1, x=195
fSumw[21]=6, x=205
... but we don't really expect the big peak near 0.
Let's try stepping through the initialize() method.
Using --debug=init on the command line will stop before initialization.
$ athena.py --debug=init ../DebugTutorial/DebugTutorial/python/DebugTutorial.py
...
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fa1a08d4175 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
(gdb)
But it's hard to set breakpoints at this point since most shared libraries haven't been loaded yet:
(gdb) break Tutorial::TestAlg::initialize
Function "Tutorial::TestAlg::initialize" not defined.
Make breakpoint pending on future shared library load? (y or [n])
(gdb)
We can tell the debugger to stop once a library has been loaded, and then set the breakpoint:
(gdb) catch load DebugTutorial
Catchpoint 1 (load)
(gdb) c
Continuing.
...
Thread 1 "athena.py" hit Catchpoint 1
Inferior loaded /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/build/x86_64-el9-gcc14-dbg/lib/libDebugTutorial.so
0x00007fa1a1490670 in ?? () from /lib64/ld-linux-x86-64.so.2
(gdb) sha Debug
Reading symbols from /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/build/x86_64-el9-gcc14-dbg/lib/libDebugTutorial.so...
(gdb) break Tutorial::TestAlg::initialize
Breakpoint 2 at 0x7fa10d6c462b: file /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx, line 26.
(gdb) c
...
Thread 1 "athena.py" hit Breakpoint 2, Tutorial::TestAlg::initialize (
this=0x13654000)
at /afs/cern.ch/user/s/ssnyder/work/DebugTutorial/DebugTutorial/DebugTutorial/src/TestAlg.cxx:26
26 ATH_CHECK( m_jetsKey.initialize() );
(gdb) n
29 for (int i=0; i<6; i++) {
(gdb)
30 for (int j=0; j<=i; j++) {
(gdb)
31 m_pairtable.emplace_back(i, j);
(gdb) p i
$1 = 0
(gdb) p j
$2 = 0
Oops -- we wanted to take the invariant mass between distinct jets, not between a jet and itself.
The loop over j should perhaps be
for (int j=0; j<i; j++) {
With this change, we get a histogram that looks more reasonable:
TH1.Print Name = hist, Entries= 75, Total sum= 69
fSumw[0]=0, x=-5
fSumw[1]=0, x=5
fSumw[2]=1, x=15
fSumw[3]=5, x=25
fSumw[4]=4, x=35
fSumw[5]=7, x=45
fSumw[6]=6, x=55
fSumw[7]=10, x=65
fSumw[8]=6, x=75
fSumw[9]=8, x=85
fSumw[10]=6, x=95
fSumw[11]=3, x=105
fSumw[12]=3, x=115
fSumw[13]=1, x=125
fSumw[14]=2, x=135
fSumw[15]=0, x=145
fSumw[16]=1, x=155
fSumw[17]=2, x=165
fSumw[18]=2, x=175
fSumw[19]=1, x=185
fSumw[20]=1, x=195
fSumw[21]=6, x=205
That's all the bugs --- that I added deliberately, anyway. Can you find any more?
Running GDB with Job Transforms¶
To debug job transforms with gdb, use the --execOnly flag combined with --athenaopts="--debug=exec":
HelloWorld_tf.py --maxEvents=5 --execOnly --athenaopts="--debug=exec"
This will:
--execOnly: Execute only the athena step (skip pre/post-processing)--athenaopts="--debug=exec": Pass the--debug=execflag to athena, which launches the job under GDB
Once gdb starts, you can set breakpoints and debug your code as usual.