Working with production workflows¶

So far we have mostly run Athena by running python on simple job options. Production workflows such as simulation and reconstruction are complicated and are defined in dozens or hundreds of configuration fragments and thousands of different Athena algorithms and tools. In order to make running such complex workflows tractable, an additional layer of python is used to wrap them up into easily understood one-line commands that are used both by the production system and for local development. These are known as job transformations. These python scripts assemble the job configuration based on the inputs at the command line and then execute the job using Athena.

This part of the tutorial demonstrates where and how to get information about reproducing official production jobs locally, how to manipulate job transforms in order to make software development more straightforward, and how to submit limited production jobs to the grid for the purposes of validating changes over high statistics.

Running complex Athena workflows locally¶

Each production workflow has a corresponding job transform script, the main ones being:

Gen_tf.py : event generation
Sim_tf.py : detector simulation
Reco_tf.py : reconstruction
Derivation_tf.py : production of DAODs (PHYS, PHYSLITE &c)

When you are doing software development that impinges on one of these production workflows, you'll have to test that production workflow locally before you make a merge request, at least to check that it runs. Later you may also want to run it on the grid yourself, which we'll look at later.

Getting the right commands¶

The transform commands themselves can be quite complicated and have to be updated frequently to keep up with the demands of data taking and simulated data production. So if you need to run one of the production workflows locally it is best to look up the relevant command from a central source. This will generally come from one of two places. * If you are asked to debug a crashing or problematic production job you will be pointed to a specific job link on BigPanda, from which you can then copy and paste a reproducer script that allows you to recreate the conditions of the job exactly including release, input files and the exact command used. * If you are developing part of a production workflow against the latest nightlies it is usually best to get them from the ART tests that we run every day, since these are up to date and also provide ready-made test input files. Having heard the talk this morning you should now be familiar with ART, in particular how to find tests and how to locate the test scripts in GitLab (by clicking the fox symbol in the ART results table).

For example the test called test_bulkProcessing_data23 (in Tier0ChainTests) is here - all of the tests have similar scripts, and from these scripts you can infer the correct command and input files. So when testing a production chain locally you need figure out what test your work impinges on, and then use the command from that test to do local tests of your developments. This has the added advantage that the test is run on the grid automatically every night using the latest builds, so once your changes are in, you can inspect the results without having to run the jobs yourself. Most of the relevant tests are in Tier0ChainTests and TrfTestsART.

If you take the example given above from test_bulkProcessing_data23 you will infer that the command to test full reconstruction on 2023 data is (dropping a bunch of special formats made at Tier0):

Reco_tf.py \
--AMIConfig f1350  \
--inputBSFile "/cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/CampaignInputs/data23/RAW/data23_13p6TeV.00452463.physics_Main.daq.RAW/540events.data23_13p6TeV.00452463.physics_Main.daq.RAW._lb0514._SFO-16._0004.data" \
--outputAODFile "AOD.pool.root" \
--conditionsTag "CONDBR2-BLKPA-2023-05" \
--imf False

Before running anything you should make sure that the test itself ran in ART with the latest build. If it didn't then there's clearly no point in running it locally. Then you should select an earlier build that worked.

Having verified that the test ran in ART, to run the test locally you just do:

setupATLAS
asetup Athena,main,latest
mkdir mytest
cd mytest

and then run the above command with the number of events set to 10 to save time (--maxEvents 10). Verify the AOD file was produced using the command checkxAOD.py AOD.pool.root.

Similar workflows (e.g. for simulation, derivation) can be found by browsing the ART tests in the different categories. Take a few minutes to see what tests are being run and which match your area of activity, and try to run a few of them - again remembering to reduce the number of events to save time.

You can also run ART locally which means you don't have to copy and paste the commands, but it means you can't make simple edits (e.g. reducing the number of outputs as above). See the talk this morning for more information.

Submitting test jobs to the grid with `pathena`¶

Assuming the above job ran, set up a grid proxy certificate to authenticate your job:

lsetup panda
voms-proxy-init -voms atlas

Then create a bash script submit.sh that contains the pathena and Reco_tf configuration as follows (change the name of the output to match your username and the date):

INPUT="data23_13p6TeV:data23_13p6TeV.00452463.physics_Main.daq.RAW"
OUTPUT="user.jcatmore.test300125_1"

pathena --trf 'Reco_tf.py \
--AMIConfig f1350  \
--inputBSFile "%IN" \
--outputAODFile "%OUT.AOD.pool.root" \
--conditionsTag "CONDBR2-BLKPA-2023-05" \
--imf False' \
--inDS $INPUT --outDS $OUTPUT \
--nFilesPerJob 1 --nFiles 5 \
--nCore 8 \
--noBuild --respectLB --maxCpuCount 43200

Note that we are now providing a path to a dataset (data23_13p6TeV:data23_13p6TeV.00452463.physics_Main.daq.RAW) rather than a file on CVFMS. This will be accessed by the grid. You can verify the dataset (e.g. the number of files and the location of the dataset replicas) using Rucio. In a different terminal session:

setupATLAS
lsetup rucio
rucio list-files data23_13p6TeV:data23_13p6TeV.00452463.physics_Main.daq.RAW
rucio list-dataset-replicas data23_13p6TeV:data23_13p6TeV.00452463.physics_Main.daq.RAW

The latter verifies that the dataset is on disk as well as tape.

Important:

Quotes: Be careful how you construct the pathena command. The quotes inside --trf '<command>' should be different than the outermost ones (or escaped). In this example single-quotes are used as outermost ones and double-quotes are used for the actual transform arguments.
Adjust INPUT and especially OUTPUT accordingly before using the script.
The pathena parameter --nFiles 15 is used to process only 15 out of the 16335 files. This should be used for testing or in this tutorial.
Also adjust the Reco_tf parameters as needed.
The %IN and %OUT parameters are the placeholders for pathena to inject the correct particular configuration.
Always submit from the subdirectory mytest since pathena packs all files of the current directory into the job input sandbox. Don't submit from your home directory since you would ship your whole home directory content.
To be safe we're using --nFilesPerJob 1
Looping jobs: use the option --maxCpuCount 43200 for an estimated maximum of 12h job duration or better --maxCpuCount 86400 for a larger safety margin, so that jobs are not killed too early or too late by the PanDA pilot looping job detection.

Run the job with

source submit.sh

Note that this may take longer than the tutorial depending on how busy the sites are where the data is stored, so submit the job and then focus on running local tests or inspecting ART results (see below).

Monitoring¶

Monitor the progress of the task using the BigPanda monitoring page, pasting the task ID given by the above command into the search box, changing the search to "task by ID" and then clicking the search button.

Troubleshooting¶

If there was a mistake in the setup or a task is misbehaving use pbook to abort the task:

setupATLAS
lsetup panda
pbook

and then on the pbook command line:

show()
kill(12345678)

substituting the above number with your task ID.

Looking at the log file¶

After the test has finished download the test log files dataset for a deeper look with (substitute your output name):

setupATLAS
lsetup rucio
rucio get user.jcatmore.test300125_1.log/

If your job doesn't finish in time, instead find an ART test and download the tarball as demonstrated earlier. Unpack the individual log tar files with:

find . -name "user.*.log.tgz" -exec tar xf {} \;

Substitute user for group in the case of an ART tarball.

Get a summary of all FPEs:

grep "WARNING FPE" tarball_PandaJob_*/log.RAWtoALL | awk '{print $11}' | sed 's/\[//' | sed 's/\]//' | sed -r '/^\s*$/d' | sort  | uniq -c

Get a summary of the average Pss memory leaks (in units of KB, note there are also MB memory leaks!):

grep "Leak estimate per event Pss" tarball_PandaJob_*/log.RAWtoALL | grep KB | awk '{print $9}' | awk '{s+=$1}END{print "ave:",s/NR}'

Other text in the log files can be sought similarly.

Debugging failed jobs¶

Here are some hints on looking at failed jobs.

There are 3 different pages after a task is in "done" status to get an error overview (since the "failed" jobs have disappeared):
- Switch to nodrop mode in the "Task extra info" pull down menu
- Or go to the "Error summary" page from the "Task extra info" pull down menu
- Or go the "All (including retries)" entry from the "Show jobs" pull down menu

For the last option, scroll down to the table "Overall error summary". Then largely ignore anything that is not exe:65 or exe:65. Almost everything else are Grid infrastructure related job failures.

exe:65 or exe:68 are Athena crashes. Click in the table on exe:65 and then you get a summary webpage.

Now you can now drill down to the log.RAWtoALL logfile and update existing or file new JIRA tickets. If operating the webpages is too tedious, you can also do the same on the full downloaded log dataset:

grep "exit code 65" tarball_PandaJob_*/payload.stdout

That returns something like

tarball_PandaJob_4995851596_CERN/payload.stdout:PyJobTransforms.transform.execute
2021-03-10 11:11:10,546 WARNING Transform now exiting early with exit
code 65 (RAWtoALL got a SIGALRM signal (exit code 142); Logfile error in
log.RAWtoALL: "CoreDumpSvc                                            0
    0   FATAL Caught fatal signal. Printing details to stdout.")
[..]

And then look into tarball_PandaJob_4995851596_CERN/log.RAWtoALL at the stack trace at the very end of the file, or grep for "FATAL" in log.RAWtoALL:

grep "FATAL" tarball_PandaJob_*/log.RAWtoALL

This procedure is fine for up to 30-50 crashes in total - where you have more, use grep and search for stack trace patterns in log.RAWtoALL. Don't drill down into any more crashes, because the pattern recognition due to the different memory addresses and not always identical looking stack traces makes things difficult.

Geant4 and DAOD production tests¶

If you didn't already do so, try to find in ART a simulation (Geant4) test, or a DAOD production test, and then run it locally and on the Grid.