CPRun Script¶
CPRun is the primary command-line tool for running CP Algorithm analyses. It provides a unified interface for executing YAML-based configurations in both EventLoop and Athena environments, with automatic backend detection and comprehensive options for controlling analysis execution.
Source: PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/scripts/CPRun.py
For hands-on introduction, see:
This page serves as a comprehensive reference for all CPRun features, command-line arguments, and implementation details.
If you use a framework that is layered on top of the CP Algorithms, like the TopCPToolKit or Easyjet, it may come with its own script for this purpose. Those serve a similar function in a (slightly) different form, potentially with some extra functionality.
Basic Usage¶
Minimal Example¶
The simplest CPRun invocation requires only an input file and a configuration:
CPRun.py -i input.root -t MyAnalysis/config.yaml
This will:
- Process all events in
input.root - Use the configuration from
MyAnalysis/config.yaml - Run all configured systematics
- Produce output as
output.root
Command-Line Arguments¶
Core Arguments¶
These arguments are available in both EventLoop and Athena modes (from CPBaseRunner):
Input/Output:
-i, --input-list: Path to input file(s). Accepts:- Single ROOT file:
-i file.root - Text file with list:
-i files.txt -
Environment variable:
-i $ASG_TEST_FILE_MC -
-t, --text-config: Path to YAML configuration file. The script searches: - Absolute paths
- Relative to current directory
-
Package-installed locations (e.g.,
PackageName/config.yaml) -
-o, --output-name: Output file name (default:output). The.rootextension is added automatically.
Event Control:
-
-e, --max-events: Number of events to process (default:-1for all events) -
--skip-n-events: Skip first N events (debugging only). In EventLoop, this disables cutbookkeeper algorithms and requires--direct-driver.
Systematics:
--no-systematics: Disable systematic variations. Runs only the nominal configuration, significantly reducing processing time and output size.
EventLoop-Specific Arguments¶
Additional arguments available when running in EventLoop mode (from EventLoopCPRunScript):
Execution Control:
-
--direct-driver: UseDirectDriverinstead ofExecDriver. Useful for debugging as it runs in the same process. -
--work-dir: Custom work directory for EventLoop job (default:workDir). Pass explicit path to preserve the working directory after execution. -
--merge-output-files: Merge histogram and n-tuple files into a single output file.
Expert Options:
-
--dump-full-config: Save complete configuration (algorithms + tools) tofull_config.json. RequiresPrintConfigurationblock in YAML. -
--run-perf-stat: EnablexAOD::PerfStatsfor input branch access analysis. -
--algorithm-timers: Enable per-algorithm timing measurements. -
--algorithm-memory-monitoring: Enable per-algorithm memory monitoring. This is very approximate, and really only useful to see very large effects.
Athena-Specific Arguments¶
Athena mode (AthenaCPRunScript) currently uses only the core arguments. Additional Athena-specific options may be added in future releases.
Input File Handling¶
Single ROOT File¶
Direct path to a ROOT file:
CPRun.py -i /path/to/data.root -t config.yaml
Text File List¶
Text file with one file path per line. Comments (#) and empty lines are ignored:
# Run 2 MC samples
/data/mc16a/sample1.root
/data/mc16a/sample2.root
# Additional samples
/data/mc16d/sample3.root
Usage:
CPRun.py -i sample_list.txt -t config.yaml
The script also supports:
- Comma-separated files on a single line
- Directory paths (all
.rootfiles in directory are added)
YAML Configuration Files¶
Configuration File Search¶
CPRun searches for configuration files in this order:
- Absolute paths:
/full/path/to/config.yaml - Relative paths:
./config.yamlor../configs/config.yaml - Package-relative paths:
MyPackage/config.yaml
For package-relative paths, CPRun searches:
$DATAPATHenvironment variable entries- Installed data directories from
atlas_install_data()inCMakeLists.txt - Using
AthenaCommon.Utils.unixtools.find_datafile()as fallback
Installing Configuration Files¶
For grid usage and package distribution, install YAML files via CMakeLists.txt:
atlas_install_data( data/*.yaml )
Then reference with package-relative naming:
CPRun.py -t MyPackage/analysis_config.yaml
Configuration Merging¶
YAML files can include other configuration fragments using the include directive. The main configuration file can reference shared configurations:
# Main configuration
include:
- common/object_definitions.yaml
- common/systematics.yaml
# Analysis-specific settings
Output:
treeName: my_analysis
CPRun automatically merges included fragments. If merging occurs, the combined configuration is saved to merged_config.yaml for inspection.
See User Configuration for full details on configuration structure and the include mechanism.
Output Files in EventLoop¶
In Athena all outputs will be added in a combined file output.root. To
get the same behavior in EventLoop, you have to specify the option
--merge-output-files (which is recommended).
Otherwise EventLoop produces separate output files:
- n-tuple output:
output.root - histogram output:
hist-output.root
Both of these get created in the workDir and then moved out at the end
of the job. You can also change the name of the work directory with the
--work-dir option, in which case the above files are not moved out of
the work directory.
Extending CPRun¶
Custom run scripts can inherit from CPBaseRunner or the backend-specific classes:
from AnalysisAlgorithmsConfig.EventLoopCPRunScript import EventLoopCPRunScript
class MyCustomScript(EventLoopCPRunScript):
def addCustomArguments(self):
super().addCustomArguments()
customGroup = self.parser.add_argument_group('Custom Options')
customGroup.add_argument('--my-option', help='Custom option')
def makeAlgSequence(self):
algSeq = super().makeAlgSequence()
# Add custom algorithms
return algSeq
Grid Submission with CPGridRun¶
For hands-on introduction, see the Grid Submission Tutorial.
CPGridRun is a wrapper around PanDA's prun command that simplifies submitting CPRun jobs to the grid. It handles tarball creation, dataset validation, output naming, and configuration file management.
Source: PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/scripts/CPGridRun.py
Overview¶
Key Features:
- Automatic source code tarball creation with change detection
- Dataset name parsing and validation via AMI
- Automatic output dataset naming from input conventions
- YAML configuration validation for grid usage
- Integration with CPRun.py command strings
- Support for custom analysis scripts
Prerequisites¶
Before using CPGridRun, set up the required tools:
# Setup PanDA client
lsetup panda
# Setup AMI for dataset queries (optional but recommended)
lsetup pyami
# Initialize grid certificate
voms-proxy-init -voms atlas
Ensure your grid certificate is valid (check with voms-proxy-info).
Basic Usage¶
Single Dataset:
CPGridRun.py -i mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855 \
--exec "CPRun.py -t MyPackage/config.yaml --no-systematics"
Multiple Datasets from File:
Create datasets.txt:
# ttbar samples
mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855
mc20_13TeV.410471.PhPy8EG_ttbar_hdamp.DAOD_PHYS.e6337_s3681_r13167_p5855
# Single top
mc20_13TeV.410648.PhPy8EG_singletop.DAOD_PHYS.e6337_s3681_r13167_p5855
Submit:
CPGridRun.py -i datasets.txt \
--exec "CPRun.py -t MyPackage/config.yaml"
Each dataset is submitted as a separate grid job with automatic output naming.
Command-Line Arguments¶
Input/Output File Configuration:
-i, --input-list: Input dataset name(s). Accepts:- Single dataset name (ATLAS production format)
-
Text file with dataset list (one per line,
#for comments) -
--output-files: Output file specifications (default:output.root). Format:name.rootcreatesname/name.rootin output dataset. -
--destSE: Destination storage element (e.g.,CERN-PROD_SCRATCHDISK) -
--mergeType: Output merging strategy: Default: Standard ROOT file mergingxAOD: UsexAODMergefor xAOD formatNone: No merging (separate files per worker)
Naming Configuration:
-
--gridUsername: Username or group name for output datasets (default:$USER) -
--prefix: Output directory prefix. If not provided, dynamically extracted from input dataset name. -
--suffix: Output directory suffix for versioning (e.g.,v1,v2) -
--outDS: Explicit output dataset name. Overrides automatic naming.
CPGrid Configuration:
--exec: CPRun.py command string to execute on grid (required). Encapsulate in quotes:
--exec "CPRun.py -t config.yaml --no-systematics -e 1000"
CPGridRun automatically:
- Sets --input-list to in.txt (grid input list)
- Enables --merge-output-files by default
- Validates YAML file for grid usage
--groupProduction: Enable official production mode (restricted to production groups)
Submission Control:
-
-y, --agreeAll: Auto-confirm all submissions without prompts -
--noSubmit: Dry run mode - printpruncommands without submitting -
--testRun: Submit with limited scope: - 10 files per job
- 300 events per file
-
Automatic test suffix in output name
-
--checkInputDS: Validate input datasets in AMI and check for newer versions -
--recreateTar: Force tarball recreation even if unchanged
Additional prun Arguments:
Any unrecognized arguments are passed through to prun. Example:
CPGridRun.py -i datasets.txt \
--exec "CPRun.py -t config.yaml" \
--nGBPerJob 8 \
--site CERN
Exec String Formatting¶
CPRun.py Commands (Automatic):
When --exec starts with CPRun.py or -, automatic formatting applies:
# User provides
--exec "CPRun.py -t config.yaml --no-systematics"
# CPGridRun formats to
--exec "CPRun.py -t config.yaml --no-systematics --input-list in.txt --merge-output-files"
Short Form:
# These are equivalent:
--exec "CPRun.py -t config.yaml"
--exec "-t config.yaml" # CPRun.py implied
Custom Scripts:
For non-CPRun scripts, provide complete command:
--exec "myScript.py -i inputs -o output -t config.yaml"
No automatic formatting is applied. Warning is issued on first use.
Output Dataset Naming¶
ATLAS Production Format:
Input: mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855
Format: {data/mc}.{DSID}.{generator_physics}.{format}.{tags}
Output: user.username.PhPy8EG.410470.DAOD_PHYS.e6337_s3681_r13167_p5855
Format: {user/group}.{username}.{prefix}.{DSID}.{format}.{tags}.{suffix}
User Format:
Input: user.jane.my_analysis.v1
Format: {user/group}.{username}.{main}.{suffix}
Output: user.username.my_analysis.outputDS.v1
Format: {user/group}.{username}.{main}.outputDS.{suffix}
Manual Override:
You can do a manual override of the output dataset name, without any
pattern substitution via --outDS, i.e. not for multiple datasets at once.
Tarball Management¶
CPGridRun automatically creates a source code tarball (cpgrid.tar.gz).
This is reused for subsequent submissions to speed up submission.
That tarball is recreated if:
- Tarball doesn't exist
- Any file in build/source is newer than tarball
- --recreateTar flag is used
Dataset Validation with AMI¶
Use --checkInputDS to validate datasets before submission:
CPGridRun.py -i datasets.txt \
--exec "CPRun.py -t config.yaml" \
--checkInputDS
Validation checks:
- Dataset existence: Confirms dataset is in AMI
- Newer versions: Detects if newer
ptagexists - Availability: Warns about missing datasets
Example output:
INFO: Newer version of datasets found in AMI:
mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855 -> ptag: p5999
ERROR: Some input datasets are not available in AMI:
mc20_13TeV.999999.InvalidDS.DAOD_PHYS.e0000_s0000_r0000_p0000
Notes from the Developers¶
Originally we left it to individual analyzers and frameworks to provide their own run scripts. We decided to provide one for several reasons:
- The way the PrunDriver/GridDriver in EventLoop works breaks the way we configure based on in-file meta-data. To configure the CP Algorithms correctly you have to open the first file and read the in-file meta-data. However PrunDriver/GridDriver want to configure the algorithms before job submission, i.e. before any files are opened.
- For most users there is probably more value in having a standardized script with a lot of standard options, than to write their own script from scratch. Most of the analysis-specific customization is in the yaml file, and the run script roughly does the same task for most users.
- Not having to teach new students how to write their own run script cuts out a section from the tutorial and that time can be used for other topics.
- Having a common run script makes it a lot easier to build CP Algorithms based unit tests, meaning we can run more configurations and test more options in our test suite. And all those tests will be dual-use. Before we had this script we mostly relied on a single full sequence test, and adding any additional test ended up quite painful.
- Having a run script is actually a fairly high level interface for running jobs, meaning it is likely to be even more stable than the python interfaces for composing jobs and it requires users to learn quite a bit less to use it effectively.