CPRun Script¶

CPRun is the primary command-line tool for running CP Algorithm analyses. It provides a unified interface for executing YAML-based configurations in both EventLoop and Athena environments, with automatic backend detection and comprehensive options for controlling analysis execution.

Source: PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/scripts/CPRun.py

For hands-on introduction, see:

This page serves as a comprehensive reference for all CPRun features, command-line arguments, and implementation details.

If you use a framework that is layered on top of the CP Algorithms, like the TopCPToolKit or Easyjet, it may come with its own script for this purpose. Those serve a similar function in a (slightly) different form, potentially with some extra functionality.

Basic Usage¶

Minimal Example¶

The simplest CPRun invocation requires only an input file and a configuration:

CPRun.py -i input.root -t MyAnalysis/config.yaml

This will:

Process all events in input.root
Use the configuration from MyAnalysis/config.yaml
Run all configured systematics
Produce output as output.root

Command-Line Arguments¶

Core Arguments¶

These arguments are available in both EventLoop and Athena modes (from CPBaseRunner):

Input/Output:

-i, --input-list: Path to input file(s). Accepts:
Single ROOT file: -i file.root
Text file with list: -i files.txt
Environment variable: -i $ASG_TEST_FILE_MC
-t, --text-config: Path to YAML configuration file. The script searches:
Absolute paths
Relative to current directory
Package-installed locations (e.g., PackageName/config.yaml)
-o, --output-name: Output file name (default: output). The .root extension is added automatically.

Event Control:

-e, --max-events: Number of events to process (default: -1 for all events)
--skip-n-events: Skip first N events (debugging only). In EventLoop, this disables cutbookkeeper algorithms and requires --direct-driver.

Systematics:

--no-systematics: Disable systematic variations. Runs only the nominal configuration, significantly reducing processing time and output size.

EventLoop-Specific Arguments¶

Additional arguments available when running in EventLoop mode (from EventLoopCPRunScript):

Execution Control:

--direct-driver: Use DirectDriver instead of ExecDriver. Useful for debugging as it runs in the same process.
--work-dir: Custom work directory for EventLoop job (default: workDir). Pass explicit path to preserve the working directory after execution.
--merge-output-files: Merge histogram and n-tuple files into a single output file.

Expert Options:

--dump-full-config: Save complete configuration (algorithms + tools) to full_config.json. Requires PrintConfiguration block in YAML.
--run-perf-stat: Enable xAOD::PerfStats for input branch access analysis.
--algorithm-timers: Enable per-algorithm timing measurements.
--algorithm-memory-monitoring: Enable per-algorithm memory monitoring. This is very approximate, and really only useful to see very large effects.

Athena-Specific Arguments¶

Athena mode (AthenaCPRunScript) currently uses only the core arguments. Additional Athena-specific options may be added in future releases.

Input File Handling¶

Single ROOT File¶

Direct path to a ROOT file:

CPRun.py -i /path/to/data.root -t config.yaml

Text File List¶

Text file with one file path per line. Comments (#) and empty lines are ignored:

# Run 2 MC samples
/data/mc16a/sample1.root
/data/mc16a/sample2.root

# Additional samples
/data/mc16d/sample3.root

Usage:

CPRun.py -i sample_list.txt -t config.yaml

The script also supports:

Comma-separated files on a single line
Directory paths (all .root files in directory are added)

YAML Configuration Files¶

Configuration File Search¶

CPRun searches for configuration files in this order:

Absolute paths: /full/path/to/config.yaml
Relative paths: ./config.yaml or ../configs/config.yaml
Package-relative paths: MyPackage/config.yaml

For package-relative paths, CPRun searches:

$DATAPATH environment variable entries
Installed data directories from atlas_install_data() in CMakeLists.txt
Using AthenaCommon.Utils.unixtools.find_datafile() as fallback

Installing Configuration Files¶

For grid usage and package distribution, install YAML files via CMakeLists.txt:

atlas_install_data( data/*.yaml )

Then reference with package-relative naming:

CPRun.py -t MyPackage/analysis_config.yaml

Configuration Merging¶

YAML files can include other configuration fragments using the include directive. The main configuration file can reference shared configurations:

# Main configuration
include:
  - common/object_definitions.yaml
  - common/systematics.yaml

# Analysis-specific settings
Output:
  treeName: my_analysis

CPRun automatically merges included fragments. If merging occurs, the combined configuration is saved to merged_config.yaml for inspection.

See User Configuration for full details on configuration structure and the include mechanism.

Output Files in EventLoop¶

In Athena all outputs will be added in a combined file output.root. To get the same behavior in EventLoop, you have to specify the option --merge-output-files (which is recommended).

Otherwise EventLoop produces separate output files:

n-tuple output: output.root
histogram output: hist-output.root

Both of these get created in the workDir and then moved out at the end of the job. You can also change the name of the work directory with the --work-dir option, in which case the above files are not moved out of the work directory.

Extending CPRun¶

Custom run scripts can inherit from CPBaseRunner or the backend-specific classes:

from AnalysisAlgorithmsConfig.EventLoopCPRunScript import EventLoopCPRunScript

class MyCustomScript(EventLoopCPRunScript):
    def addCustomArguments(self):
        super().addCustomArguments()
        customGroup = self.parser.add_argument_group('Custom Options')
        customGroup.add_argument('--my-option', help='Custom option')

    def makeAlgSequence(self):
        algSeq = super().makeAlgSequence()
        # Add custom algorithms
        return algSeq

Grid Submission with CPGridRun¶

For hands-on introduction, see the Grid Submission Tutorial.

CPGridRun is a wrapper around PanDA's prun command that simplifies submitting CPRun jobs to the grid. It handles tarball creation, dataset validation, output naming, and configuration file management.

Source: PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/scripts/CPGridRun.py

Overview¶

Key Features:

Automatic source code tarball creation with change detection
Dataset name parsing and validation via AMI
Automatic output dataset naming from input conventions
YAML configuration validation for grid usage
Integration with CPRun.py command strings
Support for custom analysis scripts

Prerequisites¶

Before using CPGridRun, set up the required tools:

# Setup PanDA client
lsetup panda

# Setup AMI for dataset queries (optional but recommended)
lsetup pyami

# Initialize grid certificate
voms-proxy-init -voms atlas

Ensure your grid certificate is valid (check with voms-proxy-info).

Basic Usage¶

Single Dataset:

CPGridRun.py -i mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855 \
             --exec "CPRun.py -t MyPackage/config.yaml --no-systematics"

Multiple Datasets from File:

Create datasets.txt:

# ttbar samples
mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855
mc20_13TeV.410471.PhPy8EG_ttbar_hdamp.DAOD_PHYS.e6337_s3681_r13167_p5855

# Single top
mc20_13TeV.410648.PhPy8EG_singletop.DAOD_PHYS.e6337_s3681_r13167_p5855

Submit:

CPGridRun.py -i datasets.txt \
             --exec "CPRun.py -t MyPackage/config.yaml"

Each dataset is submitted as a separate grid job with automatic output naming.

Command-Line Arguments¶

Input/Output File Configuration:

-i, --input-list: Input dataset name(s). Accepts:
Single dataset name (ATLAS production format)
Text file with dataset list (one per line, # for comments)
--output-files: Output file specifications (default: output.root). Format: name.root creates name/name.root in output dataset.
--destSE: Destination storage element (e.g., CERN-PROD_SCRATCHDISK)
--mergeType: Output merging strategy:
Default: Standard ROOT file merging
xAOD: Use xAODMerge for xAOD format
None: No merging (separate files per worker)

Naming Configuration:

--gridUsername: Username or group name for output datasets (default: $USER)
--prefix: Output directory prefix. If not provided, dynamically extracted from input dataset name.
--suffix: Output directory suffix for versioning (e.g., v1, v2)
--outDS: Explicit output dataset name. Overrides automatic naming.

CPGrid Configuration:

--exec: CPRun.py command string to execute on grid (required). Encapsulate in quotes:

--exec "CPRun.py -t config.yaml --no-systematics -e 1000"

CPGridRun automatically: - Sets --input-list to in.txt (grid input list) - Enables --merge-output-files by default - Validates YAML file for grid usage

--groupProduction: Enable official production mode (restricted to production groups)

Submission Control:

-y, --agreeAll: Auto-confirm all submissions without prompts
--noSubmit: Dry run mode - print prun commands without submitting
--testRun: Submit with limited scope:
10 files per job
300 events per file
Automatic test suffix in output name
--checkInputDS: Validate input datasets in AMI and check for newer versions
--recreateTar: Force tarball recreation even if unchanged

Additional prun Arguments:

Any unrecognized arguments are passed through to prun. Example:

CPGridRun.py -i datasets.txt \
             --exec "CPRun.py -t config.yaml" \
             --nGBPerJob 8 \
             --site CERN

Exec String Formatting¶

CPRun.py Commands (Automatic):

When --exec starts with CPRun.py or -, automatic formatting applies:

# User provides
--exec "CPRun.py -t config.yaml --no-systematics"

# CPGridRun formats to
--exec "CPRun.py -t config.yaml --no-systematics --input-list in.txt --merge-output-files"

Short Form:

# These are equivalent:
--exec "CPRun.py -t config.yaml"
--exec "-t config.yaml"  # CPRun.py implied

Custom Scripts:

For non-CPRun scripts, provide complete command:

--exec "myScript.py -i inputs -o output -t config.yaml"

No automatic formatting is applied. Warning is issued on first use.

Output Dataset Naming¶

ATLAS Production Format:

Input: mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855

Format: {data/mc}.{DSID}.{generator_physics}.{format}.{tags}

Output: user.username.PhPy8EG.410470.DAOD_PHYS.e6337_s3681_r13167_p5855

Format: {user/group}.{username}.{prefix}.{DSID}.{format}.{tags}.{suffix}

User Format:

Input: user.jane.my_analysis.v1

Format: {user/group}.{username}.{main}.{suffix}

Output: user.username.my_analysis.outputDS.v1

Format: {user/group}.{username}.{main}.outputDS.{suffix}

Manual Override:

You can do a manual override of the output dataset name, without any pattern substitution via --outDS, i.e. not for multiple datasets at once.

Tarball Management¶

CPGridRun automatically creates a source code tarball (cpgrid.tar.gz). This is reused for subsequent submissions to speed up submission.

That tarball is recreated if: - Tarball doesn't exist - Any file in build/source is newer than tarball - --recreateTar flag is used

Dataset Validation with AMI¶

Use --checkInputDS to validate datasets before submission:

CPGridRun.py -i datasets.txt \
             --exec "CPRun.py -t config.yaml" \
             --checkInputDS

Validation checks:

Dataset existence: Confirms dataset is in AMI
Newer versions: Detects if newer ptag exists
Availability: Warns about missing datasets

Example output:

INFO: Newer version of datasets found in AMI:
mc20_13TeV.410470.PhPy8EG_ttbar.DAOD_PHYS.e6337_s3681_r13167_p5855 -> ptag: p5999

ERROR: Some input datasets are not available in AMI:
mc20_13TeV.999999.InvalidDS.DAOD_PHYS.e0000_s0000_r0000_p0000

Notes from the Developers¶

Originally we left it to individual analyzers and frameworks to provide their own run scripts. We decided to provide one for several reasons:

The way the PrunDriver/GridDriver in EventLoop works breaks the way we configure based on in-file meta-data. To configure the CP Algorithms correctly you have to open the first file and read the in-file meta-data. However PrunDriver/GridDriver want to configure the algorithms before job submission, i.e. before any files are opened.
For most users there is probably more value in having a standardized script with a lot of standard options, than to write their own script from scratch. Most of the analysis-specific customization is in the yaml file, and the run script roughly does the same task for most users.
Not having to teach new students how to write their own run script cuts out a section from the tutorial and that time can be used for other topics.
Having a common run script makes it a lot easier to build CP Algorithms based unit tests, meaning we can run more configurations and test more options in our test suite. And all those tests will be dual-use. Before we had this script we mostly relied on a single full sequence test, and adding any additional test ended up quite painful.
Having a run script is actually a fairly high level interface for running jobs, meaning it is likely to be even more stable than the python interfaces for composing jobs and it requires users to learn quite a bit less to use it effectively.