Columnar Infrastructure¶

This section introduces the columnar infrastructure ATLAS built to support using CP tools in columnar analysis environments.

There is a question of what exactly the term "columnar" refers to; for our purposes columnar means that we want to support running in these two common analysis environments:

uproot/awkward: This provides a NumPy-like interface for event data. In many cases this gets combined with the coffea framework to add schema support.
RDataFrame: This is part of ROOT and allows doing analysis in a more declarative style, with extensive support for schemas, systematics, etc.

Motivation¶

As we move to Run 4 we will run up against limits on available disk space, and will no longer be able to support the Run 2/3 n-tuple strategy. There are several things we can do to reduce the needed disk space:

calculate systematics, scale factors and other variables on the fly, instead of reading them from disk
share n-tuples between analyzers and analyses, instead of each having their own set of n-tuples
completely forego n-tuples and do analysis directly on PHYSLITE, creating a widely shared data format for analyzers

This has an implication for how we distribute and apply CP recommendations. For Run 2/3 the model was that we apply the CP Tools/Algorithms in the n-tuple creation step and then store their output in the n-tuple. For Run 4 we will need to either incorporate them already in PHYSLITE (already generally happens where easily possible), or update them to run in the common analysis environments.

In general we will need columnar versions of tools if they:

provide systematics
calculate variables not in PHYSLITE, e.g. scale factors, MET, OR
need analysis specific configuration, e.g. muon calibration mode, MET, OR
use variables that are outputs from other columnar tools as inputs, e.g. MET, OR

This is the minimal set of tools we will need, but based on past experience it seems likely we'll have more columnar tools than that. Once we had the basic dual-use infrastructure and tools in place developers started to convert more tools to dual-use to make it easier to use them.

Implications¶

In general we try to keep the required changes to tools minimal. This shouldn't prevent CP developers from making bigger changes, but we want to keep mandatory changes minimal.

One big implication is that we need tools to run a lot faster. A full n-tuple job typically runs at 10s-100s of Hz. For an end-user analysis job 1s-10s of kHz would be preferable. A lot of this speedup already comes for free with the change to columnar, but if the tool does something computationally expensive it can suddenly become a bottleneck.

Another big implication is that we need to keep the size of the input data down. The obvious aspect is that we are trying to save disk space, and while we are currently within the allotted budget, further savings would be helpful. The less obvious aspect is that our jobs are very i/o heavy, so the less data we read from PHYSLITE the less load we put on analysis facility resources.

There are also a lot of smaller implications for the implementation of a tool, that range from minor syntax updates to some things being completely impossible. These will be noted in the documentation when the relevant aspect is discussed.