Array Mode Implementation¶

The columnar infrastructure can be switched between different environments by selecting the right columnar mode for the environment. For running in columnar, i.e. uproot and (in the future) RDataFrame, we provide the array mode. The goal of this mode is to translate the interface presented to the tool towards an underlying implementation that is as close to native columnar code as possible, ideally giving the compiler plenty of opportunity to optimize both tool and infrastructure code as it would for native columnar code.

There are some limitations to how "columnar" we can make our actual tool code, given that it was written for a very different paradigm (i.e. xAODs), and that we need to maintain compatibility with xAOD mode. The biggest difference between the two environments is that a columnar kernel generally gets all needed column passed in (a push model, effectively modeling a function call), while xAOD tools and algorithm will generally retrieve all decorations from the xAOD objects they use (a pull model). This gets further complicated by most tools having some decorations/columns they only use in some configurations, which is easy in pull models, but a lot harder in push models.

The solution to this push-pull conundrum is that we create a data area into which all columns get loaded and from which columnar tools can retrieve them. In practice that is a simple array of pointers to columns, and the column is identified by the index in the array. This requires that as the tool gets setup each column gets assigned its own index and this index gets propagated to all the columnar accessors. On the upside, this provides a very low overhead lookup for each tool as well as the option to chain tools very efficiently (the later is currently unused).

To facilitate this "reconfiguration" of columns at setup time, all of the columns are connected to their tool, which then allows setting the index, as well as some other operations. As an additional upside this setup mechanism also gives an exact accounting of all data dependencies, which is needed for integrating with the columnar environment.

Another under-appreciated complication is that except for simple types the columns/decorations have different memory layouts in different columnar modes. Even for uproot and RDataFrame there is a different layout for array data (nested vectors vs offset maps). There are a number of ways this can be addressed, but a safe fallback solution is to have a separate columnar mode for each environment.