Offset Maps and Array Handling¶

The handling and layout of (nested) arrays in uproot is fundamentally different from how they are handled and laid out in root proper. In root if you have e.g. an std::vector<std::vector<float>> then that is exactly what you are getting, an std::vector for which element in turn is an std::vector. In uproot this would instead be represented by one column of type float which contains all the actual float values in a row, and a second column of an IndexType which contains the starting points of each sub-vector in the main column (plus the number of entries as a last entry).

So to get the first sub-vector in root mode, the underlying code would do:

std::span<const float> subvector = branchData[index];

For columnar code on the other hand this would be:

std::span<const float> subvector {mainColumn + indexColumn[index], mainColumn + indexColumn[index+1]};

Note that in both cases we use an std::span. This is because the std::vector object exists only in that mode, and also because not all operations defined on an std::vector may be available for general columnar data. For nested vectors you will get objects that behave like an std::span, but have a custom implementation that encapsulates the above logic.

For deeper nested vectors (e.g. std::vector>>) extra offset maps will be used, i.e. there will be an inner and an outer offset map, with the outer offset map pointing into the inner offset map and from that (with a second index) into the actual data column. In practice every level of nesting also needs its own implementation, so while any level of nesting can be implemented, each level will need a custom implementation.

Implicit Offset Maps and Container Definitions¶

In practice even a simple variable like Jet.pt will already involve an std::vector<float> underneath, as it needs to store one pt value per jet. This offset map is shared between all jet variables, as they should all have the same number of jets (hopefully).

This offset map is not accessed in the column accessor itself, but instead is accessed via a dedicated ObjectColumn accessor. This is then used for looking up the ObjectRange for a given EventId, and the ObjectId is then defined within the ObjectRange. As such the ObjectId is already guaranteed to point to a valid entry in the outermost column the column accessor uses.

This offset map is also used to identify a given container, i.e. all columns belonging to the same container share the same offset map and the assumption is that all columns sharing an offset map belong to the same container. This is also effectively how shallow copies work in columnar mode: Any columns that would be defined on the shallow copy will share the same offset map, so they can be treated as if they are on the same container.

And since in columnar mode we also can have multiple events in a single call the number of events (and with that the size of the ObjectColumn columns) also needs to be passed in. The convention is to pass it in as a separate offset column of length 2 named EventInfo (which is also the offset map for EventInfo).

Output Vector Columns¶

While in principle it would be possible to support vector columns for output, it is technically difficult and there would have to be a compelling use case to support it.

The main issue is that in the general case this means that we will have to allocate memory for the output columns in our code, whereas all existing columns can be preallocated on the user side, meaning we can leave memory management completely out of the columnar infrastructure. This could be worked around if really needed, but it would come with a lot of implications.

Technical Notes¶

The index type is currently ColumnarOffsetType which is centrally defined. This should probably be changed to a type that is defined as part of the columnar mode itself, and is solely used as type for offset maps.

In RDataFrame the vectors are generally accessible via RVector, i.e. not using offset maps. This is one of the reasons why RDataFrame will likely need to have its own columnar mode that provides the correct behavior.

In performance terms the offset maps are likely slightly more efficient than the nested vectors, particularly in tight loops. Creating them will generally involve fewer memory operations than for nested vectors. And since the values are all continuous in memory, the chance of cache misses is reduced.