Offset Maps and Array Handling¶
The handling and layout of (nested) arrays in uproot is fundamentally
different from how they are handled and laid out in root proper. In root
if you have e.g. an std::vector<std::vector<float>> then that is
exactly what you are getting, an std::vector for which element in turn
is an std::vector. In uproot this would instead be represented by one
column of type float which contains all the actual float values in a
row, and a second column of an IndexType which contains the starting
points of each sub-vector in the main column (plus the number of entries
as a last entry).
So to get the first sub-vector in root mode, the underlying code would do:
std::span<const float> subvector = branchData[index];
For columnar code on the other hand this would be:
std::span<const float> subvector {mainColumn + indexColumn[index], mainColumn + indexColumn[index+1]};
Note that in both cases we use an std::span. This is because the
std::vector object exists only in that mode, and also because not all
operations defined on an std::vector may be available for general
columnar data. For nested vectors you will get objects that behave like
an std::span, but have a custom implementation that encapsulates the
above logic.
For deeper nested vectors (e.g.
std::vector
Implicit Offset Maps and Container Definitions¶
In practice even a simple variable like Jet.pt will already involve an
std::vector<float> underneath, as it needs to store one pt value per
jet. This offset map is shared between all jet variables, as they should
all have the same number of jets (hopefully).
This offset map is not accessed in the column accessor itself, but
instead is accessed via a dedicated ObjectColumn accessor. This is
then used for looking up the ObjectRange for a given EventId, and
the ObjectId is then defined within the ObjectRange. As such the
ObjectId is already guaranteed to point to a valid entry in the
outermost column the column accessor uses.
This offset map is also used to identify a given container, i.e. all columns belonging to the same container share the same offset map and the assumption is that all columns sharing an offset map belong to the same container. This is also effectively how shallow copies work in columnar mode: Any columns that would be defined on the shallow copy will share the same offset map, so they can be treated as if they are on the same container.
And since in columnar mode we also can have multiple events in a single
call the number of events (and with that the size of the ObjectColumn
columns) also needs to be passed in. The convention is to pass it in as
a separate offset column of length 2 named EventInfo (which is also
the offset map for EventInfo).
Output Vector Columns¶
While in principle it would be possible to support vector columns for output, it is technically difficult and there would have to be a compelling use case to support it.
The main issue is that in the general case this means that we will have to allocate memory for the output columns in our code, whereas all existing columns can be preallocated on the user side, meaning we can leave memory management completely out of the columnar infrastructure. This could be worked around if really needed, but it would come with a lot of implications.
Technical Notes¶
The index type is currently ColumnarOffsetType which is centrally
defined. This should probably be changed to a type that is defined as
part of the columnar mode itself, and is solely used as type for offset
maps.
In RDataFrame the vectors are generally accessible via RVector, i.e.
not using offset maps. This is one of the reasons why RDataFrame will
likely need to have its own columnar mode that provides the correct
behavior.
In performance terms the offset maps are likely slightly more efficient than the nested vectors, particularly in tight loops. Creating them will generally involve fewer memory operations than for nested vectors. And since the values are all continuous in memory, the chance of cache misses is reduced.