pipeline.Processing package

pipeline.Processing package#

This package handles the processing of the preprocessed files.

pipeline.Processing.compute module#

A module that defines how to compute certain columns.

pipeline.Processing.compute.column_to_computation_fct: Dict[str, Callable[[pandas.DataFrame], ndarray[Any, dtype[_ScalarType_co]]]] = {'eta': <function <lambda>>, 'phi': <function <lambda>>, 'r': <function <lambda>>, 'theta': <function <lambda>>}#: Associates a column name with a lambda function that takes as input the dataframe of hits and returns the column computed

pipeline.Processing.compute.column_to_required_columns = {'eta': ['theta'], 'theta': ['r']}#: Associates a column name the list of columns needed to compute it x and y are already assumed to belong the dataframe so they are not included in this dictionary

pipeline.Processing.compute.compute_column(hits, column)[source]#

Compute a column and store it in the dataframe of hits.

Parameters:

hits (DataFrame) – dataframe of hits
column (str) – column to compute

pipeline.Processing.compute.compute_columns(hits, columns)[source]#

Compute required columns to the dataframe of hits.

Parameters:

hits (DataFrame) – dataframe of hits
columns (List[str]) – columns to compute

Notes

If the column is already in the dataframe, it will not be computed.

pipeline.Processing.modulewise_edges module#

This Module defines functions to build the true edges between the hits, use a module from the origin vertex approach.

pipeline.Processing.modulewise_edges.get_modulewise_edges(hits)[source]#

Build the edges using a module-wise approach, from the production vertex.

Parameters:: hits (DataFrame) – dataframe of hits for a given event, with columns vx, vy, vz x, y, z, particle_id
Return type:: ndarray
Returns:: Array of all the edges, of shape (2, m), with m the number of edges.

pipeline.Processing.planewise_edges module#

A module that defines a way of defines the edges by sorting the hits by plane (instead of by distance from the origin vertex).

This way, we define the edge orientation using a left to right convention. However, if a plane a multiple hits for the same particle, the edges can be not well defined.

pipeline.Processing.planewise_edges.get_edges_from_sorted_impl(hit_ids, particle_ids, plane_ids)#

Fill the array of plane-wise edges by grouping by particle ID already sorted by plane, and forming edge by linking “adjacent” planes.

Parameters:

edges – Pre-allocated empty array of edges to fill
hit_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – List of hit IDs, sorted by particle IDs and planes
particle_group_indices – Start and end indices in hit_ids that delimits hits that have same particle ID.

Return type:

List[ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]]

pipeline.Processing.planewise_edges.get_planewise_custom_edges(hits, grouped_planes=None)[source]#

Return type:: ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

pipeline.Processing.planewise_edges.get_planewise_edges(hits, drop_duplicates=False)[source]#

Get edges by sorting the hits by plane number for every particle in the event, and linking the adjacent hits by edges.

Parameters:

hits (DataFrame) – dataframe of hits, with columns particle_id and plane
drop_duplicates (bool) – whether to drop hits of a particle that belong to the same plane

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.planewise_edges.get_planewise_edges_impl(hit_ids, particle_ids, plane_ids)[source]#

Get the plane-wise edges

Parameters:

hit_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – array of hit IDs, sorted by particle IDs
particle_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Sorted array of particle IDs for every hit
plane_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Sorted array of plane IDs for every hit

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.processing module#

A module that defines the processing at the event-level.

pipeline.Processing.processing.build_event(truncated_path, event_str, features, feature_means, feature_scales, kept_hits_columns, kept_particles_columns, true_edges_column)[source]#

Load the event, compute the necessary columns.

Parameters:

path (truncated) – path without the suffixes -particles.csv and hits_particles.csv
feature_means (List[float]) – Array of the means to substract the feature values to, in order to “centralise” them
feature_scales (List[float]) – Array of the scales to divide the “centralised” feature values by, so that their scale is around 1
kept_hits_columns (List[Union[str, Dict[str, str]]]) – Columns to keep, initially stored in the dataframe of hits
kept_particles_columns (List[str]) – Columns to keep, initially stored in the dataframe of particles, but merged to the particles of hits

Return type:

Data

Returns:

PyTorch data object, which will be saved for the training or inference.

pipeline.Processing.processing.get_normalised_features(hits, features, feature_means, feature_scales)[source]#

Get the the normalised features from the dataframe of hits.

Parameters:

hits (DataFrame) – Dataframe of hits that contains the features
features (List[str]) – list of the columns in the dataframe of hits, which correspond to the features
feature_means (Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]) – Array of the means to substract the feature values to, in order to “centralise” them
feature_scales (Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]) – Array of the scales to divide the “centralised” feature values by, so that their scale is around 1

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

pipeline.Processing.processing.prepare_event(truncated_path, output_dir, *args, **kwargs)[source]#

Load one event saved during pre-processing and save it in the right format in a PyTorch data object.

Parameters:

truncated_path (str) – path of the input files, excluding -hit_particles.parquet and -particles.parquet
output_dir (str) – directory where to save all the processed PyTorch data
overwrite – whether to overwrite an existing PyTorch data pickle file
args – passed to build_event()
kwargs – passed to build_event()

pipeline.Processing.run_processing module#

A module that defines a function to run the processing from a configuration file.

pipeline.Processing.run_processing.run_processing_from_config(path_or_config, reproduce=True, test_dataset_name=None)[source]#

Loop over the events saved during the pre-processing step, and transform them into the relevant format for training.

Parameters:

path_or_config (str | dict) – the overall configuration
reproduce (bool) – whether to reproduce an existing processing
test_dataset_name (Optional[str]) – Name of the test dataset to produce. If None (default), the train and val datasets are produced instead.

pipeline.Processing.run_processing.run_processing_in_parallel(truncated_paths, output_dir, max_workers, reproduce=True, **processing_config)[source]#

Run the processing step in parallel.

Parameters:

truncated_paths (List[str]) – List of the truncated paths of the input files (which correspond to the hits-particles and particles dataframe)
output_dir (str) – directory where to write the output files
max_workers (int) – maximal number of processes to run in parallel
reproduce (bool) – whether to delete the output directory before writting on it
**processing_config – Other keyword arguments passed to prepare_event()

pipeline.Processing.run_processing.run_processing_test_dataset(truncated_paths, output_dir, n_workers, reproduce=True, **processing_config)[source]#

Run the processing for the test dataset. There is no splitting train-val for a test sample.

Parameters:

truncated_paths (List[str]) – List of the truncated paths of the input files (which correspond to the hits-particles and particles dataframe)
output_dir (str) – directory where to write the output files
n_workers (int) – maximal number of processes to run in parallel
reproduce (bool) – whether to delete the output directory before writting on it
**processing_config – other keyword arguments passed to processing.prepare_event()

pipeline.Processing.sortedwise_edges module#

A module that defines a way of defines the edges by sorting the hits by z-abscissa (instead of by distance from the origin vertex).

This way, we define the edge orientation using a left to right convention.

pipeline.Processing.sortedwise_edges.get_edges_from_sorted_impl(edges, hit_ids, particle_group_indices)#

Fill the array of sorted-wise edges by grouping by hits belonging to the same particle, already sorted by z, and forming edge by linking “adjacent” hit IDs.

Parameters:

edges (ndarray) – Pre-allocated empty array of edges to fill
hit_ids (ndarray) – List of hit IDs, sorted by particle IDs and z-coordinates.
particle_group_indices (ndarray) – Start and end indices in hit_ids that delimits hits that have same particle ID.

Return type:

None

pipeline.Processing.sortedwise_edges.get_sortedwise_edges(hits, drop_duplicates=False)[source]#

Get edges by sorting the hits by z for every particle in the event, and linking the adjacent hits by edges.

Parameters:

hits (DataFrame) – dataframe of hits, with columns particle_id and z
drop_duplicates (bool) – whether to drop hits of a particle that belong to the same z

Return type:

ndarray

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.sortedwise_edges.get_sortedwise_edges_impl(hit_ids, particle_ids)[source]#

Get the sorted-wise edges

Parameters:

hit_ids (ndarray) – array of hit IDs, sorted by particle IDs
particle_ids (ndarray) – z-sorted array of particle IDs for every hit

Return type:

ndarray

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.splitting module#

A module that allows to handle the splitting of the overall dataset into a train and a validation set.

pipeline.Processing.splitting.randomly_split_list(list_values, sizes, seed=None)[source]#

Split a list into sub-lists of given sizes, without repetition. The total size may be smaller that the size of the original list.

Parameters:

list_values (list) – list to split
sizes (List[int]) – list of the sizes of the list to produce
seed (Optional[int]) – random seed

Return type:

List[list]

Returns:

Splitted list

pipeline.Processing package

Contents

pipeline.Processing package#

pipeline.Processing.compute module#

pipeline.Processing.modulewise_edges module#

pipeline.Processing.planewise_edges module#

pipeline.Processing.processing module#

pipeline.Processing.run_processing module#

pipeline.Processing.sortedwise_edges module#

pipeline.Processing.splitting module#