pipeline.Processing package#

This package handles the processing of the preprocessed files.

pipeline.Processing.compute module#

A module that defines how to compute certain columns.

pipeline.Processing.compute.column_to_computation_fct: Dict[str, Callable[[pandas.DataFrame], ndarray[Any, dtype[_ScalarType_co]]]] = {'eta': <function <lambda>>, 'phi': <function <lambda>>, 'r': <function <lambda>>, 'theta': <function <lambda>>}#

Associates a column name with a lambda function that takes as input the dataframe of hits and returns the column computed

pipeline.Processing.compute.column_to_required_columns = {'eta': ['theta'], 'theta': ['r']}#

Associates a column name the list of columns needed to compute it x and y are already assumed to belong the dataframe so they are not included in this dictionary

pipeline.Processing.compute.compute_column(hits, column)[source]#

Compute a column and store it in the dataframe of hits.

Parameters:
  • hits (DataFrame) – dataframe of hits

  • column (str) – column to compute

pipeline.Processing.compute.compute_columns(hits, columns)[source]#

Compute required columns to the dataframe of hits.

Parameters:
  • hits (DataFrame) – dataframe of hits

  • columns (List[str]) – columns to compute

Notes

If the column is already in the dataframe, it will not be computed.

pipeline.Processing.modulewise_edges module#

This Module defines functions to build the true edges between the hits, use a module from the origin vertex approach.

pipeline.Processing.modulewise_edges.get_modulewise_edges(hits)[source]#

Build the edges using a module-wise approach, from the production vertex.

Parameters:

hits (DataFrame) – dataframe of hits for a given event, with columns vx, vy, vz x, y, z, particle_id

Return type:

ndarray

Returns:

Array of all the edges, of shape (2, m), with m the number of edges.

pipeline.Processing.planewise_edges module#

A module that defines a way of defines the edges by sorting the hits by plane (instead of by distance from the origin vertex).

This way, we define the edge orientation using a left to right convention. However, if a plane a multiple hits for the same particle, the edges can be not well defined.

pipeline.Processing.planewise_edges.get_edges_from_sorted_impl(hit_ids, particle_ids, plane_ids)#

Fill the array of plane-wise edges by grouping by particle ID already sorted by plane, and forming edge by linking “adjacent” planes.

Parameters:
  • edges – Pre-allocated empty array of edges to fill

  • hit_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – List of hit IDs, sorted by particle IDs and planes

  • particle_group_indices – Start and end indices in hit_ids that delimits hits that have same particle ID.

Return type:

List[ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]]

pipeline.Processing.planewise_edges.get_planewise_custom_edges(hits, grouped_planes=None)[source]#
Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

pipeline.Processing.planewise_edges.get_planewise_edges(hits, drop_duplicates=False)[source]#

Get edges by sorting the hits by plane number for every particle in the event, and linking the adjacent hits by edges.

Parameters:
  • hits (DataFrame) – dataframe of hits, with columns particle_id and plane

  • drop_duplicates (bool) – whether to drop hits of a particle that belong to the same plane

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.planewise_edges.get_planewise_edges_impl(hit_ids, particle_ids, plane_ids)[source]#

Get the plane-wise edges

Parameters:
  • hit_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – array of hit IDs, sorted by particle IDs

  • particle_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Sorted array of particle IDs for every hit

  • plane_ids (ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]) – Sorted array of plane IDs for every hit

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.processing module#

A module that defines the processing at the event-level.

pipeline.Processing.processing.build_event(truncated_path, event_str, features, feature_means, feature_scales, kept_hits_columns, kept_particles_columns, true_edges_column)[source]#

Load the event, compute the necessary columns.

Parameters:
  • path (truncated) – path without the suffixes -particles.csv and hits_particles.csv

  • feature_means (List[float]) – Array of the means to substract the feature values to, in order to “centralise” them

  • feature_scales (List[float]) – Array of the scales to divide the “centralised” feature values by, so that their scale is around 1

  • kept_hits_columns (List[Union[str, Dict[str, str]]]) – Columns to keep, initially stored in the dataframe of hits

  • kept_particles_columns (List[str]) – Columns to keep, initially stored in the dataframe of particles, but merged to the particles of hits

Return type:

Data

Returns:

PyTorch data object, which will be saved for the training or inference.

pipeline.Processing.processing.get_normalised_features(hits, features, feature_means, feature_scales)[source]#

Get the the normalised features from the dataframe of hits.

Parameters:
  • hits (DataFrame) – Dataframe of hits that contains the features

  • features (List[str]) – list of the columns in the dataframe of hits, which correspond to the features

  • feature_means (Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]) – Array of the means to substract the feature values to, in order to “centralise” them

  • feature_scales (Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]) – Array of the scales to divide the “centralised” feature values by, so that their scale is around 1

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

pipeline.Processing.processing.prepare_event(truncated_path, output_dir, *args, **kwargs)[source]#

Load one event saved during pre-processing and save it in the right format in a PyTorch data object.

Parameters:
  • truncated_path (str) – path of the input files, excluding -hit_particles.parquet and -particles.parquet

  • output_dir (str) – directory where to save all the processed PyTorch data

  • overwrite – whether to overwrite an existing PyTorch data pickle file

  • args – passed to build_event()

  • kwargs – passed to build_event()

pipeline.Processing.run_processing module#

A module that defines a function to run the processing from a configuration file.

pipeline.Processing.run_processing.run_processing_from_config(path_or_config, reproduce=True, test_dataset_name=None)[source]#

Loop over the events saved during the pre-processing step, and transform them into the relevant format for training.

Parameters:
  • path_or_config (str | dict) – the overall configuration

  • reproduce (bool) – whether to reproduce an existing processing

  • test_dataset_name (Optional[str]) – Name of the test dataset to produce. If None (default), the train and val datasets are produced instead.

pipeline.Processing.run_processing.run_processing_in_parallel(truncated_paths, output_dir, max_workers, reproduce=True, **processing_config)[source]#

Run the processing step in parallel.

Parameters:
  • truncated_paths (List[str]) – List of the truncated paths of the input files (which correspond to the hits-particles and particles dataframe)

  • output_dir (str) – directory where to write the output files

  • max_workers (int) – maximal number of processes to run in parallel

  • reproduce (bool) – whether to delete the output directory before writting on it

  • **processing_config – Other keyword arguments passed to prepare_event()

pipeline.Processing.run_processing.run_processing_test_dataset(truncated_paths, output_dir, n_workers, reproduce=True, **processing_config)[source]#

Run the processing for the test dataset. There is no splitting train-val for a test sample.

Parameters:
  • truncated_paths (List[str]) – List of the truncated paths of the input files (which correspond to the hits-particles and particles dataframe)

  • output_dir (str) – directory where to write the output files

  • n_workers (int) – maximal number of processes to run in parallel

  • reproduce (bool) – whether to delete the output directory before writting on it

  • **processing_config – other keyword arguments passed to processing.prepare_event()

pipeline.Processing.sortedwise_edges module#

A module that defines a way of defines the edges by sorting the hits by z-abscissa (instead of by distance from the origin vertex).

This way, we define the edge orientation using a left to right convention.

pipeline.Processing.sortedwise_edges.get_edges_from_sorted_impl(edges, hit_ids, particle_group_indices)#

Fill the array of sorted-wise edges by grouping by hits belonging to the same particle, already sorted by z, and forming edge by linking “adjacent” hit IDs.

Parameters:
  • edges (ndarray) – Pre-allocated empty array of edges to fill

  • hit_ids (ndarray) – List of hit IDs, sorted by particle IDs and z-coordinates.

  • particle_group_indices (ndarray) – Start and end indices in hit_ids that delimits hits that have same particle ID.

Return type:

None

pipeline.Processing.sortedwise_edges.get_sortedwise_edges(hits, drop_duplicates=False)[source]#

Get edges by sorting the hits by z for every particle in the event, and linking the adjacent hits by edges.

Parameters:
  • hits (DataFrame) – dataframe of hits, with columns particle_id and z

  • drop_duplicates (bool) – whether to drop hits of a particle that belong to the same z

Return type:

ndarray

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.sortedwise_edges.get_sortedwise_edges_impl(hit_ids, particle_ids)[source]#

Get the sorted-wise edges

Parameters:
  • hit_ids (ndarray) – array of hit IDs, sorted by particle IDs

  • particle_ids (ndarray) – z-sorted array of particle IDs for every hit

Return type:

ndarray

Returns:

Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.

pipeline.Processing.splitting module#

A module that allows to handle the splitting of the overall dataset into a train and a validation set.

pipeline.Processing.splitting.randomly_split_list(list_values, sizes, seed=None)[source]#

Split a list into sub-lists of given sizes, without repetition. The total size may be smaller that the size of the original list.

Parameters:
  • list_values (list) – list to split

  • sizes (List[int]) – list of the sizes of the list to produce

  • seed (Optional[int]) – random seed

Return type:

List[list]

Returns:

Splitted list