pipeline.Processing package#
This package handles the processing of the preprocessed files.
pipeline.Processing.compute module#
A module that defines how to compute certain columns.
- pipeline.Processing.compute.column_to_computation_fct: Dict[str, Callable[[pandas.DataFrame], ndarray[Any, dtype[_ScalarType_co]]]] = {'eta': <function <lambda>>, 'phi': <function <lambda>>, 'r': <function <lambda>>, 'theta': <function <lambda>>}#
Associates a column name with a lambda function that takes as input the dataframe of hits and returns the column computed
- pipeline.Processing.compute.column_to_required_columns = {'eta': ['theta'], 'theta': ['r']}#
Associates a column name the list of columns needed to compute it
x
andy
are already assumed to belong the dataframe so they are not included in this dictionary
pipeline.Processing.modulewise_edges module#
This Module defines functions to build the true edges between the hits, use a module from the origin vertex approach.
- pipeline.Processing.modulewise_edges.get_modulewise_edges(hits)[source]#
Build the edges using a module-wise approach, from the production vertex.
- Parameters:
hits (
DataFrame
) – dataframe of hits for a given event, with columnsvx
,vy
,vz
x
,y
,z
,particle_id
- Return type:
ndarray
- Returns:
Array of all the edges, of shape (2, m), with
m
the number of edges.
pipeline.Processing.planewise_edges module#
A module that defines a way of defines the edges by sorting the hits by plane (instead of by distance from the origin vertex).
This way, we define the edge orientation using a left to right convention. However, if a plane a multiple hits for the same particle, the edges can be not well defined.
- pipeline.Processing.planewise_edges.get_edges_from_sorted_impl(hit_ids, particle_ids, plane_ids)#
Fill the array of plane-wise edges by grouping by particle ID already sorted by plane, and forming edge by linking “adjacent” planes.
- Parameters:
edges – Pre-allocated empty array of edges to fill
hit_ids (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – List of hit IDs, sorted by particle IDs and planesparticle_group_indices – Start and end indices in
hit_ids
that delimits hits that have same particle ID.
- Return type:
List
[ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]]
- pipeline.Processing.planewise_edges.get_planewise_custom_edges(hits, grouped_planes=None)[source]#
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]
- pipeline.Processing.planewise_edges.get_planewise_edges(hits, drop_duplicates=False)[source]#
Get edges by sorting the hits by plane number for every particle in the event, and linking the adjacent hits by edges.
- Parameters:
hits (
DataFrame
) – dataframe of hits, with columnsparticle_id
andplane
drop_duplicates (
bool
) – whether to drop hits of a particle that belong to the same plane
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.
- pipeline.Processing.planewise_edges.get_planewise_edges_impl(hit_ids, particle_ids, plane_ids)[source]#
Get the plane-wise edges
- Parameters:
hit_ids (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – array of hit IDs, sorted by particle IDsparticle_ids (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Sorted array of particle IDs for every hitplane_ids (
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]) – Sorted array of plane IDs for every hit
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.
pipeline.Processing.processing module#
A module that defines the processing at the event-level.
- pipeline.Processing.processing.build_event(truncated_path, event_str, features, feature_means, feature_scales, kept_hits_columns, kept_particles_columns, true_edges_column)[source]#
Load the event, compute the necessary columns.
- Parameters:
path (truncated) – path without the suffixes
-particles.csv
andhits_particles.csv
feature_means (
List
[float
]) – Array of the means to substract the feature values to, in order to “centralise” themfeature_scales (
List
[float
]) – Array of the scales to divide the “centralised” feature values by, so that their scale is around 1kept_hits_columns (
List
[Union
[str
,Dict
[str
,str
]]]) – Columns to keep, initially stored in the dataframe of hitskept_particles_columns (
List
[str
]) – Columns to keep, initially stored in the dataframe of particles, but merged to the particles of hits
- Return type:
Data
- Returns:
PyTorch data object, which will be saved for the training or inference.
- pipeline.Processing.processing.get_normalised_features(hits, features, feature_means, feature_scales)[source]#
Get the the normalised features from the dataframe of hits.
- Parameters:
hits (
DataFrame
) – Dataframe of hits that contains the featuresfeatures (
List
[str
]) – list of the columns in the dataframe ofhits
, which correspond to the featuresfeature_means (
Union
[_SupportsArray
[dtype
[Any
]],_NestedSequence
[_SupportsArray
[dtype
[Any
]]],bool
,int
,float
,complex
,str
,bytes
,_NestedSequence
[Union
[bool
,int
,float
,complex
,str
,bytes
]]]) – Array of the means to substract the feature values to, in order to “centralise” themfeature_scales (
Union
[_SupportsArray
[dtype
[Any
]],_NestedSequence
[_SupportsArray
[dtype
[Any
]]],bool
,int
,float
,complex
,str
,bytes
,_NestedSequence
[Union
[bool
,int
,float
,complex
,str
,bytes
]]]) – Array of the scales to divide the “centralised” feature values by, so that their scale is around 1
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]
- pipeline.Processing.processing.prepare_event(truncated_path, output_dir, *args, **kwargs)[source]#
Load one event saved during pre-processing and save it in the right format in a PyTorch data object.
- Parameters:
truncated_path (
str
) – path of the input files, excluding-hit_particles.parquet
and-particles.parquet
output_dir (
str
) – directory where to save all the processed PyTorch dataoverwrite – whether to overwrite an existing PyTorch data pickle file
args – passed to
build_event()
kwargs – passed to
build_event()
pipeline.Processing.run_processing module#
A module that defines a function to run the processing from a configuration file.
- pipeline.Processing.run_processing.run_processing_from_config(path_or_config, reproduce=True, test_dataset_name=None)[source]#
Loop over the events saved during the pre-processing step, and transform them into the relevant format for training.
- Parameters:
path_or_config (
str
|dict
) – the overall configurationreproduce (
bool
) – whether to reproduce an existing processingtest_dataset_name (
Optional
[str
]) – Name of the test dataset to produce. IfNone
(default), the train and val datasets are produced instead.
- pipeline.Processing.run_processing.run_processing_in_parallel(truncated_paths, output_dir, max_workers, reproduce=True, **processing_config)[source]#
Run the processing step in parallel.
- Parameters:
truncated_paths (
List
[str
]) – List of the truncated paths of the input files (which correspond to the hits-particles and particles dataframe)output_dir (
str
) – directory where to write the output filesmax_workers (
int
) – maximal number of processes to run in parallelreproduce (
bool
) – whether to delete the output directory before writting on it**processing_config – Other keyword arguments passed to
prepare_event()
- pipeline.Processing.run_processing.run_processing_test_dataset(truncated_paths, output_dir, n_workers, reproduce=True, **processing_config)[source]#
Run the processing for the test dataset. There is no splitting train-val for a test sample.
- Parameters:
truncated_paths (
List
[str
]) – List of the truncated paths of the input files (which correspond to the hits-particles and particles dataframe)output_dir (
str
) – directory where to write the output filesn_workers (
int
) – maximal number of processes to run in parallelreproduce (
bool
) – whether to delete the output directory before writting on it**processing_config – other keyword arguments passed to
processing.prepare_event()
pipeline.Processing.sortedwise_edges module#
A module that defines a way of defines the edges by sorting the hits by z-abscissa (instead of by distance from the origin vertex).
This way, we define the edge orientation using a left to right convention.
- pipeline.Processing.sortedwise_edges.get_edges_from_sorted_impl(edges, hit_ids, particle_group_indices)#
Fill the array of sorted-wise edges by grouping by hits belonging to the same particle, already sorted by z, and forming edge by linking “adjacent” hit IDs.
- Parameters:
edges (
ndarray
) – Pre-allocated empty array of edges to fillhit_ids (
ndarray
) – List of hit IDs, sorted by particle IDs and z-coordinates.particle_group_indices (
ndarray
) – Start and end indices inhit_ids
that delimits hits that have same particle ID.
- Return type:
None
- pipeline.Processing.sortedwise_edges.get_sortedwise_edges(hits, drop_duplicates=False)[source]#
Get edges by sorting the hits by
z
for every particle in the event, and linking the adjacent hits by edges.- Parameters:
hits (
DataFrame
) – dataframe of hits, with columnsparticle_id
andz
drop_duplicates (
bool
) – whether to drop hits of a particle that belong to the same z
- Return type:
ndarray
- Returns:
Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.
- pipeline.Processing.sortedwise_edges.get_sortedwise_edges_impl(hit_ids, particle_ids)[source]#
Get the sorted-wise edges
- Parameters:
hit_ids (
ndarray
) – array of hit IDs, sorted by particle IDsparticle_ids (
ndarray
) – z-sorted array of particle IDs for every hit
- Return type:
ndarray
- Returns:
Two-dimensional array where every column represent an edge. In this array, for every edge, a hit is referred to by its index in the dataframe of hits.
pipeline.Processing.splitting module#
A module that allows to handle the splitting of the overall dataset into a train and a validation set.
- pipeline.Processing.splitting.randomly_split_list(list_values, sizes, seed=None)[source]#
Split a list into sub-lists of given sizes, without repetition. The total size may be smaller that the size of the original list.
- Parameters:
list_values (
list
) – list to splitsizes (
List
[int
]) – list of the sizes of the list to produceseed (
Optional
[int
]) – random seed
- Return type:
List
[list
]- Returns:
Splitted list