pipeline.Preprocessing package#
This package handles the preprocessing.
pipeline.Preprocessing.balancing module#
A module that implements function to balance the dataset using weights.
- pipeline.Preprocessing.balancing.compute_balancing_weights(array, nbins=None)[source]#
Compute balancing weights so that histogram bins in
array
are same-sized with the weights.- Parameters:
array (
Union
[_SupportsArray
[dtype
[Any
]],_NestedSequence
[_SupportsArray
[dtype
[Any
]]],bool
,int
,float
,complex
,str
,bytes
,_NestedSequence
[Union
[bool
,int
,float
,complex
,str
,bytes
]]]) – array of values of interest, which will be histogrammisednbins (
Optional
[int
]) – number of bins in the histogram
- Return type:
ndarray
[Any
,dtype
[TypeVar
(_ScalarType_co
, bound=generic
, covariant=True)]]- Returns:
Array of weights for every value in
array
pipeline.Preprocessing.hit_filtering module#
A module that implements the ability of filter hits, grouping them by particles.
- pipeline.Preprocessing.hit_filtering.cut_long_tracks_impl(array_mask, event_ids, particle_ids, track_sizes, proportions, rng)#
Cut long tracks to get smaller tracks. The first hits are removed.
- Parameters:
array_mask (
ndarray
[Any
,dtype
[bool_
]]) – the array of mask that indicates the hits that are keptevent_ids (
ndarray
[Any
,dtype
[int64
]]) – array of event IDsparticle_ids (
ndarray
[Any
,dtype
[int64
]]) – array of particle IDstrack_sizes (
ndarray
[Any
,dtype
[int64
]]) – array of track sizesproportions (
ndarray
[Any
,dtype
[float64
]]) – array of proportion of track sizes, corresponding totrack_sizes
rng (
Generator
) – random generator for which track is cut to which size
- Return type:
None
- pipeline.Preprocessing.hit_filtering.mask_long_into_small_tracks(hits_particles, track_size_proportions, seed=None)[source]#
Create a mask to remove the first hits of long tracks to match the proportions of track sizes given as input.
- Parameters:
hits_particles (
DataFrame
) – dataframe of hits-particlestrack_sizes – dictionary that associates a track size with the expected proportion after the cut.
seed (
Optional
[int
]) – Random seed for which track is cut to which size
- Return type:
Series
- Returns:
Pandas series indexed by event, particle_id and hit_id, which indicates which hits are kept
pipeline.Preprocessing.inputloader module#
A module that defines the input loader that allow to loop over events scattered in different parquet or CSV files.
- pipeline.Preprocessing.inputloader.get_indirs(input_dir=None, subdirs=None, condition=None)[source]#
Get the input directories that can be used as input of the preprocessing.
- Parameters:
input_dir (
Optional
[str
]) – A single input directory ifsubdirs
isNone
, or the main directory where sub-directories aresubdirs (
Union
[int
,str
,List
[str
],Dict
[str
,int
],None
]) –If
subdirs
is None, there is a single input directory,input_dir
If
subdirs
is a string or a list of strings, they specify the sub-directories with respect toinput_dir
. Ifinput_dir
isNone
, then they are the (list of) input directories directly, which can be useful if the input directories are not at the same location (even though it is discouraged)If
subdirs
is an integer, it corresponds to the the name of the last sub-directory to consider (i.e., from 0 tosubdirs
). Ifsubdirs
is-1
, all the sub-directories are considered as input.If
subdirs
is a dictionary, the keysstart
andstop
specify the first and last sub-directories to consider as input.
- Returns:
List of input directories that can be considered.
pipeline.Preprocessing.line_metrics module#
- pipeline.Preprocessing.line_metrics.compute_angle_between_line_and_plane(line_direction_vector, plane_normal_vector)[source]#
Compute the angle between a line and a plane.
- Parameters:
line_direction_vector (
ndarray
) – Normalised direction vector of the straight lineplane_normal_vector (
ndarray
) – Normalised normal vector of the plane
- Return type:
float
- Returns:
Angle between a line and a plane, in degree.
- pipeline.Preprocessing.line_metrics.compute_distance_to_line(coords, point, direction_vector)#
Compute the normalised distance between coordinates and a straight line.
- Parameters:
coords (
ndarray
) – Array of coordinates, the columns are the the cartesian coordinatespoint (
ndarray
) – one point belonging to the straight linedirection_vector (
ndarray
) – Normalised direction vector of the straight line
- Return type:
float
- Returns:
Square root of the average of the squared distances from a straight line
- pipeline.Preprocessing.line_metrics.compute_distance_to_z_axis(coords, point, direction_vector)#
Compute the distance of a straight line to the z-axis.
- Parameters:
coords (
ndarray
) – Array of coordinates. This is not used.point (
ndarray
) – one point belonging to the straight linedirection_vector (
ndarray
) – Normalised direction vector of the straight line
- Return type:
floating
- Returns:
Shortest distance between a line and the z-axis.
- pipeline.Preprocessing.line_metrics.compute_distances_between_two_lines(coords1, direction_vector1, coords2, direction_vector2)#
Compute the distance between two lines.
- Parameters:
coords1 (
ndarray
) – cartesian coordinates of a point belonging to the first straight linedirection_vector1 (
ndarray
) – normalised direction vector of the first straight linecoords2 (
ndarray
) – cartesian coordinates of a point belonging to the second straight linedirection_vector2 (
ndarray
) – normalised direction vector of the second straight line
- Return type:
floating
- Returns:
Distance between the two lines.
- pipeline.Preprocessing.line_metrics.compute_xz_angle(coords, point, direction_vector)[source]#
Compute the angle between a line and the \(x\)-\(z\) plane.
- Parameters:
coords (
ndarray
) – Array of coordinates. This is not used.point (
ndarray
) – one point belonging to the straight line. This is not used.direction_vector (
ndarray
) – Normalised direction vector of the straight line
- Return type:
float
- Returns:
Angle between a line and the \(x\)-\(z\) plane, in degree.
- pipeline.Preprocessing.line_metrics.compute_yz_angle(coords, point, direction_vector)[source]#
Compute the angle between a line and the \(y\)-\(z\) plane.
- Parameters:
coords (
ndarray
) – Array of coordinates. This is not used.point (
ndarray
) – one point belonging to the straight line. This is not used.direction_vector (
ndarray
) – Normalised direction vector of the straight line
- Return type:
float
- Returns:
Angle between a line and the \(y\)-\(z\) plane, in degree.
- pipeline.Preprocessing.line_metrics.fit_line(coords)#
Fit coordinates to a straight line.
- Parameters:
coords (
ndarray
) – Array of coordinates, the columns are the the cartesian coordinates- Return type:
Tuple
[ndarray
,ndarray
]- Returns:
A tuple of two one-dimensional arrays.
One point in the straight line
Direction vector
Notes
See https://math.stackexchange.com/questions/1611308/best-fit-line-with-3d-points
pipeline.Preprocessing.particle_fitting_metrics module#
A module that implements the fitting of particle trajectories to straight lines, in order to evaluate if the particle tracks are compatible with straight lines.
This can be used for instance in order to only keep straight lines in the training sample, in order to avoid training on weird tracks.
- pipeline.Preprocessing.particle_fitting_metrics.compute_particle_line_metrics(particle_coords, metric_names, line_type='line_3d')#
- Return type:
ndarray
- pipeline.Preprocessing.particle_fitting_metrics.compute_particle_line_metrics_dataframe(hits, metric_names, coord_names=['x', 'y', 'z'], line_type='line_3d', event_id_column='event_id')[source]#
Compute the pandas Series of the distance from particle hits to straight lines fitted to these lines. The “distance” actually corresponds to the square-root of the average of the squared distance between every hit and the straight line.
- Parameters:
hits (
DataFrame
) – Pandas DataFrame of hits-particles associations, with columnsevent
,particle_id
and the cartesian coordinatesx
,y
andz
.metric_names (
List
[str
]) –List of the metric names to compute, which can be
distance_to_line
distance_to_z_axis
xz_angle
yz_angle
linetype – str,
event_id_column (
str
) – name of the event ID column
- Return type:
DataFrame
- Returns:
A pandas Series with index
event
andparticle_id
, and for every particle that has hits, the metrics specified inmetric_names
- pipeline.Preprocessing.particle_fitting_metrics.compute_particle_line_metrics_events_impl(array_metric_values, coords_events_particles, event_ids, particle_ids, metric_names, line_type='line_3d')#
Compute the distances between the particle hits, and a straight line fitted to these hit coordinates, for all the events. Fill the pre-allocated array of distances
distances
.- Parameters:
array_metric_values (
ndarray
) – Empty array of the metric values to compute, for every unique particle ID inparticle_ids
. This way, the memory space is pre-allocated. The array as many column as they are metrics to compute.coords_events_particles (
ndarray
) – Cartesian coordinates of the hits, sorted by event IDs and particle IDsevent_ids (
ndarray
) – sorted array of event IDs for every hit coordinates incoords_events_particles
particle_ids (
ndarray
) – sorted array of particle IDs group by event ID (inevent_ids
), for every hit coordinates incoords_events_particles
metric_names (
List
[str
]) – List of the names of the metrics to compute.
- pipeline.Preprocessing.particle_fitting_metrics.compute_particle_metric(coords_particles, particle_ids, array_metric_values, metric_names, line_type='line_3d', indices_groupby_particles_=None)#
Compute the distances between the particle hits, and a straight line fitted to these hit coordinates. Fill the pre-allocated array of distances
distances
.- Parameters:
coords_particles (
ndarray
) – array of hit cartesian coordinates, sorted by particle IDs.particle_ids (
ndarray
) – Sorted array of particle IDs for every hits, whose coordinates are given incoords_particles
. There are as many elements inparticle_ids
as they are hits incoords_particles
array_metric_values (
ndarray
) – Empty array of the metric values to compute, for every unique particle ID inparticle_ids
. This way, the memory space is pre-allocated. The array as many column as they are metrics to compute.metric_names (
List
[str
]) – List of the names of the metrics to compute.indices_groupby_particles – Array that contains the starting indices of every group of values in
particle_ids
. If not given, it is computed for you anyway
Notes
This function is basically called for every event.
pipeline.Preprocessing.poly_metrics module#
Module that implements the fitting of a trajectory to a 2D polynomial.
Adapted from from https://gist.github.com/kadereub/9eae9cff356bb62cdbd672931e8e5ec4
- pipeline.Preprocessing.poly_metrics.compute_distance_to_poly(coords, coeffs)#
- Return type:
float
- pipeline.Preprocessing.poly_metrics.compute_quadratic_coeff(coords, coeffs)#
- Return type:
float
- pipeline.Preprocessing.poly_metrics.eval_polynomial(P, x)#
Compute polynomial P(x) where P is a vector of coefficients, highest order coefficient at P[0]. Uses Horner’s Method.
- pipeline.Preprocessing.poly_metrics.fit_poly(x, y, deg)#
pipeline.Preprocessing.preprocessing module#
- pipeline.Preprocessing.preprocessing.apply_custom_processing(hits_particles, particles, processing=None)[source]#
Apply custom processing to the dataframe of hits-particles and particles. The custom processing functions are defined in
pipeline.Preprocessing.process_custom
.- Parameters:
hits_particles (
DataFrame
) – dataframe of hits-particlesparticles (
DataFrame
) – dataframe of particlesprocessing (
Union
[str
,Sequence
[str
],None
]) – Name(s) of the processing function(s) to apply to the dataframes. The processing functions as defined inpipeline.Preprocessing.process_custom
- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
Processed dataframe of hits-particles and particles
- pipeline.Preprocessing.preprocessing.dump_if_enough_hits(event_hits_particles, event_particles, event_id, output_dir, num_true_hits_threshold=None, pbar=None)[source]#
- Return type:
bool
- pipeline.Preprocessing.preprocessing.enough_true_hits(event_hits_particles, num_true_hits_threshold, event_id_str=None, num_events=None, required_num_events=None)[source]#
Check whether an event has enough true hits to be saved.
- Parameters:
event_hits_particles (
DataFrame
) – DataFrame of all hits for an event.num_true_hits_threshold (
int
) – Minimum number of true hits required for the event to be saved.event_id_str (
Optional
[str
]) – String representation of the event ID.num_events (
Optional
[int
]) – The current number of saved events.required_num_events (
Optional
[int
]) – The desired number of saved events.
- Return type:
bool
- Returns:
True
if the event has enough true hits to be saved,False
otherwise.
Notes
The function checks the number of true hits in an event and compares it with the given threshold. If the number of true hits is equal to 0, the event is discarded as it contains only fake hits. If the number of true hits is below the given threshold, the event is discarded as it does not contain enough true hits. Otherwise, the event is saved as the function returns
True
.
- pipeline.Preprocessing.preprocessing.preprocess(input_dir, output_dir, hits_particles_filename=None, particles_filename=None, subdirs=None, n_events=-1, processing=None, num_true_hits_threshold=None, hits_particles_columns=None, particles_columns=None, n_workers=1, raise_enough_events=True)[source]#
Preprocess the first
n_events
events in the input files and save the events in separate parquet files calledevent{event_id}-hits_particles.parquet
andevent{event_id}-hits.parquet
.- Parameters:
input_dir (
str
) – A single input directory ifsubdirs
isNone
, or the main directory where sub-directories areoutput_dir (
str
) – the output directory where the parquet files are saved.hits_particles_filename (
Optional
[str
]) – Name of the hits-particles file name (without the.parquet.lz4
extension). Default ishits_velo
.particles_filename (
Optional
[str
]) – Name of the particle file name (without the.parquet.lz4
extension). Default ismc_particles
.subdirs (
Union
[int
,str
,List
[str
],None
]) –If
subdirs
is None, there is a single input directory,input_dir
If
subdirs
is a string or a list of strings, they specify the sub-directories with respect toinput_dir
. Ifinput_dir
isNone
, then they are the (list of) input directories directly, which can be useful if the input directories are not at the same location (even though it is discouraged)If
subdirs
is an integer, it corresponds to the the name of the last sub-directory to consider (i.e., from 0 tosubdirs
). Ifsubdirs
is-1
, all the sub-directories are considered as input.If
subdirs
is a dictionary, the keysstart
andstop
specify the first and last sub-directories to consider as input.
n_events (
int
) – Number of events to save. Forn_workers
higher than 1, more events may be produced.processing (
Union
[str
,List
[str
],None
]) – Name(s) of the processing function(s) to apply to the dataframes. The processing functions as defined inpipeline.Preprocessing.process_custom
num_true_hits_threshold (
Optional
[int
]) – Minimum number of true hits required for the event to be saved.hits_particles_columns (
Optional
[List
[str
]]) – columns to load for the dataframe of hits and the hits-particles association informationparticles_columns (
Optional
[List
[str
]]) – columns to load for the dataframe of particlesn_workers (
int
) – If greater than 1, the input dataframes are all loaded and processed in parallel.raise_enough_events (
bool
) – whether to raise an error if not any events where generated.
pipeline.Preprocessing.preprocessing_paths module#
Module to handle the output path of the preprocessing.
- pipeline.Preprocessing.preprocessing_paths.get_truncated_paths(input_dir)[source]#
Get the list of the truncated paths in a given preprocessing folder.
- Parameters:
input_dir (
str
) – directory that contains preprocessed files- Return type:
List
[str
]- Returns:
List of the paths, truncated so that they do not contains neither the extension, nor
hits_particles
notparticles
- pipeline.Preprocessing.preprocessing_paths.get_truncated_paths_for_partition(path_or_config, partition)[source]#
Get the list of truncated paths for a given partition.
- Parameters:
path_or_config (
str
|dict
) – configuration dictionary, or path to the YAML file that contains the configurationpartition (
str
) – Dataset partition:train
,val
or name of a test dataset
- Return type:
List
[str
]- Returns:
List of the truncated paths of the pre-processed parquet files for this partition.
pipeline.Preprocessing.process_custom module#
- class pipeline.Preprocessing.process_custom.SelectionFunction(*args, **kwargs)[source]#
Bases:
Protocol
- pipeline.Preprocessing.process_custom.apply_mask(particles_mask, particles, hits_particles)[source]#
Apply a mask of particles to keep to both the dataframe of particles and the dataframe of particles-hits.
- Parameters:
particles_mask (
Series
) – The mask to apply, that corresponds to the particles to keep in the dataframeparticles
particles (
DataFrame
) – Dataframe of particleshits_particles (
DataFrame
) – Dataframe of hits-particles association
- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
Dataframe of
particles
andhits_particles
that only contain the particles and hits to keep.
- pipeline.Preprocessing.process_custom.at_least_7_planes(hits_particles, particles)[source]#
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.compute_n_particles_per_hit(hits_particles, particles)[source]#
Compute number of unique planes for each particle.
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.compute_n_unique_planes(hits_particles, particles)[source]#
Compute number of unique planes for each particle.
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.cut_long_tracks(hits_particles, particles)[source]#
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.default_old_training_for_rta_presentation(hits_particles, particles)[source]#
Selection that was used in the training presented in the RTA meeting.
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.everything_but_electrons(hits_particles, particles)[source]#
Only keep long electrons.
- Parameters:
hits_particles (
DataFrame
) – Dataframe of hits-particles associationparticles (
DataFrame
) – Dataframe of particles
- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
Dataframe of hits-particles association and particles, filtered so that only long electrons are left.
- pipeline.Preprocessing.process_custom.everything_but_long_electrons(hits_particles, particles)[source]#
Only keep long electrons.
- Parameters:
hits_particles (
DataFrame
) – Dataframe of hits-particles associationparticles (
DataFrame
) – Dataframe of particles
- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
Dataframe of hits-particles association and particles, filtered so that only long electrons are left.
- pipeline.Preprocessing.process_custom.less_than_3_hits_on_same_plane(hits_particles, particles)[source]#
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.only_keep_hits_on_particles(hits_particles, particles)[source]#
- pipeline.Preprocessing.process_custom.only_long_electrons(hits_particles, particles)[source]#
Only keep long electrons.
- Parameters:
hits_particles (
DataFrame
) – Dataframe of hits-particles associationparticles (
DataFrame
) – Dataframe of particles
- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
Dataframe of hits-particles association and particles, filtered so that only long electrons are left.
- pipeline.Preprocessing.process_custom.remove_curved_particles(hits_particles, particles)[source]#
Remove curved particles
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.remove_particle_not_poly_enough(hits_particles, particles, max_distance=70.0)[source]#
- Return type:
Tuple
[DataFrame
,DataFrame
]
- pipeline.Preprocessing.process_custom.remove_particles_too_scattered_on_plane(hits_particles, particles, max_xdiffs=2.5)[source]#
- Return type:
Tuple
[DataFrame
,DataFrame
]
pipeline.Preprocessing.run_preprocessing module#
This module defines how to run the pre-processing from a configuration file.
- pipeline.Preprocessing.run_preprocessing.run_preprocessing(path_or_config, reproduce=True, raise_enough_events=True)[source]#
Run the pre-processing step.
- Parameters:
path_or_config (
str
|dict
) – configuration dictionary, or path to the YAML file that contains the configurationreproduce (
bool
) – whether to reproduce an existing preprocessingraise_enough_events (
bool
) – whether to raise an error if not any events where generated.
- pipeline.Preprocessing.run_preprocessing.run_preprocessing_test_dataset(test_dataset_name, path_or_config_test=None, detector=None, reproduce=False, raise_enough_events=False)[source]#
Run the pre-processing of a test dataset.
- Parameters:
test_dataset_name (
str
) – name of the test dataset to pre-processpath_or_config_test (
UnionType
[str
,dict
,None
]) – YAML test dataset configuration dictionary or path to itreproduce (
bool
) – whether to reproduce an existing preprocessingraise_enough_events (
bool
) – whether to raise an error if not any events where generated.