pipeline.Preprocessing package#

This package handles the preprocessing.

pipeline.Preprocessing.balancing module#

A module that implements function to balance the dataset using weights.

pipeline.Preprocessing.balancing.compute_balancing_weights(array, nbins=None)[source]#

Compute balancing weights so that histogram bins in array are same-sized with the weights.

Parameters:
  • array (Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]) – array of values of interest, which will be histogrammised

  • nbins (Optional[int]) – number of bins in the histogram

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Returns:

Array of weights for every value in array

pipeline.Preprocessing.hit_filtering module#

A module that implements the ability of filter hits, grouping them by particles.

pipeline.Preprocessing.hit_filtering.cut_long_tracks_impl(array_mask, event_ids, particle_ids, track_sizes, proportions, rng)#

Cut long tracks to get smaller tracks. The first hits are removed.

Parameters:
  • array_mask (ndarray[Any, dtype[bool_]]) – the array of mask that indicates the hits that are kept

  • event_ids (ndarray[Any, dtype[int64]]) – array of event IDs

  • particle_ids (ndarray[Any, dtype[int64]]) – array of particle IDs

  • track_sizes (ndarray[Any, dtype[int64]]) – array of track sizes

  • proportions (ndarray[Any, dtype[float64]]) – array of proportion of track sizes, corresponding to track_sizes

  • rng (Generator) – random generator for which track is cut to which size

Return type:

None

pipeline.Preprocessing.hit_filtering.mask_long_into_small_tracks(hits_particles, track_size_proportions, seed=None)[source]#

Create a mask to remove the first hits of long tracks to match the proportions of track sizes given as input.

Parameters:
  • hits_particles (DataFrame) – dataframe of hits-particles

  • track_sizes – dictionary that associates a track size with the expected proportion after the cut.

  • seed (Optional[int]) – Random seed for which track is cut to which size

Return type:

Series

Returns:

Pandas series indexed by event, particle_id and hit_id, which indicates which hits are kept

pipeline.Preprocessing.inputloader module#

A module that defines the input loader that allow to loop over events scattered in different parquet or CSV files.

pipeline.Preprocessing.inputloader.get_indirs(input_dir=None, subdirs=None, condition=None)[source]#

Get the input directories that can be used as input of the preprocessing.

Parameters:
  • input_dir (Optional[str]) – A single input directory if subdirs is None, or the main directory where sub-directories are

  • subdirs (Union[int, str, List[str], Dict[str, int], None]) –

    • If subdirs is None, there is a single input directory, input_dir

    • If subdirs is a string or a list of strings, they specify the sub-directories with respect to input_dir. If input_dir is None, then they are the (list of) input directories directly, which can be useful if the input directories are not at the same location (even though it is discouraged)

    • If subdirs is an integer, it corresponds to the the name of the last sub-directory to consider (i.e., from 0 to subdirs). If subdirs is -1, all the sub-directories are considered as input.

    • If subdirs is a dictionary, the keys start and stop specify the first and last sub-directories to consider as input.

Returns:

List of input directories that can be considered.

pipeline.Preprocessing.line_metrics module#

pipeline.Preprocessing.line_metrics.compute_angle_between_line_and_plane(line_direction_vector, plane_normal_vector)[source]#

Compute the angle between a line and a plane.

Parameters:
  • line_direction_vector (ndarray) – Normalised direction vector of the straight line

  • plane_normal_vector (ndarray) – Normalised normal vector of the plane

Return type:

float

Returns:

Angle between a line and a plane, in degree.

pipeline.Preprocessing.line_metrics.compute_distance_to_line(coords, point, direction_vector)#

Compute the normalised distance between coordinates and a straight line.

Parameters:
  • coords (ndarray) – Array of coordinates, the columns are the the cartesian coordinates

  • point (ndarray) – one point belonging to the straight line

  • direction_vector (ndarray) – Normalised direction vector of the straight line

Return type:

float

Returns:

Square root of the average of the squared distances from a straight line

pipeline.Preprocessing.line_metrics.compute_distance_to_z_axis(coords, point, direction_vector)#

Compute the distance of a straight line to the z-axis.

Parameters:
  • coords (ndarray) – Array of coordinates. This is not used.

  • point (ndarray) – one point belonging to the straight line

  • direction_vector (ndarray) – Normalised direction vector of the straight line

Return type:

floating

Returns:

Shortest distance between a line and the z-axis.

pipeline.Preprocessing.line_metrics.compute_distances_between_two_lines(coords1, direction_vector1, coords2, direction_vector2)#

Compute the distance between two lines.

Parameters:
  • coords1 (ndarray) – cartesian coordinates of a point belonging to the first straight line

  • direction_vector1 (ndarray) – normalised direction vector of the first straight line

  • coords2 (ndarray) – cartesian coordinates of a point belonging to the second straight line

  • direction_vector2 (ndarray) – normalised direction vector of the second straight line

Return type:

floating

Returns:

Distance between the two lines.

pipeline.Preprocessing.line_metrics.compute_xz_angle(coords, point, direction_vector)[source]#

Compute the angle between a line and the \(x\)-\(z\) plane.

Parameters:
  • coords (ndarray) – Array of coordinates. This is not used.

  • point (ndarray) – one point belonging to the straight line. This is not used.

  • direction_vector (ndarray) – Normalised direction vector of the straight line

Return type:

float

Returns:

Angle between a line and the \(x\)-\(z\) plane, in degree.

pipeline.Preprocessing.line_metrics.compute_yz_angle(coords, point, direction_vector)[source]#

Compute the angle between a line and the \(y\)-\(z\) plane.

Parameters:
  • coords (ndarray) – Array of coordinates. This is not used.

  • point (ndarray) – one point belonging to the straight line. This is not used.

  • direction_vector (ndarray) – Normalised direction vector of the straight line

Return type:

float

Returns:

Angle between a line and the \(y\)-\(z\) plane, in degree.

pipeline.Preprocessing.line_metrics.fit_line(coords)#

Fit coordinates to a straight line.

Parameters:

coords (ndarray) – Array of coordinates, the columns are the the cartesian coordinates

Return type:

Tuple[ndarray, ndarray]

Returns:

A tuple of two one-dimensional arrays.

  • One point in the straight line

  • Direction vector

Notes

See https://math.stackexchange.com/questions/1611308/best-fit-line-with-3d-points

pipeline.Preprocessing.particle_fitting_metrics module#

A module that implements the fitting of particle trajectories to straight lines, in order to evaluate if the particle tracks are compatible with straight lines.

This can be used for instance in order to only keep straight lines in the training sample, in order to avoid training on weird tracks.

pipeline.Preprocessing.particle_fitting_metrics.compute_particle_line_metrics(particle_coords, metric_names, line_type='line_3d')#
Return type:

ndarray

pipeline.Preprocessing.particle_fitting_metrics.compute_particle_line_metrics_dataframe(hits, metric_names, coord_names=['x', 'y', 'z'], line_type='line_3d', event_id_column='event_id')[source]#

Compute the pandas Series of the distance from particle hits to straight lines fitted to these lines. The “distance” actually corresponds to the square-root of the average of the squared distance between every hit and the straight line.

Parameters:
  • hits (DataFrame) – Pandas DataFrame of hits-particles associations, with columns event, particle_id and the cartesian coordinates x, y and z.

  • metric_names (List[str]) –

    List of the metric names to compute, which can be

    • distance_to_line

    • distance_to_z_axis

    • xz_angle

    • yz_angle

  • linetype – str,

  • event_id_column (str) – name of the event ID column

Return type:

DataFrame

Returns:

A pandas Series with index event and particle_id, and for every particle that has hits, the metrics specified in metric_names

pipeline.Preprocessing.particle_fitting_metrics.compute_particle_line_metrics_events_impl(array_metric_values, coords_events_particles, event_ids, particle_ids, metric_names, line_type='line_3d')#

Compute the distances between the particle hits, and a straight line fitted to these hit coordinates, for all the events. Fill the pre-allocated array of distances distances.

Parameters:
  • array_metric_values (ndarray) – Empty array of the metric values to compute, for every unique particle ID in particle_ids. This way, the memory space is pre-allocated. The array as many column as they are metrics to compute.

  • coords_events_particles (ndarray) – Cartesian coordinates of the hits, sorted by event IDs and particle IDs

  • event_ids (ndarray) – sorted array of event IDs for every hit coordinates in coords_events_particles

  • particle_ids (ndarray) – sorted array of particle IDs group by event ID (in event_ids), for every hit coordinates in coords_events_particles

  • metric_names (List[str]) – List of the names of the metrics to compute.

pipeline.Preprocessing.particle_fitting_metrics.compute_particle_metric(coords_particles, particle_ids, array_metric_values, metric_names, line_type='line_3d', indices_groupby_particles_=None)#

Compute the distances between the particle hits, and a straight line fitted to these hit coordinates. Fill the pre-allocated array of distances distances.

Parameters:
  • coords_particles (ndarray) – array of hit cartesian coordinates, sorted by particle IDs.

  • particle_ids (ndarray) – Sorted array of particle IDs for every hits, whose coordinates are given in coords_particles. There are as many elements in particle_ids as they are hits in coords_particles

  • array_metric_values (ndarray) – Empty array of the metric values to compute, for every unique particle ID in particle_ids. This way, the memory space is pre-allocated. The array as many column as they are metrics to compute.

  • metric_names (List[str]) – List of the names of the metrics to compute.

  • indices_groupby_particles – Array that contains the starting indices of every group of values in particle_ids. If not given, it is computed for you anyway

Notes

This function is basically called for every event.

pipeline.Preprocessing.poly_metrics module#

Module that implements the fitting of a trajectory to a 2D polynomial.

Adapted from from https://gist.github.com/kadereub/9eae9cff356bb62cdbd672931e8e5ec4

pipeline.Preprocessing.poly_metrics.compute_distance_to_poly(coords, coeffs)#
Return type:

float

pipeline.Preprocessing.poly_metrics.compute_quadratic_coeff(coords, coeffs)#
Return type:

float

pipeline.Preprocessing.poly_metrics.eval_polynomial(P, x)#

Compute polynomial P(x) where P is a vector of coefficients, highest order coefficient at P[0]. Uses Horner’s Method.

pipeline.Preprocessing.poly_metrics.fit_poly(x, y, deg)#

pipeline.Preprocessing.preprocessing module#

pipeline.Preprocessing.preprocessing.apply_custom_processing(hits_particles, particles, processing=None)[source]#

Apply custom processing to the dataframe of hits-particles and particles. The custom processing functions are defined in pipeline.Preprocessing.process_custom.

Parameters:
  • hits_particles (DataFrame) – dataframe of hits-particles

  • particles (DataFrame) – dataframe of particles

  • processing (Union[str, Sequence[str], None]) – Name(s) of the processing function(s) to apply to the dataframes. The processing functions as defined in pipeline.Preprocessing.process_custom

Return type:

Tuple[DataFrame, DataFrame]

Returns:

Processed dataframe of hits-particles and particles

pipeline.Preprocessing.preprocessing.dump_if_enough_hits(event_hits_particles, event_particles, event_id, output_dir, num_true_hits_threshold=None, pbar=None)[source]#
Return type:

bool

pipeline.Preprocessing.preprocessing.enough_true_hits(event_hits_particles, num_true_hits_threshold, event_id_str=None, num_events=None, required_num_events=None)[source]#

Check whether an event has enough true hits to be saved.

Parameters:
  • event_hits_particles (DataFrame) – DataFrame of all hits for an event.

  • num_true_hits_threshold (int) – Minimum number of true hits required for the event to be saved.

  • event_id_str (Optional[str]) – String representation of the event ID.

  • num_events (Optional[int]) – The current number of saved events.

  • required_num_events (Optional[int]) – The desired number of saved events.

Return type:

bool

Returns:

True if the event has enough true hits to be saved, False otherwise.

Notes

The function checks the number of true hits in an event and compares it with the given threshold. If the number of true hits is equal to 0, the event is discarded as it contains only fake hits. If the number of true hits is below the given threshold, the event is discarded as it does not contain enough true hits. Otherwise, the event is saved as the function returns True.

pipeline.Preprocessing.preprocessing.preprocess(input_dir, output_dir, hits_particles_filename=None, particles_filename=None, subdirs=None, n_events=-1, processing=None, num_true_hits_threshold=None, hits_particles_columns=None, particles_columns=None, n_workers=1, raise_enough_events=True)[source]#

Preprocess the first n_events events in the input files and save the events in separate parquet files called event{event_id}-hits_particles.parquet and event{event_id}-hits.parquet.

Parameters:
  • input_dir (str) – A single input directory if subdirs is None, or the main directory where sub-directories are

  • output_dir (str) – the output directory where the parquet files are saved.

  • hits_particles_filename (Optional[str]) – Name of the hits-particles file name (without the .parquet.lz4 extension). Default is hits_velo.

  • particles_filename (Optional[str]) – Name of the particle file name (without the .parquet.lz4 extension). Default is mc_particles.

  • subdirs (Union[int, str, List[str], None]) –

    • If subdirs is None, there is a single input directory, input_dir

    • If subdirs is a string or a list of strings, they specify the sub-directories with respect to input_dir. If input_dir is None, then they are the (list of) input directories directly, which can be useful if the input directories are not at the same location (even though it is discouraged)

    • If subdirs is an integer, it corresponds to the the name of the last sub-directory to consider (i.e., from 0 to subdirs). If subdirs is -1, all the sub-directories are considered as input.

    • If subdirs is a dictionary, the keys start and stop specify the first and last sub-directories to consider as input.

  • n_events (int) – Number of events to save. For n_workers higher than 1, more events may be produced.

  • processing (Union[str, List[str], None]) – Name(s) of the processing function(s) to apply to the dataframes. The processing functions as defined in pipeline.Preprocessing.process_custom

  • num_true_hits_threshold (Optional[int]) – Minimum number of true hits required for the event to be saved.

  • hits_particles_columns (Optional[List[str]]) – columns to load for the dataframe of hits and the hits-particles association information

  • particles_columns (Optional[List[str]]) – columns to load for the dataframe of particles

  • n_workers (int) – If greater than 1, the input dataframes are all loaded and processed in parallel.

  • raise_enough_events (bool) – whether to raise an error if not any events where generated.

pipeline.Preprocessing.preprocessing_paths module#

Module to handle the output path of the preprocessing.

pipeline.Preprocessing.preprocessing_paths.get_truncated_paths(input_dir)[source]#

Get the list of the truncated paths in a given preprocessing folder.

Parameters:

input_dir (str) – directory that contains preprocessed files

Return type:

List[str]

Returns:

List of the paths, truncated so that they do not contains neither the extension, nor hits_particles not particles

pipeline.Preprocessing.preprocessing_paths.get_truncated_paths_for_partition(path_or_config, partition)[source]#

Get the list of truncated paths for a given partition.

Parameters:
  • path_or_config (str | dict) – configuration dictionary, or path to the YAML file that contains the configuration

  • partition (str) – Dataset partition: train, val or name of a test dataset

Return type:

List[str]

Returns:

List of the truncated paths of the pre-processed parquet files for this partition.

pipeline.Preprocessing.process_custom module#

class pipeline.Preprocessing.process_custom.SelectionFunction(*args, **kwargs)[source]#

Bases: Protocol

pipeline.Preprocessing.process_custom.apply_mask(particles_mask, particles, hits_particles)[source]#

Apply a mask of particles to keep to both the dataframe of particles and the dataframe of particles-hits.

Parameters:
  • particles_mask (Series) – The mask to apply, that corresponds to the particles to keep in the dataframe particles

  • particles (DataFrame) – Dataframe of particles

  • hits_particles (DataFrame) – Dataframe of hits-particles association

Return type:

Tuple[DataFrame, DataFrame]

Returns:

Dataframe of particles and hits_particles that only contain the particles and hits to keep.

pipeline.Preprocessing.process_custom.at_least_1_hit_on_scifi(hits_particles, particles)[source]#
pipeline.Preprocessing.process_custom.at_least_7_planes(hits_particles, particles)[source]#
Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.compute_n_particles_per_hit(hits_particles, particles)[source]#

Compute number of unique planes for each particle.

Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.compute_n_unique_planes(hits_particles, particles)[source]#

Compute number of unique planes for each particle.

Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.cut_long_tracks(hits_particles, particles)[source]#
Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.default_old_training_for_rta_presentation(hits_particles, particles)[source]#

Selection that was used in the training presented in the RTA meeting.

Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.everything_but_electrons(hits_particles, particles)[source]#

Only keep long electrons.

Parameters:
  • hits_particles (DataFrame) – Dataframe of hits-particles association

  • particles (DataFrame) – Dataframe of particles

Return type:

Tuple[DataFrame, DataFrame]

Returns:

Dataframe of hits-particles association and particles, filtered so that only long electrons are left.

pipeline.Preprocessing.process_custom.everything_but_long_electrons(hits_particles, particles)[source]#

Only keep long electrons.

Parameters:
  • hits_particles (DataFrame) – Dataframe of hits-particles association

  • particles (DataFrame) – Dataframe of particles

Return type:

Tuple[DataFrame, DataFrame]

Returns:

Dataframe of hits-particles association and particles, filtered so that only long electrons are left.

pipeline.Preprocessing.process_custom.less_than_3_hits_on_same_plane(hits_particles, particles)[source]#
Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.only_keep_hits_on_particles(hits_particles, particles)[source]#
pipeline.Preprocessing.process_custom.only_long_electrons(hits_particles, particles)[source]#

Only keep long electrons.

Parameters:
  • hits_particles (DataFrame) – Dataframe of hits-particles association

  • particles (DataFrame) – Dataframe of particles

Return type:

Tuple[DataFrame, DataFrame]

Returns:

Dataframe of hits-particles association and particles, filtered so that only long electrons are left.

pipeline.Preprocessing.process_custom.reconstructible_scifi(hits_particles, particles)[source]#
pipeline.Preprocessing.process_custom.remove_curved_particles(hits_particles, particles)[source]#

Remove curved particles

Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.remove_particle_not_poly_enough(hits_particles, particles, max_distance=70.0)[source]#
Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.remove_particles_too_scattered_on_plane(hits_particles, particles, max_xdiffs=2.5)[source]#
Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.track_weighting_selection(hits_particles, particles)[source]#

The selection performed in the track-weighting experiment.

Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.process_custom.triplets_first_selection(hits_particles, particles)[source]#

The selection performed in the triplets-edge experiment.

Return type:

Tuple[DataFrame, DataFrame]

pipeline.Preprocessing.run_preprocessing module#

This module defines how to run the pre-processing from a configuration file.

pipeline.Preprocessing.run_preprocessing.run_preprocessing(path_or_config, reproduce=True, raise_enough_events=True)[source]#

Run the pre-processing step.

Parameters:
  • path_or_config (str | dict) – configuration dictionary, or path to the YAML file that contains the configuration

  • reproduce (bool) – whether to reproduce an existing preprocessing

  • raise_enough_events (bool) – whether to raise an error if not any events where generated.

pipeline.Preprocessing.run_preprocessing.run_preprocessing_test_dataset(test_dataset_name, path_or_config_test=None, detector=None, reproduce=False, raise_enough_events=False)[source]#

Run the pre-processing of a test dataset.

Parameters:
  • test_dataset_name (str) – name of the test dataset to pre-process

  • path_or_config_test (UnionType[str, dict, None]) – YAML test dataset configuration dictionary or path to it

  • reproduce (bool) – whether to reproduce an existing preprocessing

  • raise_enough_events (bool) – whether to raise an error if not any events where generated.