pipeline.utils.loaderutils package#

A package that contains utilities to load files for training, test and validation.

pipeline.utils.loaderutils.dataiterator module#

Implement a general data loader that does not load all the data into memory, in order to deal with large datasets.

class pipeline.utils.loaderutils.dataiterator.LazyDatasetBase(input_dir, n_events=None, shuffle=False, seed=None, **kwargs)[source]#

Bases: Dataset

fetch_dataset(input_path, map_location='cpu', **kwargs)[source]#

Load and process one PyTorch DataSet.

Parameters:
  • input_path (str) – path to the PyTorch dataset

  • map_location (str) – location where to load the dataset

  • **kwargs – Other keyword arguments passed to torch.load()

Returns:

Load PyTorch data object

pipeline.utils.loaderutils.pathandling module#

Utilies to handles datasets without loading them.

pipeline.utils.loaderutils.pathandling.get_input_paths(input_dir, n_events=None, shuffle=False, seed=None)[source]#

Get the paths of the datasets located in a given directory.

Parameters:
  • input_dir (str) – input directory

  • n_events (Optional[int]) – number of events to load

  • shuffle (bool) – whether to shuffle the input paths (applied before selected the first n_events)

  • seed (Optional[int]) – seed for the shuffling

  • **kwargs – Other keyword arguments passed to ModelBase.fetch_dataset()

Return type:

List[str]

Returns:

List of paths to the PyTorch Data objects

pipeline.utils.loaderutils.preprocessing module#

A module that defines utilities used to handle the Pandas Dataframe laoded from CSV-like files.

pipeline.utils.loaderutils.preprocessing.cast_boolean_columns(particles)[source]#

Cast the columns of the particles dataframe as boolean columns. In-place.

Parameters:

particles (DataFrame) – dataframe of particles

pipeline.utils.loaderutils.preprocessing.combine_run_event_into_event_id(dataframe)[source]#
pipeline.utils.loaderutils.preprocessing.load_dataframes(indir, hits_particles_filename=None, particles_filename=None, hits_particles_columns=None, particles_columns=None, use_run_number=True, **kwargs)[source]#

Load the dataframes of hits_particles and particles that are stored in a folder. This function is also used in the validation step.

Parameters:
  • indir (str) – directory where the dataframes are saved

  • hits_particles_filename (Optional[str]) – Name of the hits-particles file name (without the .parquet.lz4 extension). Default is hits_velo.

  • particles_filename (Optional[str]) – Name of the particle file name (without the .parquet.lz4 extension). Default is mc_particles.

  • hits_particles_columns (Optional[List[str]]) – columns to load for the dataframe of hits and the hits-particles association information

  • particles_columns (Optional[List[str]]) – columns to load for the dataframe of particles

  • run_number – whether to define the event ID (event_id column) as event + (10**9) * run instead of just event.

  • **kwargs – other keyword arguments passed to the function that load the files

Return type:

Tuple[DataFrame, DataFrame]

Returns:

A 2-tuple containing the dataframe of hits-particles and the dataframes of particles

Notes

The function also defines the column particle_id = mcid + 1 in both dataframes.

pipeline.utils.loaderutils.preprocessing.load_preprocessed_dataframes(truncated_paths, ending, **kwargs)[source]#

Load dataframes stored in parquet files, whose paths are in the form {truncated_path}{ending}.parquet where the truncated path ends with 9 numbers corresponding to the event ID.

Parameters:
  • truncated_paths (List[str]) – list of truncated paths, without ending and the extension .parquet

  • ending (str) – ending of the file, excluding the extension .parquet

  • **kwargs – passed to pandas.read_parquet()

Return type:

DataFrame

Returns:

Dataframe, where the event_id was also added.

pipeline.utils.loaderutils.tracks module#

pipeline.utils.loaderutils.tracks.get_tracks_from_batch(batch)[source]#
Return type:

DataFrame

pipeline.utils.loaderutils.tracks.get_tracks_input_directory(path_or_config, partition, suffix=None)[source]#

Get the input directory where the tracks are stored for the given partition.

Return type:

str

pipeline.utils.loaderutils.tracks.load_tracks(input_dir)[source]#

Load the tracks from graphs.

Parameters:

tracks_input_dir – input directory where the PyTorch Data objects are saved, which contains the reconstructed tracks.

Return type:

DataFrame

Returns:

Dataframe with columns event_id, hit_id, track_id, for all the events in input_dir.

pipeline.utils.loaderutils.tracks.load_tracks_event(input_path)[source]#

Load the dataframe of tracks out of track building.

Parameters:

input_path (str) – Path to the PyTorch Geometric data pickle file that contains the graph together with the reconstructed tracks

Return type:

DataFrame

Returns:

Dataframe with columns event_id, hit_id, track_id

pipeline.utils.loaderutils.tracks.load_tracks_preprocessed_dataframes_given_partition(path_or_config, partition, suffix='')[source]#
Return type:

Tuple[DataFrame, DataFrame, DataFrame]