pipeline.utils.loaderutils package#
A package that contains utilities to load files for training, test and validation.
pipeline.utils.loaderutils.dataiterator module#
Implement a general data loader that does not load all the data into memory, in order to deal with large datasets.
- class pipeline.utils.loaderutils.dataiterator.LazyDatasetBase(input_dir, n_events=None, shuffle=False, seed=None, **kwargs)[source]#
Bases:
Dataset- fetch_dataset(input_path, map_location='cpu', **kwargs)[source]#
Load and process one PyTorch DataSet.
- Parameters:
input_path (
str) – path to the PyTorch datasetmap_location (
str) – location where to load the dataset**kwargs – Other keyword arguments passed to
torch.load()
- Returns:
Load PyTorch data object
pipeline.utils.loaderutils.pathandling module#
Utilies to handles datasets without loading them.
- pipeline.utils.loaderutils.pathandling.get_input_paths(input_dir, n_events=None, shuffle=False, seed=None)[source]#
Get the paths of the datasets located in a given directory.
- Parameters:
input_dir (
str) – input directoryn_events (
Optional[int]) – number of events to loadshuffle (
bool) – whether to shuffle the input paths (applied before selected the firstn_events)seed (
Optional[int]) – seed for the shuffling**kwargs – Other keyword arguments passed to
ModelBase.fetch_dataset()
- Return type:
List[str]- Returns:
List of paths to the PyTorch Data objects
pipeline.utils.loaderutils.preprocessing module#
A module that defines utilities used to handle the Pandas Dataframe laoded from CSV-like files.
- pipeline.utils.loaderutils.preprocessing.cast_boolean_columns(particles)[source]#
Cast the columns of the
particlesdataframe as boolean columns. In-place.- Parameters:
particles (
DataFrame) – dataframe of particles
- pipeline.utils.loaderutils.preprocessing.load_dataframes(indir, hits_particles_filename=None, particles_filename=None, hits_particles_columns=None, particles_columns=None, use_run_number=True, **kwargs)[source]#
Load the dataframes of hits_particles and particles that are stored in a folder. This function is also used in the validation step.
- Parameters:
indir (
str) – directory where the dataframes are savedhits_particles_filename (
Optional[str]) – Name of the hits-particles file name (without the.parquet.lz4extension). Default ishits_velo.particles_filename (
Optional[str]) – Name of the particle file name (without the.parquet.lz4extension). Default ismc_particles.hits_particles_columns (
Optional[List[str]]) – columns to load for the dataframe of hits and the hits-particles association informationparticles_columns (
Optional[List[str]]) – columns to load for the dataframe of particlesrun_number – whether to define the event ID (
event_idcolumn) asevent + (10**9) * runinstead of justevent.**kwargs – other keyword arguments passed to the function that load the files
- Return type:
Tuple[DataFrame,DataFrame]- Returns:
A 2-tuple containing the dataframe of hits-particles and the dataframes of particles
Notes
The function also defines the column
particle_id = mcid + 1in both dataframes.
- pipeline.utils.loaderutils.preprocessing.load_preprocessed_dataframes(truncated_paths, ending, **kwargs)[source]#
Load dataframes stored in parquet files, whose paths are in the form
{truncated_path}{ending}.parquetwhere the truncated path ends with 9 numbers corresponding to the event ID.- Parameters:
truncated_paths (
List[str]) – list of truncated paths, withoutendingand the extension.parquetending (
str) – ending of the file, excluding the extension.parquet**kwargs – passed to
pandas.read_parquet()
- Return type:
DataFrame- Returns:
Dataframe, where the
event_idwas also added.
pipeline.utils.loaderutils.tracks module#
- pipeline.utils.loaderutils.tracks.get_tracks_input_directory(path_or_config, partition, suffix=None)[source]#
Get the input directory where the tracks are stored for the given partition.
- Return type:
str
- pipeline.utils.loaderutils.tracks.load_tracks(input_dir)[source]#
Load the tracks from graphs.
- Parameters:
tracks_input_dir – input directory where the PyTorch Data objects are saved, which contains the reconstructed tracks.
- Return type:
DataFrame- Returns:
Dataframe with columns
event_id,hit_id,track_id, for all the events ininput_dir.
- pipeline.utils.loaderutils.tracks.load_tracks_event(input_path)[source]#
Load the dataframe of tracks out of track building.
- Parameters:
input_path (
str) – Path to the PyTorch Geometric data pickle file that contains the graph together with the reconstructed tracks- Return type:
DataFrame- Returns:
Dataframe with columns
event_id,hit_id,track_id