pipeline.utils.loaderutils package#
A package that contains utilities to load files for training, test and validation.
pipeline.utils.loaderutils.dataiterator module#
Implement a general data loader that does not load all the data into memory, in order to deal with large datasets.
- class pipeline.utils.loaderutils.dataiterator.LazyDatasetBase(input_dir, n_events=None, shuffle=False, seed=None, **kwargs)[source]#
Bases:
Dataset
- fetch_dataset(input_path, map_location='cpu', **kwargs)[source]#
Load and process one PyTorch DataSet.
- Parameters:
input_path (
str
) – path to the PyTorch datasetmap_location (
str
) – location where to load the dataset**kwargs – Other keyword arguments passed to
torch.load()
- Returns:
Load PyTorch data object
pipeline.utils.loaderutils.pathandling module#
Utilies to handles datasets without loading them.
- pipeline.utils.loaderutils.pathandling.get_input_paths(input_dir, n_events=None, shuffle=False, seed=None)[source]#
Get the paths of the datasets located in a given directory.
- Parameters:
input_dir (
str
) – input directoryn_events (
Optional
[int
]) – number of events to loadshuffle (
bool
) – whether to shuffle the input paths (applied before selected the firstn_events
)seed (
Optional
[int
]) – seed for the shuffling**kwargs – Other keyword arguments passed to
ModelBase.fetch_dataset()
- Return type:
List
[str
]- Returns:
List of paths to the PyTorch Data objects
pipeline.utils.loaderutils.preprocessing module#
A module that defines utilities used to handle the Pandas Dataframe laoded from CSV-like files.
- pipeline.utils.loaderutils.preprocessing.cast_boolean_columns(particles)[source]#
Cast the columns of the
particles
dataframe as boolean columns. In-place.- Parameters:
particles (
DataFrame
) – dataframe of particles
- pipeline.utils.loaderutils.preprocessing.load_dataframes(indir, hits_particles_filename=None, particles_filename=None, hits_particles_columns=None, particles_columns=None, use_run_number=True, **kwargs)[source]#
Load the dataframes of hits_particles and particles that are stored in a folder. This function is also used in the validation step.
- Parameters:
indir (
str
) – directory where the dataframes are savedhits_particles_filename (
Optional
[str
]) – Name of the hits-particles file name (without the.parquet.lz4
extension). Default ishits_velo
.particles_filename (
Optional
[str
]) – Name of the particle file name (without the.parquet.lz4
extension). Default ismc_particles
.hits_particles_columns (
Optional
[List
[str
]]) – columns to load for the dataframe of hits and the hits-particles association informationparticles_columns (
Optional
[List
[str
]]) – columns to load for the dataframe of particlesrun_number – whether to define the event ID (
event_id
column) asevent + (10**9) * run
instead of justevent
.**kwargs – other keyword arguments passed to the function that load the files
- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
A 2-tuple containing the dataframe of hits-particles and the dataframes of particles
Notes
The function also defines the column
particle_id = mcid + 1
in both dataframes.
- pipeline.utils.loaderutils.preprocessing.load_preprocessed_dataframes(truncated_paths, ending, **kwargs)[source]#
Load dataframes stored in parquet files, whose paths are in the form
{truncated_path}{ending}.parquet
where the truncated path ends with 9 numbers corresponding to the event ID.- Parameters:
truncated_paths (
List
[str
]) – list of truncated paths, withoutending
and the extension.parquet
ending (
str
) – ending of the file, excluding the extension.parquet
**kwargs – passed to
pandas.read_parquet()
- Return type:
DataFrame
- Returns:
Dataframe, where the
event_id
was also added.
pipeline.utils.loaderutils.tracks module#
- pipeline.utils.loaderutils.tracks.get_tracks_input_directory(path_or_config, partition, suffix=None)[source]#
Get the input directory where the tracks are stored for the given partition.
- Return type:
str
- pipeline.utils.loaderutils.tracks.load_tracks(input_dir)[source]#
Load the tracks from graphs.
- Parameters:
tracks_input_dir – input directory where the PyTorch Data objects are saved, which contains the reconstructed tracks.
- Return type:
DataFrame
- Returns:
Dataframe with columns
event_id
,hit_id
,track_id
, for all the events ininput_dir
.
- pipeline.utils.loaderutils.tracks.load_tracks_event(input_path)[source]#
Load the dataframe of tracks out of track building.
- Parameters:
input_path (
str
) – Path to the PyTorch Geometric data pickle file that contains the graph together with the reconstructed tracks- Return type:
DataFrame
- Returns:
Dataframe with columns
event_id
,hit_id
,track_id