pipeline.utils.commonutils package#

A small package that defines some helper functions used for different steps of the machine-learning pipeline.

pipeline.utils.commonutils.cdetector module#

A module to handle different detectors.

pipeline.utils.commonutils.cdetector.get_coordinate_names(detector)[source]#

Get the list of coordinates names for the given detector

Parameters:

detector (str) – Detector name

Return type:

List[str]

Returns:

List of coordinate names

pipeline.utils.commonutils.cfeatures module#

A module that defines common utilies for data-handling.

pipeline.utils.commonutils.cfeatures.get_input_features(all_features, feature_indices)[source]#

Extract the features that are trained on, from the batch pytorch geometric data object.

Parameters:
  • batch – all features

  • feature_indices (Union[List[int], int, None]) – if it is an integer, corresponds to the number of features to include in the array of features. If it is a list of integers, it corresponds to the indices of the features to include in all_features

Return type:

Tensor

Returns:

Array of features

pipeline.utils.commonutils.cfeatures.get_number_input_features(feature_indices)[source]#

Get the number of input features.

Parameters:

feature_indices (Union[int, List[int]]) – if it is an integer, corresponds to the number of features to include in the array of features. If it is a list of integers, it corresponds to the indices of the features to include in batch.x

Return type:

int

Returns:

Number of input features

pipeline.utils.commonutils.cfeatures.get_unnormalised_features(batch, path_or_config, feature_names)[source]#

Get the unnormalised features from the PyTorch Geometric data object, according to the configuration.

Parameters:
  • batch (Data) – PyTorch geometric data object, that contains the x attribute, which corresponds to the array of the features

  • path_or_config (str | dict) – configuration dictionary, or path to the YAML file that contains the configuration

  • feature_names (List[str]) – list of the names of the features to extract the unnormalised values of

Return type:

List[Tensor]

Returns:

List of PyTorch tensors, corresponding the the arrays of values of the features whose names are given by features_names

pipeline.utils.commonutils.config module#

A module that helps to handle the YAML configuration.

class pipeline.utils.commonutils.config.CommonDirs[source]#

Bases: object

A class that handles the common configuration in setup/common_config.yaml.

property common_config: Dict[str, Any]#

Common configuration dictionary, in setup/common_config.yaml.

property detectors: List[str]#

List of available detectors.

get_filenames_from_detector(detector)[source]#

Get the .parquet filenames for a given detector.

Return type:

Dict[str, str]

property repository: str#

Path to the repository.

property test_config_path#

Path to the test configuration file.

class pipeline.utils.commonutils.config.PipelineConfig(path_or_config, common_config=None, dir_path=None)[source]#

Bases: MutableMapping

add_config(path_or_config, steps=None)[source]#

Add a configuration to the current configuration.

Parameters:
  • path – path to the configuration to add

  • steps (Optional[Sequence[str]]) – list of steps to add from this configuration. If not specified, all the steps are added.

Raises:

ValueError – a step that already exists in the current dictionary is trying to be added.

Return type:

None

property common_config: Dict[str, Any]#

Common configuration dictionary

property data_experiment_dir: str#

Path to the dictionary that contains all the data of the given experiment.

property detector: str#

Detector the pipeline is applied to.

dict()[source]#

Turn the experiment configuration dictionary into a regular dictionary of dictionaries.

Return type:

Dict[str, Dict[str, Any]]

property dir_path: str | None#

Path to the directory the paths in input are expressed w.r.t.

property experiment_name: str#

Name of the experiment

get_test_batch_dir(step, test_dataset_name)[source]#
Return type:

str

property performance_dir: str#

Directory where

property required_test_dataset_names: List[str]#
property steps: List[str]#
pipeline.utils.commonutils.config.get_detector_from_experiment_name(experiment_name)[source]#

Get the detector of an experimetn.

Parameters:

experiment_name (str) – Name of an experiment

Return type:

str

Returns:

Detector used in this experiment

pipeline.utils.commonutils.config.get_detector_from_pipeline_config(path_or_config)[source]#
Return type:

str

pipeline.utils.commonutils.config.get_performance_directory_experiment(path_or_config)[source]#

Helper function to get the directory where to save plots and reports of metric performances.

Parameters:

path_or_config (str | dict) – configuration dictionary, or path to the YAML file that contains the configuration

Return type:

str

Returns:

Path to the directory whereto save performance metric plots and reports

pipeline.utils.commonutils.config.get_pipeline_config_path(experiment_name)[source]#

Get the path to the pipeline config YAML file.

Parameters:

experiment_name (str) – name of the experiment

Return type:

str

Returns:

Path where the YAML file that contains the configuration of experiment_name is stored.

pipeline.utils.commonutils.config.load_config(path_or_config)[source]#

Load the configuration if not already.

Also replace input_subdirectory by input_dir and output_subdirectory by output_subdirectory in the loaded configuration. For this reason, please always load the configuration using this function.

Return type:

Dict[str, Dict[str, Any]]

pipeline.utils.commonutils.config.load_dict(path_or_config)[source]#

Load the dictionary stored in a dictionary file, or just passthrough if the provided input is already a dictionary.

Parameters:

path_or_config (Union[str, Dict[Any, Any]]) – dictionary or path to a YAML file containing a dictionary

Return type:

Dict[Any, Any]

Returns:

dictionary contained in the YAML file or inputted dictionary

pipeline.utils.commonutils.config.resolve_relative_path(path, folder_name='')[source]#
Return type:

str

pipeline.utils.commonutils.crun module#

class pipeline.utils.commonutils.crun.InOutFunction(*args, **kwargs)[source]#

Bases: Protocol

pipeline.utils.commonutils.crun.run_for_different_partitions(func, input_dir, output_dir, partitions=['train', 'val', 'test'], test_dataset_names=None, reproduce=True, list_kwargs=None, **kwargs)[source]#

Run a function for different dataset “partitions”.

Parameters:
  • func (InOutFunction) – Function to run, with input input_dir, output_dir, reproduce and possibly additional keyword arguments.

  • input_dir (str) – input directory

  • output_dir (str) – output directory

  • partitions (List[str]) –

    Partitions to run run the func on:

    • train: train dataset

    • val: validation dataset

    • test: all the test datasets

    • A specific test dataset name

  • test_dataset_names (Optional[List[str]]) – list of possible test dataset names

  • reproduce (bool) – whether to reproduce the output. This will remove the output directory.

  • **kwargs – keyword arguments passed to func

pipeline.utils.commonutils.ctests module#

A module that define utilities to handle test datasets.

pipeline.utils.commonutils.ctests.collect_test_samples(reference_directory=None, output_path=None, n_events=1000, supplementary_test_config_path=None)[source]#
pipeline.utils.commonutils.ctests.get_available_test_dataset_names(path_or_config_test=None)[source]#

Get the list of available test dataset names from the test dataset configuration file.

Parameters:

path_or_config_test (Union[str, Dict[str, Any], None]) – YAML test dataset configuration dictionary or path to it

Return type:

List[str]

Returns:

List of test dataset names that can be produced or/and used.

pipeline.utils.commonutils.ctests.get_preprocessed_test_dataset_dir(test_dataset_name, detector)[source]#

Get the path to the directory that contains the preprocessed files of a given test dataset.

Parameters:

test_dataset_name (str) – name of the test dataset to pre-process

Return type:

str

pipeline.utils.commonutils.ctests.get_required_test_dataset_names(path_or_config)[source]#

Get the list of the dataset names required by the configuration.

Return type:

List[str]

pipeline.utils.commonutils.ctests.get_test_batch_dir(experiment_name, stage, test_dataset_name)[source]#

Get the directory where the batches of a particular experiment, and of a given test sample are saved.

Parameters:
  • experiment_name (str) – name of the experiment

  • stage (str) – name of the pipeline stage

  • test_dataset_name (str) – name of the test dataset

Returns:

Path to the directory where the torch batch files are saved.

pipeline.utils.commonutils.ctests.get_test_batch_paths(experiment_name, stage, test_dataset_name)[source]#

Get the list of paths of test batches of a given stage and experiment.

Parameters:
  • experiment_name (str) – name of the experiment

  • stage (str) – name of the pipeline stage

  • test_dataset_name (str) – name of the test dataset

Return type:

List[str]

Returns:

List of paths of the test batches.

Notes

If stage contains embedding and the test batch directory does not exists, the function tries to replace embedding by metric_learning for backward compatiblity.

pipeline.utils.commonutils.ctests.get_test_config_for_preprocessing(test_dataset_name, path_or_config_test, detector)[source]#

Get the configuration used for the pre-processing of a given test dataset.

Parameters:
  • test_dataset_name (str) – name of the test dataset to pre-process

  • path_or_config_test (str | dict) – YAML test dataset configuration dictionary or path to it

Return type:

dict

pipeline.utils.commonutils.ctests.load_preprocessing_test_config(test_dataset_name, reference_directory=None)[source]#
Return type:

Dict[str, Any]