pipeline.utils.modelutils package

pipeline.utils.modelutils package#

A package that define some common utilities for defining models in PyTorch Lighting.

pipeline.utils.modelutils.basemodel module#

Define a base model for GNN and Embedding, to avoid copy of functions.

class pipeline.utils.modelutils.basemodel.ModelBase(*args: Any, **kwargs: Any)[source]#

Bases: LightningModule

configure_optimizers()[source]#: Use Adam optimizer and step learning rate scheduler.

property feature_indices: List[int] | int | None#: I want to deprecate this…

property feature_means: List[float] | None#: List of feature means corresponding to the feature names listed in features. They are used for normalising the node features.

property feature_names: List[str] | None#: List of node feature names.

property feature_scales: List[float] | None#: List of feature scales corresponding to the feature names listed in features. They are used for normalising the node features.

fetch_datasets(lazy_dataset)[source]#

Get the datasets located in a given directory.

Parameters:

input_dir – input directory
n_events – number of events to load
shuffle – whether to shuffle the input paths (applied before selected the first n_events)
seed – seed for the shuffling
**kwargs – Other keyword arguments passed to ModelBase.get_lazy_dataset()

Return type:

List[Data]

Returns:

List of loaded PyTorch Geometric Data objects

fetch_partition(partition, n_events=None, shuffle=False, seed=None, **kwargs)[source]#

Load a partition.

Parameters:

partition (str) – train, val or name of the test dataset
n_events (Optional[int]) – number of events to load for this partition
shuffle (bool) – whether to shuffle the input paths (applied before selected the first n_events)
seed (Optional[int]) – seed for the shuffling
**kwargs – Other keyword arguments passed to ModelBase.fetch_dataset()

Return type:

Union[List[Data], LazyDatasetBase]

get_features(batch)[source]#

Get the features of a batch, using as input for inference.

If feature_names is provided, they are used to build the tensor of node features, normalising them using feature_means and feature_scales.

Otherwise, batch["x"] is returned.

Parameters:: batch (Data) – batch of nodes (typically an event)
Return type:: Tensor
Returns:: tensor of node features

Notes

No gradient is recorded.

get_lazy_dataset(input_dir, n_events=None, shuffle=False, seed=None, **kwargs)[source]#

Get the lazy dataset object.

Parameters:

input_dir (str) – input directory
n_events (Optional[int]) – number of events to load
shuffle (bool) – whether to shuffle the input paths (applied before selected the first n_events)
seed (Optional[int]) – seed for the shuffling
**kwargs – Other keyword arguments passed to the utils.loaderutils.dataiterator.LazyDatasetBase constructor.

Return type:

LazyDatasetBase

Returns:

utils.loaderutils.dataiterator.LazyDatasetBase object

get_lazy_dataset_partition(partition, n_events=None, shuffle=False, seed=None, **kwargs)[source]#

Get the lazy dataset of a partition.

Parameters:

partition (str) – train, val or name of the test dataset
n_events (Optional[int]) – number of events to load
shuffle (bool) – whether to shuffle the input paths (applied before selected the first n_events)
seed (Optional[int]) – seed for the shuffling
**kwargs – Other keyword arguments passed to ModelBase.get_lazy_dataset()

Return type:

LazyDatasetBase

Returns:

Lazy dataset of the partition

classmethod get_model_from_checkpoint(checkpoint, default_checkpoint=None, **kwargs)[source]#

Helper function to get a model at inference step.

Parameters:

checkpoint (LightningModule | str | None) – the model already loaded, or path to it
Mode – Model class
default_checkpoint (str | None) – path to fall back to if checkpoint is None.
**kwargs – other parameters passed to Model.load_from_checkpoint()

Returns:

Loaded model

get_n_features()[source]#

Number of input features of the network.

Return type:: int

get_subnetwork_inputs(subnetwork)[source]#

Find the input names of a subnetwork by looking at the signature of its ONNX forward method _onnx_{subnetwork}.

Parameters:: subnetwork (str) – subnetwork name
Return type:: List[str]
Returns:: List of the input names of the subnetwork.

get_subnetwork_outputs(subnetwork)[source]#

Get the outputs of a subnetwork, as configured by the subnetwork_to_outputs property.

Parameters:: subnetwork (str) – subnetwork name
Return type:: List[str]
Returns:: List of the output names of the subnetwork.
Raises:: KeyError – if the outputs of the subnetwork were not specified in the subnetwork_to_outputs property.

property input_kwargs: Dict[str, Any]#: Associates an input name with a dictionary corresponding to the keyword arguments used to build a dummy tensor representing the input. This dictionary basically gives the size and dtype of the tensor.

property input_to_dynamic_axes#: A dictionary that associates an input name with the dynamic axis specification.

property lazy: bool#: Whether to load the training set and val set into memory only when needed.

load_partition(partition, n_events=None, shuffle=False, seed=None)[source]#

Load datasets of a partition.

Parameters:

partition (str) – train, val or name of the test dataset
n_events (Optional[int]) – number of events to load for this partition
shuffle (bool) – whether to shuffle the input paths (applied before selected the first n_events)
seed (Optional[int]) – seed for the shuffling

load_testset_from_directory(input_dir, **kwargs)[source]#

Load a test dataset from a path to a directory.

Parameters:: input_dir (str) – path to the directory that contains the PyTorch Geometric Data pickles files.

load_trainset_split_indices(trainset_split)[source]#

property n_trainable_params: int#: Number of trainable parameters.

property on_step: bool#: Whether to log on step.

optimizer_step(epoch, batch_idx, optimizer, optimizer_closure)[source]#: Modified version of the optimizer step that implements warm up and properly enforce the learning rate.

setup(stage)[source]#

property subnetwork_groups: Dict[str, List[str]]#: A dictionary that associates a subnetwork actually corresponding to a list of subnetworks, with this list of subnetworks.

property subnetwork_to_outputs: Dict[str, List[str]]#: A dictionary that associates a subnetwork name with the list of its output names.

property subnetworks: List[str]#: List of subnetworks available. It is derived from subnetwork_to_outputs.

test_dataloader()[source]#: Test dataloader.

property testset: List[torch_geometric.data.Data]#

to_onnx(outpath, mode=None, options=None)[source]#

Export a model to ONNX.

Parameters:

outpath (str) – where to save the ONNX file
options (Optional[Iterable[str]]) – ONNX export options

Return type:

None

train_dataloader()[source]#: Train dataloader, with random splitting of epochs.

property trainset: List[torch_geometric.data.Data] | LazyDatasetBase#

val_dataloader()[source]#: Validation dataloader.

property valset: List[torch_geometric.data.Data]#

class pipeline.utils.modelutils.basemodel.ModelONNXExport(model, subnetwork)[source]#

Bases: Module

Class used to export the forward pass of a subnetwork within a TripletGNNBase model.

model#: triplet GNN model

subnetwork#: name of the subnetwork to export

forward(*args)[source]#

Forward pass to use when the model is exported to ONNX.

Return type:: Any

pipeline.utils.modelutils.basemodel.check_and_discard(s, element)[source]#

Return type:: bool

pipeline.utils.modelutils.basemodel.feature_to_compute_fct: Dict[str, Callable[[torch_geometric.data.Data], torch.Tensor]] = {'phi': <function <lambda>>, 'r': <function <lambda>>}#: Associates a column name with a lambda function that takes as input the batch object and returns the compute column

pipeline.utils.modelutils.batches module#

A module used to handle list of batches stored in model.

pipeline.utils.modelutils.batches.get_batches(model, partition)[source]#

Get the list batches for the given model.

Parameters:

model (ModelBase) – PyTorch model inheriting from ModelBase
partition (str) – train, val, test (for the current already loaded test sample) or the name of a test dataset

Return type:

List[Data]

Returns:

List of PyTorch Geometric data objects

Notes

The input directories are saved as hyperparameters in the model. This is why it is possible to get the data input directories from a model.

pipeline.utils.modelutils.batches.select_subset(batches, n_events=None, seed=None)[source]#

Randomly select a subset of batches.

Parameters:

batches (List[Data]) – overall list of batches
n_events (Optional[int]) – Maximal number of events to select
seed (Optional[int]) – Seed for reproducible randomness

Return type:

List[Data]

Returns:

List of PyTorch Data objects

pipeline.utils.modelutils.build module#

Define the base class to infer on data.

class pipeline.utils.modelutils.build.BuilderBase[source]#

Bases: ABC

Base class for looping over input files located in a directory, processing them and saving the output in a different directory.

build_features(batch)[source]#

Return type:: Data

build_weights(batch)[source]#

Builder weights in the batch for training. This should only be needed in the train and val sets.

Parameters:: batch (Data) – PyTorch Data Geometric object
Return type:: Data
Returns:: filtered batch

abstract construct_downstream(batch)[source]#: Run the inference on a PyTorch Data. In-place.

filter_batch(batch)[source]#

Filter the batch. This should only performed in the train and val sets.

Parameters:: batch (Data) – PyTorch Data Geometric object
Return type:: Data
Returns:: filtered batch

infer(input_dir, output_dir, reproduce=True, processing=None, file_names=None, n_workers=1)[source]#

Load the torch datasets located in input_dir, run the model inference and save the output in output_dir.

Parameters:

input_dir (str) – input directory path
output_dir (str) – output directory path
reproduce (bool) – whether to delete the output directory if it exists, and run again the inference
processing (Union[str, List[str], None]) – name(s) of supplementary function(s) that process the event. after ModelBase.construct_downstream().
file_names (Optional[List[str]]) – list of file names to run the inference on. If not specified, the inference is run on all the datasets located in the input directory.
parallel – Whether to run the inference in parallel. This seems quite unstable…

infer_one_step(file_name, input_dir, output_dir, processing=None)[source]#

Run the inference on a single file and save the output in another file.

Parameters:

file_name (str) – input file name
input_dir (str) – input directory path
output_dir (str) – output directory path
processing (Union[str, List[str], None]) – name(s) of supplementary function(s) that process the event. after ModelBase.construct_downstream().

load_batch(input_path)[source]#

Load a PyTorch Data object from its path. Might apply necessary pre-processing.

Return type:: Data

process_one_step(batch, processing=None)[source]#

Process one event.

Parameters:

batch (Data) – event stored in a PyTorch Geometric data object
processing (Union[str, List[str], None]) – name(s) of supplementary function(s) that process the event. after ModelBase.construct_downstream().

Return type:

Data

Returns:

Processed event, first by BuilderBase.construct_downstream(), then by the filtering and building functions provided as inputs.

save_downstream(batch, output_path)[source]#: Save the PyTorch data object data in output_path.

class pipeline.utils.modelutils.build.ModelBuilderBase(model)[source]#

Bases: BuilderBase

Base class for model inference.

load_batch(input_path)[source]#

Load a PyTorch Data object from its path. Might apply necessary pre-processing.

Return type:: Data

pipeline.utils.modelutils.checkpoint_utils module#

A module that define helper functions for checkpointing.

pipeline.utils.modelutils.checkpoint_utils.get_last_artifact(version_dir, ckpt_dirname='checkpoints')[source]#

Get the last artifact stored in a given version directory. The last artifact is the one that has the largest number of epochs, the largest number of steps and the last version.

Parameters:: version_dir (str) – path to a directory that stores the training outcomes of a given experiment
Return type:: str
Returns:: Path to the last PyTorch artifact file

pipeline.utils.modelutils.checkpoint_utils.get_last_version_dir(experiment_dir)[source]#

Get the path of the last “version directory” given the path to the directory of the given training experiment. This directory must have a file “metrics.csv”.

Parameters:: experiment_dir (str) – path to the training experiment of interest
Return type:: str
Returns:: Path to the last version directory

pipeline.utils.modelutils.checkpoint_utils.get_last_version_dir_from_config(step, path_or_config)[source]#

Get the path to the last version directory given the configuration.

Parameters:

step (str) – embedding or gnn
path_or_config (str | dict) – configuration dictionary, or path to the YAML file that contains the configuration

Return type:

str

Returns:

Path to the last version directory

Notes

For backward compatibility, if the embedding does not exist, it is replaced by metric_learning.

pipeline.utils.modelutils.checkpoint_utils.get_training_metrics(trainer, suffix='')[source]#

Get the dataframe of the training metrics.

Parameters:

trainer (Trainer | str | List[str]) – either a PyTorch Lighting Trainer object, or the path(s) to the metric file(s) to load directly.
suffix (str) – suffix to add to the name of the columns in the CSV file

Return type:

pd.DataFrame

Returns:

Dataframe of the training metrics (one row / epoch).

pipeline.utils.modelutils.exploration module#

A module that defines ParamExplorer, a class that allows to vary a parameter and check the efficiency that is obtained for this choice.

class pipeline.utils.modelutils.exploration.ParamExplorer(model, varname, varlabel=None)[source]#

Bases: ABC

A class that allow to explore the track matching performance for various choices of a given parameter of a trained model (e.g., best efficiency as a function of the squared maximal distance of the kNN)

add_lhcb_text(ax, metric_name)[source]#

compute_performance_metrics(values, partition, metric_names, categories, n_events=None, seed=None, track_metric_names=None, with_err=True, **kwargs)[source]#

Compute the performance metrics for different values a hyperparameter.

Parameters:

values (Sequence[float]) – list of values for the hyperparameter of interest
partition (str) – train, val or the name of a test dataset
n_events (Optional[int]) – Maximal number of events for the evaluation
seed (Optional[int]) – Random seed for randomly selecting n_events
metric_names (List[str]) – List of metric names to compute
categories (List[Category]) – list of categories to compute the performance in.

Return type:

Dict[float, Dict[Tuple[Optional[str], str], Dict[str, float]]]

Returns:

3-tuple of the Matplotlib Figure and Axes, and the dictionary of metric values for every tuple (value, category.name, metric_name)

property default_step: str#: Name of the temp to fall back to if not provided.

get_output_dir(path_or_config, step)[source]#

get_performance_from_tracks(df_tracks, df_hits_particles, df_particles, metric_names, categories, track_metric_names=None, with_err=True)[source]#

Get performance dictionary for given tracks.

Parameters:

df_tracks (DataFrame) – dataframe of tracks
df_hits_particles (DataFrame) – dataframe of hits-particles
df_particles (DataFrame) – dataframe of particles
metric_names (List[str]) – List of metric names to compute
categories (List[Category]) – list of categories to compute the performance in.

Return type:

Dict[Tuple[Optional[str], str], Dict[str, float]]

Returns:

Dictionary that associates the 2-tuple (category.name, metric_name) with the metric value for the given category

abstract get_tracks(value, batches, **kwargs)[source]#

Get the dataframe of tracks from the inferred batches.

Parameters:

value (float) – current value of the parameter that is explored
batches (Union[List[Data], LazyDatasetBase]) – list of inferred batches

Return type:

DataFrame

Returns:

Dataframe of tracks, with columns track_id and hit_id

load_preprocessed_dataframes(batches)[source]#

Load the preprocessed dataframes of hits-particles and particles associated with the PyTorch DataSets given as input.

Parameters:: batches (Union[List[Data], LazyDatasetBase]) – list of PyTorch Geometric Data objects
Return type:: Tuple[DataFrame, DataFrame]
Returns:: Tuple of dataframes of hits-particles and particles

plot(partition, values, n_events=None, seed=None, metric_names=None, categories=None, track_metric_names=None, identifier=None, path_or_config=None, output_path=None, same_fig=True, lhcb=False, category_name_to_color=None, step=None, with_err=True, legend_inside=None, **kwargs)[source]#

Plot metrics in differences categories for different hyperparameter values.

Parameters:

path_or_config (str | dict | None) – pipeline configuration
partition (str) – train, val or the name of a test dataset
values (Sequence[float]) – list of values for the hyperparameter of interest
n_events (int | None) – Maximal number of events for the evaluation
seed (int | None) – Random seed for randomly selecting n_events
metric_names (List[str] | None) – List of metric names to compute. If not set, efficiency, clone_rate and hit_efficiency_per_candidate are computed and plotted.
categories (List[mt.requirement.Category] | None) – list of categories to compute the performance in. By default, this is “Velo Without Electrons” and “Long Electrons”.
track_metric_names (List[str] | None) – list of track-related metrics (that do not depend on any category) to plot
identifier (str | None) – Identifier for the figure name. Only used if output_path is not provided
output_path (str | None) – Output path where the figure is saved. If same_fig is set to False, the string should contain the placeholder {metric_name}.
same_fig (bool) – in the case where several metrics are plotted, whether to plot them in the same matplotlib figure object
**kwargs – Other keyword arguments passed to ParamExplorer.compute_performance_metrics()

Return type:

Tuple[Figure | npt.NDArray, List[Axes], Dict[float, Dict[Tuple[str | None, str], Dict[str, float]]]]

Returns:

3-tuple of the Matplotlib Figures and Axes, and the dictionary of metric values for every tuple (value, category.name, metric_name)

run_inference(batches)[source]#

Run the inference on a batch.

Parameters:: batches (Union[List[Data], LazyDatasetBase]) – List of batches
Return type:: List[Data]
Returns:: List of inferred batches

pipeline.utils.modelutils.export module#

A python module that defines utilities to export a PyTorch model to ONNX.

class pipeline.utils.modelutils.export.TRTScatterAddOp(*args, **kwargs)[source]#

Bases: Function

A fake scatter add operator for ONNX export, used with a custom TensorRT plugin that implements the scatter add operation.

Notes

For reference: https://leimao.github.io/blog/PyTorch-Custom-ONNX-Operator-Export/

static forward(ctx, source, index, h)[source]#

Return type:: Tensor

static symbolic(g, source, index, h)[source]#

TensorRT exportable scatter add

Parameters:

g – populated graph
source – Source input tensor for the scattering
index – Index input tensor for the scattering
dim_size – Number of elements in the output tensor

pipeline.utils.modelutils.export.change_input_index_types(inpath, target_type=onnx.TensorProto.INT32, outpath=None)[source]#

In PyTorch, indices must be INT64. This function loop over the input nodes of a an ONNX mode and turn the input nodes with type onnx.TensorProto.INT64 to the target_type.

Parameters:

inpath (str) – path to the ONNX file
target_type (int) – type to assign to the index input nodes
outpath (Optional[str]) – path where to save the altered ONNX model. If not provided, the model is saved to inpath.

Return type:

None

pipeline.utils.modelutils.export.check_onnx_integrity(inpath)[source]#

Check the integrity of an ONNX model stored in inpath.

Return type:: None

pipeline.utils.modelutils.export.convert_model_to_fp16(inpath, outpath=None)[source]#

Convert an ONNX model to fp16.

Return type:: None

Notes

See https://onnxruntime.ai/docs/performance/model-optimizations/float16.html.

pipeline.utils.modelutils.metrics module#

Module to compute metrics to evaluate the classification performance.

pipeline.utils.modelutils.metrics.compute_classification_efficiency_purity(predictions, truths)[source]#

Compute the efficiency and purity of predictions.

Parameters:

predictions (Tensor) – tensor of predictions indicating whether each example is genuine (True) or fake (False)
truths (Tensor) – what the predictions should be to be exact

Return type:

Tuple[float, float]

Returns:

efficiency and purity of the predictions.

pipeline.utils.modelutils.metrics.compute_efficiency_purity(n_true_positives, n_truths, n_positives)[source]#

pipeline.utils.modelutils.mlp module#

A module that defines utilities for building multi-layer perceptrons.

pipeline.utils.modelutils.mlp.make_mlp(input_size, sizes, hidden_activation='ReLU', output_activation='ReLU', layer_norm=False)[source]#

Construct an MLP with specified fully-connected layers.

Return type:: Sequential

pipeline.utils.modelutils package

Contents

pipeline.utils.modelutils package#

pipeline.utils.modelutils.basemodel module#

pipeline.utils.modelutils.batches module#

pipeline.utils.modelutils.build module#

pipeline.utils.modelutils.checkpoint_utils module#

pipeline.utils.modelutils.exploration module#

pipeline.utils.modelutils.export module#

pipeline.utils.modelutils.metrics module#

pipeline.utils.modelutils.mlp module#