pipeline.utils.modelutils package#
A package that define some common utilities for defining models in PyTorch Lighting.
pipeline.utils.modelutils.basemodel module#
Define a base model for GNN and Embedding, to avoid copy of functions.
- class pipeline.utils.modelutils.basemodel.ModelBase(*args: Any, **kwargs: Any)[source]#
Bases:
LightningModule
- property feature_indices: List[int] | int | None#
I want to deprecate this…
- property feature_means: List[float] | None#
List of feature means corresponding to the feature names listed in
features
. They are used for normalising the node features.
- property feature_names: List[str] | None#
List of node feature names.
- property feature_scales: List[float] | None#
List of feature scales corresponding to the feature names listed in
features
. They are used for normalising the node features.
- fetch_datasets(lazy_dataset)[source]#
Get the datasets located in a given directory.
- Parameters:
input_dir – input directory
n_events – number of events to load
shuffle – whether to shuffle the input paths (applied before selected the first
n_events
)seed – seed for the shuffling
**kwargs – Other keyword arguments passed to
ModelBase.get_lazy_dataset()
- Return type:
List
[Data
]- Returns:
List of loaded PyTorch Geometric Data objects
- fetch_partition(partition, n_events=None, shuffle=False, seed=None, **kwargs)[source]#
Load a partition.
- Parameters:
partition (
str
) –train
,val
or name of the test datasetn_events (
Optional
[int
]) – number of events to load for this partitionshuffle (
bool
) – whether to shuffle the input paths (applied before selected the firstn_events
)seed (
Optional
[int
]) – seed for the shuffling**kwargs – Other keyword arguments passed to
ModelBase.fetch_dataset()
- Return type:
Union
[List
[Data
],LazyDatasetBase
]
- get_features(batch)[source]#
Get the features of a batch, using as input for inference.
If
feature_names
is provided, they are used to build the tensor of node features, normalising them usingfeature_means
andfeature_scales
.Otherwise,
batch["x"]
is returned.- Parameters:
batch (
Data
) – batch of nodes (typically an event)- Return type:
Tensor
- Returns:
tensor of node features
Notes
No gradient is recorded.
- get_lazy_dataset(input_dir, n_events=None, shuffle=False, seed=None, **kwargs)[source]#
Get the lazy dataset object.
- Parameters:
input_dir (
str
) – input directoryn_events (
Optional
[int
]) – number of events to loadshuffle (
bool
) – whether to shuffle the input paths (applied before selected the firstn_events
)seed (
Optional
[int
]) – seed for the shuffling**kwargs – Other keyword arguments passed to the
utils.loaderutils.dataiterator.LazyDatasetBase
constructor.
- Return type:
LazyDatasetBase
- Returns:
utils.loaderutils.dataiterator.LazyDatasetBase
object
- get_lazy_dataset_partition(partition, n_events=None, shuffle=False, seed=None, **kwargs)[source]#
Get the lazy dataset of a partition.
- Parameters:
partition (
str
) –train
,val
or name of the test datasetn_events (
Optional
[int
]) – number of events to loadshuffle (
bool
) – whether to shuffle the input paths (applied before selected the firstn_events
)seed (
Optional
[int
]) – seed for the shuffling**kwargs – Other keyword arguments passed to
ModelBase.get_lazy_dataset()
- Return type:
LazyDatasetBase
- Returns:
Lazy dataset of the
partition
- classmethod get_model_from_checkpoint(checkpoint, default_checkpoint=None, **kwargs)[source]#
Helper function to get a model at inference step.
- Parameters:
checkpoint (LightningModule | str | None) – the model already loaded, or path to it
Mode – Model class
default_checkpoint (str | None) – path to fall back to if
checkpoint
is None.**kwargs – other parameters passed to
Model.load_from_checkpoint()
- Returns:
Loaded model
- get_subnetwork_inputs(subnetwork)[source]#
Find the input names of a subnetwork by looking at the signature of its ONNX forward method
_onnx_{subnetwork}
.- Parameters:
subnetwork (
str
) – subnetwork name- Return type:
List
[str
]- Returns:
List of the input names of the subnetwork.
- get_subnetwork_outputs(subnetwork)[source]#
Get the outputs of a subnetwork, as configured by the
subnetwork_to_outputs
property.- Parameters:
subnetwork (
str
) – subnetwork name- Return type:
List
[str
]- Returns:
List of the output names of the subnetwork.
- Raises:
KeyError – if the outputs of the subnetwork were not specified in the
subnetwork_to_outputs
property.
- property input_kwargs: Dict[str, Any]#
Associates an input name with a dictionary corresponding to the keyword arguments used to build a dummy tensor representing the input. This dictionary basically gives the
size
anddtype
of the tensor.
- property input_to_dynamic_axes#
A dictionary that associates an input name with the dynamic axis specification.
- property lazy: bool#
Whether to load the training set and val set into memory only when needed.
- load_partition(partition, n_events=None, shuffle=False, seed=None)[source]#
Load datasets of a partition.
- Parameters:
partition (
str
) –train
,val
or name of the test datasetn_events (
Optional
[int
]) – number of events to load for this partitionshuffle (
bool
) – whether to shuffle the input paths (applied before selected the firstn_events
)seed (
Optional
[int
]) – seed for the shuffling
- load_testset_from_directory(input_dir, **kwargs)[source]#
Load a test dataset from a path to a directory.
- Parameters:
input_dir (
str
) – path to the directory that contains the PyTorch Geometric Data pickles files.
- property n_trainable_params: int#
Number of trainable parameters.
- property on_step: bool#
Whether to log on step.
- optimizer_step(epoch, batch_idx, optimizer, optimizer_closure)[source]#
Modified version of the optimizer step that implements warm up and properly enforce the learning rate.
- property subnetwork_groups: Dict[str, List[str]]#
A dictionary that associates a subnetwork actually corresponding to a list of subnetworks, with this list of subnetworks.
- property subnetwork_to_outputs: Dict[str, List[str]]#
A dictionary that associates a subnetwork name with the list of its output names.
- property subnetworks: List[str]#
List of subnetworks available. It is derived from
subnetwork_to_outputs
.
- property testset: List[torch_geometric.data.Data]#
- to_onnx(outpath, mode=None, options=None)[source]#
Export a model to ONNX.
- Parameters:
outpath (
str
) – where to save the ONNX fileoptions (
Optional
[Iterable
[str
]]) – ONNX export options
- Return type:
None
- property trainset: List[torch_geometric.data.Data] | LazyDatasetBase#
- property valset: List[torch_geometric.data.Data]#
- class pipeline.utils.modelutils.basemodel.ModelONNXExport(model, subnetwork)[source]#
Bases:
Module
Class used to export the forward pass of a subnetwork within a
TripletGNNBase
model.- model#
triplet GNN model
- subnetwork#
name of the subnetwork to export
- pipeline.utils.modelutils.basemodel.feature_to_compute_fct: Dict[str, Callable[[torch_geometric.data.Data], torch.Tensor]] = {'phi': <function <lambda>>, 'r': <function <lambda>>}#
Associates a column name with a lambda function that takes as input the batch object and returns the compute column
pipeline.utils.modelutils.batches module#
A module used to handle list of batches stored in model.
- pipeline.utils.modelutils.batches.get_batches(model, partition)[source]#
Get the list batches for the given model.
- Parameters:
model (
ModelBase
) – PyTorch model inheriting fromModelBase
partition (
str
) –train
,val
,test
(for the current already loaded test sample) or the name of a test dataset
- Return type:
List
[Data
]- Returns:
List of PyTorch Geometric data objects
Notes
The input directories are saved as hyperparameters in the model. This is why it is possible to get the data input directories from a model.
- pipeline.utils.modelutils.batches.select_subset(batches, n_events=None, seed=None)[source]#
Randomly select a subset of batches.
- Parameters:
batches (
List
[Data
]) – overall list of batchesn_events (
Optional
[int
]) – Maximal number of events to selectseed (
Optional
[int
]) – Seed for reproducible randomness
- Return type:
List
[Data
]- Returns:
List of PyTorch Data objects
pipeline.utils.modelutils.build module#
Define the base class to infer on data.
- class pipeline.utils.modelutils.build.BuilderBase[source]#
Bases:
ABC
Base class for looping over input files located in a directory, processing them and saving the output in a different directory.
- build_weights(batch)[source]#
Builder weights in the batch for training. This should only be needed in the train and val sets.
- Parameters:
batch (
Data
) – PyTorch Data Geometric object- Return type:
Data
- Returns:
filtered batch
- filter_batch(batch)[source]#
Filter the batch. This should only performed in the train and val sets.
- Parameters:
batch (
Data
) – PyTorch Data Geometric object- Return type:
Data
- Returns:
filtered batch
- infer(input_dir, output_dir, reproduce=True, processing=None, file_names=None, n_workers=1)[source]#
Load the torch datasets located in
input_dir
, run the model inference and save the output inoutput_dir
.- Parameters:
input_dir (
str
) – input directory pathoutput_dir (
str
) – output directory pathreproduce (
bool
) – whether to delete the output directory if it exists, and run again the inferenceprocessing (
Union
[str
,List
[str
],None
]) – name(s) of supplementary function(s) that process the event. afterModelBase.construct_downstream()
.file_names (
Optional
[List
[str
]]) – list of file names to run the inference on. If not specified, the inference is run on all the datasets located in the input directory.parallel – Whether to run the inference in parallel. This seems quite unstable…
- infer_one_step(file_name, input_dir, output_dir, processing=None)[source]#
Run the inference on a single file and save the output in another file.
- Parameters:
file_name (
str
) – input file nameinput_dir (
str
) – input directory pathoutput_dir (
str
) – output directory pathprocessing (
Union
[str
,List
[str
],None
]) – name(s) of supplementary function(s) that process the event. afterModelBase.construct_downstream()
.
- load_batch(input_path)[source]#
Load a PyTorch Data object from its path. Might apply necessary pre-processing.
- Return type:
Data
- process_one_step(batch, processing=None)[source]#
Process one event.
- Parameters:
batch (
Data
) – event stored in a PyTorch Geometric data objectprocessing (
Union
[str
,List
[str
],None
]) – name(s) of supplementary function(s) that process the event. afterModelBase.construct_downstream()
.
- Return type:
Data
- Returns:
Processed event, first by
BuilderBase.construct_downstream()
, then by the filtering and building functions provided as inputs.
- class pipeline.utils.modelutils.build.ModelBuilderBase(model)[source]#
Bases:
BuilderBase
Base class for model inference.
pipeline.utils.modelutils.checkpoint_utils module#
A module that define helper functions for checkpointing.
- pipeline.utils.modelutils.checkpoint_utils.get_last_artifact(version_dir, ckpt_dirname='checkpoints')[source]#
Get the last artifact stored in a given version directory. The last artifact is the one that has the largest number of epochs, the largest number of steps and the last version.
- Parameters:
version_dir (
str
) – path to a directory that stores the training outcomes of a given experiment- Return type:
str
- Returns:
Path to the last PyTorch artifact file
- pipeline.utils.modelutils.checkpoint_utils.get_last_version_dir(experiment_dir)[source]#
Get the path of the last “version directory” given the path to the directory of the given training experiment. This directory must have a file “metrics.csv”.
- Parameters:
experiment_dir (
str
) – path to the training experiment of interest- Return type:
str
- Returns:
Path to the last version directory
- pipeline.utils.modelutils.checkpoint_utils.get_last_version_dir_from_config(step, path_or_config)[source]#
Get the path to the last version directory given the configuration.
- Parameters:
step (
str
) –embedding
orgnn
path_or_config (
str
|dict
) – configuration dictionary, or path to the YAML file that contains the configuration
- Return type:
str
- Returns:
Path to the last version directory
Notes
For backward compatibility, if the
embedding
does not exist, it is replaced bymetric_learning
.
- pipeline.utils.modelutils.checkpoint_utils.get_training_metrics(trainer, suffix='')[source]#
Get the dataframe of the training metrics.
- Parameters:
trainer (Trainer | str | List[str]) – either a PyTorch Lighting Trainer object, or the path(s) to the metric file(s) to load directly.
suffix (str) – suffix to add to the name of the columns in the CSV file
- Return type:
pd.DataFrame
- Returns:
Dataframe of the training metrics (one row / epoch).
pipeline.utils.modelutils.exploration module#
A module that defines ParamExplorer
, a class that allows to vary
a parameter and check the efficiency that is obtained for this choice.
- class pipeline.utils.modelutils.exploration.ParamExplorer(model, varname, varlabel=None)[source]#
Bases:
ABC
A class that allow to explore the track matching performance for various choices of a given parameter of a trained model (e.g., best efficiency as a function of the squared maximal distance of the kNN)
- compute_performance_metrics(values, partition, metric_names, categories, n_events=None, seed=None, track_metric_names=None, with_err=True, **kwargs)[source]#
Compute the performance metrics for different values a hyperparameter.
- Parameters:
values (
Sequence
[float
]) – list of values for the hyperparameter of interestpartition (
str
) –train
,val
or the name of a test datasetn_events (
Optional
[int
]) – Maximal number of events for the evaluationseed (
Optional
[int
]) – Random seed for randomly selectingn_events
metric_names (
List
[str
]) – List of metric names to computecategories (
List
[Category
]) – list of categories to compute the performance in.
- Return type:
Dict
[float
,Dict
[Tuple
[Optional
[str
],str
],Dict
[str
,float
]]]- Returns:
3-tuple of the Matplotlib Figure and Axes, and the dictionary of metric values for every tuple
(value, category.name, metric_name)
- property default_step: str#
Name of the temp to fall back to if not provided.
- get_performance_from_tracks(df_tracks, df_hits_particles, df_particles, metric_names, categories, track_metric_names=None, with_err=True)[source]#
Get performance dictionary for given tracks.
- Parameters:
df_tracks (
DataFrame
) – dataframe of tracksdf_hits_particles (
DataFrame
) – dataframe of hits-particlesdf_particles (
DataFrame
) – dataframe of particlesmetric_names (
List
[str
]) – List of metric names to computecategories (
List
[Category
]) – list of categories to compute the performance in.
- Return type:
Dict
[Tuple
[Optional
[str
],str
],Dict
[str
,float
]]- Returns:
Dictionary that associates the 2-tuple
(category.name, metric_name)
with the metric value for the given category
- abstract get_tracks(value, batches, **kwargs)[source]#
Get the dataframe of tracks from the inferred batches.
- Parameters:
value (
float
) – current value of the parameter that is exploredbatches (
Union
[List
[Data
],LazyDatasetBase
]) – list of inferred batches
- Return type:
DataFrame
- Returns:
Dataframe of tracks, with columns
track_id
andhit_id
- load_preprocessed_dataframes(batches)[source]#
Load the preprocessed dataframes of hits-particles and particles associated with the PyTorch DataSets given as input.
- Parameters:
batches (
Union
[List
[Data
],LazyDatasetBase
]) – list of PyTorch Geometric Data objects- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
Tuple of dataframes of hits-particles and particles
- plot(partition, values, n_events=None, seed=None, metric_names=None, categories=None, track_metric_names=None, identifier=None, path_or_config=None, output_path=None, same_fig=True, lhcb=False, category_name_to_color=None, step=None, with_err=True, legend_inside=None, **kwargs)[source]#
Plot metrics in differences categories for different hyperparameter values.
- Parameters:
path_or_config (str | dict | None) – pipeline configuration
partition (str) –
train
,val
or the name of a test datasetvalues (Sequence[float]) – list of values for the hyperparameter of interest
n_events (int | None) – Maximal number of events for the evaluation
seed (int | None) – Random seed for randomly selecting
n_events
metric_names (List[str] | None) – List of metric names to compute. If not set,
efficiency
,clone_rate
andhit_efficiency_per_candidate
are computed and plotted.categories (List[mt.requirement.Category] | None) – list of categories to compute the performance in. By default, this is “Velo Without Electrons” and “Long Electrons”.
track_metric_names (List[str] | None) – list of track-related metrics (that do not depend on any category) to plot
identifier (str | None) – Identifier for the figure name. Only used if
output_path
is not providedoutput_path (str | None) – Output path where the figure is saved. If
same_fig
is set toFalse
, the string should contain the placeholder{metric_name}
.same_fig (bool) – in the case where several metrics are plotted, whether to plot them in the same matplotlib figure object
**kwargs – Other keyword arguments passed to
ParamExplorer.compute_performance_metrics()
- Return type:
Tuple[Figure | npt.NDArray, List[Axes], Dict[float, Dict[Tuple[str | None, str], Dict[str, float]]]]
- Returns:
3-tuple of the Matplotlib Figures and Axes, and the dictionary of metric values for every tuple
(value, category.name, metric_name)
- run_inference(batches)[source]#
Run the inference on a batch.
- Parameters:
batches (
Union
[List
[Data
],LazyDatasetBase
]) – List of batches- Return type:
List
[Data
]- Returns:
List of inferred batches
pipeline.utils.modelutils.export module#
A python module that defines utilities to export a PyTorch model to ONNX.
- class pipeline.utils.modelutils.export.TRTScatterAddOp(*args, **kwargs)[source]#
Bases:
Function
A fake scatter add operator for ONNX export, used with a custom TensorRT plugin that implements the scatter add operation.
Notes
For reference: https://leimao.github.io/blog/PyTorch-Custom-ONNX-Operator-Export/
- pipeline.utils.modelutils.export.change_input_index_types(inpath, target_type=onnx.TensorProto.INT32, outpath=None)[source]#
In PyTorch, indices must be INT64. This function loop over the input nodes of a an ONNX mode and turn the input nodes with type
onnx.TensorProto.INT64
to thetarget_type
.- Parameters:
inpath (
str
) – path to the ONNX filetarget_type (
int
) – type to assign to the index input nodesoutpath (
Optional
[str
]) – path where to save the altered ONNX model. If not provided, the model is saved toinpath
.
- Return type:
None
- pipeline.utils.modelutils.export.check_onnx_integrity(inpath)[source]#
Check the integrity of an ONNX model stored in
inpath
.- Return type:
None
- pipeline.utils.modelutils.export.convert_model_to_fp16(inpath, outpath=None)[source]#
Convert an ONNX model to fp16.
- Return type:
None
Notes
See https://onnxruntime.ai/docs/performance/model-optimizations/float16.html.
pipeline.utils.modelutils.metrics module#
Module to compute metrics to evaluate the classification performance.
- pipeline.utils.modelutils.metrics.compute_classification_efficiency_purity(predictions, truths)[source]#
Compute the efficiency and purity of predictions.
- Parameters:
predictions (
Tensor
) – tensor of predictions indicating whether each example is genuine (True
) or fake (False
)truths (
Tensor
) – what thepredictions
should be to be exact
- Return type:
Tuple
[float
,float
]- Returns:
efficiency and purity of the predictions.
pipeline.utils.modelutils.mlp module#
A module that defines utilities for building multi-layer perceptrons.