pipeline.Embedding package#

A package that handles the embedding stage of the pipeline. This stage consists in creating a rough graph by embedding the hit coordinates into an embedding space. The embedding network is trained in such a way that hits that are likely to be connected by an edge are brought clone to another, while disconnected hits brought apart.

Then, a \(k\)-nearest neighbour algorithm is applied to create the rough graph.

Subpackages#

pipeline.Embedding.embedding_base module#

A module that defines the embedding training and inference.

class pipeline.Embedding.embedding_base.EmbeddingBase(*args: Any, **kwargs: Any)[source]#

Bases: ModelBase

A class that implements the metric learning model.

append_true_pairs(training_edge_indices, y_truth, true_edge_indices, planes)[source]#

Append the true edges to the tensor of training edges.

Parameters:
  • training_edge_indices (Tensor) – training sample of edge indices

  • y_truth (Tensor) – whether the edges in training_edge_indices are genuine or fake

  • true_edge_indices (Tensor) – all the genuine edge indices

Return type:

Tuple[Tensor, Tensor]

Returns:

Training edge indices with the true edge indices added, and updated y_truth.

build_edges(embeddings, planes, k_max, squared_distance_max, query_embeddings=None, query_indices=None)[source]#

Build edges by applying kNNs.

Edges are built by looping over the planes, and drawing neighbours between a plane and the next plane_range plane, where plane_range is an hyperparameter.

Parameters:
  • embeddings (torch.Tensor) – embeddings of all the points

  • planes (torch.Tensor) – planes of all the points.

  • k_max (int) – maximum number of neigbhours for the kNN

  • squared_distance_max (float) – maximum (embedded) distance for 2 points to be considered as neighbours

  • query_embeddings (torch.Tensor | None) – embeddings of the query points

  • query_indices (torch.Tensor | None) – indices of the query points (in embeddings)

Return type:

torch.Tensor

Returns:

Edges build by the kNN.

property edgedir: str#

left or right.

Type:

Edge direction

get_hnm_pairs(query_embeddings, query_indices, embeddings, planes)[source]#

Get the edges from hard-negative mining.

Parameters:
  • query_embeddings (Tensor) – Embeddings of the query points

  • query_indices (Tensor) – Corresponding indices of the query points

  • embeddings (Tensor) – Embeddings of all the points

  • planes (Tensor) – planes of all the points

Return type:

Tensor

Returns:

Edge indices of the hard-negative mined edges

get_lazy_dataset(*args, **kwargs)[source]#

Get the lazy dataset object.

Parameters:
  • input_dir – input directory

  • n_events – number of events to load

  • shuffle – whether to shuffle the input paths (applied before selected the first n_events)

  • seed – seed for the shuffling

  • **kwargs – Other keyword arguments passed to the utils.loaderutils.dataiterator.LazyDatasetBase constructor.

Return type:

EmbeddingLazyDataSet

Returns:

utils.loaderutils.dataiterator.LazyDatasetBase object

get_lazy_dataset_partition(partition, *args, **kwargs)[source]#

Get the lazy dataset of a partition.

Parameters:
  • partition (str) – train, val or name of the test dataset

  • n_events – number of events to load

  • shuffle – whether to shuffle the input paths (applied before selected the first n_events)

  • seed – seed for the shuffling

  • **kwargs – Other keyword arguments passed to ModelBase.get_lazy_dataset()

Return type:

LazyDatasetBase

Returns:

Lazy dataset of the partition

get_loss(embeddings, edge_indices, y_truth, weights=None)[source]#

Compute the loss for the given embeddings and edges.

Parameters:
  • embeddings (torch.Tensor) – embeddings of all the points

  • edge_indices (torch.Tensor) – edge indices

  • y_truth (torch.Tensor) – for each edge (column) in edge_indices, whether this edge is genuine (True) or fake (False)

  • weights (torch.Tensor | None) – edge weights

Return type:

torch.Tensor

Returns:

Value of the siamese-like loss

get_query_points(embeddings, true_edge_indices, planes=None, query_mask=None)[source]#

Get the points the edges will be drawn from to generate the training set.

Parameters:
  • embeddings (torch.Tensor) – point embeddings

  • true_edge_indices (torch.Tensor) – true edge indices

  • particle_ids – particle IDs for each point in embeddings

Return type:

Tuple[torch.Tensor, torch.Tensor]

Returns:

1D tensor of query indices and 2D tensor of query embeddings

get_random_pairs(query_indices, planes)[source]#

Get random edges drawn from the query points.

Parameters:
  • query_indices (Tensor) – indices of the query points

  • embeddings – Embeddings of all the points

  • planes (Tensor) – planes of all the points. Only used for non-directional graphs, as random pairs are only drawn from one plane to one of the next plane_range planes (where plane_range is a hyperparameter).

Return type:

Tensor

Returns:

Edge indices of random edges drawn from the query points

get_squared_distances(embeddings, edge_indices)[source]#

Get the squared distances

Parameters:
  • embeddings (Tensor) – Embeddings of all the points

  • edge_indices (Tensor) – edge indices

Return type:

Tensor

Returns:

squared_distances tensor corresponding to the squred L2 distance between the embeddings of the hits of every edge.

get_training_edges(embeddings, true_edge_indices, planes, query_mask=None)[source]#

Get the edges used for the training.

Parameters:
  • embeddings (torch.Tensor) – Embeddings of all the points

  • true_edge_indices (torch.Tensor) – 2D tensor of genuine edge indices

  • particle_ids – tensor of particle IDs for every point. Only used in the query_noise_points regime

  • planes (torch.Tensor) – tensor of planes for every point. Only used for one-directional graph.

Return type:

Tuple[torch.Tensor, torch.Tensor]

Returns:

2D tensor of training edge indices and 1D tensor indicating whether the corresponding edge is genuine or fake.

get_truth(edge_indices, true_edge_indices)[source]#

Get the true label of each edge (whether it’s genuine or fake).

Parameters:
  • edge_indices (Tensor) – edge indices

  • true_edge_indices (Tensor) – the true edge indices

Return type:

Tuple[Tensor, Tensor]

Returns:

2 one-dimensional torch tensors. The first tensor is the tensor of edge indices,that could be shuffled a bit. The second tensor contains, for each edge (column) in edge_indices, whether this edge is genuine (1) or fake (0).

inference(batch, squared_distance_max, k_max, evaluate=False, overall=False, log=False)[source]#

Run the embedding inference + kNN to build edges of an event.

Parameters:
  • batch (Data) – event PyTorch data object

  • squared_distance_max (float) – squared maximal distance in the embedding space

  • k_max (int) – maximal number of neighbours

  • evaluate (bool) – whether to also output the loss, efficiency and purity

  • overall (bool) – if batch already contains edge_index, whether to enable concatenaning new edges to the old edge indices instead of replacing them.

  • log (bool) – whether to add an entry to the log

Return type:

Dict[str, Tensor]

property input_kwargs: Dict[str, Any]#

Associates an input name with a dictionary corresponding to the keyword arguments used to build a dummy tensor representing the input. This dictionary basically gives the size and dtype of the tensor.

property input_to_dynamic_axes#

A dictionary that associates an input name with the dynamic axis specification.

property last_plane: int#

Index of the last plane.

property n_planes: int#

Number of unremoved planes (e.g., xz-scifi).

property n_total_planes: int#

Total number of planes.

property query_planes: torch.Tensor | None#

Planes that can be queried.

remove_planes(features, planes, true_edge_index=None)[source]#

Remove hits belonging to planes given by the hyperparameter removed_planes.

Parameters:
  • features (torch.Tensor) – hit features

  • planes (torch.Tensor) – hit plane indices

  • truth_edge_index – Optionally, tensor of true edge indices

Return type:

Tuple[torch.Tensor, TensorOrNone, torch.Tensor, torch.Tensor | None]

Returns:

Reindexed hit features, true edge indices, planes and original hit indices. If no plane is removed, this is indicated by original hit indices being None.

property subnetwork_to_outputs: Dict[str, List[str]]#

A dictionary that associates a subnetwork name with the list of its output names.

to_onnx(outpath, mode=None, options=None)[source]#

Save model to an ONNX file.

Parameters:
  • outpath (str) – Path where to save the ONNX model.

  • mode (Optional[Literal['default']]) – only default mode is supported.

Return type:

None

training_step(batch, batch_idx)[source]#
training_validation_step(batch, with_grad=False)[source]#

Common step for the training and validation steps. This encompasses selecting query hits, drawing edges from them, running the embedding inference and computing the loss.

Return type:

Dict[str, Tensor]

validate_edges(edge_index, planes)[source]#

Check whether non-bidirectional edges all have the correct directions.

Parameters:
  • edge_index (Tensor) – edge indices

  • planes (Tensor) – plane index of each hit

validation_step(batch, batch_idx)[source]#
Return type:

Tensor

class pipeline.Embedding.embedding_base.EmbeddingLazyDataSet(*args, particle_requirement=None, query_particle_requirement=None, target_particle_requirement=None, particles_from_parquet=False, **kwargs)[source]#

Bases: LazyDatasetBase

apply_particle_requirement(batch)[source]#
Return type:

Data

fetch_dataset(input_path, map_location='cpu', **kwargs)[source]#

Load and process one PyTorch DataSet.

Parameters:
  • input_path (str) – path to the PyTorch dataset

  • map_location (str) – location where to load the dataset

  • **kwargs – Other keyword arguments passed to torch.load()

Returns:

Load PyTorch data object

pipeline.Embedding.embedding_base.get_example_data(path_or_config, idx=0)[source]#
Return type:

Tuple[DataFrame, Data]

pipeline.Embedding.build_embedding module#

class pipeline.Embedding.build_embedding.EmbeddingInferenceBuilder(model, k_max=1000, squared_distance_max=0.1, max_plane_diff=None)[source]#

Bases: ModelBuilderBase

construct_downstream(batch)[source]#

Run embedding inference and kNN. Add the edges and their targets to the event data object.

pipeline.Embedding.build_embedding.get_squared_distance_max_from_config(path_or_config)[source]#

Get the value of squared_distance_max_inference, or fall back to squared_distance_max if squared_distance_max_inference was set to None.

Parameters:

path_or_config (str | dict) – configuration dictionary, or path to the YAML file that contains the configuration

Return type:

float

Returns:

Squared maximal distance used for the embedding inference

pipeline.Embedding.embedding_plots module#

A module that handles the validation plots for the embedding phase specifically.

pipeline.Embedding.embedding_plots.plot_best_performances_squared_distance_max(model, path_or_config, partition, list_squared_distance_max, k_max=None, n_events=None, seed=None, builder='default', step='embedding', identifier=None, **kwargs)[source]#

Plot best performance for perfect inference as a function of the squared maximal distance.

Parameters:
  • model (EmbeddingBase) – Embedding model

  • path_or_config (str | dict) – YAML configuration. Only needed if output_wpath is not provided.

  • list_squared_distance_max (Sequence[float]) – list of squared maximal distance squared to try

  • k_max (int | None) – Maximal number of neigbhours. If not provided, the one stored in the model is used.

  • n_events (int | None) – Maximal number of events to use for each partition for performance evaluation

  • seed (int | None) – Random seed for the random choice of n_events events

  • show_err – whether to show the error bars

  • builder (str) – Builder to use to build the tracks after the GNN. It can be default (build the tracks by applying a connected component algorithm on the hits) or triplets (build triplets and form the tracks from these triplets.)

Return type:

Tuple[Figure | npt.NDArray, List[Axes], Dict[float, Dict[Tuple[str | None, str], Dict[str, float]]]]

pipeline.Embedding.embedding_plots.plot_embedding_performance_given_squared_distance_max_k_max(model, path_or_config=None, partitions=['train', 'val'], n_events=10, squared_distance_max=None, k_max=None, show_err=True, output_wpath=None, lhcb=False, step='embedding', overall=False)[source]#

Plot edge efficiency, purity and graph size as a function of the maximal squared_distance_max or maximal number of neighbours in the k-nearest neighbour algorithm.

Parameters:
  • model (EmbeddingBase) – Embedding model

  • path_or_config (UnionType[str, dict, None]) – YAML configuration. Only needed if output_wpath is not provided.

  • partitions (Union[List[str], Dict[str, Optional[str]]]) – List of partitions to plot

  • n_events (int) – Maximal number of events to use for each partition for performance evaluationn

  • squared_distance_max (Union[List[float], float, None]) – Squared maximal distance squared in the embedding space

  • k_max (Union[List[int], int, None]) – Maximal number of neighbours

  • show_err (bool) – whether to show the error bars

  • output_wpath (Optional[str]) – wildcard path where the plots are saved, with placeholder {metric_name}

Return type:

Tuple[Dict[str, Tuple[Figure, Axes]], Dict[str, Dict[str, matrix]]]

Returns:

Tuples of 2 dictionary. The first dictionary associates a metric name with the tuple of matplotlib Figure and Axes. The second dictionary associates a metric name with another dictionary that associates a partition with the list of metric values, for the different squared_distance_max or k_max given as input.

pipeline.Embedding.embedding_validation module#

A module that defines tools to perform the validation step of the embedding step.

class pipeline.Embedding.embedding_validation.EmbeddingDistanceMaxExplorer(model, builder='default')[source]#

Bases: ParamExplorer

A class that allows to vary the maximal squared distance and compare the best metric performances of track finding, in the case where all the fake edges are filtered out.

add_lhcb_text(ax, metric_name)[source]#
property default_step: str#

Name of the temp to fall back to if not provided.

get_tracks(value, batches, k_max=None, processing=None)[source]#

Get the dataframe of tracks from the inferred batches.

Parameters:
  • value (float) – current value of the parameter that is explored

  • batches (List[Data]) – list of inferred batches

Return type:

DataFrame

Returns:

Dataframe of tracks, with columns track_id and hit_id

pipeline.Embedding.embedding_validation.evaluate_embedding_performance(model, batches, squared_distance_max=None, k_max=None, overall=False)[source]#

Compute the edge efficiency and edge purity of a given model, on a subset of the train, val or test dataset.

Parameters:
  • model (EmbeddingBase) – PyTorch model inheriting from utils.modelutils.basemodel.ModelBase

  • partitiontrain, val, test (for the current already loaded test sample) or the name of a test dataset

  • squared_distance_max (Optional[float]) – Maximal distance squared for the KNN. If not given, taken from the hyperparameter in the model.

  • k_max (Optional[int]) – Maximal number of neighbours for the KNN. If not given, taken from the hyperparameter in the model.

  • n_events – Number of events to compute the performance metrics on

  • seed – Seed used to randomly select the n_events

Return type:

Tuple[Variable, Variable, Variable]

Returns:

A tuple of 3 ufloat numbers corresponding to the event-based average of the

edge efficiency and edge purity, and the graph size

pipeline.Embedding.embedding_validation.evaluate_embedding_performances_given_squared_distance_max_k_max(model, partitions=['train', 'val'], n_events=10, squared_distance_max=None, k_max=None, seed=None, overall=False)[source]#

Compute edge efficiency, purity and graph size as a function of the maximal squared_distance_max or maximal number of neighbours in the k-nearest neighbour algorithm.

Parameters:
  • model (EmbeddingBase) – Embedding model

  • path_or_config – YAML configuration. Only needed if output_wpath is not provided.

  • partitions (List[str]) – List of partitions to plot

  • n_events (int) – Maximal number of events to use for each partition for performance evaluationn

  • squared_distance_max (Union[List[float], float, None]) – Squared maximal distance squared in the embedding space

  • k_max (Union[List[int], int, None]) – Maximal number of neighbours

  • seed (Optional[int]) – Random seed for the random choice of n_events events

Return type:

Dict[str, Dict[str, matrix]]

Returns:

Dictionary that associates associates a metric name with another dictionary that associates a partition with the list of metric values, for the different squared_distance_max or k_max given as input.

pipeline.Embedding.embedding_validation.get_default_squared_distance_max(model, squared_distance_max=None)[source]#

Get the default squared distance max for inference, for a given model.

Return type:

float

pipeline.Embedding.process_custom module#

Custom functions for filtering and alterning an event.

pipeline.Embedding.process_custom.edge_features_as_slope(batch)[source]#

Build edge features that correspond to the slope.

Return type:

Data

pipeline.Embedding.process_custom.edges_at_least_3_hits(batch)[source]#
Return type:

Data

pipeline.Embedding.process_custom.edges_at_least_3_planes(batch)[source]#
Return type:

Data

pipeline.Embedding.process_custom.remove_edges_in_same_plane(batch)[source]#
Return type:

Data

pipeline.Embedding.process_custom.weights_inversely_proportional_to_nhits(batch)[source]#

Define edge weights that are inversely proportional to the number of hits.

Return type:

Data