pipeline.Embedding package#
A package that handles the embedding stage of the pipeline. This stage consists in creating a rough graph by embedding the hit coordinates into an embedding space. The embedding network is trained in such a way that hits that are likely to be connected by an edge are brought clone to another, while disconnected hits brought apart.
Then, a \(k\)-nearest neighbour algorithm is applied to create the rough graph.
Subpackages#
pipeline.Embedding.embedding_base module#
A module that defines the embedding training and inference.
- class pipeline.Embedding.embedding_base.EmbeddingBase(*args: Any, **kwargs: Any)[source]#
- Bases: - ModelBase- A class that implements the metric learning model. - append_true_pairs(training_edge_indices, y_truth, true_edge_indices, planes)[source]#
- Append the true edges to the tensor of training edges. - Parameters:
- training_edge_indices ( - Tensor) – training sample of edge indices
- y_truth ( - Tensor) – whether the edges in- training_edge_indicesare genuine or fake
- true_edge_indices ( - Tensor) – all the genuine edge indices
 
- Return type:
- Tuple[- Tensor,- Tensor]
- Returns:
- Training edge indices with the true edge indices added, and updated - y_truth.
 
 - build_edges(embeddings, planes, k_max, squared_distance_max, query_embeddings=None, query_indices=None)[source]#
- Build edges by applying kNNs. - Edges are built by looping over the planes, and drawing neighbours between a plane and the next - plane_rangeplane, where- plane_rangeis an hyperparameter.- Parameters:
- embeddings (torch.Tensor) – embeddings of all the points 
- planes (torch.Tensor) – planes of all the points. 
- k_max (int) – maximum number of neigbhours for the kNN 
- squared_distance_max (float) – maximum (embedded) distance for 2 points to be considered as neighbours 
- query_embeddings (torch.Tensor | None) – embeddings of the query points 
- query_indices (torch.Tensor | None) – indices of the query points (in - embeddings)
 
- Return type:
- torch.Tensor 
- Returns:
- Edges build by the kNN. 
 
 - property edgedir: str#
- leftor- right.- Type:
- Edge direction 
 
 - get_hnm_pairs(query_embeddings, query_indices, embeddings, planes)[source]#
- Get the edges from hard-negative mining. - Parameters:
- query_embeddings ( - Tensor) – Embeddings of the query points
- query_indices ( - Tensor) – Corresponding indices of the query points
- embeddings ( - Tensor) – Embeddings of all the points
- planes ( - Tensor) – planes of all the points
 
- Return type:
- Tensor
- Returns:
- Edge indices of the hard-negative mined edges 
 
 - get_lazy_dataset(*args, **kwargs)[source]#
- Get the lazy dataset object. - Parameters:
- input_dir – input directory 
- n_events – number of events to load 
- shuffle – whether to shuffle the input paths (applied before selected the first - n_events)
- seed – seed for the shuffling 
- **kwargs – Other keyword arguments passed to the - utils.loaderutils.dataiterator.LazyDatasetBaseconstructor.
 
- Return type:
- Returns:
- utils.loaderutils.dataiterator.LazyDatasetBaseobject
 
 - get_lazy_dataset_partition(partition, *args, **kwargs)[source]#
- Get the lazy dataset of a partition. - Parameters:
- partition ( - str) –- train,- valor name of the test dataset
- n_events – number of events to load 
- shuffle – whether to shuffle the input paths (applied before selected the first - n_events)
- seed – seed for the shuffling 
- **kwargs – Other keyword arguments passed to - ModelBase.get_lazy_dataset()
 
- Return type:
- LazyDatasetBase
- Returns:
- Lazy dataset of the - partition
 
 - get_loss(embeddings, edge_indices, y_truth, weights=None)[source]#
- Compute the loss for the given embeddings and edges. - Parameters:
- embeddings (torch.Tensor) – embeddings of all the points 
- edge_indices (torch.Tensor) – edge indices 
- y_truth (torch.Tensor) – for each edge (column) in - edge_indices, whether this edge is genuine (- True) or fake (- False)
- weights (torch.Tensor | None) – edge weights 
 
- Return type:
- torch.Tensor 
- Returns:
- Value of the siamese-like loss 
 
 - get_query_points(embeddings, true_edge_indices, planes=None, query_mask=None)[source]#
- Get the points the edges will be drawn from to generate the training set. - Parameters:
- embeddings (torch.Tensor) – point embeddings 
- true_edge_indices (torch.Tensor) – true edge indices 
- particle_ids – particle IDs for each point in - embeddings
 
- Return type:
- Tuple[torch.Tensor, torch.Tensor] 
- Returns:
- 1D tensor of query indices and 2D tensor of query embeddings 
 
 - get_random_pairs(query_indices, planes)[source]#
- Get random edges drawn from the query points. - Parameters:
- query_indices ( - Tensor) – indices of the query points
- embeddings – Embeddings of all the points 
- planes ( - Tensor) – planes of all the points. Only used for non-directional graphs, as random pairs are only drawn from one plane to one of the next- plane_rangeplanes (where- plane_rangeis a hyperparameter).
 
- Return type:
- Tensor
- Returns:
- Edge indices of random edges drawn from the query points 
 
 - get_squared_distances(embeddings, edge_indices)[source]#
- Get the squared distances - Parameters:
- embeddings ( - Tensor) – Embeddings of all the points
- edge_indices ( - Tensor) – edge indices
 
- Return type:
- Tensor
- Returns:
- squared_distancestensor corresponding to the squred L2 distance between the embeddings of the hits of every edge.
 
 - get_training_edges(embeddings, true_edge_indices, planes, query_mask=None)[source]#
- Get the edges used for the training. - Parameters:
- embeddings (torch.Tensor) – Embeddings of all the points 
- true_edge_indices (torch.Tensor) – 2D tensor of genuine edge indices 
- particle_ids – tensor of particle IDs for every point. Only used in the - query_noise_pointsregime
- planes (torch.Tensor) – tensor of planes for every point. Only used for one-directional graph. 
 
- Return type:
- Tuple[torch.Tensor, torch.Tensor] 
- Returns:
- 2D tensor of training edge indices and 1D tensor indicating whether the corresponding edge is genuine or fake. 
 
 - get_truth(edge_indices, true_edge_indices)[source]#
- Get the true label of each edge (whether it’s genuine or fake). - Parameters:
- edge_indices ( - Tensor) – edge indices
- true_edge_indices ( - Tensor) – the true edge indices
 
- Return type:
- Tuple[- Tensor,- Tensor]
- Returns:
- 2 one-dimensional torch tensors. The first tensor is the tensor of edge indices,that could be shuffled a bit. The second tensor contains, for each edge (column) in - edge_indices, whether this edge is genuine (1) or fake (0).
 
 - inference(batch, squared_distance_max, k_max, evaluate=False, overall=False, log=False)[source]#
- Run the embedding inference + kNN to build edges of an event. - Parameters:
- batch ( - Data) – event PyTorch data object
- squared_distance_max ( - float) – squared maximal distance in the embedding space
- k_max ( - int) – maximal number of neighbours
- evaluate ( - bool) – whether to also output the loss, efficiency and purity
- overall ( - bool) – if- batchalready contains- edge_index, whether to enable concatenaning new edges to the old edge indices instead of replacing them.
- log ( - bool) – whether to add an entry to the log
 
- Return type:
- Dict[- str,- Tensor]
 
 - property input_kwargs: Dict[str, Any]#
- Associates an input name with a dictionary corresponding to the keyword arguments used to build a dummy tensor representing the input. This dictionary basically gives the - sizeand- dtypeof the tensor.
 - property input_to_dynamic_axes#
- A dictionary that associates an input name with the dynamic axis specification. 
 - property last_plane: int#
- Index of the last plane. 
 - property n_planes: int#
- Number of unremoved planes (e.g., xz-scifi). 
 - property n_total_planes: int#
- Total number of planes. 
 - property query_planes: torch.Tensor | None#
- Planes that can be queried. 
 - remove_planes(features, planes, true_edge_index=None)[source]#
- Remove hits belonging to planes given by the hyperparameter - removed_planes.- Parameters:
- features (torch.Tensor) – hit features 
- planes (torch.Tensor) – hit plane indices 
- truth_edge_index – Optionally, tensor of true edge indices 
 
- Return type:
- Tuple[torch.Tensor, TensorOrNone, torch.Tensor, torch.Tensor | None] 
- Returns:
- Reindexed hit features, true edge indices, planes and original hit indices. If no plane is removed, this is indicated by original hit indices being None. 
 
 - property subnetwork_to_outputs: Dict[str, List[str]]#
- A dictionary that associates a subnetwork name with the list of its output names. 
 - to_onnx(outpath, mode=None, options=None)[source]#
- Save model to an ONNX file. - Parameters:
- outpath ( - str) – Path where to save the ONNX model.
- mode ( - Optional[- Literal[- 'default']]) – only- defaultmode is supported.
 
- Return type:
- None
 
 - training_validation_step(batch, with_grad=False)[source]#
- Common step for the training and validation steps. This encompasses selecting query hits, drawing edges from them, running the embedding inference and computing the loss. - Return type:
- Dict[- str,- Tensor]
 
 
- class pipeline.Embedding.embedding_base.EmbeddingLazyDataSet(*args, particle_requirement=None, query_particle_requirement=None, target_particle_requirement=None, particles_from_parquet=False, **kwargs)[source]#
- Bases: - LazyDatasetBase- fetch_dataset(input_path, map_location='cpu', **kwargs)[source]#
- Load and process one PyTorch DataSet. - Parameters:
- input_path ( - str) – path to the PyTorch dataset
- map_location ( - str) – location where to load the dataset
- **kwargs – Other keyword arguments passed to - torch.load()
 
- Returns:
- Load PyTorch data object 
 
 
pipeline.Embedding.build_embedding module#
- class pipeline.Embedding.build_embedding.EmbeddingInferenceBuilder(model, k_max=1000, squared_distance_max=0.1, max_plane_diff=None)[source]#
- Bases: - ModelBuilderBase
- pipeline.Embedding.build_embedding.get_squared_distance_max_from_config(path_or_config)[source]#
- Get the value of - squared_distance_max_inference, or fall back to- squared_distance_maxif- squared_distance_max_inferencewas set to- None.- Parameters:
- path_or_config ( - str|- dict) – configuration dictionary, or path to the YAML file that contains the configuration
- Return type:
- float
- Returns:
- Squared maximal distance used for the embedding inference 
 
pipeline.Embedding.embedding_plots module#
A module that handles the validation plots for the embedding phase specifically.
- pipeline.Embedding.embedding_plots.plot_best_performances_squared_distance_max(model, path_or_config, partition, list_squared_distance_max, k_max=None, n_events=None, seed=None, builder='default', step='embedding', identifier=None, **kwargs)[source]#
- Plot best performance for perfect inference as a function of the squared maximal distance. - Parameters:
- model (EmbeddingBase) – Embedding model 
- path_or_config (str | dict) – YAML configuration. Only needed if - output_wpathis not provided.
- list_squared_distance_max (Sequence[float]) – list of squared maximal distance squared to try 
- k_max (int | None) – Maximal number of neigbhours. If not provided, the one stored in the model is used. 
- n_events (int | None) – Maximal number of events to use for each partition for performance evaluation 
- seed (int | None) – Random seed for the random choice of - n_eventsevents
- show_err – whether to show the error bars 
- builder (str) – Builder to use to build the tracks after the GNN. It can be - default(build the tracks by applying a connected component algorithm on the hits) or- triplets(build triplets and form the tracks from these triplets.)
 
- Return type:
- Tuple[Figure | npt.NDArray, List[Axes], Dict[float, Dict[Tuple[str | None, str], Dict[str, float]]]] 
 
- pipeline.Embedding.embedding_plots.plot_embedding_performance_given_squared_distance_max_k_max(model, path_or_config=None, partitions=['train', 'val'], n_events=10, squared_distance_max=None, k_max=None, show_err=True, output_wpath=None, lhcb=False, step='embedding', overall=False)[source]#
- Plot edge efficiency, purity and graph size as a function of the maximal squared_distance_max or maximal number of neighbours in the k-nearest neighbour algorithm. - Parameters:
- model ( - EmbeddingBase) – Embedding model
- path_or_config ( - UnionType[- str,- dict,- None]) – YAML configuration. Only needed if- output_wpathis not provided.
- partitions ( - Union[- List[- str],- Dict[- str,- Optional[- str]]]) – List of partitions to plot
- n_events ( - int) – Maximal number of events to use for each partition for performance evaluationn
- squared_distance_max ( - Union[- List[- float],- float,- None]) – Squared maximal distance squared in the embedding space
- k_max ( - Union[- List[- int],- int,- None]) – Maximal number of neighbours
- show_err ( - bool) – whether to show the error bars
- output_wpath ( - Optional[- str]) – wildcard path where the plots are saved, with placeholder- {metric_name}
 
- Return type:
- Tuple[- Dict[- str,- Tuple[- Figure,- Axes]],- Dict[- str,- Dict[- str,- matrix]]]
- Returns:
- Tuples of 2 dictionary. The first dictionary associates a metric name with the tuple of matplotlib Figure and Axes. The second dictionary associates a metric name with another dictionary that associates a partition with the list of metric values, for the different - squared_distance_maxor- k_maxgiven as input.
 
pipeline.Embedding.embedding_validation module#
A module that defines tools to perform the validation step of the embedding step.
- class pipeline.Embedding.embedding_validation.EmbeddingDistanceMaxExplorer(model, builder='default')[source]#
- Bases: - ParamExplorer- A class that allows to vary the maximal squared distance and compare the best metric performances of track finding, in the case where all the fake edges are filtered out. - property default_step: str#
- Name of the temp to fall back to if not provided. 
 - get_tracks(value, batches, k_max=None, processing=None)[source]#
- Get the dataframe of tracks from the inferred batches. - Parameters:
- value ( - float) – current value of the parameter that is explored
- batches ( - List[- Data]) – list of inferred batches
 
- Return type:
- DataFrame
- Returns:
- Dataframe of tracks, with columns - track_idand- hit_id
 
 
- pipeline.Embedding.embedding_validation.evaluate_embedding_performance(model, batches, squared_distance_max=None, k_max=None, overall=False)[source]#
- Compute the edge efficiency and edge purity of a given model, on a subset of the train, val or test dataset. - Parameters:
- model ( - EmbeddingBase) – PyTorch model inheriting from- utils.modelutils.basemodel.ModelBase
- partition – - train,- val,- test(for the current already loaded test sample) or the name of a test dataset
- squared_distance_max ( - Optional[- float]) – Maximal distance squared for the KNN. If not given, taken from the hyperparameter in the model.
- k_max ( - Optional[- int]) – Maximal number of neighbours for the KNN. If not given, taken from the hyperparameter in the model.
- n_events – Number of events to compute the performance metrics on 
- seed – Seed used to randomly select the - n_events
 
- Return type:
- Tuple[- Variable,- Variable,- Variable]
- Returns:
- A tuple of 3 ufloat numbers corresponding to the event-based average of the
- edge efficiency and edge purity, and the graph size 
 
 
- pipeline.Embedding.embedding_validation.evaluate_embedding_performances_given_squared_distance_max_k_max(model, partitions=['train', 'val'], n_events=10, squared_distance_max=None, k_max=None, seed=None, overall=False)[source]#
- Compute edge efficiency, purity and graph size as a function of the maximal squared_distance_max or maximal number of neighbours in the k-nearest neighbour algorithm. - Parameters:
- model ( - EmbeddingBase) – Embedding model
- path_or_config – YAML configuration. Only needed if - output_wpathis not provided.
- partitions ( - List[- str]) – List of partitions to plot
- n_events ( - int) – Maximal number of events to use for each partition for performance evaluationn
- squared_distance_max ( - Union[- List[- float],- float,- None]) – Squared maximal distance squared in the embedding space
- k_max ( - Union[- List[- int],- int,- None]) – Maximal number of neighbours
- seed ( - Optional[- int]) – Random seed for the random choice of- n_eventsevents
 
- Return type:
- Dict[- str,- Dict[- str,- matrix]]
- Returns:
- Dictionary that associates associates a metric name with another dictionary that associates a partition with the list of metric values, for the different - squared_distance_maxor- k_maxgiven as input.
 
pipeline.Embedding.process_custom module#
Custom functions for filtering and alterning an event.
