pipeline.Embedding package#
A package that handles the embedding stage of the pipeline. This stage consists in creating a rough graph by embedding the hit coordinates into an embedding space. The embedding network is trained in such a way that hits that are likely to be connected by an edge are brought clone to another, while disconnected hits brought apart.
Then, a \(k\)-nearest neighbour algorithm is applied to create the rough graph.
Subpackages#
pipeline.Embedding.embedding_base module#
A module that defines the embedding training and inference.
- class pipeline.Embedding.embedding_base.EmbeddingBase(*args: Any, **kwargs: Any)[source]#
Bases:
ModelBase
A class that implements the metric learning model.
- append_true_pairs(training_edge_indices, y_truth, true_edge_indices, planes)[source]#
Append the true edges to the tensor of training edges.
- Parameters:
training_edge_indices (
Tensor
) – training sample of edge indicesy_truth (
Tensor
) – whether the edges intraining_edge_indices
are genuine or faketrue_edge_indices (
Tensor
) – all the genuine edge indices
- Return type:
Tuple
[Tensor
,Tensor
]- Returns:
Training edge indices with the true edge indices added, and updated
y_truth
.
- build_edges(embeddings, planes, k_max, squared_distance_max, query_embeddings=None, query_indices=None)[source]#
Build edges by applying kNNs.
Edges are built by looping over the planes, and drawing neighbours between a plane and the next
plane_range
plane, whereplane_range
is an hyperparameter.- Parameters:
embeddings (torch.Tensor) – embeddings of all the points
planes (torch.Tensor) – planes of all the points.
k_max (int) – maximum number of neigbhours for the kNN
squared_distance_max (float) – maximum (embedded) distance for 2 points to be considered as neighbours
query_embeddings (torch.Tensor | None) – embeddings of the query points
query_indices (torch.Tensor | None) – indices of the query points (in
embeddings
)
- Return type:
torch.Tensor
- Returns:
Edges build by the kNN.
- property edgedir: str#
left
orright
.- Type:
Edge direction
- get_hnm_pairs(query_embeddings, query_indices, embeddings, planes)[source]#
Get the edges from hard-negative mining.
- Parameters:
query_embeddings (
Tensor
) – Embeddings of the query pointsquery_indices (
Tensor
) – Corresponding indices of the query pointsembeddings (
Tensor
) – Embeddings of all the pointsplanes (
Tensor
) – planes of all the points
- Return type:
Tensor
- Returns:
Edge indices of the hard-negative mined edges
- get_lazy_dataset(*args, **kwargs)[source]#
Get the lazy dataset object.
- Parameters:
input_dir – input directory
n_events – number of events to load
shuffle – whether to shuffle the input paths (applied before selected the first
n_events
)seed – seed for the shuffling
**kwargs – Other keyword arguments passed to the
utils.loaderutils.dataiterator.LazyDatasetBase
constructor.
- Return type:
- Returns:
utils.loaderutils.dataiterator.LazyDatasetBase
object
- get_lazy_dataset_partition(partition, *args, **kwargs)[source]#
Get the lazy dataset of a partition.
- Parameters:
partition (
str
) –train
,val
or name of the test datasetn_events – number of events to load
shuffle – whether to shuffle the input paths (applied before selected the first
n_events
)seed – seed for the shuffling
**kwargs – Other keyword arguments passed to
ModelBase.get_lazy_dataset()
- Return type:
LazyDatasetBase
- Returns:
Lazy dataset of the
partition
- get_loss(embeddings, edge_indices, y_truth, weights=None)[source]#
Compute the loss for the given embeddings and edges.
- Parameters:
embeddings (torch.Tensor) – embeddings of all the points
edge_indices (torch.Tensor) – edge indices
y_truth (torch.Tensor) – for each edge (column) in
edge_indices
, whether this edge is genuine (True
) or fake (False
)weights (torch.Tensor | None) – edge weights
- Return type:
torch.Tensor
- Returns:
Value of the siamese-like loss
- get_query_points(embeddings, true_edge_indices, planes=None, query_mask=None)[source]#
Get the points the edges will be drawn from to generate the training set.
- Parameters:
embeddings (torch.Tensor) – point embeddings
true_edge_indices (torch.Tensor) – true edge indices
particle_ids – particle IDs for each point in
embeddings
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- Returns:
1D tensor of query indices and 2D tensor of query embeddings
- get_random_pairs(query_indices, planes)[source]#
Get random edges drawn from the query points.
- Parameters:
query_indices (
Tensor
) – indices of the query pointsembeddings – Embeddings of all the points
planes (
Tensor
) – planes of all the points. Only used for non-directional graphs, as random pairs are only drawn from one plane to one of the nextplane_range
planes (whereplane_range
is a hyperparameter).
- Return type:
Tensor
- Returns:
Edge indices of random edges drawn from the query points
- get_squared_distances(embeddings, edge_indices)[source]#
Get the squared distances
- Parameters:
embeddings (
Tensor
) – Embeddings of all the pointsedge_indices (
Tensor
) – edge indices
- Return type:
Tensor
- Returns:
squared_distances
tensor corresponding to the squred L2 distance between the embeddings of the hits of every edge.
- get_training_edges(embeddings, true_edge_indices, planes, query_mask=None)[source]#
Get the edges used for the training.
- Parameters:
embeddings (torch.Tensor) – Embeddings of all the points
true_edge_indices (torch.Tensor) – 2D tensor of genuine edge indices
particle_ids – tensor of particle IDs for every point. Only used in the
query_noise_points
regimeplanes (torch.Tensor) – tensor of planes for every point. Only used for one-directional graph.
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- Returns:
2D tensor of training edge indices and 1D tensor indicating whether the corresponding edge is genuine or fake.
- get_truth(edge_indices, true_edge_indices)[source]#
Get the true label of each edge (whether it’s genuine or fake).
- Parameters:
edge_indices (
Tensor
) – edge indicestrue_edge_indices (
Tensor
) – the true edge indices
- Return type:
Tuple
[Tensor
,Tensor
]- Returns:
2 one-dimensional torch tensors. The first tensor is the tensor of edge indices,that could be shuffled a bit. The second tensor contains, for each edge (column) in
edge_indices
, whether this edge is genuine (1) or fake (0).
- inference(batch, squared_distance_max, k_max, evaluate=False, overall=False, log=False)[source]#
Run the embedding inference + kNN to build edges of an event.
- Parameters:
batch (
Data
) – event PyTorch data objectsquared_distance_max (
float
) – squared maximal distance in the embedding spacek_max (
int
) – maximal number of neighboursevaluate (
bool
) – whether to also output the loss, efficiency and purityoverall (
bool
) – ifbatch
already containsedge_index
, whether to enable concatenaning new edges to the old edge indices instead of replacing them.log (
bool
) – whether to add an entry to the log
- Return type:
Dict
[str
,Tensor
]
- property input_kwargs: Dict[str, Any]#
Associates an input name with a dictionary corresponding to the keyword arguments used to build a dummy tensor representing the input. This dictionary basically gives the
size
anddtype
of the tensor.
- property input_to_dynamic_axes#
A dictionary that associates an input name with the dynamic axis specification.
- property last_plane: int#
Index of the last plane.
- property n_planes: int#
Number of unremoved planes (e.g., xz-scifi).
- property n_total_planes: int#
Total number of planes.
- property query_planes: torch.Tensor | None#
Planes that can be queried.
- remove_planes(features, planes, true_edge_index=None)[source]#
Remove hits belonging to planes given by the hyperparameter
removed_planes
.- Parameters:
features (torch.Tensor) – hit features
planes (torch.Tensor) – hit plane indices
truth_edge_index – Optionally, tensor of true edge indices
- Return type:
Tuple[torch.Tensor, TensorOrNone, torch.Tensor, torch.Tensor | None]
- Returns:
Reindexed hit features, true edge indices, planes and original hit indices. If no plane is removed, this is indicated by original hit indices being None.
- property subnetwork_to_outputs: Dict[str, List[str]]#
A dictionary that associates a subnetwork name with the list of its output names.
- to_onnx(outpath, mode=None, options=None)[source]#
Save model to an ONNX file.
- Parameters:
outpath (
str
) – Path where to save the ONNX model.mode (
Optional
[Literal
['default'
]]) – onlydefault
mode is supported.
- Return type:
None
- training_validation_step(batch, with_grad=False)[source]#
Common step for the training and validation steps. This encompasses selecting query hits, drawing edges from them, running the embedding inference and computing the loss.
- Return type:
Dict
[str
,Tensor
]
- class pipeline.Embedding.embedding_base.EmbeddingLazyDataSet(*args, particle_requirement=None, query_particle_requirement=None, target_particle_requirement=None, particles_from_parquet=False, **kwargs)[source]#
Bases:
LazyDatasetBase
- fetch_dataset(input_path, map_location='cpu', **kwargs)[source]#
Load and process one PyTorch DataSet.
- Parameters:
input_path (
str
) – path to the PyTorch datasetmap_location (
str
) – location where to load the dataset**kwargs – Other keyword arguments passed to
torch.load()
- Returns:
Load PyTorch data object
pipeline.Embedding.build_embedding module#
- class pipeline.Embedding.build_embedding.EmbeddingInferenceBuilder(model, k_max=1000, squared_distance_max=0.1, max_plane_diff=None)[source]#
Bases:
ModelBuilderBase
- pipeline.Embedding.build_embedding.get_squared_distance_max_from_config(path_or_config)[source]#
Get the value of
squared_distance_max_inference
, or fall back tosquared_distance_max
ifsquared_distance_max_inference
was set toNone
.- Parameters:
path_or_config (
str
|dict
) – configuration dictionary, or path to the YAML file that contains the configuration- Return type:
float
- Returns:
Squared maximal distance used for the embedding inference
pipeline.Embedding.embedding_plots module#
A module that handles the validation plots for the embedding phase specifically.
- pipeline.Embedding.embedding_plots.plot_best_performances_squared_distance_max(model, path_or_config, partition, list_squared_distance_max, k_max=None, n_events=None, seed=None, builder='default', step='embedding', identifier=None, **kwargs)[source]#
Plot best performance for perfect inference as a function of the squared maximal distance.
- Parameters:
model (EmbeddingBase) – Embedding model
path_or_config (str | dict) – YAML configuration. Only needed if
output_wpath
is not provided.list_squared_distance_max (Sequence[float]) – list of squared maximal distance squared to try
k_max (int | None) – Maximal number of neigbhours. If not provided, the one stored in the model is used.
n_events (int | None) – Maximal number of events to use for each partition for performance evaluation
seed (int | None) – Random seed for the random choice of
n_events
eventsshow_err – whether to show the error bars
builder (str) – Builder to use to build the tracks after the GNN. It can be
default
(build the tracks by applying a connected component algorithm on the hits) ortriplets
(build triplets and form the tracks from these triplets.)
- Return type:
Tuple[Figure | npt.NDArray, List[Axes], Dict[float, Dict[Tuple[str | None, str], Dict[str, float]]]]
- pipeline.Embedding.embedding_plots.plot_embedding_performance_given_squared_distance_max_k_max(model, path_or_config=None, partitions=['train', 'val'], n_events=10, squared_distance_max=None, k_max=None, show_err=True, output_wpath=None, lhcb=False, step='embedding', overall=False)[source]#
Plot edge efficiency, purity and graph size as a function of the maximal squared_distance_max or maximal number of neighbours in the k-nearest neighbour algorithm.
- Parameters:
model (
EmbeddingBase
) – Embedding modelpath_or_config (
UnionType
[str
,dict
,None
]) – YAML configuration. Only needed ifoutput_wpath
is not provided.partitions (
Union
[List
[str
],Dict
[str
,Optional
[str
]]]) – List of partitions to plotn_events (
int
) – Maximal number of events to use for each partition for performance evaluationnsquared_distance_max (
Union
[List
[float
],float
,None
]) – Squared maximal distance squared in the embedding spacek_max (
Union
[List
[int
],int
,None
]) – Maximal number of neighboursshow_err (
bool
) – whether to show the error barsoutput_wpath (
Optional
[str
]) – wildcard path where the plots are saved, with placeholder{metric_name}
- Return type:
Tuple
[Dict
[str
,Tuple
[Figure
,Axes
]],Dict
[str
,Dict
[str
,matrix
]]]- Returns:
Tuples of 2 dictionary. The first dictionary associates a metric name with the tuple of matplotlib Figure and Axes. The second dictionary associates a metric name with another dictionary that associates a partition with the list of metric values, for the different
squared_distance_max
ork_max
given as input.
pipeline.Embedding.embedding_validation module#
A module that defines tools to perform the validation step of the embedding step.
- class pipeline.Embedding.embedding_validation.EmbeddingDistanceMaxExplorer(model, builder='default')[source]#
Bases:
ParamExplorer
A class that allows to vary the maximal squared distance and compare the best metric performances of track finding, in the case where all the fake edges are filtered out.
- property default_step: str#
Name of the temp to fall back to if not provided.
- get_tracks(value, batches, k_max=None, processing=None)[source]#
Get the dataframe of tracks from the inferred batches.
- Parameters:
value (
float
) – current value of the parameter that is exploredbatches (
List
[Data
]) – list of inferred batches
- Return type:
DataFrame
- Returns:
Dataframe of tracks, with columns
track_id
andhit_id
- pipeline.Embedding.embedding_validation.evaluate_embedding_performance(model, batches, squared_distance_max=None, k_max=None, overall=False)[source]#
Compute the edge efficiency and edge purity of a given model, on a subset of the train, val or test dataset.
- Parameters:
model (
EmbeddingBase
) – PyTorch model inheriting fromutils.modelutils.basemodel.ModelBase
partition –
train
,val
,test
(for the current already loaded test sample) or the name of a test datasetsquared_distance_max (
Optional
[float
]) – Maximal distance squared for the KNN. If not given, taken from the hyperparameter in the model.k_max (
Optional
[int
]) – Maximal number of neighbours for the KNN. If not given, taken from the hyperparameter in the model.n_events – Number of events to compute the performance metrics on
seed – Seed used to randomly select the
n_events
- Return type:
Tuple
[Variable
,Variable
,Variable
]- Returns:
- A tuple of 3 ufloat numbers corresponding to the event-based average of the
edge efficiency and edge purity, and the graph size
- pipeline.Embedding.embedding_validation.evaluate_embedding_performances_given_squared_distance_max_k_max(model, partitions=['train', 'val'], n_events=10, squared_distance_max=None, k_max=None, seed=None, overall=False)[source]#
Compute edge efficiency, purity and graph size as a function of the maximal squared_distance_max or maximal number of neighbours in the k-nearest neighbour algorithm.
- Parameters:
model (
EmbeddingBase
) – Embedding modelpath_or_config – YAML configuration. Only needed if
output_wpath
is not provided.partitions (
List
[str
]) – List of partitions to plotn_events (
int
) – Maximal number of events to use for each partition for performance evaluationnsquared_distance_max (
Union
[List
[float
],float
,None
]) – Squared maximal distance squared in the embedding spacek_max (
Union
[List
[int
],int
,None
]) – Maximal number of neighboursseed (
Optional
[int
]) – Random seed for the random choice ofn_events
events
- Return type:
Dict
[str
,Dict
[str
,matrix
]]- Returns:
Dictionary that associates associates a metric name with another dictionary that associates a partition with the list of metric values, for the different
squared_distance_max
ork_max
given as input.
pipeline.Embedding.process_custom module#
Custom functions for filtering and alterning an event.