Metric Learning and Siamese Networks

In a world awash with data, simply classifying items into fixed categories is often insufficient. What if you need to determine if two faces belong to the same person, if a signature is genuine, or if a new product is similar to past bestsellers? This is the realm of metric learning, the process of learning a distance function that measures semantic similarity. At the heart of modern metric learning are Siamese networks, elegant architectures that learn to map inputs into an embedding space where distance corresponds to similarity. Mastering these concepts unlocks powerful capabilities in verification, matching, and learning from very few examples.

The Foundation: Siamese Networks and Shared Weights

A Siamese network is not a single network but a twin architecture comprising two or more identical subnetworks. The term "identical" is crucial: these subnetworks share the exact same weights and parameters. This design is the core innovation. Imagine training two identical twins to describe photographs; because they share the same "brain," they will describe similar images in a similar way and dissimilar images in different ways. The shared weights ensure that the two input samples are processed by the same function.

Technically, you take a pair of inputs (e.g., two images). Each input is fed into one of the twin networks. Each network outputs a feature vector, or an embedding, which is a dense, lower-dimensional representation of the input. The magic happens in the embedding space. The network is trained so that embeddings for similar items are close together (e.g., small Euclidean distance), while embeddings for dissimilar items are far apart. The shared weights guarantee that the same transformation is applied to both inputs, making the distance between their embeddings meaningful. This is fundamentally different from a classification network, which learns to separate categories by drawing boundaries. A Siamese network learns to pull things together or push them apart.

Core Loss Functions: Contrastive and Triplet Loss

Training a Siamese network requires a specialized loss function that operates on pairs or triplets of data. The most common objectives are contrastive loss and triplet loss.

Contrastive loss works directly on pairs. Each training sample is a pair of inputs $(x_{1}, x_{2})$ and a label $Y$ : $Y = 0$ if the pair is similar (same class), and $Y = 1$ if the pair is dissimilar. Let $D$ be the Euclidean distance between the embeddings of $x_{1}$ and $x_{2}$ . The contrastive loss function is defined as:

$L = (1 - Y) \cdot \frac{1}{2} D^{2} + Y \cdot \frac{1}{2} {max (0, m - D)}^{2}$

Here, $m$ is a margin hyperparameter. The logic is intuitive: for similar pairs ( $Y = 0$ ), the loss is simply $D^{2} /2$ , so we minimize the distance. For dissimilar pairs ( $Y = 1$ ), the loss is $(max (0, m - D))^{2} /2$ . This term is zero if the distance $D$ is already greater than the margin $m$ . If the distance is less than $m$ , it incurs a cost, pushing the embeddings of the dissimilar pair at least $m$ units apart.

Triplet loss uses relative distance and operates on triplets: an anchor ( $x_{a}$ ), a positive example similar to the anchor ( $x_{p}$ ), and a negative example dissimilar to the anchor ( $x_{n}$ ). The goal is to make the distance from the anchor to the positive ( $D_{a p}$ ) smaller than the distance from the anchor to the negative ( $D_{an}$ ) by at least a margin $α$ . The loss for a single triplet is:

$L = max (0, D_{a p} - D_{an} + α)$

The network learns to satisfy the inequality $D_{an} \geq D_{a p} + α$ . This relative constraint often leads to a more robust embedding space than pair-based contrastive loss because it explicitly enforces a relative ordering of distances within each triplet.

The Challenge of Training: Hard Negative Mining

A major practical challenge in training with triplet loss (and to a lesser extent, contrastive loss) is data sampling. If triplets are chosen randomly, most will easily satisfy the triplet constraint ( $D_{an} > D_{a p} + α$ ), resulting in a loss of zero. Such "easy" triplets don't provide a useful gradient for learning. The network makes no progress.

This is where hard negative mining becomes critical. The strategy is to actively seek out informative triplets that violate the constraint and provide a strong learning signal. A hard negative is a negative example that is closer to the anchor than the positive example is, or is within the margin $α$ . Mining these hard negatives can be done offline (by periodically scanning the dataset with the current model) or online (within each mini-batch during training). By focusing on these challenging cases, the model learns finer distinctions and creates a much tighter, well-separated embedding space. Without hard mining, training converges quickly but produces a weak model.

Real-World Applications

The power of metric learning with Siamese networks is best illustrated through its transformative applications.

Face Verification and Recognition: This is the canonical example. A Siamese network is trained so that embeddings of different images of the same person have a small distance. During inference, a new face image is compared to a stored embedding; if the distance is below a threshold, access is granted. This technology powers smartphone unlocks and photo tagging.
Signature Matching: Banks and legal entities use this for biometric authentication. The network learns to create embeddings where genuine signature pairs are close, while forged or mismatched signatures are far apart, enabling automated verification.
Few-Shot Classification: This addresses the problem of learning new concepts from only one or a few examples. A Siamese network can be used as a similarity comparator. For a "5-way, 1-shot" task, you have one example image from each of 5 new classes. A query image is passed through the Siamese twin alongside each of the 5 support examples. The network predicts the class whose support image embedding is closest to the query embedding. It learns a general notion of similarity that transfers to novel categories.

Common Pitfalls

Poor Triplet Sampling: As discussed, using random triplets leads to trivial learning and poor model performance. Always implement a form of hard or semi-hard negative mining. A simple starting strategy is to select the hardest negative within each mini-batch for every anchor-positive pair.
Choosing the Wrong Margin: The margin hyperparameter $α$ in triplet loss or $m$ in contrastive loss is critical. A margin that is too small allows embeddings to collapse (everything is close together). A margin that is too large makes the optimization problem too difficult and can lead to unstable training. It must be tuned for your specific dataset and embedding dimensionality.
Ignoring Data Imbalance: In verification tasks, the number of dissimilar pairs can vastly outnumber similar pairs. If not accounted for (e.g., through careful batch construction or loss weighting), the model may become biased toward simply pushing all embeddings apart. Ensure your training batches have a balanced mix of positive and negative pairs or triplets.
Overfitting on Small Datasets: Metric learning often requires large and varied datasets to learn a generalizable notion of similarity. With insufficient data, the network may memorize training relationships rather than learn a robust distance function. Techniques like strong data augmentation (for images) and dropout within the embedding network are essential.

Summary

Metric learning focuses on learning a distance function in an embedding space, where smaller distances indicate greater semantic similarity.
Siamese networks use twin subnetworks with shared weights to generate comparable embeddings for input pairs or triplets, making distance calculations meaningful.
Contrastive loss trains on labeled pairs to minimize distance for similar pairs and maximize it (beyond a margin) for dissimilar pairs.
Triplet loss uses an anchor, positive, and negative example to learn relative distances, enforcing that the positive is closer to the anchor than the negative by a specified margin.
Effective training requires hard negative mining to find challenging examples that provide meaningful gradient signals, preventing the model from stagnating on easy cases.
Key applications include face verification, signature matching, and few-shot classification, where determining similarity is more valuable than assigning a fixed label.

Metric Learning and Siamese Networks

Metric Learning and Siamese Networks

The Foundation: Siamese Networks and Shared Weights

Core Loss Functions: Contrastive and Triplet Loss

The Challenge of Training: Hard Negative Mining

Real-World Applications

Common Pitfalls

Summary

Write better notes with AI