Loading learning content...
The previous pages examined how to select which training instances to keep (CNN, ENN). But there's a fundamentally different question we've been ignoring: Are we measuring distance correctly in the first place?
K-Nearest Neighbors treats all features equally—or, with weighted variants, uses hand-tuned weights. But this assumption is almost always wrong. Consider these scenarios:
Scenario 1: Irrelevant Features
A dataset contains 100 features, but only 10 are actually predictive. Euclidean distance treats all 100 equally, letting the 90 irrelevant features dominate and obscure true similarity.
Scenario 2: Correlated Features
Two features measure nearly the same thing (e.g., height in inches and height in centimeters). Euclidean distance double-counts this dimension, distorting the neighborhood structure.
Scenario 3: Different Scales of Relevance
Feature A has a correlation of 0.9 with the target; Feature B has 0.1. They should not contribute equally to distance.
The fundamental insight: The optimal distance metric is data-dependent. We can learn it from the training data.
Large Margin Nearest Neighbors (LMNN), introduced by Weinberger and Saul in 2009, does exactly this. It learns a Mahalanobis distance metric that pulls same-class points (target neighbors) closer while pushing different-class points (imposters) away—creating large margins analogous to Support Vector Machines.
By completing this page, you will understand the Mahalanobis distance and its parameterization, grasp the LMNN objective function and its margin-based motivation, learn the optimization procedure including semidefinite programming relaxation, implement metric learning for KNN, and understand connections to dimensionality reduction and feature learning.
Before diving into LMNN, we must understand the family of distances it learns: Mahalanobis distances.
The familiar Euclidean distance between two points $\mathbf{x}_i, \mathbf{x}_j \in \mathbb{R}^d$ is:
$$d_E(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T (\mathbf{x}_i - \mathbf{x}_j)} = |\mathbf{x}_i - \mathbf{x}_j|_2$$
This treats all dimensions equally with weight 1.
A simple extension weights dimensions differently:
$$d_W(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T W (\mathbf{x}_i - \mathbf{x}_j)}$$
where $W = \text{diag}(w_1, \ldots, w_d)$ is a diagonal weight matrix. Dimension $j$ contributes $w_j$ times more to the distance.
The generalized Mahalanobis distance uses a full positive semi-definite (PSD) matrix $M$:
$$d_M(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T M (\mathbf{x}_i - \mathbf{x}_j)}$$
Key Properties:
The Mahalanobis distance has a beautiful geometric interpretation. Since $M$ is PSD, we can write:
$$M = L^T L$$
for some matrix $L \in \mathbb{R}^{r \times d}$ where $r = \text{rank}(M)$.
Then:
$$d_M(\mathbf{x}_i, \mathbf{x}_j)^2 = (\mathbf{x}_i - \mathbf{x}_j)^T L^T L (\mathbf{x}_i - \mathbf{x}_j) = ||L\mathbf{x}_i - L\mathbf{x}_j||_2^2$$
Insight: The Mahalanobis distance with matrix $M = L^T L$ is equivalent to Euclidean distance after linearly transforming the data by $L$.
This means:
| Metric | Parameters | Degrees of Freedom | Captures |
|---|---|---|---|
| Euclidean | None | 0 | Nothing (fixed) |
| Weighted Euclidean | d diagonal weights | d | Feature importance |
| Mahalanobis (diagonal M) | d diagonal entries | d | Feature importance |
| Mahalanobis (full M) | d×d symmetric PSD matrix | d(d+1)/2 | Importance + correlations |
| Low-rank Mahalanobis | L ∈ ℝ^(r×d) | r × d | Importance + correlations + reduction |
With d features, a full Mahalanobis matrix has d(d+1)/2 parameters. For d=100, that's 5,050 parameters. For d=1000, it's 500,500 parameters. Low-rank factorizations (L ∈ ℝ^(r×d) with r << d) dramatically reduce parameters while still capturing the most important transformations.
Large Margin Nearest Neighbors (LMNN) learns a Mahalanobis distance matrix $M$ by optimizing an objective that mirrors SVM's margin-maximization philosophy. The goal: make k-NN classification as reliable as possible.
Target Neighbors: For each training point $\mathbf{x}_i$, its target neighbors are the k points of the same class that should be closest to it. These are typically determined using the original Euclidean distance or can be pre-specified.
Imposters: An imposter for $\mathbf{x}_i$ is any point of a different class that intrudes into $\mathbf{x}_i$'s neighborhood, potentially causing misclassification.
LMNN's objective has two competing terms:
1. Pull Term (Attraction): Pull target neighbors closer
$$\epsilon_{\text{pull}}(M) = \sum_{i} \sum_{j \in \mathcal{N}_i} d_M(\mathbf{x}_i, \mathbf{x}_j)^2$$
where $\mathcal{N}_i$ is the set of k target neighbors of $\mathbf{x}_i$.
This term alone would collapse everything to a point!
2. Push Term (Repulsion): Push imposters away with a margin
$$\epsilon_{\text{push}}(M) = \sum_{i} \sum_{j \in \mathcal{N}i} \sum{l: y_l \neq y_i} \max(0, 1 + d_M(\mathbf{x}_i, \mathbf{x}_j)^2 - d_M(\mathbf{x}_i, \mathbf{x}_l)^2)$$
This hinge loss activates when an imposter $\mathbf{x}_l$ is closer to $\mathbf{x}_i$ than a target neighbor $\mathbf{x}_j$ by less than margin 1.
Combined Objective:
$$\mathcal{L}(M) = (1-\mu) \cdot \epsilon_{\text{pull}}(M) + \mu \cdot \epsilon_{\text{push}}(M)$$
where $\mu \in (0, 1)$ balances attraction and repulsion (typically $\mu = 0.5$).
The '+1' in the hinge loss creates a unit margin. We want imposters to be at least 1 unit further from xᵢ than any target neighbor xⱼ. This mirrors SVM's margin concept: we're not just trying to classify correctly, we're trying to classify with confidence.
Consider a 2D example with three points:
Before learning $M$:
The objective pushes for:
After learning $M$:
LMNN can be reformulated as a constrained optimization problem:
$$\min_M (1-\mu) \sum_i \sum_{j \in \mathcal{N}_i} d_M(\mathbf{x}i, \mathbf{x}j)^2 + \mu \sum{ijl} \xi{ijl}$$
subject to:
This is a semidefinite program (SDP), a convex optimization problem with known efficient solvers.
LMNN's objective is convex in $M$ (a crucial property!), which means gradient-based methods will find the global optimum. However, the constraint $M \succeq 0$ (positive semidefinite) and the scale of the problem require specialized optimization techniques.
The mathematically cleanest approach reformulates LMNN as an SDP:
Pros: Provably finds global optimum; mature SDP solvers exist
Cons: Scales as $O(d^6)$ or worse; impractical for $d > 100$
A more practical approach directly optimizes in $M$ space using gradient descent with projection onto the PSD cone.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom scipy.linalg import eigh def lmnn_projected_gradient(X, y, k=3, mu=0.5, learning_rate=1e-4, max_iter=100, reg=1e-3): """ LMNN via Projected Gradient Descent Learns a Mahalanobis distance matrix M by gradient descent with projection onto the PSD cone. Parameters: ----------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) k : int, number of target neighbors mu : float, push/pull trade-off (typically 0.5) learning_rate : float max_iter : int reg : float, regularization strength Returns: -------- M : ndarray of shape (n_features, n_features), learned metric L : ndarray, factorization where M = L.T @ L """ n, d = X.shape # Initialize M as identity M = np.eye(d) # Find target neighbors (same class, k nearest in Euclidean) target_neighbors = find_target_neighbors(X, y, k) for iteration in range(max_iter): # Compute gradient grad = compute_lmnn_gradient(X, y, M, target_neighbors, mu) # Add regularization (trace norm to prevent collapse) grad += reg * np.eye(d) # Gradient step M_new = M - learning_rate * grad # Project onto PSD cone M = project_psd(M_new) if iteration % 10 == 0: loss = compute_lmnn_loss(X, y, M, target_neighbors, mu) print(f"Iteration {iteration}: loss = {loss:.4f}") # Factorize M = L^T L for efficient transformed distances eigenvalues, eigenvectors = eigh(M) eigenvalues = np.maximum(eigenvalues, 0) # Ensure non-negative L = np.diag(np.sqrt(eigenvalues)) @ eigenvectors.T return M, L def project_psd(M): """Project a symmetric matrix onto the positive semidefinite cone""" eigenvalues, eigenvectors = eigh(M) eigenvalues = np.maximum(eigenvalues, 0) # Clip negative eigenvalues return eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T def find_target_neighbors(X, y, k): """Find k same-class nearest neighbors for each point""" from scipy.spatial.distance import cdist n = len(X) target_neighbors = {} for i in range(n): same_class = np.where(y == y[i])[0] same_class = same_class[same_class != i] # Exclude self if len(same_class) >= k: dists = cdist(X[i:i+1], X[same_class])[0] nearest_idx = np.argsort(dists)[:k] target_neighbors[i] = same_class[nearest_idx].tolist() else: target_neighbors[i] = same_class.tolist() return target_neighbors def compute_lmnn_gradient(X, y, M, target_neighbors, mu): """Compute gradient of LMNN objective w.r.t. M""" n, d = X.shape grad = np.zeros((d, d)) # Pull gradient: sum over (i, j in N_i) for i, neighbors in target_neighbors.items(): for j in neighbors: diff = X[i] - X[j] grad += (1 - mu) * np.outer(diff, diff) # Push gradient: triplet terms with active constraints for i, neighbors in target_neighbors.items(): for j in neighbors: diff_ij = X[i] - X[j] d_ij_sq = diff_ij @ M @ diff_ij for l in range(n): if y[l] == y[i]: continue diff_il = X[i] - X[l] d_il_sq = diff_il @ M @ diff_il # Check if margin constraint is violated margin = 1 + d_ij_sq - d_il_sq if margin > 0: # Hinge is active grad += mu * (np.outer(diff_ij, diff_ij) - np.outer(diff_il, diff_il)) return grad def compute_lmnn_loss(X, y, M, target_neighbors, mu): """Compute LMNN objective value""" pull_loss = 0 push_loss = 0 n = len(X) for i, neighbors in target_neighbors.items(): for j in neighbors: diff_ij = X[i] - X[j] d_ij_sq = diff_ij @ M @ diff_ij pull_loss += d_ij_sq for l in range(n): if y[l] == y[i]: continue diff_il = X[i] - X[l] d_il_sq = diff_il @ M @ diff_il push_loss += max(0, 1 + d_ij_sq - d_il_sq) return (1 - mu) * pull_loss + mu * push_lossFor large datasets, computing the full gradient is prohibitive. Stochastic variants sample triplets:
Triplet Sampling Strategies:
Instead of learning full $M = L^T L$, directly parameterize with $L \in \mathbb{R}^{r \times d}$ where $r << d$:
$$d_M(\mathbf{x}_i, \mathbf{x}_j)^2 = ||L\mathbf{x}_i - L\mathbf{x}_j||_2^2$$
Advantages:
For production use, the 'metric-learn' Python library provides optimized LMNN implementations. For custom needs, low-rank factorization with Adam optimizer is typically the best balance of flexibility, speed, and ease of implementation.
LMNN sits at an intersection of several important machine learning concepts. Understanding these connections deepens comprehension and suggests generalizations.
LMNN's margin-based formulation directly parallels SVM:
| SVM | LMNN |
|---|---|
| Maximize margin between classes | Maximize margin between target neighbor and imposter |
| Hinge loss on misclassified points | Hinge loss on violated triplets |
| Slack variables for soft margin | Slack variables for triplet violations |
| Linear kernel: $\mathbf{w}^T\mathbf{x}$ | Linear transformation: $L\mathbf{x}$ |
Both methods learn a linear transformation of the feature space. SVM optimizes for a separating hyperplane; LMNN optimizes for local neighborhoods.
LDA finds a projection maximizing between-class scatter / within-class scatter:
$$L_{\text{LDA}} = \arg\max_L \frac{||L\mu_1 - L\mu_2||^2}{\text{tr}(L\Sigma_W L^T)}$$
LMNN's pull term minimizes within-class distances (similar to $\Sigma_W$), and the push term implicitly increases between-class separation.
Key difference: LDA assumes Gaussian classes and uses class means; LMNN operates on local neighborhoods and makes no distributional assumptions.
Modern deep learning uses triplet loss for metric learning:
$$\mathcal{L}{\text{triplet}} = \sum{(a,p,n)} \max(0, ||f(a) - f(p)||^2 - ||f(a) - f(n)||^2 + \alpha)$$
where $f$ is a neural network, $a$ is anchor, $p$ is positive (same class), $n$ is negative (different class).
This is exactly LMNN's push term! The connection:
LMNN can be viewed as triplet learning with a linear embedding.
| Method | Shared Concept | Key Difference |
|---|---|---|
| SVM | Margin maximization, hinge loss | SVM: global hyperplane; LMNN: local neighborhoods |
| LDA | Maximize class separation | LDA: assumes Gaussians; LMNN: nonparametric |
| Triplet Loss | Same loss function | Triplet: nonlinear NN; LMNN: linear transform |
| PCA | Linear dimensionality reduction | PCA: unsupervised variance; LMNN: supervised margins |
| NCA | Stochastic neighbor probability | NCA: softmax probs; LMNN: hard margins |
NCA is a closely related metric learning method. It maximizes the probability of correct k-NN classification using soft probabilities:
$$p_{ij} = \frac{\exp(-||L\mathbf{x}_i - L\mathbf{x}j||^2)}{\sum{k \neq i} \exp(-||L\mathbf{x}_i - L\mathbf{x}_k||^2)}$$
$$\mathcal{L}{\text{NCA}} = \sum_i \sum{j: y_j = y_i} p_{ij}$$
LMNN vs NCA:
For large-scale problems, LMNN is generally preferred for its convexity and sparsity.
LMNN bridges classical statistics (LDA, Mahalanobis distance), kernel methods (SVM margins), and modern deep learning (triplet loss). Understanding LMNN provides insight into all these areas. It's a linear special case of deep metric learning, benefiting from convexity while capturing the essential margin-based intuition.
Applying LMNN effectively requires attention to several practical details. Here we address the most common implementation decisions and pitfalls.
The k used for target neighbor selection impacts learned metric quality:
Matching k: Use the same k for learning and classification when possible. If you'll classify with 5-NN, train with k=5.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
from metric_learn import LMNNfrom sklearn.model_selection import GridSearchCV, cross_val_scorefrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.pipeline import Pipelineimport numpy as np def lmnn_with_cv(X, y, k_candidates=[3, 5, 7], n_components=None): """ Apply LMNN with cross-validated k selection. Parameters: ----------- X : array of shape (n_samples, n_features) y : array of shape (n_samples,) k_candidates : list of k values to try n_components : int or None, for low-rank LMNN (dimensionality) Returns: -------- best_lmnn : fitted LMNN instance best_k : optimal k value """ best_score = -1 best_k = None best_lmnn = None for k in k_candidates: # Create LMNN instance lmnn = LMNN( k=k, learn_rate=1e-5, max_iter=100, n_components=n_components, convergence_tol=1e-5 ) try: # Fit LMNN lmnn.fit(X, y) # Transform data X_transformed = lmnn.transform(X) # Evaluate with k-NN on transformed data knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X_transformed, y, cv=5) mean_score = np.mean(scores) print(f"LMNN k={k}: CV accuracy = {mean_score:.3f} (+/- {np.std(scores):.3f})") if mean_score > best_score: best_score = mean_score best_k = k best_lmnn = lmnn except Exception as e: print(f"LMNN k={k} failed: {e}") print(f"Best: k={best_k} with accuracy {best_score:.3f}") return best_lmnn, best_k def compare_before_after_lmnn(X_train, y_train, X_test, y_test, k=5): """Compare k-NN performance before and after LMNN""" from metric_learn import LMNN # Before LMNN (Euclidean distance) knn_euclidean = KNeighborsClassifier(n_neighbors=k) knn_euclidean.fit(X_train, y_train) acc_before = knn_euclidean.score(X_test, y_test) # After LMNN (learned Mahalanobis distance) lmnn = LMNN(k=k, max_iter=100) lmnn.fit(X_train, y_train) X_train_tf = lmnn.transform(X_train) X_test_tf = lmnn.transform(X_test) knn_lmnn = KNeighborsClassifier(n_neighbors=k) knn_lmnn.fit(X_train_tf, y_train) acc_after = knn_lmnn.score(X_test_tf, y_test) print(f"Before LMNN: {acc_before:.3f}") print(f"After LMNN: {acc_after:.3f}") print(f"Improvement: {acc_after - acc_before:.3f}") return acc_before, acc_afterLMNN performance depends heavily on input feature scale:
Required: Center and scale features before LMNN
Without preprocessing, features with large values dominate the initial Euclidean distances used to find target neighbors, and gradients become unbalanced.
Several regularization strategies prevent overfitting:
| Dataset Size | Full Rank M | Low Rank (r=d/2) | Recommendations |
|---|---|---|---|
| n < 1K, d < 50 | ~seconds | ~seconds | Full rank, any method |
| n ~ 10K, d ~ 100 | ~minutes | ~1 minute | Low rank, projected gradient |
| n ~ 100K, d ~ 500 | ~hours | ~30 min | Low rank, mini-batch, GPU |
| n > 1M, d > 1K | Impractical | ~hours | Sampled triplets, deep alternative |
LMNN assumes that a global linear transformation improves all local neighborhoods. This fails when: (1) optimal metrics differ in different regions, (2) relationships are highly nonlinear, or (3) spurious features exist. For such cases, consider local metric learning or deep metric learning alternatives.
LMNN offers significant improvements in the right scenarios but isn't universally applicable. Here's a framework for deciding when metric learning is worthwhile.
Is baseline k-NN performance satisfactory?
Is the issue irrelevant/correlated features?
Do you have n >> d² samples?
Is the class boundary approximately linear?
When LMNN isn't ideal, consider:
Run a quick test: Apply PCA to retain 90% variance, then run k-NN. If accuracy improves substantially, metric learning will likely help (irrelevant dimensions were hurting). If accuracy drops, the original features are already well-suited, and LMNN may not provide much benefit.
We've explored Large Margin Nearest Neighbors in depth—from the mathematics of Mahalanobis distance to practical implementation. Let's consolidate the key insights:
LMNN learns a global distance metric—one transformation applied everywhere. But what if different regions of feature space need different metrics? And how can we combine multiple k-NN classifiers for robustness?
The next page explores Metric Learning more broadly, including local metric learning approaches and kernel methods that extend beyond LMNN's linear framework. Following that, we'll examine KNN Ensembles that combine multiple k-NN classifiers with different distance metrics, feature subsets, or parameter settings.
You now understand Large Margin Nearest Neighbors comprehensively—the Mahalanobis distance framework, the margin-based objective, optimization techniques, connections to other methods, and practical application guidance. You can apply LMNN to improve k-NN classification when features are suboptimal. Next, we explore the broader landscape of metric learning.