Machine LearningK-Nearest Neighbors

KNN Variants

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

3 / 5

Large Margin K-Nearest Neighbors

Beyond Fixed Metrics: Learning to Measure Similarity

The previous pages examined how to select which training instances to keep (CNN, ENN). But there's a fundamentally different question we've been ignoring: Are we measuring distance correctly in the first place?

K-Nearest Neighbors treats all features equally—or, with weighted variants, uses hand-tuned weights. But this assumption is almost always wrong. Consider these scenarios:

Scenario 1: Irrelevant Features
A dataset contains 100 features, but only 10 are actually predictive. Euclidean distance treats all 100 equally, letting the 90 irrelevant features dominate and obscure true similarity.

Scenario 2: Correlated Features
Two features measure nearly the same thing (e.g., height in inches and height in centimeters). Euclidean distance double-counts this dimension, distorting the neighborhood structure.

Scenario 3: Different Scales of Relevance
Feature A has a correlation of 0.9 with the target; Feature B has 0.1. They should not contribute equally to distance.

The fundamental insight: The optimal distance metric is data-dependent. We can learn it from the training data.

Large Margin Nearest Neighbors (LMNN), introduced by Weinberger and Saul in 2009, does exactly this. It learns a Mahalanobis distance metric that pulls same-class points (target neighbors) closer while pushing different-class points (imposters) away—creating large margins analogous to Support Vector Machines.

What You Will Master

By completing this page, you will understand the Mahalanobis distance and its parameterization, grasp the LMNN objective function and its margin-based motivation, learn the optimization procedure including semidefinite programming relaxation, implement metric learning for KNN, and understand connections to dimensionality reduction and feature learning.

The Mahalanobis Distance: A Learnable Metric

Before diving into LMNN, we must understand the family of distances it learns: Mahalanobis distances.

Standard Euclidean Distance

The familiar Euclidean distance between two points $\mathbf{x}_i, \mathbf{x}_j \in \mathbb{R}^d$ is:

$$d_E(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T (\mathbf{x}_i - \mathbf{x}_j)} = |\mathbf{x}_i - \mathbf{x}_j|_2$$

This treats all dimensions equally with weight 1.

Weighted Euclidean Distance

A simple extension weights dimensions differently:

$$d_W(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T W (\mathbf{x}_i - \mathbf{x}_j)}$$

where $W = \text{diag}(w_1, \ldots, w_d)$ is a diagonal weight matrix. Dimension $j$ contributes $w_j$ times more to the distance.

Full Mahalanobis Distance

The generalized Mahalanobis distance uses a full positive semi-definite (PSD) matrix $M$:

$$d_M(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T M (\mathbf{x}_i - \mathbf{x}_j)}$$

Key Properties:

Properties of Mahalanobis Distance

•M = I (identity): Reduces to Euclidean distance
•M = diagonal: Reduces to weighted Euclidean distance
•M = full matrix: Captures feature correlations and rotations
•M must be PSD: Ensures distances are non-negative and satisfy metric axioms
•Decomposition M = L^T L: Distance becomes Euclidean in transformed space: $||L(\mathbf{x}_i - \mathbf{x}_j)||_2$

Geometric Interpretation

The Mahalanobis distance has a beautiful geometric interpretation. Since $M$ is PSD, we can write:

$$M = L^T L$$

for some matrix $L \in \mathbb{R}^{r \times d}$ where $r = \text{rank}(M)$.

Then:

$$d_M(\mathbf{x}_i, \mathbf{x}_j)^2 = (\mathbf{x}_i - \mathbf{x}_j)^T L^T L (\mathbf{x}_i - \mathbf{x}_j) = ||L\mathbf{x}_i - L\mathbf{x}_j||_2^2$$

Insight: The Mahalanobis distance with matrix $M = L^T L$ is equivalent to Euclidean distance after linearly transforming the data by $L$.

This means:

Learning $M$ is equivalent to learning a linear transformation $L$
If $r < d$, we're also doing dimensionality reduction!
The rows of $L$ define new feature directions optimized for classification

Distance Metric Parameterization Comparison
Metric	Parameters	Degrees of Freedom	Captures
Euclidean	None	0	Nothing (fixed)
Weighted Euclidean	d diagonal weights	d	Feature importance
Mahalanobis (diagonal M)	d diagonal entries	d	Feature importance
Mahalanobis (full M)	d×d symmetric PSD matrix	d(d+1)/2	Importance + correlations
Low-rank Mahalanobis	L ∈ ℝ^(r×d)	r × d	Importance + correlations + reduction

Why Not Learn a Full Matrix Directly?

With d features, a full Mahalanobis matrix has d(d+1)/2 parameters. For d=100, that's 5,050 parameters. For d=1000, it's 500,500 parameters. Low-rank factorizations (L ∈ ℝ^(r×d) with r << d) dramatically reduce parameters while still capturing the most important transformations.

The LMNN Objective Function

Large Margin Nearest Neighbors (LMNN) learns a Mahalanobis distance matrix $M$ by optimizing an objective that mirrors SVM's margin-maximization philosophy. The goal: make k-NN classification as reliable as possible.

Key Concepts

Target Neighbors: For each training point $\mathbf{x}_i$, its target neighbors are the k points of the same class that should be closest to it. These are typically determined using the original Euclidean distance or can be pre-specified.

Imposters: An imposter for $\mathbf{x}_i$ is any point of a different class that intrudes into $\mathbf{x}_i$'s neighborhood, potentially causing misclassification.

The Two-Part Objective

LMNN's objective has two competing terms:

1. Pull Term (Attraction): Pull target neighbors closer

$$\epsilon_{\text{pull}}(M) = \sum_{i} \sum_{j \in \mathcal{N}_i} d_M(\mathbf{x}_i, \mathbf{x}_j)^2$$

where $\mathcal{N}_i$ is the set of k target neighbors of $\mathbf{x}_i$.

This term alone would collapse everything to a point!

2. Push Term (Repulsion): Push imposters away with a margin

$$\epsilon_{\text{push}}(M) = \sum_{i} \sum_{j \in \mathcal{N}i} \sum{l: y_l \neq y_i} \max(0, 1 + d_M(\mathbf{x}_i, \mathbf{x}_j)^2 - d_M(\mathbf{x}_i, \mathbf{x}_l)^2)$$

This hinge loss activates when an imposter $\mathbf{x}_l$ is closer to $\mathbf{x}_i$ than a target neighbor $\mathbf{x}_j$ by less than margin 1.

Combined Objective:

$$\mathcal{L}(M) = (1-\mu) \cdot \epsilon_{\text{pull}}(M) + \mu \cdot \epsilon_{\text{push}}(M)$$

where $\mu \in (0, 1)$ balances attraction and repulsion (typically $\mu = 0.5$).

The Margin Interpretation

The '+1' in the hinge loss creates a unit margin. We want imposters to be at least 1 unit further from xᵢ than any target neighbor xⱼ. This mirrors SVM's margin concept: we're not just trying to classify correctly, we're trying to classify with confidence.

Visualizing the Objective

Consider a 2D example with three points:

$\mathbf{x}_i$ (point of interest, class A)
$\mathbf{x}_j$ (target neighbor, class A)
$\mathbf{x}_l$ (imposter, class B)

Before learning $M$:

Distance to target: $d_E(\mathbf{x}_i, \mathbf{x}_j) = 3$
Distance to imposter: $d_E(\mathbf{x}_i, \mathbf{x}_l) = 2$
1-NN would misclassify $\mathbf{x}_i$!

The objective pushes for:

$d_M(\mathbf{x}_i, \mathbf{x}_j)^2$ to decrease (pull target closer)
$d_M(\mathbf{x}_i, \mathbf{x}_l)^2 > d_M(\mathbf{x}_i, \mathbf{x}_j)^2 + 1$ (push imposter beyond margin)

After learning $M$:

The transformation $L$ (where $M = L^T L$) stretches space such that:
- $d_M(\mathbf{x}_i, \mathbf{x}_j) = 1$
- $d_M(\mathbf{x}_i, \mathbf{x}_l) = 3$
1-NN now correctly classifies $\mathbf{x}_i$ with margin!

Constraint Formulation

LMNN can be reformulated as a constrained optimization problem:

$$\min_M (1-\mu) \sum_i \sum_{j \in \mathcal{N}_i} d_M(\mathbf{x}i, \mathbf{x}j)^2 + \mu \sum{ijl} \xi{ijl}$$

subject to:

$d_M(\mathbf{x}_i, \mathbf{x}_l)^2 - d_M(\mathbf{x}_i, \mathbf{x}j)^2 \geq 1 - \xi{ijl}$ for all triplets $(i, j \in \mathcal{N}_i, l: y_l \neq y_i)$
$\xi_{ijl} \geq 0$ (slack variables)
$M \succeq 0$ (positive semidefinite)

This is a semidefinite program (SDP), a convex optimization problem with known efficient solvers.

Optimization Algorithms for LMNN

LMNN's objective is convex in $M$ (a crucial property!), which means gradient-based methods will find the global optimum. However, the constraint $M \succeq 0$ (positive semidefinite) and the scale of the problem require specialized optimization techniques.

Approach 1: Semidefinite Programming (SDP)

The mathematically cleanest approach reformulates LMNN as an SDP:

Variables: $M \in \mathbb{R}^{d \times d}$ (symmetric), slack variables $\xi_{ijl}$
Linear objective in $M$ and $\xi$
Linear inequalities (margin constraints)
PSD constraint: $M \succeq 0$

Pros: Provably finds global optimum; mature SDP solvers exist
Cons: Scales as $O(d^6)$ or worse; impractical for $d > 100$

Approach 2: Projected Gradient Descent

A more practical approach directly optimizes in $M$ space using gradient descent with projection onto the PSD cone.

lmnn_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from scipy.linalg import eigh
 
def lmnn_projected_gradient(X, y, k=3, mu=0.5, learning_rate=1e-4, 
                             max_iter=100, reg=1e-3):
    """
    LMNN via Projected Gradient Descent
    
    Learns a Mahalanobis distance matrix M by gradient descent
    with projection onto the PSD cone.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    k : int, number of target neighbors
    mu : float, push/pull trade-off (typically 0.5)
    learning_rate : float
    max_iter : int
    reg : float, regularization strength
    
    Returns:
    --------
    M : ndarray of shape (n_features, n_features), learned metric
    L : ndarray, factorization where M = L.T @ L
    """
    n, d = X.shape
    
    # Initialize M as identity
    M = np.eye(d)
    
    # Find target neighbors (same class, k nearest in Euclidean)
    target_neighbors = find_target_neighbors(X, y, k)
    
    for iteration in range(max_iter):
        # Compute gradient
        grad = compute_lmnn_gradient(X, y, M, target_neighbors, mu)
        
        # Add regularization (trace norm to prevent collapse)
        grad += reg * np.eye(d)
        
        # Gradient step
        M_new = M - learning_rate * grad
        
        # Project onto PSD cone
        M = project_psd(M_new)
        
        if iteration % 10 == 0:
            loss = compute_lmnn_loss(X, y, M, target_neighbors, mu)
            print(f"Iteration {iteration}: loss = {loss:.4f}")
    
    # Factorize M = L^T L for efficient transformed distances
    eigenvalues, eigenvectors = eigh(M)
    eigenvalues = np.maximum(eigenvalues, 0)  # Ensure non-negative
    L = np.diag(np.sqrt(eigenvalues)) @ eigenvectors.T
    
    return M, L
 
def project_psd(M):
    """Project a symmetric matrix onto the positive semidefinite cone"""
    eigenvalues, eigenvectors = eigh(M)
    eigenvalues = np.maximum(eigenvalues, 0)  # Clip negative eigenvalues
    return eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
 
def find_target_neighbors(X, y, k):
    """Find k same-class nearest neighbors for each point"""
    from scipy.spatial.distance import cdist
    
    n = len(X)
    target_neighbors = {}
    
    for i in range(n):
        same_class = np.where(y == y[i])[0]
        same_class = same_class[same_class != i]  # Exclude self
        
        if len(same_class) >= k:
            dists = cdist(X[i:i+1], X[same_class])[0]
            nearest_idx = np.argsort(dists)[:k]
            target_neighbors[i] = same_class[nearest_idx].tolist()
        else:
            target_neighbors[i] = same_class.tolist()
    
    return target_neighbors
 
def compute_lmnn_gradient(X, y, M, target_neighbors, mu):
    """Compute gradient of LMNN objective w.r.t. M"""
    n, d = X.shape
    grad = np.zeros((d, d))
    
    # Pull gradient: sum over (i, j in N_i)
    for i, neighbors in target_neighbors.items():
        for j in neighbors:
            diff = X[i] - X[j]
            grad += (1 - mu) * np.outer(diff, diff)
    
    # Push gradient: triplet terms with active constraints
    for i, neighbors in target_neighbors.items():
        for j in neighbors:
            diff_ij = X[i] - X[j]
            d_ij_sq = diff_ij @ M @ diff_ij
            
            for l in range(n):
                if y[l] == y[i]:
                    continue
                    
                diff_il = X[i] - X[l]
                d_il_sq = diff_il @ M @ diff_il
                
                # Check if margin constraint is violated
                margin = 1 + d_ij_sq - d_il_sq
                if margin > 0:  # Hinge is active
                    grad += mu * (np.outer(diff_ij, diff_ij) - np.outer(diff_il, diff_il))
    
    return grad
 
def compute_lmnn_loss(X, y, M, target_neighbors, mu):
    """Compute LMNN objective value"""
    pull_loss = 0
    push_loss = 0
    n = len(X)
    
    for i, neighbors in target_neighbors.items():
        for j in neighbors:
            diff_ij = X[i] - X[j]
            d_ij_sq = diff_ij @ M @ diff_ij
            pull_loss += d_ij_sq
            
            for l in range(n):
                if y[l] == y[i]:
                    continue
                diff_il = X[i] - X[l]
                d_il_sq = diff_il @ M @ diff_il
                push_loss += max(0, 1 + d_ij_sq - d_il_sq)
    
    return (1 - mu) * pull_loss + mu * push_loss

Approach 3: Stochastic/Mini-batch Optimization

For large datasets, computing the full gradient is prohibitive. Stochastic variants sample triplets:

Sample a mini-batch of anchor points $\mathbf{x}_i$
For each anchor, sample target neighbors and imposters
Compute gradient on sampled triplets
Update $M$ and project

Triplet Sampling Strategies:

Random sampling: Uniform over all valid triplets (inefficient—most triplets already satisfied)
Semi-hard mining: Sample imposters that violate the margin but aren't too far (most informative)
Hard mining: Focus on the most violated triplets (can cause instability)

Approach 4: Low-Rank Factorization

Instead of learning full $M = L^T L$, directly parameterize with $L \in \mathbb{R}^{r \times d}$ where $r << d$:

$$d_M(\mathbf{x}_i, \mathbf{x}_j)^2 = ||L\mathbf{x}_i - L\mathbf{x}_j||_2^2$$

Advantages:

Reduced parameters: $r \cdot d$ instead of $d(d+1)/2$
Implicit dimensionality reduction to $r$ dimensions
No PSD projection needed (any $L$ gives valid PSD $M$)
Works well with standard optimizers (Adam, SGD)

Implementation Recommendation

For production use, the 'metric-learn' Python library provides optimized LMNN implementations. For custom needs, low-rank factorization with Adam optimizer is typically the best balance of flexibility, speed, and ease of implementation.

Connections to Other Methods

LMNN sits at an intersection of several important machine learning concepts. Understanding these connections deepens comprehension and suggests generalizations.

Connection to Support Vector Machines

LMNN's margin-based formulation directly parallels SVM:

SVM	LMNN
Maximize margin between classes	Maximize margin between target neighbor and imposter
Hinge loss on misclassified points	Hinge loss on violated triplets
Slack variables for soft margin	Slack variables for triplet violations
Linear kernel: $\mathbf{w}^T\mathbf{x}$	Linear transformation: $L\mathbf{x}$

Both methods learn a linear transformation of the feature space. SVM optimizes for a separating hyperplane; LMNN optimizes for local neighborhoods.

Connection to Linear Discriminant Analysis (LDA)

LDA finds a projection maximizing between-class scatter / within-class scatter:

$$L_{\text{LDA}} = \arg\max_L \frac{||L\mu_1 - L\mu_2||^2}{\text{tr}(L\Sigma_W L^T)}$$

LMNN's pull term minimizes within-class distances (similar to $\Sigma_W$), and the push term implicitly increases between-class separation.

Key difference: LDA assumes Gaussian classes and uses class means; LMNN operates on local neighborhoods and makes no distributional assumptions.

Connection to Triplet Loss (Deep Learning)

Modern deep learning uses triplet loss for metric learning:

$$\mathcal{L}{\text{triplet}} = \sum{(a,p,n)} \max(0, ||f(a) - f(p)||^2 - ||f(a) - f(n)||^2 + \alpha)$$

where $f$ is a neural network, $a$ is anchor, $p$ is positive (same class), $n$ is negative (different class).

This is exactly LMNN's push term! The connection:

LMNN: $f(\mathbf{x}) = L\mathbf{x}$ (linear transformation)
Deep triplet: $f(\mathbf{x}) = \text{NeuralNet}(\mathbf{x})$ (nonlinear transformation)

LMNN can be viewed as triplet learning with a linear embedding.

LMNN Connections to Other Methods
Method	Shared Concept	Key Difference
SVM	Margin maximization, hinge loss	SVM: global hyperplane; LMNN: local neighborhoods
LDA	Maximize class separation	LDA: assumes Gaussians; LMNN: nonparametric
Triplet Loss	Same loss function	Triplet: nonlinear NN; LMNN: linear transform
PCA	Linear dimensionality reduction	PCA: unsupervised variance; LMNN: supervised margins
NCA	Stochastic neighbor probability	NCA: softmax probs; LMNN: hard margins

Connection to Neighbourhood Components Analysis (NCA)

NCA is a closely related metric learning method. It maximizes the probability of correct k-NN classification using soft probabilities:

$$p_{ij} = \frac{\exp(-||L\mathbf{x}_i - L\mathbf{x}j||^2)}{\sum{k \neq i} \exp(-||L\mathbf{x}_i - L\mathbf{x}_k||^2)}$$

$$\mathcal{L}{\text{NCA}} = \sum_i \sum{j: y_j = y_i} p_{ij}$$

LMNN vs NCA:

LMNN uses hard margins (hinge loss); NCA uses soft probabilities (smooth)
LMNN is convex in $M$; NCA is non-convex (local optima)
LMNN computationally cheaper (fewer triplet activations); NCA requires all pairwise terms

For large-scale problems, LMNN is generally preferred for its convexity and sparsity.

LMNN as a Bridge

LMNN bridges classical statistics (LDA, Mahalanobis distance), kernel methods (SVM margins), and modern deep learning (triplet loss). Understanding LMNN provides insight into all these areas. It's a linear special case of deep metric learning, benefiting from convexity while capturing the essential margin-based intuition.

Practical Considerations

Applying LMNN effectively requires attention to several practical details. Here we address the most common implementation decisions and pitfalls.

Choosing the Number of Target Neighbors (k)

The k used for target neighbor selection impacts learned metric quality:

k too small (1-2): Overfits to few specific neighbors; learned metric may not generalize
k too large: Includes points that shouldn't be neighbors; push term becomes confused
Typical range: k = 3-7; often k = 3 works well

Matching k: Use the same k for learning and classification when possible. If you'll classify with 5-NN, train with k=5.

lmnn_practical_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from metric_learn import LMNN
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import numpy as np
 
def lmnn_with_cv(X, y, k_candidates=[3, 5, 7], n_components=None):
    """
    Apply LMNN with cross-validated k selection.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
    y : array of shape (n_samples,)
    k_candidates : list of k values to try
    n_components : int or None, for low-rank LMNN (dimensionality)
    
    Returns:
    --------
    best_lmnn : fitted LMNN instance
    best_k : optimal k value
    """
    best_score = -1
    best_k = None
    best_lmnn = None
    
    for k in k_candidates:
        # Create LMNN instance
        lmnn = LMNN(
            k=k,
            learn_rate=1e-5,
            max_iter=100,
            n_components=n_components,
            convergence_tol=1e-5
        )
        
        try:
            # Fit LMNN
            lmnn.fit(X, y)
            
            # Transform data
            X_transformed = lmnn.transform(X)
            
            # Evaluate with k-NN on transformed data
            knn = KNeighborsClassifier(n_neighbors=k)
            scores = cross_val_score(knn, X_transformed, y, cv=5)
            mean_score = np.mean(scores)
            
            print(f"LMNN k={k}: CV accuracy = {mean_score:.3f} (+/- {np.std(scores):.3f})")
            
            if mean_score > best_score:
                best_score = mean_score
                best_k = k
                best_lmnn = lmnn
                
        except Exception as e:
            print(f"LMNN k={k} failed: {e}")
    
    print(f"
Best: k={best_k} with accuracy {best_score:.3f}")
    return best_lmnn, best_k
 
def compare_before_after_lmnn(X_train, y_train, X_test, y_test, k=5):
    """Compare k-NN performance before and after LMNN"""
    from metric_learn import LMNN
    
    # Before LMNN (Euclidean distance)
    knn_euclidean = KNeighborsClassifier(n_neighbors=k)
    knn_euclidean.fit(X_train, y_train)
    acc_before = knn_euclidean.score(X_test, y_test)
    
    # After LMNN (learned Mahalanobis distance)
    lmnn = LMNN(k=k, max_iter=100)
    lmnn.fit(X_train, y_train)
    
    X_train_tf = lmnn.transform(X_train)
    X_test_tf = lmnn.transform(X_test)
    
    knn_lmnn = KNeighborsClassifier(n_neighbors=k)
    knn_lmnn.fit(X_train_tf, y_train)
    acc_after = knn_lmnn.score(X_test_tf, y_test)
    
    print(f"Before LMNN: {acc_before:.3f}")
    print(f"After LMNN:  {acc_after:.3f}")
    print(f"Improvement: {acc_after - acc_before:.3f}")
    
    return acc_before, acc_after

Feature Preprocessing

LMNN performance depends heavily on input feature scale:

Required: Center and scale features before LMNN

Centering: Subtract mean from each feature
Scaling: Divide by standard deviation (standardization) or scale to [0,1]

Without preprocessing, features with large values dominate the initial Euclidean distances used to find target neighbors, and gradients become unbalanced.

Regularization

Several regularization strategies prevent overfitting:

Diagonal Regularization: Add $\lambda I$ to $M$ during optimization
Low-rank Constraint: Force $\text{rank}(M) \leq r$
Early Stopping: Monitor validation loss; stop before overfitting
L2 on L: If using factorization $M = L^T L$, add $||L||_F^2$

Computational Scaling

LMNN Computational Requirements
Dataset Size	Full Rank M	Low Rank (r=d/2)	Recommendations
n < 1K, d < 50	~seconds	~seconds	Full rank, any method
n ~ 10K, d ~ 100	~minutes	~1 minute	Low rank, projected gradient
n ~ 100K, d ~ 500	~hours	~30 min	Low rank, mini-batch, GPU
n > 1M, d > 1K	Impractical	~hours	Sampled triplets, deep alternative

When LMNN Struggles

LMNN assumes that a global linear transformation improves all local neighborhoods. This fails when: (1) optimal metrics differ in different regions, (2) relationships are highly nonlinear, or (3) spurious features exist. For such cases, consider local metric learning or deep metric learning alternatives.

When to Use LMNN

LMNN offers significant improvements in the right scenarios but isn't universally applicable. Here's a framework for deciding when metric learning is worthwhile.

LMNN Is Appropriate When

•Many features, unknown importance: You have d >> 10 features and don't know which matter for classification.
•Correlated features: Feature redundancy distorts Euclidean neighborhoods; LMNN learns to decorrelate.
•Baseline KNN is mediocre: If Euclidean-distance KNN achieves < 80-85% accuracy, there's room for improvement.
•Linear relationships suffice: Class separation is approximately linear (non-linear requires kernelization or deep learning).
•Moderate dataset size: Enough data to learn d(d+1)/2 (or fewer) parameters reliably (n >> d typically).

LMNN May Not Help When

•Few, curated features: If d < 10 and features are well-designed, Euclidean distance may already be optimal.
•Baseline KNN is near-optimal: If simple k-NN achieves > 95% accuracy, improvements are marginal at best.
•Highly non-linear data: XOR-like patterns or complex manifolds require non-linear metrics (kernel LMNN, deep learning).
•Very high dimensionality: d >> 1000 with n ~ 1000 leaves too few samples to learn reliable metrics.
•Region-specific metrics needed: Different areas of feature space need different transformations (local metric learning).

Decision Flowchart

Is baseline k-NN performance satisfactory?
- Yes → Metric learning unlikely to help substantially
- No → Continue
Is the issue irrelevant/correlated features?
- Yes → LMNN is well-suited
- No → Problem may need other interventions
Do you have n >> d² samples?
- Yes → Full-rank LMNN feasible
- No → Use low-rank LMNN or consider simpler alternatives
Is the class boundary approximately linear?
- Yes → LMNN should work well
- No → Consider kernel LMNN or deep metric learning

Alternatives to LMNN

When LMNN isn't ideal, consider:

Neighborhood Components Analysis (NCA): Smooth probabilistic alternative; works well on small datasets
Kernel LMNN: Kernelized version for non-linear relationships
Deep Metric Learning: Neural networks with triplet/contrastive loss for complex, high-dimensional data
Feature Selection: If only irrelevant features are the problem, selection may be simpler than metric learning

Quick Diagnostic

Run a quick test: Apply PCA to retain 90% variance, then run k-NN. If accuracy improves substantially, metric learning will likely help (irrelevant dimensions were hurting). If accuracy drops, the original features are already well-suited, and LMNN may not provide much benefit.

Summary: Large Margin K-Nearest Neighbors

We've explored Large Margin Nearest Neighbors in depth—from the mathematics of Mahalanobis distance to practical implementation. Let's consolidate the key insights:

Key Takeaways

•LMNN learns a Mahalanobis distance: A parameterized metric that can weight and correlate features optimally for k-NN.
•It maximizes margins: Pulling same-class target neighbors closer while pushing different-class imposters beyond a margin.
•The objective is convex: Guaranteed to find the global optimum (unlike many ML methods).
•Low-rank factorization helps scale: Reduces parameters and provides implicit dimensionality reduction.
•Preprocessing is critical: Always standardize features before applying LMNN.
•Connects to deep learning: LMNN is the linear special case of triplet loss networks.
•Not always needed: If features are well-designed or classes are easily separable, Euclidean distance suffices.
•Match k values: Use the same k for training LMNN as you'll use for k-NN classification.

Connection to Next Topics

LMNN learns a global distance metric—one transformation applied everywhere. But what if different regions of feature space need different metrics? And how can we combine multiple k-NN classifiers for robustness?

The next page explores Metric Learning more broadly, including local metric learning approaches and kernel methods that extend beyond LMNN's linear framework. Following that, we'll examine KNN Ensembles that combine multiple k-NN classifiers with different distance metrics, feature subsets, or parameter settings.

Page Complete

You now understand Large Margin Nearest Neighbors comprehensively—the Mahalanobis distance framework, the margin-based objective, optimization techniques, connections to other methods, and practical application guidance. You can apply LMNN to improve k-NN classification when features are suboptimal. Next, we explore the broader landscape of metric learning.

3 / 5

Loading learning content...

Machine LearningK-Nearest Neighbors

KNN Variants

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

3 / 5

Large Margin K-Nearest Neighbors

Beyond Fixed Metrics: Learning to Measure Similarity

K-Nearest Neighbors treats all features equally—or, with weighted variants, uses hand-tuned weights. But this assumption is almost always wrong. Consider these scenarios:

Scenario 3: Different Scales of Relevance
Feature A has a correlation of 0.9 with the target; Feature B has 0.1. They should not contribute equally to distance.

The fundamental insight: The optimal distance metric is data-dependent. We can learn it from the training data.

What You Will Master

The Mahalanobis Distance: A Learnable Metric

Before diving into LMNN, we must understand the family of distances it learns: Mahalanobis distances.

Standard Euclidean Distance

The familiar Euclidean distance between two points $\mathbf{x}_i, \mathbf{x}_j \in \mathbb{R}^d$ is:

$$d_E(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T (\mathbf{x}_i - \mathbf{x}_j)} = |\mathbf{x}_i - \mathbf{x}_j|_2$$

This treats all dimensions equally with weight 1.

Weighted Euclidean Distance

A simple extension weights dimensions differently:

$$d_W(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T W (\mathbf{x}_i - \mathbf{x}_j)}$$

where $W = \text{diag}(w_1, \ldots, w_d)$ is a diagonal weight matrix. Dimension $j$ contributes $w_j$ times more to the distance.

Full Mahalanobis Distance

The generalized Mahalanobis distance uses a full positive semi-definite (PSD) matrix $M$:

$$d_M(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T M (\mathbf{x}_i - \mathbf{x}_j)}$$

Key Properties:

Properties of Mahalanobis Distance

•M = I (identity): Reduces to Euclidean distance
•M = diagonal: Reduces to weighted Euclidean distance
•M = full matrix: Captures feature correlations and rotations
•M must be PSD: Ensures distances are non-negative and satisfy metric axioms
•Decomposition M = L^T L: Distance becomes Euclidean in transformed space: $||L(\mathbf{x}_i - \mathbf{x}_j)||_2$

Geometric Interpretation

The Mahalanobis distance has a beautiful geometric interpretation. Since $M$ is PSD, we can write:

$$M = L^T L$$

for some matrix $L \in \mathbb{R}^{r \times d}$ where $r = \text{rank}(M)$.

Then:

$$d_M(\mathbf{x}_i, \mathbf{x}_j)^2 = (\mathbf{x}_i - \mathbf{x}_j)^T L^T L (\mathbf{x}_i - \mathbf{x}_j) = ||L\mathbf{x}_i - L\mathbf{x}_j||_2^2$$

Insight: The Mahalanobis distance with matrix $M = L^T L$ is equivalent to Euclidean distance after linearly transforming the data by $L$.

This means:

Learning $M$ is equivalent to learning a linear transformation $L$
If $r < d$, we're also doing dimensionality reduction!
The rows of $L$ define new feature directions optimized for classification

Distance Metric Parameterization Comparison
Metric	Parameters	Degrees of Freedom	Captures
Euclidean	None	0	Nothing (fixed)
Weighted Euclidean	d diagonal weights	d	Feature importance
Mahalanobis (diagonal M)	d diagonal entries	d	Feature importance
Mahalanobis (full M)	d×d symmetric PSD matrix	d(d+1)/2	Importance + correlations
Low-rank Mahalanobis	L ∈ ℝ^(r×d)	r × d	Importance + correlations + reduction

Why Not Learn a Full Matrix Directly?

The LMNN Objective Function

Key Concepts

Imposters: An imposter for $\mathbf{x}_i$ is any point of a different class that intrudes into $\mathbf{x}_i$'s neighborhood, potentially causing misclassification.

The Two-Part Objective

LMNN's objective has two competing terms:

1. Pull Term (Attraction): Pull target neighbors closer

$$\epsilon_{\text{pull}}(M) = \sum_{i} \sum_{j \in \mathcal{N}_i} d_M(\mathbf{x}_i, \mathbf{x}_j)^2$$

where $\mathcal{N}_i$ is the set of k target neighbors of $\mathbf{x}_i$.

This term alone would collapse everything to a point!

2. Push Term (Repulsion): Push imposters away with a margin

$$\epsilon_{\text{push}}(M) = \sum_{i} \sum_{j \in \mathcal{N}i} \sum{l: y_l \neq y_i} \max(0, 1 + d_M(\mathbf{x}_i, \mathbf{x}_j)^2 - d_M(\mathbf{x}_i, \mathbf{x}_l)^2)$$

This hinge loss activates when an imposter $\mathbf{x}_l$ is closer to $\mathbf{x}_i$ than a target neighbor $\mathbf{x}_j$ by less than margin 1.

Combined Objective:

$$\mathcal{L}(M) = (1-\mu) \cdot \epsilon_{\text{pull}}(M) + \mu \cdot \epsilon_{\text{push}}(M)$$

where $\mu \in (0, 1)$ balances attraction and repulsion (typically $\mu = 0.5$).

The Margin Interpretation

Visualizing the Objective

Consider a 2D example with three points:

$\mathbf{x}_i$ (point of interest, class A)
$\mathbf{x}_j$ (target neighbor, class A)
$\mathbf{x}_l$ (imposter, class B)

Before learning $M$:

Distance to target: $d_E(\mathbf{x}_i, \mathbf{x}_j) = 3$
Distance to imposter: $d_E(\mathbf{x}_i, \mathbf{x}_l) = 2$
1-NN would misclassify $\mathbf{x}_i$!

The objective pushes for:

$d_M(\mathbf{x}_i, \mathbf{x}_j)^2$ to decrease (pull target closer)
$d_M(\mathbf{x}_i, \mathbf{x}_l)^2 > d_M(\mathbf{x}_i, \mathbf{x}_j)^2 + 1$ (push imposter beyond margin)

After learning $M$:

The transformation $L$ (where $M = L^T L$) stretches space such that:
- $d_M(\mathbf{x}_i, \mathbf{x}_j) = 1$
- $d_M(\mathbf{x}_i, \mathbf{x}_l) = 3$
1-NN now correctly classifies $\mathbf{x}_i$ with margin!

Constraint Formulation

LMNN can be reformulated as a constrained optimization problem:

$$\min_M (1-\mu) \sum_i \sum_{j \in \mathcal{N}_i} d_M(\mathbf{x}i, \mathbf{x}j)^2 + \mu \sum{ijl} \xi{ijl}$$

subject to:

$d_M(\mathbf{x}_i, \mathbf{x}_l)^2 - d_M(\mathbf{x}_i, \mathbf{x}j)^2 \geq 1 - \xi{ijl}$ for all triplets $(i, j \in \mathcal{N}_i, l: y_l \neq y_i)$
$\xi_{ijl} \geq 0$ (slack variables)
$M \succeq 0$ (positive semidefinite)

This is a semidefinite program (SDP), a convex optimization problem with known efficient solvers.

Optimization Algorithms for LMNN

Approach 1: Semidefinite Programming (SDP)

The mathematically cleanest approach reformulates LMNN as an SDP:

Variables: $M \in \mathbb{R}^{d \times d}$ (symmetric), slack variables $\xi_{ijl}$
Linear objective in $M$ and $\xi$
Linear inequalities (margin constraints)
PSD constraint: $M \succeq 0$

Pros: Provably finds global optimum; mature SDP solvers exist
Cons: Scales as $O(d^6)$ or worse; impractical for $d > 100$

Approach 2: Projected Gradient Descent

A more practical approach directly optimizes in $M$ space using gradient descent with projection onto the PSD cone.

lmnn_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from scipy.linalg import eigh
 
def lmnn_projected_gradient(X, y, k=3, mu=0.5, learning_rate=1e-4, 
                             max_iter=100, reg=1e-3):
    """
    LMNN via Projected Gradient Descent
    
    Learns a Mahalanobis distance matrix M by gradient descent
    with projection onto the PSD cone.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    k : int, number of target neighbors
    mu : float, push/pull trade-off (typically 0.5)
    learning_rate : float
    max_iter : int
    reg : float, regularization strength
    
    Returns:
    --------
    M : ndarray of shape (n_features, n_features), learned metric
    L : ndarray, factorization where M = L.T @ L
    """
    n, d = X.shape
    
    # Initialize M as identity
    M = np.eye(d)
    
    # Find target neighbors (same class, k nearest in Euclidean)
    target_neighbors = find_target_neighbors(X, y, k)
    
    for iteration in range(max_iter):
        # Compute gradient
        grad = compute_lmnn_gradient(X, y, M, target_neighbors, mu)
        
        # Add regularization (trace norm to prevent collapse)
        grad += reg * np.eye(d)
        
        # Gradient step
        M_new = M - learning_rate * grad
        
        # Project onto PSD cone
        M = project_psd(M_new)
        
        if iteration % 10 == 0:
            loss = compute_lmnn_loss(X, y, M, target_neighbors, mu)
            print(f"Iteration {iteration}: loss = {loss:.4f}")
    
    # Factorize M = L^T L for efficient transformed distances
    eigenvalues, eigenvectors = eigh(M)
    eigenvalues = np.maximum(eigenvalues, 0)  # Ensure non-negative
    L = np.diag(np.sqrt(eigenvalues)) @ eigenvectors.T
    
    return M, L
 
def project_psd(M):
    """Project a symmetric matrix onto the positive semidefinite cone"""
    eigenvalues, eigenvectors = eigh(M)
    eigenvalues = np.maximum(eigenvalues, 0)  # Clip negative eigenvalues
    return eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
 
def find_target_neighbors(X, y, k):
    """Find k same-class nearest neighbors for each point"""
    from scipy.spatial.distance import cdist
    
    n = len(X)
    target_neighbors = {}
    
    for i in range(n):
        same_class = np.where(y == y[i])[0]
        same_class = same_class[same_class != i]  # Exclude self
        
        if len(same_class) >= k:
            dists = cdist(X[i:i+1], X[same_class])[0]
            nearest_idx = np.argsort(dists)[:k]
            target_neighbors[i] = same_class[nearest_idx].tolist()
        else:
            target_neighbors[i] = same_class.tolist()
    
    return target_neighbors
 
def compute_lmnn_gradient(X, y, M, target_neighbors, mu):
    """Compute gradient of LMNN objective w.r.t. M"""
    n, d = X.shape
    grad = np.zeros((d, d))
    
    # Pull gradient: sum over (i, j in N_i)
    for i, neighbors in target_neighbors.items():
        for j in neighbors:
            diff = X[i] - X[j]
            grad += (1 - mu) * np.outer(diff, diff)
    
    # Push gradient: triplet terms with active constraints
    for i, neighbors in target_neighbors.items():
        for j in neighbors:
            diff_ij = X[i] - X[j]
            d_ij_sq = diff_ij @ M @ diff_ij
            
            for l in range(n):
                if y[l] == y[i]:
                    continue
                    
                diff_il = X[i] - X[l]
                d_il_sq = diff_il @ M @ diff_il
                
                # Check if margin constraint is violated
                margin = 1 + d_ij_sq - d_il_sq
                if margin > 0:  # Hinge is active
                    grad += mu * (np.outer(diff_ij, diff_ij) - np.outer(diff_il, diff_il))
    
    return grad
 
def compute_lmnn_loss(X, y, M, target_neighbors, mu):
    """Compute LMNN objective value"""
    pull_loss = 0
    push_loss = 0
    n = len(X)
    
    for i, neighbors in target_neighbors.items():
        for j in neighbors:
            diff_ij = X[i] - X[j]
            d_ij_sq = diff_ij @ M @ diff_ij
            pull_loss += d_ij_sq
            
            for l in range(n):
                if y[l] == y[i]:
                    continue
                diff_il = X[i] - X[l]
                d_il_sq = diff_il @ M @ diff_il
                push_loss += max(0, 1 + d_ij_sq - d_il_sq)
    
    return (1 - mu) * pull_loss + mu * push_loss

Approach 3: Stochastic/Mini-batch Optimization

For large datasets, computing the full gradient is prohibitive. Stochastic variants sample triplets:

Sample a mini-batch of anchor points $\mathbf{x}_i$
For each anchor, sample target neighbors and imposters
Compute gradient on sampled triplets
Update $M$ and project

Triplet Sampling Strategies:

Random sampling: Uniform over all valid triplets (inefficient—most triplets already satisfied)
Semi-hard mining: Sample imposters that violate the margin but aren't too far (most informative)
Hard mining: Focus on the most violated triplets (can cause instability)

Approach 4: Low-Rank Factorization

Instead of learning full $M = L^T L$, directly parameterize with $L \in \mathbb{R}^{r \times d}$ where $r << d$:

$$d_M(\mathbf{x}_i, \mathbf{x}_j)^2 = ||L\mathbf{x}_i - L\mathbf{x}_j||_2^2$$

Advantages:

Reduced parameters: $r \cdot d$ instead of $d(d+1)/2$
Implicit dimensionality reduction to $r$ dimensions
No PSD projection needed (any $L$ gives valid PSD $M$)
Works well with standard optimizers (Adam, SGD)

Implementation Recommendation

Connections to Other Methods

LMNN sits at an intersection of several important machine learning concepts. Understanding these connections deepens comprehension and suggests generalizations.

Connection to Support Vector Machines

LMNN's margin-based formulation directly parallels SVM:

SVM	LMNN
Maximize margin between classes	Maximize margin between target neighbor and imposter
Hinge loss on misclassified points	Hinge loss on violated triplets
Slack variables for soft margin	Slack variables for triplet violations
Linear kernel: $\mathbf{w}^T\mathbf{x}$	Linear transformation: $L\mathbf{x}$

Both methods learn a linear transformation of the feature space. SVM optimizes for a separating hyperplane; LMNN optimizes for local neighborhoods.

Connection to Linear Discriminant Analysis (LDA)

LDA finds a projection maximizing between-class scatter / within-class scatter:

$$L_{\text{LDA}} = \arg\max_L \frac{||L\mu_1 - L\mu_2||^2}{\text{tr}(L\Sigma_W L^T)}$$

LMNN's pull term minimizes within-class distances (similar to $\Sigma_W$), and the push term implicitly increases between-class separation.

Key difference: LDA assumes Gaussian classes and uses class means; LMNN operates on local neighborhoods and makes no distributional assumptions.

Connection to Triplet Loss (Deep Learning)

Modern deep learning uses triplet loss for metric learning:

$$\mathcal{L}{\text{triplet}} = \sum{(a,p,n)} \max(0, ||f(a) - f(p)||^2 - ||f(a) - f(n)||^2 + \alpha)$$

where $f$ is a neural network, $a$ is anchor, $p$ is positive (same class), $n$ is negative (different class).

This is exactly LMNN's push term! The connection:

LMNN: $f(\mathbf{x}) = L\mathbf{x}$ (linear transformation)
Deep triplet: $f(\mathbf{x}) = \text{NeuralNet}(\mathbf{x})$ (nonlinear transformation)

LMNN can be viewed as triplet learning with a linear embedding.

LMNN Connections to Other Methods
Method	Shared Concept	Key Difference
SVM	Margin maximization, hinge loss	SVM: global hyperplane; LMNN: local neighborhoods
LDA	Maximize class separation	LDA: assumes Gaussians; LMNN: nonparametric
Triplet Loss	Same loss function	Triplet: nonlinear NN; LMNN: linear transform
PCA	Linear dimensionality reduction	PCA: unsupervised variance; LMNN: supervised margins
NCA	Stochastic neighbor probability	NCA: softmax probs; LMNN: hard margins

Connection to Neighbourhood Components Analysis (NCA)

NCA is a closely related metric learning method. It maximizes the probability of correct k-NN classification using soft probabilities:

$$p_{ij} = \frac{\exp(-||L\mathbf{x}_i - L\mathbf{x}j||^2)}{\sum{k \neq i} \exp(-||L\mathbf{x}_i - L\mathbf{x}_k||^2)}$$

$$\mathcal{L}{\text{NCA}} = \sum_i \sum{j: y_j = y_i} p_{ij}$$

LMNN vs NCA:

LMNN uses hard margins (hinge loss); NCA uses soft probabilities (smooth)
LMNN is convex in $M$; NCA is non-convex (local optima)
LMNN computationally cheaper (fewer triplet activations); NCA requires all pairwise terms

For large-scale problems, LMNN is generally preferred for its convexity and sparsity.

LMNN as a Bridge

Practical Considerations

Applying LMNN effectively requires attention to several practical details. Here we address the most common implementation decisions and pitfalls.

Choosing the Number of Target Neighbors (k)

The k used for target neighbor selection impacts learned metric quality:

k too small (1-2): Overfits to few specific neighbors; learned metric may not generalize
k too large: Includes points that shouldn't be neighbors; push term becomes confused
Typical range: k = 3-7; often k = 3 works well

Matching k: Use the same k for learning and classification when possible. If you'll classify with 5-NN, train with k=5.

lmnn_practical_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from metric_learn import LMNN
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import numpy as np
 
def lmnn_with_cv(X, y, k_candidates=[3, 5, 7], n_components=None):
    """
    Apply LMNN with cross-validated k selection.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
    y : array of shape (n_samples,)
    k_candidates : list of k values to try
    n_components : int or None, for low-rank LMNN (dimensionality)
    
    Returns:
    --------
    best_lmnn : fitted LMNN instance
    best_k : optimal k value
    """
    best_score = -1
    best_k = None
    best_lmnn = None
    
    for k in k_candidates:
        # Create LMNN instance
        lmnn = LMNN(
            k=k,
            learn_rate=1e-5,
            max_iter=100,
            n_components=n_components,
            convergence_tol=1e-5
        )
        
        try:
            # Fit LMNN
            lmnn.fit(X, y)
            
            # Transform data
            X_transformed = lmnn.transform(X)
            
            # Evaluate with k-NN on transformed data
            knn = KNeighborsClassifier(n_neighbors=k)
            scores = cross_val_score(knn, X_transformed, y, cv=5)
            mean_score = np.mean(scores)
            
            print(f"LMNN k={k}: CV accuracy = {mean_score:.3f} (+/- {np.std(scores):.3f})")
            
            if mean_score > best_score:
                best_score = mean_score
                best_k = k
                best_lmnn = lmnn
                
        except Exception as e:
            print(f"LMNN k={k} failed: {e}")
    
    print(f"
Best: k={best_k} with accuracy {best_score:.3f}")
    return best_lmnn, best_k
 
def compare_before_after_lmnn(X_train, y_train, X_test, y_test, k=5):
    """Compare k-NN performance before and after LMNN"""
    from metric_learn import LMNN
    
    # Before LMNN (Euclidean distance)
    knn_euclidean = KNeighborsClassifier(n_neighbors=k)
    knn_euclidean.fit(X_train, y_train)
    acc_before = knn_euclidean.score(X_test, y_test)
    
    # After LMNN (learned Mahalanobis distance)
    lmnn = LMNN(k=k, max_iter=100)
    lmnn.fit(X_train, y_train)
    
    X_train_tf = lmnn.transform(X_train)
    X_test_tf = lmnn.transform(X_test)
    
    knn_lmnn = KNeighborsClassifier(n_neighbors=k)
    knn_lmnn.fit(X_train_tf, y_train)
    acc_after = knn_lmnn.score(X_test_tf, y_test)
    
    print(f"Before LMNN: {acc_before:.3f}")
    print(f"After LMNN:  {acc_after:.3f}")
    print(f"Improvement: {acc_after - acc_before:.3f}")
    
    return acc_before, acc_after

Feature Preprocessing

LMNN performance depends heavily on input feature scale:

Required: Center and scale features before LMNN

Centering: Subtract mean from each feature
Scaling: Divide by standard deviation (standardization) or scale to [0,1]

Without preprocessing, features with large values dominate the initial Euclidean distances used to find target neighbors, and gradients become unbalanced.

Regularization

Several regularization strategies prevent overfitting:

Diagonal Regularization: Add $\lambda I$ to $M$ during optimization
Low-rank Constraint: Force $\text{rank}(M) \leq r$
Early Stopping: Monitor validation loss; stop before overfitting
L2 on L: If using factorization $M = L^T L$, add $||L||_F^2$

Computational Scaling

LMNN Computational Requirements
Dataset Size	Full Rank M	Low Rank (r=d/2)	Recommendations
n < 1K, d < 50	~seconds	~seconds	Full rank, any method
n ~ 10K, d ~ 100	~minutes	~1 minute	Low rank, projected gradient
n ~ 100K, d ~ 500	~hours	~30 min	Low rank, mini-batch, GPU
n > 1M, d > 1K	Impractical	~hours	Sampled triplets, deep alternative

When LMNN Struggles

When to Use LMNN

LMNN offers significant improvements in the right scenarios but isn't universally applicable. Here's a framework for deciding when metric learning is worthwhile.

LMNN Is Appropriate When

•Many features, unknown importance: You have d >> 10 features and don't know which matter for classification.
•Correlated features: Feature redundancy distorts Euclidean neighborhoods; LMNN learns to decorrelate.
•Baseline KNN is mediocre: If Euclidean-distance KNN achieves < 80-85% accuracy, there's room for improvement.
•Linear relationships suffice: Class separation is approximately linear (non-linear requires kernelization or deep learning).
•Moderate dataset size: Enough data to learn d(d+1)/2 (or fewer) parameters reliably (n >> d typically).

LMNN May Not Help When

•Few, curated features: If d < 10 and features are well-designed, Euclidean distance may already be optimal.
•Baseline KNN is near-optimal: If simple k-NN achieves > 95% accuracy, improvements are marginal at best.
•Highly non-linear data: XOR-like patterns or complex manifolds require non-linear metrics (kernel LMNN, deep learning).
•Very high dimensionality: d >> 1000 with n ~ 1000 leaves too few samples to learn reliable metrics.
•Region-specific metrics needed: Different areas of feature space need different transformations (local metric learning).

Decision Flowchart

Is baseline k-NN performance satisfactory?
- Yes → Metric learning unlikely to help substantially
- No → Continue
Is the issue irrelevant/correlated features?
- Yes → LMNN is well-suited
- No → Problem may need other interventions
Do you have n >> d² samples?
- Yes → Full-rank LMNN feasible
- No → Use low-rank LMNN or consider simpler alternatives
Is the class boundary approximately linear?
- Yes → LMNN should work well
- No → Consider kernel LMNN or deep metric learning

Alternatives to LMNN

When LMNN isn't ideal, consider:

Neighborhood Components Analysis (NCA): Smooth probabilistic alternative; works well on small datasets
Kernel LMNN: Kernelized version for non-linear relationships
Deep Metric Learning: Neural networks with triplet/contrastive loss for complex, high-dimensional data
Feature Selection: If only irrelevant features are the problem, selection may be simpler than metric learning

Quick Diagnostic

Summary: Large Margin K-Nearest Neighbors

We've explored Large Margin Nearest Neighbors in depth—from the mathematics of Mahalanobis distance to practical implementation. Let's consolidate the key insights:

Key Takeaways

•LMNN learns a Mahalanobis distance: A parameterized metric that can weight and correlate features optimally for k-NN.
•It maximizes margins: Pulling same-class target neighbors closer while pushing different-class imposters beyond a margin.
•The objective is convex: Guaranteed to find the global optimum (unlike many ML methods).
•Low-rank factorization helps scale: Reduces parameters and provides implicit dimensionality reduction.
•Preprocessing is critical: Always standardize features before applying LMNN.
•Connects to deep learning: LMNN is the linear special case of triplet loss networks.
•Not always needed: If features are well-designed or classes are easily separable, Euclidean distance suffices.
•Match k values: Use the same k for training LMNN as you'll use for k-NN classification.

Connection to Next Topics

Page Complete

3 / 5