Loading learning content...
Standard metrics—Euclidean, Manhattan, cosine—serve as excellent defaults, but real-world data often has structure that generic metrics cannot capture. Custom distance functions allow us to encode domain knowledge directly into the similarity measure.
Consider comparing DNA sequences: Euclidean distance on one-hot encodings misses that adenine (A) and guanine (G) are both purines, making A→G mutations biologically less significant than A→C mutations. A custom distance can encode this biochemical knowledge.
This page explores techniques for designing, implementing, and validating domain-specific distance functions.
By mastering this page, you will:\n\n• Design weighted distance functions with feature-specific importance\n• Implement Mahalanobis distance for correlated features\n• Apply edit distances (Levenshtein, Hamming) for sequences\n• Handle mixed data types with Gower distance\n• Understand metric learning basics\n• Validate custom metrics for KNN effectiveness
The simplest customization assigns different weights to features based on their importance.
$$d_w(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n} w_i (p_i - q_i)^2}$$
where $w_i \geq 0$ is the weight for feature $i$. Higher weights make that feature more influential in determining similarity.
$$d_w(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} w_i |p_i - q_i|$$
| Method | Description |
|---|---|
| Domain expertise | Expert assigns importance based on domain knowledge |
| Feature importance | Use RF/XGBoost importance scores as weights |
| Correlation with target | Weight by |
| Inverse variance | Weight by 1/variance (standardization effect) |
| Learned weights | Optimize weights via metric learning |
1234567891011121314151617181920212223
import numpy as npfrom sklearn.neighbors import KNeighborsClassifier def weighted_euclidean(p, q, weights): """Weighted Euclidean distance.""" return np.sqrt(np.sum(weights * (p - q)**2)) def weighted_manhattan(p, q, weights): """Weighted Manhattan distance.""" return np.sum(weights * np.abs(p - q)) # Using with sklearn: scale features by sqrt(weight)def apply_feature_weights(X, weights): """Transform features to apply weights in Euclidean space.""" return X * np.sqrt(weights) # Example: Medical diagnosis with expert-defined importanceweights = np.array([0.1, 0.3, 0.8, 1.0]) # [age, BMI, blood_pressure, cholesterol]X_train = np.random.randn(100, 4)X_weighted = apply_feature_weights(X_train, weights) knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')knn.fit(X_weighted, np.random.randint(0, 2, 100))When features are correlated, Euclidean distance can be misleading. Mahalanobis distance accounts for the covariance structure of the data.
$$d_M(\mathbf{p}, \mathbf{q}) = \sqrt{(\mathbf{p} - \mathbf{q})^T \Sigma^{-1} (\mathbf{p} - \mathbf{q})}$$
where $\Sigma$ is the covariance matrix of the data.
| Property | Euclidean | Mahalanobis |
|---|---|---|
| Accounts for correlation | No | Yes |
| Scale-invariant | No | Yes |
| Requires covariance estimation | No | Yes |
| Fails with singular covariance | N/A | Yes |
1234567891011121314151617181920212223242526
import numpy as npfrom scipy.spatial.distance import mahalanobisfrom sklearn.covariance import EmpiricalCovariance def compute_mahalanobis(X, p, q): """Compute Mahalanobis distance using data covariance.""" cov = EmpiricalCovariance().fit(X) cov_inv = np.linalg.inv(cov.covariance_) return mahalanobis(p, q, cov_inv) # Example: Correlated featuresnp.random.seed(42)# Create correlated data (height and weight)mean = [170, 70]cov = [[100, 35], [35, 25]] # Positive correlationX = np.random.multivariate_normal(mean, cov, 100) p = np.array([180, 75])q1 = np.array([180, 80]) # 5 units away, along correlationq2 = np.array([175, 75]) # 5 units away, perpendicular cov_inv = np.linalg.inv(np.cov(X.T))print(f"Euclidean d(p, q1): {np.linalg.norm(p - q1):.2f}")print(f"Euclidean d(p, q2): {np.linalg.norm(p - q2):.2f}")print(f"Mahalanobis d(p, q1): {mahalanobis(p, q1, cov_inv):.2f}")print(f"Mahalanobis d(p, q2): {mahalanobis(p, q2, cov_inv):.2f}")Mahalanobis distance requires estimating the covariance matrix, which can be problematic:\n\n• High dimensions: Need more samples than features (n > d)\n• Singular covariance: Use regularized estimators (LedoitWolf, shrinkage)\n• Computational cost: O(d³) for matrix inversion\n\nFor d > 100, consider PCA first or use regularized covariance.
For string and sequence data, edit distances measure the minimum operations to transform one sequence into another.
For equal-length strings, count positions where characters differ:
$$d_H(s_1, s_2) = \sum_{i=1}^{n} \mathbb{1}[s_1[i] \neq s_2[i]]$$
Example: d_H("KAROLIN", "KATHRIN") = 3
Minimum number of single-character edits (insertions, deletions, substitutions):
Example: d_L("kitten", "sitting") = 3
For time series of different lengths, finds optimal alignment:
$$DTW(X, Y) = \min_\pi \sum_{(i,j) \in \pi} d(x_i, y_j)$$
where $\pi$ is a warping path.
123456789101112131415161718192021222324252627282930
import numpy as np def hamming_distance(s1: str, s2: str) -> int: """Hamming distance for equal-length strings.""" if len(s1) != len(s2): raise ValueError("Strings must have equal length") return sum(c1 != c2 for c1, c2 in zip(s1, s2)) def levenshtein_distance(s1: str, s2: str) -> int: """Levenshtein (edit) distance using dynamic programming.""" m, n = len(s1), len(s2) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(m + 1): dp[i][0] = i for j in range(n + 1): dp[0][j] = j for i in range(1, m + 1): for j in range(1, n + 1): if s1[i-1] == s2[j-1]: dp[i][j] = dp[i-1][j-1] else: dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]) return dp[m][n] # Examplesprint(f"Hamming('KAROLIN', 'KATHRIN'): {hamming_distance('KAROLIN', 'KATHRIN')}")print(f"Levenshtein('kitten', 'sitting'): {levenshtein_distance('kitten', 'sitting')}")Real datasets often contain mixed types: numerical, categorical, binary, and ordinal features. Gower distance provides a principled way to combine them.
$$d_G(\mathbf{p}, \mathbf{q}) = \frac{\sum_{i=1}^{n} w_i \cdot d_i(p_i, q_i)}{\sum_{i=1}^{n} w_i}$$
where $d_i$ is defined per feature type:
| Feature Type | Distance Formula |
|---|---|
| Numerical | $\frac{|p_i - q_i|}{\text{range}_i}$ (normalized) |
| Binary | 0 if match, 1 if mismatch |
| Categorical | 0 if match, 1 if mismatch |
| Ordinal | Rank distance / (k-1) |
123456789101112131415161718192021222324252627282930313233
import numpy as np def gower_distance(p, q, feature_types, ranges=None): """ Compute Gower distance for mixed data types. feature_types: list of 'numerical', 'categorical', 'binary' ranges: dict of {feature_idx: (min, max)} for numerical features """ distances = [] for i, ftype in enumerate(feature_types): if ftype == 'numerical': if ranges and i in ranges: r = ranges[i][1] - ranges[i][0] d = abs(p[i] - q[i]) / r if r > 0 else 0 else: d = abs(p[i] - q[i]) elif ftype in ['categorical', 'binary']: d = 0 if p[i] == q[i] else 1 distances.append(d) return np.mean(distances) # Example: Person comparison# [age, income, gender, education]person1 = [25, 50000, 'M', 'Bachelor']person2 = [30, 55000, 'M', 'Master']feature_types = ['numerical', 'numerical', 'categorical', 'categorical']ranges = {0: (18, 80), 1: (20000, 200000)} dist = gower_distance(person1, person2, feature_types, ranges)print(f"Gower distance: {dist:.3f}")Use the gower package for production: pip install gower\n\npython\nimport gower\nimport pandas as pd\ndf = pd.DataFrame({'age': [25, 30], 'gender': ['M', 'M']})\ndist_matrix = gower.gower_matrix(df)\n
Rather than hand-crafting distance functions, metric learning algorithms learn an optimal distance metric from labeled data.
Learn a transformation matrix $M$ such that:
$$d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T M (\mathbf{x} - \mathbf{y})}$$
minimizes distance between same-class points and maximizes distance between different-class points.
| Algorithm | Approach |
|---|---|
| LMNN (Large Margin NN) | Learn M to maximize margin between classes |
| NCA (Neighborhood Component Analysis) | Maximize expected leave-one-out accuracy |
| Siamese Networks | Deep learning with contrastive loss |
| Triplet Loss | Learn embeddings where similar closer than dissimilar |
12345678910111213141516171819
from sklearn.neighbors import NeighborhoodComponentsAnalysis, KNeighborsClassifierfrom sklearn.datasets import load_irisfrom sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipeline # Load dataX, y = load_iris(return_X_y=True) # Standard KNNknn = KNeighborsClassifier(n_neighbors=3)score_standard = cross_val_score(knn, X, y, cv=5).mean() # KNN with learned metric (NCA)nca = NeighborhoodComponentsAnalysis(n_components=2, random_state=42)knn_nca = Pipeline([('nca', nca), ('knn', KNeighborsClassifier(n_neighbors=3))])score_nca = cross_val_score(knn_nca, X, y, cv=5).mean() print(f"Standard KNN accuracy: {score_standard:.3f}")print(f"NCA + KNN accuracy: {score_nca:.3f}")Custom distances must be validated to ensure they improve KNN performance.
Metric properties: Verify non-negativity, symmetry, identity, triangle inequality (if using tree structures)
Cross-validation: Compare KNN accuracy with custom vs standard metrics
Neighbor inspection: Manually examine retrieved neighbors for semantic validity
Sensitivity analysis: Test robustness to noise and missing values
Computational cost: Ensure distance computation is fast enough for your data size
This completes our exploration of distance metrics for KNN. You now have a comprehensive toolkit for measuring similarity across diverse data types and domains.
You can now select and implement appropriate distance functions for any KNN application. The next module explores the Curse of Dimensionality—why distances behave strangely in high dimensions and how to mitigate these effects.