Loading learning content...
We have explored Euclidean distance (L²) and Manhattan distance (L¹) as two distinct approaches to measuring similarity. A natural question emerges: Is there a unified framework that connects these metrics?
The answer is the Minkowski distance, named after the German mathematician Hermann Minkowski (1864-1909), who made foundational contributions to the geometry of numbers and later to the mathematical formulation of special relativity.
Minkowski distance generalizes Euclidean and Manhattan distances through a continuous parameter $p$, revealing them as special cases of a broader family. This framework provides both theoretical elegance and practical flexibility—by tuning $p$, we can interpolate between geometries, adapting the distance metric to the structure of our data.
By mastering this page, you will:\n\n• Derive the Minkowski distance formula as the Lp norm of the difference vector\n• Understand how p parameterizes the geometry from L¹ through L∞\n• Visualize Lp unit balls and their shape transformations as p varies\n• Analyze special cases: p=1 (Manhattan), p=2 (Euclidean), p=∞ (Chebyshev)\n• Explore fractional p values (0 < p < 1) and their quasi-metric properties\n• Implement Minkowski distance efficiently for arbitrary p\n• Develop intuition for selecting appropriate p values in practice
For two points in $n$-dimensional space:
$$\mathbf{p} = (p_1, p_2, \ldots, p_n) \quad \text{and} \quad \mathbf{q} = (q_1, q_2, \ldots, q_n)$$
The Minkowski distance of order $p$ (where $p \geq 1$) is defined as:
$$d_p(\mathbf{p}, \mathbf{q}) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{\frac{1}{p}} = |\mathbf{p} - \mathbf{q}|_p$$
This is the Lp norm of the difference vector. The parameter $p$ controls how individual dimension differences are aggregated.
| p | Name | Formula | Aggregation Style |
|---|---|---|---|
| 1 | Manhattan (L¹) | $\sum_i |p_i - q_i|$ | Linear sum |
| 2 | Euclidean (L²) | $\sqrt{\sum_i (p_i - q_i)^2}$ | Root of sum of squares |
| ∞ | Chebyshev (L∞) | $\max_i |p_i - q_i|$ | Maximum difference |
The Chebyshev distance (L∞ norm) is the limit as $p \to \infty$:
$$|\mathbf{x}|\infty = \lim{p \to \infty} |\mathbf{x}|_p = \max_i |x_i|$$
Proof sketch: Let $M = \max_i |x_i|$. Then:
$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{1/p} \leq (n \cdot M^p)^{1/p} = n^{1/p} \cdot M$$
As $p \to \infty$, $n^{1/p} \to 1$, so $|\mathbf{x}|_p \leq M$.
For the lower bound, the term with maximum absolute value dominates:
$$|\mathbf{x}|_p \geq (M^p)^{1/p} = M$$
By the squeeze theorem, $|\mathbf{x}|_p \to M$ as $p \to \infty$. ∎
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npfrom typing import Union def minkowski_distance(p_vec: np.ndarray, q_vec: np.ndarray, p: Union[int, float] = 2) -> float: """ Compute the Minkowski distance of order p between two vectors. d_p(p, q) = (Σ |p_i - q_i|^p)^(1/p) Parameters: p_vec: First point as numpy array q_vec: Second point as numpy array p: Order of the Minkowski distance (p >= 1, or np.inf for Chebyshev) Returns: The Minkowski distance of order p Special cases: p=1: Manhattan distance p=2: Euclidean distance p=np.inf: Chebyshev distance (max absolute difference) """ diff = np.abs(p_vec - q_vec) if p == np.inf: # Chebyshev distance: maximum absolute difference return np.max(diff) elif p == 1: # Manhattan distance: sum of absolute differences return np.sum(diff) elif p == 2: # Euclidean distance: use optimized sqrt(dot product) return np.sqrt(np.dot(diff, diff)) else: # General case return np.power(np.sum(np.power(diff, p)), 1/p) def minkowski_distance_builtin(p_vec: np.ndarray, q_vec: np.ndarray, p: Union[int, float] = 2) -> float: """ Compute Minkowski distance using NumPy's built-in norm function. np.linalg.norm with ord=p computes the Lp norm. """ return np.linalg.norm(p_vec - q_vec, ord=p) # Demonstrate convergence as p → ∞if __name__ == "__main__": p_vec = np.array([0, 0, 0]) q_vec = np.array([3, 4, 1]) # Differences: 3, 4, 1 print("Minkowski distances for different p values:") print(f" p=1 (Manhattan): {minkowski_distance(p_vec, q_vec, 1):.4f}") print(f" p=2 (Euclidean): {minkowski_distance(p_vec, q_vec, 2):.4f}") print(f" p=3: {minkowski_distance(p_vec, q_vec, 3):.4f}") print(f" p=5: {minkowski_distance(p_vec, q_vec, 5):.4f}") print(f" p=10: {minkowski_distance(p_vec, q_vec, 10):.4f}") print(f" p=100: {minkowski_distance(p_vec, q_vec, 100):.4f}") print(f" p=∞ (Chebyshev): {minkowski_distance(p_vec, q_vec, np.inf):.4f}") # The maximum difference is 4, so L∞ = 4 # As p increases, the distance approaches 4The geometry induced by Minkowski distance is beautifully illustrated by the evolution of unit balls as $p$ varies. The unit ball $B_p$ contains all points with $|\mathbf{x}|_p \leq 1$.
| p | Unit Ball Shape | Equation | Vertices/Corners |
|---|---|---|---|
| 1 | Diamond (rotated square) | $|x| + |y| \leq 1$ | 4 corners at axes |
| 2 | Circle | $x^2 + y^2 \leq 1$ | None (smooth) |
| ∞ | Square (axis-aligned) | $\max(|x|, |y|) \leq 1$ | 4 corners at diagonals |
As $p$ increases from 1 to ∞:
Key insight: The unit ball transitions from a shape with corners on the axes (p=1) to a shape with corners on the diagonals (p=∞), passing through the smooth circle at p=2.
In $n$ dimensions:
| p | Shape Name | Vertices | Facets |
|---|---|---|---|
| 1 | Cross-polytope (hyperoctahedron) | 2n | 2^n |
| 2 | Hypersphere | — | — |
| ∞ | Hypercube | 2^n | 2n |
An important high-dimensional phenomenon: as dimensionality $n$ increases, the volume of Lp unit balls behaves dramatically differently depending on p:\n\n• p = 2 (sphere): Volume → 0 super-exponentially as n → ∞\n• p = ∞ (hypercube): Volume = 2^n (exponential growth)\n• p = 1 (cross-polytope): Volume = 2^n / n! (smaller than cube)\n\nThis affects how distances concentrate in high dimensions.
An important relationship: for $1 \leq p \leq q \leq \infty$:
$$|\mathbf{x}|_q \leq |\mathbf{x}|_p \leq n^{\frac{1}{p} - \frac{1}{q}} |\mathbf{x}|_q$$
This implies:
Geometrically, smaller p values produce "pointier" shapes, larger p values produce "boxy" shapes, with the L² ball nestled in between.
Neighbor selection differs by metric: A query point may have different nearest neighbors under L¹ vs L² vs L∞. Points along axes are relatively closer in L¹; points along diagonals are relatively closer in L∞.
Example: From origin (0,0), compare points A = (1,1) and B = (√2, 0):
| Metric | d(O, A) | d(O, B) | Closer Point |
|---|---|---|---|
| L¹ | 2 | √2 ≈ 1.41 | B |
| L² | √2 ≈ 1.41 | √2 ≈ 1.41 | Tie |
| L∞ | 1 | √2 ≈ 1.41 | A |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npimport matplotlib.pyplot as plt def plot_unit_balls(): """ Visualize unit balls for different p values. """ fig, axes = plt.subplots(1, 4, figsize=(16, 4)) # Generate angles for parametric plotting theta = np.linspace(0, 2 * np.pi, 1000) p_values = [1, 2, 4, np.inf] titles = ['p=1 (Manhattan)', 'p=2 (Euclidean)', 'p=4', 'p=∞ (Chebyshev)'] for ax, p, title in zip(axes, p_values, titles): if p == np.inf: # Square: max(|x|, |y|) = 1 x = np.array([-1, -1, 1, 1, -1]) y = np.array([-1, 1, 1, -1, -1]) elif p == 1: # Diamond: |x| + |y| = 1 x = np.array([1, 0, -1, 0, 1]) y = np.array([0, 1, 0, -1, 0]) else: # General Lp ball: |x|^p + |y|^p = 1 # Parametric: x = sgn(cos(θ)) * |cos(θ)|^(2/p) # y = sgn(sin(θ)) * |sin(θ)|^(2/p) x = np.sign(np.cos(theta)) * np.abs(np.cos(theta)) ** (2/p) y = np.sign(np.sin(theta)) * np.abs(np.sin(theta)) ** (2/p) ax.fill(x, y, alpha=0.3, color='blue') ax.plot(x, y, 'b-', linewidth=2) ax.set_xlim(-1.5, 1.5) ax.set_ylim(-1.5, 1.5) ax.set_aspect('equal') ax.axhline(0, color='gray', linewidth=0.5) ax.axvline(0, color='gray', linewidth=0.5) ax.set_title(title) ax.grid(True, alpha=0.3) plt.suptitle('Unit Balls for Different Lp Norms', fontsize=14) plt.tight_layout() plt.savefig('lp_unit_balls.png', dpi=150, bbox_inches='tight') plt.show() def demonstrate_neighbor_differences(): """ Show how different p values can select different neighbors. """ # Query point at origin query = np.array([0, 0]) # Two candidate neighbors on_axis = np.array([1.5, 0]) # On the x-axis on_diagonal = np.array([1.0, 1.0]) # On the diagonal print("Comparing two points from origin:") print(f" Point A (on axis): {on_axis}") print(f" Point B (on diagonal): {on_diagonal}") print() for p in [1, 2, 3, 4, 10, np.inf]: d_axis = np.linalg.norm(query - on_axis, ord=p) d_diag = np.linalg.norm(query - on_diagonal, ord=p) closer = "A (axis)" if d_axis < d_diag else "B (diagonal)" if d_diag < d_axis else "Tie" p_str = "∞" if p == np.inf else str(p) print(f" p={p_str}: d(A)={d_axis:.3f}, d(B)={d_diag:.3f} → Closer: {closer}") if __name__ == "__main__": demonstrate_neighbor_differences() # plot_unit_balls() # Uncomment if matplotlib is availableThe Chebyshev distance (L∞ norm), named after Russian mathematician Pafnuty Chebyshev, deserves special attention. It represents the limit of Minkowski distance as $p \to \infty$.
$$d_\infty(\mathbf{p}, \mathbf{q}) = \max_{i=1,\ldots,n} |p_i - q_i| = |\mathbf{p} - \mathbf{q}|_\infty$$
Only the single largest component difference matters; all other differences are ignored.
Chebyshev distance is famously the distance a chess king travels on an infinite board:
This is unlike:
Extreme sparsity of influence: L∞ is maximally sparse—only one dimension contributes to the distance (the one with maximum difference). All other dimensions are completely ignored.
Rectangular equidistant sets: In L∞, all points equidistant from the origin form a square (2D) or hypercube (nD), aligned with the coordinate axes.
Chebyshev distance is ideal when:\n\n• Any feature being different enough matters: A single out-of-spec measurement should trigger an alert\n• Worst-case sensitivity: You care about the maximum deviation, not average\n• Grid movement with diagonals: Pathfinding where diagonal moves cost the same as axis-aligned\n• Cache/memory access patterns: L∞ relates to cache-line access patterns in computing
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as np def chebyshev_distance(p: np.ndarray, q: np.ndarray) -> float: """ Compute Chebyshev distance (L∞): maximum absolute difference. This is the distance a chess king would travel on an infinite board. """ return np.max(np.abs(p - q)) def chess_king_example(): """ Demonstrate Chebyshev distance with chess king movement. """ # Starting position (0, 0) to target (5, 3) start = np.array([0, 0]) target = np.array([5, 3]) chebyshev = chebyshev_distance(start, target) manhattan = np.sum(np.abs(target - start)) euclidean = np.linalg.norm(target - start) print("Chess King from (0,0) to (5,3):") print(f" Chebyshev (king moves needed): {chebyshev}") # 5 print(f" Manhattan (rook moves needed): {manhattan}") # 8 print(f" Euclidean (flying distance): {euclidean:.2f}") # 5.83 # The king can reach (5,3) in 5 moves: # Move diagonally 3 times (covers both x and y), then 2 more x-axis moves print("\nKing's optimal path: 3 diagonal moves + 2 horizontal moves = 5 total") def worst_case_monitoring(): """ Use Chebyshev distance for anomaly detection. """ # Normal operating parameters normal_state = np.array([100, 50, 75, 200]) # temp, pressure, flow, rpm # Current readings with one anomaly current_state = np.array([102, 51, 74, 350]) # rpm is way off! # Different distance perspectives chebyshev = chebyshev_distance(normal_state, current_state) euclidean = np.linalg.norm(current_state - normal_state) manhattan = np.sum(np.abs(current_state - normal_state)) print("\nAnomaly Detection Example:") print(f" Normal state: {normal_state}") print(f" Current state: {current_state}") print(f"\n Chebyshev (max deviation): {chebyshev}") # 150 (catches the RPM!) print(f" Euclidean distance: {euclidean:.2f}") print(f" Manhattan distance: {manhattan}") # For anomaly detection, Chebyshev immediately shows the worst offender diff = np.abs(current_state - normal_state) worst_feature = np.argmax(diff) feature_names = ['temperature', 'pressure', 'flow', 'rpm'] print(f"\n Worst deviation: {feature_names[worst_feature]} " f"(off by {diff[worst_feature]})") if __name__ == "__main__": chess_king_example() worst_case_monitoring()| Property | L¹ (Manhattan) | L² (Euclidean) | L∞ (Chebyshev) |
|---|---|---|---|
| Formula | Σ|xᵢ| | √(Σxᵢ²) | max|xᵢ| |
| Unit ball (2D) | Diamond | Circle | Square |
| Rotation invariant | No | Yes | No |
| Features contributing | All equally | All (weighted by square) | Only maximum |
| Outlier sensitivity | Moderate | High | Maximum (only worst matters) |
| Chess analogy | Rook | — | King |
| Computation | O(n) adds | O(n) + sqrt | O(n) to find max |
What happens when $0 < p < 1$? This regime produces interesting "quasi-metrics" that violate the triangle inequality but have useful properties.
The formula still makes sense:
$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{\frac{1}{p}}$$
However, for $0 < p < 1$, this is not a true norm because:
Despite not being a true metric, fractional p values have interesting properties:
Enhanced sparsity: Smaller p values emphasize sparse vectors (many zeros). The L0 "norm" (limit as p→0) counts the number of non-zero entries.
Robust estimation: In statistics, Lp regression with $0 < p < 1$ is even more robust to outliers than L¹ (though harder to optimize).
Feature selection: The L0.5 quasi-norm has been used in compressed sensing as a sparsity-inducing regularizer.
For $p < 1$, the unit "ball" becomes non-convex and "star-shaped":
When p < 1, the triangle inequality can fail dramatically. For example, with p = 0.5:\n\nConsider x = (1, 0) and y = (0, 1):\n• ‖x‖₀.₅ = 1\n• ‖y‖₀.₅ = 1\n• ‖x + y‖₀.₅ = ‖(1,1)‖₀.₅ = (1^0.5 + 1^0.5)^2 = 4\n\nBut 4 > 1 + 1, so the triangle inequality fails!\n\nImplication: You cannot use p < 1 with KNN data structures that rely on the triangle inequality (KD-trees, ball trees).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as np def lp_norm(x: np.ndarray, p: float) -> float: """Compute Lp norm (works for any p > 0).""" return np.power(np.sum(np.power(np.abs(x), p)), 1/p) def demonstrate_triangle_violation(): """ Show that triangle inequality fails for p < 1. """ x = np.array([1.0, 0.0]) y = np.array([0.0, 1.0]) print("Triangle inequality test for different p values:") print(f" x = {x}, y = {y}, x+y = {x+y}") print() for p in [0.5, 0.75, 1.0, 1.5, 2.0]: norm_x = lp_norm(x, p) norm_y = lp_norm(y, p) norm_sum = lp_norm(x + y, p) # Triangle inequality: ||x + y|| <= ||x|| + ||y|| lhs = norm_sum rhs = norm_x + norm_y holds = "✓ Holds" if lhs <= rhs + 1e-10 else "✗ VIOLATED" print(f" p = {p}: ||x+y||={lhs:.3f}, ||x||+||y||={rhs:.3f} → {holds}") def sparsity_comparison(): """ Show how smaller p emphasizes sparsity. """ # Sparse vector (mostly zeros) sparse = np.array([1, 0, 0, 0, 0]) # Dense vector (small values everywhere) dense = np.array([0.2, 0.2, 0.2, 0.2, 0.2]) print("\nSparsity comparison:") print(f" Sparse vector: {sparse}") print(f" Dense vector: {dense}") print() for p in [0.5, 1.0, 2.0]: norm_sparse = lp_norm(sparse, p) norm_dense = lp_norm(dense, p) ratio = norm_sparse / norm_dense print(f" p = {p}: sparse = {norm_sparse:.3f}, " f"dense = {norm_dense:.3f}, ratio = {ratio:.3f}") print("\n As p decreases, sparse vectors have relatively smaller norms.") print(" This is why small p values encourage sparsity in optimization.") if __name__ == "__main__": demonstrate_triangle_violation() sparsity_comparison()Let us formally verify the metric axioms for Minkowski distance with $p \geq 1$.
$$d_p(\mathbf{x}, \mathbf{y}) \geq 0$$
Proof: Absolute values are non-negative, powers of non-negative numbers are non-negative, sums of non-negative numbers are non-negative, and the p-th root of a non-negative number is non-negative. ∎
$$d_p(\mathbf{x}, \mathbf{y}) = 0 \iff \mathbf{x} = \mathbf{y}$$
Proof: If $\mathbf{x} = \mathbf{y}$, all differences are zero, so $d_p = 0$. Conversely, if $d_p = 0$, then $\sum |x_i - y_i|^p = 0$. Since each term is non-negative, all terms must be zero, implying $x_i = y_i$ for all $i$. ∎
$$d_p(\mathbf{x}, \mathbf{y}) = d_p(\mathbf{y}, \mathbf{x})$$
Proof: $|x_i - y_i| = |y_i - x_i|$ for all $i$. ∎
$$d_p(\mathbf{x}, \mathbf{z}) \leq d_p(\mathbf{x}, \mathbf{y}) + d_p(\mathbf{y}, \mathbf{z})$$
This follows from the Minkowski inequality for norms:
$$|\mathbf{a} + \mathbf{b}|_p \leq |\mathbf{a}|_p + |\mathbf{b}|_p$$
valid for $p \geq 1$.
Proof outline: The Minkowski inequality is proven using the Hölder inequality:
$$\sum_i |a_i b_i| \leq |\mathbf{a}|_p \cdot |\mathbf{b}|_q$$
where $\frac{1}{p} + \frac{1}{q} = 1$ (conjugate exponents).
From Hölder, one derives Minkowski's inequality by setting $\mathbf{a} = \mathbf{x} - \mathbf{y}$ and $\mathbf{b} = \mathbf{y} - \mathbf{z}$. ∎
Hermann Minkowski (1864-1909) was a brilliant mathematician and physicist. His contributions include:\n\n• Geometry of numbers: The mathematical foundation of crystallography and lattice theory\n• Minkowski space: The four-dimensional spacetime of special relativity (his former student Einstein called it "superfluous learnedness" before realizing its importance)\n• Minkowski inequality: The triangle inequality for Lp spaces\n\nThe Minkowski distance is named for his work on generalizing geometric concepts beyond Euclidean space.
| p Range | Norm? | Metric? | Triangle Inequality | Unit Ball Convex? |
|---|---|---|---|---|
| 0 < p < 1 | No (quasi-norm) | No | Fails | No (star-shaped) |
| p = 1 | Yes (L¹) | Yes (Manhattan) | Holds (tight) | Yes (cross-polytope) |
| 1 < p < 2 | Yes | Yes | Holds | Yes |
| p = 2 | Yes (L²) | Yes (Euclidean) | Holds | Cauchy-Schwarz | Yes (sphere) |
| 2 < p < ∞ | Yes | Yes | Holds | Yes |
| p = ∞ | Yes (L∞) | Yes (Chebyshev) | Holds (tight) | Yes (hypercube) |
With the full Minkowski family available, how do we choose the right $p$ for a KNN problem? This is ultimately an empirical question, but we can develop useful heuristics.
Start with p = 2 (Euclidean): This is the most common choice for several reasons:
Consider p = 1 (Manhattan) when:
Consider p = ∞ (Chebyshev) when:
The parameter $p$ can be tuned like any other hyperparameter:
In practice, p = 1, 2, and ∞ are tested most; intermediate values rarely provide significant improvement.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import cross_val_score, GridSearchCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.datasets import load_iris def tune_minkowski_p(): """ Demonstrate tuning the Minkowski p parameter. """ # Load and prepare data X, y = load_iris(return_X_y=True) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Test different p values p_values = [1, 1.5, 2, 2.5, 3, 4, 5, 10, np.inf] results = [] print("Cross-validation accuracy for different p values:") print("-" * 45) for p in p_values: knn = KNeighborsClassifier( n_neighbors=5, metric='minkowski', p=p if p != np.inf else np.inf ) # Handle p=∞ specially (sklearn uses 'chebyshev' metric) if p == np.inf: knn = KNeighborsClassifier(n_neighbors=5, metric='chebyshev') scores = cross_val_score(knn, X_scaled, y, cv=5) mean_score = scores.mean() std_score = scores.std() results.append((p, mean_score, std_score)) p_str = f"p={p}" if p != np.inf else "p=∞" print(f" {p_str:8s}: {mean_score:.3f} (+/- {std_score*2:.3f})") # Find best p best = max(results, key=lambda x: x[1]) print(f"\nBest: p={best[0]} with accuracy {best[1]:.3f}") return results def sklearn_gridsearch_example(): """ Use GridSearchCV to tune both k and p together. """ X, y = load_iris(return_X_y=True) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Grid over both k and p param_grid = { 'n_neighbors': [3, 5, 7, 9], 'p': [1, 2, 3] # sklearn's Minkowski p parameter } knn = KNeighborsClassifier(metric='minkowski') grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy') grid_search.fit(X_scaled, y) print("\nGridSearchCV Results:") print(f" Best parameters: {grid_search.best_params_}") print(f" Best CV score: {grid_search.best_score_:.3f}") if __name__ == "__main__": tune_minkowski_p() sklearn_gridsearch_example()For vectors of dimension $n$:
| p | Operations | Notes |
|---|---|---|
| 1 | n subtractions, n abs, n-1 additions | Simplest; no exponentiation |
| 2 | n subtractions, n squares, n-1 additions, 1 sqrt | Square root is expensive |
| general p | n subtractions, n abs, n power ops, n-1 additions, 1 root | pow() is expensive |
| ∞ | n subtractions, n abs, n-1 comparisons | Finding max |
The general case ($p \neq 1, 2, \infty$) requires expensive pow() operations for each dimension, making it significantly slower than the special cases.
For large $p$, numerical issues can arise:
$$\left( \sum_i |x_i|^p \right)^{1/p}$$
If $\sum |x_i|^p$ overflows (for large $p$ and $|x_i| > 1$) or underflows (for small values), the result is corrupted.
Solution: Use the log-sum-exp trick:
$$|\mathbf{x}|_p = M \cdot \left( \sum_i \left|\frac{x_i}{M}\right|^p \right)^{1/p}$$
where $M = \max_i |x_i|$. This normalizes values to [0, 1] before exponentiating.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as np def minkowski_stable(x: np.ndarray, y: np.ndarray, p: float) -> float: """ Numerically stable Minkowski distance computation. Uses scaling to prevent overflow/underflow for large p. """ diff = np.abs(x - y) if p == np.inf: return np.max(diff) elif p == 1: return np.sum(diff) elif p == 2: return np.sqrt(np.dot(diff, diff)) else: # Stable computation using max scaling max_val = np.max(diff) if max_val == 0: return 0.0 # Normalize by max to keep values in [0, 1] scaled = diff / max_val # Now scaled values are in [0, 1], so pow() is safe sum_powers = np.sum(np.power(scaled, p)) # Recover the original scale return max_val * np.power(sum_powers, 1/p) def compare_stability(): """ Compare naive vs stable implementations for large p. """ # Create vectors with values that could cause overflow x = np.array([10.0, 20.0, 15.0]) y = np.array([0.0, 0.0, 0.0]) print("Testing numerical stability:") for p in [10, 100, 500, 1000]: # Naive computation (may overflow) try: diff = np.abs(x - y) naive = np.power(np.sum(np.power(diff, p)), 1/p) except (OverflowError, RuntimeWarning): naive = "OVERFLOW" # Stable computation stable = minkowski_stable(x, y, p) # True answer should approach max(diff) = 20 as p → ∞ print(f" p={p:4d}: naive={naive if isinstance(naive, str) else f'{naive:.4f}'}, " f"stable={stable:.4f}, true limit=20.0") if __name__ == "__main__": compare_stability()Most ML libraries support Minkowski distance natively:\n\n• scikit-learn: KNeighborsClassifier(metric='minkowski', p=2)\n• scipy: scipy.spatial.distance.minkowski(u, v, p)\n• NumPy: np.linalg.norm(x - y, ord=p)\n\nFor p=1, 2, ∞, the special-case implementations are used automatically, providing optimal performance.
Minkowski distance provides a unified framework for understanding the Lp family of distance metrics. By varying the parameter $p$, we can smoothly transition between geometries, each with distinct properties suited to different problem domains.
The next page explores cosine similarity, a fundamentally different approach that measures the angle between vectors rather than their distance. This metric is essential for text data, high-dimensional sparse vectors, and situations where magnitude should be ignored. We'll see how it relates to Euclidean distance for normalized vectors.
You now command the full Minkowski distance family. You understand how the parameter p shapes geometry, when each special case is appropriate, and how to tune p as a hyperparameter. This prepares you to understand why cosine similarity—coming next—takes a completely different approach.