Machine LearningDistance Metrics

Distance Metrics for K-Nearest Neighbors

LevelIntermediate

Duration90 mins

TopicDistance Metrics

3 / 5

Minkowski Distance

The Generalized Distance Family

We have explored Euclidean distance (L²) and Manhattan distance (L¹) as two distinct approaches to measuring similarity. A natural question emerges: Is there a unified framework that connects these metrics?

The answer is the Minkowski distance, named after the German mathematician Hermann Minkowski (1864-1909), who made foundational contributions to the geometry of numbers and later to the mathematical formulation of special relativity.

Minkowski distance generalizes Euclidean and Manhattan distances through a continuous parameter $p$, revealing them as special cases of a broader family. This framework provides both theoretical elegance and practical flexibility—by tuning $p$, we can interpolate between geometries, adapting the distance metric to the structure of our data.

Learning Objectives

By mastering this page, you will:\n\n• Derive the Minkowski distance formula as the Lp norm of the difference vector\n• Understand how p parameterizes the geometry from L¹ through L∞\n• Visualize Lp unit balls and their shape transformations as p varies\n• Analyze special cases: p=1 (Manhattan), p=2 (Euclidean), p=∞ (Chebyshev)\n• Explore fractional p values (0 < p < 1) and their quasi-metric properties\n• Implement Minkowski distance efficiently for arbitrary p\n• Develop intuition for selecting appropriate p values in practice

The Lp Norm and Minkowski Distance

The General Formula

For two points in $n$-dimensional space:

$$\mathbf{p} = (p_1, p_2, \ldots, p_n) \quad \text{and} \quad \mathbf{q} = (q_1, q_2, \ldots, q_n)$$

The Minkowski distance of order $p$ (where $p \geq 1$) is defined as:

$$d_p(\mathbf{p}, \mathbf{q}) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{\frac{1}{p}} = |\mathbf{p} - \mathbf{q}|_p$$

This is the Lp norm of the difference vector. The parameter $p$ controls how individual dimension differences are aggregated.

Special Cases

p	Name	Formula	Aggregation Style
1	Manhattan (L¹)	$\sum_i \|p_i - q_i\|$	Linear sum
2	Euclidean (L²)	$\sqrt{\sum_i (p_i - q_i)^2}$	Root of sum of squares
∞	Chebyshev (L∞)	$\max_i \|p_i - q_i\|$	Maximum difference

Derivation of the Limit Case p → ∞

The Chebyshev distance (L∞ norm) is the limit as $p \to \infty$:

$$|\mathbf{x}|\infty = \lim{p \to \infty} |\mathbf{x}|_p = \max_i |x_i|$$

Proof sketch: Let $M = \max_i |x_i|$. Then:

$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{1/p} \leq (n \cdot M^p)^{1/p} = n^{1/p} \cdot M$$

As $p \to \infty$, $n^{1/p} \to 1$, so $|\mathbf{x}|_p \leq M$.

For the lower bound, the term with maximum absolute value dominates:

$$|\mathbf{x}|_p \geq (M^p)^{1/p} = M$$

By the squeeze theorem, $|\mathbf{x}|_p \to M$ as $p \to \infty$. ∎

minkowski_distance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from typing import Union
 
def minkowski_distance(p_vec: np.ndarray, 
                       q_vec: np.ndarray, 
                       p: Union[int, float] = 2) -> float:
    """
    Compute the Minkowski distance of order p between two vectors.
    
    d_p(p, q) = (Σ |p_i - q_i|^p)^(1/p)
    
    Parameters:
        p_vec: First point as numpy array
        q_vec: Second point as numpy array
        p: Order of the Minkowski distance (p >= 1, or np.inf for Chebyshev)
    
    Returns:
        The Minkowski distance of order p
    
    Special cases:
        p=1: Manhattan distance
        p=2: Euclidean distance
        p=np.inf: Chebyshev distance (max absolute difference)
    """
    diff = np.abs(p_vec - q_vec)
    
    if p == np.inf:
        # Chebyshev distance: maximum absolute difference
        return np.max(diff)
    elif p == 1:
        # Manhattan distance: sum of absolute differences
        return np.sum(diff)
    elif p == 2:
        # Euclidean distance: use optimized sqrt(dot product)
        return np.sqrt(np.dot(diff, diff))
    else:
        # General case
        return np.power(np.sum(np.power(diff, p)), 1/p)
 
 
def minkowski_distance_builtin(p_vec: np.ndarray, 
                                q_vec: np.ndarray, 
                                p: Union[int, float] = 2) -> float:
    """
    Compute Minkowski distance using NumPy's built-in norm function.
    
    np.linalg.norm with ord=p computes the Lp norm.
    """
    return np.linalg.norm(p_vec - q_vec, ord=p)
 
 
# Demonstrate convergence as p → ∞
if __name__ == "__main__":
    p_vec = np.array([0, 0, 0])
    q_vec = np.array([3, 4, 1])  # Differences: 3, 4, 1
    
    print("Minkowski distances for different p values:")
    print(f"  p=1 (Manhattan): {minkowski_distance(p_vec, q_vec, 1):.4f}")
    print(f"  p=2 (Euclidean): {minkowski_distance(p_vec, q_vec, 2):.4f}")
    print(f"  p=3:             {minkowski_distance(p_vec, q_vec, 3):.4f}")
    print(f"  p=5:             {minkowski_distance(p_vec, q_vec, 5):.4f}")
    print(f"  p=10:            {minkowski_distance(p_vec, q_vec, 10):.4f}")
    print(f"  p=100:           {minkowski_distance(p_vec, q_vec, 100):.4f}")
    print(f"  p=∞ (Chebyshev): {minkowski_distance(p_vec, q_vec, np.inf):.4f}")
    
    # The maximum difference is 4, so L∞ = 4
    # As p increases, the distance approaches 4

Geometry of Lp Spaces: Unit Ball Evolution

The geometry induced by Minkowski distance is beautifully illustrated by the evolution of unit balls as $p$ varies. The unit ball $B_p$ contains all points with $|\mathbf{x}|_p \leq 1$.

Unit Ball Shapes in 2D

p	Unit Ball Shape	Equation	Vertices/Corners
1	Diamond (rotated square)	$\|x\| + \|y\| \leq 1$	4 corners at axes
2	Circle	$x^2 + y^2 \leq 1$	None (smooth)
∞	Square (axis-aligned)	$\max(\|x\|, \|y\|) \leq 1$	4 corners at diagonals

Shape Evolution

As $p$ increases from 1 to ∞:

p = 1: Diamond shape, rotated 45° relative to axes
1 < p < 2: Diamond begins to "bulge" outward, corners softening
p = 2: Perfect circle (smooth, no corners)
2 < p < ∞: Circle begins to "square off," corners emerging at diagonals
p = ∞: Perfect square, aligned with axes

Key insight: The unit ball transitions from a shape with corners on the axes (p=1) to a shape with corners on the diagonals (p=∞), passing through the smooth circle at p=2.

Higher Dimensions

In $n$ dimensions:

p	Shape Name	Vertices	Facets
1	Cross-polytope (hyperoctahedron)	2n	2^n
2	Hypersphere	—	—
∞	Hypercube	2^n	2n

Volume Concentration

An important high-dimensional phenomenon: as dimensionality $n$ increases, the volume of Lp unit balls behaves dramatically differently depending on p:\n\n• p = 2 (sphere): Volume → 0 super-exponentially as n → ∞\n• p = ∞ (hypercube): Volume = 2^n (exponential growth)\n• p = 1 (cross-polytope): Volume = 2^n / n! (smaller than cube)\n\nThis affects how distances concentrate in high dimensions.

Nested Ball Property

An important relationship: for $1 \leq p \leq q \leq \infty$:

$$|\mathbf{x}|_q \leq |\mathbf{x}|_p \leq n^{\frac{1}{p} - \frac{1}{q}} |\mathbf{x}|_q$$

This implies:

L∞ distance ≤ L² distance ≤ L¹ distance (for the same points)
Unit balls are nested: $B_1 \subset B_2 \subset \cdots \subset B_\infty$

Geometrically, smaller p values produce "pointier" shapes, larger p values produce "boxy" shapes, with the L² ball nestled in between.

Practical Implications

Neighbor selection differs by metric: A query point may have different nearest neighbors under L¹ vs L² vs L∞. Points along axes are relatively closer in L¹; points along diagonals are relatively closer in L∞.

Example: From origin (0,0), compare points A = (1,1) and B = (√2, 0):

Metric	d(O, A)	d(O, B)	Closer Point
L¹	2	√2 ≈ 1.41	B
L²	√2 ≈ 1.41	√2 ≈ 1.41	Tie
L∞	1	√2 ≈ 1.41	A

unit_ball_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
import matplotlib.pyplot as plt
 
def plot_unit_balls():
    """
    Visualize unit balls for different p values.
    """
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    # Generate angles for parametric plotting
    theta = np.linspace(0, 2 * np.pi, 1000)
    
    p_values = [1, 2, 4, np.inf]
    titles = ['p=1 (Manhattan)', 'p=2 (Euclidean)', 'p=4', 'p=∞ (Chebyshev)']
    
    for ax, p, title in zip(axes, p_values, titles):
        if p == np.inf:
            # Square: max(|x|, |y|) = 1
            x = np.array([-1, -1, 1, 1, -1])
            y = np.array([-1, 1, 1, -1, -1])
        elif p == 1:
            # Diamond: |x| + |y| = 1
            x = np.array([1, 0, -1, 0, 1])
            y = np.array([0, 1, 0, -1, 0])
        else:
            # General Lp ball: |x|^p + |y|^p = 1
            # Parametric: x = sgn(cos(θ)) * |cos(θ)|^(2/p)
            # y = sgn(sin(θ)) * |sin(θ)|^(2/p)
            x = np.sign(np.cos(theta)) * np.abs(np.cos(theta)) ** (2/p)
            y = np.sign(np.sin(theta)) * np.abs(np.sin(theta)) ** (2/p)
        
        ax.fill(x, y, alpha=0.3, color='blue')
        ax.plot(x, y, 'b-', linewidth=2)
        ax.set_xlim(-1.5, 1.5)
        ax.set_ylim(-1.5, 1.5)
        ax.set_aspect('equal')
        ax.axhline(0, color='gray', linewidth=0.5)
        ax.axvline(0, color='gray', linewidth=0.5)
        ax.set_title(title)
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Unit Balls for Different Lp Norms', fontsize=14)
    plt.tight_layout()
    plt.savefig('lp_unit_balls.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
def demonstrate_neighbor_differences():
    """
    Show how different p values can select different neighbors.
    """
    # Query point at origin
    query = np.array([0, 0])
    
    # Two candidate neighbors
    on_axis = np.array([1.5, 0])       # On the x-axis
    on_diagonal = np.array([1.0, 1.0])  # On the diagonal
    
    print("Comparing two points from origin:")
    print(f"  Point A (on axis): {on_axis}")
    print(f"  Point B (on diagonal): {on_diagonal}")
    print()
    
    for p in [1, 2, 3, 4, 10, np.inf]:
        d_axis = np.linalg.norm(query - on_axis, ord=p)
        d_diag = np.linalg.norm(query - on_diagonal, ord=p)
        closer = "A (axis)" if d_axis < d_diag else "B (diagonal)" if d_diag < d_axis else "Tie"
        p_str = "∞" if p == np.inf else str(p)
        print(f"  p={p_str}: d(A)={d_axis:.3f}, d(B)={d_diag:.3f} → Closer: {closer}")
 
 
if __name__ == "__main__":
    demonstrate_neighbor_differences()
    # plot_unit_balls()  # Uncomment if matplotlib is available

Chebyshev Distance (L∞): The Maximum Metric

The Chebyshev distance (L∞ norm), named after Russian mathematician Pafnuty Chebyshev, deserves special attention. It represents the limit of Minkowski distance as $p \to \infty$.

Definition

$$d_\infty(\mathbf{p}, \mathbf{q}) = \max_{i=1,\ldots,n} |p_i - q_i| = |\mathbf{p} - \mathbf{q}|_\infty$$

Only the single largest component difference matters; all other differences are ignored.

Chess King Analogy

Chebyshev distance is famously the distance a chess king travels on an infinite board:

A king can move one square in any direction: horizontally, vertically, or diagonally
All 8 moves are equivalent (one step)
To go from (0,0) to (3,4): the king needs max(3,4) = 4 moves

This is unlike:

Manhattan/L¹: Rook movement (only axis-aligned), needs 3+4 = 7 moves
Euclidean/L²: "Flying" distance, √(3²+4²) = 5 units (not achievable on a grid)

Properties

Extreme sparsity of influence: L∞ is maximally sparse—only one dimension contributes to the distance (the one with maximum difference). All other dimensions are completely ignored.

Rectangular equidistant sets: In L∞, all points equidistant from the origin form a square (2D) or hypercube (nD), aligned with the coordinate axes.

When to Use Chebyshev Distance

Chebyshev distance is ideal when:\n\n• Any feature being different enough matters: A single out-of-spec measurement should trigger an alert\n• Worst-case sensitivity: You care about the maximum deviation, not average\n• Grid movement with diagonals: Pathfinding where diagonal moves cost the same as axis-aligned\n• Cache/memory access patterns: L∞ relates to cache-line access patterns in computing

chebyshev_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def chebyshev_distance(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute Chebyshev distance (L∞): maximum absolute difference.
    
    This is the distance a chess king would travel on an infinite board.
    """
    return np.max(np.abs(p - q))
 
 
def chess_king_example():
    """
    Demonstrate Chebyshev distance with chess king movement.
    """
    # Starting position (0, 0) to target (5, 3)
    start = np.array([0, 0])
    target = np.array([5, 3])
    
    chebyshev = chebyshev_distance(start, target)
    manhattan = np.sum(np.abs(target - start))
    euclidean = np.linalg.norm(target - start)
    
    print("Chess King from (0,0) to (5,3):")
    print(f"  Chebyshev (king moves needed): {chebyshev}")  # 5
    print(f"  Manhattan (rook moves needed): {manhattan}")  # 8
    print(f"  Euclidean (flying distance):   {euclidean:.2f}")  # 5.83
    
    # The king can reach (5,3) in 5 moves:
    # Move diagonally 3 times (covers both x and y), then 2 more x-axis moves
    print("\nKing's optimal path: 3 diagonal moves + 2 horizontal moves = 5 total")
 
 
def worst_case_monitoring():
    """
    Use Chebyshev distance for anomaly detection.
    """
    # Normal operating parameters
    normal_state = np.array([100, 50, 75, 200])  # temp, pressure, flow, rpm
    
    # Current readings with one anomaly
    current_state = np.array([102, 51, 74, 350])  # rpm is way off!
    
    # Different distance perspectives
    chebyshev = chebyshev_distance(normal_state, current_state)
    euclidean = np.linalg.norm(current_state - normal_state)
    manhattan = np.sum(np.abs(current_state - normal_state))
    
    print("\nAnomaly Detection Example:")
    print(f"  Normal state: {normal_state}")
    print(f"  Current state: {current_state}")
    print(f"\n  Chebyshev (max deviation): {chebyshev}")  # 150 (catches the RPM!)
    print(f"  Euclidean distance: {euclidean:.2f}")
    print(f"  Manhattan distance: {manhattan}")
    
    # For anomaly detection, Chebyshev immediately shows the worst offender
    diff = np.abs(current_state - normal_state)
    worst_feature = np.argmax(diff)
    feature_names = ['temperature', 'pressure', 'flow', 'rpm']
    print(f"\n  Worst deviation: {feature_names[worst_feature]} "
          f"(off by {diff[worst_feature]})")
 
 
if __name__ == "__main__":
    chess_king_example()
    worst_case_monitoring()

Comparison: L¹, L², L∞ Properties
Property	L¹ (Manhattan)	L² (Euclidean)	L∞ (Chebyshev)
Formula	Σ\|xᵢ\|	√(Σxᵢ²)	max\|xᵢ\|
Unit ball (2D)	Diamond	Circle	Square
Rotation invariant	No	Yes	No
Features contributing	All equally	All (weighted by square)	Only maximum
Outlier sensitivity	Moderate	High	Maximum (only worst matters)
Chess analogy	Rook	—	King
Computation	O(n) adds	O(n) + sqrt	O(n) to find max

Fractional p Values: Beyond the Metric Space

What happens when $0 < p < 1$? This regime produces interesting "quasi-metrics" that violate the triangle inequality but have useful properties.

The Lp "Norm" for 0 < p < 1

The formula still makes sense:

$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{\frac{1}{p}}$$

However, for $0 < p < 1$, this is not a true norm because:

Triangle inequality fails: We can have $|\mathbf{x} + \mathbf{y}|_p > |\mathbf{x}|_p + |\mathbf{y}|_p$
Not convex: The "unit ball" is no longer convex

Why Fractional p Might Be Useful

Despite not being a true metric, fractional p values have interesting properties:

Enhanced sparsity: Smaller p values emphasize sparse vectors (many zeros). The L0 "norm" (limit as p→0) counts the number of non-zero entries.

Robust estimation: In statistics, Lp regression with $0 < p < 1$ is even more robust to outliers than L¹ (though harder to optimize).

Feature selection: The L0.5 quasi-norm has been used in compressed sensing as a sparsity-inducing regularizer.

Unit "Ball" Shapes for p < 1

For $p < 1$, the unit "ball" becomes non-convex and "star-shaped":

Points along axes are relatively far from the origin
Points along diagonals are relatively close
The shape has "points" extending along the axes

Triangle Inequality Violation

When p < 1, the triangle inequality can fail dramatically. For example, with p = 0.5:\n\nConsider x = (1, 0) and y = (0, 1):\n• ‖x‖₀.₅ = 1\n• ‖y‖₀.₅ = 1\n• ‖x + y‖₀.₅ = ‖(1,1)‖₀.₅ = (1^0.5 + 1^0.5)^2 = 4\n\nBut 4 > 1 + 1, so the triangle inequality fails!\n\nImplication: You cannot use p < 1 with KNN data structures that rely on the triangle inequality (KD-trees, ball trees).

fractional_p.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def lp_norm(x: np.ndarray, p: float) -> float:
    """Compute Lp norm (works for any p > 0)."""
    return np.power(np.sum(np.power(np.abs(x), p)), 1/p)
 
 
def demonstrate_triangle_violation():
    """
    Show that triangle inequality fails for p < 1.
    """
    x = np.array([1.0, 0.0])
    y = np.array([0.0, 1.0])
    
    print("Triangle inequality test for different p values:")
    print(f"  x = {x}, y = {y}, x+y = {x+y}")
    print()
    
    for p in [0.5, 0.75, 1.0, 1.5, 2.0]:
        norm_x = lp_norm(x, p)
        norm_y = lp_norm(y, p)
        norm_sum = lp_norm(x + y, p)
        
        # Triangle inequality: ||x + y|| <= ||x|| + ||y||
        lhs = norm_sum
        rhs = norm_x + norm_y
        holds = "✓ Holds" if lhs <= rhs + 1e-10 else "✗ VIOLATED"
        
        print(f"  p = {p}: ||x+y||={lhs:.3f}, ||x||+||y||={rhs:.3f} → {holds}")
 
 
def sparsity_comparison():
    """
    Show how smaller p emphasizes sparsity.
    """
    # Sparse vector (mostly zeros)
    sparse = np.array([1, 0, 0, 0, 0])
    
    # Dense vector (small values everywhere)
    dense = np.array([0.2, 0.2, 0.2, 0.2, 0.2])
    
    print("\nSparsity comparison:")
    print(f"  Sparse vector: {sparse}")
    print(f"  Dense vector:  {dense}")
    print()
    
    for p in [0.5, 1.0, 2.0]:
        norm_sparse = lp_norm(sparse, p)
        norm_dense = lp_norm(dense, p)
        ratio = norm_sparse / norm_dense
        
        print(f"  p = {p}: sparse = {norm_sparse:.3f}, "
              f"dense = {norm_dense:.3f}, ratio = {ratio:.3f}")
    
    print("\n  As p decreases, sparse vectors have relatively smaller norms.")
    print("  This is why small p values encourage sparsity in optimization.")
 
 
if __name__ == "__main__":
    demonstrate_triangle_violation()
    sparsity_comparison()

Metric Properties Across p Values

Let us formally verify the metric axioms for Minkowski distance with $p \geq 1$.

Axiom 1: Non-negativity

$$d_p(\mathbf{x}, \mathbf{y}) \geq 0$$

Proof: Absolute values are non-negative, powers of non-negative numbers are non-negative, sums of non-negative numbers are non-negative, and the p-th root of a non-negative number is non-negative. ∎

Axiom 2: Identity of Indiscernibles

$$d_p(\mathbf{x}, \mathbf{y}) = 0 \iff \mathbf{x} = \mathbf{y}$$

Proof: If $\mathbf{x} = \mathbf{y}$, all differences are zero, so $d_p = 0$. Conversely, if $d_p = 0$, then $\sum |x_i - y_i|^p = 0$. Since each term is non-negative, all terms must be zero, implying $x_i = y_i$ for all $i$. ∎

Axiom 3: Symmetry

$$d_p(\mathbf{x}, \mathbf{y}) = d_p(\mathbf{y}, \mathbf{x})$$

Proof: $|x_i - y_i| = |y_i - x_i|$ for all $i$. ∎

Axiom 4: Triangle Inequality (Minkowski's Inequality)

$$d_p(\mathbf{x}, \mathbf{z}) \leq d_p(\mathbf{x}, \mathbf{y}) + d_p(\mathbf{y}, \mathbf{z})$$

This follows from the Minkowski inequality for norms:

$$|\mathbf{a} + \mathbf{b}|_p \leq |\mathbf{a}|_p + |\mathbf{b}|_p$$

valid for $p \geq 1$.

Proof outline: The Minkowski inequality is proven using the Hölder inequality:

$$\sum_i |a_i b_i| \leq |\mathbf{a}|_p \cdot |\mathbf{b}|_q$$

where $\frac{1}{p} + \frac{1}{q} = 1$ (conjugate exponents).

From Hölder, one derives Minkowski's inequality by setting $\mathbf{a} = \mathbf{x} - \mathbf{y}$ and $\mathbf{b} = \mathbf{y} - \mathbf{z}$. ∎

Historical Note: Minkowski's Contributions

Hermann Minkowski (1864-1909) was a brilliant mathematician and physicist. His contributions include:\n\n• Geometry of numbers: The mathematical foundation of crystallography and lattice theory\n• Minkowski space: The four-dimensional spacetime of special relativity (his former student Einstein called it "superfluous learnedness" before realizing its importance)\n• Minkowski inequality: The triangle inequality for Lp spaces\n\nThe Minkowski distance is named for his work on generalizing geometric concepts beyond Euclidean space.

Metric Properties by p Range
p Range	Norm?	Metric?	Triangle Inequality	Unit Ball Convex?
0 < p < 1	No (quasi-norm)	No	Fails	No (star-shaped)
p = 1	Yes (L¹)	Yes (Manhattan)	Holds (tight)	Yes (cross-polytope)
1 < p < 2	Yes	Yes	Holds	Yes
p = 2	Yes (L²)	Yes (Euclidean)	Holds \| Cauchy-Schwarz	Yes (sphere)
2 < p < ∞	Yes	Yes	Holds	Yes
p = ∞	Yes (L∞)	Yes (Chebyshev)	Holds (tight)	Yes (hypercube)

Selecting p for K-Nearest Neighbors

With the full Minkowski family available, how do we choose the right $p$ for a KNN problem? This is ultimately an empirical question, but we can develop useful heuristics.

Default Recommendations

Start with p = 2 (Euclidean): This is the most common choice for several reasons:

Strong theoretical foundations (Hilbert space structure)
Rotation invariance (no dependence on coordinate orientation)
Modern embeddings are typically trained for L² similarity
Well-supported in all ML libraries

Consider p = 1 (Manhattan) when:

Features are inherently independent and additive
Grid-like structure in the problem domain
Robustness to outliers is important
Computational efficiency on integer data is needed

Consider p = ∞ (Chebyshev) when:

Maximum deviation is the critical concern
Any single feature being "off" indicates dissimilarity
Worst-case bounds are important (e.g., control systems)

Hyperparameter Tuning for p

The parameter $p$ can be tuned like any other hyperparameter:

Define a grid of p values: [1, 1.5, 2, 2.5, 3, 4, ∞]
Use cross-validation to evaluate KNN accuracy for each p
Select the p with best validation performance

In practice, p = 1, 2, and ∞ are tested most; intermediate values rarely provide significant improvement.

tune_minkowski_p.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
 
def tune_minkowski_p():
    """
    Demonstrate tuning the Minkowski p parameter.
    """
    # Load and prepare data
    X, y = load_iris(return_X_y=True)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Test different p values
    p_values = [1, 1.5, 2, 2.5, 3, 4, 5, 10, np.inf]
    results = []
    
    print("Cross-validation accuracy for different p values:")
    print("-" * 45)
    
    for p in p_values:
        knn = KNeighborsClassifier(
            n_neighbors=5, 
            metric='minkowski', 
            p=p if p != np.inf else np.inf
        )
        
        # Handle p=∞ specially (sklearn uses 'chebyshev' metric)
        if p == np.inf:
            knn = KNeighborsClassifier(n_neighbors=5, metric='chebyshev')
        
        scores = cross_val_score(knn, X_scaled, y, cv=5)
        mean_score = scores.mean()
        std_score = scores.std()
        results.append((p, mean_score, std_score))
        
        p_str = f"p={p}" if p != np.inf else "p=∞"
        print(f"  {p_str:8s}: {mean_score:.3f} (+/- {std_score*2:.3f})")
    
    # Find best p
    best = max(results, key=lambda x: x[1])
    print(f"\nBest: p={best[0]} with accuracy {best[1]:.3f}")
    
    return results
 
 
def sklearn_gridsearch_example():
    """
    Use GridSearchCV to tune both k and p together.
    """
    X, y = load_iris(return_X_y=True)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Grid over both k and p
    param_grid = {
        'n_neighbors': [3, 5, 7, 9],
        'p': [1, 2, 3]  # sklearn's Minkowski p parameter
    }
    
    knn = KNeighborsClassifier(metric='minkowski')
    
    grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_scaled, y)
    
    print("\nGridSearchCV Results:")
    print(f"  Best parameters: {grid_search.best_params_}")
    print(f"  Best CV score: {grid_search.best_score_:.3f}")
 
 
if __name__ == "__main__":
    tune_minkowski_p()
    sklearn_gridsearch_example()

Decision Flowchart for Selecting p

•Check your embeddings: If using learned features (neural nets, word2vec), what metric was used for training?
•Check domain constraints: Is there a natural grid structure? Are features independent?
•Check for outliers: If outliers are noise, prefer p=1 or p=2. If outliers are important, stick with p=2.
•Check dimensionality: In very high dimensions, all Lp distances concentrate. Consider dimensionality reduction first.
•When in doubt: Default to p=2, then try p=1 and p=∞ in cross-validation.

Implementation and Computational Considerations

Computational Complexity

For vectors of dimension $n$:

p	Operations	Notes
1	n subtractions, n abs, n-1 additions	Simplest; no exponentiation
2	n subtractions, n squares, n-1 additions, 1 sqrt	Square root is expensive
general p	n subtractions, n abs, n power ops, n-1 additions, 1 root	pow() is expensive
∞	n subtractions, n abs, n-1 comparisons	Finding max

The general case ($p \neq 1, 2, \infty$) requires expensive pow() operations for each dimension, making it significantly slower than the special cases.

Numerical Stability

For large $p$, numerical issues can arise:

$$\left( \sum_i |x_i|^p \right)^{1/p}$$

If $\sum |x_i|^p$ overflows (for large $p$ and $|x_i| > 1$) or underflows (for small values), the result is corrupted.

Solution: Use the log-sum-exp trick:

$$|\mathbf{x}|_p = M \cdot \left( \sum_i \left|\frac{x_i}{M}\right|^p \right)^{1/p}$$

where $M = \max_i |x_i|$. This normalizes values to [0, 1] before exponentiating.

stable_minkowski.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
 
def minkowski_stable(x: np.ndarray, y: np.ndarray, p: float) -> float:
    """
    Numerically stable Minkowski distance computation.
    
    Uses scaling to prevent overflow/underflow for large p.
    """
    diff = np.abs(x - y)
    
    if p == np.inf:
        return np.max(diff)
    elif p == 1:
        return np.sum(diff)
    elif p == 2:
        return np.sqrt(np.dot(diff, diff))
    else:
        # Stable computation using max scaling
        max_val = np.max(diff)
        if max_val == 0:
            return 0.0
        
        # Normalize by max to keep values in [0, 1]
        scaled = diff / max_val
        
        # Now scaled values are in [0, 1], so pow() is safe
        sum_powers = np.sum(np.power(scaled, p))
        
        # Recover the original scale
        return max_val * np.power(sum_powers, 1/p)
 
 
def compare_stability():
    """
    Compare naive vs stable implementations for large p.
    """
    # Create vectors with values that could cause overflow
    x = np.array([10.0, 20.0, 15.0])
    y = np.array([0.0, 0.0, 0.0])
    
    print("Testing numerical stability:")
    
    for p in [10, 100, 500, 1000]:
        # Naive computation (may overflow)
        try:
            diff = np.abs(x - y)
            naive = np.power(np.sum(np.power(diff, p)), 1/p)
        except (OverflowError, RuntimeWarning):
            naive = "OVERFLOW"
        
        # Stable computation
        stable = minkowski_stable(x, y, p)
        
        # True answer should approach max(diff) = 20 as p → ∞
        print(f"  p={p:4d}: naive={naive if isinstance(naive, str) else f'{naive:.4f}'}, "
              f"stable={stable:.4f}, true limit=20.0")
 
 
if __name__ == "__main__":
    compare_stability()

Library Support for Minkowski Distance

Most ML libraries support Minkowski distance natively:\n\n• scikit-learn: KNeighborsClassifier(metric='minkowski', p=2)\n• scipy: scipy.spatial.distance.minkowski(u, v, p)\n• NumPy: np.linalg.norm(x - y, ord=p)\n\nFor p=1, 2, ∞, the special-case implementations are used automatically, providing optimal performance.

Summary and Path Forward

Minkowski distance provides a unified framework for understanding the Lp family of distance metrics. By varying the parameter $p$, we can smoothly transition between geometries, each with distinct properties suited to different problem domains.

Key Concepts Mastered

•Lp norm formula — $d_p = \left(\sum |p_i - q_i|^p\right)^{1/p}$ parameterizes a family of metrics
•Special cases — L¹ (Manhattan, p=1), L² (Euclidean, p=2), L∞ (Chebyshev, p=∞)
•Unit ball evolution — Shape transforms from diamond (p=1) through circle (p=2) to square (p=∞)
•Chebyshev distance — Maximum absolute difference; ideal for worst-case analysis and king's movement
•Fractional p — Values 0 < p < 1 violate triangle inequality but promote sparsity
•Metric properties — For p ≥ 1, Minkowski distance satisfies all metric axioms
•Hyperparameter tuning — The value p can be cross-validated alongside k in KNN
•Numerical stability — Use max-scaling for large p values to prevent overflow

What's Next: Cosine Similarity

The next page explores cosine similarity, a fundamentally different approach that measures the angle between vectors rather than their distance. This metric is essential for text data, high-dimensional sparse vectors, and situations where magnitude should be ignored. We'll see how it relates to Euclidean distance for normalized vectors.

Page Complete

You now command the full Minkowski distance family. You understand how the parameter p shapes geometry, when each special case is appropriate, and how to tune p as a hyperparameter. This prepares you to understand why cosine similarity—coming next—takes a completely different approach.

3 / 5

Loading learning content...

Machine LearningDistance Metrics

Distance Metrics for K-Nearest Neighbors

LevelIntermediate

Duration90 mins

TopicDistance Metrics

3 / 5

Minkowski Distance

The Generalized Distance Family

Learning Objectives

The Lp Norm and Minkowski Distance

The General Formula

For two points in $n$-dimensional space:

$$\mathbf{p} = (p_1, p_2, \ldots, p_n) \quad \text{and} \quad \mathbf{q} = (q_1, q_2, \ldots, q_n)$$

The Minkowski distance of order $p$ (where $p \geq 1$) is defined as:

$$d_p(\mathbf{p}, \mathbf{q}) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{\frac{1}{p}} = |\mathbf{p} - \mathbf{q}|_p$$

This is the Lp norm of the difference vector. The parameter $p$ controls how individual dimension differences are aggregated.

Special Cases

p	Name	Formula	Aggregation Style
1	Manhattan (L¹)	$\sum_i \|p_i - q_i\|$	Linear sum
2	Euclidean (L²)	$\sqrt{\sum_i (p_i - q_i)^2}$	Root of sum of squares
∞	Chebyshev (L∞)	$\max_i \|p_i - q_i\|$	Maximum difference

Derivation of the Limit Case p → ∞

The Chebyshev distance (L∞ norm) is the limit as $p \to \infty$:

$$|\mathbf{x}|\infty = \lim{p \to \infty} |\mathbf{x}|_p = \max_i |x_i|$$

Proof sketch: Let $M = \max_i |x_i|$. Then:

$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{1/p} \leq (n \cdot M^p)^{1/p} = n^{1/p} \cdot M$$

As $p \to \infty$, $n^{1/p} \to 1$, so $|\mathbf{x}|_p \leq M$.

For the lower bound, the term with maximum absolute value dominates:

$$|\mathbf{x}|_p \geq (M^p)^{1/p} = M$$

By the squeeze theorem, $|\mathbf{x}|_p \to M$ as $p \to \infty$. ∎

minkowski_distance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from typing import Union
 
def minkowski_distance(p_vec: np.ndarray, 
                       q_vec: np.ndarray, 
                       p: Union[int, float] = 2) -> float:
    """
    Compute the Minkowski distance of order p between two vectors.
    
    d_p(p, q) = (Σ |p_i - q_i|^p)^(1/p)
    
    Parameters:
        p_vec: First point as numpy array
        q_vec: Second point as numpy array
        p: Order of the Minkowski distance (p >= 1, or np.inf for Chebyshev)
    
    Returns:
        The Minkowski distance of order p
    
    Special cases:
        p=1: Manhattan distance
        p=2: Euclidean distance
        p=np.inf: Chebyshev distance (max absolute difference)
    """
    diff = np.abs(p_vec - q_vec)
    
    if p == np.inf:
        # Chebyshev distance: maximum absolute difference
        return np.max(diff)
    elif p == 1:
        # Manhattan distance: sum of absolute differences
        return np.sum(diff)
    elif p == 2:
        # Euclidean distance: use optimized sqrt(dot product)
        return np.sqrt(np.dot(diff, diff))
    else:
        # General case
        return np.power(np.sum(np.power(diff, p)), 1/p)
 
 
def minkowski_distance_builtin(p_vec: np.ndarray, 
                                q_vec: np.ndarray, 
                                p: Union[int, float] = 2) -> float:
    """
    Compute Minkowski distance using NumPy's built-in norm function.
    
    np.linalg.norm with ord=p computes the Lp norm.
    """
    return np.linalg.norm(p_vec - q_vec, ord=p)
 
 
# Demonstrate convergence as p → ∞
if __name__ == "__main__":
    p_vec = np.array([0, 0, 0])
    q_vec = np.array([3, 4, 1])  # Differences: 3, 4, 1
    
    print("Minkowski distances for different p values:")
    print(f"  p=1 (Manhattan): {minkowski_distance(p_vec, q_vec, 1):.4f}")
    print(f"  p=2 (Euclidean): {minkowski_distance(p_vec, q_vec, 2):.4f}")
    print(f"  p=3:             {minkowski_distance(p_vec, q_vec, 3):.4f}")
    print(f"  p=5:             {minkowski_distance(p_vec, q_vec, 5):.4f}")
    print(f"  p=10:            {minkowski_distance(p_vec, q_vec, 10):.4f}")
    print(f"  p=100:           {minkowski_distance(p_vec, q_vec, 100):.4f}")
    print(f"  p=∞ (Chebyshev): {minkowski_distance(p_vec, q_vec, np.inf):.4f}")
    
    # The maximum difference is 4, so L∞ = 4
    # As p increases, the distance approaches 4

Geometry of Lp Spaces: Unit Ball Evolution

The geometry induced by Minkowski distance is beautifully illustrated by the evolution of unit balls as $p$ varies. The unit ball $B_p$ contains all points with $|\mathbf{x}|_p \leq 1$.

Unit Ball Shapes in 2D

p	Unit Ball Shape	Equation	Vertices/Corners
1	Diamond (rotated square)	$\|x\| + \|y\| \leq 1$	4 corners at axes
2	Circle	$x^2 + y^2 \leq 1$	None (smooth)
∞	Square (axis-aligned)	$\max(\|x\|, \|y\|) \leq 1$	4 corners at diagonals

Shape Evolution

As $p$ increases from 1 to ∞:

p = 1: Diamond shape, rotated 45° relative to axes
1 < p < 2: Diamond begins to "bulge" outward, corners softening
p = 2: Perfect circle (smooth, no corners)
2 < p < ∞: Circle begins to "square off," corners emerging at diagonals
p = ∞: Perfect square, aligned with axes

Key insight: The unit ball transitions from a shape with corners on the axes (p=1) to a shape with corners on the diagonals (p=∞), passing through the smooth circle at p=2.

Higher Dimensions

In $n$ dimensions:

p	Shape Name	Vertices	Facets
1	Cross-polytope (hyperoctahedron)	2n	2^n
2	Hypersphere	—	—
∞	Hypercube	2^n	2n

Volume Concentration

Nested Ball Property

An important relationship: for $1 \leq p \leq q \leq \infty$:

$$|\mathbf{x}|_q \leq |\mathbf{x}|_p \leq n^{\frac{1}{p} - \frac{1}{q}} |\mathbf{x}|_q$$

This implies:

L∞ distance ≤ L² distance ≤ L¹ distance (for the same points)
Unit balls are nested: $B_1 \subset B_2 \subset \cdots \subset B_\infty$

Geometrically, smaller p values produce "pointier" shapes, larger p values produce "boxy" shapes, with the L² ball nestled in between.

Practical Implications

Example: From origin (0,0), compare points A = (1,1) and B = (√2, 0):

Metric	d(O, A)	d(O, B)	Closer Point
L¹	2	√2 ≈ 1.41	B
L²	√2 ≈ 1.41	√2 ≈ 1.41	Tie
L∞	1	√2 ≈ 1.41	A

unit_ball_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
import matplotlib.pyplot as plt
 
def plot_unit_balls():
    """
    Visualize unit balls for different p values.
    """
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    # Generate angles for parametric plotting
    theta = np.linspace(0, 2 * np.pi, 1000)
    
    p_values = [1, 2, 4, np.inf]
    titles = ['p=1 (Manhattan)', 'p=2 (Euclidean)', 'p=4', 'p=∞ (Chebyshev)']
    
    for ax, p, title in zip(axes, p_values, titles):
        if p == np.inf:
            # Square: max(|x|, |y|) = 1
            x = np.array([-1, -1, 1, 1, -1])
            y = np.array([-1, 1, 1, -1, -1])
        elif p == 1:
            # Diamond: |x| + |y| = 1
            x = np.array([1, 0, -1, 0, 1])
            y = np.array([0, 1, 0, -1, 0])
        else:
            # General Lp ball: |x|^p + |y|^p = 1
            # Parametric: x = sgn(cos(θ)) * |cos(θ)|^(2/p)
            # y = sgn(sin(θ)) * |sin(θ)|^(2/p)
            x = np.sign(np.cos(theta)) * np.abs(np.cos(theta)) ** (2/p)
            y = np.sign(np.sin(theta)) * np.abs(np.sin(theta)) ** (2/p)
        
        ax.fill(x, y, alpha=0.3, color='blue')
        ax.plot(x, y, 'b-', linewidth=2)
        ax.set_xlim(-1.5, 1.5)
        ax.set_ylim(-1.5, 1.5)
        ax.set_aspect('equal')
        ax.axhline(0, color='gray', linewidth=0.5)
        ax.axvline(0, color='gray', linewidth=0.5)
        ax.set_title(title)
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Unit Balls for Different Lp Norms', fontsize=14)
    plt.tight_layout()
    plt.savefig('lp_unit_balls.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
def demonstrate_neighbor_differences():
    """
    Show how different p values can select different neighbors.
    """
    # Query point at origin
    query = np.array([0, 0])
    
    # Two candidate neighbors
    on_axis = np.array([1.5, 0])       # On the x-axis
    on_diagonal = np.array([1.0, 1.0])  # On the diagonal
    
    print("Comparing two points from origin:")
    print(f"  Point A (on axis): {on_axis}")
    print(f"  Point B (on diagonal): {on_diagonal}")
    print()
    
    for p in [1, 2, 3, 4, 10, np.inf]:
        d_axis = np.linalg.norm(query - on_axis, ord=p)
        d_diag = np.linalg.norm(query - on_diagonal, ord=p)
        closer = "A (axis)" if d_axis < d_diag else "B (diagonal)" if d_diag < d_axis else "Tie"
        p_str = "∞" if p == np.inf else str(p)
        print(f"  p={p_str}: d(A)={d_axis:.3f}, d(B)={d_diag:.3f} → Closer: {closer}")
 
 
if __name__ == "__main__":
    demonstrate_neighbor_differences()
    # plot_unit_balls()  # Uncomment if matplotlib is available

Chebyshev Distance (L∞): The Maximum Metric

The Chebyshev distance (L∞ norm), named after Russian mathematician Pafnuty Chebyshev, deserves special attention. It represents the limit of Minkowski distance as $p \to \infty$.

Definition

$$d_\infty(\mathbf{p}, \mathbf{q}) = \max_{i=1,\ldots,n} |p_i - q_i| = |\mathbf{p} - \mathbf{q}|_\infty$$

Only the single largest component difference matters; all other differences are ignored.

Chess King Analogy

Chebyshev distance is famously the distance a chess king travels on an infinite board:

A king can move one square in any direction: horizontally, vertically, or diagonally
All 8 moves are equivalent (one step)
To go from (0,0) to (3,4): the king needs max(3,4) = 4 moves

This is unlike:

Manhattan/L¹: Rook movement (only axis-aligned), needs 3+4 = 7 moves
Euclidean/L²: "Flying" distance, √(3²+4²) = 5 units (not achievable on a grid)

Properties

Extreme sparsity of influence: L∞ is maximally sparse—only one dimension contributes to the distance (the one with maximum difference). All other dimensions are completely ignored.

Rectangular equidistant sets: In L∞, all points equidistant from the origin form a square (2D) or hypercube (nD), aligned with the coordinate axes.

When to Use Chebyshev Distance

chebyshev_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def chebyshev_distance(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute Chebyshev distance (L∞): maximum absolute difference.
    
    This is the distance a chess king would travel on an infinite board.
    """
    return np.max(np.abs(p - q))
 
 
def chess_king_example():
    """
    Demonstrate Chebyshev distance with chess king movement.
    """
    # Starting position (0, 0) to target (5, 3)
    start = np.array([0, 0])
    target = np.array([5, 3])
    
    chebyshev = chebyshev_distance(start, target)
    manhattan = np.sum(np.abs(target - start))
    euclidean = np.linalg.norm(target - start)
    
    print("Chess King from (0,0) to (5,3):")
    print(f"  Chebyshev (king moves needed): {chebyshev}")  # 5
    print(f"  Manhattan (rook moves needed): {manhattan}")  # 8
    print(f"  Euclidean (flying distance):   {euclidean:.2f}")  # 5.83
    
    # The king can reach (5,3) in 5 moves:
    # Move diagonally 3 times (covers both x and y), then 2 more x-axis moves
    print("\nKing's optimal path: 3 diagonal moves + 2 horizontal moves = 5 total")
 
 
def worst_case_monitoring():
    """
    Use Chebyshev distance for anomaly detection.
    """
    # Normal operating parameters
    normal_state = np.array([100, 50, 75, 200])  # temp, pressure, flow, rpm
    
    # Current readings with one anomaly
    current_state = np.array([102, 51, 74, 350])  # rpm is way off!
    
    # Different distance perspectives
    chebyshev = chebyshev_distance(normal_state, current_state)
    euclidean = np.linalg.norm(current_state - normal_state)
    manhattan = np.sum(np.abs(current_state - normal_state))
    
    print("\nAnomaly Detection Example:")
    print(f"  Normal state: {normal_state}")
    print(f"  Current state: {current_state}")
    print(f"\n  Chebyshev (max deviation): {chebyshev}")  # 150 (catches the RPM!)
    print(f"  Euclidean distance: {euclidean:.2f}")
    print(f"  Manhattan distance: {manhattan}")
    
    # For anomaly detection, Chebyshev immediately shows the worst offender
    diff = np.abs(current_state - normal_state)
    worst_feature = np.argmax(diff)
    feature_names = ['temperature', 'pressure', 'flow', 'rpm']
    print(f"\n  Worst deviation: {feature_names[worst_feature]} "
          f"(off by {diff[worst_feature]})")
 
 
if __name__ == "__main__":
    chess_king_example()
    worst_case_monitoring()

Comparison: L¹, L², L∞ Properties
Property	L¹ (Manhattan)	L² (Euclidean)	L∞ (Chebyshev)
Formula	Σ\|xᵢ\|	√(Σxᵢ²)	max\|xᵢ\|
Unit ball (2D)	Diamond	Circle	Square
Rotation invariant	No	Yes	No
Features contributing	All equally	All (weighted by square)	Only maximum
Outlier sensitivity	Moderate	High	Maximum (only worst matters)
Chess analogy	Rook	—	King
Computation	O(n) adds	O(n) + sqrt	O(n) to find max

Fractional p Values: Beyond the Metric Space

What happens when $0 < p < 1$? This regime produces interesting "quasi-metrics" that violate the triangle inequality but have useful properties.

The Lp "Norm" for 0 < p < 1

The formula still makes sense:

$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{\frac{1}{p}}$$

However, for $0 < p < 1$, this is not a true norm because:

Triangle inequality fails: We can have $|\mathbf{x} + \mathbf{y}|_p > |\mathbf{x}|_p + |\mathbf{y}|_p$
Not convex: The "unit ball" is no longer convex

Why Fractional p Might Be Useful

Despite not being a true metric, fractional p values have interesting properties:

Enhanced sparsity: Smaller p values emphasize sparse vectors (many zeros). The L0 "norm" (limit as p→0) counts the number of non-zero entries.

Robust estimation: In statistics, Lp regression with $0 < p < 1$ is even more robust to outliers than L¹ (though harder to optimize).

Feature selection: The L0.5 quasi-norm has been used in compressed sensing as a sparsity-inducing regularizer.

Unit "Ball" Shapes for p < 1

For $p < 1$, the unit "ball" becomes non-convex and "star-shaped":

Points along axes are relatively far from the origin
Points along diagonals are relatively close
The shape has "points" extending along the axes

Triangle Inequality Violation

fractional_p.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def lp_norm(x: np.ndarray, p: float) -> float:
    """Compute Lp norm (works for any p > 0)."""
    return np.power(np.sum(np.power(np.abs(x), p)), 1/p)
 
 
def demonstrate_triangle_violation():
    """
    Show that triangle inequality fails for p < 1.
    """
    x = np.array([1.0, 0.0])
    y = np.array([0.0, 1.0])
    
    print("Triangle inequality test for different p values:")
    print(f"  x = {x}, y = {y}, x+y = {x+y}")
    print()
    
    for p in [0.5, 0.75, 1.0, 1.5, 2.0]:
        norm_x = lp_norm(x, p)
        norm_y = lp_norm(y, p)
        norm_sum = lp_norm(x + y, p)
        
        # Triangle inequality: ||x + y|| <= ||x|| + ||y||
        lhs = norm_sum
        rhs = norm_x + norm_y
        holds = "✓ Holds" if lhs <= rhs + 1e-10 else "✗ VIOLATED"
        
        print(f"  p = {p}: ||x+y||={lhs:.3f}, ||x||+||y||={rhs:.3f} → {holds}")
 
 
def sparsity_comparison():
    """
    Show how smaller p emphasizes sparsity.
    """
    # Sparse vector (mostly zeros)
    sparse = np.array([1, 0, 0, 0, 0])
    
    # Dense vector (small values everywhere)
    dense = np.array([0.2, 0.2, 0.2, 0.2, 0.2])
    
    print("\nSparsity comparison:")
    print(f"  Sparse vector: {sparse}")
    print(f"  Dense vector:  {dense}")
    print()
    
    for p in [0.5, 1.0, 2.0]:
        norm_sparse = lp_norm(sparse, p)
        norm_dense = lp_norm(dense, p)
        ratio = norm_sparse / norm_dense
        
        print(f"  p = {p}: sparse = {norm_sparse:.3f}, "
              f"dense = {norm_dense:.3f}, ratio = {ratio:.3f}")
    
    print("\n  As p decreases, sparse vectors have relatively smaller norms.")
    print("  This is why small p values encourage sparsity in optimization.")
 
 
if __name__ == "__main__":
    demonstrate_triangle_violation()
    sparsity_comparison()

Metric Properties Across p Values

Let us formally verify the metric axioms for Minkowski distance with $p \geq 1$.

Axiom 1: Non-negativity

$$d_p(\mathbf{x}, \mathbf{y}) \geq 0$$

Axiom 2: Identity of Indiscernibles

$$d_p(\mathbf{x}, \mathbf{y}) = 0 \iff \mathbf{x} = \mathbf{y}$$

Axiom 3: Symmetry

$$d_p(\mathbf{x}, \mathbf{y}) = d_p(\mathbf{y}, \mathbf{x})$$

Proof: $|x_i - y_i| = |y_i - x_i|$ for all $i$. ∎

Axiom 4: Triangle Inequality (Minkowski's Inequality)

$$d_p(\mathbf{x}, \mathbf{z}) \leq d_p(\mathbf{x}, \mathbf{y}) + d_p(\mathbf{y}, \mathbf{z})$$

This follows from the Minkowski inequality for norms:

$$|\mathbf{a} + \mathbf{b}|_p \leq |\mathbf{a}|_p + |\mathbf{b}|_p$$

valid for $p \geq 1$.

Proof outline: The Minkowski inequality is proven using the Hölder inequality:

$$\sum_i |a_i b_i| \leq |\mathbf{a}|_p \cdot |\mathbf{b}|_q$$

where $\frac{1}{p} + \frac{1}{q} = 1$ (conjugate exponents).

From Hölder, one derives Minkowski's inequality by setting $\mathbf{a} = \mathbf{x} - \mathbf{y}$ and $\mathbf{b} = \mathbf{y} - \mathbf{z}$. ∎

Historical Note: Minkowski's Contributions

Metric Properties by p Range
p Range	Norm?	Metric?	Triangle Inequality	Unit Ball Convex?
0 < p < 1	No (quasi-norm)	No	Fails	No (star-shaped)
p = 1	Yes (L¹)	Yes (Manhattan)	Holds (tight)	Yes (cross-polytope)
1 < p < 2	Yes	Yes	Holds	Yes
p = 2	Yes (L²)	Yes (Euclidean)	Holds \| Cauchy-Schwarz	Yes (sphere)
2 < p < ∞	Yes	Yes	Holds	Yes
p = ∞	Yes (L∞)	Yes (Chebyshev)	Holds (tight)	Yes (hypercube)

Selecting p for K-Nearest Neighbors

With the full Minkowski family available, how do we choose the right $p$ for a KNN problem? This is ultimately an empirical question, but we can develop useful heuristics.

Default Recommendations

Start with p = 2 (Euclidean): This is the most common choice for several reasons:

Strong theoretical foundations (Hilbert space structure)
Rotation invariance (no dependence on coordinate orientation)
Modern embeddings are typically trained for L² similarity
Well-supported in all ML libraries

Consider p = 1 (Manhattan) when:

Features are inherently independent and additive
Grid-like structure in the problem domain
Robustness to outliers is important
Computational efficiency on integer data is needed

Consider p = ∞ (Chebyshev) when:

Maximum deviation is the critical concern
Any single feature being "off" indicates dissimilarity
Worst-case bounds are important (e.g., control systems)

Hyperparameter Tuning for p

The parameter $p$ can be tuned like any other hyperparameter:

Define a grid of p values: [1, 1.5, 2, 2.5, 3, 4, ∞]
Use cross-validation to evaluate KNN accuracy for each p
Select the p with best validation performance

In practice, p = 1, 2, and ∞ are tested most; intermediate values rarely provide significant improvement.

tune_minkowski_p.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
 
def tune_minkowski_p():
    """
    Demonstrate tuning the Minkowski p parameter.
    """
    # Load and prepare data
    X, y = load_iris(return_X_y=True)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Test different p values
    p_values = [1, 1.5, 2, 2.5, 3, 4, 5, 10, np.inf]
    results = []
    
    print("Cross-validation accuracy for different p values:")
    print("-" * 45)
    
    for p in p_values:
        knn = KNeighborsClassifier(
            n_neighbors=5, 
            metric='minkowski', 
            p=p if p != np.inf else np.inf
        )
        
        # Handle p=∞ specially (sklearn uses 'chebyshev' metric)
        if p == np.inf:
            knn = KNeighborsClassifier(n_neighbors=5, metric='chebyshev')
        
        scores = cross_val_score(knn, X_scaled, y, cv=5)
        mean_score = scores.mean()
        std_score = scores.std()
        results.append((p, mean_score, std_score))
        
        p_str = f"p={p}" if p != np.inf else "p=∞"
        print(f"  {p_str:8s}: {mean_score:.3f} (+/- {std_score*2:.3f})")
    
    # Find best p
    best = max(results, key=lambda x: x[1])
    print(f"\nBest: p={best[0]} with accuracy {best[1]:.3f}")
    
    return results
 
 
def sklearn_gridsearch_example():
    """
    Use GridSearchCV to tune both k and p together.
    """
    X, y = load_iris(return_X_y=True)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Grid over both k and p
    param_grid = {
        'n_neighbors': [3, 5, 7, 9],
        'p': [1, 2, 3]  # sklearn's Minkowski p parameter
    }
    
    knn = KNeighborsClassifier(metric='minkowski')
    
    grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_scaled, y)
    
    print("\nGridSearchCV Results:")
    print(f"  Best parameters: {grid_search.best_params_}")
    print(f"  Best CV score: {grid_search.best_score_:.3f}")
 
 
if __name__ == "__main__":
    tune_minkowski_p()
    sklearn_gridsearch_example()

Decision Flowchart for Selecting p

•Check your embeddings: If using learned features (neural nets, word2vec), what metric was used for training?
•Check domain constraints: Is there a natural grid structure? Are features independent?
•Check for outliers: If outliers are noise, prefer p=1 or p=2. If outliers are important, stick with p=2.
•Check dimensionality: In very high dimensions, all Lp distances concentrate. Consider dimensionality reduction first.
•When in doubt: Default to p=2, then try p=1 and p=∞ in cross-validation.

Implementation and Computational Considerations

Computational Complexity

For vectors of dimension $n$:

p	Operations	Notes
1	n subtractions, n abs, n-1 additions	Simplest; no exponentiation
2	n subtractions, n squares, n-1 additions, 1 sqrt	Square root is expensive
general p	n subtractions, n abs, n power ops, n-1 additions, 1 root	pow() is expensive
∞	n subtractions, n abs, n-1 comparisons	Finding max

The general case ($p \neq 1, 2, \infty$) requires expensive pow() operations for each dimension, making it significantly slower than the special cases.

Numerical Stability

For large $p$, numerical issues can arise:

$$\left( \sum_i |x_i|^p \right)^{1/p}$$

If $\sum |x_i|^p$ overflows (for large $p$ and $|x_i| > 1$) or underflows (for small values), the result is corrupted.

Solution: Use the log-sum-exp trick:

$$|\mathbf{x}|_p = M \cdot \left( \sum_i \left|\frac{x_i}{M}\right|^p \right)^{1/p}$$

where $M = \max_i |x_i|$. This normalizes values to [0, 1] before exponentiating.

stable_minkowski.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
 
def minkowski_stable(x: np.ndarray, y: np.ndarray, p: float) -> float:
    """
    Numerically stable Minkowski distance computation.
    
    Uses scaling to prevent overflow/underflow for large p.
    """
    diff = np.abs(x - y)
    
    if p == np.inf:
        return np.max(diff)
    elif p == 1:
        return np.sum(diff)
    elif p == 2:
        return np.sqrt(np.dot(diff, diff))
    else:
        # Stable computation using max scaling
        max_val = np.max(diff)
        if max_val == 0:
            return 0.0
        
        # Normalize by max to keep values in [0, 1]
        scaled = diff / max_val
        
        # Now scaled values are in [0, 1], so pow() is safe
        sum_powers = np.sum(np.power(scaled, p))
        
        # Recover the original scale
        return max_val * np.power(sum_powers, 1/p)
 
 
def compare_stability():
    """
    Compare naive vs stable implementations for large p.
    """
    # Create vectors with values that could cause overflow
    x = np.array([10.0, 20.0, 15.0])
    y = np.array([0.0, 0.0, 0.0])
    
    print("Testing numerical stability:")
    
    for p in [10, 100, 500, 1000]:
        # Naive computation (may overflow)
        try:
            diff = np.abs(x - y)
            naive = np.power(np.sum(np.power(diff, p)), 1/p)
        except (OverflowError, RuntimeWarning):
            naive = "OVERFLOW"
        
        # Stable computation
        stable = minkowski_stable(x, y, p)
        
        # True answer should approach max(diff) = 20 as p → ∞
        print(f"  p={p:4d}: naive={naive if isinstance(naive, str) else f'{naive:.4f}'}, "
              f"stable={stable:.4f}, true limit=20.0")
 
 
if __name__ == "__main__":
    compare_stability()

Library Support for Minkowski Distance

Summary and Path Forward

Key Concepts Mastered

•Lp norm formula — $d_p = \left(\sum |p_i - q_i|^p\right)^{1/p}$ parameterizes a family of metrics
•Special cases — L¹ (Manhattan, p=1), L² (Euclidean, p=2), L∞ (Chebyshev, p=∞)
•Unit ball evolution — Shape transforms from diamond (p=1) through circle (p=2) to square (p=∞)
•Chebyshev distance — Maximum absolute difference; ideal for worst-case analysis and king's movement
•Fractional p — Values 0 < p < 1 violate triangle inequality but promote sparsity
•Metric properties — For p ≥ 1, Minkowski distance satisfies all metric axioms
•Hyperparameter tuning — The value p can be cross-validated alongside k in KNN
•Numerical stability — Use max-scaling for large p values to prevent overflow

What's Next: Cosine Similarity

Page Complete

3 / 5