Distance Metrics - Learning Module

Loading content...

0/245

Euclidean Distance

The Language of Proximity

In the realm of machine learning, the concept of distance serves as the mathematical foundation for determining how similar or dissimilar two data points are. Among all distance measures, Euclidean distance stands as the most intuitive and historically significant—it is the "straight-line" distance we experience in physical space.

When we say that K-Nearest Neighbors finds the K closest training examples to make predictions, we must precisely define what closest means. This seemingly simple question opens a rich mathematical domain with profound implications for algorithm behavior, computational efficiency, and prediction quality.

Euclidean distance, named after the ancient Greek mathematician Euclid of Alexandria, extends the familiar notion of distance from the Pythagorean theorem to arbitrary dimensions. It forms the default metric in most implementations of KNN and serves as the reference point against which all other distance measures are compared.

Learning Objectives

By mastering this page, you will:\n\n• Derive the Euclidean distance formula from first principles using the Pythagorean theorem\n• Understand the geometric interpretation in 2D, 3D, and n-dimensional spaces\n• Prove that Euclidean distance satisfies the formal axioms of a metric\n• Analyze computational considerations including the squared distance optimization\n• Recognize scenarios where Euclidean distance is ideal versus problematic\n• Implement efficient Euclidean distance calculations in production code

Geometric Derivation from First Principles

The Euclidean distance formula emerges naturally from the Pythagorean theorem, one of the oldest and most fundamental results in mathematics. Let us build the formula systematically, starting from the familiar two-dimensional case and extending to arbitrary dimensions.

The Two-Dimensional Case

Consider two points in the Cartesian plane:

$$\mathbf{p} = (x_1, y_1) \quad \text{and} \quad \mathbf{q} = (x_2, y_2)$$

To find the straight-line distance between $\mathbf{p}$ and $\mathbf{q}$, we construct a right triangle where:

The horizontal leg has length $|x_2 - x_1|$ (the difference in x-coordinates)
The vertical leg has length $|y_2 - y_1|$ (the difference in y-coordinates)
The hypotenuse is the straight-line distance we seek

By the Pythagorean theorem, the hypotenuse $d$ satisfies:

$$d^2 = (x_2 - x_1)^2 + (y_2 - y_1)^2$$

Taking the square root:

$$d(\mathbf{p}, \mathbf{q}) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$

This is the Euclidean distance in two dimensions.

Why Squared Differences?

We use $(x_2 - x_1)^2$ rather than $|x_2 - x_1|$ for several reasons:\n\n1. Squaring eliminates the need for absolute values — negative differences become positive automatically\n2. Squaring is differentiable everywhere — unlike the absolute value, which has a discontinuity in its derivative at zero\n3. The Pythagorean theorem naturally produces squares — the relationship $c^2 = a^2 + b^2$ is fundamental\n\nThis choice has profound implications for optimization algorithms, as we will see throughout machine learning.

Extension to Three Dimensions

The extension to three dimensions is straightforward. For points:

$$\mathbf{p} = (x_1, y_1, z_1) \quad \text{and} \quad \mathbf{q} = (x_2, y_2, z_2)$$

We apply the Pythagorean theorem twice:

First, find the distance in the xy-plane: $d_{xy} = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}$
Then, form a right triangle with $d_{xy}$ and the z-difference: $d^2 = d_{xy}^2 + (z_2-z_1)^2$

Substituting:

$$d^2 = (x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2$$

$$d(\mathbf{p}, \mathbf{q}) = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2}$$

The General n-Dimensional Formula

Inductively extending this pattern to $n$ dimensions, for vectors:

$$\mathbf{p} = (p_1, p_2, \ldots, p_n) \quad \text{and} \quad \mathbf{q} = (q_1, q_2, \ldots, q_n)$$

The Euclidean distance (also called the L² norm of the difference vector) is:

$$d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2} = |\mathbf{p} - \mathbf{q}|_2$$

This elegant formula encapsulates the "straight-line" distance in any number of dimensions.

euclidean_distance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import Union, List
import math
 
def euclidean_distance_naive(p: List[float], q: List[float]) -> float:
    """
    Compute Euclidean distance using the explicit formula.
    
    This implementation follows the mathematical definition directly:
    d(p, q) = sqrt(sum((p_i - q_i)^2 for all i))
    
    Parameters:
        p: First point as a list of coordinates
        q: Second point as a list of coordinates
    
    Returns:
        The Euclidean distance between p and q
    
    Example:
        >>> euclidean_distance_naive([0, 0], [3, 4])
        5.0  # The classic 3-4-5 right triangle
    """
    if len(p) != len(q):
        raise ValueError("Points must have the same dimensionality")
    
    # Compute sum of squared differences
    sum_squared_diff = sum((p_i - q_i) ** 2 for p_i, q_i in zip(p, q))
    
    # Take the square root
    return math.sqrt(sum_squared_diff)
 
 
def euclidean_distance_numpy(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute Euclidean distance using NumPy's optimized operations.
    
    This leverages:
    1. Vectorized subtraction across all dimensions
    2. Efficient dot product for sum of squares
    3. Optimized square root implementation
    
    Parameters:
        p: First point as numpy array
        q: Second point as numpy array
    
    Returns:
        The Euclidean distance between p and q
        
    Performance Note:
        For high-dimensional vectors (n > 100), this is typically
        10-100x faster than the naive Python implementation due to
        NumPy's C-level optimizations and potential SIMD vectorization.
    """
    diff = p - q
    return np.sqrt(np.dot(diff, diff))
 
 
def euclidean_distance_builtin(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute Euclidean distance using NumPy's built-in norm function.
    
    np.linalg.norm computes the L2 norm by default, which is exactly
    the Euclidean distance when applied to the difference vector.
    
    This is the most readable and numerically stable option for
    production code.
    """
    return np.linalg.norm(p - q)
 
 
# Demonstration of equivalence
if __name__ == "__main__":
    # Classic 3-4-5 triangle in 2D
    p1 = [0, 0]
    q1 = [3, 4]
    print(f"2D distance: {euclidean_distance_naive(p1, q1)}")  # 5.0
    
    # 3D example
    p2 = np.array([1, 2, 3])
    q2 = np.array([4, 6, 3])
    print(f"3D distance: {euclidean_distance_numpy(p2, q2)}")  # 5.0
    
    # High-dimensional example (100D)
    p3 = np.random.randn(100)
    q3 = np.random.randn(100)
    print(f"100D distance: {euclidean_distance_builtin(p3, q3):.4f}")

Vector Space Interpretation and the L² Norm

Euclidean distance has a profound connection to linear algebra and vector spaces. Understanding this connection provides deeper insight into why Euclidean distance behaves as it does and how it relates to other concepts in machine learning.

Distance as a Norm of the Difference

The Euclidean distance between points $\mathbf{p}$ and $\mathbf{q}$ can be expressed as the L² norm (also called the Euclidean norm) of their difference vector:

$$d(\mathbf{p}, \mathbf{q}) = |\mathbf{p} - \mathbf{q}|_2$$

where the L² norm of a vector $\mathbf{v} = (v_1, v_2, \ldots, v_n)$ is:

$$|\mathbf{v}|2 = \sqrt{\sum{i=1}^{n} v_i^2}$$

This interpretation reveals that distance is fundamentally about the magnitude of the displacement vector between two points.

Connection to Inner Products

The squared Euclidean distance has an elegant expression using inner products (dot products):

$$d(\mathbf{p}, \mathbf{q})^2 = (\mathbf{p} - \mathbf{q}) \cdot (\mathbf{p} - \mathbf{q}) = \mathbf{p} \cdot \mathbf{p} - 2\mathbf{p} \cdot \mathbf{q} + \mathbf{q} \cdot \mathbf{q}$$

This expansion is computationally useful:

$$d(\mathbf{p}, \mathbf{q})^2 = |\mathbf{p}|^2 - 2\mathbf{p}^T\mathbf{q} + |\mathbf{q}|^2$$

Key insight: If we precompute $|\mathbf{p}|^2$ for all training points, computing distances to a query point $\mathbf{q}$ requires only the dot products $\mathbf{p}^T\mathbf{q}$ and the single value $|\mathbf{q}|^2$. This is especially efficient for matrix-based implementations.

The Geometry of Inner Products

The inner product $\mathbf{p} \cdot \mathbf{q} = |\mathbf{p}| |\mathbf{q}| \cos\theta$ connects Euclidean distance to the angle $\theta$ between vectors. This relationship is fundamental to understanding why Euclidean distance and cosine similarity (covered later) capture different aspects of vector relationships.

Geometric Interpretation: The Unit Ball

Every distance metric induces a geometry on the space. The unit ball for a metric is the set of all points at distance ≤ 1 from the origin:

$$B = {\mathbf{x} : d(\mathbf{0}, \mathbf{x}) \leq 1}$$

For Euclidean distance, the unit ball is:

A circle in 2D
A sphere in 3D
A hypersphere in $n$D

This spherical geometry has important implications:

Isotropy: Euclidean distance treats all directions equally. Moving 1 unit in any direction produces the same distance.
Rotation Invariance: Euclidean distance is unchanged by rotations of the coordinate system. Mathematically, for any orthogonal matrix $R$: $$d(R\mathbf{p}, R\mathbf{q}) = d(\mathbf{p}, \mathbf{q})$$
Translation Invariance: Distance depends only on the relative positions, not absolute positions: $$d(\mathbf{p} + \mathbf{t}, \mathbf{q} + \mathbf{t}) = d(\mathbf{p}, \mathbf{q})$$ for any translation vector $\mathbf{t}$

These properties make Euclidean distance natural for physical measurements but can be problematic when features have different scales or when the geometry of the data is non-spherical.

Properties of Euclidean Distance in Different Dimensions
Dimension	Unit Ball Shape	Surface Area	Volume	Notable Property
1D	Line segment [-1, 1]	2 (endpoints)	2 (length)	Absolute difference \|p - q\|
2D	Circle (disk)	2π (circumference)	π	Classic 2D distance
3D	Sphere	4π	4π/3	Physical space distance
nD	n-hypersphere	~√(2πn) · eⁿ	~(2πe/n)^(n/2)	Volume concentrates at boundary

Formal Metric Space Properties

A distance function (or metric) must satisfy four fundamental axioms to be mathematically well-defined. These axioms capture our intuitive understanding of what "distance" should mean. We will prove that Euclidean distance satisfies all four axioms, making it a proper metric.

The Four Axioms of a Metric

A function $d: X \times X \to \mathbb{R}$ is a metric on set $X$ if for all points $\mathbf{p}, \mathbf{q}, \mathbf{r} \in X$:

Axiom 1: Non-negativity

$$d(\mathbf{p}, \mathbf{q}) \geq 0$$

Proof for Euclidean distance: Since $(p_i - q_i)^2 \geq 0$ for all $i$, the sum is non-negative, and the square root of a non-negative number is non-negative. ∎

Axiom 2: Identity of Indiscernibles

$$d(\mathbf{p}, \mathbf{q}) = 0 \iff \mathbf{p} = \mathbf{q}$$

Proof for Euclidean distance:

($\Leftarrow$) If $\mathbf{p} = \mathbf{q}$, then $p_i = q_i$ for all $i$, so each $(p_i - q_i)^2 = 0$, giving $d = 0$.
($\Rightarrow$) If $d = 0$, then $\sum_i (p_i - q_i)^2 = 0$. Since each term is non-negative, all terms must be zero, so $p_i = q_i$ for all $i$. ∎

Axiom 3: Symmetry

$$d(\mathbf{p}, \mathbf{q}) = d(\mathbf{q}, \mathbf{p})$$

Proof for Euclidean distance: Since $(p_i - q_i)^2 = (q_i - p_i)^2$, the sums are equal. ∎

Axiom 4: Triangle Inequality

$$d(\mathbf{p}, \mathbf{r}) \leq d(\mathbf{p}, \mathbf{q}) + d(\mathbf{q}, \mathbf{r})$$

The triangle inequality states that the direct path is never longer than any indirect path—"the shortest distance between two points is a straight line."

Proving the Triangle Inequality

The triangle inequality for Euclidean distance follows from the Cauchy-Schwarz inequality. The proof requires establishing that for any vectors $\mathbf{u}$ and $\mathbf{v}$:\n\n$|\mathbf{u} + \mathbf{v}| \leq |\mathbf{u}| + |\mathbf{v}|$\n\nSetting $\mathbf{u} = \mathbf{p} - \mathbf{q}$ and $\mathbf{v} = \mathbf{q} - \mathbf{r}$ gives the triangle inequality for distances.

Proof of the Triangle Inequality

We prove the more general result: for vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$:

$$|\mathbf{u} + \mathbf{v}|_2 \leq |\mathbf{u}|_2 + |\mathbf{v}|_2$$

Step 1: Square both sides (preserves inequality since both sides are non-negative): $$|\mathbf{u} + \mathbf{v}|_2^2 \leq (|\mathbf{u}|_2 + |\mathbf{v}|_2)^2$$

Step 2: Expand the left side: $$|\mathbf{u} + \mathbf{v}|_2^2 = (\mathbf{u} + \mathbf{v}) \cdot (\mathbf{u} + \mathbf{v}) = |\mathbf{u}|_2^2 + 2\mathbf{u} \cdot \mathbf{v} + |\mathbf{v}|_2^2$$

Step 3: Expand the right side: $$(|\mathbf{u}|_2 + |\mathbf{v}|_2)^2 = |\mathbf{u}|_2^2 + 2|\mathbf{u}|_2|\mathbf{v}|_2 + |\mathbf{v}|_2^2$$

Step 4: We need to show: $$\mathbf{u} \cdot \mathbf{v} \leq |\mathbf{u}|_2 |\mathbf{v}|_2$$

This is the Cauchy-Schwarz inequality, one of the most important inequalities in mathematics.

Step 5: Cauchy-Schwarz proof (brief): Define $f(t) = |\mathbf{u} + t\mathbf{v}|_2^2 = |\mathbf{u}|_2^2 + 2t(\mathbf{u} \cdot \mathbf{v}) + t^2|\mathbf{v}|_2^2$

This parabola in $t$ is always non-negative. For a quadratic $at^2 + bt + c \geq 0$ to hold for all $t$, the discriminant must be non-positive: $b^2 - 4ac \leq 0$.

Applying: $(2\mathbf{u} \cdot \mathbf{v})^2 - 4|\mathbf{v}|_2^2 |\mathbf{u}|_2^2 \leq 0$

Simplifying: $(\mathbf{u} \cdot \mathbf{v})^2 \leq |\mathbf{u}|_2^2 |\mathbf{v}|_2^2$

Taking square roots: $|\mathbf{u} \cdot \mathbf{v}| \leq |\mathbf{u}|_2 |\mathbf{v}|_2$ ∎

The triangle inequality follows immediately.

Why Metric Axioms Matter for KNN

•Non-negativity ensures distances are meaningful quantities
•Identity of indiscernibles guarantees identical points have zero distance (duplicate detection)
•Symmetry means asking 'How far is A from B?' equals 'How far is B from A?'—essential for consistent nearest-neighbor relationships
•Triangle inequality enables efficient search algorithms (if we know d(q,a) and d(a,b), we can bound d(q,b))

Computational Considerations and Optimizations

Computing Euclidean distances efficiently is critical for KNN performance. In a brute-force KNN search with $N$ training points of dimension $D$, we compute $N$ distances, each requiring $O(D)$ operations. Understanding computational optimizations is essential for practical implementations.

The Squared Distance Optimization

The most important optimization arises from a key insight: when comparing distances, we often don't need the actual distances—only their relative ordering.

Since the square root function is monotonically increasing: $$d(\mathbf{p}, \mathbf{q}) < d(\mathbf{p}, \mathbf{r}) \iff d(\mathbf{p}, \mathbf{q})^2 < d(\mathbf{p}, \mathbf{r})^2$$

This means we can skip the square root operation entirely when:

Finding the K nearest neighbors (comparison-based)
Sorting points by distance
Checking if a point is within a threshold distance (if threshold is squared)

Computational savings: The square root operation is typically 10-30x slower than basic arithmetic operations on modern processors.

squared_distance_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import time
 
def find_k_nearest_naive(query: np.ndarray, 
                         data: np.ndarray, 
                         k: int) -> np.ndarray:
    """
    Find K nearest neighbors using full Euclidean distance.
    
    This computes the square root for every distance, which is
    unnecessary for ranking neighbors.
    """
    distances = np.array([np.sqrt(np.sum((query - point)**2)) 
                         for point in data])
    return np.argsort(distances)[:k]
 
 
def find_k_nearest_optimized(query: np.ndarray, 
                             data: np.ndarray, 
                             k: int) -> np.ndarray:
    """
    Find K nearest neighbors using squared Euclidean distance.
    
    Skips the sqrt operation since we only need relative ordering.
    This is mathematically equivalent but computationally faster.
    """
    squared_distances = np.sum((data - query)**2, axis=1)
    return np.argsort(squared_distances)[:k]
 
 
# Performance comparison
np.random.seed(42)
data = np.random.randn(10000, 100)  # 10K points, 100 dimensions
query = np.random.randn(100)
 
# Time the naive approach
start = time.time()
for _ in range(100):
    result_naive = find_k_nearest_naive(query, data, k=5)
naive_time = time.time() - start
 
# Time the optimized approach
start = time.time()
for _ in range(100):
    result_optimized = find_k_nearest_optimized(query, data, k=5)
optimized_time = time.time() - start
 
print(f"Naive time: {naive_time:.3f}s")
print(f"Optimized time: {optimized_time:.3f}s")
print(f"Speedup: {naive_time/optimized_time:.1f}x")
print(f"Results identical: {np.array_equal(result_naive, result_optimized)}")

Matrix-Based Batch Distance Computation

When computing distances between many point pairs, we can leverage matrix operations for massive speedups through BLAS/LAPACK optimizations.

For a query matrix $Q$ of shape $(m, D)$ and data matrix $X$ of shape $(N, D)$, we want the distance matrix $D$ of shape $(m, N)$ where $D_{ij} = d(\mathbf{q}_i, \mathbf{x}_j)$.

Using the identity: $$d(\mathbf{q}, \mathbf{x})^2 = |\mathbf{q}|^2 - 2\mathbf{q}^T\mathbf{x} + |\mathbf{x}|^2$$

We can compute all distances as:

Compute $|\mathbf{q}_i|^2$ for all queries (vector of length $m$)
Compute $|\mathbf{x}_j|^2$ for all data points (vector of length $N$)
Compute $QX^T$ (matrix multiply, shape $(m, N)$)
Combine: $D^2_{ij} = |\mathbf{q}i|^2 - 2(QX^T){ij} + |\mathbf{x}_j|^2$

The matrix multiplication dominates the computation and benefits from highly optimized BLAS implementations.

batch_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from typing import Optional
 
def pairwise_euclidean_distance_squared(
    X: np.ndarray, 
    Y: Optional[np.ndarray] = None
) -> np.ndarray:
    """
    Compute pairwise squared Euclidean distances efficiently.
    
    Uses the identity: ||x-y||² = ||x||² - 2<x,y> + ||y||²
    
    Parameters:
        X: Matrix of shape (n_samples_X, n_features)
        Y: Matrix of shape (n_samples_Y, n_features), or None
           If None, computes pairwise distances within X
    
    Returns:
        Distance matrix of shape (n_samples_X, n_samples_Y)
        Entry [i,j] is the squared distance between X[i] and Y[j]
    
    Complexity:
        O(n_X * n_Y * D) time, O(n_X * n_Y) space
        But the matrix multiply O(n_X * n_Y * D) is highly optimized
    """
    if Y is None:
        Y = X
    
    # Compute squared norms for each row
    # Shape: (n_X,) and (n_Y,)
    X_norm_sq = np.sum(X ** 2, axis=1)
    Y_norm_sq = np.sum(Y ** 2, axis=1)
    
    # Compute cross-term using matrix multiplication
    # Shape: (n_X, n_Y)
    cross_term = X @ Y.T
    
    # Combine: broadcast X_norm_sq as column, Y_norm_sq as row
    # ||x-y||² = ||x||² - 2<x,y> + ||y||²
    distances_sq = X_norm_sq[:, np.newaxis] - 2 * cross_term + Y_norm_sq[np.newaxis, :]
    
    # Numerical precision can cause tiny negative values
    distances_sq = np.maximum(distances_sq, 0)
    
    return distances_sq
 
 
def pairwise_euclidean_distance(
    X: np.ndarray, 
    Y: Optional[np.ndarray] = None
) -> np.ndarray:
    """Compute actual Euclidean distances (with sqrt)."""
    return np.sqrt(pairwise_euclidean_distance_squared(X, Y))
 
 
# Example: Computing all pairwise distances for KNN
X_train = np.random.randn(1000, 50)  # 1000 training points
X_query = np.random.randn(10, 50)    # 10 query points
 
# Compute all distances efficiently (10 × 1000 = 10,000 distances)
distances = pairwise_euclidean_distance(X_query, X_train)
 
# Find 5 nearest neighbors for each query
k = 5
nearest_indices = np.argsort(distances, axis=1)[:, :k]
print(f"Nearest neighbors shape: {nearest_indices.shape}")  # (10, 5)

Numerical Precision Considerations

The matrix-based formula $|\mathbf{x}|^2 - 2\mathbf{x}^T\mathbf{y} + |\mathbf{y}|^2$ can produce small negative values due to floating-point precision errors (subtracting nearly equal large numbers). Always clamp the result to non-negative before taking the square root:\n\ndistances_sq = np.maximum(distances_sq, 0)\n\nFor critical applications, consider using Kahan summation or higher-precision arithmetic.

The Scale Sensitivity Problem

Euclidean distance has a critical property that is both a strength and a weakness: it treats all dimensions equally. This isotropy becomes problematic when features have different scales or units.

The Problem Illustrated

Consider a dataset with two features:

Age: ranges from 0 to 100 years
Income: ranges from $0 to $1,000,000

Compare two pairs of people:

Person A (Age=25, Income=$50,000) vs Person B (Age=26, Income=$50,000)
Person A (Age=25, Income=$50,000) vs Person C (Age=25, Income=$950,000)

Euclidean distance:

$d(A, B) = \sqrt{(26-25)^2 + (50000-50000)^2} = 1$
$d(A, C) = \sqrt{(25-25)^2 + (950000-50000)^2} = 900000$

The income difference completely dominates, making the age feature effectively invisible. A $900,000 income difference contributes 900,000² = 810 billion to the squared distance, while a 1-year age difference contributes just 1.

Ignoring Scale is a Common Mistake

Using raw features with different scales is one of the most common errors in KNN implementations. The algorithm will effectively use only the high-magnitude features, ignoring potentially important low-magnitude features entirely.

Solutions: Feature Scaling

The standard solutions involve transforming features to comparable scales:

1. Min-Max Normalization (Rescaling)

Scale each feature to the range $[0, 1]$:

$$x'_i = \frac{x_i - \min(x)}{\max(x) - \min(x)}$$

Pros: Bounded output, preserves zero values for sparse data Cons: Sensitive to outliers (a single extreme value stretches the range)

2. Z-Score Standardization

Transform each feature to have zero mean and unit variance:

$$x'_i = \frac{x_i - \mu}{\sigma}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

Pros: Handles outliers better, centers the data Cons: Unbounded output, doesn't preserve sparsity

3. Robust Scaling

Use median and interquartile range (IQR) instead of mean and standard deviation:

$$x'_i = \frac{x_i - \text{median}(x)}{\text{IQR}(x)}$$

Pros: Highly resistant to outliers Cons: Doesn't capture the full data range

feature_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
 
# Simulated data with vastly different scales
np.random.seed(42)
n_samples = 500
 
# Feature 1: Age (0-100)
age = np.random.normal(35, 15, n_samples).clip(0, 100)
 
# Feature 2: Income (0-1,000,000)
income = np.random.exponential(50000, n_samples).clip(0, 1000000)
 
# Feature 3: Credit Score (300-850)
credit_score = np.random.normal(650, 80, n_samples).clip(300, 850)
 
X = np.column_stack([age, income, credit_score])
 
# Binary classification target (arbitrary for demonstration)
y = ((age > 40) & (income > 60000)).astype(int)
 
# Compare KNN performance with different scaling methods
print("Feature ranges before scaling:")
print(f"  Age: [{X[:, 0].min():.0f}, {X[:, 0].max():.0f}]")
print(f"  Income: [{X[:, 1].min():.0f}, {X[:, 1].max():.0f}]")
print(f"  Credit: [{X[:, 2].min():.0f}, {X[:, 2].max():.0f}]")
 
# Test different scaling approaches
scalers = {
    "No Scaling": None,
    "Min-Max": MinMaxScaler(),
    "Z-Score": StandardScaler(),
    "Robust": RobustScaler(),
}
 
print("\nKNN 5-fold CV Accuracy:")
for name, scaler in scalers.items():
    if scaler is not None:
        X_scaled = scaler.fit_transform(X)
    else:
        X_scaled = X
    
    knn = KNeighborsClassifier(n_neighbors=5)
    scores = cross_val_score(knn, X_scaled, y, cv=5)
    print(f"  {name}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

When to Apply Which Scaling

•Min-Max: Image pixels (natural 0-255 range), bounded domains, when you need output in [0,1]
•Z-Score: Most general cases, Gaussian-like distributions, when you'll combine with other techniques (PCA, regularization)
•Robust: Data with significant outliers, heavily skewed distributions, financial data with extreme values
•No scaling: When features are already comparable (e.g., all percentages, all normalized counts)

When Euclidean Distance Excels

Despite its limitations, Euclidean distance remains the default choice for many applications. Understanding when it performs well helps in selecting the appropriate metric.

Ideal Conditions for Euclidean Distance

Euclidean distance is the optimal or near-optimal choice when:

1. Isotropic Data Geometry

When the "true" similarity between data points follows spherical geometry—meaning similarity decreases equally in all directions—Euclidean distance correctly captures this structure.

Example: Physical sensor data where deviation in any feature direction is equally significant (temperature, pressure, vibration magnitude).

2. Low to Moderate Dimensionality

Euclidean distance works well in low dimensions (D < 20) where the curse of dimensionality hasn't yet made all points equidistant. The geometric intuition from 2D/3D extends naturally.

3. Continuous, Dense Features

For continuous numerical features that can take any value in an interval, Euclidean distance provides smooth gradients of similarity. It naturally handles the interpolation between nearby values.

Example: Scientific measurements, engineering specifications, any data derived from physical quantities.

4. Post-Normalization Data

After proper scaling (Z-score standardization), features become comparable, and Euclidean distance provides a principled weighting where each standard deviation of change contributes equally.

5. When Rotation Invariance Matters

Euclidean distance is invariant to rotations of the coordinate system. If your problem has no preferred direction (isotropic), this is desirable.

Example: Comparing molecular conformations in 3D space after alignment.

Euclidean Distance Shines

•Image similarity (after normalization)
•Recommender systems (dense embeddings)
•Geographic proximity (small areas)
•Time series (after DTW alignment)
•Neural network embeddings (designed for L²)

Consider Alternatives When

•Text data (prefer cosine similarity)
•Sparse, high-dimensional data
•Categorical features present
•Grid-like movement (Manhattan better)
•Hierarchical or tree-structured data

Modern Deep Learning Embeddings

Modern representation learning (word2vec, BERT, ResNet) explicitly optimizes embeddings for Euclidean similarity. Neural networks trained with L² loss functions learn representations where Euclidean distance is meaningful. For these learned embeddings, Euclidean distance is often the correct choice by construction.

Limitations and Pathological Cases

Euclidean distance can fail dramatically in certain scenarios. Recognizing these pathologies helps avoid misapplication.

High-Dimensional Catastrophe

In high dimensions, Euclidean distance suffers from a phenomenon called distance concentration. For random points in high-dimensional space:

$$\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \text{ as } D \to \infty$$

All points become approximately equidistant! This makes nearest-neighbor search meaningless—every point is almost equally close to every other point.

Mathematical intuition: In $D$ dimensions, distance is $d = \sqrt{\sum_{i=1}^D (x_i - y_i)^2}$. By the law of large numbers, as $D \to \infty$, the sum concentrates around its expected value, and relative variations shrink.

Sparse Data Problems

For sparse data (many zero entries), Euclidean distance can be problematic:

Two documents with no words in common have distance proportional to the document lengths, not semantic difference
Zero values contribute nothing to similarity but inflate distance from non-zero points
The "background" of zeros dominates the signal

The Curse of Dimensionality

As dimensionality increases:\n\n• Volume of a hypersphere relative to its bounding hypercube → 0\n• Most points are near the boundary, not the center\n• Random projections preserve distances (Johnson-Lindenstrauss)\n\nFor D > 100, seriously consider dimensionality reduction (PCA, UMAP) or alternative metrics.

Axis-Aligned Biases

Euclidean distance treats axis-aligned differences the same as diagonal differences. But in many domains, this is inappropriate:

Example - Chess: A king moves one square in any direction (including diagonals). Manhattan distance (covered next) says diagonal is distance 2; Euclidean says it's $\sqrt{2}$. Neither matches the king's actual movement (Chebyshev distance: 1).

Wrapping Domains

Euclidean distance doesn't handle periodic or cyclical features:

Example - Time of day: 11:55 PM and 12:05 AM (midnight) are 10 minutes apart, but Euclidean distance on 24-hour representation (23.92 vs 0.08) gives 23.84, suggesting nearly maximum difference!

Solutions:

Transform to circular coordinates: $(\cos(2\pi t / 24), \sin(2\pi t / 24))$
Use specialized periodic distance functions

Heterogeneous Feature Types

Euclidean distance is fundamentally designed for continuous numerical features. Mixing in:

Categorical features: What is $d(\text{"red"}, \text{"blue"})$?
Ordinal features: Is the distance between ratings 1 and 2 the same as between 4 and 5?
Binary features: How should presence/absence contribute?

Solutions: Use hybrid distance functions like Gower distance, or embed categorical features appropriately.

Euclidean Distance Failure Modes and Remedies
Problem	Symptom	Solution
High dimensionality	All points equidistant	Dimensionality reduction (PCA, UMAP)
Different scales	High-magnitude features dominate	Feature standardization
Sparse data	Zero values inflate distances	Cosine similarity, Jaccard distance
Periodic features	Opposite phases appear far	Circular/angular distance
Categorical features	Cannot compute differences	One-hot + Hamming, embeddings
Correlated features	Double-counting information	Mahalanobis distance, PCA

Summary and Path Forward

Euclidean distance is the fundamental building block of distance-based machine learning. Its simplicity, intuitive geometric interpretation, and strong theoretical foundations make it the default starting point. However, its assumptions—isotropy, continuous features, comparable scales—must be verified before application.

Key Takeaways

Essential Concepts Mastered

•Derivation from Pythagorean theorem — Euclidean distance extends the straight-line distance concept to any number of dimensions via $d = \sqrt{\sum(p_i - q_i)^2}$
•L² norm interpretation — Distance equals the magnitude of the difference vector, connecting to linear algebra and inner products
•Metric space axioms — Non-negativity, identity of indiscernibles, symmetry, and triangle inequality are all satisfied
•Squared distance optimization — Skip the square root for comparison-only operations (significant speedup)
•Matrix-based computation — Use $|x|^2 - 2x^Ty + |y|^2$ for efficient batch distance computation
•Scale sensitivity — Always standardize features before using Euclidean distance
•Limitations — High dimensionality, sparse data, and heterogeneous features require alternatives

What's Next: Manhattan Distance

The next page explores Manhattan distance (L¹ norm), which takes a fundamentally different approach by summing absolute differences rather than squared differences. We will see how this creates grid-like geometry, how it handles outliers differently, and when it outperforms Euclidean distance.

Page Complete

You now have a deep understanding of Euclidean distance—its derivation, properties, computational optimizations, strengths, and limitations. This foundation prepares you to understand the family of distance metrics we explore next, all of which are variations on these core themes.