Loading content...
In the realm of machine learning, the concept of distance serves as the mathematical foundation for determining how similar or dissimilar two data points are. Among all distance measures, Euclidean distance stands as the most intuitive and historically significant—it is the "straight-line" distance we experience in physical space.
When we say that K-Nearest Neighbors finds the K closest training examples to make predictions, we must precisely define what closest means. This seemingly simple question opens a rich mathematical domain with profound implications for algorithm behavior, computational efficiency, and prediction quality.
Euclidean distance, named after the ancient Greek mathematician Euclid of Alexandria, extends the familiar notion of distance from the Pythagorean theorem to arbitrary dimensions. It forms the default metric in most implementations of KNN and serves as the reference point against which all other distance measures are compared.
By mastering this page, you will:\n\n• Derive the Euclidean distance formula from first principles using the Pythagorean theorem\n• Understand the geometric interpretation in 2D, 3D, and n-dimensional spaces\n• Prove that Euclidean distance satisfies the formal axioms of a metric\n• Analyze computational considerations including the squared distance optimization\n• Recognize scenarios where Euclidean distance is ideal versus problematic\n• Implement efficient Euclidean distance calculations in production code
The Euclidean distance formula emerges naturally from the Pythagorean theorem, one of the oldest and most fundamental results in mathematics. Let us build the formula systematically, starting from the familiar two-dimensional case and extending to arbitrary dimensions.
Consider two points in the Cartesian plane:
$$\mathbf{p} = (x_1, y_1) \quad \text{and} \quad \mathbf{q} = (x_2, y_2)$$
To find the straight-line distance between $\mathbf{p}$ and $\mathbf{q}$, we construct a right triangle where:
By the Pythagorean theorem, the hypotenuse $d$ satisfies:
$$d^2 = (x_2 - x_1)^2 + (y_2 - y_1)^2$$
Taking the square root:
$$d(\mathbf{p}, \mathbf{q}) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$
This is the Euclidean distance in two dimensions.
We use $(x_2 - x_1)^2$ rather than $|x_2 - x_1|$ for several reasons:\n\n1. Squaring eliminates the need for absolute values — negative differences become positive automatically\n2. Squaring is differentiable everywhere — unlike the absolute value, which has a discontinuity in its derivative at zero\n3. The Pythagorean theorem naturally produces squares — the relationship $c^2 = a^2 + b^2$ is fundamental\n\nThis choice has profound implications for optimization algorithms, as we will see throughout machine learning.
The extension to three dimensions is straightforward. For points:
$$\mathbf{p} = (x_1, y_1, z_1) \quad \text{and} \quad \mathbf{q} = (x_2, y_2, z_2)$$
We apply the Pythagorean theorem twice:
Substituting:
$$d^2 = (x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2$$
$$d(\mathbf{p}, \mathbf{q}) = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2}$$
Inductively extending this pattern to $n$ dimensions, for vectors:
$$\mathbf{p} = (p_1, p_2, \ldots, p_n) \quad \text{and} \quad \mathbf{q} = (q_1, q_2, \ldots, q_n)$$
The Euclidean distance (also called the L² norm of the difference vector) is:
$$d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2} = |\mathbf{p} - \mathbf{q}|_2$$
This elegant formula encapsulates the "straight-line" distance in any number of dimensions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom typing import Union, Listimport math def euclidean_distance_naive(p: List[float], q: List[float]) -> float: """ Compute Euclidean distance using the explicit formula. This implementation follows the mathematical definition directly: d(p, q) = sqrt(sum((p_i - q_i)^2 for all i)) Parameters: p: First point as a list of coordinates q: Second point as a list of coordinates Returns: The Euclidean distance between p and q Example: >>> euclidean_distance_naive([0, 0], [3, 4]) 5.0 # The classic 3-4-5 right triangle """ if len(p) != len(q): raise ValueError("Points must have the same dimensionality") # Compute sum of squared differences sum_squared_diff = sum((p_i - q_i) ** 2 for p_i, q_i in zip(p, q)) # Take the square root return math.sqrt(sum_squared_diff) def euclidean_distance_numpy(p: np.ndarray, q: np.ndarray) -> float: """ Compute Euclidean distance using NumPy's optimized operations. This leverages: 1. Vectorized subtraction across all dimensions 2. Efficient dot product for sum of squares 3. Optimized square root implementation Parameters: p: First point as numpy array q: Second point as numpy array Returns: The Euclidean distance between p and q Performance Note: For high-dimensional vectors (n > 100), this is typically 10-100x faster than the naive Python implementation due to NumPy's C-level optimizations and potential SIMD vectorization. """ diff = p - q return np.sqrt(np.dot(diff, diff)) def euclidean_distance_builtin(p: np.ndarray, q: np.ndarray) -> float: """ Compute Euclidean distance using NumPy's built-in norm function. np.linalg.norm computes the L2 norm by default, which is exactly the Euclidean distance when applied to the difference vector. This is the most readable and numerically stable option for production code. """ return np.linalg.norm(p - q) # Demonstration of equivalenceif __name__ == "__main__": # Classic 3-4-5 triangle in 2D p1 = [0, 0] q1 = [3, 4] print(f"2D distance: {euclidean_distance_naive(p1, q1)}") # 5.0 # 3D example p2 = np.array([1, 2, 3]) q2 = np.array([4, 6, 3]) print(f"3D distance: {euclidean_distance_numpy(p2, q2)}") # 5.0 # High-dimensional example (100D) p3 = np.random.randn(100) q3 = np.random.randn(100) print(f"100D distance: {euclidean_distance_builtin(p3, q3):.4f}")Euclidean distance has a profound connection to linear algebra and vector spaces. Understanding this connection provides deeper insight into why Euclidean distance behaves as it does and how it relates to other concepts in machine learning.
The Euclidean distance between points $\mathbf{p}$ and $\mathbf{q}$ can be expressed as the L² norm (also called the Euclidean norm) of their difference vector:
$$d(\mathbf{p}, \mathbf{q}) = |\mathbf{p} - \mathbf{q}|_2$$
where the L² norm of a vector $\mathbf{v} = (v_1, v_2, \ldots, v_n)$ is:
$$|\mathbf{v}|2 = \sqrt{\sum{i=1}^{n} v_i^2}$$
This interpretation reveals that distance is fundamentally about the magnitude of the displacement vector between two points.
The squared Euclidean distance has an elegant expression using inner products (dot products):
$$d(\mathbf{p}, \mathbf{q})^2 = (\mathbf{p} - \mathbf{q}) \cdot (\mathbf{p} - \mathbf{q}) = \mathbf{p} \cdot \mathbf{p} - 2\mathbf{p} \cdot \mathbf{q} + \mathbf{q} \cdot \mathbf{q}$$
This expansion is computationally useful:
$$d(\mathbf{p}, \mathbf{q})^2 = |\mathbf{p}|^2 - 2\mathbf{p}^T\mathbf{q} + |\mathbf{q}|^2$$
Key insight: If we precompute $|\mathbf{p}|^2$ for all training points, computing distances to a query point $\mathbf{q}$ requires only the dot products $\mathbf{p}^T\mathbf{q}$ and the single value $|\mathbf{q}|^2$. This is especially efficient for matrix-based implementations.
The inner product $\mathbf{p} \cdot \mathbf{q} = |\mathbf{p}| |\mathbf{q}| \cos\theta$ connects Euclidean distance to the angle $\theta$ between vectors. This relationship is fundamental to understanding why Euclidean distance and cosine similarity (covered later) capture different aspects of vector relationships.
Every distance metric induces a geometry on the space. The unit ball for a metric is the set of all points at distance ≤ 1 from the origin:
$$B = {\mathbf{x} : d(\mathbf{0}, \mathbf{x}) \leq 1}$$
For Euclidean distance, the unit ball is:
This spherical geometry has important implications:
Isotropy: Euclidean distance treats all directions equally. Moving 1 unit in any direction produces the same distance.
Rotation Invariance: Euclidean distance is unchanged by rotations of the coordinate system. Mathematically, for any orthogonal matrix $R$: $$d(R\mathbf{p}, R\mathbf{q}) = d(\mathbf{p}, \mathbf{q})$$
Translation Invariance: Distance depends only on the relative positions, not absolute positions: $$d(\mathbf{p} + \mathbf{t}, \mathbf{q} + \mathbf{t}) = d(\mathbf{p}, \mathbf{q})$$ for any translation vector $\mathbf{t}$
These properties make Euclidean distance natural for physical measurements but can be problematic when features have different scales or when the geometry of the data is non-spherical.
| Dimension | Unit Ball Shape | Surface Area | Volume | Notable Property |
|---|---|---|---|---|
| 1D | Line segment [-1, 1] | 2 (endpoints) | 2 (length) | Absolute difference |p - q| |
| 2D | Circle (disk) | 2π (circumference) | π | Classic 2D distance |
| 3D | Sphere | 4π | 4π/3 | Physical space distance |
| nD | n-hypersphere | ~√(2πn) · eⁿ | ~(2πe/n)^(n/2) | Volume concentrates at boundary |
A distance function (or metric) must satisfy four fundamental axioms to be mathematically well-defined. These axioms capture our intuitive understanding of what "distance" should mean. We will prove that Euclidean distance satisfies all four axioms, making it a proper metric.
A function $d: X \times X \to \mathbb{R}$ is a metric on set $X$ if for all points $\mathbf{p}, \mathbf{q}, \mathbf{r} \in X$:
$$d(\mathbf{p}, \mathbf{q}) \geq 0$$
Proof for Euclidean distance: Since $(p_i - q_i)^2 \geq 0$ for all $i$, the sum is non-negative, and the square root of a non-negative number is non-negative. ∎
$$d(\mathbf{p}, \mathbf{q}) = 0 \iff \mathbf{p} = \mathbf{q}$$
Proof for Euclidean distance:
$$d(\mathbf{p}, \mathbf{q}) = d(\mathbf{q}, \mathbf{p})$$
Proof for Euclidean distance: Since $(p_i - q_i)^2 = (q_i - p_i)^2$, the sums are equal. ∎
$$d(\mathbf{p}, \mathbf{r}) \leq d(\mathbf{p}, \mathbf{q}) + d(\mathbf{q}, \mathbf{r})$$
The triangle inequality states that the direct path is never longer than any indirect path—"the shortest distance between two points is a straight line."
The triangle inequality for Euclidean distance follows from the Cauchy-Schwarz inequality. The proof requires establishing that for any vectors $\mathbf{u}$ and $\mathbf{v}$:\n\n$|\mathbf{u} + \mathbf{v}| \leq |\mathbf{u}| + |\mathbf{v}|$\n\nSetting $\mathbf{u} = \mathbf{p} - \mathbf{q}$ and $\mathbf{v} = \mathbf{q} - \mathbf{r}$ gives the triangle inequality for distances.
We prove the more general result: for vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$:
$$|\mathbf{u} + \mathbf{v}|_2 \leq |\mathbf{u}|_2 + |\mathbf{v}|_2$$
Step 1: Square both sides (preserves inequality since both sides are non-negative): $$|\mathbf{u} + \mathbf{v}|_2^2 \leq (|\mathbf{u}|_2 + |\mathbf{v}|_2)^2$$
Step 2: Expand the left side: $$|\mathbf{u} + \mathbf{v}|_2^2 = (\mathbf{u} + \mathbf{v}) \cdot (\mathbf{u} + \mathbf{v}) = |\mathbf{u}|_2^2 + 2\mathbf{u} \cdot \mathbf{v} + |\mathbf{v}|_2^2$$
Step 3: Expand the right side: $$(|\mathbf{u}|_2 + |\mathbf{v}|_2)^2 = |\mathbf{u}|_2^2 + 2|\mathbf{u}|_2|\mathbf{v}|_2 + |\mathbf{v}|_2^2$$
Step 4: We need to show: $$\mathbf{u} \cdot \mathbf{v} \leq |\mathbf{u}|_2 |\mathbf{v}|_2$$
This is the Cauchy-Schwarz inequality, one of the most important inequalities in mathematics.
Step 5: Cauchy-Schwarz proof (brief): Define $f(t) = |\mathbf{u} + t\mathbf{v}|_2^2 = |\mathbf{u}|_2^2 + 2t(\mathbf{u} \cdot \mathbf{v}) + t^2|\mathbf{v}|_2^2$
This parabola in $t$ is always non-negative. For a quadratic $at^2 + bt + c \geq 0$ to hold for all $t$, the discriminant must be non-positive: $b^2 - 4ac \leq 0$.
Applying: $(2\mathbf{u} \cdot \mathbf{v})^2 - 4|\mathbf{v}|_2^2 |\mathbf{u}|_2^2 \leq 0$
Simplifying: $(\mathbf{u} \cdot \mathbf{v})^2 \leq |\mathbf{u}|_2^2 |\mathbf{v}|_2^2$
Taking square roots: $|\mathbf{u} \cdot \mathbf{v}| \leq |\mathbf{u}|_2 |\mathbf{v}|_2$ ∎
The triangle inequality follows immediately.
Computing Euclidean distances efficiently is critical for KNN performance. In a brute-force KNN search with $N$ training points of dimension $D$, we compute $N$ distances, each requiring $O(D)$ operations. Understanding computational optimizations is essential for practical implementations.
The most important optimization arises from a key insight: when comparing distances, we often don't need the actual distances—only their relative ordering.
Since the square root function is monotonically increasing: $$d(\mathbf{p}, \mathbf{q}) < d(\mathbf{p}, \mathbf{r}) \iff d(\mathbf{p}, \mathbf{q})^2 < d(\mathbf{p}, \mathbf{r})^2$$
This means we can skip the square root operation entirely when:
Computational savings: The square root operation is typically 10-30x slower than basic arithmetic operations on modern processors.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport time def find_k_nearest_naive(query: np.ndarray, data: np.ndarray, k: int) -> np.ndarray: """ Find K nearest neighbors using full Euclidean distance. This computes the square root for every distance, which is unnecessary for ranking neighbors. """ distances = np.array([np.sqrt(np.sum((query - point)**2)) for point in data]) return np.argsort(distances)[:k] def find_k_nearest_optimized(query: np.ndarray, data: np.ndarray, k: int) -> np.ndarray: """ Find K nearest neighbors using squared Euclidean distance. Skips the sqrt operation since we only need relative ordering. This is mathematically equivalent but computationally faster. """ squared_distances = np.sum((data - query)**2, axis=1) return np.argsort(squared_distances)[:k] # Performance comparisonnp.random.seed(42)data = np.random.randn(10000, 100) # 10K points, 100 dimensionsquery = np.random.randn(100) # Time the naive approachstart = time.time()for _ in range(100): result_naive = find_k_nearest_naive(query, data, k=5)naive_time = time.time() - start # Time the optimized approachstart = time.time()for _ in range(100): result_optimized = find_k_nearest_optimized(query, data, k=5)optimized_time = time.time() - start print(f"Naive time: {naive_time:.3f}s")print(f"Optimized time: {optimized_time:.3f}s")print(f"Speedup: {naive_time/optimized_time:.1f}x")print(f"Results identical: {np.array_equal(result_naive, result_optimized)}")When computing distances between many point pairs, we can leverage matrix operations for massive speedups through BLAS/LAPACK optimizations.
For a query matrix $Q$ of shape $(m, D)$ and data matrix $X$ of shape $(N, D)$, we want the distance matrix $D$ of shape $(m, N)$ where $D_{ij} = d(\mathbf{q}_i, \mathbf{x}_j)$.
Using the identity: $$d(\mathbf{q}, \mathbf{x})^2 = |\mathbf{q}|^2 - 2\mathbf{q}^T\mathbf{x} + |\mathbf{x}|^2$$
We can compute all distances as:
The matrix multiplication dominates the computation and benefits from highly optimized BLAS implementations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npfrom typing import Optional def pairwise_euclidean_distance_squared( X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray: """ Compute pairwise squared Euclidean distances efficiently. Uses the identity: ||x-y||² = ||x||² - 2<x,y> + ||y||² Parameters: X: Matrix of shape (n_samples_X, n_features) Y: Matrix of shape (n_samples_Y, n_features), or None If None, computes pairwise distances within X Returns: Distance matrix of shape (n_samples_X, n_samples_Y) Entry [i,j] is the squared distance between X[i] and Y[j] Complexity: O(n_X * n_Y * D) time, O(n_X * n_Y) space But the matrix multiply O(n_X * n_Y * D) is highly optimized """ if Y is None: Y = X # Compute squared norms for each row # Shape: (n_X,) and (n_Y,) X_norm_sq = np.sum(X ** 2, axis=1) Y_norm_sq = np.sum(Y ** 2, axis=1) # Compute cross-term using matrix multiplication # Shape: (n_X, n_Y) cross_term = X @ Y.T # Combine: broadcast X_norm_sq as column, Y_norm_sq as row # ||x-y||² = ||x||² - 2<x,y> + ||y||² distances_sq = X_norm_sq[:, np.newaxis] - 2 * cross_term + Y_norm_sq[np.newaxis, :] # Numerical precision can cause tiny negative values distances_sq = np.maximum(distances_sq, 0) return distances_sq def pairwise_euclidean_distance( X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray: """Compute actual Euclidean distances (with sqrt).""" return np.sqrt(pairwise_euclidean_distance_squared(X, Y)) # Example: Computing all pairwise distances for KNNX_train = np.random.randn(1000, 50) # 1000 training pointsX_query = np.random.randn(10, 50) # 10 query points # Compute all distances efficiently (10 × 1000 = 10,000 distances)distances = pairwise_euclidean_distance(X_query, X_train) # Find 5 nearest neighbors for each queryk = 5nearest_indices = np.argsort(distances, axis=1)[:, :k]print(f"Nearest neighbors shape: {nearest_indices.shape}") # (10, 5)The matrix-based formula $|\mathbf{x}|^2 - 2\mathbf{x}^T\mathbf{y} + |\mathbf{y}|^2$ can produce small negative values due to floating-point precision errors (subtracting nearly equal large numbers). Always clamp the result to non-negative before taking the square root:\n\ndistances_sq = np.maximum(distances_sq, 0)\n\nFor critical applications, consider using Kahan summation or higher-precision arithmetic.
Euclidean distance has a critical property that is both a strength and a weakness: it treats all dimensions equally. This isotropy becomes problematic when features have different scales or units.
Consider a dataset with two features:
Compare two pairs of people:
Euclidean distance:
The income difference completely dominates, making the age feature effectively invisible. A $900,000 income difference contributes 900,000² = 810 billion to the squared distance, while a 1-year age difference contributes just 1.
Using raw features with different scales is one of the most common errors in KNN implementations. The algorithm will effectively use only the high-magnitude features, ignoring potentially important low-magnitude features entirely.
The standard solutions involve transforming features to comparable scales:
Scale each feature to the range $[0, 1]$:
$$x'_i = \frac{x_i - \min(x)}{\max(x) - \min(x)}$$
Pros: Bounded output, preserves zero values for sparse data Cons: Sensitive to outliers (a single extreme value stretches the range)
Transform each feature to have zero mean and unit variance:
$$x'_i = \frac{x_i - \mu}{\sigma}$$
where $\mu$ is the mean and $\sigma$ is the standard deviation.
Pros: Handles outliers better, centers the data Cons: Unbounded output, doesn't preserve sparsity
Use median and interquartile range (IQR) instead of mean and standard deviation:
$$x'_i = \frac{x_i - \text{median}(x)}{\text{IQR}(x)}$$
Pros: Highly resistant to outliers Cons: Doesn't capture the full data range
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npfrom sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScalerfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import cross_val_score # Simulated data with vastly different scalesnp.random.seed(42)n_samples = 500 # Feature 1: Age (0-100)age = np.random.normal(35, 15, n_samples).clip(0, 100) # Feature 2: Income (0-1,000,000)income = np.random.exponential(50000, n_samples).clip(0, 1000000) # Feature 3: Credit Score (300-850)credit_score = np.random.normal(650, 80, n_samples).clip(300, 850) X = np.column_stack([age, income, credit_score]) # Binary classification target (arbitrary for demonstration)y = ((age > 40) & (income > 60000)).astype(int) # Compare KNN performance with different scaling methodsprint("Feature ranges before scaling:")print(f" Age: [{X[:, 0].min():.0f}, {X[:, 0].max():.0f}]")print(f" Income: [{X[:, 1].min():.0f}, {X[:, 1].max():.0f}]")print(f" Credit: [{X[:, 2].min():.0f}, {X[:, 2].max():.0f}]") # Test different scaling approachesscalers = { "No Scaling": None, "Min-Max": MinMaxScaler(), "Z-Score": StandardScaler(), "Robust": RobustScaler(),} print("\nKNN 5-fold CV Accuracy:")for name, scaler in scalers.items(): if scaler is not None: X_scaled = scaler.fit_transform(X) else: X_scaled = X knn = KNeighborsClassifier(n_neighbors=5) scores = cross_val_score(knn, X_scaled, y, cv=5) print(f" {name}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")Despite its limitations, Euclidean distance remains the default choice for many applications. Understanding when it performs well helps in selecting the appropriate metric.
Euclidean distance is the optimal or near-optimal choice when:
When the "true" similarity between data points follows spherical geometry—meaning similarity decreases equally in all directions—Euclidean distance correctly captures this structure.
Example: Physical sensor data where deviation in any feature direction is equally significant (temperature, pressure, vibration magnitude).
Euclidean distance works well in low dimensions (D < 20) where the curse of dimensionality hasn't yet made all points equidistant. The geometric intuition from 2D/3D extends naturally.
For continuous numerical features that can take any value in an interval, Euclidean distance provides smooth gradients of similarity. It naturally handles the interpolation between nearby values.
Example: Scientific measurements, engineering specifications, any data derived from physical quantities.
After proper scaling (Z-score standardization), features become comparable, and Euclidean distance provides a principled weighting where each standard deviation of change contributes equally.
Euclidean distance is invariant to rotations of the coordinate system. If your problem has no preferred direction (isotropic), this is desirable.
Example: Comparing molecular conformations in 3D space after alignment.
Modern representation learning (word2vec, BERT, ResNet) explicitly optimizes embeddings for Euclidean similarity. Neural networks trained with L² loss functions learn representations where Euclidean distance is meaningful. For these learned embeddings, Euclidean distance is often the correct choice by construction.
Euclidean distance can fail dramatically in certain scenarios. Recognizing these pathologies helps avoid misapplication.
In high dimensions, Euclidean distance suffers from a phenomenon called distance concentration. For random points in high-dimensional space:
$$\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \text{ as } D \to \infty$$
All points become approximately equidistant! This makes nearest-neighbor search meaningless—every point is almost equally close to every other point.
Mathematical intuition: In $D$ dimensions, distance is $d = \sqrt{\sum_{i=1}^D (x_i - y_i)^2}$. By the law of large numbers, as $D \to \infty$, the sum concentrates around its expected value, and relative variations shrink.
For sparse data (many zero entries), Euclidean distance can be problematic:
As dimensionality increases:\n\n• Volume of a hypersphere relative to its bounding hypercube → 0\n• Most points are near the boundary, not the center\n• Random projections preserve distances (Johnson-Lindenstrauss)\n\nFor D > 100, seriously consider dimensionality reduction (PCA, UMAP) or alternative metrics.
Euclidean distance treats axis-aligned differences the same as diagonal differences. But in many domains, this is inappropriate:
Example - Chess: A king moves one square in any direction (including diagonals). Manhattan distance (covered next) says diagonal is distance 2; Euclidean says it's $\sqrt{2}$. Neither matches the king's actual movement (Chebyshev distance: 1).
Euclidean distance doesn't handle periodic or cyclical features:
Example - Time of day: 11:55 PM and 12:05 AM (midnight) are 10 minutes apart, but Euclidean distance on 24-hour representation (23.92 vs 0.08) gives 23.84, suggesting nearly maximum difference!
Solutions:
Euclidean distance is fundamentally designed for continuous numerical features. Mixing in:
Solutions: Use hybrid distance functions like Gower distance, or embed categorical features appropriately.
| Problem | Symptom | Solution |
|---|---|---|
| High dimensionality | All points equidistant | Dimensionality reduction (PCA, UMAP) |
| Different scales | High-magnitude features dominate | Feature standardization |
| Sparse data | Zero values inflate distances | Cosine similarity, Jaccard distance |
| Periodic features | Opposite phases appear far | Circular/angular distance |
| Categorical features | Cannot compute differences | One-hot + Hamming, embeddings |
| Correlated features | Double-counting information | Mahalanobis distance, PCA |
Euclidean distance is the fundamental building block of distance-based machine learning. Its simplicity, intuitive geometric interpretation, and strong theoretical foundations make it the default starting point. However, its assumptions—isotropy, continuous features, comparable scales—must be verified before application.
The next page explores Manhattan distance (L¹ norm), which takes a fundamentally different approach by summing absolute differences rather than squared differences. We will see how this creates grid-like geometry, how it handles outliers differently, and when it outperforms Euclidean distance.
You now have a deep understanding of Euclidean distance—its derivation, properties, computational optimizations, strengths, and limitations. This foundation prepares you to understand the family of distance metrics we explore next, all of which are variations on these core themes.