Distance Metrics - Learning Module

Loading content...

0/278

Manhattan Distance

The Geometry of City Blocks

Imagine you're a taxi driver in Manhattan. Your passenger asks to travel from the corner of 3rd Avenue and 14th Street to 7th Avenue and 18th Street. You cannot fly diagonally through buildings—you must follow the grid of streets, traveling only north-south or east-west.

This constraint gives rise to Manhattan distance, alternatively called taxicab distance, city-block distance, or the L¹ norm. Unlike Euclidean distance, which measures the straight-line path through space, Manhattan distance measures the total distance traveled along axis-parallel paths.

The key insight is that many real-world problems have natural grid-like constraints, or the semantic meaning of "distance" is better captured by the sum of individual deviations rather than the root of summed squares.

Learning Objectives

By mastering this page, you will:

• Derive the Manhattan distance formula and understand its geometric interpretation • Visualize the L¹ unit ball (diamond/rhombus) and contrast with L² (circle) • Prove Manhattan distance satisfies all metric space axioms • Analyze computational advantages: no square roots, integer-friendly • Understand robustness to outliers compared to Euclidean distance • Identify problem domains where Manhattan distance is the natural choice • Implement efficient Manhattan distance calculations

Mathematical Definition and Derivation

The L¹ Norm Formula

For two points in $n$-dimensional space:

$$\mathbf{p} = (p_1, p_2, \ldots, p_n) \quad \text{and} \quad \mathbf{q} = (q_1, q_2, \ldots, q_n)$$

The Manhattan distance is defined as:

$$d_1(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} |p_i - q_i| = |\mathbf{p} - \mathbf{q}|_1$$

The subscript "1" indicates this is the L¹ norm (or 1-norm) of the difference vector. Compare with the Euclidean (L²) norm:

Metric	Formula	Norm Type
Manhattan	$\sum_i \|p_i - q_i\|$	L¹
Euclidean	$\sqrt{\sum_i (p_i - q_i)^2}$	L²

Two-Dimensional Example

In the plane, for points $\mathbf{p} = (x_1, y_1)$ and $\mathbf{q} = (x_2, y_2)$:

$$d_1(\mathbf{p}, \mathbf{q}) = |x_2 - x_1| + |y_2 - y_1|$$

This is the minimal path length along a grid—exactly the number of blocks you'd walk in a city.

Geometric Interpretation: Multiple Shortest Paths

Unlike Euclidean geometry where the shortest path is unique (a straight line), Manhattan geometry has multiple shortest paths. From $(0,0)$ to $(3,2)$:

Path 1: Right 3, Up 2 → Distance = 5
Path 2: Up 2, Right 3 → Distance = 5
Path 3: Right 1, Up 1, Right 2, Up 1 → Distance = 5

Any path that only moves right and up (no backtracking) has the same Manhattan distance. There are $\binom{5}{2} = 10$ such shortest paths.

Why 'Manhattan'?

The name comes from the grid layout of streets in Manhattan, New York City, where most streets run north-south or east-west. The term was coined by mathematicians studying geometries where the usual Euclidean straight-line distance doesn't apply. The metric is also called:

• Taxicab distance — taxis follow the street grid • City-block distance — measuring in city blocks • Rectilinear distance — from Latin 'rectilineus' (straight line), ironically • L¹ distance — from functional analysis notation

manhattan_distance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from typing import List
 
def manhattan_distance_naive(p: List[float], q: List[float]) -> float:
    """
    Compute Manhattan distance using the explicit formula.
    
    d₁(p, q) = Σ |pᵢ - qᵢ|
    
    Parameters:
        p: First point as a list of coordinates
        q: Second point as a list of coordinates
    
    Returns:
        The Manhattan distance between p and q
    
    Example:
        >>> manhattan_distance_naive([0, 0], [3, 4])
        7  # Compare to Euclidean distance of 5
    """
    if len(p) != len(q):
        raise ValueError("Points must have the same dimensionality")
    
    return sum(abs(p_i - q_i) for p_i, q_i in zip(p, q))
 
 
def manhattan_distance_numpy(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute Manhattan distance using NumPy's vectorized operations.
    
    This is typically 10-100x faster than the naive implementation
    for high-dimensional vectors.
    """
    return np.sum(np.abs(p - q))
 
 
def manhattan_distance_norm(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute Manhattan distance using NumPy's norm function.
    
    np.linalg.norm with ord=1 computes the L¹ norm.
    """
    return np.linalg.norm(p - q, ord=1)
 
 
# Comparison with Euclidean distance
if __name__ == "__main__":
    p = np.array([0, 0])
    q = np.array([3, 4])
    
    manhattan = manhattan_distance_numpy(p, q)
    euclidean = np.linalg.norm(p - q, ord=2)
    
    print(f"Manhattan distance: {manhattan}")  # 7
    print(f"Euclidean distance: {euclidean}")  # 5
    print(f"Ratio (Manhattan/Euclidean): {manhattan/euclidean:.2f}")  # 1.4
    
    # In higher dimensions, the ratio approaches √n
    n_dims = 100
    p_high = np.random.randn(n_dims)
    q_high = np.random.randn(n_dims)
    
    manhattan_high = manhattan_distance_numpy(p_high, q_high)
    euclidean_high = np.linalg.norm(p_high - q_high)
    
    print(f"
100D Manhattan: {manhattan_high:.2f}")
    print(f"100D Euclidean: {euclidean_high:.2f}")
    print(f"100D Ratio: {manhattan_high/euclidean_high:.2f}")
    print(f"√100 = {np.sqrt(100):.2f}")

The L¹ Unit Ball: Diamond Geometry

Every metric defines a geometry, and that geometry is characterized by the shape of its unit ball—the set of all points at distance ≤ 1 from the origin.

Unit Ball Definition

For the L¹ norm in $\mathbb{R}^n$, the unit ball is:

$$B_1 = {\mathbf{x} \in \mathbb{R}^n : |\mathbf{x}|1 \leq 1} = {\mathbf{x} : \sum{i=1}^n |x_i| \leq 1}$$

Two-Dimensional Case: The Square Diamond

In 2D, the L¹ unit ball is defined by:

$$|x| + |y| \leq 1$$

This is a square rotated 45 degrees (a diamond or rhombus) with vertices at $(\pm 1, 0)$ and $(0, \pm 1)$.

Contrast with L²: The Euclidean unit ball $x^2 + y^2 \leq 1$ is a circle inscribed in this diamond. The diamond's vertices touch the circle at the axes ($x=0$ or $y=0$), but the diamond extends beyond the circle along the diagonals.

Visual Comparison of Lp Balls

p	Unit Ball Shape (2D)	Corners	Smoothness
p = 1	Diamond (square rotated 45°)	4 sharp corners	Non-smooth at axes
p = 2	Circle	None	Perfectly smooth
p → ∞	Square (axis-aligned)	4 sharp corners	Non-smooth at corners

Higher Dimensions

In $n$ dimensions, the L¹ unit ball is a cross-polytope (or hyperoctahedron):

3D: An octahedron with 6 vertices at $(\pm 1, 0, 0)$, $(0, \pm 1, 0)$, $(0, 0, \pm 1)$
nD: Has $2n$ vertices (positive and negative unit vectors along each axis)
nD: Has $2^n$ facets (each facet is an (n-1)-simplex)

This combinatorial explosion of facets makes high-dimensional L¹ geometry fundamentally different from the smooth hypersphere of L².

Why Geometry Matters for KNN

The shape of the unit ball determines which points are considered "equally close." With Euclidean distance, equidistant points form a circle/sphere. With Manhattan distance, equidistant points form a diamond/cross-polytope.

This affects neighbor selection: L¹ naturally favors neighbors aligned with the axes, while L² treats all directions equally.

Relationship Between L¹ and L² Distances

For any two points, there's a predictable relationship between their L¹ and L² distances:

$$|\mathbf{x}|_2 \leq |\mathbf{x}|_1 \leq \sqrt{n} \cdot |\mathbf{x}|_2$$

Lower bound: Euclidean distance is always less than or equal to Manhattan distance. The straight-line path is never longer than the grid path.

Upper bound: Manhattan distance is at most $\sqrt{n}$ times the Euclidean distance, with equality when all coordinates differ equally (the vector points along the diagonal).

Proof of lower bound: By Cauchy-Schwarz: $$|\mathbf{x}|_1 = \sum_i |x_i| \cdot 1 \leq \sqrt{\sum_i x_i^2} \cdot \sqrt{\sum_i 1^2} = |\mathbf{x}|_2 \cdot \sqrt{n}$$

Proof of upper bound: By Jensen's inequality or directly: $$|\mathbf{x}|_2 = \sqrt{\sum_i x_i^2} \leq \sqrt{(\sum_i |x_i|)^2} = |\mathbf{x}|_1$$

with equality when only one coordinate is non-zero.

Example: L¹ vs L² Distance Ratios
Vector Direction	L¹ Distance	L² Distance	Ratio L¹/L²
Along axis: (3, 0)	3	3	1.00
Balanced: (3, 3)	6	4.24	1.41 (≈√2)
Unbalanced: (3, 1)	4	3.16	1.26
Along 3D diagonal: (1,1,1)	3	1.73	1.73 (≈√3)
Along nD diagonal: (1,...,1)	n	√n	√n

Metric Space Axioms for Manhattan Distance

Like Euclidean distance, Manhattan distance is a proper metric, satisfying all four axioms. We verify each:

Axiom 1: Non-negativity

$$d_1(\mathbf{p}, \mathbf{q}) \geq 0$$

Proof: Absolute values are non-negative, and sums of non-negative numbers are non-negative. ∎

Axiom 2: Identity of Indiscernibles

$$d_1(\mathbf{p}, \mathbf{q}) = 0 \iff \mathbf{p} = \mathbf{q}$$

Proof:

($\Leftarrow$) If $\mathbf{p} = \mathbf{q}$, then $|p_i - q_i| = 0$ for all $i$, so the sum is 0.
($\Rightarrow$) If $d_1 = 0$, each $|p_i - q_i| = 0$ (since we're summing non-negative terms). Thus $p_i = q_i$ for all $i$. ∎

Axiom 3: Symmetry

$$d_1(\mathbf{p}, \mathbf{q}) = d_1(\mathbf{q}, \mathbf{p})$$

Proof: $|p_i - q_i| = |q_i - p_i|$ for all $i$. ∎

Axiom 4: Triangle Inequality

$$d_1(\mathbf{p}, \mathbf{r}) \leq d_1(\mathbf{p}, \mathbf{q}) + d_1(\mathbf{q}, \mathbf{r})$$

Proof: For each dimension $i$: $$|p_i - r_i| = |p_i - q_i + q_i - r_i| \leq |p_i - q_i| + |q_i - r_i|$$

by the standard triangle inequality for real numbers.

which is exactly: $$d_1(\mathbf{p}, \mathbf{r}) \leq d_1(\mathbf{p}, \mathbf{q}) + d_1(\mathbf{q}, \mathbf{r})$$ ∎

The Triangle Inequality is Tighter for L¹

An interesting property: the triangle inequality is "more often tight" for L¹ than L². In L¹, the indirect path through an intermediate point has the same length as the direct path whenever the intermediate point lies on any shortest L¹ path.

For L², the triangle inequality is tight only when all three points are collinear.

Computational Advantages of Manhattan Distance

Manhattan distance offers several computational advantages over Euclidean distance:

No Square Root Operation

The L¹ formula involves only:

Subtraction
Absolute value
Addition

No square root is needed, ever. This is in contrast to Euclidean distance, where the square root is required for the actual distance (though it can be skipped for comparisons).

Impact: Absolute value is a simple branch-free operation on modern CPUs (using bit manipulation), making Manhattan distance approximately 30-50% faster than Euclidean distance for actual distance computation.

Integer-Preserving

For integer coordinates, Manhattan distance is always an integer:

$$\mathbf{p}, \mathbf{q} \in \mathbb{Z}^n \implies d_1(\mathbf{p}, \mathbf{q}) \in \mathbb{Z}$$

This is not true for Euclidean distance, which typically produces irrational numbers (e.g., $\sqrt{2}$).

Benefits:

Exact arithmetic (no floating-point errors)
Smaller memory footprint for distance storage
Faster comparison operations
Suitable for environments without floating-point support

integer_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
import time
from typing import Tuple
 
def benchmark_distances(n_points: int, n_dims: int, n_iterations: int) -> Tuple[float, float]:
    """
    Benchmark Manhattan vs Euclidean distance computation.
    
    Returns:
        (manhattan_time, euclidean_time) in seconds
    """
    # Generate random data
    X = np.random.randn(n_points, n_dims)
    query = np.random.randn(n_dims)
    
    # Benchmark Manhattan
    start = time.perf_counter()
    for _ in range(n_iterations):
        distances = np.sum(np.abs(X - query), axis=1)
    manhattan_time = time.perf_counter() - start
    
    # Benchmark Euclidean
    start = time.perf_counter()
    for _ in range(n_iterations):
        distances = np.sqrt(np.sum((X - query)**2, axis=1))
    euclidean_time = time.perf_counter() - start
    
    return manhattan_time, euclidean_time
 
 
def benchmark_integer_distance():
    """
    Demonstrate integer exactness of Manhattan distance.
    """
    # Integer coordinates
    p = np.array([0, 0, 0], dtype=np.int64)
    q = np.array([3, 4, 5], dtype=np.int64)
    
    manhattan = np.sum(np.abs(p - q))
    euclidean = np.sqrt(np.sum((p - q)**2))
    
    print("Integer coordinate example:")
    print(f"  Manhattan: {manhattan} (type: {type(manhattan).__name__})")
    print(f"  Euclidean: {euclidean} (type: {type(euclidean).__name__})")
    print(f"  Manhattan is integer: {isinstance(manhattan, (int, np.integer))}")
    print(f"  Euclidean is irrational: {euclidean != int(euclidean)}")
 
 
if __name__ == "__main__":
    # Run integer demonstration
    benchmark_integer_distance()
    
    # Run speed benchmark
    print("
Speed benchmark (10K points, 100 dims, 100 iterations):")
    m_time, e_time = benchmark_distances(10000, 100, 100)
    print(f"  Manhattan: {m_time:.3f}s")
    print(f"  Euclidean: {e_time:.3f}s")
    print(f"  Manhattan is {e_time/m_time:.1f}x faster")

SIMD and Vectorization Considerations

Modern CPUs support SIMD (Single Instruction, Multiple Data) operations that process multiple data elements simultaneously. Both Manhattan and Euclidean distance benefit from SIMD, but with different characteristics:

Manhattan distance SIMD optimization:

Subtraction: Fully vectorized
Absolute value: Modern CPUs have vector abs instructions
Sum: Horizontal add operations

Euclidean distance SIMD optimization:

Subtraction: Fully vectorized
Squaring: Multiply operation (vectorized)
Sum: Horizontal add operations
Square root: Often requires scalar operation or limited vector support

The absence of square root in Manhattan distance is particularly beneficial on older or embedded architectures where the square root is not vectorized.

Memory Access Patterns

Both metrics have identical memory access patterns (sequential access through all dimensions), so neither has an inherent advantage in terms of cache performance. However, Manhattan distance requires slightly less CPU register pressure since it doesn't need to accumulate squares.

When Computational Speed Matters

Consider Manhattan distance when:

• Real-time systems require predictable, fast distance computation • Embedded systems lack efficient floating-point support • Exact arithmetic is required for correctness (financial, cryptographic) • Large-scale batch processing where small speedups compound

For most ML applications with GPU acceleration, the difference is negligible as both operations are highly optimized.

Robustness to Outliers: L¹ vs L²

One of the most important practical differences between Manhattan and Euclidean distance is their sensitivity to outliers. This difference arises from how each metric aggregates component-wise differences.

The Squaring Effect in L²

Euclidean distance squares each difference before summing:

$$d_2^2 = \sum_i (p_i - q_i)^2$$

This means a single large difference dominates the total distance. If one component differs by 10 units and another by 1 unit:

Contribution to $d_2^2$: $10^2 + 1^2 = 101$
The large difference contributes 99% of the squared distance

Linear Aggregation in L¹

Manhattan distance sums absolute differences without amplification:

$$d_1 = \sum_i |p_i - q_i|$$

The same example:

Contribution to $d_1$: $10 + 1 = 11$
The large difference contributes 91% of the distance

While still dominant, the outlier's influence is proportionally less severe.

Quantifying Robustness

Consider two points differing by $\epsilon$ in most dimensions but by $M$ (a large outlier) in one dimension:

$$\mathbf{d} = (\epsilon, \epsilon, \ldots, \epsilon, M)$$

L² distance: $\sqrt{(n-1)\epsilon^2 + M^2} \approx M$ for large $M$

L¹ distance: $(n-1)\epsilon + M$

As $n$ grows, the L¹ distance picks up more signal from the small differences, while L² remains dominated by the outlier.

outlier_robustness.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def compare_outlier_sensitivity():
    """
    Demonstrate how L¹ and L² respond differently to outliers.
    """
    n_dims = 100
    
    # Two points that are close in most dimensions
    p = np.zeros(n_dims)
    q_normal = np.ones(n_dims) * 0.1  # All dimensions differ by 0.1
    
    # Same point but with one outlier dimension
    q_outlier = q_normal.copy()
    q_outlier[0] = 10  # One dimension has a massive difference
    
    print("Normal case (all differences = 0.1):")
    print(f"  L¹ distance: {np.sum(np.abs(p - q_normal)):.2f}")
    print(f"  L² distance: {np.linalg.norm(p - q_normal):.2f}")
    
    print("
With outlier (one difference = 10, rest = 0.1):")
    d1_outlier = np.sum(np.abs(p - q_outlier))
    d2_outlier = np.linalg.norm(p - q_outlier)
    print(f"  L¹ distance: {d1_outlier:.2f}")
    print(f"  L² distance: {d2_outlier:.2f}")
    
    # Compute outlier contribution
    d1_normal = np.sum(np.abs(p - q_normal))
    d2_normal = np.linalg.norm(p - q_normal)
    
    d1_increase = (d1_outlier - d1_normal) / d1_normal * 100
    d2_increase = (d2_outlier - d2_normal) / d2_normal * 100
    
    print(f"
Percentage increase due to outlier:")
    print(f"  L¹: {d1_increase:.0f}% increase")
    print(f"  L²: {d2_increase:.0f}% increase")
    print(f"  L² is {d2_increase/d1_increase:.1f}x more sensitive")
 
 
def robust_regression_analogy():
    """
    Show connection to robust regression (L1 vs L2 loss).
    """
    print("
" + "="*50)
    print("Connection to Regression:")
    print("="*50)
    print("• L² loss (MSE) → L² distance → Sensitive to outliers")
    print("• L¹ loss (MAE) → L¹ distance → Robust to outliers")
    print("• Both in KNN and regression, L¹ is more robust!")
 
 
if __name__ == "__main__":
    compare_outlier_sensitivity()
    robust_regression_analogy()

Connection to Robust Statistics

This outlier robustness connects to a broader principle in statistics:

• L² minimization (mean squared error) → Mean (sensitive to outliers) • L¹ minimization (mean absolute error) → Median (robust to outliers)

The point that minimizes sum of squared distances to all other points is the mean. The point that minimizes sum of absolute distances is the geometric median.

In KNN, using L¹ makes neighbors less affected by anomalous feature values.

When Robustness Helps

•Sensor data with occasional spikes
•Financial data with extreme values
•User-generated content (typos, errors)
•Any domain with measurement errors

When L² Sensitivity is Desired

•Large differences truly mean dissimilarity
•Outliers are informative, not noise
•Clean, curated datasets
•When you want to penalize big mistakes heavily

When Manhattan Distance Excels

Manhattan distance is the natural choice for several important problem domains. Understanding these cases helps in metric selection.

1. Grid-Based Domains

Any problem where movement is constrained to axis-parallel directions naturally uses Manhattan distance:

Warehouse robotics: Robots in warehouses typically move along aisles (north-south or east-west). The actual travel distance between two locations is the Manhattan distance, not the Euclidean distance.

Game pathfinding: In many games, characters move on grids (chess, checkers, tile-based RPGs). Manhattan distance provides the correct heuristic for A* search when diagonal movement is not allowed.

Urban navigation: In grid-planned cities, actual travel distance (by car or on foot) closely approximates Manhattan distance.

2. Feature Independence Assumptions

Manhattan distance implicitly assumes features are independent and equally important. Each feature's contribution to distance is proportional only to its own deviation, not influenced by other features.

This is appropriate when:

Features represent fundamentally different measurements
There's no reason to believe features should co-vary
The "distance" is conceptually a sum of individual differences

Example: Comparing products by (price, weight, rating). A $10 price difference and a 1 kg weight difference are independent concerns. Manhattan distance sums their contributions directly.

3. Sparse High-Dimensional Data

For sparse data (many zeros), Manhattan distance can be more appropriate than Euclidean:

Only non-zero differences contribute
The metric doesn't amplify the impact of rare large values
Computational efficiency for sparse representations

Domain-Specific Metric Recommendations
Domain	Recommended Metric	Reason
Warehouse routing	Manhattan	Physical movement is axis-aligned
Image similarity	Euclidean (L²)	Continuous features, diagonal similarity matters
Game AI (grid movement)	Manhattan	Movement restricted to grid
Text documents (TF-IDF)	Cosine	Magnitude-invariant comparison
Sensor data with outliers	Manhattan	More robust to spikes
Embeddings (word2vec, etc.)	Euclidean or Cosine	Trained for these metrics
Integer/discrete features	Manhattan	Integer-preserving, no sqrt
Financial time series	Manhattan (or DTW)	Robust to extreme values

4. Binary and Count Features

For binary features (0/1), both Manhattan and Euclidean distance reduce to simpler forms, but Manhattan is often more interpretable:

Binary features: Manhattan distance equals the Hamming distance (number of positions where bits differ).

Count features: When features are counts (document word frequencies, event occurrences), Manhattan distance gives the total count difference, which is often more interpretable than the geometric distance.

5. L¹ Regularization Connection

There's a deep connection between L¹ distance and L¹ regularization (Lasso regression). Both promote sparsity:

L¹ regularization drives coefficients to exactly zero
L¹ distance creates "corners" in the unit ball (sparse geometric structure)

If your features were learned with L¹ regularization, using L¹ distance may be geometrically consistent.

Practical Heuristic

When in doubt, try both! Train KNN with Euclidean distance and Manhattan distance, then compare cross-validation performance. The difference is often small, but when it's large, it provides insight into your data's geometry.

Limitations of Manhattan Distance

While Manhattan distance has many strengths, it also has limitations that make it unsuitable for certain applications.

1. Not Rotation Invariant

Manhattan distance is not invariant to rotations of the coordinate system. Rotating the data changes the distances between points.

Example: Points $(1, 0)$ and $(0, 1)$ have:

Manhattan distance = $|1-0| + |0-1| = 2$
After 45° rotation, they become $(\frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2})$ and $(-\frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2})$
New Manhattan distance = $|\sqrt{2}| + |0| = \sqrt{2} \approx 1.41$

The distances changed because Manhattan distance depends on the coordinate axes.

Implication: If your data's natural axes are unknown or arbitrary, Euclidean distance (which is rotation-invariant) may be more appropriate.

2. Underestimates "Diagonal" Distances

Manhattan distance treats diagonal movement as more expensive than Euclidean geometry suggests:

Points $(0,0)$ and $(1,1)$:
- Euclidean: $\sqrt{2} \approx 1.41$
- Manhattan: $2$

This means diagonally-related points appear farther apart than they "should" in Euclidean space. For data where diagonal relationships are meaningful, this is a distortion.

3. High-Dimensional Behavior

Like all Lp distances, Manhattan distance suffers from the curse of dimensionality, though with different characteristics:

Distance concentration still occurs (all points become roughly equidistant)
The ratio of max/min distances approaches 1 slower than for L², but still degrades
The L¹ ball's $2^n$ facets make geometric intuition even harder than for L²'s smooth hypersphere

Coordinate Dependence is Often Overlooked

The biggest practical pitfall with Manhattan distance is its dependence on the coordinate system. If you:

• Apply PCA before KNN (rotation!) • Have features that should naturally be combined (e.g., x and y components of a vector) • Work in a domain where the axis choice is arbitrary

...then Manhattan distance may produce inconsistent or misleading results.

4. Ties and Non-Unique Paths

The existence of multiple shortest paths in Manhattan geometry can lead to:

More ties when selecting nearest neighbors
Ambiguity in tie-breaking strategies
Different behavior depending on implementation details

This is usually a minor issue but can affect reproducibility.

5. Sub-Optimal for Learned Representations

Modern neural networks (word embeddings, image features, etc.) typically optimize for Euclidean or cosine similarity. Using Manhattan distance on these representations may not align with the geometry the network learned.

Recommendation: For learned embeddings, prefer the metric used during training, or experiment to find what works best for your downstream task.

Summary and Path Forward

Manhattan distance provides an alternative geometry to Euclidean distance, with distinct properties that make it ideal for certain applications. Its simplicity, computational efficiency, and robustness to outliers are valuable in many practical scenarios.

Key Concepts Mastered

•L¹ norm definition — Sum of absolute differences: $d_1 = \sum |p_i - q_i|$
•Diamond geometry — The L¹ unit ball is a rotated square (2D) or cross-polytope (nD)
•Metric space axioms — Manhattan distance is a proper metric (non-negativity, identity, symmetry, triangle inequality)
•Computational advantages — No square root, integer-preserving, slightly faster than L²
•Outlier robustness — Linear aggregation is less sensitive to extreme values than quadratic
•Ideal applications — Grid-based domains, independent features, sparse data
•Limitations — Not rotation-invariant, coordinate-dependent, not optimal for learned embeddings

What's Next: Minkowski Distance

The next page introduces Minkowski distance, the generalized family that includes both Euclidean (p=2) and Manhattan (p=1) as special cases. We'll explore how varying the parameter p interpolates between these geometries and beyond, providing a unified framework for distance metrics.

Page Complete

You now understand Manhattan distance as a first-class alternative to Euclidean distance. You can reason about when to use L¹ vs L², understand the geometric implications of this choice, and implement either efficiently. This prepares you for the unified Minkowski framework coming next.