Loading content...
Imagine you're a taxi driver in Manhattan. Your passenger asks to travel from the corner of 3rd Avenue and 14th Street to 7th Avenue and 18th Street. You cannot fly diagonally through buildings—you must follow the grid of streets, traveling only north-south or east-west.
This constraint gives rise to Manhattan distance, alternatively called taxicab distance, city-block distance, or the L¹ norm. Unlike Euclidean distance, which measures the straight-line path through space, Manhattan distance measures the total distance traveled along axis-parallel paths.
The key insight is that many real-world problems have natural grid-like constraints, or the semantic meaning of "distance" is better captured by the sum of individual deviations rather than the root of summed squares.
By mastering this page, you will:
• Derive the Manhattan distance formula and understand its geometric interpretation • Visualize the L¹ unit ball (diamond/rhombus) and contrast with L² (circle) • Prove Manhattan distance satisfies all metric space axioms • Analyze computational advantages: no square roots, integer-friendly • Understand robustness to outliers compared to Euclidean distance • Identify problem domains where Manhattan distance is the natural choice • Implement efficient Manhattan distance calculations
For two points in $n$-dimensional space:
$$\mathbf{p} = (p_1, p_2, \ldots, p_n) \quad \text{and} \quad \mathbf{q} = (q_1, q_2, \ldots, q_n)$$
The Manhattan distance is defined as:
$$d_1(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} |p_i - q_i| = |\mathbf{p} - \mathbf{q}|_1$$
The subscript "1" indicates this is the L¹ norm (or 1-norm) of the difference vector. Compare with the Euclidean (L²) norm:
| Metric | Formula | Norm Type |
|---|---|---|
| Manhattan | $\sum_i |p_i - q_i|$ | L¹ |
| Euclidean | $\sqrt{\sum_i (p_i - q_i)^2}$ | L² |
In the plane, for points $\mathbf{p} = (x_1, y_1)$ and $\mathbf{q} = (x_2, y_2)$:
$$d_1(\mathbf{p}, \mathbf{q}) = |x_2 - x_1| + |y_2 - y_1|$$
This is the minimal path length along a grid—exactly the number of blocks you'd walk in a city.
Unlike Euclidean geometry where the shortest path is unique (a straight line), Manhattan geometry has multiple shortest paths. From $(0,0)$ to $(3,2)$:
Any path that only moves right and up (no backtracking) has the same Manhattan distance. There are $\binom{5}{2} = 10$ such shortest paths.
The name comes from the grid layout of streets in Manhattan, New York City, where most streets run north-south or east-west. The term was coined by mathematicians studying geometries where the usual Euclidean straight-line distance doesn't apply. The metric is also called:
• Taxicab distance — taxis follow the street grid • City-block distance — measuring in city blocks • Rectilinear distance — from Latin 'rectilineus' (straight line), ironically • L¹ distance — from functional analysis notation
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom typing import List def manhattan_distance_naive(p: List[float], q: List[float]) -> float: """ Compute Manhattan distance using the explicit formula. d₁(p, q) = Σ |pᵢ - qᵢ| Parameters: p: First point as a list of coordinates q: Second point as a list of coordinates Returns: The Manhattan distance between p and q Example: >>> manhattan_distance_naive([0, 0], [3, 4]) 7 # Compare to Euclidean distance of 5 """ if len(p) != len(q): raise ValueError("Points must have the same dimensionality") return sum(abs(p_i - q_i) for p_i, q_i in zip(p, q)) def manhattan_distance_numpy(p: np.ndarray, q: np.ndarray) -> float: """ Compute Manhattan distance using NumPy's vectorized operations. This is typically 10-100x faster than the naive implementation for high-dimensional vectors. """ return np.sum(np.abs(p - q)) def manhattan_distance_norm(p: np.ndarray, q: np.ndarray) -> float: """ Compute Manhattan distance using NumPy's norm function. np.linalg.norm with ord=1 computes the L¹ norm. """ return np.linalg.norm(p - q, ord=1) # Comparison with Euclidean distanceif __name__ == "__main__": p = np.array([0, 0]) q = np.array([3, 4]) manhattan = manhattan_distance_numpy(p, q) euclidean = np.linalg.norm(p - q, ord=2) print(f"Manhattan distance: {manhattan}") # 7 print(f"Euclidean distance: {euclidean}") # 5 print(f"Ratio (Manhattan/Euclidean): {manhattan/euclidean:.2f}") # 1.4 # In higher dimensions, the ratio approaches √n n_dims = 100 p_high = np.random.randn(n_dims) q_high = np.random.randn(n_dims) manhattan_high = manhattan_distance_numpy(p_high, q_high) euclidean_high = np.linalg.norm(p_high - q_high) print(f"100D Manhattan: {manhattan_high:.2f}") print(f"100D Euclidean: {euclidean_high:.2f}") print(f"100D Ratio: {manhattan_high/euclidean_high:.2f}") print(f"√100 = {np.sqrt(100):.2f}")Every metric defines a geometry, and that geometry is characterized by the shape of its unit ball—the set of all points at distance ≤ 1 from the origin.
For the L¹ norm in $\mathbb{R}^n$, the unit ball is:
$$B_1 = {\mathbf{x} \in \mathbb{R}^n : |\mathbf{x}|1 \leq 1} = {\mathbf{x} : \sum{i=1}^n |x_i| \leq 1}$$
In 2D, the L¹ unit ball is defined by:
$$|x| + |y| \leq 1$$
This is a square rotated 45 degrees (a diamond or rhombus) with vertices at $(\pm 1, 0)$ and $(0, \pm 1)$.
Contrast with L²: The Euclidean unit ball $x^2 + y^2 \leq 1$ is a circle inscribed in this diamond. The diamond's vertices touch the circle at the axes ($x=0$ or $y=0$), but the diamond extends beyond the circle along the diagonals.
| p | Unit Ball Shape (2D) | Corners | Smoothness |
|---|---|---|---|
| p = 1 | Diamond (square rotated 45°) | 4 sharp corners | Non-smooth at axes |
| p = 2 | Circle | None | Perfectly smooth |
| p → ∞ | Square (axis-aligned) | 4 sharp corners | Non-smooth at corners |
In $n$ dimensions, the L¹ unit ball is a cross-polytope (or hyperoctahedron):
This combinatorial explosion of facets makes high-dimensional L¹ geometry fundamentally different from the smooth hypersphere of L².
The shape of the unit ball determines which points are considered "equally close." With Euclidean distance, equidistant points form a circle/sphere. With Manhattan distance, equidistant points form a diamond/cross-polytope.
This affects neighbor selection: L¹ naturally favors neighbors aligned with the axes, while L² treats all directions equally.
For any two points, there's a predictable relationship between their L¹ and L² distances:
$$|\mathbf{x}|_2 \leq |\mathbf{x}|_1 \leq \sqrt{n} \cdot |\mathbf{x}|_2$$
Lower bound: Euclidean distance is always less than or equal to Manhattan distance. The straight-line path is never longer than the grid path.
Upper bound: Manhattan distance is at most $\sqrt{n}$ times the Euclidean distance, with equality when all coordinates differ equally (the vector points along the diagonal).
Proof of lower bound: By Cauchy-Schwarz: $$|\mathbf{x}|_1 = \sum_i |x_i| \cdot 1 \leq \sqrt{\sum_i x_i^2} \cdot \sqrt{\sum_i 1^2} = |\mathbf{x}|_2 \cdot \sqrt{n}$$
Proof of upper bound: By Jensen's inequality or directly: $$|\mathbf{x}|_2 = \sqrt{\sum_i x_i^2} \leq \sqrt{(\sum_i |x_i|)^2} = |\mathbf{x}|_1$$
with equality when only one coordinate is non-zero.
| Vector Direction | L¹ Distance | L² Distance | Ratio L¹/L² |
|---|---|---|---|
| Along axis: (3, 0) | 3 | 3 | 1.00 |
| Balanced: (3, 3) | 6 | 4.24 | 1.41 (≈√2) |
| Unbalanced: (3, 1) | 4 | 3.16 | 1.26 |
| Along 3D diagonal: (1,1,1) | 3 | 1.73 | 1.73 (≈√3) |
| Along nD diagonal: (1,...,1) | n | √n | √n |
Like Euclidean distance, Manhattan distance is a proper metric, satisfying all four axioms. We verify each:
$$d_1(\mathbf{p}, \mathbf{q}) \geq 0$$
Proof: Absolute values are non-negative, and sums of non-negative numbers are non-negative. ∎
$$d_1(\mathbf{p}, \mathbf{q}) = 0 \iff \mathbf{p} = \mathbf{q}$$
Proof:
$$d_1(\mathbf{p}, \mathbf{q}) = d_1(\mathbf{q}, \mathbf{p})$$
Proof: $|p_i - q_i| = |q_i - p_i|$ for all $i$. ∎
$$d_1(\mathbf{p}, \mathbf{r}) \leq d_1(\mathbf{p}, \mathbf{q}) + d_1(\mathbf{q}, \mathbf{r})$$
Proof: For each dimension $i$: $$|p_i - r_i| = |p_i - q_i + q_i - r_i| \leq |p_i - q_i| + |q_i - r_i|$$
by the standard triangle inequality for real numbers.
Summing over all dimensions: $$\sum_i |p_i - r_i| \leq \sum_i |p_i - q_i| + \sum_i |q_i - r_i|$$
which is exactly: $$d_1(\mathbf{p}, \mathbf{r}) \leq d_1(\mathbf{p}, \mathbf{q}) + d_1(\mathbf{q}, \mathbf{r})$$ ∎
An interesting property: the triangle inequality is "more often tight" for L¹ than L². In L¹, the indirect path through an intermediate point has the same length as the direct path whenever the intermediate point lies on any shortest L¹ path.
For L², the triangle inequality is tight only when all three points are collinear.
Manhattan distance offers several computational advantages over Euclidean distance:
The L¹ formula involves only:
No square root is needed, ever. This is in contrast to Euclidean distance, where the square root is required for the actual distance (though it can be skipped for comparisons).
Impact: Absolute value is a simple branch-free operation on modern CPUs (using bit manipulation), making Manhattan distance approximately 30-50% faster than Euclidean distance for actual distance computation.
For integer coordinates, Manhattan distance is always an integer:
$$\mathbf{p}, \mathbf{q} \in \mathbb{Z}^n \implies d_1(\mathbf{p}, \mathbf{q}) \in \mathbb{Z}$$
This is not true for Euclidean distance, which typically produces irrational numbers (e.g., $\sqrt{2}$).
Benefits:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npimport timefrom typing import Tuple def benchmark_distances(n_points: int, n_dims: int, n_iterations: int) -> Tuple[float, float]: """ Benchmark Manhattan vs Euclidean distance computation. Returns: (manhattan_time, euclidean_time) in seconds """ # Generate random data X = np.random.randn(n_points, n_dims) query = np.random.randn(n_dims) # Benchmark Manhattan start = time.perf_counter() for _ in range(n_iterations): distances = np.sum(np.abs(X - query), axis=1) manhattan_time = time.perf_counter() - start # Benchmark Euclidean start = time.perf_counter() for _ in range(n_iterations): distances = np.sqrt(np.sum((X - query)**2, axis=1)) euclidean_time = time.perf_counter() - start return manhattan_time, euclidean_time def benchmark_integer_distance(): """ Demonstrate integer exactness of Manhattan distance. """ # Integer coordinates p = np.array([0, 0, 0], dtype=np.int64) q = np.array([3, 4, 5], dtype=np.int64) manhattan = np.sum(np.abs(p - q)) euclidean = np.sqrt(np.sum((p - q)**2)) print("Integer coordinate example:") print(f" Manhattan: {manhattan} (type: {type(manhattan).__name__})") print(f" Euclidean: {euclidean} (type: {type(euclidean).__name__})") print(f" Manhattan is integer: {isinstance(manhattan, (int, np.integer))}") print(f" Euclidean is irrational: {euclidean != int(euclidean)}") if __name__ == "__main__": # Run integer demonstration benchmark_integer_distance() # Run speed benchmark print("Speed benchmark (10K points, 100 dims, 100 iterations):") m_time, e_time = benchmark_distances(10000, 100, 100) print(f" Manhattan: {m_time:.3f}s") print(f" Euclidean: {e_time:.3f}s") print(f" Manhattan is {e_time/m_time:.1f}x faster")Modern CPUs support SIMD (Single Instruction, Multiple Data) operations that process multiple data elements simultaneously. Both Manhattan and Euclidean distance benefit from SIMD, but with different characteristics:
Manhattan distance SIMD optimization:
Euclidean distance SIMD optimization:
The absence of square root in Manhattan distance is particularly beneficial on older or embedded architectures where the square root is not vectorized.
Both metrics have identical memory access patterns (sequential access through all dimensions), so neither has an inherent advantage in terms of cache performance. However, Manhattan distance requires slightly less CPU register pressure since it doesn't need to accumulate squares.
Consider Manhattan distance when:
• Real-time systems require predictable, fast distance computation • Embedded systems lack efficient floating-point support • Exact arithmetic is required for correctness (financial, cryptographic) • Large-scale batch processing where small speedups compound
For most ML applications with GPU acceleration, the difference is negligible as both operations are highly optimized.
One of the most important practical differences between Manhattan and Euclidean distance is their sensitivity to outliers. This difference arises from how each metric aggregates component-wise differences.
Euclidean distance squares each difference before summing:
$$d_2^2 = \sum_i (p_i - q_i)^2$$
This means a single large difference dominates the total distance. If one component differs by 10 units and another by 1 unit:
Manhattan distance sums absolute differences without amplification:
$$d_1 = \sum_i |p_i - q_i|$$
The same example:
While still dominant, the outlier's influence is proportionally less severe.
Consider two points differing by $\epsilon$ in most dimensions but by $M$ (a large outlier) in one dimension:
$$\mathbf{d} = (\epsilon, \epsilon, \ldots, \epsilon, M)$$
L² distance: $\sqrt{(n-1)\epsilon^2 + M^2} \approx M$ for large $M$
L¹ distance: $(n-1)\epsilon + M$
As $n$ grows, the L¹ distance picks up more signal from the small differences, while L² remains dominated by the outlier.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np def compare_outlier_sensitivity(): """ Demonstrate how L¹ and L² respond differently to outliers. """ n_dims = 100 # Two points that are close in most dimensions p = np.zeros(n_dims) q_normal = np.ones(n_dims) * 0.1 # All dimensions differ by 0.1 # Same point but with one outlier dimension q_outlier = q_normal.copy() q_outlier[0] = 10 # One dimension has a massive difference print("Normal case (all differences = 0.1):") print(f" L¹ distance: {np.sum(np.abs(p - q_normal)):.2f}") print(f" L² distance: {np.linalg.norm(p - q_normal):.2f}") print("With outlier (one difference = 10, rest = 0.1):") d1_outlier = np.sum(np.abs(p - q_outlier)) d2_outlier = np.linalg.norm(p - q_outlier) print(f" L¹ distance: {d1_outlier:.2f}") print(f" L² distance: {d2_outlier:.2f}") # Compute outlier contribution d1_normal = np.sum(np.abs(p - q_normal)) d2_normal = np.linalg.norm(p - q_normal) d1_increase = (d1_outlier - d1_normal) / d1_normal * 100 d2_increase = (d2_outlier - d2_normal) / d2_normal * 100 print(f"Percentage increase due to outlier:") print(f" L¹: {d1_increase:.0f}% increase") print(f" L²: {d2_increase:.0f}% increase") print(f" L² is {d2_increase/d1_increase:.1f}x more sensitive") def robust_regression_analogy(): """ Show connection to robust regression (L1 vs L2 loss). """ print("" + "="*50) print("Connection to Regression:") print("="*50) print("• L² loss (MSE) → L² distance → Sensitive to outliers") print("• L¹ loss (MAE) → L¹ distance → Robust to outliers") print("• Both in KNN and regression, L¹ is more robust!") if __name__ == "__main__": compare_outlier_sensitivity() robust_regression_analogy()This outlier robustness connects to a broader principle in statistics:
• L² minimization (mean squared error) → Mean (sensitive to outliers) • L¹ minimization (mean absolute error) → Median (robust to outliers)
The point that minimizes sum of squared distances to all other points is the mean. The point that minimizes sum of absolute distances is the geometric median.
In KNN, using L¹ makes neighbors less affected by anomalous feature values.
Manhattan distance is the natural choice for several important problem domains. Understanding these cases helps in metric selection.
Any problem where movement is constrained to axis-parallel directions naturally uses Manhattan distance:
Warehouse robotics: Robots in warehouses typically move along aisles (north-south or east-west). The actual travel distance between two locations is the Manhattan distance, not the Euclidean distance.
Game pathfinding: In many games, characters move on grids (chess, checkers, tile-based RPGs). Manhattan distance provides the correct heuristic for A* search when diagonal movement is not allowed.
Urban navigation: In grid-planned cities, actual travel distance (by car or on foot) closely approximates Manhattan distance.
Manhattan distance implicitly assumes features are independent and equally important. Each feature's contribution to distance is proportional only to its own deviation, not influenced by other features.
This is appropriate when:
Example: Comparing products by (price, weight, rating). A $10 price difference and a 1 kg weight difference are independent concerns. Manhattan distance sums their contributions directly.
For sparse data (many zeros), Manhattan distance can be more appropriate than Euclidean:
| Domain | Recommended Metric | Reason |
|---|---|---|
| Warehouse routing | Manhattan | Physical movement is axis-aligned |
| Image similarity | Euclidean (L²) | Continuous features, diagonal similarity matters |
| Game AI (grid movement) | Manhattan | Movement restricted to grid |
| Text documents (TF-IDF) | Cosine | Magnitude-invariant comparison |
| Sensor data with outliers | Manhattan | More robust to spikes |
| Embeddings (word2vec, etc.) | Euclidean or Cosine | Trained for these metrics |
| Integer/discrete features | Manhattan | Integer-preserving, no sqrt |
| Financial time series | Manhattan (or DTW) | Robust to extreme values |
For binary features (0/1), both Manhattan and Euclidean distance reduce to simpler forms, but Manhattan is often more interpretable:
Binary features: Manhattan distance equals the Hamming distance (number of positions where bits differ).
Count features: When features are counts (document word frequencies, event occurrences), Manhattan distance gives the total count difference, which is often more interpretable than the geometric distance.
There's a deep connection between L¹ distance and L¹ regularization (Lasso regression). Both promote sparsity:
If your features were learned with L¹ regularization, using L¹ distance may be geometrically consistent.
When in doubt, try both! Train KNN with Euclidean distance and Manhattan distance, then compare cross-validation performance. The difference is often small, but when it's large, it provides insight into your data's geometry.
While Manhattan distance has many strengths, it also has limitations that make it unsuitable for certain applications.
Manhattan distance is not invariant to rotations of the coordinate system. Rotating the data changes the distances between points.
Example: Points $(1, 0)$ and $(0, 1)$ have:
The distances changed because Manhattan distance depends on the coordinate axes.
Implication: If your data's natural axes are unknown or arbitrary, Euclidean distance (which is rotation-invariant) may be more appropriate.
Manhattan distance treats diagonal movement as more expensive than Euclidean geometry suggests:
This means diagonally-related points appear farther apart than they "should" in Euclidean space. For data where diagonal relationships are meaningful, this is a distortion.
Like all Lp distances, Manhattan distance suffers from the curse of dimensionality, though with different characteristics:
The biggest practical pitfall with Manhattan distance is its dependence on the coordinate system. If you:
• Apply PCA before KNN (rotation!) • Have features that should naturally be combined (e.g., x and y components of a vector) • Work in a domain where the axis choice is arbitrary
...then Manhattan distance may produce inconsistent or misleading results.
The existence of multiple shortest paths in Manhattan geometry can lead to:
This is usually a minor issue but can affect reproducibility.
Modern neural networks (word embeddings, image features, etc.) typically optimize for Euclidean or cosine similarity. Using Manhattan distance on these representations may not align with the geometry the network learned.
Recommendation: For learned embeddings, prefer the metric used during training, or experiment to find what works best for your downstream task.
Manhattan distance provides an alternative geometry to Euclidean distance, with distinct properties that make it ideal for certain applications. Its simplicity, computational efficiency, and robustness to outliers are valuable in many practical scenarios.
The next page introduces Minkowski distance, the generalized family that includes both Euclidean (p=2) and Manhattan (p=1) as special cases. We'll explore how varying the parameter p interpolates between these geometries and beyond, providing a unified framework for distance metrics.
You now understand Manhattan distance as a first-class alternative to Euclidean distance. You can reason about when to use L¹ vs L², understand the geometric implications of this choice, and implement either efficiently. This prepares you for the unified Minkowski framework coming next.