Loading learning content...
In our journey through K-means variants, we've seen how K-medoids achieves robustness by constraining cluster centers to actual data points. K-medians takes a different path to robustness: instead of changing what the center can be, it changes how we measure distance.
K-medians replaces the Euclidean (L2) distance of K-means with Manhattan (L1) distance, and computes cluster centers as coordinate-wise medians rather than means. This seemingly simple change has profound implications for robustness, geometry, and optimization—fundamentally altering the clustering landscape.
By the end of this page, you will master the mathematical foundations of K-medians, understand why L1 distance provides inherent outlier resistance, derive the optimality of coordinate-wise medians, and recognize when K-medians outperforms K-means in practical applications.
To understand K-medians, we must first deeply understand the difference between L1 and L2 distances and their implications for cluster center computation.
The Minkowski Distance Family:
Both L1 and L2 are special cases of the Minkowski distance (Lp norm):
$$d_p(x, y) = \left( \sum_{i=1}^{D} |x_i - y_i|^p \right)^{1/p}$$
| Property | L2 (Euclidean) | L1 (Manhattan) |
|---|---|---|
| Formula | $\sqrt{\sum_i (x_i - y_i)^2}$ | $\sum_i |x_i - y_i|$ |
| Geometric interpretation | Straight-line (crow flies) | Axis-aligned path (city blocks) |
| Unit ball shape | Circle/Sphere | Diamond/Cross-polytope |
| Optimal center | Arithmetic mean | Coordinate-wise median |
| Differentiability | Smooth everywhere | Non-differentiable at origin |
| Outlier sensitivity | High (squared terms) | Lower (linear terms) |
| Rotation invariance | Yes | No (axis-dependent) |
The Geometric Intuition:
In Euclidean space, the set of points equidistant from the origin forms a circle (2D) or sphere (3D). In Manhattan distance, this same isodistance curve forms a diamond (2D) or octahedron (3D). This difference in geometry propagates through the entire clustering algorithm.
Consider navigating a city grid: L2 distance measures how far you'd fly in a straight line, while L1 distance measures how far you'd walk along streets and avenues. The Manhattan distance gets its name from this city-block analogy—particularly relevant in Manhattan's grid layout.
Why L1 Provides Robustness:
The key insight is in how each distance handles large deviations:
The 10× distance translates to 100× contribution. Outliers dominate the objective.
The 10× distance translates to only 10× contribution. Outliers have proportional, bounded influence.
K-means actually minimizes the squared Euclidean distance, making the outlier sensitivity even more extreme. An outlier with 10× the typical distance contributes 100× to the objective function. K-medians' linear distance metric inherently provides quadratic improvement in outlier robustness.
K-medians minimizes the sum of L1 distances from each point to its assigned cluster center:
$$J_{\text{K-medians}} = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - c_k|1 = \sum{k=1}^{K} \sum_{x_i \in C_k} \sum_{d=1}^{D} |x_{i,d} - c_{k,d}|$$
where $x_{i,d}$ is the $d$-th coordinate of point $x_i$, and $c_{k,d}$ is the $d$-th coordinate of center $c_k$.
Key Observation: The objective decomposes across dimensions:
$$J = \sum_{k=1}^{K} \sum_{d=1}^{D} \sum_{x_i \in C_k} |x_{i,d} - c_{k,d}|$$
This separability is crucial: each coordinate of the cluster center can be optimized independently.
Deriving the Optimal Center:
For a fixed cluster assignment, what is the optimal center $c_k$ that minimizes:
$$\sum_{x_i \in C_k} |x_i - c_k|1 = \sum{d=1}^{D} \sum_{x_i \in C_k} |x_{i,d} - c_{k,d}|$$
Since the dimensions are independent, we minimize each coordinate separately:
$$c_{k,d}^* = \arg\min_c \sum_{x_i \in C_k} |x_{i,d} - c|$$
The solution is the median.
Proof: Let $v_1 < v_2 < \ldots < v_n$ be the values of coordinate $d$ for points in cluster $k$. Consider the function:
$$f(c) = \sum_{j=1}^{n} |v_j - c|$$
This is a piecewise linear function with slopes:
The minimum occurs where the slope changes from negative to non-negative. This happens when the number of points below equals the number above—the definition of the median.
For odd $n$: the minimum is at the middle value For even $n$: any value in the interval $[v_{n/2}, v_{n/2+1}]$ is optimal
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npimport matplotlib.pyplot as plt def visualize_l1_objective(): """ Visualize why the median minimizes sum of absolute deviations. """ # Sample cluster points (1D for visualization) points = np.array([1, 3, 4, 7, 15]) # Note: 15 is an outlier # Compute objective for range of center values c_range = np.linspace(-2, 20, 1000) # L1 objective (sum of absolute deviations) l1_objective = np.array([np.sum(np.abs(points - c)) for c in c_range]) # L2 objective (sum of squared deviations) l2_objective = np.array([np.sum((points - c)**2) for c in c_range]) # Optimal centers median = np.median(points) # L1 optimal = 4 mean = np.mean(points) # L2 optimal = 6 print(f"Points: {points}") print(f"Median (L1 optimal): {median}") print(f"Mean (L2 optimal): {mean}") print(f"") print(f"Difference shows outlier influence: mean is pulled toward 15") # The median is exactly one of the data points (the middle one) # The mean is pulled toward the outlier return { 'points': points, 'median': median, 'mean': mean, 'l1_at_median': np.sum(np.abs(points - median)), 'l2_at_mean': np.sum((points - mean)**2) } result = visualize_l1_objective()# Output:# Points: [ 1 3 4 7 15]# Median (L1 optimal): 4.0# Mean (L2 optimal): 6.0# Difference shows outlier influence: mean is pulled toward 15The K-medians algorithm follows the same iterative structure as K-means, but with two key modifications:
Algorithm:
Input: Data X = {x_1, ..., x_n}, number of clusters K
Output: Cluster centers {c_1, ..., c_K}, assignments {z_1, ..., z_n}
1. Initialize K cluster centers (random, k-medians++, etc.)
2. Repeat until convergence:
a. Assignment: For each x_i, assign to nearest center using L1 distance:
z_i = argmin_k ||x_i - c_k||_1
b. Update: For each cluster k, update center as coordinate-wise median:
c_{k,d} = median({x_{i,d} : z_i = k}) for all dimensions d
3. Return centers and assignments
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
import numpy as npfrom scipy.spatial.distance import cdist def k_medians(X, K, max_iter=100, tol=1e-6, random_state=None): """ K-medians clustering algorithm. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Input data matrix K : int Number of clusters max_iter : int Maximum number of iterations tol : float Convergence tolerance for center movement random_state : int, optional Random seed for reproducibility Returns: -------- centers : ndarray of shape (K, n_features) Final cluster centers (coordinate-wise medians) labels : ndarray of shape (n_samples,) Cluster assignment for each point inertia : float Total L1 distance (objective function value) n_iter : int Number of iterations until convergence """ if random_state is not None: np.random.seed(random_state) n_samples, n_features = X.shape # Initialize centers randomly (could use k-medians++) initial_indices = np.random.choice(n_samples, K, replace=False) centers = X[initial_indices].copy() labels = np.zeros(n_samples, dtype=int) for iteration in range(max_iter): old_centers = centers.copy() # ===== ASSIGNMENT STEP ===== # Compute L1 (Manhattan) distances from each point to each center # cdist with 'cityblock' metric computes Manhattan distance distances = cdist(X, centers, metric='cityblock') labels = np.argmin(distances, axis=1) # ===== UPDATE STEP ===== # Compute coordinate-wise median for each cluster new_centers = np.zeros_like(centers) for k in range(K): cluster_mask = labels == k if np.any(cluster_mask): # Median computed independently for each dimension new_centers[k] = np.median(X[cluster_mask], axis=0) else: # Empty cluster: reinitialize randomly new_centers[k] = X[np.random.randint(n_samples)] centers = new_centers # Check convergence center_shift = np.sum(np.abs(centers - old_centers)) if center_shift < tol: break # Compute final inertia (total L1 distance) final_distances = cdist(X, centers, metric='cityblock') inertia = np.sum(np.min(final_distances, axis=1)) return centers, labels, inertia, iteration + 1 def k_medians_plusplus_init(X, K, random_state=None): """ K-medians++ initialization: same as k-means++ but with L1 distance. The probability of selecting a point as the next center is proportional to its L1 distance from the nearest existing center. """ if random_state is not None: np.random.seed(random_state) n_samples, n_features = X.shape centers = np.zeros((K, n_features)) # First center: random first_idx = np.random.randint(n_samples) centers[0] = X[first_idx] for k in range(1, K): # Compute L1 distance to nearest center for each point distances = cdist(X, centers[:k], metric='cityblock') min_distances = np.min(distances, axis=1) # Probability proportional to distance probs = min_distances / min_distances.sum() # Sample next center next_idx = np.random.choice(n_samples, p=probs) centers[k] = X[next_idx] return centers # Example: Compare K-means vs K-medians with outliersdef compare_kmeans_kmedians_outliers(): np.random.seed(42) # Create clean clusters cluster1 = np.random.randn(50, 2) * 0.5 + [0, 0] cluster2 = np.random.randn(50, 2) * 0.5 + [5, 5] cluster3 = np.random.randn(50, 2) * 0.5 + [10, 0] # Add outliers outliers = np.array([[20, 20], [-10, 15], [25, -5]]) X = np.vstack([cluster1, cluster2, cluster3, outliers]) # K-medians centers_medians, labels_medians, inertia_medians, n_iter = k_medians( X, K=3, random_state=42 ) # K-means (for comparison) from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) labels_kmeans = kmeans.fit_predict(X) centers_kmeans = kmeans.cluster_centers_ print("K-medians vs K-means with Outliers:") print(f" K-medians centers:{np.round(centers_medians, 2)}") print(f" K-means centers:{np.round(centers_kmeans, 2)}") print(f"Note: K-means centers are pulled toward outliers") return centers_medians, centers_kmeans compare_kmeans_kmedians_outliers()K-medians shares many convergence properties with K-means, but the non-differentiability of the L1 norm introduces some nuances.
Guaranteed Monotonic Improvement:
Like K-means, K-medians exhibits monotonic decrease of the objective function:
Assignment step: For fixed centers, assigning each point to its nearest center (L1) can only decrease or maintain the total distance.
Update step: For fixed assignments, setting centers to coordinate-wise medians minimizes the within-cluster L1 distance (proven earlier).
Since both steps are non-increasing and the objective is bounded below by 0, the algorithm must converge.
Convergence to Local Minimum:
K-medians converges to a local minimum where:
However, this is generally not the global minimum. The objective landscape has many local minima, especially with L1 distance.
The absolute value function |x| is non-differentiable at x = 0. This means gradient-based optimization methods cannot be directly applied to K-medians. The coordinate-wise median update is a closed-form solution that circumvents this issue, but it limits the optimization techniques available for improvement.
Comparison to K-means Convergence:
| Property | K-means | K-medians |
|---|---|---|
| Monotonic improvement | ✅ Guaranteed | ✅ Guaranteed |
| Convergence to local min | ✅ Guaranteed | ✅ Guaranteed |
| Global optimum | ❌ Not guaranteed | ❌ Not guaranteed |
| Convergence speed | Fast (typically) | Similar |
| Number of iterations | Usually 10-100 | Usually 10-100 |
| Differentiable objective | ✅ Yes | ❌ No |
Uniqueness of Solution:
The median is unique for odd cluster sizes but may be any value in an interval for even sizes. This can cause oscillation if not handled carefully. In practice, implementations use the lower median, upper median, or midpoint for even-sized clusters.
Sensitivity to Initialization:
Like K-means, K-medians is sensitive to initialization. Use:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom scipy.spatial.distance import cdist def track_k_medians_convergence(X, K, max_iter=50, random_state=42): """ K-medians with detailed convergence tracking. Returns history of objective values and center movements. """ np.random.seed(random_state) n_samples, n_features = X.shape # Initialize initial_indices = np.random.choice(n_samples, K, replace=False) centers = X[initial_indices].copy() # Track history objective_history = [] center_history = [centers.copy()] for iteration in range(max_iter): # Assignment distances = cdist(X, centers, metric='cityblock') labels = np.argmin(distances, axis=1) # Record objective inertia = np.sum(np.min(distances, axis=1)) objective_history.append(inertia) # Update (coordinate-wise median) new_centers = np.zeros_like(centers) for k in range(K): mask = labels == k if np.any(mask): new_centers[k] = np.median(X[mask], axis=0) else: new_centers[k] = X[np.random.randint(n_samples)] center_history.append(new_centers.copy()) # Check convergence center_shift = np.sum(np.abs(centers - new_centers)) centers = new_centers if center_shift < 1e-6: break return { 'final_centers': centers, 'final_labels': labels, 'objective_history': objective_history, 'center_history': center_history, 'converged_at': iteration + 1 } # Demonstrate convergencenp.random.seed(42)X = np.vstack([ np.random.randn(100, 2) * 0.8 + [0, 0], np.random.randn(100, 2) * 0.8 + [4, 4], np.random.randn(100, 2) * 0.8 + [8, 0]]) result = track_k_medians_convergence(X, K=3) print("K-medians Convergence Analysis:")print(f" Converged after {result['converged_at']} iterations")print(f" Objective history (first 10): {[f'{v:.2f}' for v in result['objective_history'][:10]]}")print(f" Initial objective: {result['objective_history'][0]:.2f}")print(f" Final objective: {result['objective_history'][-1]:.2f}")print(f" Improvement: {(1 - result['objective_history'][-1]/result['objective_history'][0])*100:.1f}%")K-medians has similar computational complexity to K-means, with one key difference in the update step.
Time Complexity per Iteration:
Assignment Step: O(nKD)
Update Step: O(nD log n) — slightly more expensive than K-means
Total per Iteration: O(nKD + nD log n) ≈ O(nKD) for typical K
Comparison to K-means:
| Operation | K-means | K-medians |
|---|---|---|
| Assignment | O(nKD) | O(nKD) |
| Update | O(nD) | O(nD log n) |
| Total | O(nKD) | O(nKD + nD log n) |
The Median Computation Overhead:
K-means computes means with a simple linear pass:
mean = sum(values) / len(values) # O(n)
K-medians requires finding the median, which is slightly more complex:
Option 1: Sorting — O(n log n)
sorted_values = sorted(values)
median = sorted_values[len(values) // 2]
Option 2: QuickSelect — O(n) expected, O(n²) worst case Linear-time selection algorithm finds the k-th smallest element.
Option 3: Median-of-medians — O(n) guaranteed Deterministic linear-time selection, but with larger constants.
In practice, sorting is often used because:
In practice, K-medians typically runs within 1.5-2× the time of K-means on the same data. The median computation overhead is dominated by the assignment step for large K or high-dimensional data. For most applications, this difference is negligible.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npimport timefrom sklearn.cluster import KMeans def k_medians_benchmark(X, K, max_iter=100): """K-medians with timing.""" from scipy.spatial.distance import cdist n_samples, n_features = X.shape centers = X[np.random.choice(n_samples, K, replace=False)] for _ in range(max_iter): distances = cdist(X, centers, metric='cityblock') labels = np.argmin(distances, axis=1) new_centers = np.zeros_like(centers) for k in range(K): mask = labels == k if np.any(mask): new_centers[k] = np.median(X[mask], axis=0) else: new_centers[k] = X[np.random.randint(n_samples)] if np.allclose(centers, new_centers): break centers = new_centers return centers, labels def benchmark_complexity(): """Compare K-means and K-medians runtime scaling.""" results = [] for n in [1000, 5000, 10000, 50000]: for D in [2, 10, 50]: for K in [3, 10]: X = np.random.randn(n, D) # K-means start = time.time() kmeans = KMeans(n_clusters=K, max_iter=100, n_init=1) kmeans.fit(X) kmeans_time = time.time() - start # K-medians np.random.seed(42) start = time.time() k_medians_benchmark(X, K) kmedians_time = time.time() - start ratio = kmedians_time / kmeans_time results.append({ 'n': n, 'D': D, 'K': K, 'kmeans_time': kmeans_time, 'kmedians_time': kmedians_time, 'ratio': ratio }) print(f"n={n:5d}, D={D:2d}, K={K:2d} | " f"K-means: {kmeans_time:.3f}s | " f"K-medians: {kmedians_time:.3f}s | " f"Ratio: {ratio:.2f}x") return results # Example output (will vary by machine):# n= 1000, D= 2, K= 3 | K-means: 0.012s | K-medians: 0.018s | Ratio: 1.50x# n= 5000, D=10, K=10 | K-means: 0.089s | K-medians: 0.142s | Ratio: 1.60x# n=50000, D=50, K=10 | K-means: 1.234s | K-medians: 1.891s | Ratio: 1.53xUnderstanding the trade-offs between K-medians and other clustering variants helps you choose the right algorithm for your specific problem.
Both algorithms provide outlier robustness, but via different mechanisms. K-medians uses L1 distance with virtual centers (medians). K-medoids uses any distance metric but constrains centers to actual data points. K-medoids offers more flexibility in distance choice and interpretable cluster exemplars, while K-medians is computationally simpler and faster for large datasets.
Domain-Specific Applications:
1. Image Processing: L1 distance (Sum of Absolute Differences, SAD) is widely used in video compression and motion estimation because it's faster to compute and more robust to illumination changes.
2. Compressed Sensing: L1 minimization promotes sparsity, making K-medians natural for clustering in sparse signal recovery applications.
3. Robust Statistics: When data may be contaminated with outliers, L1-based methods provide consistent estimation even with up to 50% contamination.
4. Financial Data: Stock returns often have heavy tails. K-medians can cluster assets more meaningfully than K-means, which may be distorted by extreme market events.
5. Urban Planning: For facility location problems in grid-based cities, L1 distance accurately models actual travel distances.
While K-medians is less commonly implemented in standard libraries compared to K-means, several options exist for production use.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166
import numpy as npfrom scipy.spatial.distance import cdist # ============================================# Option 1: Custom Implementation (Recommended for learning)# ============================================ class KMedians: """ Production-ready K-medians implementation. Parameters: ----------- n_clusters : int Number of clusters max_iter : int Maximum iterations n_init : int Number of random initializations tol : float Convergence tolerance random_state : int, optional Random seed """ def __init__(self, n_clusters=8, max_iter=300, n_init=10, tol=1e-4, random_state=None): self.n_clusters = n_clusters self.max_iter = max_iter self.n_init = n_init self.tol = tol self.random_state = random_state def _init_centers(self, X): """K-medians++ initialization.""" n_samples = len(X) centers = np.zeros((self.n_clusters, X.shape[1])) # First center: random centers[0] = X[np.random.randint(n_samples)] for k in range(1, self.n_clusters): distances = cdist(X, centers[:k], metric='cityblock') min_dist = np.min(distances, axis=1) probs = min_dist / min_dist.sum() centers[k] = X[np.random.choice(n_samples, p=probs)] return centers def _single_run(self, X): """Single K-medians run.""" n_samples = len(X) centers = self._init_centers(X) for iteration in range(self.max_iter): # Assignment distances = cdist(X, centers, metric='cityblock') labels = np.argmin(distances, axis=1) # Update new_centers = np.zeros_like(centers) for k in range(self.n_clusters): mask = labels == k if np.any(mask): new_centers[k] = np.median(X[mask], axis=0) else: new_centers[k] = X[np.random.randint(n_samples)] # Convergence check shift = np.sum(np.abs(centers - new_centers)) centers = new_centers if shift < self.tol: break inertia = np.sum(np.min(cdist(X, centers, metric='cityblock'), axis=1)) return centers, labels, inertia def fit(self, X): """Fit K-medians to data.""" if self.random_state is not None: np.random.seed(self.random_state) best_inertia = np.inf for _ in range(self.n_init): centers, labels, inertia = self._single_run(X) if inertia < best_inertia: best_inertia = inertia self.cluster_centers_ = centers self.labels_ = labels self.inertia_ = inertia return self def predict(self, X): """Predict cluster labels for new data.""" distances = cdist(X, self.cluster_centers_, metric='cityblock') return np.argmin(distances, axis=1) def fit_predict(self, X): """Fit and return labels.""" self.fit(X) return self.labels_ # ============================================# Option 2: Using PyClustering library# ============================================ # pip install pyclusteringtry: from pyclustering.cluster.kmedians import kmedians from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer def pyclustering_kmedians(X, K): """K-medians using PyClustering library.""" # Initialize with k-means++ initial_centers = kmeans_plusplus_initializer( X.tolist(), K ).initialize() # Run k-medians kmedians_instance = kmedians(X.tolist(), initial_centers) kmedians_instance.process() clusters = kmedians_instance.get_clusters() centers = np.array(kmedians_instance.get_medians()) # Convert cluster lists to labels labels = np.zeros(len(X), dtype=int) for k, cluster in enumerate(clusters): for idx in cluster: labels[idx] = k return centers, labels except ImportError: pass # ============================================# Example usage# ============================================ # Generate sample datanp.random.seed(42)X = np.vstack([ np.random.randn(100, 2) * 0.5 + [0, 0], np.random.randn(100, 2) * 0.5 + [4, 4], np.random.randn(100, 2) * 0.5 + [8, 0],]) # Add outliersX = np.vstack([X, [[15, 15], [-5, 10], [20, -5]]]) # Fit K-medianskmedians = KMedians(n_clusters=3, random_state=42)kmedians.fit(X) print("K-medians Results:")print(f" Cluster centers:{np.round(kmedians.cluster_centers_, 3)}")print(f" Inertia (L1): {kmedians.inertia_:.2f}")print(f" Cluster sizes: {np.bincount(kmedians.labels_)}")We've comprehensively explored K-medians, understanding how a simple change in distance metric fundamentally alters clustering behavior. Let's consolidate the essential insights:
Looking Ahead:
The next page explores Fuzzy C-means, which takes a fundamentally different approach: instead of hard cluster assignments, each point belongs to all clusters with varying degrees of membership. This soft clustering paradigm enables modeling of overlapping clusters and provides richer information about cluster structure.
You now have a thorough understanding of K-medians clustering—from its mathematical foundations in L1 geometry to practical implementation details. You can now make informed decisions about when to use K-medians over K-means based on your data characteristics and robustness requirements.