Machine LearningK-Means Variants

K-Means Variants: Beyond Standard K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Variants

2 / 5

K-medians: Robust Clustering with L1 Distance

From L2 to L1: A Different Geometry of Clustering

In our journey through K-means variants, we've seen how K-medoids achieves robustness by constraining cluster centers to actual data points. K-medians takes a different path to robustness: instead of changing what the center can be, it changes how we measure distance.

K-medians replaces the Euclidean (L2) distance of K-means with Manhattan (L1) distance, and computes cluster centers as coordinate-wise medians rather than means. This seemingly simple change has profound implications for robustness, geometry, and optimization—fundamentally altering the clustering landscape.

What You Will Learn

By the end of this page, you will master the mathematical foundations of K-medians, understand why L1 distance provides inherent outlier resistance, derive the optimality of coordinate-wise medians, and recognize when K-medians outperforms K-means in practical applications.

Mathematical Foundation: L1 vs L2 Distance

To understand K-medians, we must first deeply understand the difference between L1 and L2 distances and their implications for cluster center computation.

The Minkowski Distance Family:

Both L1 and L2 are special cases of the Minkowski distance (Lp norm):

$$d_p(x, y) = \left( \sum_{i=1}^{D} |x_i - y_i|^p \right)^{1/p}$$

L2 (Euclidean): $p = 2$ → $d_2(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$
L1 (Manhattan): $p = 1$ → $d_1(x, y) = \sum_i |x_i - y_i|$

L1 vs L2 Distance: Fundamental Properties
Property	L2 (Euclidean)	L1 (Manhattan)
Formula	$\sqrt{\sum_i (x_i - y_i)^2}$	$\sum_i \|x_i - y_i\|$
Geometric interpretation	Straight-line (crow flies)	Axis-aligned path (city blocks)
Unit ball shape	Circle/Sphere	Diamond/Cross-polytope
Optimal center	Arithmetic mean	Coordinate-wise median
Differentiability	Smooth everywhere	Non-differentiable at origin
Outlier sensitivity	High (squared terms)	Lower (linear terms)
Rotation invariance	Yes	No (axis-dependent)

The Geometric Intuition:

In Euclidean space, the set of points equidistant from the origin forms a circle (2D) or sphere (3D). In Manhattan distance, this same isodistance curve forms a diamond (2D) or octahedron (3D). This difference in geometry propagates through the entire clustering algorithm.

Consider navigating a city grid: L2 distance measures how far you'd fly in a straight line, while L1 distance measures how far you'd walk along streets and avenues. The Manhattan distance gets its name from this city-block analogy—particularly relevant in Manhattan's grid layout.

Why L1 Provides Robustness:

The key insight is in how each distance handles large deviations:

In L2: A point 10 units away contributes $10^2 = 100$ to the squared distance
In L2: A point 100 units away contributes $100^2 = 10,000$ to the squared distance

The 10× distance translates to 100× contribution. Outliers dominate the objective.

In L1: A point 10 units away contributes 10 to the distance
In L1: A point 100 units away contributes 100 to the distance

The 10× distance translates to only 10× contribution. Outliers have proportional, bounded influence.

The Squaring Effect

K-means actually minimizes the squared Euclidean distance, making the outlier sensitivity even more extreme. An outlier with 10× the typical distance contributes 100× to the objective function. K-medians' linear distance metric inherently provides quadratic improvement in outlier robustness.

The K-medians Objective Function

K-medians minimizes the sum of L1 distances from each point to its assigned cluster center:

$$J_{\text{K-medians}} = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - c_k|1 = \sum{k=1}^{K} \sum_{x_i \in C_k} \sum_{d=1}^{D} |x_{i,d} - c_{k,d}|$$

where $x_{i,d}$ is the $d$-th coordinate of point $x_i$, and $c_{k,d}$ is the $d$-th coordinate of center $c_k$.

Key Observation: The objective decomposes across dimensions:

$$J = \sum_{k=1}^{K} \sum_{d=1}^{D} \sum_{x_i \in C_k} |x_{i,d} - c_{k,d}|$$

This separability is crucial: each coordinate of the cluster center can be optimized independently.

Deriving the Optimal Center:

For a fixed cluster assignment, what is the optimal center $c_k$ that minimizes:

$$\sum_{x_i \in C_k} |x_i - c_k|1 = \sum{d=1}^{D} \sum_{x_i \in C_k} |x_{i,d} - c_{k,d}|$$

Since the dimensions are independent, we minimize each coordinate separately:

$$c_{k,d}^* = \arg\min_c \sum_{x_i \in C_k} |x_{i,d} - c|$$

The solution is the median.

Proof: Let $v_1 < v_2 < \ldots < v_n$ be the values of coordinate $d$ for points in cluster $k$. Consider the function:

$$f(c) = \sum_{j=1}^{n} |v_j - c|$$

This is a piecewise linear function with slopes:

For $c < v_1$: slope = $-n$ (all terms decrease as $c$ increases)
For $v_j \leq c < v_{j+1}$: slope = $-n + 2j$ (j terms now increase, n-j decrease)
For $c \geq v_n$: slope = $+n$ (all terms increase as $c$ increases)

The minimum occurs where the slope changes from negative to non-negative. This happens when the number of points below equals the number above—the definition of the median.

For odd $n$: the minimum is at the middle value For even $n$: any value in the interval $[v_{n/2}, v_{n/2+1}]$ is optimal

median_optimality_proof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_l1_objective():
    """
    Visualize why the median minimizes sum of absolute deviations.
    """
    # Sample cluster points (1D for visualization)
    points = np.array([1, 3, 4, 7, 15])  # Note: 15 is an outlier
    
    # Compute objective for range of center values
    c_range = np.linspace(-2, 20, 1000)
    
    # L1 objective (sum of absolute deviations)
    l1_objective = np.array([np.sum(np.abs(points - c)) for c in c_range])
    
    # L2 objective (sum of squared deviations)
    l2_objective = np.array([np.sum((points - c)**2) for c in c_range])
    
    # Optimal centers
    median = np.median(points)  # L1 optimal = 4
    mean = np.mean(points)      # L2 optimal = 6
    
    print(f"Points: {points}")
    print(f"Median (L1 optimal): {median}")
    print(f"Mean (L2 optimal): {mean}")
    print(f"")
    print(f"Difference shows outlier influence: mean is pulled toward 15")
    
    # The median is exactly one of the data points (the middle one)
    # The mean is pulled toward the outlier
    
    return {
        'points': points,
        'median': median,
        'mean': mean,
        'l1_at_median': np.sum(np.abs(points - median)),
        'l2_at_mean': np.sum((points - mean)**2)
    }
 
result = visualize_l1_objective()
# Output:
# Points: [ 1  3  4  7 15]
# Median (L1 optimal): 4.0
# Mean (L2 optimal): 6.0
# Difference shows outlier influence: mean is pulled toward 15

The K-medians Algorithm

The K-medians algorithm follows the same iterative structure as K-means, but with two key modifications:

Assignment step: Use L1 (Manhattan) distance instead of L2
Update step: Compute coordinate-wise medians instead of means

Algorithm:

Input: Data X = {x_1, ..., x_n}, number of clusters K
Output: Cluster centers {c_1, ..., c_K}, assignments {z_1, ..., z_n}

1. Initialize K cluster centers (random, k-medians++, etc.)
2. Repeat until convergence:
   a. Assignment: For each x_i, assign to nearest center using L1 distance:
      z_i = argmin_k ||x_i - c_k||_1
   
   b. Update: For each cluster k, update center as coordinate-wise median:
      c_{k,d} = median({x_{i,d} : z_i = k}) for all dimensions d
3. Return centers and assignments

k_medians.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from scipy.spatial.distance import cdist
 
def k_medians(X, K, max_iter=100, tol=1e-6, random_state=None):
    """
    K-medians clustering algorithm.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Input data matrix
    K : int
        Number of clusters
    max_iter : int
        Maximum number of iterations
    tol : float
        Convergence tolerance for center movement
    random_state : int, optional
        Random seed for reproducibility
    
    Returns:
    --------
    centers : ndarray of shape (K, n_features)
        Final cluster centers (coordinate-wise medians)
    labels : ndarray of shape (n_samples,)
        Cluster assignment for each point
    inertia : float
        Total L1 distance (objective function value)
    n_iter : int
        Number of iterations until convergence
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples, n_features = X.shape
    
    # Initialize centers randomly (could use k-medians++)
    initial_indices = np.random.choice(n_samples, K, replace=False)
    centers = X[initial_indices].copy()
    
    labels = np.zeros(n_samples, dtype=int)
    
    for iteration in range(max_iter):
        old_centers = centers.copy()
        
        # ===== ASSIGNMENT STEP =====
        # Compute L1 (Manhattan) distances from each point to each center
        # cdist with 'cityblock' metric computes Manhattan distance
        distances = cdist(X, centers, metric='cityblock')
        labels = np.argmin(distances, axis=1)
        
        # ===== UPDATE STEP =====
        # Compute coordinate-wise median for each cluster
        new_centers = np.zeros_like(centers)
        for k in range(K):
            cluster_mask = labels == k
            if np.any(cluster_mask):
                # Median computed independently for each dimension
                new_centers[k] = np.median(X[cluster_mask], axis=0)
            else:
                # Empty cluster: reinitialize randomly
                new_centers[k] = X[np.random.randint(n_samples)]
        
        centers = new_centers
        
        # Check convergence
        center_shift = np.sum(np.abs(centers - old_centers))
        if center_shift < tol:
            break
    
    # Compute final inertia (total L1 distance)
    final_distances = cdist(X, centers, metric='cityblock')
    inertia = np.sum(np.min(final_distances, axis=1))
    
    return centers, labels, inertia, iteration + 1
 
 
def k_medians_plusplus_init(X, K, random_state=None):
    """
    K-medians++ initialization: same as k-means++ but with L1 distance.
    
    The probability of selecting a point as the next center is proportional
    to its L1 distance from the nearest existing center.
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples, n_features = X.shape
    centers = np.zeros((K, n_features))
    
    # First center: random
    first_idx = np.random.randint(n_samples)
    centers[0] = X[first_idx]
    
    for k in range(1, K):
        # Compute L1 distance to nearest center for each point
        distances = cdist(X, centers[:k], metric='cityblock')
        min_distances = np.min(distances, axis=1)
        
        # Probability proportional to distance
        probs = min_distances / min_distances.sum()
        
        # Sample next center
        next_idx = np.random.choice(n_samples, p=probs)
        centers[k] = X[next_idx]
    
    return centers
 
 
# Example: Compare K-means vs K-medians with outliers
def compare_kmeans_kmedians_outliers():
    np.random.seed(42)
    
    # Create clean clusters
    cluster1 = np.random.randn(50, 2) * 0.5 + [0, 0]
    cluster2 = np.random.randn(50, 2) * 0.5 + [5, 5]
    cluster3 = np.random.randn(50, 2) * 0.5 + [10, 0]
    
    # Add outliers
    outliers = np.array([[20, 20], [-10, 15], [25, -5]])
    
    X = np.vstack([cluster1, cluster2, cluster3, outliers])
    
    # K-medians
    centers_medians, labels_medians, inertia_medians, n_iter = k_medians(
        X, K=3, random_state=42
    )
    
    # K-means (for comparison)
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
    labels_kmeans = kmeans.fit_predict(X)
    centers_kmeans = kmeans.cluster_centers_
    
    print("K-medians vs K-means with Outliers:")
    print(f"  K-medians centers:
{np.round(centers_medians, 2)}")
    print(f"  K-means centers:
{np.round(centers_kmeans, 2)}")
    print(f"
Note: K-means centers are pulled toward outliers")
    
    return centers_medians, centers_kmeans
 
compare_kmeans_kmedians_outliers()

Convergence Properties and Optimization

K-medians shares many convergence properties with K-means, but the non-differentiability of the L1 norm introduces some nuances.

Guaranteed Monotonic Improvement:

Like K-means, K-medians exhibits monotonic decrease of the objective function:

Assignment step: For fixed centers, assigning each point to its nearest center (L1) can only decrease or maintain the total distance.
Update step: For fixed assignments, setting centers to coordinate-wise medians minimizes the within-cluster L1 distance (proven earlier).

Since both steps are non-increasing and the objective is bounded below by 0, the algorithm must converge.

Convergence to Local Minimum:

K-medians converges to a local minimum where:

Each point is assigned to its nearest center (L1)
Each center is the coordinate-wise median of its cluster

However, this is generally not the global minimum. The objective landscape has many local minima, especially with L1 distance.

Non-Differentiability Challenge

The absolute value function |x| is non-differentiable at x = 0. This means gradient-based optimization methods cannot be directly applied to K-medians. The coordinate-wise median update is a closed-form solution that circumvents this issue, but it limits the optimization techniques available for improvement.

Comparison to K-means Convergence:

Property	K-means	K-medians
Monotonic improvement	✅ Guaranteed	✅ Guaranteed
Convergence to local min	✅ Guaranteed	✅ Guaranteed
Global optimum	❌ Not guaranteed	❌ Not guaranteed
Convergence speed	Fast (typically)	Similar
Number of iterations	Usually 10-100	Usually 10-100
Differentiable objective	✅ Yes	❌ No

Uniqueness of Solution:

The median is unique for odd cluster sizes but may be any value in an interval for even sizes. This can cause oscillation if not handled carefully. In practice, implementations use the lower median, upper median, or midpoint for even-sized clusters.

Sensitivity to Initialization:

Like K-means, K-medians is sensitive to initialization. Use:

K-medians++: Probabilistic initialization using L1 distances
Multiple restarts: Run algorithm multiple times, keep best result
Smart seeding: Use domain knowledge to guide initial centers

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy.spatial.distance import cdist
 
def track_k_medians_convergence(X, K, max_iter=50, random_state=42):
    """
    K-medians with detailed convergence tracking.
    
    Returns history of objective values and center movements.
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    
    # Initialize
    initial_indices = np.random.choice(n_samples, K, replace=False)
    centers = X[initial_indices].copy()
    
    # Track history
    objective_history = []
    center_history = [centers.copy()]
    
    for iteration in range(max_iter):
        # Assignment
        distances = cdist(X, centers, metric='cityblock')
        labels = np.argmin(distances, axis=1)
        
        # Record objective
        inertia = np.sum(np.min(distances, axis=1))
        objective_history.append(inertia)
        
        # Update (coordinate-wise median)
        new_centers = np.zeros_like(centers)
        for k in range(K):
            mask = labels == k
            if np.any(mask):
                new_centers[k] = np.median(X[mask], axis=0)
            else:
                new_centers[k] = X[np.random.randint(n_samples)]
        
        center_history.append(new_centers.copy())
        
        # Check convergence
        center_shift = np.sum(np.abs(centers - new_centers))
        centers = new_centers
        
        if center_shift < 1e-6:
            break
    
    return {
        'final_centers': centers,
        'final_labels': labels,
        'objective_history': objective_history,
        'center_history': center_history,
        'converged_at': iteration + 1
    }
 
# Demonstrate convergence
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) * 0.8 + [0, 0],
    np.random.randn(100, 2) * 0.8 + [4, 4],
    np.random.randn(100, 2) * 0.8 + [8, 0]
])
 
result = track_k_medians_convergence(X, K=3)
 
print("K-medians Convergence Analysis:")
print(f"  Converged after {result['converged_at']} iterations")
print(f"  Objective history (first 10): {[f'{v:.2f}' for v in result['objective_history'][:10]]}")
print(f"  Initial objective: {result['objective_history'][0]:.2f}")
print(f"  Final objective: {result['objective_history'][-1]:.2f}")
print(f"  Improvement: {(1 - result['objective_history'][-1]/result['objective_history'][0])*100:.1f}%")

Computational Complexity

K-medians has similar computational complexity to K-means, with one key difference in the update step.

Time Complexity per Iteration:

Assignment Step: O(nKD)

Compute L1 distance from each of n points to each of K centers
Each distance computation involves D coordinate differences and absolute values

Update Step: O(nD log n) — slightly more expensive than K-means

For each of K clusters (average n/K points each)
For each of D dimensions
Compute median: O(n/K) using linear-time selection, or O((n/K) log(n/K)) using sorting
Total: K × D × O(n/K) = O(nD) with selection, O(nD log(n/K)) with sorting

Total per Iteration: O(nKD + nD log n) ≈ O(nKD) for typical K

Comparison to K-means:

Operation	K-means	K-medians
Assignment	O(nKD)	O(nKD)
Update	O(nD)	O(nD log n)
Total	O(nKD)	O(nKD + nD log n)

The Median Computation Overhead:

K-means computes means with a simple linear pass:

mean = sum(values) / len(values)  # O(n)

K-medians requires finding the median, which is slightly more complex:

Option 1: Sorting — O(n log n)

sorted_values = sorted(values)
median = sorted_values[len(values) // 2]

Option 2: QuickSelect — O(n) expected, O(n²) worst case Linear-time selection algorithm finds the k-th smallest element.

Option 3: Median-of-medians — O(n) guaranteed Deterministic linear-time selection, but with larger constants.

In practice, sorting is often used because:

It's simple and cache-friendly
For typical cluster sizes (hundreds to thousands), the O(n log n) vs O(n) difference is negligible
Modern sorting implementations are highly optimized

Practical Performance

In practice, K-medians typically runs within 1.5-2× the time of K-means on the same data. The median computation overhead is dominated by the assignment step for large K or high-dimensional data. For most applications, this difference is negligible.

complexity_benchmark.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
import time
from sklearn.cluster import KMeans
 
def k_medians_benchmark(X, K, max_iter=100):
    """K-medians with timing."""
    from scipy.spatial.distance import cdist
    
    n_samples, n_features = X.shape
    centers = X[np.random.choice(n_samples, K, replace=False)]
    
    for _ in range(max_iter):
        distances = cdist(X, centers, metric='cityblock')
        labels = np.argmin(distances, axis=1)
        
        new_centers = np.zeros_like(centers)
        for k in range(K):
            mask = labels == k
            if np.any(mask):
                new_centers[k] = np.median(X[mask], axis=0)
            else:
                new_centers[k] = X[np.random.randint(n_samples)]
        
        if np.allclose(centers, new_centers):
            break
        centers = new_centers
    
    return centers, labels
 
 
def benchmark_complexity():
    """Compare K-means and K-medians runtime scaling."""
    
    results = []
    
    for n in [1000, 5000, 10000, 50000]:
        for D in [2, 10, 50]:
            for K in [3, 10]:
                X = np.random.randn(n, D)
                
                # K-means
                start = time.time()
                kmeans = KMeans(n_clusters=K, max_iter=100, n_init=1)
                kmeans.fit(X)
                kmeans_time = time.time() - start
                
                # K-medians
                np.random.seed(42)
                start = time.time()
                k_medians_benchmark(X, K)
                kmedians_time = time.time() - start
                
                ratio = kmedians_time / kmeans_time
                results.append({
                    'n': n, 'D': D, 'K': K,
                    'kmeans_time': kmeans_time,
                    'kmedians_time': kmedians_time,
                    'ratio': ratio
                })
                
                print(f"n={n:5d}, D={D:2d}, K={K:2d} | "
                      f"K-means: {kmeans_time:.3f}s | "
                      f"K-medians: {kmedians_time:.3f}s | "
                      f"Ratio: {ratio:.2f}x")
    
    return results
 
# Example output (will vary by machine):
# n= 1000, D= 2, K= 3 | K-means: 0.012s | K-medians: 0.018s | Ratio: 1.50x
# n= 5000, D=10, K=10 | K-means: 0.089s | K-medians: 0.142s | Ratio: 1.60x
# n=50000, D=50, K=10 | K-means: 1.234s | K-medians: 1.891s | Ratio: 1.53x

When to Use K-medians

Understanding the trade-offs between K-medians and other clustering variants helps you choose the right algorithm for your specific problem.

Use K-medians When:

•Data contains outliers — L1 distance provides natural robustness
•Features have different scales — L1 is less sensitive to scale differences
•Sparse data — Manhattan distance often performs better on high-dimensional sparse data
•Grid-like domains — When actual travel is axis-aligned (logistics, routing)
•Heavy-tailed distributions — Data with extreme values benefits from L1
•You want interpretable centers — Medians have intuitive meaning

Prefer K-means When:

•Data is clean — No significant outliers or noise
•Euclidean geometry matters — Actual straight-line distances are meaningful
•Rotation invariance needed — L1 depends on coordinate orientation
•Gradient methods required — Need differentiable objective for integration with other optimization
•Library support matters — K-means is more widely implemented and optimized

K-medians vs. K-medoids

Both algorithms provide outlier robustness, but via different mechanisms. K-medians uses L1 distance with virtual centers (medians). K-medoids uses any distance metric but constrains centers to actual data points. K-medoids offers more flexibility in distance choice and interpretable cluster exemplars, while K-medians is computationally simpler and faster for large datasets.

Domain-Specific Applications:

1. Image Processing: L1 distance (Sum of Absolute Differences, SAD) is widely used in video compression and motion estimation because it's faster to compute and more robust to illumination changes.

2. Compressed Sensing: L1 minimization promotes sparsity, making K-medians natural for clustering in sparse signal recovery applications.

3. Robust Statistics: When data may be contaminated with outliers, L1-based methods provide consistent estimation even with up to 50% contamination.

4. Financial Data: Stock returns often have heavy tails. K-medians can cluster assets more meaningfully than K-means, which may be distorted by extreme market events.

5. Urban Planning: For facility location problems in grid-based cities, L1 distance accurately models actual travel distances.

Practical Implementation and Libraries

While K-medians is less commonly implemented in standard libraries compared to K-means, several options exist for production use.

kmedians_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
import numpy as np
from scipy.spatial.distance import cdist
 
# ============================================
# Option 1: Custom Implementation (Recommended for learning)
# ============================================
 
class KMedians:
    """
    Production-ready K-medians implementation.
    
    Parameters:
    -----------
    n_clusters : int
        Number of clusters
    max_iter : int
        Maximum iterations
    n_init : int
        Number of random initializations
    tol : float
        Convergence tolerance
    random_state : int, optional
        Random seed
    """
    
    def __init__(self, n_clusters=8, max_iter=300, n_init=10, 
                 tol=1e-4, random_state=None):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.n_init = n_init
        self.tol = tol
        self.random_state = random_state
        
    def _init_centers(self, X):
        """K-medians++ initialization."""
        n_samples = len(X)
        centers = np.zeros((self.n_clusters, X.shape[1]))
        
        # First center: random
        centers[0] = X[np.random.randint(n_samples)]
        
        for k in range(1, self.n_clusters):
            distances = cdist(X, centers[:k], metric='cityblock')
            min_dist = np.min(distances, axis=1)
            probs = min_dist / min_dist.sum()
            centers[k] = X[np.random.choice(n_samples, p=probs)]
        
        return centers
    
    def _single_run(self, X):
        """Single K-medians run."""
        n_samples = len(X)
        centers = self._init_centers(X)
        
        for iteration in range(self.max_iter):
            # Assignment
            distances = cdist(X, centers, metric='cityblock')
            labels = np.argmin(distances, axis=1)
            
            # Update
            new_centers = np.zeros_like(centers)
            for k in range(self.n_clusters):
                mask = labels == k
                if np.any(mask):
                    new_centers[k] = np.median(X[mask], axis=0)
                else:
                    new_centers[k] = X[np.random.randint(n_samples)]
            
            # Convergence check
            shift = np.sum(np.abs(centers - new_centers))
            centers = new_centers
            
            if shift < self.tol:
                break
        
        inertia = np.sum(np.min(cdist(X, centers, metric='cityblock'), axis=1))
        return centers, labels, inertia
    
    def fit(self, X):
        """Fit K-medians to data."""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        best_inertia = np.inf
        
        for _ in range(self.n_init):
            centers, labels, inertia = self._single_run(X)
            
            if inertia < best_inertia:
                best_inertia = inertia
                self.cluster_centers_ = centers
                self.labels_ = labels
                self.inertia_ = inertia
        
        return self
    
    def predict(self, X):
        """Predict cluster labels for new data."""
        distances = cdist(X, self.cluster_centers_, metric='cityblock')
        return np.argmin(distances, axis=1)
    
    def fit_predict(self, X):
        """Fit and return labels."""
        self.fit(X)
        return self.labels_
 
 
# ============================================
# Option 2: Using PyClustering library
# ============================================
 
# pip install pyclustering
try:
    from pyclustering.cluster.kmedians import kmedians
    from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
    
    def pyclustering_kmedians(X, K):
        """K-medians using PyClustering library."""
        # Initialize with k-means++
        initial_centers = kmeans_plusplus_initializer(
            X.tolist(), K
        ).initialize()
        
        # Run k-medians
        kmedians_instance = kmedians(X.tolist(), initial_centers)
        kmedians_instance.process()
        
        clusters = kmedians_instance.get_clusters()
        centers = np.array(kmedians_instance.get_medians())
        
        # Convert cluster lists to labels
        labels = np.zeros(len(X), dtype=int)
        for k, cluster in enumerate(clusters):
            for idx in cluster:
                labels[idx] = k
        
        return centers, labels
 
except ImportError:
    pass
 
 
# ============================================
# Example usage
# ============================================
 
# Generate sample data
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) * 0.5 + [0, 0],
    np.random.randn(100, 2) * 0.5 + [4, 4],
    np.random.randn(100, 2) * 0.5 + [8, 0],
])
 
# Add outliers
X = np.vstack([X, [[15, 15], [-5, 10], [20, -5]]])
 
# Fit K-medians
kmedians = KMedians(n_clusters=3, random_state=42)
kmedians.fit(X)
 
print("K-medians Results:")
print(f"  Cluster centers:
{np.round(kmedians.cluster_centers_, 3)}")
print(f"  Inertia (L1): {kmedians.inertia_:.2f}")
print(f"  Cluster sizes: {np.bincount(kmedians.labels_)}")

Summary: K-medians Clustering

We've comprehensively explored K-medians, understanding how a simple change in distance metric fundamentally alters clustering behavior. Let's consolidate the essential insights:

Key Takeaways

•K-medians uses L1 (Manhattan) distance — This provides inherent robustness to outliers because extreme values have linear (not quadratic) influence on the objective.
•Cluster centers are coordinate-wise medians — The median minimizes sum of absolute deviations, decomposing the optimization across dimensions.
•Same algorithmic structure as K-means — Alternating assignment (L1 distance) and update (median) steps guarantee convergence to a local minimum.
•Slightly higher computational cost — Median computation is O(n log n) vs. O(n) for mean, but this rarely matters in practice.
•Not rotation invariant — L1 distance depends on coordinate alignment; rotate data meaningfully before clustering.
•Use for outlier-contaminated or heavy-tailed data — When robustness matters more than geometric fidelity to Euclidean space.

Looking Ahead:

The next page explores Fuzzy C-means, which takes a fundamentally different approach: instead of hard cluster assignments, each point belongs to all clusters with varying degrees of membership. This soft clustering paradigm enables modeling of overlapping clusters and provides richer information about cluster structure.

Page Complete

You now have a thorough understanding of K-medians clustering—from its mathematical foundations in L1 geometry to practical implementation details. You can now make informed decisions about when to use K-medians over K-means based on your data characteristics and robustness requirements.

2 / 5

Loading learning content...

Machine LearningK-Means Variants

K-Means Variants: Beyond Standard K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Variants

2 / 5

K-medians: Robust Clustering with L1 Distance

From L2 to L1: A Different Geometry of Clustering

What You Will Learn

Mathematical Foundation: L1 vs L2 Distance

To understand K-medians, we must first deeply understand the difference between L1 and L2 distances and their implications for cluster center computation.

The Minkowski Distance Family:

Both L1 and L2 are special cases of the Minkowski distance (Lp norm):

$$d_p(x, y) = \left( \sum_{i=1}^{D} |x_i - y_i|^p \right)^{1/p}$$

L2 (Euclidean): $p = 2$ → $d_2(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$
L1 (Manhattan): $p = 1$ → $d_1(x, y) = \sum_i |x_i - y_i|$

L1 vs L2 Distance: Fundamental Properties
Property	L2 (Euclidean)	L1 (Manhattan)
Formula	$\sqrt{\sum_i (x_i - y_i)^2}$	$\sum_i \|x_i - y_i\|$
Geometric interpretation	Straight-line (crow flies)	Axis-aligned path (city blocks)
Unit ball shape	Circle/Sphere	Diamond/Cross-polytope
Optimal center	Arithmetic mean	Coordinate-wise median
Differentiability	Smooth everywhere	Non-differentiable at origin
Outlier sensitivity	High (squared terms)	Lower (linear terms)
Rotation invariance	Yes	No (axis-dependent)

The Geometric Intuition:

Why L1 Provides Robustness:

The key insight is in how each distance handles large deviations:

In L2: A point 10 units away contributes $10^2 = 100$ to the squared distance
In L2: A point 100 units away contributes $100^2 = 10,000$ to the squared distance

The 10× distance translates to 100× contribution. Outliers dominate the objective.

In L1: A point 10 units away contributes 10 to the distance
In L1: A point 100 units away contributes 100 to the distance

The 10× distance translates to only 10× contribution. Outliers have proportional, bounded influence.

The Squaring Effect

The K-medians Objective Function

K-medians minimizes the sum of L1 distances from each point to its assigned cluster center:

$$J_{\text{K-medians}} = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - c_k|1 = \sum{k=1}^{K} \sum_{x_i \in C_k} \sum_{d=1}^{D} |x_{i,d} - c_{k,d}|$$

where $x_{i,d}$ is the $d$-th coordinate of point $x_i$, and $c_{k,d}$ is the $d$-th coordinate of center $c_k$.

Key Observation: The objective decomposes across dimensions:

$$J = \sum_{k=1}^{K} \sum_{d=1}^{D} \sum_{x_i \in C_k} |x_{i,d} - c_{k,d}|$$

This separability is crucial: each coordinate of the cluster center can be optimized independently.

Deriving the Optimal Center:

For a fixed cluster assignment, what is the optimal center $c_k$ that minimizes:

$$\sum_{x_i \in C_k} |x_i - c_k|1 = \sum{d=1}^{D} \sum_{x_i \in C_k} |x_{i,d} - c_{k,d}|$$

Since the dimensions are independent, we minimize each coordinate separately:

$$c_{k,d}^* = \arg\min_c \sum_{x_i \in C_k} |x_{i,d} - c|$$

The solution is the median.

Proof: Let $v_1 < v_2 < \ldots < v_n$ be the values of coordinate $d$ for points in cluster $k$. Consider the function:

$$f(c) = \sum_{j=1}^{n} |v_j - c|$$

This is a piecewise linear function with slopes:

For $c < v_1$: slope = $-n$ (all terms decrease as $c$ increases)
For $v_j \leq c < v_{j+1}$: slope = $-n + 2j$ (j terms now increase, n-j decrease)
For $c \geq v_n$: slope = $+n$ (all terms increase as $c$ increases)

The minimum occurs where the slope changes from negative to non-negative. This happens when the number of points below equals the number above—the definition of the median.

For odd $n$: the minimum is at the middle value For even $n$: any value in the interval $[v_{n/2}, v_{n/2+1}]$ is optimal

median_optimality_proof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_l1_objective():
    """
    Visualize why the median minimizes sum of absolute deviations.
    """
    # Sample cluster points (1D for visualization)
    points = np.array([1, 3, 4, 7, 15])  # Note: 15 is an outlier
    
    # Compute objective for range of center values
    c_range = np.linspace(-2, 20, 1000)
    
    # L1 objective (sum of absolute deviations)
    l1_objective = np.array([np.sum(np.abs(points - c)) for c in c_range])
    
    # L2 objective (sum of squared deviations)
    l2_objective = np.array([np.sum((points - c)**2) for c in c_range])
    
    # Optimal centers
    median = np.median(points)  # L1 optimal = 4
    mean = np.mean(points)      # L2 optimal = 6
    
    print(f"Points: {points}")
    print(f"Median (L1 optimal): {median}")
    print(f"Mean (L2 optimal): {mean}")
    print(f"")
    print(f"Difference shows outlier influence: mean is pulled toward 15")
    
    # The median is exactly one of the data points (the middle one)
    # The mean is pulled toward the outlier
    
    return {
        'points': points,
        'median': median,
        'mean': mean,
        'l1_at_median': np.sum(np.abs(points - median)),
        'l2_at_mean': np.sum((points - mean)**2)
    }
 
result = visualize_l1_objective()
# Output:
# Points: [ 1  3  4  7 15]
# Median (L1 optimal): 4.0
# Mean (L2 optimal): 6.0
# Difference shows outlier influence: mean is pulled toward 15

The K-medians Algorithm

The K-medians algorithm follows the same iterative structure as K-means, but with two key modifications:

Assignment step: Use L1 (Manhattan) distance instead of L2
Update step: Compute coordinate-wise medians instead of means

Algorithm:

Input: Data X = {x_1, ..., x_n}, number of clusters K
Output: Cluster centers {c_1, ..., c_K}, assignments {z_1, ..., z_n}

1. Initialize K cluster centers (random, k-medians++, etc.)
2. Repeat until convergence:
   a. Assignment: For each x_i, assign to nearest center using L1 distance:
      z_i = argmin_k ||x_i - c_k||_1
   
   b. Update: For each cluster k, update center as coordinate-wise median:
      c_{k,d} = median({x_{i,d} : z_i = k}) for all dimensions d
3. Return centers and assignments

k_medians.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from scipy.spatial.distance import cdist
 
def k_medians(X, K, max_iter=100, tol=1e-6, random_state=None):
    """
    K-medians clustering algorithm.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Input data matrix
    K : int
        Number of clusters
    max_iter : int
        Maximum number of iterations
    tol : float
        Convergence tolerance for center movement
    random_state : int, optional
        Random seed for reproducibility
    
    Returns:
    --------
    centers : ndarray of shape (K, n_features)
        Final cluster centers (coordinate-wise medians)
    labels : ndarray of shape (n_samples,)
        Cluster assignment for each point
    inertia : float
        Total L1 distance (objective function value)
    n_iter : int
        Number of iterations until convergence
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples, n_features = X.shape
    
    # Initialize centers randomly (could use k-medians++)
    initial_indices = np.random.choice(n_samples, K, replace=False)
    centers = X[initial_indices].copy()
    
    labels = np.zeros(n_samples, dtype=int)
    
    for iteration in range(max_iter):
        old_centers = centers.copy()
        
        # ===== ASSIGNMENT STEP =====
        # Compute L1 (Manhattan) distances from each point to each center
        # cdist with 'cityblock' metric computes Manhattan distance
        distances = cdist(X, centers, metric='cityblock')
        labels = np.argmin(distances, axis=1)
        
        # ===== UPDATE STEP =====
        # Compute coordinate-wise median for each cluster
        new_centers = np.zeros_like(centers)
        for k in range(K):
            cluster_mask = labels == k
            if np.any(cluster_mask):
                # Median computed independently for each dimension
                new_centers[k] = np.median(X[cluster_mask], axis=0)
            else:
                # Empty cluster: reinitialize randomly
                new_centers[k] = X[np.random.randint(n_samples)]
        
        centers = new_centers
        
        # Check convergence
        center_shift = np.sum(np.abs(centers - old_centers))
        if center_shift < tol:
            break
    
    # Compute final inertia (total L1 distance)
    final_distances = cdist(X, centers, metric='cityblock')
    inertia = np.sum(np.min(final_distances, axis=1))
    
    return centers, labels, inertia, iteration + 1
 
 
def k_medians_plusplus_init(X, K, random_state=None):
    """
    K-medians++ initialization: same as k-means++ but with L1 distance.
    
    The probability of selecting a point as the next center is proportional
    to its L1 distance from the nearest existing center.
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples, n_features = X.shape
    centers = np.zeros((K, n_features))
    
    # First center: random
    first_idx = np.random.randint(n_samples)
    centers[0] = X[first_idx]
    
    for k in range(1, K):
        # Compute L1 distance to nearest center for each point
        distances = cdist(X, centers[:k], metric='cityblock')
        min_distances = np.min(distances, axis=1)
        
        # Probability proportional to distance
        probs = min_distances / min_distances.sum()
        
        # Sample next center
        next_idx = np.random.choice(n_samples, p=probs)
        centers[k] = X[next_idx]
    
    return centers
 
 
# Example: Compare K-means vs K-medians with outliers
def compare_kmeans_kmedians_outliers():
    np.random.seed(42)
    
    # Create clean clusters
    cluster1 = np.random.randn(50, 2) * 0.5 + [0, 0]
    cluster2 = np.random.randn(50, 2) * 0.5 + [5, 5]
    cluster3 = np.random.randn(50, 2) * 0.5 + [10, 0]
    
    # Add outliers
    outliers = np.array([[20, 20], [-10, 15], [25, -5]])
    
    X = np.vstack([cluster1, cluster2, cluster3, outliers])
    
    # K-medians
    centers_medians, labels_medians, inertia_medians, n_iter = k_medians(
        X, K=3, random_state=42
    )
    
    # K-means (for comparison)
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
    labels_kmeans = kmeans.fit_predict(X)
    centers_kmeans = kmeans.cluster_centers_
    
    print("K-medians vs K-means with Outliers:")
    print(f"  K-medians centers:
{np.round(centers_medians, 2)}")
    print(f"  K-means centers:
{np.round(centers_kmeans, 2)}")
    print(f"
Note: K-means centers are pulled toward outliers")
    
    return centers_medians, centers_kmeans
 
compare_kmeans_kmedians_outliers()

Convergence Properties and Optimization

K-medians shares many convergence properties with K-means, but the non-differentiability of the L1 norm introduces some nuances.

Guaranteed Monotonic Improvement:

Like K-means, K-medians exhibits monotonic decrease of the objective function:

Assignment step: For fixed centers, assigning each point to its nearest center (L1) can only decrease or maintain the total distance.
Update step: For fixed assignments, setting centers to coordinate-wise medians minimizes the within-cluster L1 distance (proven earlier).

Since both steps are non-increasing and the objective is bounded below by 0, the algorithm must converge.

Convergence to Local Minimum:

K-medians converges to a local minimum where:

Each point is assigned to its nearest center (L1)
Each center is the coordinate-wise median of its cluster

However, this is generally not the global minimum. The objective landscape has many local minima, especially with L1 distance.

Non-Differentiability Challenge

Comparison to K-means Convergence:

Property	K-means	K-medians
Monotonic improvement	✅ Guaranteed	✅ Guaranteed
Convergence to local min	✅ Guaranteed	✅ Guaranteed
Global optimum	❌ Not guaranteed	❌ Not guaranteed
Convergence speed	Fast (typically)	Similar
Number of iterations	Usually 10-100	Usually 10-100
Differentiable objective	✅ Yes	❌ No

Uniqueness of Solution:

Sensitivity to Initialization:

Like K-means, K-medians is sensitive to initialization. Use:

K-medians++: Probabilistic initialization using L1 distances
Multiple restarts: Run algorithm multiple times, keep best result
Smart seeding: Use domain knowledge to guide initial centers

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy.spatial.distance import cdist
 
def track_k_medians_convergence(X, K, max_iter=50, random_state=42):
    """
    K-medians with detailed convergence tracking.
    
    Returns history of objective values and center movements.
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    
    # Initialize
    initial_indices = np.random.choice(n_samples, K, replace=False)
    centers = X[initial_indices].copy()
    
    # Track history
    objective_history = []
    center_history = [centers.copy()]
    
    for iteration in range(max_iter):
        # Assignment
        distances = cdist(X, centers, metric='cityblock')
        labels = np.argmin(distances, axis=1)
        
        # Record objective
        inertia = np.sum(np.min(distances, axis=1))
        objective_history.append(inertia)
        
        # Update (coordinate-wise median)
        new_centers = np.zeros_like(centers)
        for k in range(K):
            mask = labels == k
            if np.any(mask):
                new_centers[k] = np.median(X[mask], axis=0)
            else:
                new_centers[k] = X[np.random.randint(n_samples)]
        
        center_history.append(new_centers.copy())
        
        # Check convergence
        center_shift = np.sum(np.abs(centers - new_centers))
        centers = new_centers
        
        if center_shift < 1e-6:
            break
    
    return {
        'final_centers': centers,
        'final_labels': labels,
        'objective_history': objective_history,
        'center_history': center_history,
        'converged_at': iteration + 1
    }
 
# Demonstrate convergence
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) * 0.8 + [0, 0],
    np.random.randn(100, 2) * 0.8 + [4, 4],
    np.random.randn(100, 2) * 0.8 + [8, 0]
])
 
result = track_k_medians_convergence(X, K=3)
 
print("K-medians Convergence Analysis:")
print(f"  Converged after {result['converged_at']} iterations")
print(f"  Objective history (first 10): {[f'{v:.2f}' for v in result['objective_history'][:10]]}")
print(f"  Initial objective: {result['objective_history'][0]:.2f}")
print(f"  Final objective: {result['objective_history'][-1]:.2f}")
print(f"  Improvement: {(1 - result['objective_history'][-1]/result['objective_history'][0])*100:.1f}%")

Computational Complexity

K-medians has similar computational complexity to K-means, with one key difference in the update step.

Time Complexity per Iteration:

Assignment Step: O(nKD)

Compute L1 distance from each of n points to each of K centers
Each distance computation involves D coordinate differences and absolute values

Update Step: O(nD log n) — slightly more expensive than K-means

For each of K clusters (average n/K points each)
For each of D dimensions
Compute median: O(n/K) using linear-time selection, or O((n/K) log(n/K)) using sorting
Total: K × D × O(n/K) = O(nD) with selection, O(nD log(n/K)) with sorting

Total per Iteration: O(nKD + nD log n) ≈ O(nKD) for typical K

Comparison to K-means:

Operation	K-means	K-medians
Assignment	O(nKD)	O(nKD)
Update	O(nD)	O(nD log n)
Total	O(nKD)	O(nKD + nD log n)

The Median Computation Overhead:

K-means computes means with a simple linear pass:

mean = sum(values) / len(values)  # O(n)

K-medians requires finding the median, which is slightly more complex:

Option 1: Sorting — O(n log n)

sorted_values = sorted(values)
median = sorted_values[len(values) // 2]

Option 2: QuickSelect — O(n) expected, O(n²) worst case Linear-time selection algorithm finds the k-th smallest element.

Option 3: Median-of-medians — O(n) guaranteed Deterministic linear-time selection, but with larger constants.

In practice, sorting is often used because:

It's simple and cache-friendly
For typical cluster sizes (hundreds to thousands), the O(n log n) vs O(n) difference is negligible
Modern sorting implementations are highly optimized

Practical Performance

complexity_benchmark.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
import time
from sklearn.cluster import KMeans
 
def k_medians_benchmark(X, K, max_iter=100):
    """K-medians with timing."""
    from scipy.spatial.distance import cdist
    
    n_samples, n_features = X.shape
    centers = X[np.random.choice(n_samples, K, replace=False)]
    
    for _ in range(max_iter):
        distances = cdist(X, centers, metric='cityblock')
        labels = np.argmin(distances, axis=1)
        
        new_centers = np.zeros_like(centers)
        for k in range(K):
            mask = labels == k
            if np.any(mask):
                new_centers[k] = np.median(X[mask], axis=0)
            else:
                new_centers[k] = X[np.random.randint(n_samples)]
        
        if np.allclose(centers, new_centers):
            break
        centers = new_centers
    
    return centers, labels
 
 
def benchmark_complexity():
    """Compare K-means and K-medians runtime scaling."""
    
    results = []
    
    for n in [1000, 5000, 10000, 50000]:
        for D in [2, 10, 50]:
            for K in [3, 10]:
                X = np.random.randn(n, D)
                
                # K-means
                start = time.time()
                kmeans = KMeans(n_clusters=K, max_iter=100, n_init=1)
                kmeans.fit(X)
                kmeans_time = time.time() - start
                
                # K-medians
                np.random.seed(42)
                start = time.time()
                k_medians_benchmark(X, K)
                kmedians_time = time.time() - start
                
                ratio = kmedians_time / kmeans_time
                results.append({
                    'n': n, 'D': D, 'K': K,
                    'kmeans_time': kmeans_time,
                    'kmedians_time': kmedians_time,
                    'ratio': ratio
                })
                
                print(f"n={n:5d}, D={D:2d}, K={K:2d} | "
                      f"K-means: {kmeans_time:.3f}s | "
                      f"K-medians: {kmedians_time:.3f}s | "
                      f"Ratio: {ratio:.2f}x")
    
    return results
 
# Example output (will vary by machine):
# n= 1000, D= 2, K= 3 | K-means: 0.012s | K-medians: 0.018s | Ratio: 1.50x
# n= 5000, D=10, K=10 | K-means: 0.089s | K-medians: 0.142s | Ratio: 1.60x
# n=50000, D=50, K=10 | K-means: 1.234s | K-medians: 1.891s | Ratio: 1.53x

When to Use K-medians

Understanding the trade-offs between K-medians and other clustering variants helps you choose the right algorithm for your specific problem.

Use K-medians When:

•Data contains outliers — L1 distance provides natural robustness
•Features have different scales — L1 is less sensitive to scale differences
•Sparse data — Manhattan distance often performs better on high-dimensional sparse data
•Grid-like domains — When actual travel is axis-aligned (logistics, routing)
•Heavy-tailed distributions — Data with extreme values benefits from L1
•You want interpretable centers — Medians have intuitive meaning

Prefer K-means When:

•Data is clean — No significant outliers or noise
•Euclidean geometry matters — Actual straight-line distances are meaningful
•Rotation invariance needed — L1 depends on coordinate orientation
•Gradient methods required — Need differentiable objective for integration with other optimization
•Library support matters — K-means is more widely implemented and optimized

K-medians vs. K-medoids

Domain-Specific Applications:

2. Compressed Sensing: L1 minimization promotes sparsity, making K-medians natural for clustering in sparse signal recovery applications.

3. Robust Statistics: When data may be contaminated with outliers, L1-based methods provide consistent estimation even with up to 50% contamination.

4. Financial Data: Stock returns often have heavy tails. K-medians can cluster assets more meaningfully than K-means, which may be distorted by extreme market events.

5. Urban Planning: For facility location problems in grid-based cities, L1 distance accurately models actual travel distances.

Practical Implementation and Libraries

While K-medians is less commonly implemented in standard libraries compared to K-means, several options exist for production use.

kmedians_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
import numpy as np
from scipy.spatial.distance import cdist
 
# ============================================
# Option 1: Custom Implementation (Recommended for learning)
# ============================================
 
class KMedians:
    """
    Production-ready K-medians implementation.
    
    Parameters:
    -----------
    n_clusters : int
        Number of clusters
    max_iter : int
        Maximum iterations
    n_init : int
        Number of random initializations
    tol : float
        Convergence tolerance
    random_state : int, optional
        Random seed
    """
    
    def __init__(self, n_clusters=8, max_iter=300, n_init=10, 
                 tol=1e-4, random_state=None):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.n_init = n_init
        self.tol = tol
        self.random_state = random_state
        
    def _init_centers(self, X):
        """K-medians++ initialization."""
        n_samples = len(X)
        centers = np.zeros((self.n_clusters, X.shape[1]))
        
        # First center: random
        centers[0] = X[np.random.randint(n_samples)]
        
        for k in range(1, self.n_clusters):
            distances = cdist(X, centers[:k], metric='cityblock')
            min_dist = np.min(distances, axis=1)
            probs = min_dist / min_dist.sum()
            centers[k] = X[np.random.choice(n_samples, p=probs)]
        
        return centers
    
    def _single_run(self, X):
        """Single K-medians run."""
        n_samples = len(X)
        centers = self._init_centers(X)
        
        for iteration in range(self.max_iter):
            # Assignment
            distances = cdist(X, centers, metric='cityblock')
            labels = np.argmin(distances, axis=1)
            
            # Update
            new_centers = np.zeros_like(centers)
            for k in range(self.n_clusters):
                mask = labels == k
                if np.any(mask):
                    new_centers[k] = np.median(X[mask], axis=0)
                else:
                    new_centers[k] = X[np.random.randint(n_samples)]
            
            # Convergence check
            shift = np.sum(np.abs(centers - new_centers))
            centers = new_centers
            
            if shift < self.tol:
                break
        
        inertia = np.sum(np.min(cdist(X, centers, metric='cityblock'), axis=1))
        return centers, labels, inertia
    
    def fit(self, X):
        """Fit K-medians to data."""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        best_inertia = np.inf
        
        for _ in range(self.n_init):
            centers, labels, inertia = self._single_run(X)
            
            if inertia < best_inertia:
                best_inertia = inertia
                self.cluster_centers_ = centers
                self.labels_ = labels
                self.inertia_ = inertia
        
        return self
    
    def predict(self, X):
        """Predict cluster labels for new data."""
        distances = cdist(X, self.cluster_centers_, metric='cityblock')
        return np.argmin(distances, axis=1)
    
    def fit_predict(self, X):
        """Fit and return labels."""
        self.fit(X)
        return self.labels_
 
 
# ============================================
# Option 2: Using PyClustering library
# ============================================
 
# pip install pyclustering
try:
    from pyclustering.cluster.kmedians import kmedians
    from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
    
    def pyclustering_kmedians(X, K):
        """K-medians using PyClustering library."""
        # Initialize with k-means++
        initial_centers = kmeans_plusplus_initializer(
            X.tolist(), K
        ).initialize()
        
        # Run k-medians
        kmedians_instance = kmedians(X.tolist(), initial_centers)
        kmedians_instance.process()
        
        clusters = kmedians_instance.get_clusters()
        centers = np.array(kmedians_instance.get_medians())
        
        # Convert cluster lists to labels
        labels = np.zeros(len(X), dtype=int)
        for k, cluster in enumerate(clusters):
            for idx in cluster:
                labels[idx] = k
        
        return centers, labels
 
except ImportError:
    pass
 
 
# ============================================
# Example usage
# ============================================
 
# Generate sample data
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) * 0.5 + [0, 0],
    np.random.randn(100, 2) * 0.5 + [4, 4],
    np.random.randn(100, 2) * 0.5 + [8, 0],
])
 
# Add outliers
X = np.vstack([X, [[15, 15], [-5, 10], [20, -5]]])
 
# Fit K-medians
kmedians = KMedians(n_clusters=3, random_state=42)
kmedians.fit(X)
 
print("K-medians Results:")
print(f"  Cluster centers:
{np.round(kmedians.cluster_centers_, 3)}")
print(f"  Inertia (L1): {kmedians.inertia_:.2f}")
print(f"  Cluster sizes: {np.bincount(kmedians.labels_)}")

Summary: K-medians Clustering

We've comprehensively explored K-medians, understanding how a simple change in distance metric fundamentally alters clustering behavior. Let's consolidate the essential insights:

Key Takeaways

•K-medians uses L1 (Manhattan) distance — This provides inherent robustness to outliers because extreme values have linear (not quadratic) influence on the objective.
•Cluster centers are coordinate-wise medians — The median minimizes sum of absolute deviations, decomposing the optimization across dimensions.
•Same algorithmic structure as K-means — Alternating assignment (L1 distance) and update (median) steps guarantee convergence to a local minimum.
•Slightly higher computational cost — Median computation is O(n log n) vs. O(n) for mean, but this rarely matters in practice.
•Not rotation invariant — L1 distance depends on coordinate alignment; rotate data meaningfully before clustering.
•Use for outlier-contaminated or heavy-tailed data — When robustness matters more than geometric fidelity to Euclidean space.

Looking Ahead:

Page Complete

2 / 5