Machine LearningK-Means Clustering

K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Clustering

3 / 5

Initialization Strategies (K-Means++)

The First Step Determines Everything

Lloyd's algorithm converges to a local minimum—but which local minimum depends entirely on where you start. Poor initialization can lead to solutions that are arbitrarily worse than optimal, wasted computational effort, and meaningless clusters.

Consider running k-means on a dataset with three well-separated clusters. If all initial centroids happen to fall within one cluster, the algorithm may converge to a configuration where that cluster is split into three parts while the other two clusters are merged. The final WCSS could be 10× higher than optimal.

This sensitivity to initialization motivated decades of research culminating in k-means++—an elegant probabilistic initialization scheme that provides theoretical guarantees on solution quality. Published by Arthur and Vassilvitskii in 2007, k-means++ has become the default initialization method in virtually every k-means implementation.

What You Will Learn

By the end of this page, you will: • Understand why random initialization fails • Master the k-means++ algorithm step-by-step • Prove the O(log k) approximation guarantee • Analyze the computational cost of initialization • Compare k-means++ with alternative strategies

The Initialization Problem

Before presenting solutions, let's deeply understand the problem. K-means is sensitive to initialization because the objective function has many local minima.

Random Initialization Failures:

The simplest approach—randomly selecting $k$ data points as initial centroids—can fail dramatically:

Cluster splitting: If multiple centroids land in the same true cluster, that cluster gets split while others are merged
Empty clusters: If initial centroids are positioned such that some receive no points in the first assignment step, convergence becomes problematic
Slow convergence: Poor initialization requires more iterations to stabilize, wasting computation
Suboptimal cost: Final WCSS can be orders of magnitude higher than optimal

Arbitrarily Bad Initializations Exist

Theorem: For any constant c > 0, there exist datasets where random initialization produces solutions with WCSS > c × OPT with probability at least 1/2.

In other words, random initialization provides NO theoretical guarantees—the solution can be arbitrarily bad compared to optimal.

Visualizing the Problem:

Imagine k=3 clusters arranged in a line: one on the left, one in the middle, one on the right. With random initialization:

Good case: One centroid lands in each cluster → algorithm converges quickly to good solution
Bad case: Two centroids in left cluster, one in middle → left cluster gets split, right cluster merged with middle → much higher WCSS

The probability of the bad case can be substantial, especially as the number of clusters grows.

Random Initialization Failure Modes
Failure Mode	Cause	Consequence
Cluster splitting	Multiple centroids in one true cluster	Nearby points assigned to different clusters
Cluster merging	No centroid near a true cluster	Distinct groups merged into one cluster
Empty clusters	All points closer to other centroids	Degenerate solution, NaN centroids
Slow convergence	Centroids far from final positions	Many iterations needed, wasted compute

The K-Means++ Algorithm

K-means++ is a distance-weighted probabilistic initialization. The key insight: subsequent centroids should be far from existing ones. Points far from all current centroids are more likely to be in uncovered clusters.

The Algorithm:

K-Means++ Initialization:

1. Choose first centroid μ₁ uniformly at random from data points

2. For j = 2 to k:
   a. For each point xᵢ, compute D(xᵢ) = min distance to existing centroids
   b. Choose next centroid μⱼ = xᵢ with probability proportional to D(xᵢ)²

3. Proceed with standard Lloyd's algorithm using these centroids

The magic is in step 2b: points far from all centroids have higher probability of being selected.

D² Weighting: The Key Insight

Why squared distance, not just distance?

D² weighting provides exactly the right exploration-exploitation balance: • Points very far from centroids dominate the probability • Points moderately far still have reasonable chance • Points near existing centroids have near-zero probability

This matches the k-means objective (sum of squared distances) and provides the O(log k) guarantee.

Step-by-Step Example:

Consider 6 points: A(0,0), B(1,0), C(10,0), D(11,0), E(20,0), F(21,0)

Clustering into k=3:

Step 1: Choose first centroid uniformly. Suppose we pick A(0,0).

Step 2 (j=2): Compute distances to A:

D(A)² = 0
D(B)² = 1
D(C)² = 100
D(D)² = 121
D(E)² = 400
D(F)² = 441

Total = 1063. Probabilities: A=0, B=0.1%, C=9.4%, D=11.4%, E=37.6%, F=41.5%

Likely outcome: E or F selected (combined 79% probability).

Step 3 (j=3): Now compute min distance to {A, F}:

D(A)² = 0, D(B)² = 1 (closer to A)
D(C)² = 100 (closer to A), D(D)² = 100 (closer to F)
D(E)² = 1 (closer to F), D(F)² = 0

Probabilities now favor C and D (the middle cluster).

Result: High probability of getting one centroid per true cluster!

kmeans_plus_plus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
"""
K-Means++ Initialization: Complete Implementation
 
This module implements the k-means++ algorithm with detailed comments
explaining each step and its theoretical motivation.
"""
import numpy as np
from typing import Tuple, Optional
 
def kmeans_plus_plus_init(
    X: np.ndarray,
    k: int,
    random_state: Optional[int] = None
) -> np.ndarray:
    """
    K-means++ initialization for centroid selection.
    
    Algorithm:
    1. Choose first centroid uniformly at random
    2. For each subsequent centroid:
       - Compute D(x)² = squared distance to nearest existing centroid
       - Sample next centroid with probability proportional to D(x)²
    
    Args:
        X: Data matrix of shape (n_samples, n_features)
        k: Number of clusters/centroids to initialize
        random_state: Random seed for reproducibility
        
    Returns:
        centroids: Array of shape (k, n_features) with initial centroids
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples, n_features = X.shape
    centroids = np.empty((k, n_features))
    
    # Track which points have been selected as centroids
    selected_indices = []
    
    # Step 1: Choose first centroid uniformly at random
    first_idx = np.random.randint(n_samples)
    centroids[0] = X[first_idx].copy()
    selected_indices.append(first_idx)
    
    # Step 2: Choose remaining k-1 centroids
    for j in range(1, k):
        # Compute squared distance from each point to nearest centroid
        # This is D(x)² in the k-means++ paper
        distances_sq = compute_min_distances_squared(X, centroids[:j])
        
        # Convert distances to probability distribution
        # P(x_i is chosen) ∝ D(x_i)²
        probabilities = distances_sq / distances_sq.sum()
        
        # Sample next centroid according to D² distribution
        next_idx = np.random.choice(n_samples, p=probabilities)
        centroids[j] = X[next_idx].copy()
        selected_indices.append(next_idx)
    
    return centroids
 
 
def compute_min_distances_squared(
    X: np.ndarray,
    centroids: np.ndarray
) -> np.ndarray:
    """
    Compute squared distance from each point to its nearest centroid.
    
    D(x)² = min_j ||x - μ_j||²
    
    Args:
        X: Data points (n_samples, n_features)
        centroids: Current centroids (n_centroids, n_features)
        
    Returns:
        min_distances_sq: (n_samples,) array of minimum squared distances
    """
    n_samples = X.shape[0]
    n_centroids = centroids.shape[0]
    
    # Compute all pairwise squared distances
    # Using the identity ||x - μ||² = ||x||² - 2x·μ + ||μ||²
    X_sq = np.sum(X ** 2, axis=1, keepdims=True)  # (n_samples, 1)
    centroids_sq = np.sum(centroids ** 2, axis=1)  # (n_centroids,)
    cross_term = X @ centroids.T  # (n_samples, n_centroids)
    
    distances_sq = X_sq - 2 * cross_term + centroids_sq  # (n_samples, n_centroids)
    
    # Clip negative values (numerical precision)
    distances_sq = np.maximum(distances_sq, 0)
    
    # Find minimum distance to any centroid for each point
    min_distances_sq = np.min(distances_sq, axis=1)
    
    return min_distances_sq
 
 
def kmeans_with_plus_plus(
    X: np.ndarray,
    k: int,
    max_iters: int = 300,
    tol: float = 1e-4,
    random_state: Optional[int] = None
) -> Tuple[np.ndarray, np.ndarray, float, int]:
    """
    Complete k-means clustering with k-means++ initialization.
    
    Args:
        X: Data matrix (n_samples, n_features)
        k: Number of clusters
        max_iters: Maximum Lloyd iterations
        tol: Convergence tolerance
        random_state: Random seed
        
    Returns:
        centroids: Final centroid positions (k, n_features)
        labels: Cluster assignments (n_samples,)
        inertia: Final WCSS value
        n_iters: Number of iterations until convergence
    """
    # K-means++ initialization
    centroids = kmeans_plus_plus_init(X, k, random_state)
    
    # Lloyd's algorithm
    for iteration in range(max_iters):
        old_centroids = centroids.copy()
        
        # Assignment step
        distances_sq = compute_all_distances_squared(X, centroids)
        labels = np.argmin(distances_sq, axis=1)
        
        # Update step
        for j in range(k):
            mask = (labels == j)
            if mask.sum() > 0:
                centroids[j] = X[mask].mean(axis=0)
        
        # Check convergence
        shift = np.sum((centroids - old_centroids) ** 2)
        if shift < tol:
            break
    
    # Compute final inertia
    distances_sq = compute_all_distances_squared(X, centroids)
    inertia = np.sum(distances_sq[np.arange(len(labels)), labels])
    
    return centroids, labels, inertia, iteration + 1
 
 
def compute_all_distances_squared(X, centroids):
    """Compute squared distances from all points to all centroids."""
    X_sq = np.sum(X ** 2, axis=1, keepdims=True)
    centroids_sq = np.sum(centroids ** 2, axis=1)
    cross_term = X @ centroids.T
    return np.maximum(X_sq - 2 * cross_term + centroids_sq, 0)
 
 
# Demonstration: Random vs K-means++ initialization
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate 3 well-separated clusters
    cluster1 = np.random.randn(100, 2) + np.array([0, 0])
    cluster2 = np.random.randn(100, 2) + np.array([10, 0])
    cluster3 = np.random.randn(100, 2) + np.array([5, 8])
    X = np.vstack([cluster1, cluster2, cluster3])
    
    print("Comparing Random Init vs K-Means++ (10 runs each)")
    print("=" * 55)
    
    random_inertias = []
    kpp_inertias = []
    
    for run in range(10):
        # Random initialization
        random_idx = np.random.choice(len(X), 3, replace=False)
        random_centroids = X[random_idx].copy()
        # Run Lloyd's with random init (simplified)
        centroids = random_centroids.copy()
        for _ in range(100):
            dists = compute_all_distances_squared(X, centroids)
            labels = np.argmin(dists, axis=1)
            for j in range(3):
                mask = labels == j
                if mask.sum() > 0:
                    centroids[j] = X[mask].mean(axis=0)
        dists = compute_all_distances_squared(X, centroids)
        random_inertias.append(np.sum(dists[np.arange(len(labels)), labels]))
        
        # K-means++ initialization
        _, _, inertia, _ = kmeans_with_plus_plus(X, 3, random_state=run*10)
        kpp_inertias.append(inertia)
    
    print(f"Random Init - Mean Inertia: {np.mean(random_inertias):.2f} "
          f"(std: {np.std(random_inertias):.2f})")
    print(f"K-Means++   - Mean Inertia: {np.mean(kpp_inertias):.2f} "
          f"(std: {np.std(kpp_inertias):.2f})")
    print(f"
Improvement: {(np.mean(random_inertias)/np.mean(kpp_inertias) - 1)*100:.1f}%")

The O(log k) Approximation Guarantee

K-means++ provides a remarkable theoretical guarantee that sets it apart from heuristic approaches.

Theorem (Arthur & Vassilvitskii, 2007):

Let $\phi$ denote the WCSS achieved by k-means++ initialization (before running Lloyd's algorithm), and let $\phi^*$ denote the optimal WCSS. Then:

$$\mathbb{E}[\phi] \leq 8(\ln k + 2) \cdot \phi^*$$

In other words, k-means++ is an O(log k)-approximation algorithm in expectation.

What This Means:

For k=10: Expected cost ≤ 8(2.3 + 2) × OPT ≈ 34 × OPT
For k=100: Expected cost ≤ 8(4.6 + 2) × OPT ≈ 53 × OPT
For k=1000: Expected cost ≤ 8(6.9 + 2) × OPT ≈ 71 × OPT

The guarantee degrades only logarithmically with k, which is remarkably slow.

After Lloyd's Algorithm

The O(log k) bound applies to k-means++ initialization alone. Running Lloyd's algorithm afterward can only improve (or maintain) the WCSS. In practice, the combination typically achieves solutions within 1-5% of optimal for well-clustered data.

Proof Sketch:

The proof works by induction on the number of clusters. The key insight is that D²-sampling tends to select points from uncovered clusters.

Lemma 1: If we've selected centroids covering some clusters well, D²-sampling will likely pick a point from an uncovered cluster (since those points have large D² values).

Lemma 2: When we pick a point from cluster $C$, the expected cost contribution from $C$ is at most 8 × OPT($C$), where OPT($C$) is the optimal cost for clustering $C$ with one center.

Induction: After selecting $j$ centroids well covering $j$ clusters, the $(j+1)$-th selection likely covers a new cluster with bounded cost. Summing over all $k$ steps and accounting for failure probabilities gives the O(log k) bound.

The logarithmic factor arises from a coupon collector-type analysis: how many samples until we've covered all clusters?

K-Means++ vs Random Initialization
Property	Random Init	K-Means++
Approximation ratio	Unbounded	O(log k)
Probability of good solution	Can be arbitrarily low	High in expectation
Time complexity	O(k)	O(nk)
Space complexity	O(k)	O(n)
Default in sklearn	No	Yes ('k-means++')

Computational Considerations

K-means++ initialization has higher computational cost than random selection, but the investment typically pays off through faster convergence and better solutions.

Time Complexity:

Random initialization: O(k) — just sample k indices
K-means++: O(nkd) — for each of k centroids, compute distance to all n points

This matches the per-iteration cost of Lloyd's algorithm. So k-means++ adds the equivalent of k extra Lloyd iterations.

Is It Worth It?

Yes, almost always:

Better initialization → fewer Lloyd iterations needed
Better final solution → don't need multiple restarts
The O(nkd) initialization cost is amortized over typically O(1) runs instead of O(10) runs needed with random init

Scalable K-Means++ Variants

For very large datasets, even O(nkd) initialization becomes expensive. Variants exist:

• K-Means|| (Scalable K-Means++): Sample O(k) points per round for O(log n) rounds, then reduce to k centroids. Achieves O(log k) approximation with O(k log n) distance computations.

• Mini-batch K-Means++: Apply k-means++ to a random sample, then use those centroids.

• Greedy K-Means++: Deterministically choose the farthest point instead of sampling (loses theoretical guarantee but often works well).

Implementation Optimization:

The naive O(nkd) implementation can be optimized:

# Naive: recompute all distances at each step
for j in range(1, k):
    distances = compute_distances_to_all_centroids(X, centroids[:j])
    min_distances = distances.min(axis=1)
    # ... sample next centroid

# Optimized: incrementally update minimum distances
min_distances_sq = np.full(n, np.inf)
for j in range(k):
    if j > 0:
        # Only compute distances to newest centroid
        new_distances_sq = compute_distances(X, centroids[j-1])
        min_distances_sq = np.minimum(min_distances_sq, new_distances_sq)
    # ... sample based on min_distances_sq

The optimized version still requires O(nd) per centroid selection but avoids redundant distance computations, improving constants significantly.

Alternative Initialization Methods

K-means++ is the default choice, but alternative initialization strategies exist for specific scenarios.

Forgy Method (Random Selection)

Simply select $k$ data points uniformly at random as initial centroids.

Pros:

O(k) time complexity
Simple to implement
Works reasonably for well-separated clusters

Cons:

No theoretical guarantees
High variance in solution quality
May place multiple centroids in one true cluster

Use when: Computational budget is extremely tight, or data is known to have well-separated, equal-sized clusters.

Initialization Methods Comparison
Method	Time	Guarantee	Outlier Sensitivity
K-Means++	O(nkd)	O(log k)-approx	Moderate
Forgy (Random)	O(k)	None	Low
Random Partition	O(n)	None	Low
Maxmin	O(nkd)	2-approx (k-center)	High
K-Means\|\|	O(k log n · d)	O(log k)-approx	Moderate

Multiple Restarts: The Practical Approach

Even with k-means++ initialization, we only have a probabilistic guarantee. The standard practice is to run k-means multiple times and keep the best solution.

The Strategy:

best_inertia = float('inf')
best_result = None

for i in range(n_init):  # typically n_init = 10
    centroids, labels, inertia = kmeans_with_kpp(X, k)
    if inertia < best_inertia:
        best_inertia = inertia
        best_result = (centroids, labels)

return best_result

How Many Restarts?

With k-means++ initialization, 10 restarts is typically sufficient:

Probability of all 10 runs giving poor solutions is low
Beyond 10, diminishing returns set in
Sklearn's default is n_init=10 for 'k-means++', n_init=1 for 'random'

Parallelizing Restarts

Multiple restarts are embarrassingly parallel—each run is independent. Modern implementations (sklearn, spark) parallelize across cores:

from sklearn.cluster import KMeans

kmeans = KMeans(
    n_clusters=k,
    init='k-means++',
    n_init=10,      # 10 restarts
    n_jobs=-1       # use all cores
)

With 8 cores and n_init=10, you get nearly 8× speedup over sequential.

Early Stopping Across Restarts:

Advanced implementations can terminate individual runs early if they're clearly worse:

Run first few iterations of each restart
Keep only the top-performing runs (by intermediate WCSS)
Run those to convergence

This is a form of successive halving applied to k-means restarts.

Summary: Initialization Strategies

Key Takeaways

•K-means is highly sensitive to initialization — Bad starting points lead to arbitrarily bad solutions
•K-means++ uses D²-weighted sampling — Points far from existing centroids are more likely to be selected
•O(log k) approximation guarantee — K-means++ provides the first theoretical bounds for k-means initialization
•O(nkd) initialization cost — Worth it due to faster convergence and fewer restarts needed
•Multiple restarts remain important — Run 10 times with k-means++ and keep the best
•Scalable variants exist — K-means|| for very large datasets

What's Next:

We now understand how to initialize k-means properly. But when does Lloyd's algorithm actually terminate? In the next page, we'll analyze the convergence properties of k-means—proving that it always converges, understanding the rate of convergence, and examining pathological cases.

Page Complete

You now understand why initialization matters, how k-means++ works, and its theoretical guarantees. K-means++ has become the gold standard for k-means initialization and should be your default choice in practice.

3 / 5

Loading learning content...

Machine LearningK-Means Clustering

K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Clustering

3 / 5

Initialization Strategies (K-Means++)

The First Step Determines Everything

What You Will Learn

The Initialization Problem

Before presenting solutions, let's deeply understand the problem. K-means is sensitive to initialization because the objective function has many local minima.

Random Initialization Failures:

The simplest approach—randomly selecting $k$ data points as initial centroids—can fail dramatically:

Cluster splitting: If multiple centroids land in the same true cluster, that cluster gets split while others are merged
Empty clusters: If initial centroids are positioned such that some receive no points in the first assignment step, convergence becomes problematic
Slow convergence: Poor initialization requires more iterations to stabilize, wasting computation
Suboptimal cost: Final WCSS can be orders of magnitude higher than optimal

Arbitrarily Bad Initializations Exist

Theorem: For any constant c > 0, there exist datasets where random initialization produces solutions with WCSS > c × OPT with probability at least 1/2.

In other words, random initialization provides NO theoretical guarantees—the solution can be arbitrarily bad compared to optimal.

Visualizing the Problem:

Imagine k=3 clusters arranged in a line: one on the left, one in the middle, one on the right. With random initialization:

Good case: One centroid lands in each cluster → algorithm converges quickly to good solution
Bad case: Two centroids in left cluster, one in middle → left cluster gets split, right cluster merged with middle → much higher WCSS

The probability of the bad case can be substantial, especially as the number of clusters grows.

Random Initialization Failure Modes
Failure Mode	Cause	Consequence
Cluster splitting	Multiple centroids in one true cluster	Nearby points assigned to different clusters
Cluster merging	No centroid near a true cluster	Distinct groups merged into one cluster
Empty clusters	All points closer to other centroids	Degenerate solution, NaN centroids
Slow convergence	Centroids far from final positions	Many iterations needed, wasted compute

The K-Means++ Algorithm

The Algorithm:

K-Means++ Initialization:

1. Choose first centroid μ₁ uniformly at random from data points

2. For j = 2 to k:
   a. For each point xᵢ, compute D(xᵢ) = min distance to existing centroids
   b. Choose next centroid μⱼ = xᵢ with probability proportional to D(xᵢ)²

3. Proceed with standard Lloyd's algorithm using these centroids

The magic is in step 2b: points far from all centroids have higher probability of being selected.

D² Weighting: The Key Insight

Why squared distance, not just distance?

This matches the k-means objective (sum of squared distances) and provides the O(log k) guarantee.

Step-by-Step Example:

Consider 6 points: A(0,0), B(1,0), C(10,0), D(11,0), E(20,0), F(21,0)

Clustering into k=3:

Step 1: Choose first centroid uniformly. Suppose we pick A(0,0).

Step 2 (j=2): Compute distances to A:

D(A)² = 0
D(B)² = 1
D(C)² = 100
D(D)² = 121
D(E)² = 400
D(F)² = 441

Total = 1063. Probabilities: A=0, B=0.1%, C=9.4%, D=11.4%, E=37.6%, F=41.5%

Likely outcome: E or F selected (combined 79% probability).

Step 3 (j=3): Now compute min distance to {A, F}:

D(A)² = 0, D(B)² = 1 (closer to A)
D(C)² = 100 (closer to A), D(D)² = 100 (closer to F)
D(E)² = 1 (closer to F), D(F)² = 0

Probabilities now favor C and D (the middle cluster).

Result: High probability of getting one centroid per true cluster!

kmeans_plus_plus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
"""
K-Means++ Initialization: Complete Implementation
 
This module implements the k-means++ algorithm with detailed comments
explaining each step and its theoretical motivation.
"""
import numpy as np
from typing import Tuple, Optional
 
def kmeans_plus_plus_init(
    X: np.ndarray,
    k: int,
    random_state: Optional[int] = None
) -> np.ndarray:
    """
    K-means++ initialization for centroid selection.
    
    Algorithm:
    1. Choose first centroid uniformly at random
    2. For each subsequent centroid:
       - Compute D(x)² = squared distance to nearest existing centroid
       - Sample next centroid with probability proportional to D(x)²
    
    Args:
        X: Data matrix of shape (n_samples, n_features)
        k: Number of clusters/centroids to initialize
        random_state: Random seed for reproducibility
        
    Returns:
        centroids: Array of shape (k, n_features) with initial centroids
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples, n_features = X.shape
    centroids = np.empty((k, n_features))
    
    # Track which points have been selected as centroids
    selected_indices = []
    
    # Step 1: Choose first centroid uniformly at random
    first_idx = np.random.randint(n_samples)
    centroids[0] = X[first_idx].copy()
    selected_indices.append(first_idx)
    
    # Step 2: Choose remaining k-1 centroids
    for j in range(1, k):
        # Compute squared distance from each point to nearest centroid
        # This is D(x)² in the k-means++ paper
        distances_sq = compute_min_distances_squared(X, centroids[:j])
        
        # Convert distances to probability distribution
        # P(x_i is chosen) ∝ D(x_i)²
        probabilities = distances_sq / distances_sq.sum()
        
        # Sample next centroid according to D² distribution
        next_idx = np.random.choice(n_samples, p=probabilities)
        centroids[j] = X[next_idx].copy()
        selected_indices.append(next_idx)
    
    return centroids
 
 
def compute_min_distances_squared(
    X: np.ndarray,
    centroids: np.ndarray
) -> np.ndarray:
    """
    Compute squared distance from each point to its nearest centroid.
    
    D(x)² = min_j ||x - μ_j||²
    
    Args:
        X: Data points (n_samples, n_features)
        centroids: Current centroids (n_centroids, n_features)
        
    Returns:
        min_distances_sq: (n_samples,) array of minimum squared distances
    """
    n_samples = X.shape[0]
    n_centroids = centroids.shape[0]
    
    # Compute all pairwise squared distances
    # Using the identity ||x - μ||² = ||x||² - 2x·μ + ||μ||²
    X_sq = np.sum(X ** 2, axis=1, keepdims=True)  # (n_samples, 1)
    centroids_sq = np.sum(centroids ** 2, axis=1)  # (n_centroids,)
    cross_term = X @ centroids.T  # (n_samples, n_centroids)
    
    distances_sq = X_sq - 2 * cross_term + centroids_sq  # (n_samples, n_centroids)
    
    # Clip negative values (numerical precision)
    distances_sq = np.maximum(distances_sq, 0)
    
    # Find minimum distance to any centroid for each point
    min_distances_sq = np.min(distances_sq, axis=1)
    
    return min_distances_sq
 
 
def kmeans_with_plus_plus(
    X: np.ndarray,
    k: int,
    max_iters: int = 300,
    tol: float = 1e-4,
    random_state: Optional[int] = None
) -> Tuple[np.ndarray, np.ndarray, float, int]:
    """
    Complete k-means clustering with k-means++ initialization.
    
    Args:
        X: Data matrix (n_samples, n_features)
        k: Number of clusters
        max_iters: Maximum Lloyd iterations
        tol: Convergence tolerance
        random_state: Random seed
        
    Returns:
        centroids: Final centroid positions (k, n_features)
        labels: Cluster assignments (n_samples,)
        inertia: Final WCSS value
        n_iters: Number of iterations until convergence
    """
    # K-means++ initialization
    centroids = kmeans_plus_plus_init(X, k, random_state)
    
    # Lloyd's algorithm
    for iteration in range(max_iters):
        old_centroids = centroids.copy()
        
        # Assignment step
        distances_sq = compute_all_distances_squared(X, centroids)
        labels = np.argmin(distances_sq, axis=1)
        
        # Update step
        for j in range(k):
            mask = (labels == j)
            if mask.sum() > 0:
                centroids[j] = X[mask].mean(axis=0)
        
        # Check convergence
        shift = np.sum((centroids - old_centroids) ** 2)
        if shift < tol:
            break
    
    # Compute final inertia
    distances_sq = compute_all_distances_squared(X, centroids)
    inertia = np.sum(distances_sq[np.arange(len(labels)), labels])
    
    return centroids, labels, inertia, iteration + 1
 
 
def compute_all_distances_squared(X, centroids):
    """Compute squared distances from all points to all centroids."""
    X_sq = np.sum(X ** 2, axis=1, keepdims=True)
    centroids_sq = np.sum(centroids ** 2, axis=1)
    cross_term = X @ centroids.T
    return np.maximum(X_sq - 2 * cross_term + centroids_sq, 0)
 
 
# Demonstration: Random vs K-means++ initialization
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate 3 well-separated clusters
    cluster1 = np.random.randn(100, 2) + np.array([0, 0])
    cluster2 = np.random.randn(100, 2) + np.array([10, 0])
    cluster3 = np.random.randn(100, 2) + np.array([5, 8])
    X = np.vstack([cluster1, cluster2, cluster3])
    
    print("Comparing Random Init vs K-Means++ (10 runs each)")
    print("=" * 55)
    
    random_inertias = []
    kpp_inertias = []
    
    for run in range(10):
        # Random initialization
        random_idx = np.random.choice(len(X), 3, replace=False)
        random_centroids = X[random_idx].copy()
        # Run Lloyd's with random init (simplified)
        centroids = random_centroids.copy()
        for _ in range(100):
            dists = compute_all_distances_squared(X, centroids)
            labels = np.argmin(dists, axis=1)
            for j in range(3):
                mask = labels == j
                if mask.sum() > 0:
                    centroids[j] = X[mask].mean(axis=0)
        dists = compute_all_distances_squared(X, centroids)
        random_inertias.append(np.sum(dists[np.arange(len(labels)), labels]))
        
        # K-means++ initialization
        _, _, inertia, _ = kmeans_with_plus_plus(X, 3, random_state=run*10)
        kpp_inertias.append(inertia)
    
    print(f"Random Init - Mean Inertia: {np.mean(random_inertias):.2f} "
          f"(std: {np.std(random_inertias):.2f})")
    print(f"K-Means++   - Mean Inertia: {np.mean(kpp_inertias):.2f} "
          f"(std: {np.std(kpp_inertias):.2f})")
    print(f"
Improvement: {(np.mean(random_inertias)/np.mean(kpp_inertias) - 1)*100:.1f}%")

The O(log k) Approximation Guarantee

K-means++ provides a remarkable theoretical guarantee that sets it apart from heuristic approaches.

Theorem (Arthur & Vassilvitskii, 2007):

Let $\phi$ denote the WCSS achieved by k-means++ initialization (before running Lloyd's algorithm), and let $\phi^*$ denote the optimal WCSS. Then:

$$\mathbb{E}[\phi] \leq 8(\ln k + 2) \cdot \phi^*$$

In other words, k-means++ is an O(log k)-approximation algorithm in expectation.

What This Means:

For k=10: Expected cost ≤ 8(2.3 + 2) × OPT ≈ 34 × OPT
For k=100: Expected cost ≤ 8(4.6 + 2) × OPT ≈ 53 × OPT
For k=1000: Expected cost ≤ 8(6.9 + 2) × OPT ≈ 71 × OPT

The guarantee degrades only logarithmically with k, which is remarkably slow.

After Lloyd's Algorithm

Proof Sketch:

The proof works by induction on the number of clusters. The key insight is that D²-sampling tends to select points from uncovered clusters.

Lemma 1: If we've selected centroids covering some clusters well, D²-sampling will likely pick a point from an uncovered cluster (since those points have large D² values).

Lemma 2: When we pick a point from cluster $C$, the expected cost contribution from $C$ is at most 8 × OPT($C$), where OPT($C$) is the optimal cost for clustering $C$ with one center.

The logarithmic factor arises from a coupon collector-type analysis: how many samples until we've covered all clusters?

K-Means++ vs Random Initialization
Property	Random Init	K-Means++
Approximation ratio	Unbounded	O(log k)
Probability of good solution	Can be arbitrarily low	High in expectation
Time complexity	O(k)	O(nk)
Space complexity	O(k)	O(n)
Default in sklearn	No	Yes ('k-means++')

Computational Considerations

K-means++ initialization has higher computational cost than random selection, but the investment typically pays off through faster convergence and better solutions.

Time Complexity:

Random initialization: O(k) — just sample k indices
K-means++: O(nkd) — for each of k centroids, compute distance to all n points

This matches the per-iteration cost of Lloyd's algorithm. So k-means++ adds the equivalent of k extra Lloyd iterations.

Is It Worth It?

Yes, almost always:

Better initialization → fewer Lloyd iterations needed
Better final solution → don't need multiple restarts
The O(nkd) initialization cost is amortized over typically O(1) runs instead of O(10) runs needed with random init

Scalable K-Means++ Variants

For very large datasets, even O(nkd) initialization becomes expensive. Variants exist:

• K-Means|| (Scalable K-Means++): Sample O(k) points per round for O(log n) rounds, then reduce to k centroids. Achieves O(log k) approximation with O(k log n) distance computations.

• Mini-batch K-Means++: Apply k-means++ to a random sample, then use those centroids.

• Greedy K-Means++: Deterministically choose the farthest point instead of sampling (loses theoretical guarantee but often works well).

Implementation Optimization:

The naive O(nkd) implementation can be optimized:

# Naive: recompute all distances at each step
for j in range(1, k):
    distances = compute_distances_to_all_centroids(X, centroids[:j])
    min_distances = distances.min(axis=1)
    # ... sample next centroid

# Optimized: incrementally update minimum distances
min_distances_sq = np.full(n, np.inf)
for j in range(k):
    if j > 0:
        # Only compute distances to newest centroid
        new_distances_sq = compute_distances(X, centroids[j-1])
        min_distances_sq = np.minimum(min_distances_sq, new_distances_sq)
    # ... sample based on min_distances_sq

The optimized version still requires O(nd) per centroid selection but avoids redundant distance computations, improving constants significantly.

Alternative Initialization Methods

K-means++ is the default choice, but alternative initialization strategies exist for specific scenarios.

Forgy Method (Random Selection)

Simply select $k$ data points uniformly at random as initial centroids.

Pros:

O(k) time complexity
Simple to implement
Works reasonably for well-separated clusters

Cons:

No theoretical guarantees
High variance in solution quality
May place multiple centroids in one true cluster

Use when: Computational budget is extremely tight, or data is known to have well-separated, equal-sized clusters.

Initialization Methods Comparison
Method	Time	Guarantee	Outlier Sensitivity
K-Means++	O(nkd)	O(log k)-approx	Moderate
Forgy (Random)	O(k)	None	Low
Random Partition	O(n)	None	Low
Maxmin	O(nkd)	2-approx (k-center)	High
K-Means\|\|	O(k log n · d)	O(log k)-approx	Moderate

Multiple Restarts: The Practical Approach

Even with k-means++ initialization, we only have a probabilistic guarantee. The standard practice is to run k-means multiple times and keep the best solution.

The Strategy:

best_inertia = float('inf')
best_result = None

for i in range(n_init):  # typically n_init = 10
    centroids, labels, inertia = kmeans_with_kpp(X, k)
    if inertia < best_inertia:
        best_inertia = inertia
        best_result = (centroids, labels)

return best_result

How Many Restarts?

With k-means++ initialization, 10 restarts is typically sufficient:

Probability of all 10 runs giving poor solutions is low
Beyond 10, diminishing returns set in
Sklearn's default is n_init=10 for 'k-means++', n_init=1 for 'random'

Parallelizing Restarts

Multiple restarts are embarrassingly parallel—each run is independent. Modern implementations (sklearn, spark) parallelize across cores:

from sklearn.cluster import KMeans

kmeans = KMeans(
    n_clusters=k,
    init='k-means++',
    n_init=10,      # 10 restarts
    n_jobs=-1       # use all cores
)

With 8 cores and n_init=10, you get nearly 8× speedup over sequential.

Early Stopping Across Restarts:

Advanced implementations can terminate individual runs early if they're clearly worse:

Run first few iterations of each restart
Keep only the top-performing runs (by intermediate WCSS)
Run those to convergence

This is a form of successive halving applied to k-means restarts.

Summary: Initialization Strategies

Key Takeaways

•K-means is highly sensitive to initialization — Bad starting points lead to arbitrarily bad solutions
•K-means++ uses D²-weighted sampling — Points far from existing centroids are more likely to be selected
•O(log k) approximation guarantee — K-means++ provides the first theoretical bounds for k-means initialization
•O(nkd) initialization cost — Worth it due to faster convergence and fewer restarts needed
•Multiple restarts remain important — Run 10 times with k-means++ and keep the best
•Scalable variants exist — K-means|| for very large datasets

What's Next:

Page Complete

3 / 5