Machine LearningK-Means Clustering

K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Clustering

4 / 5

Convergence Properties

Does Lloyd's Algorithm Always Stop?

We've established that Lloyd's algorithm decreases the WCSS at each iteration. But does it actually terminate? How quickly? Are there cases where it gets stuck or runs forever?

These questions matter for both theory and practice:

Theoretically: We need to prove the algorithm is well-defined and terminates
Practically: We need to know how many iterations to budget and when to stop

In this page, we'll rigorously analyze convergence—proving that Lloyd's algorithm always terminates, characterizing its convergence rate, and exploring edge cases where behavior becomes pathological.

What You Will Learn

By the end of this page, you will: • Prove that Lloyd's algorithm always converges • Understand why convergence to local (not global) minima occurs • Analyze convergence rates (worst-case and typical) • Recognize pathological inputs that cause slow convergence • Implement robust stopping criteria

Proof of Convergence

Lloyd's algorithm is guaranteed to converge in finite time. The proof relies on two key observations:

Observation 1: WCSS is monotonically non-increasing

Let $J^{(t)}$ denote the WCSS at iteration $t$. We showed earlier that:

The assignment step can only decrease (or maintain) $J$
The update step can only decrease (or maintain) $J$

Therefore: $J^{(t+1)} \leq J^{(t)}$ for all $t$.

Observation 2: There are finitely many possible configurations

A configuration is a pair $({C_j}, {\boldsymbol{\mu}_j})$ of cluster assignments and centroids. Since:

There are at most $k^n$ ways to assign $n$ points to $k$ clusters
Each centroid is uniquely determined by its cluster (as the mean)

The number of distinct configurations is at most $k^n$ (finite).

The Convergence Theorem

Theorem: Lloyd's algorithm terminates in at most $k^n$ iterations.

Proof:

WCSS strictly decreases unless we're at a fixed point (no assignments change)
Each configuration has a unique WCSS value
The sequence of WCSS values $J^{(0)}, J^{(1)}, ...$ is strictly decreasing (until convergence)
Since there are at most $k^n$ configurations, we cannot decrease more than $k^n$ times
Therefore, convergence occurs within $k^n$ iterations ∎

Important Caveat: The $k^n$ Bound is Exponential

While convergence is guaranteed, the $k^n$ worst-case bound is exponentially large. For $n = 100$ and $k = 2$, this is $2^{100} \approx 10^{30}$ iterations!

Fortunately, this worst case is rarely (if ever) achieved in practice. Typical convergence is much faster, as we'll see in the next section.

What Does Convergence Mean?

Lloyd's algorithm converges to a fixed point—a configuration where:

The assignment step doesn't change any assignments
The update step doesn't move any centroids

At a fixed point, we've reached a local minimum of the WCSS. This may or may not be the global minimum.

Local vs Global Minima

Lloyd's algorithm converges to a local minimum, not necessarily the global minimum. A configuration is a local minimum if no single-point reassignment decreases WCSS. But there might exist multi-point swaps that would decrease WCSS—the algorithm cannot escape such local minima.

This is why initialization (k-means++) and multiple restarts are essential for finding good solutions.

Convergence Rate Analysis

The $k^n$ worst-case bound tells us convergence is guaranteed but says nothing about typical behavior. In practice, Lloyd's algorithm converges remarkably fast.

Empirical Observations:

Across decades of empirical studies:

Most runs converge within 10-100 iterations
Convergence is typically geometric (constant factor improvement per iteration)
The number of iterations is largely independent of $n$ for typical data

Smoothed Analysis:

Spielmann and Teng's smoothed analysis explains why Lloyd's algorithm behaves well in practice:

Theorem (Smoothed Polynomial Convergence): If input data is perturbed by small random Gaussian noise (standard deviation $\sigma$), Lloyd's algorithm converges in polynomial expected time: $O\left(\frac{n^{34}k^{34}d^8}{\sigma^6}\right)$ iterations.

What is Smoothed Analysis?

Smoothed analysis bridges worst-case and average-case complexity. It asks: what happens when we slightly perturb worst-case inputs?

For Lloyd's algorithm, adversarial inputs causing exponential convergence are extremely brittle—tiny perturbations restore polynomial convergence. Real-world data always has some noise, so exponential behavior is essentially never observed.

Convergence Characterization:

Typical convergence follows a pattern:

Initial phase (1-5 iterations): Large WCSS decreases as major cluster structure is discovered
Refinement phase (5-20 iterations): Moderate decreases as boundaries are adjusted
Stabilization phase (20+ iterations): Tiny decreases; reassignments become rare

This suggests an adaptive stopping criterion based on relative improvement:

if (J_old - J_new) / J_old < tolerance:
    converged = True

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
"""
Convergence Analysis of Lloyd's Algorithm
 
This module tracks and visualizes convergence behavior across iterations.
"""
import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt
 
def kmeans_with_tracking(
    X: np.ndarray,
    k: int,
    max_iters: int = 100,
    random_state: int = 42
) -> Tuple[np.ndarray, np.ndarray, List[float], List[int]]:
    """
    K-means with convergence tracking.
    
    Returns:
        centroids: Final centroids
        labels: Final cluster assignments
        inertia_history: WCSS at each iteration
        reassignment_history: Number of points reassigned at each iteration
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    
    # K-means++ initialization
    centroids = kmeans_plus_plus_init(X, k)
    
    inertia_history = []
    reassignment_history = []
    labels = None
    
    for iteration in range(max_iters):
        old_labels = labels.copy() if labels is not None else None
        
        # Assignment step
        distances_sq = compute_distances_squared(X, centroids)
        labels = np.argmin(distances_sq, axis=1)
        
        # Track reassignments
        if old_labels is not None:
            n_reassigned = np.sum(labels != old_labels)
            reassignment_history.append(n_reassigned)
        else:
            reassignment_history.append(n_samples)  # First iteration
        
        # Compute and track inertia
        inertia = np.sum(distances_sq[np.arange(n_samples), labels])
        inertia_history.append(inertia)
        
        # Update step
        new_centroids = np.zeros_like(centroids)
        for j in range(k):
            mask = (labels == j)
            if mask.sum() > 0:
                new_centroids[j] = X[mask].mean(axis=0)
            else:
                new_centroids[j] = centroids[j]
        
        # Check convergence
        if np.allclose(centroids, new_centroids):
            break
            
        centroids = new_centroids
    
    return centroids, labels, inertia_history, reassignment_history
 
 
def analyze_convergence_patterns(X, k, n_runs=10):
    """
    Analyze convergence across multiple runs.
    """
    all_histories = []
    all_lengths = []
    all_final_inertias = []
    
    for run in range(n_runs):
        _, _, history, _ = kmeans_with_tracking(X, k, random_state=run)
        all_histories.append(history)
        all_lengths.append(len(history))
        all_final_inertias.append(history[-1])
    
    print(f"Convergence Analysis ({n_runs} runs):")
    print(f"  Iterations until convergence:")
    print(f"    Mean: {np.mean(all_lengths):.1f}")
    print(f"    Min:  {np.min(all_lengths)}")
    print(f"    Max:  {np.max(all_lengths)}")
    print(f"  Final inertia:")
    print(f"    Mean: {np.mean(all_final_inertias):.2f}")
    print(f"    Std:  {np.std(all_final_inertias):.2f}")
    
    return all_histories
 
 
def demonstrate_geometric_convergence():
    """
    Show that WCSS typically decreases geometrically.
    """
    np.random.seed(42)
    
    # Generate well-separated clusters
    X = np.vstack([
        np.random.randn(100, 2) + [0, 0],
        np.random.randn(100, 2) + [5, 5],
        np.random.randn(100, 2) + [10, 0],
    ])
    
    _, _, inertia_history, reassign_history = kmeans_with_tracking(X, 3)
    
    print("Iteration-by-Iteration Convergence:")
    print("-" * 50)
    print(f"{'Iter':<6}{'Inertia':<15}{'Decrease %':<15}{'Reassigned'}")
    print("-" * 50)
    
    for i, (inertia, reassign) in enumerate(zip(inertia_history, reassign_history)):
        if i == 0:
            decrease = 0
        else:
            decrease = (inertia_history[i-1] - inertia) / inertia_history[i-1] * 100
        print(f"{i:<6}{inertia:<15.2f}{decrease:<15.2f}{reassign}")
    
    print("-" * 50)
    print(f"Converged in {len(inertia_history)} iterations")
 
 
# Helper functions
def kmeans_plus_plus_init(X, k):
    n = X.shape[0]
    centroids = [X[np.random.randint(n)]]
    
    for _ in range(1, k):
        distances = np.array([min(np.sum((x - c)**2) for c in centroids) for x in X])
        probs = distances / distances.sum()
        next_idx = np.random.choice(n, p=probs)
        centroids.append(X[next_idx])
    
    return np.array(centroids)
 
def compute_distances_squared(X, centroids):
    return np.array([[np.sum((x - c)**2) for c in centroids] for x in X])
 
 
if __name__ == "__main__":
    demonstrate_geometric_convergence()

Worst-Case Constructions

While practical convergence is fast, adversarial inputs exist that force Lloyd's algorithm to take exponentially many iterations. Understanding these constructions illuminates the algorithm's behavior.

Arthur & Vassilvitskii Construction (2006):

For any $n$ and $k = 2$, there exist datasets and initializations where Lloyd's algorithm requires $2^{\Omega(n)}$ iterations.

The Construction:

Place $n$ points along a line at carefully chosen positions
Initialize with specific centroids that create a "cascade" effect
Each iteration moves centroids by a tiny amount
The cascade must propagate through all $n$ points

The key is that each point's assignment depends on a delicate balance, and changing one assignment triggers a chain reaction that takes exponential time to stabilize.

Why Worst Cases Don't Matter in Practice

These adversarial constructions require: • Exact arithmetic: Any floating-point noise breaks the cascade • Specific initialization: Slightly different starting points avoid the trap • Contrived geometry: Real datasets don't have this structure

As smoothed analysis shows, adding tiny noise makes exponential behavior vanish. Real data always has noise, so you'll never see this in practice.

Other Pathological Cases:

1. Empty Clusters:

If initialization places centroids such that one receives no points:

The centroid becomes undefined (mean of empty set)
Common fixes: reinitialize randomly, or keep old centroid

2. Ties:

When a point is equidistant from multiple centroids:

Assignment is ambiguous (implementation-dependent)
Can cause non-deterministic behavior
Usually handled by arbitrary tie-breaking (e.g., lowest index wins)

3. Oscillation:

In rare edge cases, assignments can oscillate:

Point $x$ alternates between clusters $A$ and $B$
Each assignment changes the centroids just enough to flip $x$ back
Handled by tracking previous states and detecting cycles

Modern implementations handle all these cases gracefully.

Practical Stopping Criteria

In practice, we don't wait for exact convergence—we stop when progress becomes negligible. Several stopping criteria are used:

Common Stopping Criteria

•No assignment changes: The purest criterion—stop when no point changes clusters. Guarantees a true fixed point.
•Centroid movement threshold: Stop when $\sum_j |\boldsymbol{\mu}_j^{(t)} - \boldsymbol{\mu}_j^{(t-1)}|^2 < \epsilon$. Detects when centroids stabilize.
•Relative inertia change: Stop when $(J^{(t-1)} - J^{(t)}) / J^{(t-1)} < \epsilon$. Focuses on objective improvement rate.
•Maximum iterations: Safety net to prevent infinite loops. Typical: 300-1000 iterations.

Recommended Practice:

Use a combination of criteria:

for iteration in range(max_iters):
    # ... assignment and update steps ...
    
    # Criterion 1: No assignments changed
    if np.all(labels == old_labels):
        break
    
    # Criterion 2: Centroids barely moved
    centroid_shift = np.sum((centroids - old_centroids) ** 2)
    if centroid_shift < tol:
        break
    
    # Criterion 3: Objective stopped improving
    if (old_inertia - inertia) / old_inertia < rtol:
        break

This ensures termination regardless of edge cases while stopping early when effectively converged.

Choosing Tolerance Values

Centroid tolerance (tol): Typically 1e-4 to 1e-6. Scale with data magnitude.

Relative tolerance (rtol): Typically 1e-4. Stop when improvement is < 0.01%.

Max iterations: 300 is sklearn's default; rarely hit with good initialization.

For large-scale applications, be slightly more aggressive (larger tolerances) to save computation.

Convergence Diagnostics

Monitoring convergence reveals important information about the clustering and potential issues.

Healthy Convergence Signs:

Monotonically decreasing inertia: WCSS should never increase
Geometric decrease initially: Large improvements in first few iterations
Diminishing returns: Later iterations provide smaller improvements
Stable assignments: Number of reassignments decreases over time

Warning Signs:

Convergence Warning Signs and Remedies
Warning Sign	Possible Cause	Remedy
Hit max_iters	Poor initialization or difficult data	Increase max_iters, use k-means++, check data scaling
Oscillating inertia	Numerical precision issues	Use double precision, center data
Empty clusters	k too large or bad initialization	Reduce k, use k-means++, reinitialize empty clusters
Very slow convergence	Data on different scales	Standardize features, check for outliers
Large variance across restarts	Multiple local minima	Increase n_init, use k-means++

Elbow Plot for Convergence:

Plotting inertia vs. iteration number should show an "elbow":

Inertia
   |
   | 
   |  
   |   \____________________
   |_________________________ Iteration

The flat region indicates convergence. If the curve keeps decreasing steeply, you may need more iterations.

Comparing Restarts:

When running multiple restarts, compare:

Final inertia values (lower is better)
Convergence speed (faster may indicate better initialization)
Cluster sizes (roughly balanced is often desirable)

If restarts produce very different inertias, the objective landscape has multiple distinct local minima—consider increasing n_init.

Acceleration Techniques

Several techniques can speed up convergence without changing the final result.

1. Early Termination of Assignment Step:

In the assignment step, we can skip recomputing distances for points that definitely won't change clusters. Using triangle inequality:

If $|\mathbf{x} - \boldsymbol{\mu}{\text{old}}| \leq \frac{1}{2}\min{j eq \text{old}} |\boldsymbol{\mu}_{\text{old}} - \boldsymbol{\mu}_j|$

then $\mathbf{x}$ cannot change clusters (it's too close to its current centroid).

2. Elkan's Algorithm:

Maintains bounds on distances to avoid redundant computation:

Upper bound: $u(x)$ on distance to assigned centroid
Lower bounds: $l(x, j)$ on distance to each other centroid

Proven to reduce distance computations by 50-90% on many datasets.

Sklearn's algorithm Parameter

Scikit-learn offers two algorithms:

• 'lloyd': Standard algorithm, O(nkd) per iteration • 'elkan': Uses triangle inequality, faster for low-dimensional data

KMeans(n_clusters=k, algorithm='elkan')  # Often 2-3x faster

Elkan's algorithm is default for dense data; Lloyd's for sparse data (where triangle inequality overhead isn't worth it).

3. Mini-Batch K-Means:

Instead of using all $n$ points per iteration, sample a small batch:

for iteration in range(max_iters):
    batch = random_sample(X, batch_size=1000)
    
    # Assignment step on batch only
    batch_labels = assign_to_nearest(batch, centroids)
    
    # Update centroids using batch (with learning rate)
    for j in range(k):
        batch_cluster = batch[batch_labels == j]
        if len(batch_cluster) > 0:
            update = batch_cluster.mean(axis=0) - centroids[j]
            centroids[j] += learning_rate * update

Trade-offs:

Much faster per iteration: O(batch_size × k × d)
More iterations needed to converge
May converge to slightly worse solution
Excellent for very large datasets (n > 100,000)

Summary: Convergence Properties

Key Takeaways

•Lloyd's algorithm always converges — WCSS decreases monotonically, finite configurations guarantee termination
•Convergence is to local minima — Global optimality is not guaranteed; quality depends on initialization
•Worst case is exponential, typical case is fast — Smoothed analysis explains why 10-100 iterations suffice in practice
•Multiple stopping criteria recommended — Combine assignment stability, centroid movement, and max iterations
•Convergence diagnostics reveal issues — Monitor inertia curves and cross-restart variance
•Acceleration techniques available — Elkan's algorithm, mini-batch, triangle inequality bounds

What's Next:

We've now covered the algorithm (Lloyd's), objective (WCSS), initialization (k-means++), and convergence. In the final page of this module, we'll examine the limitations of k-means—understanding when it works well, when it fails, and what alternatives exist.

Page Complete

You now understand the convergence guarantees of Lloyd's algorithm, why practical convergence is fast despite exponential worst cases, and how to implement robust stopping criteria. This knowledge lets you confidently run k-means knowing it will terminate appropriately.

4 / 5

Loading learning content...

Machine LearningK-Means Clustering

K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Clustering

4 / 5

Convergence Properties

Does Lloyd's Algorithm Always Stop?

We've established that Lloyd's algorithm decreases the WCSS at each iteration. But does it actually terminate? How quickly? Are there cases where it gets stuck or runs forever?

These questions matter for both theory and practice:

Theoretically: We need to prove the algorithm is well-defined and terminates
Practically: We need to know how many iterations to budget and when to stop

What You Will Learn

Proof of Convergence

Lloyd's algorithm is guaranteed to converge in finite time. The proof relies on two key observations:

Observation 1: WCSS is monotonically non-increasing

Let $J^{(t)}$ denote the WCSS at iteration $t$. We showed earlier that:

The assignment step can only decrease (or maintain) $J$
The update step can only decrease (or maintain) $J$

Therefore: $J^{(t+1)} \leq J^{(t)}$ for all $t$.

Observation 2: There are finitely many possible configurations

A configuration is a pair $({C_j}, {\boldsymbol{\mu}_j})$ of cluster assignments and centroids. Since:

There are at most $k^n$ ways to assign $n$ points to $k$ clusters
Each centroid is uniquely determined by its cluster (as the mean)

The number of distinct configurations is at most $k^n$ (finite).

The Convergence Theorem

Theorem: Lloyd's algorithm terminates in at most $k^n$ iterations.

Proof:

WCSS strictly decreases unless we're at a fixed point (no assignments change)
Each configuration has a unique WCSS value
The sequence of WCSS values $J^{(0)}, J^{(1)}, ...$ is strictly decreasing (until convergence)
Since there are at most $k^n$ configurations, we cannot decrease more than $k^n$ times
Therefore, convergence occurs within $k^n$ iterations ∎

Important Caveat: The $k^n$ Bound is Exponential

While convergence is guaranteed, the $k^n$ worst-case bound is exponentially large. For $n = 100$ and $k = 2$, this is $2^{100} \approx 10^{30}$ iterations!

Fortunately, this worst case is rarely (if ever) achieved in practice. Typical convergence is much faster, as we'll see in the next section.

What Does Convergence Mean?

Lloyd's algorithm converges to a fixed point—a configuration where:

The assignment step doesn't change any assignments
The update step doesn't move any centroids

At a fixed point, we've reached a local minimum of the WCSS. This may or may not be the global minimum.

Local vs Global Minima

This is why initialization (k-means++) and multiple restarts are essential for finding good solutions.

Convergence Rate Analysis

The $k^n$ worst-case bound tells us convergence is guaranteed but says nothing about typical behavior. In practice, Lloyd's algorithm converges remarkably fast.

Empirical Observations:

Across decades of empirical studies:

Most runs converge within 10-100 iterations
Convergence is typically geometric (constant factor improvement per iteration)
The number of iterations is largely independent of $n$ for typical data

Smoothed Analysis:

Spielmann and Teng's smoothed analysis explains why Lloyd's algorithm behaves well in practice:

What is Smoothed Analysis?

Smoothed analysis bridges worst-case and average-case complexity. It asks: what happens when we slightly perturb worst-case inputs?

Convergence Characterization:

Typical convergence follows a pattern:

Initial phase (1-5 iterations): Large WCSS decreases as major cluster structure is discovered
Refinement phase (5-20 iterations): Moderate decreases as boundaries are adjusted
Stabilization phase (20+ iterations): Tiny decreases; reassignments become rare

This suggests an adaptive stopping criterion based on relative improvement:

if (J_old - J_new) / J_old < tolerance:
    converged = True

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
"""
Convergence Analysis of Lloyd's Algorithm
 
This module tracks and visualizes convergence behavior across iterations.
"""
import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt
 
def kmeans_with_tracking(
    X: np.ndarray,
    k: int,
    max_iters: int = 100,
    random_state: int = 42
) -> Tuple[np.ndarray, np.ndarray, List[float], List[int]]:
    """
    K-means with convergence tracking.
    
    Returns:
        centroids: Final centroids
        labels: Final cluster assignments
        inertia_history: WCSS at each iteration
        reassignment_history: Number of points reassigned at each iteration
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    
    # K-means++ initialization
    centroids = kmeans_plus_plus_init(X, k)
    
    inertia_history = []
    reassignment_history = []
    labels = None
    
    for iteration in range(max_iters):
        old_labels = labels.copy() if labels is not None else None
        
        # Assignment step
        distances_sq = compute_distances_squared(X, centroids)
        labels = np.argmin(distances_sq, axis=1)
        
        # Track reassignments
        if old_labels is not None:
            n_reassigned = np.sum(labels != old_labels)
            reassignment_history.append(n_reassigned)
        else:
            reassignment_history.append(n_samples)  # First iteration
        
        # Compute and track inertia
        inertia = np.sum(distances_sq[np.arange(n_samples), labels])
        inertia_history.append(inertia)
        
        # Update step
        new_centroids = np.zeros_like(centroids)
        for j in range(k):
            mask = (labels == j)
            if mask.sum() > 0:
                new_centroids[j] = X[mask].mean(axis=0)
            else:
                new_centroids[j] = centroids[j]
        
        # Check convergence
        if np.allclose(centroids, new_centroids):
            break
            
        centroids = new_centroids
    
    return centroids, labels, inertia_history, reassignment_history
 
 
def analyze_convergence_patterns(X, k, n_runs=10):
    """
    Analyze convergence across multiple runs.
    """
    all_histories = []
    all_lengths = []
    all_final_inertias = []
    
    for run in range(n_runs):
        _, _, history, _ = kmeans_with_tracking(X, k, random_state=run)
        all_histories.append(history)
        all_lengths.append(len(history))
        all_final_inertias.append(history[-1])
    
    print(f"Convergence Analysis ({n_runs} runs):")
    print(f"  Iterations until convergence:")
    print(f"    Mean: {np.mean(all_lengths):.1f}")
    print(f"    Min:  {np.min(all_lengths)}")
    print(f"    Max:  {np.max(all_lengths)}")
    print(f"  Final inertia:")
    print(f"    Mean: {np.mean(all_final_inertias):.2f}")
    print(f"    Std:  {np.std(all_final_inertias):.2f}")
    
    return all_histories
 
 
def demonstrate_geometric_convergence():
    """
    Show that WCSS typically decreases geometrically.
    """
    np.random.seed(42)
    
    # Generate well-separated clusters
    X = np.vstack([
        np.random.randn(100, 2) + [0, 0],
        np.random.randn(100, 2) + [5, 5],
        np.random.randn(100, 2) + [10, 0],
    ])
    
    _, _, inertia_history, reassign_history = kmeans_with_tracking(X, 3)
    
    print("Iteration-by-Iteration Convergence:")
    print("-" * 50)
    print(f"{'Iter':<6}{'Inertia':<15}{'Decrease %':<15}{'Reassigned'}")
    print("-" * 50)
    
    for i, (inertia, reassign) in enumerate(zip(inertia_history, reassign_history)):
        if i == 0:
            decrease = 0
        else:
            decrease = (inertia_history[i-1] - inertia) / inertia_history[i-1] * 100
        print(f"{i:<6}{inertia:<15.2f}{decrease:<15.2f}{reassign}")
    
    print("-" * 50)
    print(f"Converged in {len(inertia_history)} iterations")
 
 
# Helper functions
def kmeans_plus_plus_init(X, k):
    n = X.shape[0]
    centroids = [X[np.random.randint(n)]]
    
    for _ in range(1, k):
        distances = np.array([min(np.sum((x - c)**2) for c in centroids) for x in X])
        probs = distances / distances.sum()
        next_idx = np.random.choice(n, p=probs)
        centroids.append(X[next_idx])
    
    return np.array(centroids)
 
def compute_distances_squared(X, centroids):
    return np.array([[np.sum((x - c)**2) for c in centroids] for x in X])
 
 
if __name__ == "__main__":
    demonstrate_geometric_convergence()

Worst-Case Constructions

Arthur & Vassilvitskii Construction (2006):

For any $n$ and $k = 2$, there exist datasets and initializations where Lloyd's algorithm requires $2^{\Omega(n)}$ iterations.

The Construction:

Place $n$ points along a line at carefully chosen positions
Initialize with specific centroids that create a "cascade" effect
Each iteration moves centroids by a tiny amount
The cascade must propagate through all $n$ points

The key is that each point's assignment depends on a delicate balance, and changing one assignment triggers a chain reaction that takes exponential time to stabilize.

Why Worst Cases Don't Matter in Practice

As smoothed analysis shows, adding tiny noise makes exponential behavior vanish. Real data always has noise, so you'll never see this in practice.

Other Pathological Cases:

1. Empty Clusters:

If initialization places centroids such that one receives no points:

The centroid becomes undefined (mean of empty set)
Common fixes: reinitialize randomly, or keep old centroid

2. Ties:

When a point is equidistant from multiple centroids:

Assignment is ambiguous (implementation-dependent)
Can cause non-deterministic behavior
Usually handled by arbitrary tie-breaking (e.g., lowest index wins)

3. Oscillation:

In rare edge cases, assignments can oscillate:

Point $x$ alternates between clusters $A$ and $B$
Each assignment changes the centroids just enough to flip $x$ back
Handled by tracking previous states and detecting cycles

Modern implementations handle all these cases gracefully.

Practical Stopping Criteria

In practice, we don't wait for exact convergence—we stop when progress becomes negligible. Several stopping criteria are used:

Common Stopping Criteria

•No assignment changes: The purest criterion—stop when no point changes clusters. Guarantees a true fixed point.
•Centroid movement threshold: Stop when $\sum_j |\boldsymbol{\mu}_j^{(t)} - \boldsymbol{\mu}_j^{(t-1)}|^2 < \epsilon$. Detects when centroids stabilize.
•Relative inertia change: Stop when $(J^{(t-1)} - J^{(t)}) / J^{(t-1)} < \epsilon$. Focuses on objective improvement rate.
•Maximum iterations: Safety net to prevent infinite loops. Typical: 300-1000 iterations.

Recommended Practice:

Use a combination of criteria:

for iteration in range(max_iters):
    # ... assignment and update steps ...
    
    # Criterion 1: No assignments changed
    if np.all(labels == old_labels):
        break
    
    # Criterion 2: Centroids barely moved
    centroid_shift = np.sum((centroids - old_centroids) ** 2)
    if centroid_shift < tol:
        break
    
    # Criterion 3: Objective stopped improving
    if (old_inertia - inertia) / old_inertia < rtol:
        break

This ensures termination regardless of edge cases while stopping early when effectively converged.

Choosing Tolerance Values

Centroid tolerance (tol): Typically 1e-4 to 1e-6. Scale with data magnitude.

Relative tolerance (rtol): Typically 1e-4. Stop when improvement is < 0.01%.

Max iterations: 300 is sklearn's default; rarely hit with good initialization.

For large-scale applications, be slightly more aggressive (larger tolerances) to save computation.

Convergence Diagnostics

Monitoring convergence reveals important information about the clustering and potential issues.

Healthy Convergence Signs:

Monotonically decreasing inertia: WCSS should never increase
Geometric decrease initially: Large improvements in first few iterations
Diminishing returns: Later iterations provide smaller improvements
Stable assignments: Number of reassignments decreases over time

Warning Signs:

Convergence Warning Signs and Remedies
Warning Sign	Possible Cause	Remedy
Hit max_iters	Poor initialization or difficult data	Increase max_iters, use k-means++, check data scaling
Oscillating inertia	Numerical precision issues	Use double precision, center data
Empty clusters	k too large or bad initialization	Reduce k, use k-means++, reinitialize empty clusters
Very slow convergence	Data on different scales	Standardize features, check for outliers
Large variance across restarts	Multiple local minima	Increase n_init, use k-means++

Elbow Plot for Convergence:

Plotting inertia vs. iteration number should show an "elbow":

Inertia
   |
   | 
   |  
   |   \____________________
   |_________________________ Iteration

The flat region indicates convergence. If the curve keeps decreasing steeply, you may need more iterations.

Comparing Restarts:

When running multiple restarts, compare:

Final inertia values (lower is better)
Convergence speed (faster may indicate better initialization)
Cluster sizes (roughly balanced is often desirable)

If restarts produce very different inertias, the objective landscape has multiple distinct local minima—consider increasing n_init.

Acceleration Techniques

Several techniques can speed up convergence without changing the final result.

1. Early Termination of Assignment Step:

In the assignment step, we can skip recomputing distances for points that definitely won't change clusters. Using triangle inequality:

If $|\mathbf{x} - \boldsymbol{\mu}{\text{old}}| \leq \frac{1}{2}\min{j eq \text{old}} |\boldsymbol{\mu}_{\text{old}} - \boldsymbol{\mu}_j|$

then $\mathbf{x}$ cannot change clusters (it's too close to its current centroid).

2. Elkan's Algorithm:

Maintains bounds on distances to avoid redundant computation:

Upper bound: $u(x)$ on distance to assigned centroid
Lower bounds: $l(x, j)$ on distance to each other centroid

Proven to reduce distance computations by 50-90% on many datasets.

Sklearn's algorithm Parameter

Scikit-learn offers two algorithms:

• 'lloyd': Standard algorithm, O(nkd) per iteration • 'elkan': Uses triangle inequality, faster for low-dimensional data

KMeans(n_clusters=k, algorithm='elkan')  # Often 2-3x faster

Elkan's algorithm is default for dense data; Lloyd's for sparse data (where triangle inequality overhead isn't worth it).

3. Mini-Batch K-Means:

Instead of using all $n$ points per iteration, sample a small batch:

for iteration in range(max_iters):
    batch = random_sample(X, batch_size=1000)
    
    # Assignment step on batch only
    batch_labels = assign_to_nearest(batch, centroids)
    
    # Update centroids using batch (with learning rate)
    for j in range(k):
        batch_cluster = batch[batch_labels == j]
        if len(batch_cluster) > 0:
            update = batch_cluster.mean(axis=0) - centroids[j]
            centroids[j] += learning_rate * update

Trade-offs:

Much faster per iteration: O(batch_size × k × d)
More iterations needed to converge
May converge to slightly worse solution
Excellent for very large datasets (n > 100,000)

Summary: Convergence Properties

Key Takeaways

•Lloyd's algorithm always converges — WCSS decreases monotonically, finite configurations guarantee termination
•Convergence is to local minima — Global optimality is not guaranteed; quality depends on initialization
•Worst case is exponential, typical case is fast — Smoothed analysis explains why 10-100 iterations suffice in practice
•Multiple stopping criteria recommended — Combine assignment stability, centroid movement, and max iterations
•Convergence diagnostics reveal issues — Monitor inertia curves and cross-restart variance
•Acceleration techniques available — Elkan's algorithm, mini-batch, triangle inequality bounds

What's Next:

Page Complete

4 / 5