Loading learning content...
We've established that Lloyd's algorithm decreases the WCSS at each iteration. But does it actually terminate? How quickly? Are there cases where it gets stuck or runs forever?
These questions matter for both theory and practice:
In this page, we'll rigorously analyze convergence—proving that Lloyd's algorithm always terminates, characterizing its convergence rate, and exploring edge cases where behavior becomes pathological.
By the end of this page, you will: • Prove that Lloyd's algorithm always converges • Understand why convergence to local (not global) minima occurs • Analyze convergence rates (worst-case and typical) • Recognize pathological inputs that cause slow convergence • Implement robust stopping criteria
Lloyd's algorithm is guaranteed to converge in finite time. The proof relies on two key observations:
Observation 1: WCSS is monotonically non-increasing
Let $J^{(t)}$ denote the WCSS at iteration $t$. We showed earlier that:
Therefore: $J^{(t+1)} \leq J^{(t)}$ for all $t$.
Observation 2: There are finitely many possible configurations
A configuration is a pair $({C_j}, {\boldsymbol{\mu}_j})$ of cluster assignments and centroids. Since:
The number of distinct configurations is at most $k^n$ (finite).
Theorem: Lloyd's algorithm terminates in at most $k^n$ iterations.
Proof:
Important Caveat: The $k^n$ Bound is Exponential
While convergence is guaranteed, the $k^n$ worst-case bound is exponentially large. For $n = 100$ and $k = 2$, this is $2^{100} \approx 10^{30}$ iterations!
Fortunately, this worst case is rarely (if ever) achieved in practice. Typical convergence is much faster, as we'll see in the next section.
What Does Convergence Mean?
Lloyd's algorithm converges to a fixed point—a configuration where:
At a fixed point, we've reached a local minimum of the WCSS. This may or may not be the global minimum.
Lloyd's algorithm converges to a local minimum, not necessarily the global minimum. A configuration is a local minimum if no single-point reassignment decreases WCSS. But there might exist multi-point swaps that would decrease WCSS—the algorithm cannot escape such local minima.
This is why initialization (k-means++) and multiple restarts are essential for finding good solutions.
The $k^n$ worst-case bound tells us convergence is guaranteed but says nothing about typical behavior. In practice, Lloyd's algorithm converges remarkably fast.
Empirical Observations:
Across decades of empirical studies:
Smoothed Analysis:
Spielmann and Teng's smoothed analysis explains why Lloyd's algorithm behaves well in practice:
Theorem (Smoothed Polynomial Convergence): If input data is perturbed by small random Gaussian noise (standard deviation $\sigma$), Lloyd's algorithm converges in polynomial expected time: $O\left(\frac{n^{34}k^{34}d^8}{\sigma^6}\right)$ iterations.
Smoothed analysis bridges worst-case and average-case complexity. It asks: what happens when we slightly perturb worst-case inputs?
For Lloyd's algorithm, adversarial inputs causing exponential convergence are extremely brittle—tiny perturbations restore polynomial convergence. Real-world data always has some noise, so exponential behavior is essentially never observed.
Convergence Characterization:
Typical convergence follows a pattern:
This suggests an adaptive stopping criterion based on relative improvement:
if (J_old - J_new) / J_old < tolerance:
converged = True
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
"""Convergence Analysis of Lloyd's Algorithm This module tracks and visualizes convergence behavior across iterations."""import numpy as npfrom typing import List, Tupleimport matplotlib.pyplot as plt def kmeans_with_tracking( X: np.ndarray, k: int, max_iters: int = 100, random_state: int = 42) -> Tuple[np.ndarray, np.ndarray, List[float], List[int]]: """ K-means with convergence tracking. Returns: centroids: Final centroids labels: Final cluster assignments inertia_history: WCSS at each iteration reassignment_history: Number of points reassigned at each iteration """ np.random.seed(random_state) n_samples, n_features = X.shape # K-means++ initialization centroids = kmeans_plus_plus_init(X, k) inertia_history = [] reassignment_history = [] labels = None for iteration in range(max_iters): old_labels = labels.copy() if labels is not None else None # Assignment step distances_sq = compute_distances_squared(X, centroids) labels = np.argmin(distances_sq, axis=1) # Track reassignments if old_labels is not None: n_reassigned = np.sum(labels != old_labels) reassignment_history.append(n_reassigned) else: reassignment_history.append(n_samples) # First iteration # Compute and track inertia inertia = np.sum(distances_sq[np.arange(n_samples), labels]) inertia_history.append(inertia) # Update step new_centroids = np.zeros_like(centroids) for j in range(k): mask = (labels == j) if mask.sum() > 0: new_centroids[j] = X[mask].mean(axis=0) else: new_centroids[j] = centroids[j] # Check convergence if np.allclose(centroids, new_centroids): break centroids = new_centroids return centroids, labels, inertia_history, reassignment_history def analyze_convergence_patterns(X, k, n_runs=10): """ Analyze convergence across multiple runs. """ all_histories = [] all_lengths = [] all_final_inertias = [] for run in range(n_runs): _, _, history, _ = kmeans_with_tracking(X, k, random_state=run) all_histories.append(history) all_lengths.append(len(history)) all_final_inertias.append(history[-1]) print(f"Convergence Analysis ({n_runs} runs):") print(f" Iterations until convergence:") print(f" Mean: {np.mean(all_lengths):.1f}") print(f" Min: {np.min(all_lengths)}") print(f" Max: {np.max(all_lengths)}") print(f" Final inertia:") print(f" Mean: {np.mean(all_final_inertias):.2f}") print(f" Std: {np.std(all_final_inertias):.2f}") return all_histories def demonstrate_geometric_convergence(): """ Show that WCSS typically decreases geometrically. """ np.random.seed(42) # Generate well-separated clusters X = np.vstack([ np.random.randn(100, 2) + [0, 0], np.random.randn(100, 2) + [5, 5], np.random.randn(100, 2) + [10, 0], ]) _, _, inertia_history, reassign_history = kmeans_with_tracking(X, 3) print("Iteration-by-Iteration Convergence:") print("-" * 50) print(f"{'Iter':<6}{'Inertia':<15}{'Decrease %':<15}{'Reassigned'}") print("-" * 50) for i, (inertia, reassign) in enumerate(zip(inertia_history, reassign_history)): if i == 0: decrease = 0 else: decrease = (inertia_history[i-1] - inertia) / inertia_history[i-1] * 100 print(f"{i:<6}{inertia:<15.2f}{decrease:<15.2f}{reassign}") print("-" * 50) print(f"Converged in {len(inertia_history)} iterations") # Helper functionsdef kmeans_plus_plus_init(X, k): n = X.shape[0] centroids = [X[np.random.randint(n)]] for _ in range(1, k): distances = np.array([min(np.sum((x - c)**2) for c in centroids) for x in X]) probs = distances / distances.sum() next_idx = np.random.choice(n, p=probs) centroids.append(X[next_idx]) return np.array(centroids) def compute_distances_squared(X, centroids): return np.array([[np.sum((x - c)**2) for c in centroids] for x in X]) if __name__ == "__main__": demonstrate_geometric_convergence()While practical convergence is fast, adversarial inputs exist that force Lloyd's algorithm to take exponentially many iterations. Understanding these constructions illuminates the algorithm's behavior.
Arthur & Vassilvitskii Construction (2006):
For any $n$ and $k = 2$, there exist datasets and initializations where Lloyd's algorithm requires $2^{\Omega(n)}$ iterations.
The Construction:
The key is that each point's assignment depends on a delicate balance, and changing one assignment triggers a chain reaction that takes exponential time to stabilize.
These adversarial constructions require: • Exact arithmetic: Any floating-point noise breaks the cascade • Specific initialization: Slightly different starting points avoid the trap • Contrived geometry: Real datasets don't have this structure
As smoothed analysis shows, adding tiny noise makes exponential behavior vanish. Real data always has noise, so you'll never see this in practice.
Other Pathological Cases:
1. Empty Clusters:
If initialization places centroids such that one receives no points:
2. Ties:
When a point is equidistant from multiple centroids:
3. Oscillation:
In rare edge cases, assignments can oscillate:
Modern implementations handle all these cases gracefully.
In practice, we don't wait for exact convergence—we stop when progress becomes negligible. Several stopping criteria are used:
Recommended Practice:
Use a combination of criteria:
for iteration in range(max_iters):
# ... assignment and update steps ...
# Criterion 1: No assignments changed
if np.all(labels == old_labels):
break
# Criterion 2: Centroids barely moved
centroid_shift = np.sum((centroids - old_centroids) ** 2)
if centroid_shift < tol:
break
# Criterion 3: Objective stopped improving
if (old_inertia - inertia) / old_inertia < rtol:
break
This ensures termination regardless of edge cases while stopping early when effectively converged.
Centroid tolerance (tol): Typically 1e-4 to 1e-6. Scale with data magnitude.
Relative tolerance (rtol): Typically 1e-4. Stop when improvement is < 0.01%.
Max iterations: 300 is sklearn's default; rarely hit with good initialization.
For large-scale applications, be slightly more aggressive (larger tolerances) to save computation.
Monitoring convergence reveals important information about the clustering and potential issues.
Healthy Convergence Signs:
Warning Signs:
| Warning Sign | Possible Cause | Remedy |
|---|---|---|
| Hit max_iters | Poor initialization or difficult data | Increase max_iters, use k-means++, check data scaling |
| Oscillating inertia | Numerical precision issues | Use double precision, center data |
| Empty clusters | k too large or bad initialization | Reduce k, use k-means++, reinitialize empty clusters |
| Very slow convergence | Data on different scales | Standardize features, check for outliers |
| Large variance across restarts | Multiple local minima | Increase n_init, use k-means++ |
Elbow Plot for Convergence:
Plotting inertia vs. iteration number should show an "elbow":
Inertia
|
|
|
| \____________________
|_________________________ Iteration
The flat region indicates convergence. If the curve keeps decreasing steeply, you may need more iterations.
Comparing Restarts:
When running multiple restarts, compare:
If restarts produce very different inertias, the objective landscape has multiple distinct local minima—consider increasing n_init.
Several techniques can speed up convergence without changing the final result.
1. Early Termination of Assignment Step:
In the assignment step, we can skip recomputing distances for points that definitely won't change clusters. Using triangle inequality:
If $|\mathbf{x} - \boldsymbol{\mu}{\text{old}}| \leq \frac{1}{2}\min{j eq \text{old}} |\boldsymbol{\mu}_{\text{old}} - \boldsymbol{\mu}_j|$
then $\mathbf{x}$ cannot change clusters (it's too close to its current centroid).
2. Elkan's Algorithm:
Maintains bounds on distances to avoid redundant computation:
Proven to reduce distance computations by 50-90% on many datasets.
Scikit-learn offers two algorithms:
• 'lloyd': Standard algorithm, O(nkd) per iteration • 'elkan': Uses triangle inequality, faster for low-dimensional data
KMeans(n_clusters=k, algorithm='elkan') # Often 2-3x faster
Elkan's algorithm is default for dense data; Lloyd's for sparse data (where triangle inequality overhead isn't worth it).
3. Mini-Batch K-Means:
Instead of using all $n$ points per iteration, sample a small batch:
for iteration in range(max_iters):
batch = random_sample(X, batch_size=1000)
# Assignment step on batch only
batch_labels = assign_to_nearest(batch, centroids)
# Update centroids using batch (with learning rate)
for j in range(k):
batch_cluster = batch[batch_labels == j]
if len(batch_cluster) > 0:
update = batch_cluster.mean(axis=0) - centroids[j]
centroids[j] += learning_rate * update
Trade-offs:
What's Next:
We've now covered the algorithm (Lloyd's), objective (WCSS), initialization (k-means++), and convergence. In the final page of this module, we'll examine the limitations of k-means—understanding when it works well, when it fails, and what alternatives exist.
You now understand the convergence guarantees of Lloyd's algorithm, why practical convergence is fast despite exponential worst cases, and how to implement robust stopping criteria. This knowledge lets you confidently run k-means knowing it will terminate appropriately.