Loading content...
If you've ever used clustering in any practical application—customer segmentation, image compression, document grouping, or anomaly detection—chances are you've encountered k-means clustering. First proposed by Stuart Lloyd at Bell Labs in 1957 (though not published until 1982), this algorithm has become the de facto starting point for partitional clustering.
What makes k-means so ubiquitous? It's conceptually simple, computationally efficient, and surprisingly effective across diverse domains. Yet beneath this simplicity lies rich mathematical structure that connects to optimization theory, computational geometry, and statistical learning.
In this page, we'll develop a deep understanding of Lloyd's algorithm—the iterative procedure that makes k-means work. We'll trace through exactly what happens at each step, build intuition for why it works, and prepare the foundation for understanding its mathematical underpinnings.
By the end of this page, you will: • Understand the complete Lloyd's algorithm step-by-step • Trace through the assignment and update phases in detail • Develop intuition for why the algorithm converges • Implement k-means from scratch with full understanding • Recognize the algorithm's connection to optimization
Before diving into the algorithm, let's precisely define the problem k-means solves.
The Input: We have a dataset $\mathcal{X} = {\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n}$ where each $\mathbf{x}_i \in \mathbb{R}^d$ is a $d$-dimensional feature vector. We also specify $k$, the desired number of clusters.
The Output: We seek a partition of the data into $k$ disjoint clusters ${C_1, C_2, \ldots, C_k}$ such that:
Additionally, k-means produces cluster centroids ${\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_k}$ where each $\boldsymbol{\mu}_j \in \mathbb{R}^d$ represents the "center" of cluster $C_j$.
K-means performs hard clustering—each point belongs to exactly one cluster. This contrasts with soft clustering (like Gaussian Mixture Models) where points have probabilistic membership across multiple clusters. The hard assignment makes k-means computationally efficient but can be problematic for points near cluster boundaries.
The Intuition:
We want points within the same cluster to be "similar" (close to each other) while points in different clusters should be "dissimilar" (far apart). K-means operationalizes this by:
This creates clusters that are compact (points are close to their centroid) and separated (centroids are distinct).
| Symbol | Meaning | Dimensionality |
|---|---|---|
| $n$ | Number of data points | Scalar |
| $d$ | Feature dimensionality | Scalar |
| $k$ | Number of clusters | Scalar |
| $\mathbf{x}_i$ | The $i$-th data point | $\mathbb{R}^d$ |
| $\boldsymbol{\mu}_j$ | Centroid of cluster $j$ | $\mathbb{R}^d$ |
| $C_j$ | Set of points in cluster $j$ | Set of indices |
| $r_{ij}$ | Assignment indicator (1 if $\mathbf{x}_i \in C_j$) | ${0, 1}$ |
Lloyd's algorithm is elegantly simple: it alternates between two steps until convergence. Let's examine each step in detail.
Algorithm Overview:
1. INITIALIZE: Choose k initial centroids μ₁, μ₂, ..., μₖ
2. REPEAT until convergence:
a. ASSIGNMENT STEP: Assign each point to nearest centroid
b. UPDATE STEP: Recompute centroids as cluster means
3. RETURN: Final clusters and centroids
Let's unpack each component:
Step 2a: The Assignment Step
Given current centroids ${\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k}$, assign each point to the cluster with the nearest centroid:
$$C_j = {\mathbf{x}_i : |\mathbf{x}_i - \boldsymbol{\mu}_j|^2 \leq |\mathbf{x}i - \boldsymbol{\mu}{j'}|^2 \text{ for all } j' \neq j}$$
In other words, point $\mathbf{x}_i$ is assigned to cluster $j^* = \arg\min_j |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$.
Key insight: This step creates a Voronoi tessellation of the feature space. Each cluster corresponds to a Voronoi cell—the region of space closer to that centroid than any other.
Step 2b: The Update Step
Given current cluster assignments ${C_1, \ldots, C_k}$, recompute each centroid as the mean of its assigned points:
$$\boldsymbol{\mu}j = \frac{1}{|C_j|} \sum{\mathbf{x}_i \in C_j} \mathbf{x}_i$$
Key insight: The centroid is the point that minimizes the sum of squared distances to all points in the cluster. This is why k-means is sometimes called the minimum variance clustering method.
Mathematical justification: For a fixed cluster $C_j$, the point $\boldsymbol{\mu}$ that minimizes $\sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}|^2$ is precisely the arithmetic mean. This follows from setting the gradient to zero:
$$\frac{\partial}{\partial \boldsymbol{\mu}} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}i - \boldsymbol{\mu}|^2 = -2\sum{\mathbf{x}_i \in C_j} (\mathbf{x}_i - \boldsymbol{\mu}) = 0$$
Solving yields $\boldsymbol{\mu} = \frac{1}{|C_j|} \sum_{\mathbf{x}_i \in C_j} \mathbf{x}_i$.
The algorithm converges when either: • No assignments change between iterations • Centroids don't move (or move less than a threshold ε) • Maximum iterations reached (safeguard against slow convergence)
In practice, 'no assignment changes' is the cleanest criterion and guarantees we've reached a fixed point.
Let's trace Lloyd's algorithm on a small 2D dataset to build intuition. Consider 6 points in $\mathbb{R}^2$ that we want to partition into $k=2$ clusters:
Dataset:
Initialization (Random Selection)
Suppose we randomly select $\mathbf{x}_1$ and $\mathbf{x}_4$ as initial centroids:
These are our starting points. Now we begin iterating.
After just 2 iterations, the algorithm converged to: • Cluster 1: {(1,1), (1.5,2), (3,4)} with centroid (1.83, 2.33) • Cluster 2: {(5,7), (3.5,5), (4.5,5)} with centroid (4.33, 5.67)
This separates the 'lower-left' points from the 'upper-right' points—an intuitive clustering.
Let's implement Lloyd's algorithm from scratch, emphasizing clarity and understanding over optimization. This implementation includes all the key components we've discussed.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244
"""Lloyd's Algorithm: K-Means Clustering from Scratch A complete, well-documented implementation emphasizing understandingover micro-optimizations. Each step maps directly to the theory."""import numpy as npfrom typing import Tuple, List, Optional class KMeans: """ K-Means clustering using Lloyd's algorithm. Attributes: k: Number of clusters max_iters: Maximum iterations before stopping tol: Convergence tolerance for centroid movement centroids: Final cluster centroids after fitting labels: Cluster assignments for training data inertia: Final within-cluster sum of squares n_iters: Number of iterations until convergence """ def __init__( self, k: int, max_iters: int = 300, tol: float = 1e-4, random_state: Optional[int] = None ): self.k = k self.max_iters = max_iters self.tol = tol self.random_state = random_state # These are set during fit() self.centroids = None self.labels = None self.inertia = None self.n_iters = 0 def _initialize_centroids(self, X: np.ndarray) -> np.ndarray: """ Initialize centroids using random selection (Forgy method). Randomly selects k data points as initial centroids. This is simple but can lead to poor initializations. Args: X: Data matrix of shape (n_samples, n_features) Returns: Initial centroids of shape (k, n_features) """ n_samples = X.shape[0] if self.random_state is not None: np.random.seed(self.random_state) # Select k unique indices indices = np.random.choice(n_samples, size=self.k, replace=False) # Return copies of selected points as centroids return X[indices].copy() def _compute_distances( self, X: np.ndarray, centroids: np.ndarray ) -> np.ndarray: """ Compute squared Euclidean distances from each point to each centroid. Uses the identity ||x - μ||² = ||x||² - 2x·μ + ||μ||² for efficient vectorized computation. Args: X: Data matrix (n_samples, n_features) centroids: Centroid matrix (k, n_features) Returns: Distance matrix (n_samples, k) where D[i,j] = ||x_i - μ_j||² """ # ||x||² for each sample: shape (n_samples, 1) X_sq = np.sum(X ** 2, axis=1, keepdims=True) # ||μ||² for each centroid: shape (1, k) centroids_sq = np.sum(centroids ** 2, axis=1, keepdims=True).T # -2 * X @ μ^T: shape (n_samples, k) cross_term = -2 * X @ centroids.T # ||x - μ||² = ||x||² - 2x·μ + ||μ||² distances_sq = X_sq + cross_term + centroids_sq # Clip negative values (numerical precision issues) return np.maximum(distances_sq, 0) def _assign_clusters( self, X: np.ndarray, centroids: np.ndarray ) -> np.ndarray: """ Assignment step: assign each point to the nearest centroid. This creates a Voronoi tessellation of the feature space. Args: X: Data matrix (n_samples, n_features) centroids: Current centroids (k, n_features) Returns: Cluster labels (n_samples,) where labels[i] ∈ {0, ..., k-1} """ distances_sq = self._compute_distances(X, centroids) return np.argmin(distances_sq, axis=1) def _update_centroids( self, X: np.ndarray, labels: np.ndarray ) -> np.ndarray: """ Update step: recompute centroids as cluster means. For each cluster j: μ_j = (1/|C_j|) Σ_{x_i ∈ C_j} x_i Args: X: Data matrix (n_samples, n_features) labels: Current cluster assignments (n_samples,) Returns: New centroids (k, n_features) """ n_features = X.shape[1] new_centroids = np.zeros((self.k, n_features)) for j in range(self.k): # Get all points assigned to cluster j cluster_mask = (labels == j) cluster_points = X[cluster_mask] if len(cluster_points) > 0: # Centroid is the mean of cluster points new_centroids[j] = cluster_points.mean(axis=0) else: # Empty cluster: reinitialize randomly # This handles edge cases where a cluster loses all points random_idx = np.random.randint(X.shape[0]) new_centroids[j] = X[random_idx] return new_centroids def _compute_inertia( self, X: np.ndarray, labels: np.ndarray, centroids: np.ndarray ) -> float: """ Compute within-cluster sum of squares (WCSS/inertia). Inertia = Σ_j Σ_{x_i ∈ C_j} ||x_i - μ_j||² This is the objective function that k-means minimizes. """ distances_sq = self._compute_distances(X, centroids) # Sum of squared distances to assigned centroids return np.sum(distances_sq[np.arange(len(labels)), labels]) def fit(self, X: np.ndarray) -> 'KMeans': """ Fit k-means clustering to data using Lloyd's algorithm. The main loop alternates between: 1. Assignment: assign points to nearest centroid 2. Update: recompute centroids as cluster means Args: X: Training data (n_samples, n_features) Returns: self (fitted estimator) """ X = np.asarray(X, dtype=np.float64) # Step 1: Initialize centroids self.centroids = self._initialize_centroids(X) # Main Lloyd's algorithm loop for iteration in range(self.max_iters): # Store old centroids for convergence check old_centroids = self.centroids.copy() # Step 2a: Assignment - assign points to nearest centroid self.labels = self._assign_clusters(X, self.centroids) # Step 2b: Update - recompute centroids self.centroids = self._update_centroids(X, self.labels) # Check convergence: have centroids stopped moving? centroid_shift = np.sum((self.centroids - old_centroids) ** 2) self.n_iters = iteration + 1 if centroid_shift < self.tol: break # Compute final inertia self.inertia = self._compute_inertia(X, self.labels, self.centroids) return self def predict(self, X: np.ndarray) -> np.ndarray: """ Predict cluster labels for new data points. Simply assigns each point to the nearest centroid. """ if self.centroids is None: raise ValueError("Model not fitted. Call fit() first.") return self._assign_clusters(np.asarray(X), self.centroids) # Example usage and visualizationif __name__ == "__main__": # Generate sample data: 3 clusters np.random.seed(42) cluster_1 = np.random.randn(50, 2) + np.array([0, 0]) cluster_2 = np.random.randn(50, 2) + np.array([5, 5]) cluster_3 = np.random.randn(50, 2) + np.array([10, 0]) X = np.vstack([cluster_1, cluster_2, cluster_3]) # Fit k-means kmeans = KMeans(k=3, random_state=42) kmeans.fit(X) print(f"Converged in {kmeans.n_iters} iterations") print(f"Final inertia: {kmeans.inertia:.2f}") print(f"Cluster sizes: {np.bincount(kmeans.labels)}") print(f"\nCentroids:\n{kmeans.centroids}")Lloyd's algorithm isn't just a clever heuristic—it's a coordinate descent algorithm that monotonically decreases an objective function. Understanding this connection is key to understanding k-means' behavior.
The Objective Being Minimized:
Define the within-cluster sum of squares (WCSS) or inertia:
$$J({C_j}, {\boldsymbol{\mu}j}) = \sum{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$
This measures the total squared distance from each point to its assigned centroid. Lower $J$ means more compact clusters.
Lloyd's algorithm alternates between optimizing two sets of variables:
Assignment step: Fix centroids {μⱼ}, optimize assignments {Cⱼ} → Each point goes to nearest centroid, minimizing J
Update step: Fix assignments {Cⱼ}, optimize centroids {μⱼ} → Each centroid becomes cluster mean, minimizing J
Each step reduces (or maintains) J 📉. Since J is bounded below by 0 and decreases monotonically, the algorithm must converge.
Proof Sketch: Each Step Decreases J
Assignment step: For fixed centroids, assigning each point to its nearest centroid is exactly the choice that minimizes that point's contribution to $J$. Any other assignment would increase the sum of squared distances.
Update step: For fixed assignments, the centroid that minimizes $\sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$ is the arithmetic mean. This is a calculus-verified fact (gradient equals zero at the mean).
Convergence Guarantee:
Since:
The algorithm must terminate in finite time. However, it may converge to a local minimum, not necessarily the global minimum.
K-means is guaranteed to converge, but NOT guaranteed to find the global optimum. The final solution depends heavily on initialization. This is why practitioners run k-means multiple times with different random seeds and keep the best result (lowest inertia).
Understanding the computational cost of k-means is essential for applying it to large datasets.
Per-Iteration Cost:
Assignment step: For each of $n$ points, compute distance to each of $k$ centroids. Each distance computation is $O(d)$.
Update step: For each of $k$ clusters, compute the mean of assigned points.
Per iteration: $O(nkd)$
Overall Complexity:
$$O(tnkd)$$
where $t$ is the number of iterations until convergence.
| Factor | Effect on Runtime | Practical Consideration |
|---|---|---|
| $n$ (samples) | Linear | Scales well to millions of points |
| $k$ (clusters) | Linear | Keep $k$ reasonable (typically < 1000) |
| $d$ (features) | Linear | High-d requires careful feature selection |
| $t$ (iterations) | Linear | Usually converges in 10-100 iterations |
Space Complexity:
Total: $O(nd + kd) = O((n+k)d)$
For most applications where $n \gg k$, space complexity is dominated by storing the data itself.
Why K-Means is Fast:
Compared to other clustering algorithms:
K-means' linear scaling in $n$ makes it the go-to choice for large-scale clustering. With $n = 1{,}000{,}000$, $k = 100$, and $d = 100$, a single iteration processes about 10 billion floating-point operations—achievable in seconds on modern hardware.
We've developed a complete understanding of Lloyd's algorithm—the engine behind k-means clustering. Let's consolidate the key insights:
What's Next:
Now that we understand how Lloyd's algorithm operates, we need to understand what it's optimizing. In the next page, we'll formally derive the k-means objective function and understand its connection to variance minimization, EM algorithm, and probabilistic clustering models.
You now understand Lloyd's algorithm in depth—from the intuition to the mathematics to the implementation. Next, we'll explore the objective function that k-means is actually minimizing and its deeper theoretical foundations.