K Means Clustering - Learning Module

Loading content...

0/245

Lloyd's Algorithm

The Workhorse of Clustering

If you've ever used clustering in any practical application—customer segmentation, image compression, document grouping, or anomaly detection—chances are you've encountered k-means clustering. First proposed by Stuart Lloyd at Bell Labs in 1957 (though not published until 1982), this algorithm has become the de facto starting point for partitional clustering.

What makes k-means so ubiquitous? It's conceptually simple, computationally efficient, and surprisingly effective across diverse domains. Yet beneath this simplicity lies rich mathematical structure that connects to optimization theory, computational geometry, and statistical learning.

In this page, we'll develop a deep understanding of Lloyd's algorithm—the iterative procedure that makes k-means work. We'll trace through exactly what happens at each step, build intuition for why it works, and prepare the foundation for understanding its mathematical underpinnings.

What You Will Learn

By the end of this page, you will: • Understand the complete Lloyd's algorithm step-by-step • Trace through the assignment and update phases in detail • Develop intuition for why the algorithm converges • Implement k-means from scratch with full understanding • Recognize the algorithm's connection to optimization

The Clustering Problem Setup

Before diving into the algorithm, let's precisely define the problem k-means solves.

The Input: We have a dataset $\mathcal{X} = {\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n}$ where each $\mathbf{x}_i \in \mathbb{R}^d$ is a $d$-dimensional feature vector. We also specify $k$, the desired number of clusters.

The Output: We seek a partition of the data into $k$ disjoint clusters ${C_1, C_2, \ldots, C_k}$ such that:

Every point belongs to exactly one cluster: $\bigcup_{j=1}^{k} C_j = \mathcal{X}$
Clusters don't overlap: $C_i \cap C_j = \emptyset$ for $i \neq j$

Additionally, k-means produces cluster centroids ${\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_k}$ where each $\boldsymbol{\mu}_j \in \mathbb{R}^d$ represents the "center" of cluster $C_j$.

Hard vs Soft Clustering

K-means performs hard clustering—each point belongs to exactly one cluster. This contrasts with soft clustering (like Gaussian Mixture Models) where points have probabilistic membership across multiple clusters. The hard assignment makes k-means computationally efficient but can be problematic for points near cluster boundaries.

The Intuition:

We want points within the same cluster to be "similar" (close to each other) while points in different clusters should be "dissimilar" (far apart). K-means operationalizes this by:

Representing each cluster by its centroid (mean point)
Assigning points to the nearest centroid
Iteratively refining centroids until stable

This creates clusters that are compact (points are close to their centroid) and separated (centroids are distinct).

K-Means Notation Reference
Symbol	Meaning	Dimensionality
$n$	Number of data points	Scalar
$d$	Feature dimensionality	Scalar
$k$	Number of clusters	Scalar
$\mathbf{x}_i$	The $i$-th data point	$\mathbb{R}^d$
$\boldsymbol{\mu}_j$	Centroid of cluster $j$	$\mathbb{R}^d$
$C_j$	Set of points in cluster $j$	Set of indices
$r_{ij}$	Assignment indicator (1 if $\mathbf{x}_i \in C_j$)	${0, 1}$

Lloyd's Algorithm: The Complete Procedure

Lloyd's algorithm is elegantly simple: it alternates between two steps until convergence. Let's examine each step in detail.

Algorithm Overview:

1. INITIALIZE: Choose k initial centroids μ₁, μ₂, ..., μₖ
2. REPEAT until convergence:
   a. ASSIGNMENT STEP: Assign each point to nearest centroid
   b. UPDATE STEP: Recompute centroids as cluster means
3. RETURN: Final clusters and centroids

Let's unpack each component:

Step 1: Initialization

•Random Selection: Choose $k$ data points uniformly at random as initial centroids
•Random Partition: Randomly assign points to $k$ clusters, then compute centroids
•Forgy Method: Select $k$ random observations as initial means
•K-means++: Probabilistic selection favoring spread-out centroids (covered in detail later)

Step 2a: The Assignment Step

Given current centroids ${\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k}$, assign each point to the cluster with the nearest centroid:

$$C_j = {\mathbf{x}_i : |\mathbf{x}_i - \boldsymbol{\mu}_j|^2 \leq |\mathbf{x}i - \boldsymbol{\mu}{j'}|^2 \text{ for all } j' \neq j}$$

In other words, point $\mathbf{x}_i$ is assigned to cluster $j^* = \arg\min_j |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$.

Key insight: This step creates a Voronoi tessellation of the feature space. Each cluster corresponds to a Voronoi cell—the region of space closer to that centroid than any other.

Step 2b: The Update Step

Given current cluster assignments ${C_1, \ldots, C_k}$, recompute each centroid as the mean of its assigned points:

$$\boldsymbol{\mu}j = \frac{1}{|C_j|} \sum{\mathbf{x}_i \in C_j} \mathbf{x}_i$$

Key insight: The centroid is the point that minimizes the sum of squared distances to all points in the cluster. This is why k-means is sometimes called the minimum variance clustering method.

Mathematical justification: For a fixed cluster $C_j$, the point $\boldsymbol{\mu}$ that minimizes $\sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}|^2$ is precisely the arithmetic mean. This follows from setting the gradient to zero:

$$\frac{\partial}{\partial \boldsymbol{\mu}} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}i - \boldsymbol{\mu}|^2 = -2\sum{\mathbf{x}_i \in C_j} (\mathbf{x}_i - \boldsymbol{\mu}) = 0$$

Solving yields $\boldsymbol{\mu} = \frac{1}{|C_j|} \sum_{\mathbf{x}_i \in C_j} \mathbf{x}_i$.

Convergence Criterion

The algorithm converges when either: • No assignments change between iterations • Centroids don't move (or move less than a threshold ε) • Maximum iterations reached (safeguard against slow convergence)

In practice, 'no assignment changes' is the cleanest criterion and guarantees we've reached a fixed point.

Tracing Through a Concrete Example

Let's trace Lloyd's algorithm on a small 2D dataset to build intuition. Consider 6 points in $\mathbb{R}^2$ that we want to partition into $k=2$ clusters:

Dataset:

$\mathbf{x}_1 = (1, 1)$
$\mathbf{x}_2 = (1.5, 2)$
$\mathbf{x}_3 = (3, 4)$
$\mathbf{x}_4 = (5, 7)$
$\mathbf{x}_5 = (3.5, 5)$
$\mathbf{x}_6 = (4.5, 5)$

Initialization (Random Selection)

Suppose we randomly select $\mathbf{x}_1$ and $\mathbf{x}_4$ as initial centroids:

$\boldsymbol{\mu}_1^{(0)} = (1, 1)$
$\boldsymbol{\mu}_2^{(0)} = (5, 7)$

These are our starting points. Now we begin iterating.

Final Result

After just 2 iterations, the algorithm converged to: • Cluster 1: {(1,1), (1.5,2), (3,4)} with centroid (1.83, 2.33) • Cluster 2: {(5,7), (3.5,5), (4.5,5)} with centroid (4.33, 5.67)

This separates the 'lower-left' points from the 'upper-right' points—an intuitive clustering.

Complete Implementation

Let's implement Lloyd's algorithm from scratch, emphasizing clarity and understanding over optimization. This implementation includes all the key components we've discussed.

lloyds_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
"""
Lloyd's Algorithm: K-Means Clustering from Scratch
 
A complete, well-documented implementation emphasizing understanding
over micro-optimizations. Each step maps directly to the theory.
"""
import numpy as np
from typing import Tuple, List, Optional
 
class KMeans:
    """
    K-Means clustering using Lloyd's algorithm.
    
    Attributes:
        k: Number of clusters
        max_iters: Maximum iterations before stopping
        tol: Convergence tolerance for centroid movement
        centroids: Final cluster centroids after fitting
        labels: Cluster assignments for training data
        inertia: Final within-cluster sum of squares
        n_iters: Number of iterations until convergence
    """
    
    def __init__(
        self,
        k: int,
        max_iters: int = 300,
        tol: float = 1e-4,
        random_state: Optional[int] = None
    ):
        self.k = k
        self.max_iters = max_iters
        self.tol = tol
        self.random_state = random_state
        
        # These are set during fit()
        self.centroids = None
        self.labels = None
        self.inertia = None
        self.n_iters = 0
        
    def _initialize_centroids(self, X: np.ndarray) -> np.ndarray:
        """
        Initialize centroids using random selection (Forgy method).
        
        Randomly selects k data points as initial centroids.
        This is simple but can lead to poor initializations.
        
        Args:
            X: Data matrix of shape (n_samples, n_features)
            
        Returns:
            Initial centroids of shape (k, n_features)
        """
        n_samples = X.shape[0]
        
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        # Select k unique indices
        indices = np.random.choice(n_samples, size=self.k, replace=False)
        
        # Return copies of selected points as centroids
        return X[indices].copy()
    
    def _compute_distances(
        self, 
        X: np.ndarray, 
        centroids: np.ndarray
    ) -> np.ndarray:
        """
        Compute squared Euclidean distances from each point to each centroid.
        
        Uses the identity ||x - μ||² = ||x||² - 2x·μ + ||μ||²
        for efficient vectorized computation.
        
        Args:
            X: Data matrix (n_samples, n_features)
            centroids: Centroid matrix (k, n_features)
            
        Returns:
            Distance matrix (n_samples, k) where D[i,j] = ||x_i - μ_j||²
        """
        # ||x||² for each sample: shape (n_samples, 1)
        X_sq = np.sum(X ** 2, axis=1, keepdims=True)
        
        # ||μ||² for each centroid: shape (1, k)
        centroids_sq = np.sum(centroids ** 2, axis=1, keepdims=True).T
        
        # -2 * X @ μ^T: shape (n_samples, k)
        cross_term = -2 * X @ centroids.T
        
        # ||x - μ||² = ||x||² - 2x·μ + ||μ||²
        distances_sq = X_sq + cross_term + centroids_sq
        
        # Clip negative values (numerical precision issues)
        return np.maximum(distances_sq, 0)
    
    def _assign_clusters(
        self, 
        X: np.ndarray, 
        centroids: np.ndarray
    ) -> np.ndarray:
        """
        Assignment step: assign each point to the nearest centroid.
        
        This creates a Voronoi tessellation of the feature space.
        
        Args:
            X: Data matrix (n_samples, n_features)
            centroids: Current centroids (k, n_features)
            
        Returns:
            Cluster labels (n_samples,) where labels[i] ∈ {0, ..., k-1}
        """
        distances_sq = self._compute_distances(X, centroids)
        return np.argmin(distances_sq, axis=1)
    
    def _update_centroids(
        self, 
        X: np.ndarray, 
        labels: np.ndarray
    ) -> np.ndarray:
        """
        Update step: recompute centroids as cluster means.
        
        For each cluster j: μ_j = (1/|C_j|) Σ_{x_i ∈ C_j} x_i
        
        Args:
            X: Data matrix (n_samples, n_features)
            labels: Current cluster assignments (n_samples,)
            
        Returns:
            New centroids (k, n_features)
        """
        n_features = X.shape[1]
        new_centroids = np.zeros((self.k, n_features))
        
        for j in range(self.k):
            # Get all points assigned to cluster j
            cluster_mask = (labels == j)
            cluster_points = X[cluster_mask]
            
            if len(cluster_points) > 0:
                # Centroid is the mean of cluster points
                new_centroids[j] = cluster_points.mean(axis=0)
            else:
                # Empty cluster: reinitialize randomly
                # This handles edge cases where a cluster loses all points
                random_idx = np.random.randint(X.shape[0])
                new_centroids[j] = X[random_idx]
                
        return new_centroids
    
    def _compute_inertia(
        self, 
        X: np.ndarray, 
        labels: np.ndarray, 
        centroids: np.ndarray
    ) -> float:
        """
        Compute within-cluster sum of squares (WCSS/inertia).
        
        Inertia = Σ_j Σ_{x_i ∈ C_j} ||x_i - μ_j||²
        
        This is the objective function that k-means minimizes.
        """
        distances_sq = self._compute_distances(X, centroids)
        # Sum of squared distances to assigned centroids
        return np.sum(distances_sq[np.arange(len(labels)), labels])
    
    def fit(self, X: np.ndarray) -> 'KMeans':
        """
        Fit k-means clustering to data using Lloyd's algorithm.
        
        The main loop alternates between:
        1. Assignment: assign points to nearest centroid
        2. Update: recompute centroids as cluster means
        
        Args:
            X: Training data (n_samples, n_features)
            
        Returns:
            self (fitted estimator)
        """
        X = np.asarray(X, dtype=np.float64)
        
        # Step 1: Initialize centroids
        self.centroids = self._initialize_centroids(X)
        
        # Main Lloyd's algorithm loop
        for iteration in range(self.max_iters):
            # Store old centroids for convergence check
            old_centroids = self.centroids.copy()
            
            # Step 2a: Assignment - assign points to nearest centroid
            self.labels = self._assign_clusters(X, self.centroids)
            
            # Step 2b: Update - recompute centroids
            self.centroids = self._update_centroids(X, self.labels)
            
            # Check convergence: have centroids stopped moving?
            centroid_shift = np.sum((self.centroids - old_centroids) ** 2)
            
            self.n_iters = iteration + 1
            
            if centroid_shift < self.tol:
                break
        
        # Compute final inertia
        self.inertia = self._compute_inertia(X, self.labels, self.centroids)
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict cluster labels for new data points.
        
        Simply assigns each point to the nearest centroid.
        """
        if self.centroids is None:
            raise ValueError("Model not fitted. Call fit() first.")
        return self._assign_clusters(np.asarray(X), self.centroids)
 
 
# Example usage and visualization
if __name__ == "__main__":
    # Generate sample data: 3 clusters
    np.random.seed(42)
    
    cluster_1 = np.random.randn(50, 2) + np.array([0, 0])
    cluster_2 = np.random.randn(50, 2) + np.array([5, 5])
    cluster_3 = np.random.randn(50, 2) + np.array([10, 0])
    
    X = np.vstack([cluster_1, cluster_2, cluster_3])
    
    # Fit k-means
    kmeans = KMeans(k=3, random_state=42)
    kmeans.fit(X)
    
    print(f"Converged in {kmeans.n_iters} iterations")
    print(f"Final inertia: {kmeans.inertia:.2f}")
    print(f"Cluster sizes: {np.bincount(kmeans.labels)}")
    print(f"\nCentroids:\n{kmeans.centroids}")

Why Lloyd's Algorithm Works

Lloyd's algorithm isn't just a clever heuristic—it's a coordinate descent algorithm that monotonically decreases an objective function. Understanding this connection is key to understanding k-means' behavior.

The Objective Being Minimized:

Define the within-cluster sum of squares (WCSS) or inertia:

$$J({C_j}, {\boldsymbol{\mu}j}) = \sum{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

This measures the total squared distance from each point to its assigned centroid. Lower $J$ means more compact clusters.

The Block Coordinate Descent Perspective

Lloyd's algorithm alternates between optimizing two sets of variables:

Assignment step: Fix centroids {μⱼ}, optimize assignments {Cⱼ} → Each point goes to nearest centroid, minimizing J

Update step: Fix assignments {Cⱼ}, optimize centroids {μⱼ} → Each centroid becomes cluster mean, minimizing J

Each step reduces (or maintains) J 📉. Since J is bounded below by 0 and decreases monotonically, the algorithm must converge.

Proof Sketch: Each Step Decreases J

Assignment step: For fixed centroids, assigning each point to its nearest centroid is exactly the choice that minimizes that point's contribution to $J$. Any other assignment would increase the sum of squared distances.

Update step: For fixed assignments, the centroid that minimizes $\sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$ is the arithmetic mean. This is a calculus-verified fact (gradient equals zero at the mean).

Convergence Guarantee:

Since:

$J \geq 0$ (it's a sum of squared distances)
Each iteration reduces or maintains $J$
There are finitely many possible partitions

The algorithm must terminate in finite time. However, it may converge to a local minimum, not necessarily the global minimum.

Local Optima Warning

K-means is guaranteed to converge, but NOT guaranteed to find the global optimum. The final solution depends heavily on initialization. This is why practitioners run k-means multiple times with different random seeds and keep the best result (lowest inertia).

Computational Complexity

Understanding the computational cost of k-means is essential for applying it to large datasets.

Per-Iteration Cost:

Assignment step: For each of $n$ points, compute distance to each of $k$ centroids. Each distance computation is $O(d)$.
- Total: $O(nkd)$
Update step: For each of $k$ clusters, compute the mean of assigned points.
- Total: $O(nd)$ (each point contributes once)
Per iteration: $O(nkd)$

Overall Complexity:

$$O(tnkd)$$

where $t$ is the number of iterations until convergence.

K-Means Scalability Analysis
Factor	Effect on Runtime	Practical Consideration
$n$ (samples)	Linear	Scales well to millions of points
$k$ (clusters)	Linear	Keep $k$ reasonable (typically < 1000)
$d$ (features)	Linear	High-d requires careful feature selection
$t$ (iterations)	Linear	Usually converges in 10-100 iterations

Space Complexity:

Store data: $O(nd)$
Store centroids: $O(kd)$
Store assignments: $O(n)$
Distance computation (can be done in streaming fashion): $O(k)$

Total: $O(nd + kd) = O((n+k)d)$

For most applications where $n \gg k$, space complexity is dominated by storing the data itself.

Why K-Means is Fast:

Compared to other clustering algorithms:

Hierarchical clustering: $O(n^2 \log n)$ or $O(n^3)$ depending on linkage
DBSCAN: $O(n^2)$ naive, $O(n \log n)$ with spatial indexing
Spectral clustering: $O(n^3)$ for eigendecomposition

K-means' linear scaling in $n$ makes it the go-to choice for large-scale clustering. With $n = 1{,}000{,}000$, $k = 100$, and $d = 100$, a single iteration processes about 10 billion floating-point operations—achievable in seconds on modern hardware.

Summary: Lloyd's Algorithm

We've developed a complete understanding of Lloyd's algorithm—the engine behind k-means clustering. Let's consolidate the key insights:

Key Takeaways

•Lloyd's algorithm alternates between assignment and update steps — Points go to nearest centroid; centroids become cluster means
•It's coordinate descent on the within-cluster sum of squares — Each step provably reduces or maintains the objective
•Convergence is guaranteed but to a local minimum — Final result depends on initialization
•Computational complexity is O(tnkd) — Linear in all factors, making it highly scalable
•The algorithm creates Voronoi tessellations — Cluster regions are convex polytopes

What's Next:

Now that we understand how Lloyd's algorithm operates, we need to understand what it's optimizing. In the next page, we'll formally derive the k-means objective function and understand its connection to variance minimization, EM algorithm, and probabilistic clustering models.

Page Complete

You now understand Lloyd's algorithm in depth—from the intuition to the mathematics to the implementation. Next, we'll explore the objective function that k-means is actually minimizing and its deeper theoretical foundations.