Loading learning content...
In the previous page, we learned that Lloyd's algorithm monotonically decreases some objective function. But what exactly is that objective? Why does minimizing it produce meaningful clusters?
The k-means objective function—the within-cluster sum of squares (WCSS)—is more than just a mathematical convenience. It encodes a specific notion of what makes a "good" clustering: clusters should be compact (points close to their centroid) and homogeneous (points within a cluster are similar to each other).
Understanding this objective function is crucial because:
By the end of this page, you will: • Derive the k-means objective function from first principles • Understand its connection to variance decomposition • See why squared Euclidean distance is fundamental • Connect k-means to vector quantization and compression • Understand the probabilistic interpretation via GMMs
The k-means objective function is elegantly simple. Given a dataset $\mathcal{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$, cluster assignments ${C_1, \ldots, C_k}$, and centroids ${\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k}$, the within-cluster sum of squares (WCSS) is:
$$J = \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$
Equivalently, using indicator variables $r_{ij} \in {0, 1}$ where $r_{ij} = 1$ iff point $i$ belongs to cluster $j$:
$$J = \sum_{i=1}^{n} \sum_{j=1}^{k} r_{ij} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$
subject to $\sum_{j=1}^{k} r_{ij} = 1$ for all $i$ (each point in exactly one cluster).
The same quantity goes by many names: • WCSS (Within-Cluster Sum of Squares) — Statistics terminology • Inertia — Scikit-learn terminology • Distortion — Information theory / vector quantization • Quantization error — Signal processing
All refer to the same objective: total squared error from points to their assigned centroids.
Decomposing the Objective:
We can rewrite $J$ in a revealing form. For each cluster $j$:
$$J_j = \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2 = |C_j| \cdot \text{Var}(C_j)$$
where $\text{Var}(C_j) = \frac{1}{|C_j|} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$ is the within-cluster variance.
Thus: $$J = \sum_{j=1}^{k} |C_j| \cdot \text{Var}(C_j) = \text{Total within-cluster variance} \times n$$
Interpretation: K-means minimizes the weighted sum of cluster variances. Larger clusters contribute more to the objective, which aligns with the intuition that we care more about the coherence of large clusters.
A beautiful result from statistics provides deep insight into what k-means accomplishes. The total variance of a dataset can be decomposed into within-cluster and between-cluster components.
Total Sum of Squares (TSS): $$\text{TSS} = \sum_{i=1}^{n} |\mathbf{x}_i - \bar{\mathbf{x}}|^2$$
where $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i$ is the global mean.
Within-Cluster Sum of Squares (WCSS): $$\text{WCSS} = \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$
Between-Cluster Sum of Squares (BCSS): $$\text{BCSS} = \sum_{j=1}^{k} |C_j| |\boldsymbol{\mu}_j - \bar{\mathbf{x}}|^2$$
TSS = WCSS + BCSS
The total variance is the sum of within-cluster variance and between-cluster variance. Since TSS is fixed for a given dataset, minimizing WCSS is equivalent to maximizing BCSS.
In other words: k-means simultaneously makes clusters more compact (low within-cluster spread) and more separated (high between-cluster spread).
Proof of the Decomposition:
For any point $\mathbf{x}_i \in C_j$: $$\mathbf{x}_i - \bar{\mathbf{x}} = (\mathbf{x}_i - \boldsymbol{\mu}_j) + (\boldsymbol{\mu}_j - \bar{\mathbf{x}})$$
Squaring both sides and summing over all points: $$\sum_{i=1}^n |\mathbf{x}i - \bar{\mathbf{x}}|^2 = \sum{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} \left[ |\mathbf{x}_i - \boldsymbol{\mu}_j|^2 + |\boldsymbol{\mu}_j - \bar{\mathbf{x}}|^2 + 2(\mathbf{x}_i - \boldsymbol{\mu}_j)^T(\boldsymbol{\mu}_j - \bar{\mathbf{x}}) \right]$$
The cross term vanishes because $\sum_{\mathbf{x}_i \in C_j}(\mathbf{x}_i - \boldsymbol{\mu}_j) = \mathbf{0}$ (the mean is the point where deviations sum to zero). This yields:
$$\text{TSS} = \text{WCSS} + \text{BCSS}$$
Explained Variance Ratio:
We can define the explained variance ratio or R-squared for clustering:
$$R^2 = \frac{\text{BCSS}}{\text{TSS}} = 1 - \frac{\text{WCSS}}{\text{TSS}}$$
This provides a normalized measure of clustering quality that's invariant to the overall scale of the data.
| Component | Formula | K-Means Goal |
|---|---|---|
| TSS (Total) | $\sum_i |\mathbf{x}_i - \bar{\mathbf{x}}|^2$ | Fixed (given data) |
| WCSS (Within) | $\sum_j \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$ | Minimize ↓ |
| BCSS (Between) | $\sum_j |C_j| |\boldsymbol{\mu}_j - \bar{\mathbf{x}}|^2$ | Maximize ↑ |
K-means specifically uses squared Euclidean distance $|\mathbf{x} - \boldsymbol{\mu}|^2$, not other distance measures. This isn't arbitrary—it's mathematically essential for the algorithm to work.
The Critical Property: The Mean Minimizes Squared Distances
For any set of points $S = {\mathbf{x}_1, \ldots, \mathbf{x}m}$, the point $\boldsymbol{\mu}$ that minimizes $\sum{\mathbf{x} \in S} |\mathbf{x} - \boldsymbol{\mu}|^2$ is the arithmetic mean:
$$\boldsymbol{\mu}^* = \frac{1}{m}\sum_{\mathbf{x} \in S} \mathbf{x}$$
Proof: $$\frac{\partial}{\partial \boldsymbol{\mu}} \sum_{\mathbf{x} \in S} |\mathbf{x} - \boldsymbol{\mu}|^2 = -2\sum_{\mathbf{x} \in S}(\mathbf{x} - \boldsymbol{\mu}) = 0$$ $$\Rightarrow \sum_{\mathbf{x} \in S} \mathbf{x} = m \boldsymbol{\mu} \Rightarrow \boldsymbol{\mu} = \frac{1}{m}\sum_{\mathbf{x} \in S} \mathbf{x}$$
Manhattan distance (L1): The minimizer is the median, not the mean. This leads to k-medians clustering.
Euclidean distance (L2, not squared): Has no closed-form minimizer—requires iterative optimization per cluster.
Cosine distance: Spherical k-means uses normalized centroids, not means.
The squared Euclidean distance is special because it yields a closed-form, easily computable optimal centroid.
Geometric Interpretation:
Squared Euclidean distance creates spherical clusters—all points equidistant from a centroid form a hypersphere. The Voronoi cells (regions assigned to each centroid) are convex polytopes.
This has important implications:
Feature Scaling is Essential:
If feature 1 ranges [0, 1] and feature 2 ranges [0, 1000], the squared distance is dominated by feature 2: $$|\mathbf{x} - \boldsymbol{\mu}|^2 = (x_1 - \mu_1)^2 + (x_2 - \mu_2)^2 \approx (x_2 - \mu_2)^2$$
Standardizing features (z-scores) or min-max scaling ensures all features contribute equally.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
"""Why Squared Euclidean Distance is Special for K-Means This demonstrates the mathematical properties that make squared Euclidean distance uniquely suited for k-means."""import numpy as npfrom scipy.optimize import minimize def demonstrate_mean_minimizes_squared(): """ Show that the mean minimizes sum of squared distances. """ # Generate random points np.random.seed(42) points = np.random.randn(100, 2) * 2 + np.array([5, 3]) # The mean mean = points.mean(axis=0) # Sum of squared distances to mean ssd_to_mean = np.sum((points - mean) ** 2) # Try other candidate centers def total_squared_dist(center): return np.sum((points - center) ** 2) # Random candidate random_point = np.array([4.5, 2.5]) ssd_to_random = total_squared_dist(random_point) # Median (minimizes L1, not L2²) median = np.median(points, axis=0) ssd_to_median = total_squared_dist(median) print("Sum of Squared Distances to Different Centers:") print(f" To mean: {ssd_to_mean:.4f} ← MINIMUM") print(f" To median: {ssd_to_median:.4f}") print(f" To random point: {ssd_to_random:.4f}") print() # Verify with optimization result = minimize(total_squared_dist, x0=[0, 0], method='BFGS') print(f"Numerically optimal center: {result.x}") print(f"Analytical mean: {mean}") print(f"Match: {np.allclose(result.x, mean)}") def demonstrate_l1_vs_l2(): """ Show that L1 (Manhattan) distance leads to different optimal center. """ np.random.seed(42) points = np.random.randn(100, 2) # L2² optimal center is the mean mean = points.mean(axis=0) l2_sq_loss = np.sum((points - mean) ** 2) # L1 optimal center is the coordinate-wise median median = np.median(points, axis=0) l1_loss_at_median = np.sum(np.abs(points - median)) l1_loss_at_mean = np.sum(np.abs(points - mean)) print("L1 (Manhattan) vs L2² (Squared Euclidean):") print(f" Mean: {mean}") print(f" Median: {median}") print(f" L1 loss at median: {l1_loss_at_median:.4f} ← MINIMUM for L1") print(f" L1 loss at mean: {l1_loss_at_mean:.4f}") print() print("This is why k-medians uses the median, not the mean!") def demonstrate_feature_scaling(): """ Show impact of feature scaling on k-means clustering. """ np.random.seed(42) # Two clusters in original scale cluster1 = np.random.randn(50, 2) + np.array([0, 0]) cluster2 = np.random.randn(50, 2) + np.array([3, 3]) data_scaled = np.vstack([cluster1, cluster2]) # Same data with feature 2 scaled up by 100x data_unscaled = data_scaled.copy() data_unscaled[:, 1] *= 100 # Distance between cluster centers center1_scaled = cluster1.mean(axis=0) center2_scaled = cluster2.mean(axis=0) dist_scaled = np.linalg.norm(center1_scaled - center2_scaled) center1_unscaled = data_unscaled[:50].mean(axis=0) center2_unscaled = data_unscaled[50:].mean(axis=0) dist_unscaled = np.linalg.norm(center1_unscaled - center2_unscaled) print("Impact of Feature Scaling:") print(f" Scaled data - inter-cluster distance: {dist_scaled:.2f}") print(f" Unscaled data - inter-cluster distance: {dist_unscaled:.2f}") print() print(f" Feature 1 contribution (scaled): {abs(center2_scaled[0] - center1_scaled[0]):.2f}") print(f" Feature 2 contribution (scaled): {abs(center2_scaled[1] - center1_scaled[1]):.2f}") print(f" Feature 1 contribution (unscaled): {abs(center2_unscaled[0] - center1_unscaled[0]):.2f}") print(f" Feature 2 contribution (unscaled): {abs(center2_unscaled[1] - center1_unscaled[1]):.2f}") print() print("In unscaled data, feature 2 dominates the distance calculation!") if __name__ == "__main__": demonstrate_mean_minimizes_squared() print("=" * 60) demonstrate_l1_vs_l2() print("=" * 60) demonstrate_feature_scaling()K-means has a fascinating interpretation in signal processing and information theory as vector quantization (VQ)—a technique for lossy data compression.
The VQ Problem:
Given $n$ vectors ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, find $k$ codebook vectors ${\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k}$ (also called codewords or prototypes) that best represent the data. Each data point is then quantized by replacing it with its nearest codeword.
The distortion is exactly the k-means objective: $$D = \frac{1}{n}\sum_{i=1}^{n} \min_j |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$
Lloyd's algorithm is also known as the Lloyd-Max algorithm in the VQ literature.
A classic application: compress an image by clustering pixel colors.
A 1000×1000 RGB image uses 3MB. With k=16 colors: 48 bytes (centroids) + 500KB (4-bit indices) ≈ 6× compression!
Information-Theoretic View:
From an information theory perspective, the k-means codebook enables a simple coding scheme:
This connects to rate-distortion theory: the tradeoff between bits used (rate) and reconstruction error (distortion). K-means finds the optimal codebook for a fixed number of codewords.
Comparison with Other Quantizers:
| Method | Codebook Structure | K-Means Equivalent? |
|---|---|---|
| K-Means | Unstructured (any k vectors) | Yes |
| Product Quantization | Cartesian product of subspace codebooks | No (but related) |
| Lattice Quantization | Regular geometric lattice | No |
| Tree-Structured VQ | Hierarchical refinement | Related (hierarchical k-means) |
K-means has a deep connection to Gaussian Mixture Models (GMMs)—specifically, k-means is the limit of GMM when cluster variances go to zero.
The GMM Model:
A Gaussian Mixture Model assumes data is generated from $k$ Gaussian distributions: $$p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_j, \Sigma_j)$$
where $\pi_j$ are mixing weights and $\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \Sigma)$ is the Gaussian density.
K-Means as a Limiting Case:
Consider GMM with:
Derivation:
The probability of point $\mathbf{x}$ belonging to cluster $j$ is: $$p(z = j | \mathbf{x}) = \frac{\pi_j \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}j, \sigma^2 I)}{\sum{j'} \pi_{j'} \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_{j'}, \sigma^2 I)}$$
For spherical Gaussians: $$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_j, \sigma^2 I) \propto \exp\left(-\frac{|\mathbf{x} - \boldsymbol{\mu}_j|^2}{2\sigma^2}\right)$$
As $\sigma^2 \to 0$, the exponential becomes sharply peaked around the nearest centroid. The posterior becomes a hard assignment:
$$\lim_{\sigma^2 \to 0} p(z = j|\mathbf{x}) = \begin{cases} 1 & \text{if } j = \arg\min_{j'} |\mathbf{x} - \boldsymbol{\mu}_{j'}|^2 \ 0 & \text{otherwise} \end{cases}$$
This is exactly k-means assignment!
K-Means (Hard EM): • Hard cluster assignments (each point in one cluster) • Assumes equal, spherical clusters • No uncertainty quantification • Computationally cheaper
Full GMM (Soft EM): • Soft assignments (probabilities over clusters) • Allows different cluster shapes/sizes • Provides posterior probabilities • More expressive but expensive
While Lloyd's algorithm is computationally efficient, finding the globally optimal k-means solution is computationally intractable.
Formal Result:
The k-means clustering problem (minimize WCSS) is NP-hard, even for:
The problem remains NP-hard even when either $k$ or $d$ is fixed at 2.
What This Means:
Worst-case iterations: Lloyd's algorithm can require exponential iterations to converge on certain adversarial inputs (2^Ω(n)).
Average case: On random data, typically converges in O(n) iterations.
Smoothed analysis: With small random perturbations, convergence is polynomial (Spielman & Teng).
In practice, Lloyd's algorithm converges quickly—usually in O(k·n·d) total time over all iterations.
Approximation Guarantees:
Despite NP-hardness, good approximations are achievable:
For most practical applications, Lloyd's algorithm with good initialization (k-means++) provides solutions within a few percent of optimal.
What's Next:
We now understand what k-means is optimizing and why. But the algorithm's success heavily depends on initialization—different starting points can lead to wildly different solutions. In the next page, we'll study initialization strategies, especially the k-means++ algorithm that provides theoretical guarantees on solution quality.
You now understand the k-means objective function from multiple perspectives: statistical (variance decomposition), geometric (squared distances), information-theoretic (vector quantization), and probabilistic (GMM limiting case). This foundation prepares you for understanding why initialization is so critical.