Machine LearningK-Means Clustering

K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Clustering

2 / 5

Objective Function

What Is K-Means Actually Optimizing?

In the previous page, we learned that Lloyd's algorithm monotonically decreases some objective function. But what exactly is that objective? Why does minimizing it produce meaningful clusters?

The k-means objective function—the within-cluster sum of squares (WCSS)—is more than just a mathematical convenience. It encodes a specific notion of what makes a "good" clustering: clusters should be compact (points close to their centroid) and homogeneous (points within a cluster are similar to each other).

Understanding this objective function is crucial because:

It explains when k-means works well and when it fails
It connects clustering to variance decomposition and information theory
It reveals the relationship between k-means and probabilistic models like GMMs
It guides the choice of distance metrics and feature scaling

What You Will Learn

By the end of this page, you will: • Derive the k-means objective function from first principles • Understand its connection to variance decomposition • See why squared Euclidean distance is fundamental • Connect k-means to vector quantization and compression • Understand the probabilistic interpretation via GMMs

The Within-Cluster Sum of Squares (WCSS)

The k-means objective function is elegantly simple. Given a dataset $\mathcal{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$, cluster assignments ${C_1, \ldots, C_k}$, and centroids ${\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k}$, the within-cluster sum of squares (WCSS) is:

$$J = \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

Equivalently, using indicator variables $r_{ij} \in {0, 1}$ where $r_{ij} = 1$ iff point $i$ belongs to cluster $j$:

$$J = \sum_{i=1}^{n} \sum_{j=1}^{k} r_{ij} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

subject to $\sum_{j=1}^{k} r_{ij} = 1$ for all $i$ (each point in exactly one cluster).

Inertia, Distortion, and WCSS

The same quantity goes by many names: • WCSS (Within-Cluster Sum of Squares) — Statistics terminology • Inertia — Scikit-learn terminology • Distortion — Information theory / vector quantization • Quantization error — Signal processing

All refer to the same objective: total squared error from points to their assigned centroids.

Decomposing the Objective:

We can rewrite $J$ in a revealing form. For each cluster $j$:

$$J_j = \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2 = |C_j| \cdot \text{Var}(C_j)$$

where $\text{Var}(C_j) = \frac{1}{|C_j|} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$ is the within-cluster variance.

Thus: $$J = \sum_{j=1}^{k} |C_j| \cdot \text{Var}(C_j) = \text{Total within-cluster variance} \times n$$

Interpretation: K-means minimizes the weighted sum of cluster variances. Larger clusters contribute more to the objective, which aligns with the intuition that we care more about the coherence of large clusters.

The Variance Decomposition Theorem

A beautiful result from statistics provides deep insight into what k-means accomplishes. The total variance of a dataset can be decomposed into within-cluster and between-cluster components.

Total Sum of Squares (TSS): $$\text{TSS} = \sum_{i=1}^{n} |\mathbf{x}_i - \bar{\mathbf{x}}|^2$$

where $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i$ is the global mean.

Within-Cluster Sum of Squares (WCSS): $$\text{WCSS} = \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

Between-Cluster Sum of Squares (BCSS): $$\text{BCSS} = \sum_{j=1}^{k} |C_j| |\boldsymbol{\mu}_j - \bar{\mathbf{x}}|^2$$

The Fundamental Decomposition

TSS = WCSS + BCSS

The total variance is the sum of within-cluster variance and between-cluster variance. Since TSS is fixed for a given dataset, minimizing WCSS is equivalent to maximizing BCSS.

In other words: k-means simultaneously makes clusters more compact (low within-cluster spread) and more separated (high between-cluster spread).

Proof of the Decomposition:

For any point $\mathbf{x}_i \in C_j$: $$\mathbf{x}_i - \bar{\mathbf{x}} = (\mathbf{x}_i - \boldsymbol{\mu}_j) + (\boldsymbol{\mu}_j - \bar{\mathbf{x}})$$

Squaring both sides and summing over all points: $$\sum_{i=1}^n |\mathbf{x}i - \bar{\mathbf{x}}|^2 = \sum{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} \left[ |\mathbf{x}_i - \boldsymbol{\mu}_j|^2 + |\boldsymbol{\mu}_j - \bar{\mathbf{x}}|^2 + 2(\mathbf{x}_i - \boldsymbol{\mu}_j)^T(\boldsymbol{\mu}_j - \bar{\mathbf{x}}) \right]$$

The cross term vanishes because $\sum_{\mathbf{x}_i \in C_j}(\mathbf{x}_i - \boldsymbol{\mu}_j) = \mathbf{0}$ (the mean is the point where deviations sum to zero). This yields:

$$\text{TSS} = \text{WCSS} + \text{BCSS}$$

Explained Variance Ratio:

We can define the explained variance ratio or R-squared for clustering:

$$R^2 = \frac{\text{BCSS}}{\text{TSS}} = 1 - \frac{\text{WCSS}}{\text{TSS}}$$

$R^2 \to 0$: Clustering explains little variance; clusters are not meaningful
$R^2 \to 1$: Clustering explains most variance; clusters are well-separated

This provides a normalized measure of clustering quality that's invariant to the overall scale of the data.

Variance Decomposition in Clustering
Component	Formula	K-Means Goal
TSS (Total)	$\sum_i \|\mathbf{x}_i - \bar{\mathbf{x}}\|^2$	Fixed (given data)
WCSS (Within)	$\sum_j \sum_{\mathbf{x}_i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2$	Minimize ↓
BCSS (Between)	$\sum_j \|C_j\| \|\boldsymbol{\mu}_j - \bar{\mathbf{x}}\|^2$	Maximize ↑

Why Squared Euclidean Distance?

K-means specifically uses squared Euclidean distance $|\mathbf{x} - \boldsymbol{\mu}|^2$, not other distance measures. This isn't arbitrary—it's mathematically essential for the algorithm to work.

The Critical Property: The Mean Minimizes Squared Distances

For any set of points $S = {\mathbf{x}_1, \ldots, \mathbf{x}m}$, the point $\boldsymbol{\mu}$ that minimizes $\sum{\mathbf{x} \in S} |\mathbf{x} - \boldsymbol{\mu}|^2$ is the arithmetic mean:

$$\boldsymbol{\mu}^* = \frac{1}{m}\sum_{\mathbf{x} \in S} \mathbf{x}$$

Proof: $$\frac{\partial}{\partial \boldsymbol{\mu}} \sum_{\mathbf{x} \in S} |\mathbf{x} - \boldsymbol{\mu}|^2 = -2\sum_{\mathbf{x} \in S}(\mathbf{x} - \boldsymbol{\mu}) = 0$$ $$\Rightarrow \sum_{\mathbf{x} \in S} \mathbf{x} = m \boldsymbol{\mu} \Rightarrow \boldsymbol{\mu} = \frac{1}{m}\sum_{\mathbf{x} \in S} \mathbf{x}$$

Other Distances Break K-Means

Manhattan distance (L1): The minimizer is the median, not the mean. This leads to k-medians clustering.

Euclidean distance (L2, not squared): Has no closed-form minimizer—requires iterative optimization per cluster.

Cosine distance: Spherical k-means uses normalized centroids, not means.

The squared Euclidean distance is special because it yields a closed-form, easily computable optimal centroid.

Geometric Interpretation:

Squared Euclidean distance creates spherical clusters—all points equidistant from a centroid form a hypersphere. The Voronoi cells (regions assigned to each centroid) are convex polytopes.

This has important implications:

K-means finds convex clusters — It cannot discover non-convex shapes
Clusters are axis-aligned ellipsoids at best — Due to isotropic distance
Scale matters — Features on different scales contribute unequally

Feature Scaling is Essential:

If feature 1 ranges [0, 1] and feature 2 ranges [0, 1000], the squared distance is dominated by feature 2: $$|\mathbf{x} - \boldsymbol{\mu}|^2 = (x_1 - \mu_1)^2 + (x_2 - \mu_2)^2 \approx (x_2 - \mu_2)^2$$

Standardizing features (z-scores) or min-max scaling ensures all features contribute equally.

distance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
"""
Why Squared Euclidean Distance is Special for K-Means
 
This demonstrates the mathematical properties that make 
squared Euclidean distance uniquely suited for k-means.
"""
import numpy as np
from scipy.optimize import minimize
 
def demonstrate_mean_minimizes_squared():
    """
    Show that the mean minimizes sum of squared distances.
    """
    # Generate random points
    np.random.seed(42)
    points = np.random.randn(100, 2) * 2 + np.array([5, 3])
    
    # The mean
    mean = points.mean(axis=0)
    
    # Sum of squared distances to mean
    ssd_to_mean = np.sum((points - mean) ** 2)
    
    # Try other candidate centers
    def total_squared_dist(center):
        return np.sum((points - center) ** 2)
    
    # Random candidate
    random_point = np.array([4.5, 2.5])
    ssd_to_random = total_squared_dist(random_point)
    
    # Median (minimizes L1, not L2²)
    median = np.median(points, axis=0)
    ssd_to_median = total_squared_dist(median)
    
    print("Sum of Squared Distances to Different Centers:")
    print(f"  To mean:         {ssd_to_mean:.4f}  ← MINIMUM")
    print(f"  To median:       {ssd_to_median:.4f}")
    print(f"  To random point: {ssd_to_random:.4f}")
    print()
    
    # Verify with optimization
    result = minimize(total_squared_dist, x0=[0, 0], method='BFGS')
    print(f"Numerically optimal center: {result.x}")
    print(f"Analytical mean:            {mean}")
    print(f"Match: {np.allclose(result.x, mean)}")
 
 
def demonstrate_l1_vs_l2():
    """
    Show that L1 (Manhattan) distance leads to different optimal center.
    """
    np.random.seed(42)
    points = np.random.randn(100, 2)
    
    # L2² optimal center is the mean
    mean = points.mean(axis=0)
    l2_sq_loss = np.sum((points - mean) ** 2)
    
    # L1 optimal center is the coordinate-wise median
    median = np.median(points, axis=0)
    l1_loss_at_median = np.sum(np.abs(points - median))
    l1_loss_at_mean = np.sum(np.abs(points - mean))
    
    print("L1 (Manhattan) vs L2² (Squared Euclidean):")
    print(f"  Mean:   {mean}")
    print(f"  Median: {median}")
    print(f"  L1 loss at median: {l1_loss_at_median:.4f}  ← MINIMUM for L1")
    print(f"  L1 loss at mean:   {l1_loss_at_mean:.4f}")
    print()
    print("This is why k-medians uses the median, not the mean!")
 
 
def demonstrate_feature_scaling():
    """
    Show impact of feature scaling on k-means clustering.
    """
    np.random.seed(42)
    
    # Two clusters in original scale
    cluster1 = np.random.randn(50, 2) + np.array([0, 0])
    cluster2 = np.random.randn(50, 2) + np.array([3, 3])
    data_scaled = np.vstack([cluster1, cluster2])
    
    # Same data with feature 2 scaled up by 100x
    data_unscaled = data_scaled.copy()
    data_unscaled[:, 1] *= 100
    
    # Distance between cluster centers
    center1_scaled = cluster1.mean(axis=0)
    center2_scaled = cluster2.mean(axis=0)
    dist_scaled = np.linalg.norm(center1_scaled - center2_scaled)
    
    center1_unscaled = data_unscaled[:50].mean(axis=0)
    center2_unscaled = data_unscaled[50:].mean(axis=0)
    dist_unscaled = np.linalg.norm(center1_unscaled - center2_unscaled)
    
    print("Impact of Feature Scaling:")
    print(f"  Scaled data - inter-cluster distance:   {dist_scaled:.2f}")
    print(f"  Unscaled data - inter-cluster distance: {dist_unscaled:.2f}")
    print()
    print(f"  Feature 1 contribution (scaled):   {abs(center2_scaled[0] - center1_scaled[0]):.2f}")
    print(f"  Feature 2 contribution (scaled):   {abs(center2_scaled[1] - center1_scaled[1]):.2f}")
    print(f"  Feature 1 contribution (unscaled): {abs(center2_unscaled[0] - center1_unscaled[0]):.2f}")
    print(f"  Feature 2 contribution (unscaled): {abs(center2_unscaled[1] - center1_unscaled[1]):.2f}")
    print()
    print("In unscaled data, feature 2 dominates the distance calculation!")
 
 
if __name__ == "__main__":
    demonstrate_mean_minimizes_squared()
    print("=" * 60)
    demonstrate_l1_vs_l2()
    print("=" * 60)
    demonstrate_feature_scaling()

The Vector Quantization Perspective

K-means has a fascinating interpretation in signal processing and information theory as vector quantization (VQ)—a technique for lossy data compression.

The VQ Problem:

Given $n$ vectors ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, find $k$ codebook vectors ${\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k}$ (also called codewords or prototypes) that best represent the data. Each data point is then quantized by replacing it with its nearest codeword.

The distortion is exactly the k-means objective: $$D = \frac{1}{n}\sum_{i=1}^{n} \min_j |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

Lloyd's algorithm is also known as the Lloyd-Max algorithm in the VQ literature.

Image Compression via K-Means

A classic application: compress an image by clustering pixel colors.

Extract all RGB pixel values (n pixels, 3 dimensions)
Run k-means with k colors (e.g., k=16)
Replace each pixel with its nearest centroid color
Store only the 16 centroid colors + per-pixel cluster indices

A 1000×1000 RGB image uses 3MB. With k=16 colors: 48 bytes (centroids) + 500KB (4-bit indices) ≈ 6× compression!

Information-Theoretic View:

From an information theory perspective, the k-means codebook enables a simple coding scheme:

Encoding: Map each data point to its nearest centroid index (requires $\log_2 k$ bits)
Decoding: Replace the index with the centroid
Reconstruction error: The distortion/WCSS

This connects to rate-distortion theory: the tradeoff between bits used (rate) and reconstruction error (distortion). K-means finds the optimal codebook for a fixed number of codewords.

Comparison with Other Quantizers:

K-Means vs Other Vector Quantization Methods
Method	Codebook Structure	K-Means Equivalent?
K-Means	Unstructured (any k vectors)	Yes
Product Quantization	Cartesian product of subspace codebooks	No (but related)
Lattice Quantization	Regular geometric lattice	No
Tree-Structured VQ	Hierarchical refinement	Related (hierarchical k-means)

The Probabilistic Interpretation: GMM Connection

K-means has a deep connection to Gaussian Mixture Models (GMMs)—specifically, k-means is the limit of GMM when cluster variances go to zero.

The GMM Model:

A Gaussian Mixture Model assumes data is generated from $k$ Gaussian distributions: $$p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_j, \Sigma_j)$$

where $\pi_j$ are mixing weights and $\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \Sigma)$ is the Gaussian density.

K-Means as a Limiting Case:

Consider GMM with:

Equal mixing weights: $\pi_j = 1/k$
Spherical covariances: $\Sigma_j = \sigma^2 I$ for all $j$
Taking the limit $\sigma^2 \to 0$

Derivation:

The probability of point $\mathbf{x}$ belonging to cluster $j$ is: $$p(z = j | \mathbf{x}) = \frac{\pi_j \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}j, \sigma^2 I)}{\sum{j'} \pi_{j'} \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_{j'}, \sigma^2 I)}$$

For spherical Gaussians: $$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_j, \sigma^2 I) \propto \exp\left(-\frac{|\mathbf{x} - \boldsymbol{\mu}_j|^2}{2\sigma^2}\right)$$

As $\sigma^2 \to 0$, the exponential becomes sharply peaked around the nearest centroid. The posterior becomes a hard assignment:

$$\lim_{\sigma^2 \to 0} p(z = j|\mathbf{x}) = \begin{cases} 1 & \text{if } j = \arg\min_{j'} |\mathbf{x} - \boldsymbol{\mu}_{j'}|^2 \ 0 & \text{otherwise} \end{cases}$$

This is exactly k-means assignment!

K-Means vs GMM: Key Differences

K-Means (Hard EM): • Hard cluster assignments (each point in one cluster) • Assumes equal, spherical clusters • No uncertainty quantification • Computationally cheaper

Full GMM (Soft EM): • Soft assignments (probabilities over clusters) • Allows different cluster shapes/sizes • Provides posterior probabilities • More expressive but expensive

K-Means (Hard Assignment)

•$r_{ij} \in {0, 1}$
•Each point in exactly one cluster
•Assignment: argmin over distances
•Update: cluster mean

GMM (Soft Assignment)

•$r_{ij} \in [0, 1]$ (responsibilities)
•Points have partial membership
•E-step: compute posteriors
•M-step: weighted mean

Computational Complexity: NP-Hardness

While Lloyd's algorithm is computationally efficient, finding the globally optimal k-means solution is computationally intractable.

Formal Result:

The k-means clustering problem (minimize WCSS) is NP-hard, even for:

$k = 2$ clusters (Aloise et al., 2009)
$d = 2$ dimensions (Mahajan et al., 2009)

The problem remains NP-hard even when either $k$ or $d$ is fixed at 2.

What This Means:

No known polynomial-time algorithm can guarantee the global optimum
Lloyd's algorithm finds a local minimum, not necessarily the global one
Multiple restarts with different initializations are necessary
Approximation algorithms (like k-means++) provide theoretical guarantees

Lloyd's Algorithm Convergence

Worst-case iterations: Lloyd's algorithm can require exponential iterations to converge on certain adversarial inputs (2^Ω(n)).

Average case: On random data, typically converges in O(n) iterations.

Smoothed analysis: With small random perturbations, convergence is polynomial (Spielman & Teng).

In practice, Lloyd's algorithm converges quickly—usually in O(k·n·d) total time over all iterations.

Approximation Guarantees:

Despite NP-hardness, good approximations are achievable:

K-means++: Achieves $O(\log k)$ approximation in expectation
D2-sampling: Similar guarantees with different analysis
Local search methods: $1 + \epsilon$ approximation for any $\epsilon > 0$ (PTAS), but exponential in $1/\epsilon$

For most practical applications, Lloyd's algorithm with good initialization (k-means++) provides solutions within a few percent of optimal.

Summary: The K-Means Objective

Key Takeaways

•K-means minimizes within-cluster sum of squares (WCSS) — Total squared distance from points to their centroids
•Variance decomposition: TSS = WCSS + BCSS — Minimizing WCSS equivalently maximizes between-cluster separation
•Squared Euclidean distance is essential — The mean is the unique minimizer of sum of squared distances
•Vector quantization interpretation — K-means finds optimal codebook for lossy compression
•GMM connection — K-means is the zero-variance limit of spherical GMM
•Global optimization is NP-hard — Lloyd's finds local minima; initialization matters greatly

What's Next:

We now understand what k-means is optimizing and why. But the algorithm's success heavily depends on initialization—different starting points can lead to wildly different solutions. In the next page, we'll study initialization strategies, especially the k-means++ algorithm that provides theoretical guarantees on solution quality.

Page Complete

You now understand the k-means objective function from multiple perspectives: statistical (variance decomposition), geometric (squared distances), information-theoretic (vector quantization), and probabilistic (GMM limiting case). This foundation prepares you for understanding why initialization is so critical.

2 / 5

Loading learning content...

Machine LearningK-Means Clustering

K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Clustering

2 / 5

Objective Function

What Is K-Means Actually Optimizing?

In the previous page, we learned that Lloyd's algorithm monotonically decreases some objective function. But what exactly is that objective? Why does minimizing it produce meaningful clusters?

Understanding this objective function is crucial because:

It explains when k-means works well and when it fails
It connects clustering to variance decomposition and information theory
It reveals the relationship between k-means and probabilistic models like GMMs
It guides the choice of distance metrics and feature scaling

What You Will Learn

The Within-Cluster Sum of Squares (WCSS)

$$J = \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

Equivalently, using indicator variables $r_{ij} \in {0, 1}$ where $r_{ij} = 1$ iff point $i$ belongs to cluster $j$:

$$J = \sum_{i=1}^{n} \sum_{j=1}^{k} r_{ij} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

subject to $\sum_{j=1}^{k} r_{ij} = 1$ for all $i$ (each point in exactly one cluster).

Inertia, Distortion, and WCSS

All refer to the same objective: total squared error from points to their assigned centroids.

Decomposing the Objective:

We can rewrite $J$ in a revealing form. For each cluster $j$:

$$J_j = \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2 = |C_j| \cdot \text{Var}(C_j)$$

where $\text{Var}(C_j) = \frac{1}{|C_j|} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$ is the within-cluster variance.

Thus: $$J = \sum_{j=1}^{k} |C_j| \cdot \text{Var}(C_j) = \text{Total within-cluster variance} \times n$$

The Variance Decomposition Theorem

Total Sum of Squares (TSS): $$\text{TSS} = \sum_{i=1}^{n} |\mathbf{x}_i - \bar{\mathbf{x}}|^2$$

where $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i$ is the global mean.

Within-Cluster Sum of Squares (WCSS): $$\text{WCSS} = \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

Between-Cluster Sum of Squares (BCSS): $$\text{BCSS} = \sum_{j=1}^{k} |C_j| |\boldsymbol{\mu}_j - \bar{\mathbf{x}}|^2$$

The Fundamental Decomposition

TSS = WCSS + BCSS

The total variance is the sum of within-cluster variance and between-cluster variance. Since TSS is fixed for a given dataset, minimizing WCSS is equivalent to maximizing BCSS.

In other words: k-means simultaneously makes clusters more compact (low within-cluster spread) and more separated (high between-cluster spread).

Proof of the Decomposition:

For any point $\mathbf{x}_i \in C_j$: $$\mathbf{x}_i - \bar{\mathbf{x}} = (\mathbf{x}_i - \boldsymbol{\mu}_j) + (\boldsymbol{\mu}_j - \bar{\mathbf{x}})$$

The cross term vanishes because $\sum_{\mathbf{x}_i \in C_j}(\mathbf{x}_i - \boldsymbol{\mu}_j) = \mathbf{0}$ (the mean is the point where deviations sum to zero). This yields:

$$\text{TSS} = \text{WCSS} + \text{BCSS}$$

Explained Variance Ratio:

We can define the explained variance ratio or R-squared for clustering:

$$R^2 = \frac{\text{BCSS}}{\text{TSS}} = 1 - \frac{\text{WCSS}}{\text{TSS}}$$

$R^2 \to 0$: Clustering explains little variance; clusters are not meaningful
$R^2 \to 1$: Clustering explains most variance; clusters are well-separated

This provides a normalized measure of clustering quality that's invariant to the overall scale of the data.

Variance Decomposition in Clustering
Component	Formula	K-Means Goal
TSS (Total)	$\sum_i \|\mathbf{x}_i - \bar{\mathbf{x}}\|^2$	Fixed (given data)
WCSS (Within)	$\sum_j \sum_{\mathbf{x}_i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2$	Minimize ↓
BCSS (Between)	$\sum_j \|C_j\| \|\boldsymbol{\mu}_j - \bar{\mathbf{x}}\|^2$	Maximize ↑

Why Squared Euclidean Distance?

The Critical Property: The Mean Minimizes Squared Distances

For any set of points $S = {\mathbf{x}_1, \ldots, \mathbf{x}m}$, the point $\boldsymbol{\mu}$ that minimizes $\sum{\mathbf{x} \in S} |\mathbf{x} - \boldsymbol{\mu}|^2$ is the arithmetic mean:

$$\boldsymbol{\mu}^* = \frac{1}{m}\sum_{\mathbf{x} \in S} \mathbf{x}$$

Other Distances Break K-Means

Manhattan distance (L1): The minimizer is the median, not the mean. This leads to k-medians clustering.

Euclidean distance (L2, not squared): Has no closed-form minimizer—requires iterative optimization per cluster.

Cosine distance: Spherical k-means uses normalized centroids, not means.

The squared Euclidean distance is special because it yields a closed-form, easily computable optimal centroid.

Geometric Interpretation:

Squared Euclidean distance creates spherical clusters—all points equidistant from a centroid form a hypersphere. The Voronoi cells (regions assigned to each centroid) are convex polytopes.

This has important implications:

K-means finds convex clusters — It cannot discover non-convex shapes
Clusters are axis-aligned ellipsoids at best — Due to isotropic distance
Scale matters — Features on different scales contribute unequally

Feature Scaling is Essential:

Standardizing features (z-scores) or min-max scaling ensures all features contribute equally.

distance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
"""
Why Squared Euclidean Distance is Special for K-Means
 
This demonstrates the mathematical properties that make 
squared Euclidean distance uniquely suited for k-means.
"""
import numpy as np
from scipy.optimize import minimize
 
def demonstrate_mean_minimizes_squared():
    """
    Show that the mean minimizes sum of squared distances.
    """
    # Generate random points
    np.random.seed(42)
    points = np.random.randn(100, 2) * 2 + np.array([5, 3])
    
    # The mean
    mean = points.mean(axis=0)
    
    # Sum of squared distances to mean
    ssd_to_mean = np.sum((points - mean) ** 2)
    
    # Try other candidate centers
    def total_squared_dist(center):
        return np.sum((points - center) ** 2)
    
    # Random candidate
    random_point = np.array([4.5, 2.5])
    ssd_to_random = total_squared_dist(random_point)
    
    # Median (minimizes L1, not L2²)
    median = np.median(points, axis=0)
    ssd_to_median = total_squared_dist(median)
    
    print("Sum of Squared Distances to Different Centers:")
    print(f"  To mean:         {ssd_to_mean:.4f}  ← MINIMUM")
    print(f"  To median:       {ssd_to_median:.4f}")
    print(f"  To random point: {ssd_to_random:.4f}")
    print()
    
    # Verify with optimization
    result = minimize(total_squared_dist, x0=[0, 0], method='BFGS')
    print(f"Numerically optimal center: {result.x}")
    print(f"Analytical mean:            {mean}")
    print(f"Match: {np.allclose(result.x, mean)}")
 
 
def demonstrate_l1_vs_l2():
    """
    Show that L1 (Manhattan) distance leads to different optimal center.
    """
    np.random.seed(42)
    points = np.random.randn(100, 2)
    
    # L2² optimal center is the mean
    mean = points.mean(axis=0)
    l2_sq_loss = np.sum((points - mean) ** 2)
    
    # L1 optimal center is the coordinate-wise median
    median = np.median(points, axis=0)
    l1_loss_at_median = np.sum(np.abs(points - median))
    l1_loss_at_mean = np.sum(np.abs(points - mean))
    
    print("L1 (Manhattan) vs L2² (Squared Euclidean):")
    print(f"  Mean:   {mean}")
    print(f"  Median: {median}")
    print(f"  L1 loss at median: {l1_loss_at_median:.4f}  ← MINIMUM for L1")
    print(f"  L1 loss at mean:   {l1_loss_at_mean:.4f}")
    print()
    print("This is why k-medians uses the median, not the mean!")
 
 
def demonstrate_feature_scaling():
    """
    Show impact of feature scaling on k-means clustering.
    """
    np.random.seed(42)
    
    # Two clusters in original scale
    cluster1 = np.random.randn(50, 2) + np.array([0, 0])
    cluster2 = np.random.randn(50, 2) + np.array([3, 3])
    data_scaled = np.vstack([cluster1, cluster2])
    
    # Same data with feature 2 scaled up by 100x
    data_unscaled = data_scaled.copy()
    data_unscaled[:, 1] *= 100
    
    # Distance between cluster centers
    center1_scaled = cluster1.mean(axis=0)
    center2_scaled = cluster2.mean(axis=0)
    dist_scaled = np.linalg.norm(center1_scaled - center2_scaled)
    
    center1_unscaled = data_unscaled[:50].mean(axis=0)
    center2_unscaled = data_unscaled[50:].mean(axis=0)
    dist_unscaled = np.linalg.norm(center1_unscaled - center2_unscaled)
    
    print("Impact of Feature Scaling:")
    print(f"  Scaled data - inter-cluster distance:   {dist_scaled:.2f}")
    print(f"  Unscaled data - inter-cluster distance: {dist_unscaled:.2f}")
    print()
    print(f"  Feature 1 contribution (scaled):   {abs(center2_scaled[0] - center1_scaled[0]):.2f}")
    print(f"  Feature 2 contribution (scaled):   {abs(center2_scaled[1] - center1_scaled[1]):.2f}")
    print(f"  Feature 1 contribution (unscaled): {abs(center2_unscaled[0] - center1_unscaled[0]):.2f}")
    print(f"  Feature 2 contribution (unscaled): {abs(center2_unscaled[1] - center1_unscaled[1]):.2f}")
    print()
    print("In unscaled data, feature 2 dominates the distance calculation!")
 
 
if __name__ == "__main__":
    demonstrate_mean_minimizes_squared()
    print("=" * 60)
    demonstrate_l1_vs_l2()
    print("=" * 60)
    demonstrate_feature_scaling()

The Vector Quantization Perspective

K-means has a fascinating interpretation in signal processing and information theory as vector quantization (VQ)—a technique for lossy data compression.

The VQ Problem:

The distortion is exactly the k-means objective: $$D = \frac{1}{n}\sum_{i=1}^{n} \min_j |\mathbf{x}_i - \boldsymbol{\mu}_j|^2$$

Lloyd's algorithm is also known as the Lloyd-Max algorithm in the VQ literature.

Image Compression via K-Means

A classic application: compress an image by clustering pixel colors.

Extract all RGB pixel values (n pixels, 3 dimensions)
Run k-means with k colors (e.g., k=16)
Replace each pixel with its nearest centroid color
Store only the 16 centroid colors + per-pixel cluster indices

A 1000×1000 RGB image uses 3MB. With k=16 colors: 48 bytes (centroids) + 500KB (4-bit indices) ≈ 6× compression!

Information-Theoretic View:

From an information theory perspective, the k-means codebook enables a simple coding scheme:

Encoding: Map each data point to its nearest centroid index (requires $\log_2 k$ bits)
Decoding: Replace the index with the centroid
Reconstruction error: The distortion/WCSS

This connects to rate-distortion theory: the tradeoff between bits used (rate) and reconstruction error (distortion). K-means finds the optimal codebook for a fixed number of codewords.

Comparison with Other Quantizers:

K-Means vs Other Vector Quantization Methods
Method	Codebook Structure	K-Means Equivalent?
K-Means	Unstructured (any k vectors)	Yes
Product Quantization	Cartesian product of subspace codebooks	No (but related)
Lattice Quantization	Regular geometric lattice	No
Tree-Structured VQ	Hierarchical refinement	Related (hierarchical k-means)

The Probabilistic Interpretation: GMM Connection

K-means has a deep connection to Gaussian Mixture Models (GMMs)—specifically, k-means is the limit of GMM when cluster variances go to zero.

The GMM Model:

A Gaussian Mixture Model assumes data is generated from $k$ Gaussian distributions: $$p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_j, \Sigma_j)$$

where $\pi_j$ are mixing weights and $\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \Sigma)$ is the Gaussian density.

K-Means as a Limiting Case:

Consider GMM with:

Equal mixing weights: $\pi_j = 1/k$
Spherical covariances: $\Sigma_j = \sigma^2 I$ for all $j$
Taking the limit $\sigma^2 \to 0$

Derivation:

For spherical Gaussians: $$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_j, \sigma^2 I) \propto \exp\left(-\frac{|\mathbf{x} - \boldsymbol{\mu}_j|^2}{2\sigma^2}\right)$$

As $\sigma^2 \to 0$, the exponential becomes sharply peaked around the nearest centroid. The posterior becomes a hard assignment:

$$\lim_{\sigma^2 \to 0} p(z = j|\mathbf{x}) = \begin{cases} 1 & \text{if } j = \arg\min_{j'} |\mathbf{x} - \boldsymbol{\mu}_{j'}|^2 \ 0 & \text{otherwise} \end{cases}$$

This is exactly k-means assignment!

K-Means vs GMM: Key Differences

K-Means (Hard EM): • Hard cluster assignments (each point in one cluster) • Assumes equal, spherical clusters • No uncertainty quantification • Computationally cheaper

Full GMM (Soft EM): • Soft assignments (probabilities over clusters) • Allows different cluster shapes/sizes • Provides posterior probabilities • More expressive but expensive

K-Means (Hard Assignment)

•$r_{ij} \in {0, 1}$
•Each point in exactly one cluster
•Assignment: argmin over distances
•Update: cluster mean

GMM (Soft Assignment)

•$r_{ij} \in [0, 1]$ (responsibilities)
•Points have partial membership
•E-step: compute posteriors
•M-step: weighted mean

Computational Complexity: NP-Hardness

While Lloyd's algorithm is computationally efficient, finding the globally optimal k-means solution is computationally intractable.

Formal Result:

The k-means clustering problem (minimize WCSS) is NP-hard, even for:

$k = 2$ clusters (Aloise et al., 2009)
$d = 2$ dimensions (Mahajan et al., 2009)

The problem remains NP-hard even when either $k$ or $d$ is fixed at 2.

What This Means:

No known polynomial-time algorithm can guarantee the global optimum
Lloyd's algorithm finds a local minimum, not necessarily the global one
Multiple restarts with different initializations are necessary
Approximation algorithms (like k-means++) provide theoretical guarantees

Lloyd's Algorithm Convergence

Worst-case iterations: Lloyd's algorithm can require exponential iterations to converge on certain adversarial inputs (2^Ω(n)).

Average case: On random data, typically converges in O(n) iterations.

Smoothed analysis: With small random perturbations, convergence is polynomial (Spielman & Teng).

In practice, Lloyd's algorithm converges quickly—usually in O(k·n·d) total time over all iterations.

Approximation Guarantees:

Despite NP-hardness, good approximations are achievable:

K-means++: Achieves $O(\log k)$ approximation in expectation
D2-sampling: Similar guarantees with different analysis
Local search methods: $1 + \epsilon$ approximation for any $\epsilon > 0$ (PTAS), but exponential in $1/\epsilon$

For most practical applications, Lloyd's algorithm with good initialization (k-means++) provides solutions within a few percent of optimal.

Summary: The K-Means Objective

Key Takeaways

•K-means minimizes within-cluster sum of squares (WCSS) — Total squared distance from points to their centroids
•Variance decomposition: TSS = WCSS + BCSS — Minimizing WCSS equivalently maximizes between-cluster separation
•Squared Euclidean distance is essential — The mean is the unique minimizer of sum of squared distances
•Vector quantization interpretation — K-means finds optimal codebook for lossy compression
•GMM connection — K-means is the zero-variance limit of spherical GMM
•Global optimization is NP-hard — Lloyd's finds local minima; initialization matters greatly

What's Next:

Page Complete

2 / 5