Loading content...
Clustering occupies a unique position in machine learning—it seeks to discover structure without being told what to find. Unlike regression and classification where labels guide learning, clustering must identify natural groupings from data alone. This is unsupervised learning in its purest form.
Clustering answers questions like:
The absence of labels is both liberating and challenging. There's no ground truth to optimize against—success depends on whether discovered structure is meaningful and useful for downstream tasks.
By the end of this page, you will understand clustering at a foundational level: formal definitions and objectives, major algorithmic paradigms (partitional, hierarchical, density-based, model-based), the critical role of distance metrics and feature scaling, cluster validity measures, and practical considerations for applying clustering effectively. This knowledge enables you to discover meaningful structure in any unlabeled dataset.
Formalizing clustering is surprisingly subtle. Unlike supervised learning where objective functions are clear, clustering objectives admit multiple valid formulations.
The Basic Setup:
Given a dataset $\mathcal{D} = {x_1, x_2, \ldots, x_n}$ where each $x_i \in \mathbb{R}^d$, clustering seeks a partition: $$\mathcal{C} = {C_1, C_2, \ldots, C_K}$$
such that:
Soft vs. Hard Clustering:
Hard clustering assigns each point to exactly one cluster.
Soft (fuzzy) clustering assigns membership probabilities: $p_{ik}$ = probability that point $i$ belongs to cluster $k$, with $\sum_k p_{ik} = 1$.
Soft clustering is more flexible and often more appropriate—real data often admits ambiguous assignments.
Objective Functions:
Different clustering algorithms optimize different objectives:
Compactness-based (within-cluster distances small): $$\min_{\mathcal{C}} \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2$$
where $\mu_k$ is the centroid of cluster $k$. This is the k-means objective.
Separation-based (between-cluster distances large): Maximize $\sum_{k < l} |\mu_k - \mu_l|^2$
Ratio-cut and Normalized-cut (graph-based): Minimize edges cut between clusters while keeping clusters balanced.
Likelihood-based (probabilistic models): $$\max_\theta \sum_{i=1}^{n} \log \sum_{k=1}^{K} \pi_k \cdot p(x_i | \theta_k)$$
This is the Gaussian Mixture Model (GMM) objective.
There is no single 'correct' clustering objective. Different objectives encode different assumptions about what makes a good cluster. K-means assumes spherical clusters of similar size. DBSCAN assumes clusters are dense regions separated by sparse regions. Hierarchical clustering assumes nested structure. Choose the objective that matches your understanding of the domain.
| Objective | Formula/Description | Assumption | Algorithm |
|---|---|---|---|
| Within-cluster sum of squares | $\sum_k \sum_{x \in C_k} |x - \mu_k|^2$ | Spherical, similar-sized clusters | K-means, K-medoids |
| Log-likelihood (GMM) | $\sum_i \log \sum_k \pi_k \mathcal{N}(x_i; \mu_k, \Sigma_k)$ | Gaussian-distributed clusters | EM for GMM |
| Density connectivity | Maximal density-connected regions | Clusters are dense regions | DBSCAN, HDBSCAN |
| Linkage criterion | Min/max/avg distance between clusters | Hierarchical structure exists | Agglomerative clustering |
| Graph cuts | Minimize edges cut / cluster weight | Data forms a graph | Spectral clustering |
The choice of distance metric fundamentally shapes clustering results. Two datasets that look similar under Euclidean distance may have completely different structures under cosine similarity. Understanding distance is prerequisite to understanding clustering.
What Makes a Valid Distance Metric?
A function $d: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}^+$ is a metric if it satisfies:
Some useful functions (cosine similarity, KL divergence) are not true metrics but work well in practice.
| Metric | Formula | Properties | Use Cases |
|---|---|---|---|
| Euclidean (L2) | $\sqrt{\sum_i (x_i - y_i)^2}$ | Isotropic; sensitive to scale | General purpose; requires scaled features |
| Manhattan (L1) | $\sum_i |x_i - y_i|$ | More robust to outliers than L2 | When dimensions are independent |
| Chebyshev (L∞) | $\max_i |x_i - y_i|$ | Maximum coordinate difference | When worst-case dimension matters |
| Minkowski (Lp) | $(\sum_i |x_i - y_i|^p)^{1/p}$ | Generalizes L1, L2, L∞ | Tunable robustness |
| Cosine Distance | $1 - \frac{x \cdot y}{|x| |y|}$ | Measures angle, ignores magnitude | Text, vector embeddings, high-dim sparse |
| Mahalanobis | $\sqrt{(x-y)^T \Sigma^{-1} (x-y)}$ | Accounts for feature correlations | When correlations matter; requires Σ estimate |
| Hamming | $\sum_i \mathbb{1}[x_i \neq y_i]$ | Counts differing positions | Binary or categorical data |
| Jaccard | $1 - \frac{|A \cap B|}{|A \cup B|}$ | Set overlap | Sets, binary feature vectors |
The Critical Role of Feature Scaling:
Most distance metrics are sensitive to feature scales. If one feature ranges [0, 1] and another [0, 1000], the second dominates distance calculations.
Common scaling approaches:
Choosing the right distance and scaling is as important as choosing the right algorithm. A poor choice can make any algorithm fail; a good choice can make simple algorithms succeed.
In high dimensions, distances concentrate: the ratio of maximum to minimum distances approaches 1. All points become 'equally far' from each other, making distance-based clustering fail. Solutions include: dimensionality reduction (PCA, t-SNE, UMAP) before clustering, careful feature selection, or algorithms designed for high dimensions (subspace clustering).
Partitional algorithms directly partition data into K clusters, typically by optimizing an objective function iteratively.
K-Means: The Workhorse Algorithm
K-means is the most widely used clustering algorithm due to its simplicity and scalability.
Algorithm:
Objective minimized: Within-cluster sum of squares (WCSS)
Complexity: $O(n \cdot K \cdot d \cdot I)$ where $I$ is iterations (typically small)
K-Means++ Initialization:
Poor initialization leads to poor local optima. K-means++ chooses initial centroids to be spread out:
This gives $O(\log K)$-competitive approximation to optimal and dramatically improves practical performance.
K-Medoids (PAM - Partitioning Around Medoids):
Instead of mean centroids, uses actual data points (medoids). More robust to outliers and works with any distance metric (not just Euclidean). But slower: O(K(n-K)²) per iteration.
Gaussian Mixture Models (GMM):
Probabilistic generalization of K-means. Assumes data is generated from a mixture of K Gaussian distributions:
$$p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x; \mu_k, \Sigma_k)$$
where $\pi_k$ are mixing coefficients (sum to 1), $\mu_k$ are means, and $\Sigma_k$ are covariances.
Advantages over K-means:
Trained via EM algorithm:
More flexible but more parameters to estimate; can overfit with small data.
Use K-means when: data is large, clusters are roughly spherical and similar-sized, you need speed, hard assignments are acceptable. Use GMM when: cluster shapes vary (elliptical), you need probability estimates, model selection via likelihood is important, data size is manageable. Both require specifying K.
Hierarchical clustering produces a tree-like hierarchy of clusters (dendrogram) rather than a single partition. This is valuable when the goal is to explore multi-scale structure or when the 'right' number of clusters is unknown.
Two Approaches:
1. Agglomerative (bottom-up): Start with n singleton clusters; repeatedly merge the two closest clusters until one remains.
2. Divisive (top-down): Start with one cluster containing all points; repeatedly split clusters until n singletons remain.
Agglomerative is more common because divisive requires deciding how to split (expensive for large clusters).
Linkage Criteria:
The key choice in agglomerative clustering: how do we define distance between clusters?
Single Linkage: Distance = minimum distance between any pair of points (one from each cluster) $$d(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$$
Complete Linkage: Distance = maximum distance between any pair $$d(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$$
Average Linkage (UPGMA): Distance = average of all pairwise distances $$d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$$
Ward's Method: Minimize increase in total within-cluster variance when merging
| Linkage | Cluster Shape | Sensitivity | Complexity | Best For |
|---|---|---|---|---|
| Single | Elongated, chained | Very noise-sensitive | O(n²) | Detecting connected regions |
| Complete | Compact, spherical | Outlier-sensitive | O(n²) | Compact clusters of similar diameter |
| Average | Moderate shapes | Balanced sensitivity | O(n²) | General purpose |
| Ward | Compact, spherical | Moderate sensitivity | O(n²) | Variance-based clustering; most popular |
Reading and Cutting Dendrograms:
The dendrogram visualizes the merge history:
To obtain K clusters: Draw horizontal line at appropriate height; the number of vertical lines it crosses equals the number of clusters. Alternatively, select K largest gaps in merge heights.
Complexity: O(n² log n) with efficient implementations (O(n²) for single linkage). Memory O(n²) for distance matrix. This limits scalability to ~10,000-100,000 points.
Hierarchical clustering is ideal for exploratory analysis where the number of clusters is unknown, when you want to visualize cluster structure at multiple scales, for small-to-medium datasets where the dendrogram provides interpretable insight, and in domains like biology where taxonomic hierarchies are natural (phylogenetic trees).
Density-based methods define clusters as regions of high density separated by regions of low density. This elegant formulation handles arbitrarily shaped clusters and naturally identifies outliers.
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
The foundational density-based algorithm. Key concepts:
Core points: Points with at least minPts neighbors within radius ε
Border points: Points within ε of a core point but not core themselves
Noise points: Points that are neither core nor border
Density-connected: Two points are in the same cluster if they're connected through a chain of core points.
DBSCAN Algorithm:
Parameters:
ε (epsilon): Radius for neighborhood queriesminPts: Minimum points to form a dense regionChoosing parameters:
minPts ≥ d+1 where d is dimensionality (rule of thumb: minPts = 2×d)ε: Use k-distance plot (plot distance to k-th nearest neighbor, sorted; look for 'elbow')Complexity: O(n log n) with spatial indexing (R-tree, KD-tree); O(n²) worst case
HDBSCAN: Hierarchical DBSCAN
Addresses DBSCAN's sensitivity to ε by extracting a hierarchy of clusters:
Key advantages:
HDBSCAN has become the go-to density-based method for many practitioners.
Use DBSCAN/HDBSCAN when: clusters have arbitrary shapes, outliers exist and should be detected, number of clusters is unknown, density is roughly uniform. Use K-means when: clusters are roughly spherical and similar-sized, you know K, computational speed is critical, outliers are pre-filtered.
How do we know if clustering succeeded? Without ground truth labels, evaluation is fundamentally ambiguous—but several principles and metrics provide guidance.
Two Types of Validation:
Internal validation: Measures based only on the data and clustering (no external labels)
External validation: Measures agreement with known ground truth labels (used for algorithm comparison on labeled datasets)
| Metric | Formula / Description | Range | Interpretation |
|---|---|---|---|
| Silhouette Coefficient | $\frac{b(i) - a(i)}{\max(a(i), b(i))}$ where a=intra-cluster dist, b=nearest-cluster dist | [-1, 1] | Higher = better; >0.7 strong structure |
| Calinski-Harabasz Index | Ratio of between-cluster to within-cluster variance | [0, ∞) | Higher = better defined clusters |
| Davies-Bouldin Index | Average similarity of each cluster to its most similar cluster | [0, ∞) | Lower = better separation |
| Dunn Index | Min inter-cluster dist / max intra-cluster dist | [0, ∞) | Higher = compact and well-separated |
| Within-Cluster Sum of Squares | $\sum_k \sum_{x \in C_k} |x - \mu_k|^2$ | [0, ∞) | Lower = more compact (but decreases with K) |
The Silhouette Coefficient in Detail:
For each point $i$:
Interpretation:
Overall silhouette = average over all points. Values:
External Validation Metrics:
When ground truth labels exist (for algorithm benchmarking):
Rand Index (RI): Fraction of point pairs where clustering and ground truth agree (both same or both different clusters). Range [0, 1].
Adjusted Rand Index (ARI): RI corrected for chance. Expected value = 0 for random clustering. Range [-1, 1]; 1 = perfect agreement.
Normalized Mutual Information (NMI): Information-theoretic measure of agreement. Range [0, 1]. $$NMI(C, T) = \frac{2 \cdot I(C; T)}{H(C) + H(T)}$$
Fowlkes-Mallows Index: Geometric mean of precision and recall of pair classifications.
ARI and NMI are the most commonly reported external validation metrics.
No internal metric can definitively validate clustering. Different metrics favor different cluster properties (compactness, separation, balance). Always combine quantitative metrics with domain expertise and visual inspection. A clustering that scores well but doesn't make domain sense is useless.
Successful clustering requires navigating numerous practical decisions beyond algorithm choice. Here are the key considerations that distinguish expert practice.
The Elbow Method for Choosing K:
Limitations: The elbow is often subjective or absent. Supplement with silhouette analysis (plot average silhouette vs. K; choose K that maximizes) or domain knowledge.
Gap Statistic: Compares WCSS to expected WCSS under null reference distribution. Choose K where gap is large. More principled but computationally expensive.
Always visualize your clusters. Use dimensionality reduction (PCA, t-SNE, UMAP) to project high-dimensional data to 2D/3D. Color points by cluster label. If clusters don't look sensible in visualization, they probably aren't—regardless of what metrics say. Human judgment remains essential.
We've explored clustering comprehensively—formal definitions, distance metrics, algorithmic paradigms (partitional, hierarchical, density-based), cluster validity, and practical considerations. Let's synthesize the key insights.
Connection to Other Problem Types:
Clustering connects to the broader ML landscape:
What's Next:
Having covered regression (continuous), classification (discrete), and clustering (unsupervised), we turn to structured prediction—predicting complex, structured outputs like sequences, trees, and graphs. This generalizes classification to outputs with internal dependencies.
You now possess a comprehensive understanding of clustering problems in machine learning. From formal definitions through practical implementation, you have the conceptual framework to discover meaningful structure in any unlabeled dataset. Next, we explore structured prediction—predicting outputs with rich internal structure.