Loading content...
With clustering objectives, distance measures, and cluster definitions now understood, we can survey the landscape of clustering approaches. Just as there are many ways to define a cluster, there are many ways to find clusters. Understanding this taxonomy helps you navigate the dozens of clustering algorithms and recognize their fundamental similarities and differences.
Clustering algorithms can be categorized along multiple dimensions: How do they partition data? Do they assign points definitively or probabilistically? Do they handle all points or allow for noise? Do they operate on raw data or transformed representations? Each dimension represents a design choice that affects the algorithm's behavior, strengths, and limitations.
This page provides a comprehensive taxonomy of clustering types, preparing you to understand how specific algorithms (K-means, hierarchical clustering, DBSCAN, spectral clustering, GMMs) fit into the broader landscape.
By the end of this page, you will understand the fundamental distinctions between clustering types: partitional vs. hierarchical, hard vs. soft, complete vs. partial, exclusive vs. overlapping, and more. You'll learn which type suits different applications and how these categories guide algorithm selection.
The most fundamental distinction in clustering is between methods that find a single partition versus those that find a hierarchy of partitions.
Partitional Clustering:
Partitional (or "flat") clustering divides data into a fixed number of non-overlapping groups. The output is a single set of clusters ${C_1, C_2, \ldots, C_k}$.
Key characteristics:
Examples: K-means, K-medoids, GMM (can be), spectral clustering
Hierarchical Clustering:
Hierarchical clustering creates a tree (dendrogram) of nested cluster relationships. Every partition at every granularity is encoded in a single structure.
Key characteristics:
Agglomerative vs. Divisive Hierarchical:
Hierarchical clustering comes in two flavors:
Agglomerative (bottom-up): Start with n clusters (each point is a cluster), iteratively merge the two most similar clusters. More common in practice.
Divisive (top-down): Start with one cluster (all points), recursively split. Less common; harder to decide how to split.
Hybrid Approaches:
Some modern methods combine aspects of both:
Use partitional when: you know k, need efficiency, or have very large data. Use hierarchical when: structure is naturally hierarchical, you want to explore multiple granularities, or visualization of relationships is important. Many practical workflows use hierarchical for exploration, then partitional for production.
Should a point belong to exactly one cluster, or can it have partial membership in multiple clusters?
Hard Clustering:
Each point is assigned to exactly one cluster: $$\forall x_i: \quad x_i \in C_j \text{ for exactly one } j$$
Cluster assignments are deterministic—no ambiguity in which cluster owns each point.
Soft (Fuzzy) Clustering:
Each point has a degree of membership to each cluster: $$\forall x_i: \quad \mu_j(x_i) \in [0, 1] \text{ for each cluster } j$$
where $\sum_j \mu_j(x_i) = 1$ (memberships sum to 1).
In probabilistic clustering (GMMs), these are posterior probabilities: $P(\text{cluster } j | x_i)$.
| Aspect | Hard Clustering | Soft Clustering |
|---|---|---|
| Assignment | Binary: 0 or 1 | Continuous: [0, 1] |
| Boundary points | Forced into one cluster | Shared by multiple clusters |
| Information | Less informative | More informative |
| Complexity | Simpler interpretation | More parameters |
| Examples | K-means, DBSCAN, hierarchical | GMM, Fuzzy C-means |
| Best for | Well-separated clusters | Overlapping clusters, uncertainty |
Fuzzy C-Means (FCM):
The classic soft clustering algorithm generalizes k-means:
$$J_{FCM} = \sum_{i=1}^{n} \sum_{j=1}^{k} \mu_{ij}^m |x_i - \mu_j|^2$$
where:
The membership update rule: $$\mu_{ij} = \frac{1}{\sum_{l=1}^{k} \left( \frac{|x_i - \mu_j|}{|x_i - \mu_l|} \right)^{\frac{2}{m-1}}}$$
When Soft Assignments Matter:
You can always convert soft to hard by assigning each point to its highest-membership cluster. GMM often outputs soft assignments but is used in hard mode (assign to argmax posterior). Going reverse is harder—hard k-means doesn't naturally produce meaningful soft assignments without modification.
Must every data point be assigned to a cluster?
Complete Clustering:
Every point is assigned to exactly one cluster. The clusters form a complete partition of the data: $$\bigcup_{j=1}^{k} C_j = X$$
Most clustering algorithms produce complete clusterings. K-means, hierarchical clustering (with fixed cut), and GMM with hard assignments all assign every point.
Partial Clustering:
Some points may not be assigned to any cluster. These points are classified as noise, outliers, or background: $$\bigcup_{j=1}^{k} C_j \subset X$$
Density-based methods (DBSCAN, HDBSCAN) naturally produce partial clusterings. Points in sparse regions are declared noise rather than forced into inappropriate clusters.
Making Complete Methods Partial:
You can convert complete clustering to partial by post-processing:
Making Partial Methods Complete:
To assign noise points:
The Design Choice:
Choosing between complete and partial depends on your application:
Forcing every point into a cluster can distort cluster statistics. A single outlier assigned to a k-means cluster can significantly shift the centroid. When outliers are expected, use partial clustering methods or robust algorithms.
Can a point belong to multiple clusters simultaneously?
Exclusive Clustering:
Each point belongs to at most one cluster: $$C_i \cap C_j = \emptyset \quad \forall i eq j$$
This is the standard assumption in most clustering. The clusters partition the data into non-overlapping groups.
Overlapping Clustering:
Points can belong to multiple clusters: $$C_i \cap C_j eq \emptyset \text{ for some } i eq j$$
This is different from soft clustering: in overlapping clustering, a point can have full membership in multiple clusters, not partial membership in each.
Why Overlapping Clusters?
Many real-world groupings overlap naturally:
Algorithms for Overlapping Clustering:
| Algorithm | Approach |
|---|---|
| Clique Percolation | Find overlapping graph communities via k-clique patterns |
| OCSM | Overlapping Cluster Spanning Model for bioinformatics |
| Fuzzy approaches | Threshold soft assignments to get overlapping hard assignments |
| Multi-assignment | Run clustering multiple times, assign to all above-threshold clusters |
From Soft to Overlapping:
Soft clustering can approximate overlapping by using a threshold:
This converts degree-of-membership to binary multi-membership.
Soft clustering: point has 30% membership in cluster A, 70% in cluster B. Overlapping clustering: point fully belongs to both A and B. Soft represents uncertainty or gradation; overlapping represents genuine multi-class membership. A document might be 'half about ML' (soft) or 'a full member of both ML and healthcare topics' (overlapping).
Does the algorithm always produce the same result, or does randomness play a role?
Deterministic Clustering:
Given the same data and parameters, the algorithm always produces the same clustering:
$$\text{Algorithm}(X, \theta) \rightarrow C \quad \text{(same every time)}$$
Examples:
Stochastic Clustering:
Results vary across runs due to random initialization or sampling:
$$\text{Algorithm}(X, \theta) \rightarrow C_1, C_2, C_3, \ldots \quad \text{(may differ)}$$
Examples:
Handling Stochasticity:
When using stochastic algorithms:
Set random seed for reproducibility:
np.random.seed(42)
kmeans.fit(X)
Run multiple times and select best:
best_inertia = float('inf')
for _ in range(10):
km = KMeans(n_clusters=k).fit(X)
if km.inertia_ < best_inertia:
best_km = km
Assess stability: If different runs give very different results, the clustering may not be stable.
Use consensus clustering: Combine multiple runs to find robust clusters.
Smart Initialization:
K-means++ initialization dramatically reduces the impact of randomness by choosing initial centroids that are spread apart. This makes k-means more reproducible while keeping the efficiency benefits of the algorithm.
For production systems, always set random seeds and document them. For research, report results across multiple runs with confidence intervals. Tools like scikit-learn's n_init parameter automatically run k-means multiple times and return the best result.
A fundamental distinction in the algorithmic approach to clustering:
Distance-Based (Metric) Clustering:
Clustering is based on pairwise distances between points. The algorithm uses distances directly to determine similarity and cluster membership.
Characteristics:
Model-Based (Probabilistic) Clustering:
Data is assumed to be generated from a probabilistic model. Clustering involves inferring the model parameters that best explain the data.
Characteristics:
| Aspect | Distance-Based | Model-Based |
|---|---|---|
| Foundation | Pairwise distances | Probability distributions |
| Assumptions | Similarity metric is meaningful | Data follows assumed distribution |
| Output | Cluster assignments | Assignments + model parameters |
| Uncertainty | Not naturally quantified | Posterior probabilities |
| Model selection | Heuristics (elbow, silhouette) | Principled (BIC, AIC, cross-validation) |
| Flexibility | Any distance metric | Distribution family must be chosen |
| New data | Assign by distance to clusters | Assign by posterior probability |
K-Means as a Special Case:
Interestingly, k-means can be derived from both perspectives:
Distance view: Minimize WCSS = sum of squared Euclidean distances to centroids
Model view: Maximum likelihood estimation for a GMM with spherical, equal-variance Gaussians and hard assignments
This duality shows how the paradigms connect. GMM generalizes k-means by:
Practical Implications:
Choose distance-based when:
Choose model-based when:
In practice, you might use distance-based methods for exploration (simpler, fewer assumptions) and model-based methods for production (principled model selection, uncertainty estimates). Many workflows start with k-means to get a sense of the data, then refine with GMM for the final model.
How does the algorithm handle data: all at once, or point by point?
Batch Clustering:
The algorithm sees all data before beginning and processes it as a whole:
$$\text{Cluster}({x_1, x_2, \ldots, x_n}) \rightarrow {C_1, C_2, \ldots, C_k}$$
Most clustering algorithms are batch algorithms: k-means, hierarchical, spectral, GMM.
Online (Incremental) Clustering:
Data arrives one point at a time, and the clustering is updated incrementally:
$$\text{Cluster}_{t+1} = \text{Update}(\text{Cluster}t, x{t+1})$$
The algorithm maintains a current clustering that evolves as new data arrives.
Why Online Clustering?
Batch clustering is insufficient when:
Online K-Means (Mini-Batch K-Means):
Samples small batches randomly and updates centroids incrementally:
for batch in stream:
assign points to nearest centroid
update centroids using learning rate:
μ_j = (1-η)μ_j + η * mean(batch points in C_j)
Converges to similar solution as batch k-means but much faster for large data.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):
Builds a CF (Clustering Feature) tree incrementally:
| Aspect | Batch | Online |
|---|---|---|
| Data access | All data available | One point/batch at a time |
| Memory | O(n) or O(n²) | O(k) or O(tree size) |
| Quality | Better (sees all data) | Approximate (limited view) |
| Adaptability | Rerun for new data | Naturally adapts |
| Use case | Static datasets | Streams, large data |
Online algorithms can handle 'concept drift' — when the underlying cluster structure changes over time. A batch algorithm trained on last month's data may be wrong for this month. Online methods can adapt continuously. Techniques like exponential forgetting (giving less weight to old points) help track changes.
What if we have some prior knowledge about which points should (or shouldn't) be together?
The Semi-Supervised Setting:
Pure unsupervised clustering uses only feature data. But often we have partial knowledge:
Constraint-based clustering incorporates this knowledge.
Types of Constraints:
Must-Link (ML): Points $x_i$ and $x_j$ must be in the same cluster. $$\text{ML}(x_i, x_j): \text{cluster}(x_i) = \text{cluster}(x_j)$$
Cannot-Link (CL): Points $x_i$ and $x_j$ must be in different clusters. $$\text{CL}(x_i, x_j): \text{cluster}(x_i) eq \text{cluster}(x_j)$$
These constraints are transitive: if ML(a,b) and ML(b,c), then ML(a,c).
Algorithms for Constrained Clustering:
COP-Kmeans: K-means with hard constraints
PCKMeans: Penalized Constraint K-means
Metric Learning:
When to Use Constraints:
Constraint-based clustering is valuable when:
Not all constraints are equally useful. Active learning selects the most informative pairs to query. Typically, pairs where the algorithm is uncertain (boundary points between clusters) provide the most value. This minimizes human labeling effort while maximizing clustering improvement.
We've surveyed the major dimensions along which clustering approaches vary. Let's consolidate:
What's Next:
With the full taxonomy understood, we turn to the challenges of evaluating clustering—a notoriously difficult problem because there's typically no ground truth. The next page explores internal and external validation metrics, stability analysis, and the fundamental difficulties of assessing unsupervised learning.
You now have a comprehensive taxonomy of clustering approaches. This classification helps you navigate the many algorithms available and understand their fundamental characteristics. Next, we'll tackle the challenging question of how to evaluate whether a clustering is 'good.'