Clustering Problem Formulation - Learning Module

Loading content...

0/278

Types of Clustering

A Taxonomy of Clustering Approaches

With clustering objectives, distance measures, and cluster definitions now understood, we can survey the landscape of clustering approaches. Just as there are many ways to define a cluster, there are many ways to find clusters. Understanding this taxonomy helps you navigate the dozens of clustering algorithms and recognize their fundamental similarities and differences.

Clustering algorithms can be categorized along multiple dimensions: How do they partition data? Do they assign points definitively or probabilistically? Do they handle all points or allow for noise? Do they operate on raw data or transformed representations? Each dimension represents a design choice that affects the algorithm's behavior, strengths, and limitations.

This page provides a comprehensive taxonomy of clustering types, preparing you to understand how specific algorithms (K-means, hierarchical clustering, DBSCAN, spectral clustering, GMMs) fit into the broader landscape.

What You Will Learn

By the end of this page, you will understand the fundamental distinctions between clustering types: partitional vs. hierarchical, hard vs. soft, complete vs. partial, exclusive vs. overlapping, and more. You'll learn which type suits different applications and how these categories guide algorithm selection.

Partitional vs. Hierarchical Clustering

The most fundamental distinction in clustering is between methods that find a single partition versus those that find a hierarchy of partitions.

Partitional Clustering:

Partitional (or "flat") clustering divides data into a fixed number of non-overlapping groups. The output is a single set of clusters ${C_1, C_2, \ldots, C_k}$.

Key characteristics:

Requires specifying $k$ (number of clusters) in advance
Each point belongs to exactly one cluster
Optimizes some objective function (e.g., WCSS for k-means)
No inherent relationship between clusters

Examples: K-means, K-medoids, GMM (can be), spectral clustering

Hierarchical Clustering:

Hierarchical clustering creates a tree (dendrogram) of nested cluster relationships. Every partition at every granularity is encoded in a single structure.

Key characteristics:

No need to pre-specify $k$
Produces a hierarchy: clusters within clusters
Can extract any number of clusters by cutting the tree
Relationships between clusters are explicit

Partitional: Pros & Cons

•✓ Computationally efficient (often O(nkI) for k iterations)
•✓ Scales well to large datasets
•✓ Direct optimization of interpretable objectives
•✗ Requires choosing k in advance
•✗ Results depend on initialization
•✗ No multi-scale view of data

Hierarchical: Pros & Cons

•✓ No need to pre-specify k
•✓ Visualizable dendrogram
•✓ Multi-scale structure visible
•✓ Deterministic (no random initialization)
•✗ Computationally expensive (O(n²) space, O(n³) time typically)
•✗ Decisions are irreversible (greedy)

Agglomerative vs. Divisive Hierarchical:

Hierarchical clustering comes in two flavors:

Agglomerative (bottom-up): Start with n clusters (each point is a cluster), iteratively merge the two most similar clusters. More common in practice.
Divisive (top-down): Start with one cluster (all points), recursively split. Less common; harder to decide how to split.

Hybrid Approaches:

Some modern methods combine aspects of both:

Bisecting K-means: Hierarchical structure via recursive partitional calls
HDBSCAN: Density-based hierarchy extracted for flat clustering
BIRCH: Hierarchical data summary followed by partitional clustering

Choosing Between Partitional and Hierarchical

Use partitional when: you know k, need efficiency, or have very large data. Use hierarchical when: structure is naturally hierarchical, you want to explore multiple granularities, or visualization of relationships is important. Many practical workflows use hierarchical for exploration, then partitional for production.

Hard vs. Soft (Fuzzy) Clustering

Should a point belong to exactly one cluster, or can it have partial membership in multiple clusters?

Hard Clustering:

Each point is assigned to exactly one cluster: $$\forall x_i: \quad x_i \in C_j \text{ for exactly one } j$$

Cluster assignments are deterministic—no ambiguity in which cluster owns each point.

Soft (Fuzzy) Clustering:

Each point has a degree of membership to each cluster: $$\forall x_i: \quad \mu_j(x_i) \in [0, 1] \text{ for each cluster } j$$

where $\sum_j \mu_j(x_i) = 1$ (memberships sum to 1).

In probabilistic clustering (GMMs), these are posterior probabilities: $P(\text{cluster } j | x_i)$.

Hard vs. Soft Clustering Comparison
Aspect	Hard Clustering	Soft Clustering
Assignment	Binary: 0 or 1	Continuous: [0, 1]
Boundary points	Forced into one cluster	Shared by multiple clusters
Information	Less informative	More informative
Complexity	Simpler interpretation	More parameters
Examples	K-means, DBSCAN, hierarchical	GMM, Fuzzy C-means
Best for	Well-separated clusters	Overlapping clusters, uncertainty

Fuzzy C-Means (FCM):

The classic soft clustering algorithm generalizes k-means:

$$J_{FCM} = \sum_{i=1}^{n} \sum_{j=1}^{k} \mu_{ij}^m |x_i - \mu_j|^2$$

where:

$\mu_{ij}$ = membership of point $i$ in cluster $j$
$m > 1$ = fuzziness parameter (typically 2)
Higher $m$ = softer assignments

The membership update rule: $$\mu_{ij} = \frac{1}{\sum_{l=1}^{k} \left( \frac{|x_i - \mu_j|}{|x_i - \mu_l|} \right)^{\frac{2}{m-1}}}$$

When Soft Assignments Matter:

Overlapping populations: Customers may belong to multiple segments
Uncertainty quantification: Want to know confidence in assignments
Downstream modeling: Soft assignments can be features for supervised learning
Boundary analysis: Understanding transition zones between clusters

Converting Between Hard and Soft

You can always convert soft to hard by assigning each point to its highest-membership cluster. GMM often outputs soft assignments but is used in hard mode (assign to argmax posterior). Going reverse is harder—hard k-means doesn't naturally produce meaningful soft assignments without modification.

Complete vs. Partial Clustering

Must every data point be assigned to a cluster?

Complete Clustering:

Every point is assigned to exactly one cluster. The clusters form a complete partition of the data: $$\bigcup_{j=1}^{k} C_j = X$$

Most clustering algorithms produce complete clusterings. K-means, hierarchical clustering (with fixed cut), and GMM with hard assignments all assign every point.

Partial Clustering:

Some points may not be assigned to any cluster. These points are classified as noise, outliers, or background: $$\bigcup_{j=1}^{k} C_j \subset X$$

Density-based methods (DBSCAN, HDBSCAN) naturally produce partial clusterings. Points in sparse regions are declared noise rather than forced into inappropriate clusters.

Advantages of Partial Clustering

•Honest uncertainty — Rather than forcing ambiguous points into clusters, acknowledge they don't fit well
•Outlier detection — Noise points often represent anomalies worth investigating
•Cleaner clusters — Without forced outlier assignment, cluster statistics are more meaningful
•Realistic modeling — Real data often has points that don't belong to any natural group

Making Complete Methods Partial:

You can convert complete clustering to partial by post-processing:

Distance threshold: Remove points farther than threshold from cluster center
Silhouette filtering: Remove points with negative silhouette scores
Probability threshold: In GMMs, unlabel points with max posterior below threshold

Making Partial Methods Complete:

To assign noise points:

Nearest cluster: Assign to closest cluster
Separate cluster: Create a "noise" cluster
Soft assignment: Give noise points soft membership based on proximity

The Design Choice:

Choosing between complete and partial depends on your application:

Classification preprocessing: Complete (need labels for all training points)
Customer segmentation with anomaly detection: Partial (some customers are outliers)
Latent class modeling: Complete (assume every observation has a class)
Quality control: Partial (flag unusual items rather than force-classify)

The Forced Assignment Problem

Forcing every point into a cluster can distort cluster statistics. A single outlier assigned to a k-means cluster can significantly shift the centroid. When outliers are expected, use partial clustering methods or robust algorithms.

Exclusive vs. Overlapping Clustering

Can a point belong to multiple clusters simultaneously?

Exclusive Clustering:

Each point belongs to at most one cluster: $$C_i \cap C_j = \emptyset \quad \forall i eq j$$

This is the standard assumption in most clustering. The clusters partition the data into non-overlapping groups.

Overlapping Clustering:

Points can belong to multiple clusters: $$C_i \cap C_j eq \emptyset \text{ for some } i eq j$$

This is different from soft clustering: in overlapping clustering, a point can have full membership in multiple clusters, not partial membership in each.

Why Overlapping Clusters?

Many real-world groupings overlap naturally:

Document clustering: A paper about "machine learning for healthcare" belongs to both "ML" and "healthcare" clusters
Social networks: People belong to multiple communities (work, family, hobbies)
Biological pathways: Genes participate in multiple biological processes
Product categories: A laptop bag belongs to both "electronics accessories" and "bags"

Algorithms for Overlapping Clustering:

Algorithm	Approach
Clique Percolation	Find overlapping graph communities via k-clique patterns
OCSM	Overlapping Cluster Spanning Model for bioinformatics
Fuzzy approaches	Threshold soft assignments to get overlapping hard assignments
Multi-assignment	Run clustering multiple times, assign to all above-threshold clusters

From Soft to Overlapping:

Soft clustering can approximate overlapping by using a threshold:

If $\mu_j(x_i) > \tau$ for multiple $j$, assign $x_i$ to all those clusters

This converts degree-of-membership to binary multi-membership.

Distinguishing Soft from Overlapping

Soft clustering: point has 30% membership in cluster A, 70% in cluster B. Overlapping clustering: point fully belongs to both A and B. Soft represents uncertainty or gradation; overlapping represents genuine multi-class membership. A document might be 'half about ML' (soft) or 'a full member of both ML and healthcare topics' (overlapping).

Deterministic vs. Stochastic Clustering

Does the algorithm always produce the same result, or does randomness play a role?

Deterministic Clustering:

Given the same data and parameters, the algorithm always produces the same clustering:

$$\text{Algorithm}(X, \theta) \rightarrow C \quad \text{(same every time)}$$

Examples:

Hierarchical clustering (agglomerative)
DBSCAN (parameter-dependent but deterministic given ε, minPts)
Spectral clustering (deterministic eigendecomposition, then potentially stochastic k-means)

Stochastic Clustering:

Results vary across runs due to random initialization or sampling:

$$\text{Algorithm}(X, \theta) \rightarrow C_1, C_2, C_3, \ldots \quad \text{(may differ)}$$

Examples:

K-means (random centroid initialization)
GMM with EM (random parameter initialization)
Monte Carlo sampling methods

Deterministic Advantages

•Reproducible results
•No need for multiple runs
•Easier to debug and validate
•Results can be cached/reused

Stochastic Advantages

•Can escape local optima
•Multiple runs give stability estimate
•Often faster than deterministic alternatives
•Can explore solution space

Handling Stochasticity:

When using stochastic algorithms:

Set random seed for reproducibility:
```
np.random.seed(42)
kmeans.fit(X)
```

Run multiple times and select best:

best_inertia = float('inf')
for _ in range(10):
    km = KMeans(n_clusters=k).fit(X)
    if km.inertia_ < best_inertia:
        best_km = km

Assess stability: If different runs give very different results, the clustering may not be stable.
Use consensus clustering: Combine multiple runs to find robust clusters.

Smart Initialization:

K-means++ initialization dramatically reduces the impact of randomness by choosing initial centroids that are spread apart. This makes k-means more reproducible while keeping the efficiency benefits of the algorithm.

Reproducibility in Practice

For production systems, always set random seeds and document them. For research, report results across multiple runs with confidence intervals. Tools like scikit-learn's n_init parameter automatically run k-means multiple times and return the best result.

Model-Based vs. Distance-Based Clustering

A fundamental distinction in the algorithmic approach to clustering:

Distance-Based (Metric) Clustering:

Clustering is based on pairwise distances between points. The algorithm uses distances directly to determine similarity and cluster membership.

Characteristics:

Requires a distance function $d(x_i, x_j)$
No probability model assumed
Objective: minimize within-cluster distances, maximize between-cluster distances
Examples: K-means, hierarchical clustering, DBSCAN

Model-Based (Probabilistic) Clustering:

Data is assumed to be generated from a probabilistic model. Clustering involves inferring the model parameters that best explain the data.

Characteristics:

Assumes a generative process (e.g., mixture of distributions)
Uses likelihood/posterior for inference
Provides probability of cluster membership
Examples: GMM, Latent Dirichlet Allocation, Hidden Markov Models

Model-Based vs. Distance-Based Comparison
Aspect	Distance-Based	Model-Based
Foundation	Pairwise distances	Probability distributions
Assumptions	Similarity metric is meaningful	Data follows assumed distribution
Output	Cluster assignments	Assignments + model parameters
Uncertainty	Not naturally quantified	Posterior probabilities
Model selection	Heuristics (elbow, silhouette)	Principled (BIC, AIC, cross-validation)
Flexibility	Any distance metric	Distribution family must be chosen
New data	Assign by distance to clusters	Assign by posterior probability

K-Means as a Special Case:

Interestingly, k-means can be derived from both perspectives:

Distance view: Minimize WCSS = sum of squared Euclidean distances to centroids
Model view: Maximum likelihood estimation for a GMM with spherical, equal-variance Gaussians and hard assignments

This duality shows how the paradigms connect. GMM generalizes k-means by:

Allowing different covariances per cluster
Allowing soft assignments
Providing a proper likelihood for model comparison

Practical Implications:

Choose distance-based when:

You need to use custom distance metrics
Data doesn't follow standard distributions
Interpretability of distances matters more than probabilities

Choose model-based when:

You want principled uncertainty quantification
You need to compare different numbers of clusters rigorously
You want a generative model (sample new data)
Data approximately follows a known distribution family

The Best of Both Worlds

In practice, you might use distance-based methods for exploration (simpler, fewer assumptions) and model-based methods for production (principled model selection, uncertainty estimates). Many workflows start with k-means to get a sense of the data, then refine with GMM for the final model.

Online vs. Batch Clustering

How does the algorithm handle data: all at once, or point by point?

Batch Clustering:

The algorithm sees all data before beginning and processes it as a whole:

$$\text{Cluster}({x_1, x_2, \ldots, x_n}) \rightarrow {C_1, C_2, \ldots, C_k}$$

Most clustering algorithms are batch algorithms: k-means, hierarchical, spectral, GMM.

Online (Incremental) Clustering:

Data arrives one point at a time, and the clustering is updated incrementally:

$$\text{Cluster}_{t+1} = \text{Update}(\text{Cluster}t, x{t+1})$$

The algorithm maintains a current clustering that evolves as new data arrives.

Why Online Clustering?

Batch clustering is insufficient when:

Data is too large: Can't fit in memory; must process in chunks
Data is streaming: Data arrives continuously (sensor readings, user events)
Distribution changes: Clusters evolve over time; need to adapt
Real-time requirements: Can't wait for batch processing; need immediate assignments

Online K-Means (Mini-Batch K-Means):

Samples small batches randomly and updates centroids incrementally:

for batch in stream:
    assign points to nearest centroid
    update centroids using learning rate: 
    μ_j = (1-η)μ_j + η * mean(batch points in C_j)

Converges to similar solution as batch k-means but much faster for large data.

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):

Builds a CF (Clustering Feature) tree incrementally:

Summarizes data in compact structures
Can process arbitrarily large data in one pass
Final clustering on CF summaries

Batch vs. Online Clustering
Aspect	Batch	Online
Data access	All data available	One point/batch at a time
Memory	O(n) or O(n²)	O(k) or O(tree size)
Quality	Better (sees all data)	Approximate (limited view)
Adaptability	Rerun for new data	Naturally adapts
Use case	Static datasets	Streams, large data

Handling Concept Drift

Online algorithms can handle 'concept drift' — when the underlying cluster structure changes over time. A batch algorithm trained on last month's data may be wrong for this month. Online methods can adapt continuously. Techniques like exponential forgetting (giving less weight to old points) help track changes.

Constraint-Based (Semi-Supervised) Clustering

What if we have some prior knowledge about which points should (or shouldn't) be together?

The Semi-Supervised Setting:

Pure unsupervised clustering uses only feature data. But often we have partial knowledge:

Some pairs of points are known to be in the same cluster
Some pairs are known to be in different clusters
Some points have known labels

Constraint-based clustering incorporates this knowledge.

Types of Constraints:

Must-Link (ML): Points $x_i$ and $x_j$ must be in the same cluster. $$\text{ML}(x_i, x_j): \text{cluster}(x_i) = \text{cluster}(x_j)$$

Cannot-Link (CL): Points $x_i$ and $x_j$ must be in different clusters. $$\text{CL}(x_i, x_j): \text{cluster}(x_i) eq \text{cluster}(x_j)$$

These constraints are transitive: if ML(a,b) and ML(b,c), then ML(a,c).

Sources of Constraints

•User feedback — Interactive clustering where users correct mistakes
•Partial labels — Some data is labeled; same-label pairs are must-link
•Domain knowledge — Expert knows certain items must/can't be grouped
•External information — Documents from same source should cluster together
•Temporal constraints — Adjacent time points may need to cluster together

Algorithms for Constrained Clustering:

COP-Kmeans: K-means with hard constraints

Rejects assignments that violate constraints
May fail to find feasible solution if constraints are inconsistent

PCKMeans: Penalized Constraint K-means

Adds penalty for constraint violations to objective: $$J = \sum_i |x_i - \mu_{c_i}|^2 + w_{ML} \sum_{\text{ML}(i,j)} \mathbb{1}[c_i eq c_j] + w_{CL} \sum_{\text{CL}(i,j)} \mathbb{1}[c_i = c_j]$$

Metric Learning:

Learn a distance metric that respects constraints
Must-link pairs should be close; cannot-link pairs should be far
Then cluster using learned metric

When to Use Constraints:

Constraint-based clustering is valuable when:

You have some supervision but not enough for classification
Domain experts can provide pairwise guidance
You want to steer clustering toward business-relevant groupings
Initial clustering is wrong in known ways and you want to correct it

Active Learning for Constraints

Not all constraints are equally useful. Active learning selects the most informative pairs to query. Typically, pairs where the algorithm is uncertain (boundary points between clusters) provide the most value. This minimizes human labeling effort while maximizing clustering improvement.

Summary: Types of Clustering

We've surveyed the major dimensions along which clustering approaches vary. Let's consolidate:

Key Takeaways

•Partitional vs. Hierarchical — Single partition vs. nested tree of partitions. Choose based on whether you know k and whether hierarchy is meaningful.
•Hard vs. Soft — Definite assignments vs. degree of membership. Soft is better for overlapping populations and uncertainty quantification.
•Complete vs. Partial — All points assigned vs. noise/outlier class. Partial is more honest when some points don't fit any cluster.
•Exclusive vs. Overlapping — Single cluster vs. multiple cluster membership. Overlapping suits naturally multi-category data.
•Deterministic vs. Stochastic — Reproducible results vs. random variation. Stochastic needs multiple runs; deterministic is simpler to work with.
•Model-Based vs. Distance-Based — Probabilistic generative model vs. pairwise distances. Model-based provides principled uncertainty and model selection.
•Online vs. Batch — Streaming/incremental vs. all-data-at-once. Online for large/streaming data.
•Constrained — Incorporating must-link/cannot-link supervision. Bridges unsupervised and supervised learning.

What's Next:

With the full taxonomy understood, we turn to the challenges of evaluating clustering—a notoriously difficult problem because there's typically no ground truth. The next page explores internal and external validation metrics, stability analysis, and the fundamental difficulties of assessing unsupervised learning.

Page Complete

You now have a comprehensive taxonomy of clustering approaches. This classification helps you navigate the many algorithms available and understand their fundamental characteristics. Next, we'll tackle the challenging question of how to evaluate whether a clustering is 'good.'

Types of Clustering

A Taxonomy of Clustering Approaches

What You Will Learn

Partitional vs. Hierarchical Clustering

The most fundamental distinction in clustering is between methods that find a single partition versus those that find a hierarchy of partitions.

Partitional Clustering:

Partitional (or "flat") clustering divides data into a fixed number of non-overlapping groups. The output is a single set of clusters ${C_1, C_2, \ldots, C_k}$.

Key characteristics:

Requires specifying $k$ (number of clusters) in advance
Each point belongs to exactly one cluster
Optimizes some objective function (e.g., WCSS for k-means)
No inherent relationship between clusters

Examples: K-means, K-medoids, GMM (can be), spectral clustering

Hierarchical Clustering:

Hierarchical clustering creates a tree (dendrogram) of nested cluster relationships. Every partition at every granularity is encoded in a single structure.

Key characteristics:

No need to pre-specify $k$
Produces a hierarchy: clusters within clusters
Can extract any number of clusters by cutting the tree
Relationships between clusters are explicit

Partitional: Pros & Cons

•✓ Computationally efficient (often O(nkI) for k iterations)
•✓ Scales well to large datasets
•✓ Direct optimization of interpretable objectives
•✗ Requires choosing k in advance
•✗ Results depend on initialization
•✗ No multi-scale view of data

Hierarchical: Pros & Cons

•✓ No need to pre-specify k
•✓ Visualizable dendrogram
•✓ Multi-scale structure visible
•✓ Deterministic (no random initialization)
•✗ Computationally expensive (O(n²) space, O(n³) time typically)
•✗ Decisions are irreversible (greedy)

Agglomerative vs. Divisive Hierarchical:

Hierarchical clustering comes in two flavors:

Agglomerative (bottom-up): Start with n clusters (each point is a cluster), iteratively merge the two most similar clusters. More common in practice.
Divisive (top-down): Start with one cluster (all points), recursively split. Less common; harder to decide how to split.

Hybrid Approaches:

Some modern methods combine aspects of both:

Bisecting K-means: Hierarchical structure via recursive partitional calls
HDBSCAN: Density-based hierarchy extracted for flat clustering
BIRCH: Hierarchical data summary followed by partitional clustering

Choosing Between Partitional and Hierarchical

Hard vs. Soft (Fuzzy) Clustering

Should a point belong to exactly one cluster, or can it have partial membership in multiple clusters?

Hard Clustering:

Each point is assigned to exactly one cluster: $$\forall x_i: \quad x_i \in C_j \text{ for exactly one } j$$

Cluster assignments are deterministic—no ambiguity in which cluster owns each point.

Soft (Fuzzy) Clustering:

Each point has a degree of membership to each cluster: $$\forall x_i: \quad \mu_j(x_i) \in [0, 1] \text{ for each cluster } j$$

where $\sum_j \mu_j(x_i) = 1$ (memberships sum to 1).

In probabilistic clustering (GMMs), these are posterior probabilities: $P(\text{cluster } j | x_i)$.

Hard vs. Soft Clustering Comparison
Aspect	Hard Clustering	Soft Clustering
Assignment	Binary: 0 or 1	Continuous: [0, 1]
Boundary points	Forced into one cluster	Shared by multiple clusters
Information	Less informative	More informative
Complexity	Simpler interpretation	More parameters
Examples	K-means, DBSCAN, hierarchical	GMM, Fuzzy C-means
Best for	Well-separated clusters	Overlapping clusters, uncertainty

Fuzzy C-Means (FCM):

The classic soft clustering algorithm generalizes k-means:

$$J_{FCM} = \sum_{i=1}^{n} \sum_{j=1}^{k} \mu_{ij}^m |x_i - \mu_j|^2$$

where:

$\mu_{ij}$ = membership of point $i$ in cluster $j$
$m > 1$ = fuzziness parameter (typically 2)
Higher $m$ = softer assignments

The membership update rule: $$\mu_{ij} = \frac{1}{\sum_{l=1}^{k} \left( \frac{|x_i - \mu_j|}{|x_i - \mu_l|} \right)^{\frac{2}{m-1}}}$$

When Soft Assignments Matter:

Overlapping populations: Customers may belong to multiple segments
Uncertainty quantification: Want to know confidence in assignments
Downstream modeling: Soft assignments can be features for supervised learning
Boundary analysis: Understanding transition zones between clusters

Converting Between Hard and Soft

Complete vs. Partial Clustering

Must every data point be assigned to a cluster?

Complete Clustering:

Every point is assigned to exactly one cluster. The clusters form a complete partition of the data: $$\bigcup_{j=1}^{k} C_j = X$$

Most clustering algorithms produce complete clusterings. K-means, hierarchical clustering (with fixed cut), and GMM with hard assignments all assign every point.

Partial Clustering:

Some points may not be assigned to any cluster. These points are classified as noise, outliers, or background: $$\bigcup_{j=1}^{k} C_j \subset X$$

Density-based methods (DBSCAN, HDBSCAN) naturally produce partial clusterings. Points in sparse regions are declared noise rather than forced into inappropriate clusters.

Advantages of Partial Clustering

•Honest uncertainty — Rather than forcing ambiguous points into clusters, acknowledge they don't fit well
•Outlier detection — Noise points often represent anomalies worth investigating
•Cleaner clusters — Without forced outlier assignment, cluster statistics are more meaningful
•Realistic modeling — Real data often has points that don't belong to any natural group

Making Complete Methods Partial:

You can convert complete clustering to partial by post-processing:

Distance threshold: Remove points farther than threshold from cluster center
Silhouette filtering: Remove points with negative silhouette scores
Probability threshold: In GMMs, unlabel points with max posterior below threshold

Making Partial Methods Complete:

To assign noise points:

Nearest cluster: Assign to closest cluster
Separate cluster: Create a "noise" cluster
Soft assignment: Give noise points soft membership based on proximity

The Design Choice:

Choosing between complete and partial depends on your application:

Classification preprocessing: Complete (need labels for all training points)
Customer segmentation with anomaly detection: Partial (some customers are outliers)
Latent class modeling: Complete (assume every observation has a class)
Quality control: Partial (flag unusual items rather than force-classify)

The Forced Assignment Problem

Exclusive vs. Overlapping Clustering

Can a point belong to multiple clusters simultaneously?

Exclusive Clustering:

Each point belongs to at most one cluster: $$C_i \cap C_j = \emptyset \quad \forall i eq j$$

This is the standard assumption in most clustering. The clusters partition the data into non-overlapping groups.

Overlapping Clustering:

Points can belong to multiple clusters: $$C_i \cap C_j eq \emptyset \text{ for some } i eq j$$

This is different from soft clustering: in overlapping clustering, a point can have full membership in multiple clusters, not partial membership in each.

Why Overlapping Clusters?

Many real-world groupings overlap naturally:

Document clustering: A paper about "machine learning for healthcare" belongs to both "ML" and "healthcare" clusters
Social networks: People belong to multiple communities (work, family, hobbies)
Biological pathways: Genes participate in multiple biological processes
Product categories: A laptop bag belongs to both "electronics accessories" and "bags"

Algorithms for Overlapping Clustering:

Algorithm	Approach
Clique Percolation	Find overlapping graph communities via k-clique patterns
OCSM	Overlapping Cluster Spanning Model for bioinformatics
Fuzzy approaches	Threshold soft assignments to get overlapping hard assignments
Multi-assignment	Run clustering multiple times, assign to all above-threshold clusters

From Soft to Overlapping:

Soft clustering can approximate overlapping by using a threshold:

If $\mu_j(x_i) > \tau$ for multiple $j$, assign $x_i$ to all those clusters

This converts degree-of-membership to binary multi-membership.

Distinguishing Soft from Overlapping

Deterministic vs. Stochastic Clustering

Does the algorithm always produce the same result, or does randomness play a role?

Deterministic Clustering:

Given the same data and parameters, the algorithm always produces the same clustering:

$$\text{Algorithm}(X, \theta) \rightarrow C \quad \text{(same every time)}$$

Examples:

Hierarchical clustering (agglomerative)
DBSCAN (parameter-dependent but deterministic given ε, minPts)
Spectral clustering (deterministic eigendecomposition, then potentially stochastic k-means)

Stochastic Clustering:

Results vary across runs due to random initialization or sampling:

$$\text{Algorithm}(X, \theta) \rightarrow C_1, C_2, C_3, \ldots \quad \text{(may differ)}$$

Examples:

K-means (random centroid initialization)
GMM with EM (random parameter initialization)
Monte Carlo sampling methods

Deterministic Advantages

•Reproducible results
•No need for multiple runs
•Easier to debug and validate
•Results can be cached/reused

Stochastic Advantages

•Can escape local optima
•Multiple runs give stability estimate
•Often faster than deterministic alternatives
•Can explore solution space

Handling Stochasticity:

When using stochastic algorithms:

Set random seed for reproducibility:
```
np.random.seed(42)
kmeans.fit(X)
```

Run multiple times and select best:

best_inertia = float('inf')
for _ in range(10):
    km = KMeans(n_clusters=k).fit(X)
    if km.inertia_ < best_inertia:
        best_km = km

Assess stability: If different runs give very different results, the clustering may not be stable.
Use consensus clustering: Combine multiple runs to find robust clusters.

Smart Initialization:

Reproducibility in Practice

Model-Based vs. Distance-Based Clustering

A fundamental distinction in the algorithmic approach to clustering:

Distance-Based (Metric) Clustering:

Clustering is based on pairwise distances between points. The algorithm uses distances directly to determine similarity and cluster membership.

Characteristics:

Requires a distance function $d(x_i, x_j)$
No probability model assumed
Objective: minimize within-cluster distances, maximize between-cluster distances
Examples: K-means, hierarchical clustering, DBSCAN

Model-Based (Probabilistic) Clustering:

Data is assumed to be generated from a probabilistic model. Clustering involves inferring the model parameters that best explain the data.

Characteristics:

Assumes a generative process (e.g., mixture of distributions)
Uses likelihood/posterior for inference
Provides probability of cluster membership
Examples: GMM, Latent Dirichlet Allocation, Hidden Markov Models

Model-Based vs. Distance-Based Comparison
Aspect	Distance-Based	Model-Based
Foundation	Pairwise distances	Probability distributions
Assumptions	Similarity metric is meaningful	Data follows assumed distribution
Output	Cluster assignments	Assignments + model parameters
Uncertainty	Not naturally quantified	Posterior probabilities
Model selection	Heuristics (elbow, silhouette)	Principled (BIC, AIC, cross-validation)
Flexibility	Any distance metric	Distribution family must be chosen
New data	Assign by distance to clusters	Assign by posterior probability

K-Means as a Special Case:

Interestingly, k-means can be derived from both perspectives:

Distance view: Minimize WCSS = sum of squared Euclidean distances to centroids
Model view: Maximum likelihood estimation for a GMM with spherical, equal-variance Gaussians and hard assignments

This duality shows how the paradigms connect. GMM generalizes k-means by:

Allowing different covariances per cluster
Allowing soft assignments
Providing a proper likelihood for model comparison

Practical Implications:

Choose distance-based when:

You need to use custom distance metrics
Data doesn't follow standard distributions
Interpretability of distances matters more than probabilities

Choose model-based when:

You want principled uncertainty quantification
You need to compare different numbers of clusters rigorously
You want a generative model (sample new data)
Data approximately follows a known distribution family

The Best of Both Worlds

Online vs. Batch Clustering

How does the algorithm handle data: all at once, or point by point?

Batch Clustering:

The algorithm sees all data before beginning and processes it as a whole:

$$\text{Cluster}({x_1, x_2, \ldots, x_n}) \rightarrow {C_1, C_2, \ldots, C_k}$$

Most clustering algorithms are batch algorithms: k-means, hierarchical, spectral, GMM.

Online (Incremental) Clustering:

Data arrives one point at a time, and the clustering is updated incrementally:

$$\text{Cluster}_{t+1} = \text{Update}(\text{Cluster}t, x{t+1})$$

The algorithm maintains a current clustering that evolves as new data arrives.

Why Online Clustering?

Batch clustering is insufficient when:

Data is too large: Can't fit in memory; must process in chunks
Data is streaming: Data arrives continuously (sensor readings, user events)
Distribution changes: Clusters evolve over time; need to adapt
Real-time requirements: Can't wait for batch processing; need immediate assignments

Online K-Means (Mini-Batch K-Means):

Samples small batches randomly and updates centroids incrementally:

for batch in stream:
    assign points to nearest centroid
    update centroids using learning rate: 
    μ_j = (1-η)μ_j + η * mean(batch points in C_j)

Converges to similar solution as batch k-means but much faster for large data.

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):

Builds a CF (Clustering Feature) tree incrementally:

Summarizes data in compact structures
Can process arbitrarily large data in one pass
Final clustering on CF summaries

Batch vs. Online Clustering
Aspect	Batch	Online
Data access	All data available	One point/batch at a time
Memory	O(n) or O(n²)	O(k) or O(tree size)
Quality	Better (sees all data)	Approximate (limited view)
Adaptability	Rerun for new data	Naturally adapts
Use case	Static datasets	Streams, large data

Handling Concept Drift

Constraint-Based (Semi-Supervised) Clustering

What if we have some prior knowledge about which points should (or shouldn't) be together?

The Semi-Supervised Setting:

Pure unsupervised clustering uses only feature data. But often we have partial knowledge:

Some pairs of points are known to be in the same cluster
Some pairs are known to be in different clusters
Some points have known labels

Constraint-based clustering incorporates this knowledge.

Types of Constraints:

Must-Link (ML): Points $x_i$ and $x_j$ must be in the same cluster. $$\text{ML}(x_i, x_j): \text{cluster}(x_i) = \text{cluster}(x_j)$$

Cannot-Link (CL): Points $x_i$ and $x_j$ must be in different clusters. $$\text{CL}(x_i, x_j): \text{cluster}(x_i) eq \text{cluster}(x_j)$$

These constraints are transitive: if ML(a,b) and ML(b,c), then ML(a,c).

Sources of Constraints

•User feedback — Interactive clustering where users correct mistakes
•Partial labels — Some data is labeled; same-label pairs are must-link
•Domain knowledge — Expert knows certain items must/can't be grouped
•External information — Documents from same source should cluster together
•Temporal constraints — Adjacent time points may need to cluster together

Algorithms for Constrained Clustering:

COP-Kmeans: K-means with hard constraints

Rejects assignments that violate constraints
May fail to find feasible solution if constraints are inconsistent

PCKMeans: Penalized Constraint K-means

Adds penalty for constraint violations to objective: $$J = \sum_i |x_i - \mu_{c_i}|^2 + w_{ML} \sum_{\text{ML}(i,j)} \mathbb{1}[c_i eq c_j] + w_{CL} \sum_{\text{CL}(i,j)} \mathbb{1}[c_i = c_j]$$

Metric Learning:

Learn a distance metric that respects constraints
Must-link pairs should be close; cannot-link pairs should be far
Then cluster using learned metric

When to Use Constraints:

Constraint-based clustering is valuable when:

You have some supervision but not enough for classification
Domain experts can provide pairwise guidance
You want to steer clustering toward business-relevant groupings
Initial clustering is wrong in known ways and you want to correct it

Active Learning for Constraints

Summary: Types of Clustering

We've surveyed the major dimensions along which clustering approaches vary. Let's consolidate:

Key Takeaways

•Partitional vs. Hierarchical — Single partition vs. nested tree of partitions. Choose based on whether you know k and whether hierarchy is meaningful.
•Hard vs. Soft — Definite assignments vs. degree of membership. Soft is better for overlapping populations and uncertainty quantification.
•Complete vs. Partial — All points assigned vs. noise/outlier class. Partial is more honest when some points don't fit any cluster.
•Exclusive vs. Overlapping — Single cluster vs. multiple cluster membership. Overlapping suits naturally multi-category data.
•Deterministic vs. Stochastic — Reproducible results vs. random variation. Stochastic needs multiple runs; deterministic is simpler to work with.
•Model-Based vs. Distance-Based — Probabilistic generative model vs. pairwise distances. Model-based provides principled uncertainty and model selection.
•Online vs. Batch — Streaming/incremental vs. all-data-at-once. Online for large/streaming data.
•Constrained — Incorporating must-link/cannot-link supervision. Bridges unsupervised and supervised learning.

What's Next:

Page Complete