Ml Landscape - Learning Module

Loading content...

0/245

Clustering Problems

Discovering Structure in Unlabeled Data

Clustering occupies a unique position in machine learning—it seeks to discover structure without being told what to find. Unlike regression and classification where labels guide learning, clustering must identify natural groupings from data alone. This is unsupervised learning in its purest form.

Clustering answers questions like:

What distinct customer segments exist in our user base?
How do these gene expression patterns naturally group?
What are the main themes in this document collection?
Which images are visually similar to each other?
Are there anomalies that don't fit any natural group?

The absence of labels is both liberating and challenging. There's no ground truth to optimize against—success depends on whether discovered structure is meaningful and useful for downstream tasks.

What You Will Master

By the end of this page, you will understand clustering at a foundational level: formal definitions and objectives, major algorithmic paradigms (partitional, hierarchical, density-based, model-based), the critical role of distance metrics and feature scaling, cluster validity measures, and practical considerations for applying clustering effectively. This knowledge enables you to discover meaningful structure in any unlabeled dataset.

Formal Definition of Clustering

Formalizing clustering is surprisingly subtle. Unlike supervised learning where objective functions are clear, clustering objectives admit multiple valid formulations.

The Basic Setup:

Given a dataset $\mathcal{D} = {x_1, x_2, \ldots, x_n}$ where each $x_i \in \mathbb{R}^d$, clustering seeks a partition: $$\mathcal{C} = {C_1, C_2, \ldots, C_K}$$

such that:

Each point belongs to exactly one cluster: $\bigcup_{k=1}^{K} C_k = \mathcal{D}$
Clusters are non-overlapping: $C_i \cap C_j = \emptyset$ for $i \neq j$
Clusters are non-empty: $C_k \neq \emptyset$ for all $k$

Soft vs. Hard Clustering:

Hard clustering assigns each point to exactly one cluster.

Soft (fuzzy) clustering assigns membership probabilities: $p_{ik}$ = probability that point $i$ belongs to cluster $k$, with $\sum_k p_{ik} = 1$.

Soft clustering is more flexible and often more appropriate—real data often admits ambiguous assignments.

Objective Functions:

Different clustering algorithms optimize different objectives:

Compactness-based (within-cluster distances small): $$\min_{\mathcal{C}} \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2$$

where $\mu_k$ is the centroid of cluster $k$. This is the k-means objective.

Separation-based (between-cluster distances large): Maximize $\sum_{k < l} |\mu_k - \mu_l|^2$

Ratio-cut and Normalized-cut (graph-based): Minimize edges cut between clusters while keeping clusters balanced.

Likelihood-based (probabilistic models): $$\max_\theta \sum_{i=1}^{n} \log \sum_{k=1}^{K} \pi_k \cdot p(x_i | \theta_k)$$

This is the Gaussian Mixture Model (GMM) objective.

No Universal Clustering Objective

There is no single 'correct' clustering objective. Different objectives encode different assumptions about what makes a good cluster. K-means assumes spherical clusters of similar size. DBSCAN assumes clusters are dense regions separated by sparse regions. Hierarchical clustering assumes nested structure. Choose the objective that matches your understanding of the domain.

Clustering Objective Functions
Objective	Formula/Description	Assumption	Algorithm
Within-cluster sum of squares	$\sum_k \sum_{x \in C_k} \|x - \mu_k\|^2$	Spherical, similar-sized clusters	K-means, K-medoids
Log-likelihood (GMM)	$\sum_i \log \sum_k \pi_k \mathcal{N}(x_i; \mu_k, \Sigma_k)$	Gaussian-distributed clusters	EM for GMM
Density connectivity	Maximal density-connected regions	Clusters are dense regions	DBSCAN, HDBSCAN
Linkage criterion	Min/max/avg distance between clusters	Hierarchical structure exists	Agglomerative clustering
Graph cuts	Minimize edges cut / cluster weight	Data forms a graph	Spectral clustering

Distance Metrics and Similarity Measures

The choice of distance metric fundamentally shapes clustering results. Two datasets that look similar under Euclidean distance may have completely different structures under cosine similarity. Understanding distance is prerequisite to understanding clustering.

What Makes a Valid Distance Metric?

A function $d: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}^+$ is a metric if it satisfies:

Non-negativity: $d(x, y) \geq 0$
Identity of indiscernibles: $d(x, y) = 0 \Leftrightarrow x = y$
Symmetry: $d(x, y) = d(y, x)$
Triangle inequality: $d(x, z) \leq d(x, y) + d(y, z)$

Some useful functions (cosine similarity, KL divergence) are not true metrics but work well in practice.

Common Distance and Similarity Measures
Metric	Formula	Properties	Use Cases
Euclidean (L2)	$\sqrt{\sum_i (x_i - y_i)^2}$	Isotropic; sensitive to scale	General purpose; requires scaled features
Manhattan (L1)	$\sum_i \|x_i - y_i\|$	More robust to outliers than L2	When dimensions are independent
Chebyshev (L∞)	$\max_i \|x_i - y_i\|$	Maximum coordinate difference	When worst-case dimension matters
Minkowski (Lp)	$(\sum_i \|x_i - y_i\|^p)^{1/p}$	Generalizes L1, L2, L∞	Tunable robustness
Cosine Distance	$1 - \frac{x \cdot y}{\|x\| \|y\|}$	Measures angle, ignores magnitude	Text, vector embeddings, high-dim sparse
Mahalanobis	$\sqrt{(x-y)^T \Sigma^{-1} (x-y)}$	Accounts for feature correlations	When correlations matter; requires Σ estimate
Hamming	$\sum_i \mathbb{1}[x_i \neq y_i]$	Counts differing positions	Binary or categorical data
Jaccard	$1 - \frac{\|A \cap B\|}{\|A \cup B\|}$	Set overlap	Sets, binary feature vectors

The Critical Role of Feature Scaling:

Most distance metrics are sensitive to feature scales. If one feature ranges [0, 1] and another [0, 1000], the second dominates distance calculations.

Common scaling approaches:

Standardization (Z-score): $x' = (x - \mu) / \sigma$. Centers at 0, scales to unit variance.
Min-Max Normalization: $x' = (x - x_{min}) / (x_{max} - x_{min})$. Scales to [0, 1].
Robust Scaling: Uses median and IQR instead of mean and std. Robust to outliers.
Unit Vector Normalization: $x' = x / |x|$. Makes all vectors unit length (then use cosine).

Choosing the right distance and scaling is as important as choosing the right algorithm. A poor choice can make any algorithm fail; a good choice can make simple algorithms succeed.

Curse of Dimensionality in Distances

In high dimensions, distances concentrate: the ratio of maximum to minimum distances approaches 1. All points become 'equally far' from each other, making distance-based clustering fail. Solutions include: dimensionality reduction (PCA, t-SNE, UMAP) before clustering, careful feature selection, or algorithms designed for high dimensions (subspace clustering).

Partitional Clustering Algorithms

Partitional algorithms directly partition data into K clusters, typically by optimizing an objective function iteratively.

K-Means: The Workhorse Algorithm

K-means is the most widely used clustering algorithm due to its simplicity and scalability.

Algorithm:

Initialize K cluster centroids (randomly or using k-means++)
Assign: Each point to nearest centroid: $C_k = {x_i : k = \arg\min_j |x_i - \mu_j|}$
Update: Recompute centroids as cluster means: $\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i$
Repeat until convergence (assignments don't change)

Objective minimized: Within-cluster sum of squares (WCSS)

Complexity: $O(n \cdot K \cdot d \cdot I)$ where $I$ is iterations (typically small)

K-Means Strengths

•Simple to understand and implement
•Scales to large datasets: O(nKd)
•Converges quickly in practice
•Works well for compact, spherical clusters
•Parallelizable (mini-batch variant)
•Widely available in all ML libraries

K-Means Limitations

•Requires specifying K in advance
•Sensitive to initialization (use k-means++)
•Assumes spherical clusters of similar size
•Sensitive to outliers (centroids pulled)
•Only finds convex cluster shapes
•Converges to local optima only

K-Means++ Initialization:

Poor initialization leads to poor local optima. K-means++ chooses initial centroids to be spread out:

Choose first centroid uniformly at random
For each subsequent centroid:
- Compute distance $D(x)$ from each point to nearest existing centroid
- Choose next centroid with probability proportional to $D(x)^2$
Repeat until K centroids are chosen

This gives $O(\log K)$-competitive approximation to optimal and dramatically improves practical performance.

K-Medoids (PAM - Partitioning Around Medoids):

Instead of mean centroids, uses actual data points (medoids). More robust to outliers and works with any distance metric (not just Euclidean). But slower: O(K(n-K)²) per iteration.

Gaussian Mixture Models (GMM):

Probabilistic generalization of K-means. Assumes data is generated from a mixture of K Gaussian distributions:

$$p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x; \mu_k, \Sigma_k)$$

where $\pi_k$ are mixing coefficients (sum to 1), $\mu_k$ are means, and $\Sigma_k$ are covariances.

Advantages over K-means:

Produces soft cluster assignments (probabilities)
Handles elliptical clusters of different sizes/orientations
Provides likelihood-based model selection (AIC, BIC)

Trained via EM algorithm:

E-step: Compute posterior probabilities $p(k|x_i)$
M-step: Update $\pi_k, \mu_k, \Sigma_k$ to maximize expected log-likelihood

More flexible but more parameters to estimate; can overfit with small data.

When to Use K-Means vs. GMM

Use K-means when: data is large, clusters are roughly spherical and similar-sized, you need speed, hard assignments are acceptable. Use GMM when: cluster shapes vary (elliptical), you need probability estimates, model selection via likelihood is important, data size is manageable. Both require specifying K.

Hierarchical Clustering

Hierarchical clustering produces a tree-like hierarchy of clusters (dendrogram) rather than a single partition. This is valuable when the goal is to explore multi-scale structure or when the 'right' number of clusters is unknown.

Two Approaches:

1. Agglomerative (bottom-up): Start with n singleton clusters; repeatedly merge the two closest clusters until one remains.

2. Divisive (top-down): Start with one cluster containing all points; repeatedly split clusters until n singletons remain.

Agglomerative is more common because divisive requires deciding how to split (expensive for large clusters).

Linkage Criteria:

The key choice in agglomerative clustering: how do we define distance between clusters?

Single Linkage: Distance = minimum distance between any pair of points (one from each cluster) $$d(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$$

Produces elongated, 'chained' clusters
Sensitive to noise/bridges between clusters

Complete Linkage: Distance = maximum distance between any pair $$d(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$$

Produces compact, roughly equal-diameter clusters
Sensitive to outliers

Average Linkage (UPGMA): Distance = average of all pairwise distances $$d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$$

Compromise between single and complete
Less sensitive to noise than single linkage

Ward's Method: Minimize increase in total within-cluster variance when merging

Tends to create compact, spherical clusters
Often produces the most intuitive hierarchies

Hierarchical Clustering Linkage Comparison
Linkage	Cluster Shape	Sensitivity	Complexity	Best For
Single	Elongated, chained	Very noise-sensitive	O(n²)	Detecting connected regions
Complete	Compact, spherical	Outlier-sensitive	O(n²)	Compact clusters of similar diameter
Average	Moderate shapes	Balanced sensitivity	O(n²)	General purpose
Ward	Compact, spherical	Moderate sensitivity	O(n²)	Variance-based clustering; most popular

Reading and Cutting Dendrograms:

The dendrogram visualizes the merge history:

Horizontal axis: data points (leaves)
Vertical axis: distance at which merges occur
Height of merge = distance between merged clusters

To obtain K clusters: Draw horizontal line at appropriate height; the number of vertical lines it crosses equals the number of clusters. Alternatively, select K largest gaps in merge heights.

Complexity: O(n² log n) with efficient implementations (O(n²) for single linkage). Memory O(n²) for distance matrix. This limits scalability to ~10,000-100,000 points.

When Hierarchical Clustering Excels

Hierarchical clustering is ideal for exploratory analysis where the number of clusters is unknown, when you want to visualize cluster structure at multiple scales, for small-to-medium datasets where the dendrogram provides interpretable insight, and in domains like biology where taxonomic hierarchies are natural (phylogenetic trees).

Density-Based Clustering

Density-based methods define clusters as regions of high density separated by regions of low density. This elegant formulation handles arbitrarily shaped clusters and naturally identifies outliers.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise

The foundational density-based algorithm. Key concepts:

Core points: Points with at least minPts neighbors within radius ε

Border points: Points within ε of a core point but not core themselves

Noise points: Points that are neither core nor border

Density-connected: Two points are in the same cluster if they're connected through a chain of core points.

DBSCAN Algorithm:

For each unvisited point:
- Mark as visited
- Find all points within ε radius (neighbors)
- If neighbors ≥ minPts: create new cluster, expand by adding density-reachable points
- Otherwise: mark as noise (may later become border point)

Parameters:

ε (epsilon): Radius for neighborhood queries
minPts: Minimum points to form a dense region

Choosing parameters:

minPts ≥ d+1 where d is dimensionality (rule of thumb: minPts = 2×d)
ε: Use k-distance plot (plot distance to k-th nearest neighbor, sorted; look for 'elbow')

Complexity: O(n log n) with spatial indexing (R-tree, KD-tree); O(n²) worst case

DBSCAN Strengths

•No need to specify number of clusters
•Finds arbitrarily shaped clusters
•Robust to outliers (labels them as noise)
•Only two intuitive parameters
•Deterministic (unlike K-means)

DBSCAN Limitations

•Struggles with varying-density clusters
•Parameter selection can be difficult
•High-dimensional data problematic (curse of dimensionality)
•Border points may be unstable
•Not suitable for high-dimensional sparse data

HDBSCAN: Hierarchical DBSCAN

Addresses DBSCAN's sensitivity to ε by extracting a hierarchy of clusters:

Build a minimum spanning tree based on mutual reachability distance
Convert to hierarchy by removing edges in order of increasing weight
Condense the hierarchy by collapsing noise points
Extract flat clustering by selecting the most stable clusters

Key advantages:

Doesn't require choosing ε (only minPts)
Handles varying-density clusters
Provides cluster stability scores
Often works 'out of the box' better than DBSCAN

HDBSCAN has become the go-to density-based method for many practitioners.

DBSCAN vs. K-Means: Choose Wisely

Use DBSCAN/HDBSCAN when: clusters have arbitrary shapes, outliers exist and should be detected, number of clusters is unknown, density is roughly uniform. Use K-means when: clusters are roughly spherical and similar-sized, you know K, computational speed is critical, outliers are pre-filtered.

Cluster Validity and Evaluation

How do we know if clustering succeeded? Without ground truth labels, evaluation is fundamentally ambiguous—but several principles and metrics provide guidance.

Two Types of Validation:

Internal validation: Measures based only on the data and clustering (no external labels)

External validation: Measures agreement with known ground truth labels (used for algorithm comparison on labeled datasets)

Internal Clustering Validity Metrics
Metric	Formula / Description	Range	Interpretation
Silhouette Coefficient	$\frac{b(i) - a(i)}{\max(a(i), b(i))}$ where a=intra-cluster dist, b=nearest-cluster dist	[-1, 1]	Higher = better; >0.7 strong structure
Calinski-Harabasz Index	Ratio of between-cluster to within-cluster variance	[0, ∞)	Higher = better defined clusters
Davies-Bouldin Index	Average similarity of each cluster to its most similar cluster	[0, ∞)	Lower = better separation
Dunn Index	Min inter-cluster dist / max intra-cluster dist	[0, ∞)	Higher = compact and well-separated
Within-Cluster Sum of Squares	$\sum_k \sum_{x \in C_k} \|x - \mu_k\|^2$	[0, ∞)	Lower = more compact (but decreases with K)

The Silhouette Coefficient in Detail:

For each point $i$:

$a(i)$ = average distance to all other points in same cluster (cohesion)
$b(i)$ = minimum average distance to points in any other cluster (separation)
$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$

Interpretation:

$s(i) \approx 1$: Point is well-clustered (far from other clusters)
$s(i) \approx 0$: Point is on cluster boundary
$s(i) < 0$: Point may be in wrong cluster

Overall silhouette = average over all points. Values:

0.71-1.0: Strong structure
0.51-0.70: Reasonable structure
0.26-0.50: Weak structure
< 0.25: No substantial structure

External Validation Metrics:

When ground truth labels exist (for algorithm benchmarking):

Rand Index (RI): Fraction of point pairs where clustering and ground truth agree (both same or both different clusters). Range [0, 1].

Adjusted Rand Index (ARI): RI corrected for chance. Expected value = 0 for random clustering. Range [-1, 1]; 1 = perfect agreement.

Normalized Mutual Information (NMI): Information-theoretic measure of agreement. Range [0, 1]. $$NMI(C, T) = \frac{2 \cdot I(C; T)}{H(C) + H(T)}$$

Fowlkes-Mallows Index: Geometric mean of precision and recall of pair classifications.

ARI and NMI are the most commonly reported external validation metrics.

The Fundamental Limitation

No internal metric can definitively validate clustering. Different metrics favor different cluster properties (compactness, separation, balance). Always combine quantitative metrics with domain expertise and visual inspection. A clustering that scores well but doesn't make domain sense is useless.

Practical Considerations

Successful clustering requires navigating numerous practical decisions beyond algorithm choice. Here are the key considerations that distinguish expert practice.

Critical Practical Considerations

•Choosing K (Number of Clusters): Use elbow method (plot WCSS vs. K, look for 'elbow'), silhouette analysis, gap statistic, or domain knowledge. There's rarely a definitively 'correct' K—test multiple and evaluate stability.
•Feature Selection and Engineering: More features isn't always better. Irrelevant features add noise. Consider dimensionality reduction (PCA, UMAP) before clustering high-dimensional data.
•Handling Mixed Data Types: Numeric, categorical, and ordinal features require different distance metrics. Gower distance handles mixed types; alternatively, encode categoricals appropriately.
•Dealing with Outliers: Decide whether outliers are noise to remove or important signals. DBSCAN labels outliers explicitly; for k-means, consider pre-filtering or using k-medoids.
•Scalability: K-means scales to millions; hierarchical clustering struggles beyond ~10K points; DBSCAN with indexing handles moderate sizes; mini-batch and sampling approaches extend reach.
•Reproducibility: Set random seeds for stochastic algorithms. Run multiple times with different initializations and assess stability. Stable clusters across runs are more trustworthy.
•Interpretation: Clusters are only useful if meaningful. Compute cluster profiles (feature means, medians), visualize in 2D (t-SNE, UMAP), and validate with domain experts.

The Elbow Method for Choosing K:

Run clustering for K = 1, 2, ..., K_max
Compute within-cluster sum of squares (WCSS) for each K
Plot WCSS vs. K
Look for 'elbow'—point where adding more clusters gives diminishing returns

Limitations: The elbow is often subjective or absent. Supplement with silhouette analysis (plot average silhouette vs. K; choose K that maximizes) or domain knowledge.

Gap Statistic: Compares WCSS to expected WCSS under null reference distribution. Choose K where gap is large. More principled but computationally expensive.

The Visualization Imperative

Always visualize your clusters. Use dimensionality reduction (PCA, t-SNE, UMAP) to project high-dimensional data to 2D/3D. Color points by cluster label. If clusters don't look sensible in visualization, they probably aren't—regardless of what metrics say. Human judgment remains essential.

Summary: Clustering in the ML Landscape

We've explored clustering comprehensively—formal definitions, distance metrics, algorithmic paradigms (partitional, hierarchical, density-based), cluster validity, and practical considerations. Let's synthesize the key insights.

Key Takeaways

•Clustering discovers structure without labels — It's genuinely unsupervised learning, seeking natural groupings in data.
•No single 'correct' clustering exists — Different objectives and algorithms produce different valid clusterings. The 'best' depends on what's meaningful for your application.
•Distance metrics fundamentally shape results — Euclidean, cosine, Manhattan—each encodes different assumptions. Feature scaling is essential for most metrics.
•Algorithm choice encodes assumptions — K-means assumes spherical clusters; DBSCAN assumes density separation; hierarchical assumes nested structure.
•Validation is inherently ambiguous — Internal metrics (silhouette, Davies-Bouldin) provide guidance but not ground truth. Combine with domain knowledge.
•Practical decisions dominate success — Choosing K, handling outliers, selecting features, and ensuring reproducibility matter as much as algorithm choice.
•Visualization is essential — If clusters don't look sensible visually, they probably aren't meaningful.

Connection to Other Problem Types:

Clustering connects to the broader ML landscape:

Classification: Clusters can define classes for subsequent classifiers (semi-supervised learning). Cluster labels become pseudo-labels.
Dimensionality Reduction: Clustering in reduced space often works better. t-SNE and UMAP are often visualized with cluster colors.
Anomaly Detection: Points distant from all cluster centers, or labeled as noise by DBSCAN, are natural anomaly candidates.
Generative Modeling: Mixture models (GMM) are both clustering methods and generative models.

What's Next:

Having covered regression (continuous), classification (discrete), and clustering (unsupervised), we turn to structured prediction—predicting complex, structured outputs like sequences, trees, and graphs. This generalizes classification to outputs with internal dependencies.

Page Complete

You now possess a comprehensive understanding of clustering problems in machine learning. From formal definitions through practical implementation, you have the conceptual framework to discover meaningful structure in any unlabeled dataset. Next, we explore structured prediction—predicting outputs with rich internal structure.