Clustering Problem Formulation - Learning Module

Loading content...

0/278

Cluster Definitions

What Exactly Is a Cluster?

We've discussed clustering objectives and distance measures, but we've avoided a fundamental question: What exactly is a cluster? This seemingly simple question has no single answer—and that's not a weakness of clustering theory, but a reflection of the genuine complexity of grouping data.

Different definitions of "cluster" lead to fundamentally different algorithms. K-means implicitly defines clusters as spherical regions around centroids. DBSCAN defines clusters as dense regions separated by sparse areas. Hierarchical methods define clusters as nested trees of relationships. Understanding these definitions is key to choosing and interpreting clustering algorithms.

This page provides a rigorous exploration of cluster definitions, from intuitive notions to formal mathematical characterizations. You'll learn to recognize which definition matches your problem and which algorithms implement that definition.

What You Will Learn

By the end of this page, you will understand multiple formal definitions of clusters including center-based, density-based, connectivity-based, and graph-based formulations. You'll learn how each definition translates into specific algorithms and understand which definition is appropriate for different data characteristics.

The Challenge of Defining Clusters

Before examining formal definitions, let's understand why defining "cluster" is genuinely difficult. Consider looking at a scatter plot of points. Humans effortlessly see groupings—but formalizing that intuition is surprisingly hard.

The Human Perception Problem:

When we visually identify clusters, we use sophisticated perceptual mechanisms that are difficult to replicate mathematically:

Gestalt principles: Proximity, similarity, closure, continuity
Pattern recognition: Detecting shapes, boundaries, and outliers
Context-dependent interpretation: What looks like two clusters at one scale might look like one cluster at another

No algorithm perfectly captures human cluster perception. Every algorithm is an approximation based on specific mathematical assumptions.

The Multi-Scale Problem:

Consider data with hierarchical structure: galaxies form clusters, which form superclusters, which form cosmic filaments. At which scale should we define clusters? The "correct" answer depends entirely on the question we're asking.

Similarly, customers might cluster by:

Individual purchases (millions of micro-clusters)
Purchasing patterns (thousands of behavioral segments)
Broad demographics (tens of market segments)
Major categories (a few mega-segments)

All of these are valid clusterings at different granularities.

The Shape Problem:

Some definitions assume clusters have specific shapes:

Spherical (k-means)
Ellipsoidal (Gaussian mixtures)
Arbitrary but connected (DBSCAN)
Defined by density (mean-shift)

Real clusters often have irregular shapes that don't fit neat assumptions. An algorithm assuming spherical clusters will artificially split an elongated cluster.

No Universal Definition

There is no universally accepted definition of 'cluster.' Every definition embodies assumptions that may or may not match your data. The key is to choose a definition—and thus an algorithm—whose assumptions align with your domain knowledge and objectives.

Center-Based (Prototype) Clusters

The most intuitive cluster definition: a cluster is a set of points that are closer to their cluster's prototype (center or representative) than to any other cluster's prototype.

Formal Definition:

Given prototypes ${\mu_1, \mu_2, \ldots, \mu_k}$, a point $x_i$ belongs to cluster $C_j$ if:

$$C_j = {x_i : d(x_i, \mu_j) \leq d(x_i, \mu_l) \ \forall l eq j}$$

This creates a Voronoi tessellation of the feature space—each cluster occupies a convex region bounded by hyperplanes equidistant between prototypes.

Types of Prototypes:

Prototype Types and Their Properties
Prototype Type	Definition	Algorithm	Properties
Centroid (Mean)	$\mu_j = \frac{1}{\|C_j\|} \sum_{x \in C_j} x$	K-Means	Minimizes squared distances; not a data point; sensitive to outliers
Medoid	$m_j = \arg\min_{x \in C_j} \sum_{y \in C_j} d(x, y)$	K-Medoids, PAM	Actual data point; more robust; works with any distance
Median	Per-dimension median of cluster	K-Medians	Robust to outliers; uses L1 distance

Mathematical Properties:

Convexity: Center-based clusters partition space into convex regions. If two points belong to the same cluster, all points on the line segment between them also belong to that cluster.
Spherical assumption: With Euclidean distance, clusters are implicitly assumed to be spherical (or ellipsoidal if using Mahalanobis). This is because equidistant points from a center form a hypersphere.
Equal variance assumption: Standard k-means treats all clusters as having equal "spread." A tight cluster and a dispersed cluster are treated the same way.

Limitations:

Cannot discover non-convex clusters (e.g., crescent shapes, rings)
Sensitive to initial prototype positions
Assumes clusters have similar sizes/densities
Requires specifying number of clusters k

When to Use Center-Based Definitions

Center-based clustering works well when: (1) you expect roughly spherical clusters, (2) clusters have similar sizes, (3) you want interpretable cluster representatives, and (4) computational efficiency matters (k-means is very fast). It's often a good starting point before trying more complex methods.

Density-Based Clusters

Density-based clustering defines clusters as regions of high point density separated by regions of low density. This elegantly captures the intuition that clusters are "islands" of points in a "sea" of empty space.

The DBSCAN Definition:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) introduces key concepts:

ε-neighborhood: $N_\varepsilon(x) = {y : d(x, y) \leq \varepsilon}$ — all points within distance ε of x
Core point: A point with at least MinPts points in its ε-neighborhood
Border point: Not a core point, but within ε of a core point
Noise point: Neither core nor border

Density-Reachability:

A point $p$ is directly density-reachable from $q$ if:

$q$ is a core point
$p \in N_\varepsilon(q)$

A point $p$ is density-reachable from $q$ if there exists a chain of points $p_1, p_2, \ldots, p_n$ where $p_1 = q$, $p_n = p$, and each $p_{i+1}$ is directly density-reachable from $p_i$.

Cluster definition: A cluster is a maximal set of density-connected points. Two points are density-connected if they are both density-reachable from some common core point.

Advantages of Density-Based Definitions

•Arbitrary shapes — Can discover clusters of any shape: crescents, rings, irregular blobs
•No need to specify k — Number of clusters is determined by the data structure
•Natural outlier detection — Points that don't belong to any cluster are classified as noise
•Robust to outliers — Outliers don't distort cluster boundaries as they do with centroids

Mathematical Formalization:

Define the local density at point $x$ as:

$$\rho(x) = |N_\varepsilon(x)|$$

A cluster can be characterized as a connected region where $\rho(x) \geq \text{MinPts}$ for all interior points.

Variations:

OPTICS: Creates an ordering that encodes density structure at all scales, avoiding the need to fix ε
HDBSCAN: Hierarchical DBSCAN that extracts flat clusters from a hierarchy based on stability
Mean-shift: Defines clusters as basins of attraction under a density gradient

Limitations:

Struggles with varying densities (some clusters might be denser than others)
Requires setting ε and MinPts, which can be difficult
Performance degrades in high dimensions (curse of dimensionality affects density)

The Varying Density Problem

If one cluster is much denser than another, a single ε value can't capture both. Too small ε fragments the sparser cluster; too large ε merges dense clusters. HDBSCAN addresses this by considering multiple density scales simultaneously.

Connectivity-Based (Hierarchical) Clusters

Connectivity-based definitions emphasize relationships between points rather than absolute positions. A cluster is a set of points that are more connected to each other than to points outside the cluster.

Single-Linkage (Minimum) Clustering:

Two clusters are merged based on the smallest distance between any pair of points:

$$d_{\text{single}}(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$$

A cluster is defined as a set of points connected by a chain of short distances. This implicitly defines clusters by:

Points within the same cluster can be connected by a path of nearby points
The longest link in the path is minimized

Complete-Linkage (Maximum) Clustering:

$$d_{\text{complete}}(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$$

Clusters are defined by their diameter—the maximum distance between any two points. This produces more compact, spherical clusters.

Average-Linkage Clustering:

$$d_{\text{average}}(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$$

Balances between the extremes of single and complete linkage.

Single-Linkage Properties

•Can find elongated, chain-like clusters
•Suffers from 'chaining effect' — single bridging points can merge distinct clusters
•Produces a minimum spanning tree
•Sensitive to noise and outliers

Complete-Linkage Properties

•Produces compact, tight clusters
•Resistant to chaining
•May split natural clusters if they're elongated
•Sensitive to outliers (outlier increases diameter)

The Dendrogram:

Hierarchical clustering produces a dendrogram—a tree showing nested cluster relationships. The dendrogram encodes clusterings at all granularities:

The y-axis shows the distance at which clusters merge
Cutting the tree at different heights yields different numbers of clusters
The shape reveals cluster structure (balanced vs. unbalanced trees)

Ward's Method:

An important variant minimizes the increase in total within-cluster variance when merging:

$$d_{\text{Ward}}(C_i, C_j) = \Delta(C_i, C_j) = \frac{|C_i| \cdot |C_j|}{|C_i| + |C_j|} |\mu_i - \mu_j|^2$$

Ward's method tends to produce clusters of similar size and more spherical shapes.

When Hierarchy Matters

Connectivity-based definitions are ideal when: (1) data has natural hierarchy (taxonomies, evolutionary trees), (2) you want to explore multiple granularities, (3) you need a visualization of cluster relationships, or (4) you're uncertain about the number of clusters.

Graph-Based Clusters

Graph-based methods represent data as a graph where nodes are data points and edges represent relationships (similarity or proximity). Clusters are then defined as graph communities—densely connected subgraphs with sparse connections between them.

From Points to Graphs:

Given data points, construct a graph:

k-Nearest Neighbor Graph: Connect each point to its k nearest neighbors
ε-Neighborhood Graph: Connect points within distance ε of each other
Fully Connected Graph: Connect all pairs with edge weights based on similarity

The Graph Cut Perspective:

A cluster is a subgraph that minimizes the cost of the cut—the edges removed to separate it from other clusters.

Minimum Cut:

$$\text{cut}(C_1, C_2) = \sum_{i \in C_1, j \in C_2} w_{ij}$$

where $w_{ij}$ is the edge weight (similarity) between nodes $i$ and $j$.

Problem: Minimum cut often produces trivially small clusters (single outlier nodes).

Normalized Cuts:

To encourage balanced cluster sizes, normalize the cut:

Ratio Cut: $$\text{RatioCut}(C_1, C_2) = \frac{\text{cut}(C_1, C_2)}{|C_1|} + \frac{\text{cut}(C_1, C_2)}{|C_2|}$$

Normalized Cut (NCut): $$\text{NCut}(C_1, C_2) = \frac{\text{cut}(C_1, C_2)}{\text{vol}(C_1)} + \frac{\text{cut}(C_1, C_2)}{\text{vol}(C_2)}$$

where $\text{vol}(C) = \sum_{i \in C} \sum_j w_{ij}$ is the total edge weight incident to cluster $C$.

Spectral Clustering:

Minimizing normalized cuts directly is NP-hard, but a relaxed solution uses eigenvalues of the graph Laplacian:

Compute the graph Laplacian $L = D - W$
- $D$ is the degree matrix (diagonal)
- $W$ is the weighted adjacency matrix
Find eigenvectors of the normalized Laplacian $L_{\text{norm}} = D^{-1/2} L D^{-1/2}$
Use the first $k$ eigenvectors as new coordinates for each point
Apply k-means in this spectral embedding space

Advantages of Graph-Based Definitions

•Non-convex clusters — Can capture arbitrary cluster shapes through graph connectivity
•Natural for network data — When data is already a graph (social networks, citations)
•Theoretical foundation — Deep connections to random walks, Markov chains, and spectral graph theory
•Robust to metric choice — Graph structure can encode complex similarity notions

The Power of Spectral Clustering

Spectral clustering can separate clusters that k-means cannot (e.g., concentric circles). The spectral embedding 'unfolds' complex geometry. The eigenvectors of the Laplacian encode global graph structure that local methods miss.

Distribution-Based (Probabilistic) Clusters

Distribution-based methods define clusters as probability distributions from which points are sampled. The data is modeled as a mixture of distributions, and clustering amounts to inferring which distribution generated each point.

Gaussian Mixture Model (GMM) Definition:

Data is generated from a mixture of $k$ Gaussian distributions:

$$p(x) = \sum_{j=1}^{k} \pi_j \cdot \mathcal{N}(x | \mu_j, \Sigma_j)$$

where:

$\pi_j$ = mixing weight of component $j$ (probability of choosing that component)
$\mu_j$ = mean of component $j$
$\Sigma_j$ = covariance matrix of component $j$
$\sum_j \pi_j = 1$

Soft Cluster Assignment:

Unlike hard clustering, GMMs provide soft assignments—the probability that point $x_i$ belongs to cluster $j$:

$$\gamma_{ij} = P(z_i = j | x_i) = \frac{\pi_j \cdot \mathcal{N}(x_i | \mu_j, \Sigma_j)}{\sum_{l=1}^{k} \pi_l \cdot \mathcal{N}(x_i | \mu_l, \Sigma_l)}$$

These are called responsibilities—the degree to which component $j$ is responsible for generating point $x_i$.

GMM Covariance Types and Cluster Shapes
Covariance Type	Shape	Parameters per Cluster	Use When
Full	Arbitrary ellipsoids (rotated)	$d + d(d+1)/2$	Clusters have different orientations and spreads
Tied	Same ellipsoid shape, different center	$d + d(d+1)/2$ shared	Similar cluster shapes, different locations
Diagonal	Axis-aligned ellipsoids	$2d$	Features are independent within clusters
Spherical	Spheres (like k-means)	$d + 1$	Isotropic clusters; regularizing when data is limited

Relationship to K-Means:

K-means is a special case of GMM where:

All covariances are spherical and equal: $\Sigma_j = \sigma^2 I$
Mixing weights are equal: $\pi_j = 1/k$
Assignments are hard (each $\gamma_{ij}$ is 0 or 1)

As $\sigma^2 \rightarrow 0$ or with hard assignment, GMM becomes k-means.

Advantages of Probabilistic Definition:

Uncertainty quantification — Know how confident each assignment is
Principled model selection — Use AIC/BIC to choose $k$
Flexible cluster shapes — Full covariance allows ellipsoidal clusters
Generative model — Can sample new data from the learned model
Handle overlap — Points in overlapping regions get fractional assignments

Beyond Gaussians:

The mixture model framework extends to other distributions:

Student's t-mixture: More robust to outliers (heavier tails)
Von Mises-Fisher mixture: For directional/spherical data
Multinomial mixture: For count/categorical data (like topic models)

The Philosophical Difference

Probabilistic clustering fundamentally differs from geometric clustering. It doesn't ask 'which points are close together?' but 'what process generated this data?' This perspective is more principled when data truly comes from a mixture of populations, and allows for uncertainty about cluster membership.

Information-Theoretic Clusters

Information theory provides another lens for defining clusters. Here, a cluster is defined by how much information is preserved or lost when we compress data by grouping points.

Rate-Distortion Framework:

Imagine compressing data by sending only cluster labels instead of full feature vectors. This introduces distortion (loss of information) but reduces the rate (bits needed). The optimal clustering minimizes distortion for a given rate:

$$\min_{p(c|x)} I(X; C) \quad \text{subject to} \quad \mathbb{E}[d(X, \hat{X})] \leq D$$

where:

$I(X; C)$ = mutual information between data and cluster assignments
$d(X, \hat{X})$ = distortion between original and reconstructed data
$D$ = maximum allowed distortion

The Information Bottleneck:

When we have additional information about data points (e.g., labels $Y$), we can define clusters that preserve relevant information:

$$\min_{p(c|x)} I(X; C) - \beta I(C; Y)$$

The first term compresses (forgets irrelevant details); the second preserves information about $Y$. The parameter $\beta$ trades off compression against relevance.

Minimum Description Length (MDL):

A cluster is good if it compresses the data efficiently. The MDL principle says:

$$\text{Best clustering} = \arg\min_{\mathcal{C}} \left[ L(\mathcal{C}) + L(X | \mathcal{C}) \right]$$

where:

$L(\mathcal{C})$ = bits to describe the clustering (model complexity)
$L(X | \mathcal{C})$ = bits to describe data given the clustering (data fit)

This automatically balances model complexity (more clusters = more bits to describe model) against fit (fewer clusters = more bits to describe deviations).

Practical Implications:

Information-theoretic clustering:

Provides principled ways to choose number of clusters
Works with any data type that has a probability model
Connects clustering to data compression and representation learning
Offers theoretical bounds on cluster quality

When to Use Information-Theoretic Views

This perspective is particularly valuable when: (1) you want principled model selection without heuristics, (2) you're clustering for downstream prediction tasks (preserve relevant information), (3) you want clustering that generalizes well, or (4) you need interpretable compression of large datasets.

Comparing Cluster Definitions

Different cluster definitions capture different intuitions and lead to different algorithms. Understanding these differences helps you choose the right approach.

Side-by-Side Comparison:

Cluster Definition Comparison
Definition	Core Intuition	Cluster Shape	Handles Noise	Specifies k
Center-based	Closer to own center than others	Convex (spherical/ellipsoidal)	No (all points assigned)	Yes
Density-based	Dense region surrounded by sparse	Arbitrary	Yes (noise class)	No
Connectivity-based	Connected by similar points	Arbitrary (depends on linkage)	Partially	No (cut dendrogram)
Graph-based	Densely connected subgraph	Arbitrary	Depends on method	Yes for spectral
Distribution-based	Sampled from same distribution	Determined by distribution family	Can model as component	Yes
Information-theoretic	Points needing same description	Implicit	Can be modeled	Determined by criterion

Matching Definitions to Data:

Data Characteristic	Best Definition
Roughly spherical clusters	Center-based (k-means)
Arbitrary shapes, varying density	Density-based (HDBSCAN)
Hierarchical structure	Connectivity-based (hierarchical)
Network/graph data	Graph-based (spectral, community detection)
Clusters overlap significantly	Distribution-based (GMM)
Need to choose k principled	Information-theoretic (MDL, BIC)

Important Insight:

The same data can have different valid clusterings under different definitions. Two researchers using different definitions may legitimately find different clusters—and both may be correct for their purposes. This is why stating your clustering objective and definition explicitly is crucial for reproducible research.

Definition Determines Result

Never treat a clustering result as 'the' answer without specifying which definition was used. Different definitions can produce radically different clusterings of the same data. Always report your definition and justify why it's appropriate for your problem.

Summary: Cluster Definitions

We've explored the rich landscape of cluster definitions. Let's consolidate the key insights:

Key Takeaways

•No universal definition — Different definitions capture different intuitions about what constitutes a cluster; all are valid for appropriate contexts.
•Center-based — Clusters as Voronoi cells around prototypes. Simple, efficient, but assumes convex shapes.
•Density-based — Clusters as dense regions. Handles arbitrary shapes and noise but requires density parameters.
•Connectivity-based — Clusters as connected groups. Reveals hierarchy but sensitive to linkage choice.
•Graph-based — Clusters as communities in similarity graphs. Captures global structure through spectral methods.
•Distribution-based — Clusters as sampling distributions. Provides probabilistic assignments and principled model selection.
•Information-theoretic — Clusters as compressions. Balances complexity and fit automatically.
•Definition determines algorithm — Your cluster definition choice commits you to specific algorithms and assumptions.

What's Next:

Now that we understand various ways to define clusters, we turn to the types of clustering approaches—the algorithmic paradigms that implement these definitions. The next page examines partitional vs. hierarchical clustering, hard vs. soft assignments, and other fundamental distinctions that shape how clustering is performed.

Page Complete

You now understand the major formal definitions of clusters and can recognize which definition underlies different algorithms. This foundation is essential for choosing appropriate methods and interpreting results correctly. Next, we'll categorize the types of clustering approaches.

Cluster Definitions

What Exactly Is a Cluster?

What You Will Learn

The Challenge of Defining Clusters

The Human Perception Problem:

When we visually identify clusters, we use sophisticated perceptual mechanisms that are difficult to replicate mathematically:

Gestalt principles: Proximity, similarity, closure, continuity
Pattern recognition: Detecting shapes, boundaries, and outliers
Context-dependent interpretation: What looks like two clusters at one scale might look like one cluster at another

No algorithm perfectly captures human cluster perception. Every algorithm is an approximation based on specific mathematical assumptions.

The Multi-Scale Problem:

Similarly, customers might cluster by:

Individual purchases (millions of micro-clusters)
Purchasing patterns (thousands of behavioral segments)
Broad demographics (tens of market segments)
Major categories (a few mega-segments)

All of these are valid clusterings at different granularities.

The Shape Problem:

Some definitions assume clusters have specific shapes:

Spherical (k-means)
Ellipsoidal (Gaussian mixtures)
Arbitrary but connected (DBSCAN)
Defined by density (mean-shift)

Real clusters often have irregular shapes that don't fit neat assumptions. An algorithm assuming spherical clusters will artificially split an elongated cluster.

No Universal Definition

Center-Based (Prototype) Clusters

The most intuitive cluster definition: a cluster is a set of points that are closer to their cluster's prototype (center or representative) than to any other cluster's prototype.

Formal Definition:

Given prototypes ${\mu_1, \mu_2, \ldots, \mu_k}$, a point $x_i$ belongs to cluster $C_j$ if:

$$C_j = {x_i : d(x_i, \mu_j) \leq d(x_i, \mu_l) \ \forall l eq j}$$

This creates a Voronoi tessellation of the feature space—each cluster occupies a convex region bounded by hyperplanes equidistant between prototypes.

Types of Prototypes:

Prototype Types and Their Properties
Prototype Type	Definition	Algorithm	Properties
Centroid (Mean)	$\mu_j = \frac{1}{\|C_j\|} \sum_{x \in C_j} x$	K-Means	Minimizes squared distances; not a data point; sensitive to outliers
Medoid	$m_j = \arg\min_{x \in C_j} \sum_{y \in C_j} d(x, y)$	K-Medoids, PAM	Actual data point; more robust; works with any distance
Median	Per-dimension median of cluster	K-Medians	Robust to outliers; uses L1 distance

Mathematical Properties:

Convexity: Center-based clusters partition space into convex regions. If two points belong to the same cluster, all points on the line segment between them also belong to that cluster.
Spherical assumption: With Euclidean distance, clusters are implicitly assumed to be spherical (or ellipsoidal if using Mahalanobis). This is because equidistant points from a center form a hypersphere.
Equal variance assumption: Standard k-means treats all clusters as having equal "spread." A tight cluster and a dispersed cluster are treated the same way.

Limitations:

Cannot discover non-convex clusters (e.g., crescent shapes, rings)
Sensitive to initial prototype positions
Assumes clusters have similar sizes/densities
Requires specifying number of clusters k

When to Use Center-Based Definitions

Density-Based Clusters

The DBSCAN Definition:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) introduces key concepts:

ε-neighborhood: $N_\varepsilon(x) = {y : d(x, y) \leq \varepsilon}$ — all points within distance ε of x
Core point: A point with at least MinPts points in its ε-neighborhood
Border point: Not a core point, but within ε of a core point
Noise point: Neither core nor border

Density-Reachability:

A point $p$ is directly density-reachable from $q$ if:

$q$ is a core point
$p \in N_\varepsilon(q)$

A point $p$ is density-reachable from $q$ if there exists a chain of points $p_1, p_2, \ldots, p_n$ where $p_1 = q$, $p_n = p$, and each $p_{i+1}$ is directly density-reachable from $p_i$.

Cluster definition: A cluster is a maximal set of density-connected points. Two points are density-connected if they are both density-reachable from some common core point.

Advantages of Density-Based Definitions

•Arbitrary shapes — Can discover clusters of any shape: crescents, rings, irregular blobs
•No need to specify k — Number of clusters is determined by the data structure
•Natural outlier detection — Points that don't belong to any cluster are classified as noise
•Robust to outliers — Outliers don't distort cluster boundaries as they do with centroids

Mathematical Formalization:

Define the local density at point $x$ as:

$$\rho(x) = |N_\varepsilon(x)|$$

A cluster can be characterized as a connected region where $\rho(x) \geq \text{MinPts}$ for all interior points.

Variations:

OPTICS: Creates an ordering that encodes density structure at all scales, avoiding the need to fix ε
HDBSCAN: Hierarchical DBSCAN that extracts flat clusters from a hierarchy based on stability
Mean-shift: Defines clusters as basins of attraction under a density gradient

Limitations:

Struggles with varying densities (some clusters might be denser than others)
Requires setting ε and MinPts, which can be difficult
Performance degrades in high dimensions (curse of dimensionality affects density)

The Varying Density Problem

Connectivity-Based (Hierarchical) Clusters

Single-Linkage (Minimum) Clustering:

Two clusters are merged based on the smallest distance between any pair of points:

$$d_{\text{single}}(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$$

A cluster is defined as a set of points connected by a chain of short distances. This implicitly defines clusters by:

Points within the same cluster can be connected by a path of nearby points
The longest link in the path is minimized

Complete-Linkage (Maximum) Clustering:

$$d_{\text{complete}}(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$$

Clusters are defined by their diameter—the maximum distance between any two points. This produces more compact, spherical clusters.

Average-Linkage Clustering:

$$d_{\text{average}}(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$$

Balances between the extremes of single and complete linkage.

Single-Linkage Properties

•Can find elongated, chain-like clusters
•Suffers from 'chaining effect' — single bridging points can merge distinct clusters
•Produces a minimum spanning tree
•Sensitive to noise and outliers

Complete-Linkage Properties

•Produces compact, tight clusters
•Resistant to chaining
•May split natural clusters if they're elongated
•Sensitive to outliers (outlier increases diameter)

The Dendrogram:

Hierarchical clustering produces a dendrogram—a tree showing nested cluster relationships. The dendrogram encodes clusterings at all granularities:

The y-axis shows the distance at which clusters merge
Cutting the tree at different heights yields different numbers of clusters
The shape reveals cluster structure (balanced vs. unbalanced trees)

Ward's Method:

An important variant minimizes the increase in total within-cluster variance when merging:

$$d_{\text{Ward}}(C_i, C_j) = \Delta(C_i, C_j) = \frac{|C_i| \cdot |C_j|}{|C_i| + |C_j|} |\mu_i - \mu_j|^2$$

Ward's method tends to produce clusters of similar size and more spherical shapes.

When Hierarchy Matters

Graph-Based Clusters

From Points to Graphs:

Given data points, construct a graph:

k-Nearest Neighbor Graph: Connect each point to its k nearest neighbors
ε-Neighborhood Graph: Connect points within distance ε of each other
Fully Connected Graph: Connect all pairs with edge weights based on similarity

The Graph Cut Perspective:

A cluster is a subgraph that minimizes the cost of the cut—the edges removed to separate it from other clusters.

Minimum Cut:

$$\text{cut}(C_1, C_2) = \sum_{i \in C_1, j \in C_2} w_{ij}$$

where $w_{ij}$ is the edge weight (similarity) between nodes $i$ and $j$.

Problem: Minimum cut often produces trivially small clusters (single outlier nodes).

Normalized Cuts:

To encourage balanced cluster sizes, normalize the cut:

Ratio Cut: $$\text{RatioCut}(C_1, C_2) = \frac{\text{cut}(C_1, C_2)}{|C_1|} + \frac{\text{cut}(C_1, C_2)}{|C_2|}$$

Normalized Cut (NCut): $$\text{NCut}(C_1, C_2) = \frac{\text{cut}(C_1, C_2)}{\text{vol}(C_1)} + \frac{\text{cut}(C_1, C_2)}{\text{vol}(C_2)}$$

where $\text{vol}(C) = \sum_{i \in C} \sum_j w_{ij}$ is the total edge weight incident to cluster $C$.

Spectral Clustering:

Minimizing normalized cuts directly is NP-hard, but a relaxed solution uses eigenvalues of the graph Laplacian:

Compute the graph Laplacian $L = D - W$
- $D$ is the degree matrix (diagonal)
- $W$ is the weighted adjacency matrix
Find eigenvectors of the normalized Laplacian $L_{\text{norm}} = D^{-1/2} L D^{-1/2}$
Use the first $k$ eigenvectors as new coordinates for each point
Apply k-means in this spectral embedding space

Advantages of Graph-Based Definitions

•Non-convex clusters — Can capture arbitrary cluster shapes through graph connectivity
•Natural for network data — When data is already a graph (social networks, citations)
•Theoretical foundation — Deep connections to random walks, Markov chains, and spectral graph theory
•Robust to metric choice — Graph structure can encode complex similarity notions

The Power of Spectral Clustering

Distribution-Based (Probabilistic) Clusters

Gaussian Mixture Model (GMM) Definition:

Data is generated from a mixture of $k$ Gaussian distributions:

$$p(x) = \sum_{j=1}^{k} \pi_j \cdot \mathcal{N}(x | \mu_j, \Sigma_j)$$

where:

$\pi_j$ = mixing weight of component $j$ (probability of choosing that component)
$\mu_j$ = mean of component $j$
$\Sigma_j$ = covariance matrix of component $j$
$\sum_j \pi_j = 1$

Soft Cluster Assignment:

Unlike hard clustering, GMMs provide soft assignments—the probability that point $x_i$ belongs to cluster $j$:

$$\gamma_{ij} = P(z_i = j | x_i) = \frac{\pi_j \cdot \mathcal{N}(x_i | \mu_j, \Sigma_j)}{\sum_{l=1}^{k} \pi_l \cdot \mathcal{N}(x_i | \mu_l, \Sigma_l)}$$

These are called responsibilities—the degree to which component $j$ is responsible for generating point $x_i$.

GMM Covariance Types and Cluster Shapes
Covariance Type	Shape	Parameters per Cluster	Use When
Full	Arbitrary ellipsoids (rotated)	$d + d(d+1)/2$	Clusters have different orientations and spreads
Tied	Same ellipsoid shape, different center	$d + d(d+1)/2$ shared	Similar cluster shapes, different locations
Diagonal	Axis-aligned ellipsoids	$2d$	Features are independent within clusters
Spherical	Spheres (like k-means)	$d + 1$	Isotropic clusters; regularizing when data is limited

Relationship to K-Means:

K-means is a special case of GMM where:

All covariances are spherical and equal: $\Sigma_j = \sigma^2 I$
Mixing weights are equal: $\pi_j = 1/k$
Assignments are hard (each $\gamma_{ij}$ is 0 or 1)

As $\sigma^2 \rightarrow 0$ or with hard assignment, GMM becomes k-means.

Advantages of Probabilistic Definition:

Uncertainty quantification — Know how confident each assignment is
Principled model selection — Use AIC/BIC to choose $k$
Flexible cluster shapes — Full covariance allows ellipsoidal clusters
Generative model — Can sample new data from the learned model
Handle overlap — Points in overlapping regions get fractional assignments

Beyond Gaussians:

The mixture model framework extends to other distributions:

Student's t-mixture: More robust to outliers (heavier tails)
Von Mises-Fisher mixture: For directional/spherical data
Multinomial mixture: For count/categorical data (like topic models)

The Philosophical Difference

Information-Theoretic Clusters

Information theory provides another lens for defining clusters. Here, a cluster is defined by how much information is preserved or lost when we compress data by grouping points.

Rate-Distortion Framework:

$$\min_{p(c|x)} I(X; C) \quad \text{subject to} \quad \mathbb{E}[d(X, \hat{X})] \leq D$$

where:

$I(X; C)$ = mutual information between data and cluster assignments
$d(X, \hat{X})$ = distortion between original and reconstructed data
$D$ = maximum allowed distortion

The Information Bottleneck:

When we have additional information about data points (e.g., labels $Y$), we can define clusters that preserve relevant information:

$$\min_{p(c|x)} I(X; C) - \beta I(C; Y)$$

The first term compresses (forgets irrelevant details); the second preserves information about $Y$. The parameter $\beta$ trades off compression against relevance.

Minimum Description Length (MDL):

A cluster is good if it compresses the data efficiently. The MDL principle says:

$$\text{Best clustering} = \arg\min_{\mathcal{C}} \left[ L(\mathcal{C}) + L(X | \mathcal{C}) \right]$$

where:

$L(\mathcal{C})$ = bits to describe the clustering (model complexity)
$L(X | \mathcal{C})$ = bits to describe data given the clustering (data fit)

This automatically balances model complexity (more clusters = more bits to describe model) against fit (fewer clusters = more bits to describe deviations).

Practical Implications:

Information-theoretic clustering:

Provides principled ways to choose number of clusters
Works with any data type that has a probability model
Connects clustering to data compression and representation learning
Offers theoretical bounds on cluster quality

When to Use Information-Theoretic Views

Comparing Cluster Definitions

Different cluster definitions capture different intuitions and lead to different algorithms. Understanding these differences helps you choose the right approach.

Side-by-Side Comparison:

Cluster Definition Comparison
Definition	Core Intuition	Cluster Shape	Handles Noise	Specifies k
Center-based	Closer to own center than others	Convex (spherical/ellipsoidal)	No (all points assigned)	Yes
Density-based	Dense region surrounded by sparse	Arbitrary	Yes (noise class)	No
Connectivity-based	Connected by similar points	Arbitrary (depends on linkage)	Partially	No (cut dendrogram)
Graph-based	Densely connected subgraph	Arbitrary	Depends on method	Yes for spectral
Distribution-based	Sampled from same distribution	Determined by distribution family	Can model as component	Yes
Information-theoretic	Points needing same description	Implicit	Can be modeled	Determined by criterion

Matching Definitions to Data:

Data Characteristic	Best Definition
Roughly spherical clusters	Center-based (k-means)
Arbitrary shapes, varying density	Density-based (HDBSCAN)
Hierarchical structure	Connectivity-based (hierarchical)
Network/graph data	Graph-based (spectral, community detection)
Clusters overlap significantly	Distribution-based (GMM)
Need to choose k principled	Information-theoretic (MDL, BIC)

Important Insight:

Definition Determines Result

Summary: Cluster Definitions

We've explored the rich landscape of cluster definitions. Let's consolidate the key insights:

Key Takeaways

•No universal definition — Different definitions capture different intuitions about what constitutes a cluster; all are valid for appropriate contexts.
•Center-based — Clusters as Voronoi cells around prototypes. Simple, efficient, but assumes convex shapes.
•Density-based — Clusters as dense regions. Handles arbitrary shapes and noise but requires density parameters.
•Connectivity-based — Clusters as connected groups. Reveals hierarchy but sensitive to linkage choice.
•Graph-based — Clusters as communities in similarity graphs. Captures global structure through spectral methods.
•Distribution-based — Clusters as sampling distributions. Provides probabilistic assignments and principled model selection.
•Information-theoretic — Clusters as compressions. Balances complexity and fit automatically.
•Definition determines algorithm — Your cluster definition choice commits you to specific algorithms and assumptions.

What's Next:

Page Complete