Loading content...
We've discussed clustering objectives and distance measures, but we've avoided a fundamental question: What exactly is a cluster? This seemingly simple question has no single answer—and that's not a weakness of clustering theory, but a reflection of the genuine complexity of grouping data.
Different definitions of "cluster" lead to fundamentally different algorithms. K-means implicitly defines clusters as spherical regions around centroids. DBSCAN defines clusters as dense regions separated by sparse areas. Hierarchical methods define clusters as nested trees of relationships. Understanding these definitions is key to choosing and interpreting clustering algorithms.
This page provides a rigorous exploration of cluster definitions, from intuitive notions to formal mathematical characterizations. You'll learn to recognize which definition matches your problem and which algorithms implement that definition.
By the end of this page, you will understand multiple formal definitions of clusters including center-based, density-based, connectivity-based, and graph-based formulations. You'll learn how each definition translates into specific algorithms and understand which definition is appropriate for different data characteristics.
Before examining formal definitions, let's understand why defining "cluster" is genuinely difficult. Consider looking at a scatter plot of points. Humans effortlessly see groupings—but formalizing that intuition is surprisingly hard.
The Human Perception Problem:
When we visually identify clusters, we use sophisticated perceptual mechanisms that are difficult to replicate mathematically:
No algorithm perfectly captures human cluster perception. Every algorithm is an approximation based on specific mathematical assumptions.
The Multi-Scale Problem:
Consider data with hierarchical structure: galaxies form clusters, which form superclusters, which form cosmic filaments. At which scale should we define clusters? The "correct" answer depends entirely on the question we're asking.
Similarly, customers might cluster by:
All of these are valid clusterings at different granularities.
The Shape Problem:
Some definitions assume clusters have specific shapes:
Real clusters often have irregular shapes that don't fit neat assumptions. An algorithm assuming spherical clusters will artificially split an elongated cluster.
There is no universally accepted definition of 'cluster.' Every definition embodies assumptions that may or may not match your data. The key is to choose a definition—and thus an algorithm—whose assumptions align with your domain knowledge and objectives.
The most intuitive cluster definition: a cluster is a set of points that are closer to their cluster's prototype (center or representative) than to any other cluster's prototype.
Formal Definition:
Given prototypes ${\mu_1, \mu_2, \ldots, \mu_k}$, a point $x_i$ belongs to cluster $C_j$ if:
$$C_j = {x_i : d(x_i, \mu_j) \leq d(x_i, \mu_l) \ \forall l eq j}$$
This creates a Voronoi tessellation of the feature space—each cluster occupies a convex region bounded by hyperplanes equidistant between prototypes.
Types of Prototypes:
| Prototype Type | Definition | Algorithm | Properties |
|---|---|---|---|
| Centroid (Mean) | $\mu_j = \frac{1}{|C_j|} \sum_{x \in C_j} x$ | K-Means | Minimizes squared distances; not a data point; sensitive to outliers |
| Medoid | $m_j = \arg\min_{x \in C_j} \sum_{y \in C_j} d(x, y)$ | K-Medoids, PAM | Actual data point; more robust; works with any distance |
| Median | Per-dimension median of cluster | K-Medians | Robust to outliers; uses L1 distance |
Mathematical Properties:
Convexity: Center-based clusters partition space into convex regions. If two points belong to the same cluster, all points on the line segment between them also belong to that cluster.
Spherical assumption: With Euclidean distance, clusters are implicitly assumed to be spherical (or ellipsoidal if using Mahalanobis). This is because equidistant points from a center form a hypersphere.
Equal variance assumption: Standard k-means treats all clusters as having equal "spread." A tight cluster and a dispersed cluster are treated the same way.
Limitations:
Center-based clustering works well when: (1) you expect roughly spherical clusters, (2) clusters have similar sizes, (3) you want interpretable cluster representatives, and (4) computational efficiency matters (k-means is very fast). It's often a good starting point before trying more complex methods.
Density-based clustering defines clusters as regions of high point density separated by regions of low density. This elegantly captures the intuition that clusters are "islands" of points in a "sea" of empty space.
The DBSCAN Definition:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) introduces key concepts:
Density-Reachability:
A point $p$ is directly density-reachable from $q$ if:
A point $p$ is density-reachable from $q$ if there exists a chain of points $p_1, p_2, \ldots, p_n$ where $p_1 = q$, $p_n = p$, and each $p_{i+1}$ is directly density-reachable from $p_i$.
Cluster definition: A cluster is a maximal set of density-connected points. Two points are density-connected if they are both density-reachable from some common core point.
Mathematical Formalization:
Define the local density at point $x$ as:
$$\rho(x) = |N_\varepsilon(x)|$$
A cluster can be characterized as a connected region where $\rho(x) \geq \text{MinPts}$ for all interior points.
Variations:
Limitations:
If one cluster is much denser than another, a single ε value can't capture both. Too small ε fragments the sparser cluster; too large ε merges dense clusters. HDBSCAN addresses this by considering multiple density scales simultaneously.
Connectivity-based definitions emphasize relationships between points rather than absolute positions. A cluster is a set of points that are more connected to each other than to points outside the cluster.
Single-Linkage (Minimum) Clustering:
Two clusters are merged based on the smallest distance between any pair of points:
$$d_{\text{single}}(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$$
A cluster is defined as a set of points connected by a chain of short distances. This implicitly defines clusters by:
Complete-Linkage (Maximum) Clustering:
$$d_{\text{complete}}(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$$
Clusters are defined by their diameter—the maximum distance between any two points. This produces more compact, spherical clusters.
Average-Linkage Clustering:
$$d_{\text{average}}(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$$
Balances between the extremes of single and complete linkage.
The Dendrogram:
Hierarchical clustering produces a dendrogram—a tree showing nested cluster relationships. The dendrogram encodes clusterings at all granularities:
Ward's Method:
An important variant minimizes the increase in total within-cluster variance when merging:
$$d_{\text{Ward}}(C_i, C_j) = \Delta(C_i, C_j) = \frac{|C_i| \cdot |C_j|}{|C_i| + |C_j|} |\mu_i - \mu_j|^2$$
Ward's method tends to produce clusters of similar size and more spherical shapes.
Connectivity-based definitions are ideal when: (1) data has natural hierarchy (taxonomies, evolutionary trees), (2) you want to explore multiple granularities, (3) you need a visualization of cluster relationships, or (4) you're uncertain about the number of clusters.
Graph-based methods represent data as a graph where nodes are data points and edges represent relationships (similarity or proximity). Clusters are then defined as graph communities—densely connected subgraphs with sparse connections between them.
From Points to Graphs:
Given data points, construct a graph:
The Graph Cut Perspective:
A cluster is a subgraph that minimizes the cost of the cut—the edges removed to separate it from other clusters.
Minimum Cut:
$$\text{cut}(C_1, C_2) = \sum_{i \in C_1, j \in C_2} w_{ij}$$
where $w_{ij}$ is the edge weight (similarity) between nodes $i$ and $j$.
Problem: Minimum cut often produces trivially small clusters (single outlier nodes).
Normalized Cuts:
To encourage balanced cluster sizes, normalize the cut:
Ratio Cut: $$\text{RatioCut}(C_1, C_2) = \frac{\text{cut}(C_1, C_2)}{|C_1|} + \frac{\text{cut}(C_1, C_2)}{|C_2|}$$
Normalized Cut (NCut): $$\text{NCut}(C_1, C_2) = \frac{\text{cut}(C_1, C_2)}{\text{vol}(C_1)} + \frac{\text{cut}(C_1, C_2)}{\text{vol}(C_2)}$$
where $\text{vol}(C) = \sum_{i \in C} \sum_j w_{ij}$ is the total edge weight incident to cluster $C$.
Spectral Clustering:
Minimizing normalized cuts directly is NP-hard, but a relaxed solution uses eigenvalues of the graph Laplacian:
Compute the graph Laplacian $L = D - W$
Find eigenvectors of the normalized Laplacian $L_{\text{norm}} = D^{-1/2} L D^{-1/2}$
Use the first $k$ eigenvectors as new coordinates for each point
Apply k-means in this spectral embedding space
Spectral clustering can separate clusters that k-means cannot (e.g., concentric circles). The spectral embedding 'unfolds' complex geometry. The eigenvectors of the Laplacian encode global graph structure that local methods miss.
Distribution-based methods define clusters as probability distributions from which points are sampled. The data is modeled as a mixture of distributions, and clustering amounts to inferring which distribution generated each point.
Gaussian Mixture Model (GMM) Definition:
Data is generated from a mixture of $k$ Gaussian distributions:
$$p(x) = \sum_{j=1}^{k} \pi_j \cdot \mathcal{N}(x | \mu_j, \Sigma_j)$$
where:
Soft Cluster Assignment:
Unlike hard clustering, GMMs provide soft assignments—the probability that point $x_i$ belongs to cluster $j$:
$$\gamma_{ij} = P(z_i = j | x_i) = \frac{\pi_j \cdot \mathcal{N}(x_i | \mu_j, \Sigma_j)}{\sum_{l=1}^{k} \pi_l \cdot \mathcal{N}(x_i | \mu_l, \Sigma_l)}$$
These are called responsibilities—the degree to which component $j$ is responsible for generating point $x_i$.
| Covariance Type | Shape | Parameters per Cluster | Use When |
|---|---|---|---|
| Full | Arbitrary ellipsoids (rotated) | $d + d(d+1)/2$ | Clusters have different orientations and spreads |
| Tied | Same ellipsoid shape, different center | $d + d(d+1)/2$ shared | Similar cluster shapes, different locations |
| Diagonal | Axis-aligned ellipsoids | $2d$ | Features are independent within clusters |
| Spherical | Spheres (like k-means) | $d + 1$ | Isotropic clusters; regularizing when data is limited |
Relationship to K-Means:
K-means is a special case of GMM where:
As $\sigma^2 \rightarrow 0$ or with hard assignment, GMM becomes k-means.
Advantages of Probabilistic Definition:
Beyond Gaussians:
The mixture model framework extends to other distributions:
Probabilistic clustering fundamentally differs from geometric clustering. It doesn't ask 'which points are close together?' but 'what process generated this data?' This perspective is more principled when data truly comes from a mixture of populations, and allows for uncertainty about cluster membership.
Information theory provides another lens for defining clusters. Here, a cluster is defined by how much information is preserved or lost when we compress data by grouping points.
Rate-Distortion Framework:
Imagine compressing data by sending only cluster labels instead of full feature vectors. This introduces distortion (loss of information) but reduces the rate (bits needed). The optimal clustering minimizes distortion for a given rate:
$$\min_{p(c|x)} I(X; C) \quad \text{subject to} \quad \mathbb{E}[d(X, \hat{X})] \leq D$$
where:
The Information Bottleneck:
When we have additional information about data points (e.g., labels $Y$), we can define clusters that preserve relevant information:
$$\min_{p(c|x)} I(X; C) - \beta I(C; Y)$$
The first term compresses (forgets irrelevant details); the second preserves information about $Y$. The parameter $\beta$ trades off compression against relevance.
Minimum Description Length (MDL):
A cluster is good if it compresses the data efficiently. The MDL principle says:
$$\text{Best clustering} = \arg\min_{\mathcal{C}} \left[ L(\mathcal{C}) + L(X | \mathcal{C}) \right]$$
where:
This automatically balances model complexity (more clusters = more bits to describe model) against fit (fewer clusters = more bits to describe deviations).
Practical Implications:
Information-theoretic clustering:
This perspective is particularly valuable when: (1) you want principled model selection without heuristics, (2) you're clustering for downstream prediction tasks (preserve relevant information), (3) you want clustering that generalizes well, or (4) you need interpretable compression of large datasets.
Different cluster definitions capture different intuitions and lead to different algorithms. Understanding these differences helps you choose the right approach.
Side-by-Side Comparison:
| Definition | Core Intuition | Cluster Shape | Handles Noise | Specifies k |
|---|---|---|---|---|
| Center-based | Closer to own center than others | Convex (spherical/ellipsoidal) | No (all points assigned) | Yes |
| Density-based | Dense region surrounded by sparse | Arbitrary | Yes (noise class) | No |
| Connectivity-based | Connected by similar points | Arbitrary (depends on linkage) | Partially | No (cut dendrogram) |
| Graph-based | Densely connected subgraph | Arbitrary | Depends on method | Yes for spectral |
| Distribution-based | Sampled from same distribution | Determined by distribution family | Can model as component | Yes |
| Information-theoretic | Points needing same description | Implicit | Can be modeled | Determined by criterion |
Matching Definitions to Data:
| Data Characteristic | Best Definition |
|---|---|
| Roughly spherical clusters | Center-based (k-means) |
| Arbitrary shapes, varying density | Density-based (HDBSCAN) |
| Hierarchical structure | Connectivity-based (hierarchical) |
| Network/graph data | Graph-based (spectral, community detection) |
| Clusters overlap significantly | Distribution-based (GMM) |
| Need to choose k principled | Information-theoretic (MDL, BIC) |
Important Insight:
The same data can have different valid clusterings under different definitions. Two researchers using different definitions may legitimately find different clusters—and both may be correct for their purposes. This is why stating your clustering objective and definition explicitly is crucial for reproducible research.
Never treat a clustering result as 'the' answer without specifying which definition was used. Different definitions can produce radically different clusterings of the same data. Always report your definition and justify why it's appropriate for your problem.
We've explored the rich landscape of cluster definitions. Let's consolidate the key insights:
What's Next:
Now that we understand various ways to define clusters, we turn to the types of clustering approaches—the algorithmic paradigms that implement these definitions. The next page examines partitional vs. hierarchical clustering, hard vs. soft assignments, and other fundamental distinctions that shape how clustering is performed.
You now understand the major formal definitions of clusters and can recognize which definition underlies different algorithms. This foundation is essential for choosing appropriate methods and interpreting results correctly. Next, we'll categorize the types of clustering approaches.