Loading learning content...
DBSCAN gave us density-based clustering. OPTICS gave us multi-scale density analysis. But both still required manual parameter selection and interpretation. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) completes the evolution by combining hierarchical clustering with density-based methods to automatically extract an optimal flat clustering.
Developed by Ricardo Campello, Davoud Moulavi, and Jörg Sander in 2013, HDBSCAN has become the de facto standard for density-based clustering in practice. Its key innovations:
By the end of this page, you will understand HDBSCAN's theoretical foundations, including mutual reachability distance and condensed cluster trees. You'll master the algorithm's mechanics, learn to interpret its outputs, and understand why it has become the gold standard for density-based clustering in modern machine learning.
Understanding HDBSCAN requires tracing the conceptual evolution from DBSCAN through OPTICS:
DBSCAN's Limitation: Fixed ε means one density threshold for all clusters. Dense and sparse clusters can't coexist.
OPTICS' Contribution: The reachability plot captures structure at all densities. But extracting clusters still requires choosing thresholds or tuning Xi.
HDBSCAN's Innovation: Build a true hierarchy of clusters based on density, then use a principled criterion to select the 'most stable' clusters from this hierarchy.
The key insight is that clusters exist at different density levels, and the 'true' clusters are those that persist (remain stable) across a range of density thresholds. A cluster that appears briefly at one specific density but quickly splits or merges is less 'stable' than one that persists across many density levels.
| Algorithm | Year | Key Innovation | Remaining Limitation |
|---|---|---|---|
| DBSCAN | 1996 | Density-based cluster definition | Single density scale |
| OPTICS | 1999 | Multi-scale ordering and visualization | Manual extraction required |
| HDBSCAN | 2013 | Automatic optimal cluster extraction | Minimal (very few parameters) |
HDBSCAN's core principle: stable clusters are real clusters. If a grouping only exists at one precise density level and disappears with tiny changes, it's likely noise or artifact. If a grouping persists across a range of densities, it represents genuine structure in the data.
HDBSCAN introduces a key concept that smooths out density variations: the mutual reachability distance.
Core Distance (Recap):
The core distance of point p with respect to parameter k (often min_cluster_size) is the distance to its k-th nearest neighbor:
$$\text{core}_k(p) = d(p, \text{k-th nearest neighbor of } p)$$
This measures the 'local density' at p. Small core distance = dense region.
Mutual Reachability Distance:
The mutual reachability distance between points a and b is:
$$d_{mreach-k}(a, b) = \max{\text{core}_k(a), \text{core}_k(b), d(a, b)}$$
This is symmetric (unlike OPTICS' reachability distance) and captures the density 'barrier' between two points.
Think of mutual reachability as answering: 'What's the minimum density level at which a and b could be directly connected?' It's bounded below by both points' local densities (their core distances) and by their actual distance. Two points in dense regions can be mutual-reachability-close even if somewhat distant, because neither 'mind' stretching a bit. But a point in a sparse region raises the mutual reachability to everything around it.
Properties of Mutual Reachability Distance:
Effect on Clustering:
The mutual reachability distance 'equalizes' dense and sparse regions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom sklearn.neighbors import NearestNeighborsfrom typing import Tuple def compute_mutual_reachability_graph( X: np.ndarray, min_samples: int = 5) -> Tuple[np.ndarray, np.ndarray]: """ Compute the mutual reachability distance graph. Parameters ---------- X : ndarray of shape (n_samples, n_features) Input data. min_samples : int The k parameter for core distance (k-th nearest neighbor). Returns ------- core_distances : ndarray of shape (n_samples,) Core distance for each point. mutual_reach_graph : ndarray of shape (n_samples, n_samples) Mutual reachability distance matrix. """ n_samples = X.shape[0] # Compute k-nearest neighbors for core distances nn = NearestNeighbors(n_neighbors=min_samples, metric='euclidean') nn.fit(X) distances, _ = nn.kneighbors(X) # Core distance is distance to k-th neighbor (last column) core_distances = distances[:, -1] # Compute pairwise Euclidean distances euclidean_dist = np.linalg.norm( X[:, np.newaxis, :] - X[np.newaxis, :, :], axis=2 ) # Mutual reachability: max(core_dist[a], core_dist[b], d(a,b)) # Broadcasting: core_distances[:, None] for rows, core_distances for cols mutual_reach_graph = np.maximum.reduce([ core_distances[:, np.newaxis], # core_dist of point a core_distances[np.newaxis, :], # core_dist of point b euclidean_dist # Euclidean distance ]) return core_distances, mutual_reach_graph def visualize_mutual_reachability_effect( X: np.ndarray, min_samples: int = 5) -> None: """ Visualize how mutual reachability transforms distances. """ import matplotlib.pyplot as plt core_dist, mr_graph = compute_mutual_reachability_graph(X, min_samples) euclid_graph = np.linalg.norm( X[:, np.newaxis, :] - X[np.newaxis, :, :], axis=2 ) fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Core distances as scatter colors ax0 = axes[0] scatter = ax0.scatter(X[:, 0], X[:, 1], c=core_dist, cmap='viridis', s=50) ax0.set_title(f'Core Distance (min_samples={min_samples})') plt.colorbar(scatter, ax=ax0, label='Core Distance') # Euclidean distance matrix ax1 = axes[1] im1 = ax1.imshow(euclid_graph, cmap='viridis') ax1.set_title('Euclidean Distance Matrix') plt.colorbar(im1, ax=ax1) # Mutual reachability matrix ax2 = axes[2] im2 = ax2.imshow(mr_graph, cmap='viridis') ax2.set_title('Mutual Reachability Distance Matrix') plt.colorbar(im2, ax=ax2) plt.tight_layout() plt.show() # Summary statistics print(f"Euclidean distances: min={euclid_graph.min():.3f}, " f"max={euclid_graph.max():.3f}, mean={euclid_graph.mean():.3f}") print(f"Mutual reach dists: min={mr_graph.min():.3f}, " f"max={mr_graph.max():.3f}, mean={mr_graph.mean():.3f}") print(f"Inflation factor: {mr_graph.mean() / euclid_graph.mean():.2f}x")HDBSCAN builds a hierarchical clustering structure by constructing a minimum spanning tree (MST) in the mutual reachability graph, then converting it into a hierarchy.
Step 1: Construct Minimum Spanning Tree
Build an MST over all data points where edge weights are mutual reachability distances. This MST captures the 'backbone' of cluster connectivity.
Why MST? The MST connects all points with minimal total 'cost' (mutual reachability). Cutting edges in order of weight reveals the hierarchical cluster structure—like single-linkage clustering, but in mutual reachability space.
Step 2: Build the Cluster Hierarchy (Dendrogram)
Process MST edges in increasing order of weight:
This produces a dendrogram where:
HDBSCAN's hierarchy is exactly single-linkage clustering in mutual reachability space. Single-linkage merges clusters at the distance of their closest points. Since mutual reachability equals Euclidean distance in dense regions and inflates at boundaries, HDBSCAN's hierarchy naturally separates clusters at density-based boundaries.
Step 3: Condensed Cluster Tree
The full dendrogram has n-1 merges (one per data point absorbed). This is often too detailed. HDBSCAN 'condenses' the tree using a minimum cluster size parameter:
The condensed tree has far fewer nodes and captures only the 'meaningful' cluster structure.
Condensed Tree Properties:
| Feature | Full Dendrogram | Condensed Tree |
|---|---|---|
| Number of splits | n-1 | Much fewer |
| Leaf interpretation | Individual points | Noise points or cluster endpoints |
| Node interpretation | Every merge | Only significant splits |
| Visualization | Often overwhelming | Interpretable |
The defining innovation of HDBSCAN is its principled method for selecting clusters from the condensed tree: stability-based selection.
Defining Lambda (λ):
Instead of working with distance directly, HDBSCAN uses lambda = 1 / distance (like density). Higher lambda means denser regions. Each node in the condensed tree has:
Cluster Stability:
The stability of a cluster C is the sum of 'how long' each point persisted in that cluster:
$$\text{stability}(C) = \sum_{p \in C} (\lambda_{death}(p, C) - \lambda_{birth}(C))$$
Intuitively, stability measures the total 'lifetime' of points in the cluster. Stable clusters have many points that persist across a range of density levels.
A cluster with high stability has existed for a 'long time' in density space and contains many points. A cluster that only exists at one precise density level (narrow λ range) or contains few points has low stability. High stability → real cluster. Low stability → possibly artifact.
Optimal Cluster Selection:
HDBSCAN selects clusters by traversing the condensed tree bottom-up:
select_clusters(node):
if node is leaf:
return stability(node), [node]
left_stability, left_clusters = select_clusters(left_child)
right_stability, right_clusters = select_clusters(right_child)
children_stability = left_stability + right_stability
if stability(node) > children_stability:
# This node is more stable than its children combined
# Select this cluster, not its descendants
return stability(node), [node]
else:
# Children are more stable
# Select the descendants instead
return children_stability, left_clusters + right_clusters
This greedy algorithm finds the set of non-overlapping clusters that maximizes total stability— the clusters that 'deserve' to be selected based on their persistence.
Key Property: The selected clusters form a valid flat clustering—no point belongs to two selected clusters.
Bringing all pieces together, here's the complete HDBSCAN algorithm:
Input: Dataset X, min_cluster_size, min_samples (optional, defaults to min_cluster_size)
Algorithm:
Compute Core Distances
Build Mutual Reachability Graph
Construct Minimum Spanning Tree
Build Cluster Hierarchy
Condense the Tree
Extract Clusters via Stability
Assign Labels
Output: Cluster labels, soft clustering probabilities (optional), condensed tree (for visualization)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
import numpy as npfrom scipy.sparse.csgraph import minimum_spanning_treefrom scipy.spatial.distance import pdist, squareformfrom sklearn.neighbors import NearestNeighborsfrom typing import List, Tuple, Dict, NamedTuplefrom dataclasses import dataclassimport heapq @dataclassclass CondensedNode: """A node in the condensed cluster tree.""" id: int parent: int lambda_birth: float lambda_death: float size: int stability: float children: List[int] is_cluster: bool # Selected as final cluster? members: List[int] # Point indices (for leaves) class HDBSCANResult(NamedTuple): """Result of HDBSCAN clustering.""" labels: np.ndarray probabilities: np.ndarray condensed_tree: List[CondensedNode] n_clusters: int def hdbscan_simple( X: np.ndarray, min_cluster_size: int = 5, min_samples: int = None) -> HDBSCANResult: """ Simplified HDBSCAN implementation for educational purposes. Parameters ---------- X : ndarray of shape (n_samples, n_features) Input data. min_cluster_size : int Minimum size for a cluster. min_samples : int, optional Number of samples in neighborhood for core distance. Defaults to min_cluster_size. Returns ------- HDBSCANResult Clustering results including labels and tree. """ if min_samples is None: min_samples = min_cluster_size n_samples = X.shape[0] # Step 1: Core distances nn = NearestNeighbors(n_neighbors=min_samples) nn.fit(X) distances, _ = nn.kneighbors(X) core_distances = distances[:, -1] # Step 2: Mutual reachability graph # Full distance matrix (for educational clarity) dist_matrix = squareform(pdist(X)) mutual_reach = np.maximum.reduce([ core_distances[:, np.newaxis], core_distances[np.newaxis, :], dist_matrix ]) # Step 3: Minimum spanning tree # scipy's minimum_spanning_tree expects a sparse matrix, returns CSR mst = minimum_spanning_tree(mutual_reach) mst_full = mst.toarray() # Make symmetric for easier edge extraction mst_full = mst_full + mst_full.T # Extract edges: (weight, node_a, node_b) edges = [] for i in range(n_samples): for j in range(i + 1, n_samples): if mst_full[i, j] > 0: edges.append((mst_full[i, j], i, j)) edges.sort() # Sort by weight # Step 4: Build hierarchy using union-find parent = list(range(n_samples)) rank = [0] * n_samples cluster_size = [1] * n_samples def find(x): if parent[x] != x: parent[x] = find(parent[x]) return parent[x] def union(x, y, weight): px, py = find(x), find(y) if px == py: return None if rank[px] < rank[py]: px, py = py, px parent[py] = px if rank[px] == rank[py]: rank[px] += 1 new_size = cluster_size[px] + cluster_size[py] cluster_size[px] = new_size return (weight, px, py, new_size) # Process edges in order merges = [] for weight, a, b in edges: result = union(a, b, weight) if result is not None: merges.append(result) # Step 5-6: Build condensed tree and compute stability # (Simplified: using the merge sequence directly) # For this simplified version, we'll use a basic stability heuristic # based on cluster "persistence" in the hierarchy labels = extract_flat_clustering_simple( n_samples, merges, min_cluster_size ) # Compute soft probabilities (simplified) probabilities = compute_probabilities(X, labels, core_distances) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) return HDBSCANResult( labels=labels, probabilities=probabilities, condensed_tree=[], # Simplified version doesn't build full tree n_clusters=n_clusters ) def extract_flat_clustering_simple( n_samples: int, merges: List[Tuple], min_cluster_size: int) -> np.ndarray: """ Simplified cluster extraction based on merge hierarchy. Uses a heuristic: cut the hierarchy where cluster sizes first exceed min_cluster_size and stability is maximized. """ # For simplicity, use a threshold-based extraction # More sophisticated: full condensed tree + stability calculation parent = list(range(n_samples)) cluster_members = {i: {i} for i in range(n_samples)} def find(x): if parent[x] != x: parent[x] = find(parent[x]) return parent[x] labels = np.full(n_samples, -1, dtype=int) cluster_id = 0 threshold = None for i, (weight, _, _, size) in enumerate(merges): if size >= min_cluster_size and threshold is None: threshold = weight break if threshold is None: return labels # All noise # Replay merges up to threshold for weight, root_a, root_b, size in merges: if weight > threshold * 2: # Allow some margin break pa, pb = find(root_a), find(root_b) if pa != pb: parent[pb] = pa cluster_members[pa] = cluster_members.get(pa, set()) | cluster_members.get(pb, set()) # Assign labels based on final clusters for i in range(n_samples): root = find(i) members = cluster_members.get(root, {i}) if len(members) >= min_cluster_size: if labels[root] == -1: labels[root] = cluster_id cluster_id += 1 for member in members: labels[member] = labels[root] return labels def compute_probabilities( X: np.ndarray, labels: np.ndarray, core_distances: np.ndarray) -> np.ndarray: """ Compute cluster membership probabilities. Points with lower core distances (in denser regions) have higher probability of belonging to their assigned cluster. """ probabilities = np.zeros(len(labels)) for cluster_id in set(labels): if cluster_id == -1: continue mask = labels == cluster_id cluster_core_dists = core_distances[mask] if len(cluster_core_dists) > 0: max_core = cluster_core_dists.max() # Probability inversely related to core distance # Normalized so max is 1 probs = 1 - (cluster_core_dists / (max_core + 1e-10)) probs = np.clip(probs, 0.1, 1.0) # Floor at 0.1 probabilities[mask] = probs return probabilitiesOne of HDBSCAN's great advantages is requiring very few parameters. The primary parameters and their effects:
min_cluster_size (Required):
The minimum number of points required for a group to be considered a cluster.
min_samples (Optional):
The number of neighbors used to compute core distance. Defaults to min_cluster_size if not specified.
cluster_selection_epsilon (Optional):
Merge clusters that are closer than this distance.
| Parameter | Increase Effect | Decrease Effect | Rule of Thumb |
|---|---|---|---|
| min_cluster_size | Fewer, larger clusters | More, smaller clusters | ≈ smallest meaningful cluster |
| min_samples | More noise, denser cores required | Less noise, looser cores | ≈ min_cluster_size |
| cluster_selection_epsilon | Merge nearby clusters | Strict stability separation | 0 for pure HDBSCAN |
Start with min_cluster_size = the smallest meaningful group in your domain. Leave min_samples at default. Run HDBSCAN and examine results. If too many small clusters: increase min_cluster_size. If too much noise: decrease min_samples. If clusters under-segmented: decrease min_cluster_size.
Advanced Parameters:
cluster_selection_method:
'eom' (Excess of Mass) — Default stability-based selection'leaf' — Select only leaf clusters (most granular)alpha: Distance scaling parameter. alpha = 1.0 (default) uses standard distances.
metric: Distance metric to use. Supports 'euclidean', 'manhattan', 'cosine', and others.
core_dist_n_jobs: Number of parallel jobs for core distance computation. -1 uses all CPUs.
Unlike hard clustering where each point belongs to exactly one cluster, HDBSCAN provides soft clustering probabilities that indicate how confidently each point belongs to its assigned cluster.
Probability Interpretation:
For each point p assigned to cluster C, HDBSCAN computes probability(p, C) ∈ [0, 1]:
Computing Membership Probability:
The probability is based on lambda values:
$$\text{probability}(p, C) = \frac{\lambda_p - \lambda_{birth}(C)}{\lambda_{death}(C) - \lambda_{birth}(C)}$$
where λ_p is the lambda value at which point p 'fell out' of the cluster. Points that persist longer (closer to λ_death) have higher probability.
Points labeled as noise (-1) don't have meaningful cluster membership probabilities. Some implementations assign 0 or NaN to noise points. Don't try to interpret these values—noise points are, by definition, not confidently assigned to any cluster.
Outlier Scores:
HDBSCAN also provides outlier scores for each point:
$$\text{outlier_score}(p) = 1 - \text{probability}(p)$$
Or more sophisticated: based on how 'different' the point's local density is from its cluster's typical density. High outlier score → point is unusual even within its cluster.
Practical Usage:
from hdbscan import HDBSCAN
clusterer = HDBSCAN(min_cluster_size=10)
clusterer.fit(X)
# Hard cluster assignments
labels = clusterer.labels_
# Membership probabilities
probabilities = clusterer.probabilities_
# Outlier scores
outlier_scores = clusterer.outlier_scores_
# Find uncertain assignments
uncertain = probabilities < 0.5
print(f"Uncertain points: {uncertain.sum()}")
HDBSCAN provides powerful visualizations that aid interpretation:
1. Condensed Tree Visualization
The condensed tree shows the hierarchical structure with:
2. Single Linkage Tree
The full dendrogram before condensation, showing all n-1 merges. Useful for understanding the complete hierarchy.
3. Cluster Persistence Plot
Shows stability values for each potential cluster, helping understand why certain clusters were selected.
4. Cluster Probability Plot
Scatter plot with point colors indicating membership probability—dense cores bright, periphery dim.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.colors import LinearSegmentedColormapimport hdbscan def comprehensive_hdbscan_visualization(X, min_cluster_size=10): """ Create comprehensive HDBSCAN visualization suite. Parameters ---------- X : ndarray of shape (n_samples, n_features) Input data (2D for visualization). min_cluster_size : int HDBSCAN parameter. """ # Fit HDBSCAN clusterer = hdbscan.HDBSCAN( min_cluster_size=min_cluster_size, gen_min_span_tree=True ) clusterer.fit(X) fig = plt.figure(figsize=(20, 12)) # 1. Data with cluster assignments ax1 = fig.add_subplot(2, 3, 1) scatter = ax1.scatter( X[:, 0], X[:, 1], c=clusterer.labels_, cmap='Spectral', s=50, alpha=0.7 ) ax1.set_title(f'HDBSCAN Clustering (n_clusters={clusterer.labels_.max() + 1})') # Mark noise points noise_mask = clusterer.labels_ == -1 ax1.scatter( X[noise_mask, 0], X[noise_mask, 1], c='gray', marker='x', s=30, label='Noise' ) ax1.legend() # 2. Cluster membership probability ax2 = fig.add_subplot(2, 3, 2) scatter2 = ax2.scatter( X[:, 0], X[:, 1], c=clusterer.probabilities_, cmap='viridis', s=50, alpha=0.8 ) ax2.set_title('Membership Probability') plt.colorbar(scatter2, ax=ax2, label='Probability') # 3. Outlier scores ax3 = fig.add_subplot(2, 3, 3) outlier_scores = clusterer.outlier_scores_ scatter3 = ax3.scatter( X[:, 0], X[:, 1], c=outlier_scores, cmap='Reds', s=50, alpha=0.8 ) ax3.set_title('Outlier Scores') plt.colorbar(scatter3, ax=ax3, label='Outlier Score') # 4. Condensed tree (if available) ax4 = fig.add_subplot(2, 3, 4) clusterer.condensed_tree_.plot( select_clusters=True, axis=ax4, colorbar=False ) ax4.set_title('Condensed Cluster Tree') # 5. Minimum spanning tree (projected to 2D) ax5 = fig.add_subplot(2, 3, 5) clusterer.minimum_spanning_tree_.plot( edge_cmap='viridis', edge_alpha=0.6, node_size=10, axis=ax5 ) ax5.set_title('Minimum Spanning Tree') # 6. Cluster sizes and stability summary ax6 = fig.add_subplot(2, 3, 6) # Get cluster statistics unique_labels = set(clusterer.labels_) unique_labels.discard(-1) # Remove noise cluster_sizes = [] cluster_avg_probs = [] cluster_labels = [] for label in sorted(unique_labels): mask = clusterer.labels_ == label cluster_sizes.append(mask.sum()) cluster_avg_probs.append(clusterer.probabilities_[mask].mean()) cluster_labels.append(f'Cluster {label}') x_pos = np.arange(len(cluster_labels)) ax6.bar(x_pos, cluster_sizes, color='steelblue', alpha=0.7) ax6.set_xticks(x_pos) ax6.set_xticklabels(cluster_labels, rotation=45) ax6.set_ylabel('Cluster Size') ax6.set_title('Cluster Size Distribution') # Add average probability as text for i, (size, prob) in enumerate(zip(cluster_sizes, cluster_avg_probs)): ax6.text(i, size + 2, f'p̄={prob:.2f}', ha='center', fontsize=8) # Add noise count n_noise = (clusterer.labels_ == -1).sum() ax6.text( 0.95, 0.95, f'Noise: {n_noise} points', transform=ax6.transAxes, ha='right', va='top', fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5) ) plt.tight_layout() plt.show() return clusterer # Example usageif __name__ == "__main__": np.random.seed(42) # Create multi-density clusters cluster1 = np.random.normal([0, 0], 0.3, (200, 2)) # Dense cluster2 = np.random.normal([3, 3], 0.8, (100, 2)) # Medium cluster3 = np.random.normal([6, 0], 1.2, (50, 2)) # Sparse noise = np.random.uniform(-2, 8, (30, 2)) X = np.vstack([cluster1, cluster2, cluster3, noise]) clusterer = comprehensive_hdbscan_visualization(X, min_cluster_size=10)HDBSCAN represents the culmination of density-based clustering evolution. Let's consolidate the essential concepts:
What's Next:
With the three major density-based algorithms mastered (DBSCAN, OPTICS, HDBSCAN), we turn to a crucial practical topic: parameter selection. How do we choose ε, MinPts, and min_cluster_size effectively? What diagnostic tools and heuristics guide these decisions?
You now have a deep understanding of HDBSCAN—the state-of-the-art in density-based clustering. You understand its theoretical foundations, algorithmic mechanics, and practical advantages. You're equipped to apply HDBSCAN effectively and interpret its rich outputs.