Machine LearningDensity-Based Clustering

Density-Based Clustering

LevelIntermediate

Duration120 mins

TopicDensity-Based Clustering

4 / 5

HDBSCAN

The Final Evolution: Hierarchical Density Clustering

DBSCAN gave us density-based clustering. OPTICS gave us multi-scale density analysis. But both still required manual parameter selection and interpretation. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) completes the evolution by combining hierarchical clustering with density-based methods to automatically extract an optimal flat clustering.

Developed by Ricardo Campello, Davoud Moulavi, and Jörg Sander in 2013, HDBSCAN has become the de facto standard for density-based clustering in practice. Its key innovations:

Automatic cluster extraction — No need to specify ε or extraction thresholds
Variable-density clusters — Different clusters can have different densities
Robust noise detection — Principled separation of signal from noise
Soft clustering support — Probabilistic cluster membership available
Minimal parameters — Only MinPts (called min_cluster_size) truly matters

What You Will Learn

By the end of this page, you will understand HDBSCAN's theoretical foundations, including mutual reachability distance and condensed cluster trees. You'll master the algorithm's mechanics, learn to interpret its outputs, and understand why it has become the gold standard for density-based clustering in modern machine learning.

From DBSCAN to HDBSCAN: The Journey

Understanding HDBSCAN requires tracing the conceptual evolution from DBSCAN through OPTICS:

DBSCAN's Limitation: Fixed ε means one density threshold for all clusters. Dense and sparse clusters can't coexist.

OPTICS' Contribution: The reachability plot captures structure at all densities. But extracting clusters still requires choosing thresholds or tuning Xi.

HDBSCAN's Innovation: Build a true hierarchy of clusters based on density, then use a principled criterion to select the 'most stable' clusters from this hierarchy.

The key insight is that clusters exist at different density levels, and the 'true' clusters are those that persist (remain stable) across a range of density thresholds. A cluster that appears briefly at one specific density but quickly splits or merges is less 'stable' than one that persists across many density levels.

Evolution of Density-Based Clustering
Algorithm	Year	Key Innovation	Remaining Limitation
DBSCAN	1996	Density-based cluster definition	Single density scale
OPTICS	1999	Multi-scale ordering and visualization	Manual extraction required
HDBSCAN	2013	Automatic optimal cluster extraction	Minimal (very few parameters)

The Stability Principle

HDBSCAN's core principle: stable clusters are real clusters. If a grouping only exists at one precise density level and disappears with tiny changes, it's likely noise or artifact. If a grouping persists across a range of densities, it represents genuine structure in the data.

Mutual Reachability Distance

HDBSCAN introduces a key concept that smooths out density variations: the mutual reachability distance.

Core Distance (Recap):

The core distance of point p with respect to parameter k (often min_cluster_size) is the distance to its k-th nearest neighbor:

$$\text{core}_k(p) = d(p, \text{k-th nearest neighbor of } p)$$

This measures the 'local density' at p. Small core distance = dense region.

Mutual Reachability Distance:

The mutual reachability distance between points a and b is:

$$d_{mreach-k}(a, b) = \max{\text{core}_k(a), \text{core}_k(b), d(a, b)}$$

This is symmetric (unlike OPTICS' reachability distance) and captures the density 'barrier' between two points.

Understanding Mutual Reachability

Think of mutual reachability as answering: 'What's the minimum density level at which a and b could be directly connected?' It's bounded below by both points' local densities (their core distances) and by their actual distance. Two points in dense regions can be mutual-reachability-close even if somewhat distant, because neither 'mind' stretching a bit. But a point in a sparse region raises the mutual reachability to everything around it.

Properties of Mutual Reachability Distance:

Symmetry: $d_{mreach}(a, b) = d_{mreach}(b, a)$
Lower Bound: $d_{mreach}(a, b) \geq d(a, b)$ always
Density Smoothing: In dense regions, nearby points have low mutual reachability. In sparse regions, it's higher.
Metric Properties: Mutual reachability is a valid distance metric (satisfies triangle inequality)

Effect on Clustering:

The mutual reachability distance 'equalizes' dense and sparse regions:

In dense clusters, it roughly equals Euclidean distance
At cluster boundaries (where density drops), it inflates distances
This creates natural 'gaps' at cluster boundaries

mutual_reachability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from sklearn.neighbors import NearestNeighbors
from typing import Tuple
 
def compute_mutual_reachability_graph(
    X: np.ndarray,
    min_samples: int = 5
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute the mutual reachability distance graph.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data.
    min_samples : int
        The k parameter for core distance (k-th nearest neighbor).
    
    Returns
    -------
    core_distances : ndarray of shape (n_samples,)
        Core distance for each point.
    mutual_reach_graph : ndarray of shape (n_samples, n_samples)
        Mutual reachability distance matrix.
    """
    n_samples = X.shape[0]
    
    # Compute k-nearest neighbors for core distances
    nn = NearestNeighbors(n_neighbors=min_samples, metric='euclidean')
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    
    # Core distance is distance to k-th neighbor (last column)
    core_distances = distances[:, -1]
    
    # Compute pairwise Euclidean distances
    euclidean_dist = np.linalg.norm(
        X[:, np.newaxis, :] - X[np.newaxis, :, :], 
        axis=2
    )
    
    # Mutual reachability: max(core_dist[a], core_dist[b], d(a,b))
    # Broadcasting: core_distances[:, None] for rows, core_distances for cols
    mutual_reach_graph = np.maximum.reduce([
        core_distances[:, np.newaxis],  # core_dist of point a
        core_distances[np.newaxis, :],  # core_dist of point b
        euclidean_dist                   # Euclidean distance
    ])
    
    return core_distances, mutual_reach_graph
 
 
def visualize_mutual_reachability_effect(
    X: np.ndarray,
    min_samples: int = 5
) -> None:
    """
    Visualize how mutual reachability transforms distances.
    """
    import matplotlib.pyplot as plt
    
    core_dist, mr_graph = compute_mutual_reachability_graph(X, min_samples)
    euclid_graph = np.linalg.norm(
        X[:, np.newaxis, :] - X[np.newaxis, :, :], axis=2
    )
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Core distances as scatter colors
    ax0 = axes[0]
    scatter = ax0.scatter(X[:, 0], X[:, 1], c=core_dist, 
                          cmap='viridis', s=50)
    ax0.set_title(f'Core Distance (min_samples={min_samples})')
    plt.colorbar(scatter, ax=ax0, label='Core Distance')
    
    # Euclidean distance matrix
    ax1 = axes[1]
    im1 = ax1.imshow(euclid_graph, cmap='viridis')
    ax1.set_title('Euclidean Distance Matrix')
    plt.colorbar(im1, ax=ax1)
    
    # Mutual reachability matrix
    ax2 = axes[2]
    im2 = ax2.imshow(mr_graph, cmap='viridis')
    ax2.set_title('Mutual Reachability Distance Matrix')
    plt.colorbar(im2, ax=ax2)
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print(f"Euclidean distances: min={euclid_graph.min():.3f}, "
          f"max={euclid_graph.max():.3f}, mean={euclid_graph.mean():.3f}")
    print(f"Mutual reach dists:  min={mr_graph.min():.3f}, "
          f"max={mr_graph.max():.3f}, mean={mr_graph.mean():.3f}")
    print(f"Inflation factor: {mr_graph.mean() / euclid_graph.mean():.2f}x")

Building the Cluster Hierarchy

HDBSCAN builds a hierarchical clustering structure by constructing a minimum spanning tree (MST) in the mutual reachability graph, then converting it into a hierarchy.

Step 1: Construct Minimum Spanning Tree

Build an MST over all data points where edge weights are mutual reachability distances. This MST captures the 'backbone' of cluster connectivity.

Why MST? The MST connects all points with minimal total 'cost' (mutual reachability). Cutting edges in order of weight reveals the hierarchical cluster structure—like single-linkage clustering, but in mutual reachability space.

Step 2: Build the Cluster Hierarchy (Dendrogram)

Process MST edges in increasing order of weight:

Start with each point as its own cluster
For each edge (a, b) with weight w:
- Merge the clusters containing a and b
- Record this merge at 'height' w (the mutual reachability distance)
Continue until all points are in one cluster

This produces a dendrogram where:

Leaves are individual points
Internal nodes are cluster merges
Heights represent the 'density level' at which clusters merge

Relationship to Single-Linkage

HDBSCAN's hierarchy is exactly single-linkage clustering in mutual reachability space. Single-linkage merges clusters at the distance of their closest points. Since mutual reachability equals Euclidean distance in dense regions and inflates at boundaries, HDBSCAN's hierarchy naturally separates clusters at density-based boundaries.

Step 3: Condensed Cluster Tree

The full dendrogram has n-1 merges (one per data point absorbed). This is often too detailed. HDBSCAN 'condenses' the tree using a minimum cluster size parameter:

Spurious splits ignored: If a cluster splits and one child has fewer than min_cluster_size points, treat it as 'noise falling off' rather than a true split.
Only significant splits kept: Record splits only when both children exceed min_cluster_size.

The condensed tree has far fewer nodes and captures only the 'meaningful' cluster structure.

Condensed Tree Properties:

Feature	Full Dendrogram	Condensed Tree
Number of splits	n-1	Much fewer
Leaf interpretation	Individual points	Noise points or cluster endpoints
Node interpretation	Every merge	Only significant splits
Visualization	Often overwhelming	Interpretable

Cluster Stability and Selection

The defining innovation of HDBSCAN is its principled method for selecting clusters from the condensed tree: stability-based selection.

Defining Lambda (λ):

Instead of working with distance directly, HDBSCAN uses lambda = 1 / distance (like density). Higher lambda means denser regions. Each node in the condensed tree has:

λ_birth: The lambda value when the cluster first appeared (split from parent)
λ_death: The lambda value when the cluster disappeared (split into children or became noise)

Cluster Stability:

The stability of a cluster C is the sum of 'how long' each point persisted in that cluster:

$$\text{stability}(C) = \sum_{p \in C} (\lambda_{death}(p, C) - \lambda_{birth}(C))$$

Intuitively, stability measures the total 'lifetime' of points in the cluster. Stable clusters have many points that persist across a range of density levels.

Stability Intuition

A cluster with high stability has existed for a 'long time' in density space and contains many points. A cluster that only exists at one precise density level (narrow λ range) or contains few points has low stability. High stability → real cluster. Low stability → possibly artifact.

Optimal Cluster Selection:

HDBSCAN selects clusters by traversing the condensed tree bottom-up:

select_clusters(node):
    if node is leaf:
        return stability(node), [node]
    
    left_stability, left_clusters = select_clusters(left_child)
    right_stability, right_clusters = select_clusters(right_child)
    children_stability = left_stability + right_stability
    
    if stability(node) > children_stability:
        # This node is more stable than its children combined
        # Select this cluster, not its descendants
        return stability(node), [node]
    else:
        # Children are more stable
        # Select the descendants instead
        return children_stability, left_clusters + right_clusters

This greedy algorithm finds the set of non-overlapping clusters that maximizes total stability— the clusters that 'deserve' to be selected based on their persistence.

Key Property: The selected clusters form a valid flat clustering—no point belongs to two selected clusters.

Why Stability-Based Selection Works

•Automatic granularity: Chooses finer or coarser clusters based on which is more stable
•Variable density: Different clusters can be at different density levels
•Noise separation: Unstable, fleeting groupings are left as noise
•No threshold needed: Optimal selection is computed, not specified
•Mathematically principled: Maximizing stability has theoretical justification

The Complete HDBSCAN Algorithm

Bringing all pieces together, here's the complete HDBSCAN algorithm:

Input: Dataset X, min_cluster_size, min_samples (optional, defaults to min_cluster_size)

Algorithm:

Compute Core Distances
- For each point p, compute core_dist(p) = distance to min_samples-th nearest neighbor
Build Mutual Reachability Graph
- For all pairs (a, b): d_mreach(a, b) = max(core_dist(a), core_dist(b), d(a, b))
Construct Minimum Spanning Tree
- Build MST over mutual reachability graph
- Can use Prim's or Kruskal's algorithm
Build Cluster Hierarchy
- Sort MST edges by weight (ascending)
- Use union-find to build hierarchy as edges are processed
Condense the Tree
- Collapse nodes with fewer than min_cluster_size points
- Only keep 'significant' splits
Extract Clusters via Stability
- Compute stability for each node in condensed tree
- Select optimal clusters using the stability maximization algorithm
Assign Labels
- Points in selected clusters get cluster labels
- Remaining points are labeled as noise (-1)

Output: Cluster labels, soft clustering probabilities (optional), condensed tree (for visualization)

hdbscan_core.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import numpy as np
from scipy.sparse.csgraph import minimum_spanning_tree
from scipy.spatial.distance import pdist, squareform
from sklearn.neighbors import NearestNeighbors
from typing import List, Tuple, Dict, NamedTuple
from dataclasses import dataclass
import heapq
 
@dataclass
class CondensedNode:
    """A node in the condensed cluster tree."""
    id: int
    parent: int
    lambda_birth: float
    lambda_death: float
    size: int
    stability: float
    children: List[int]
    is_cluster: bool  # Selected as final cluster?
    members: List[int]  # Point indices (for leaves)
 
 
class HDBSCANResult(NamedTuple):
    """Result of HDBSCAN clustering."""
    labels: np.ndarray
    probabilities: np.ndarray
    condensed_tree: List[CondensedNode]
    n_clusters: int
 
 
def hdbscan_simple(
    X: np.ndarray,
    min_cluster_size: int = 5,
    min_samples: int = None
) -> HDBSCANResult:
    """
    Simplified HDBSCAN implementation for educational purposes.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data.
    min_cluster_size : int
        Minimum size for a cluster.
    min_samples : int, optional
        Number of samples in neighborhood for core distance.
        Defaults to min_cluster_size.
    
    Returns
    -------
    HDBSCANResult
        Clustering results including labels and tree.
    """
    if min_samples is None:
        min_samples = min_cluster_size
    
    n_samples = X.shape[0]
    
    # Step 1: Core distances
    nn = NearestNeighbors(n_neighbors=min_samples)
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    core_distances = distances[:, -1]
    
    # Step 2: Mutual reachability graph
    # Full distance matrix (for educational clarity)
    dist_matrix = squareform(pdist(X))
    mutual_reach = np.maximum.reduce([
        core_distances[:, np.newaxis],
        core_distances[np.newaxis, :],
        dist_matrix
    ])
    
    # Step 3: Minimum spanning tree
    # scipy's minimum_spanning_tree expects a sparse matrix, returns CSR
    mst = minimum_spanning_tree(mutual_reach)
    mst_full = mst.toarray()
    # Make symmetric for easier edge extraction
    mst_full = mst_full + mst_full.T
    
    # Extract edges: (weight, node_a, node_b)
    edges = []
    for i in range(n_samples):
        for j in range(i + 1, n_samples):
            if mst_full[i, j] > 0:
                edges.append((mst_full[i, j], i, j))
    edges.sort()  # Sort by weight
    
    # Step 4: Build hierarchy using union-find
    parent = list(range(n_samples))
    rank = [0] * n_samples
    cluster_size = [1] * n_samples
    
    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]
    
    def union(x, y, weight):
        px, py = find(x), find(y)
        if px == py:
            return None
        
        if rank[px] < rank[py]:
            px, py = py, px
        parent[py] = px
        if rank[px] == rank[py]:
            rank[px] += 1
        
        new_size = cluster_size[px] + cluster_size[py]
        cluster_size[px] = new_size
        
        return (weight, px, py, new_size)
    
    # Process edges in order
    merges = []
    for weight, a, b in edges:
        result = union(a, b, weight)
        if result is not None:
            merges.append(result)
    
    # Step 5-6: Build condensed tree and compute stability
    # (Simplified: using the merge sequence directly)
    
    # For this simplified version, we'll use a basic stability heuristic
    # based on cluster "persistence" in the hierarchy
    
    labels = extract_flat_clustering_simple(
        n_samples, merges, min_cluster_size
    )
    
    # Compute soft probabilities (simplified)
    probabilities = compute_probabilities(X, labels, core_distances)
    
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    
    return HDBSCANResult(
        labels=labels,
        probabilities=probabilities,
        condensed_tree=[],  # Simplified version doesn't build full tree
        n_clusters=n_clusters
    )
 
 
def extract_flat_clustering_simple(
    n_samples: int,
    merges: List[Tuple],
    min_cluster_size: int
) -> np.ndarray:
    """
    Simplified cluster extraction based on merge hierarchy.
    
    Uses a heuristic: cut the hierarchy where cluster sizes
    first exceed min_cluster_size and stability is maximized.
    """
    # For simplicity, use a threshold-based extraction
    # More sophisticated: full condensed tree + stability calculation
    
    parent = list(range(n_samples))
    cluster_members = {i: {i} for i in range(n_samples)}
    
    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]
    
    labels = np.full(n_samples, -1, dtype=int)
    cluster_id = 0
    
    threshold = None
    for i, (weight, _, _, size) in enumerate(merges):
        if size >= min_cluster_size and threshold is None:
            threshold = weight
            break
    
    if threshold is None:
        return labels  # All noise
    
    # Replay merges up to threshold
    for weight, root_a, root_b, size in merges:
        if weight > threshold * 2:  # Allow some margin
            break
        
        pa, pb = find(root_a), find(root_b)
        if pa != pb:
            parent[pb] = pa
            cluster_members[pa] = cluster_members.get(pa, set()) | cluster_members.get(pb, set())
    
    # Assign labels based on final clusters
    for i in range(n_samples):
        root = find(i)
        members = cluster_members.get(root, {i})
        if len(members) >= min_cluster_size:
            if labels[root] == -1:
                labels[root] = cluster_id
                cluster_id += 1
            for member in members:
                labels[member] = labels[root]
    
    return labels
 
 
def compute_probabilities(
    X: np.ndarray,
    labels: np.ndarray,
    core_distances: np.ndarray
) -> np.ndarray:
    """
    Compute cluster membership probabilities.
    
    Points with lower core distances (in denser regions) have
    higher probability of belonging to their assigned cluster.
    """
    probabilities = np.zeros(len(labels))
    
    for cluster_id in set(labels):
        if cluster_id == -1:
            continue
        
        mask = labels == cluster_id
        cluster_core_dists = core_distances[mask]
        
        if len(cluster_core_dists) > 0:
            max_core = cluster_core_dists.max()
            # Probability inversely related to core distance
            # Normalized so max is 1
            probs = 1 - (cluster_core_dists / (max_core + 1e-10))
            probs = np.clip(probs, 0.1, 1.0)  # Floor at 0.1
            probabilities[mask] = probs
    
    return probabilities

HDBSCAN Parameters and Their Effects

One of HDBSCAN's great advantages is requiring very few parameters. The primary parameters and their effects:

min_cluster_size (Required):

The minimum number of points required for a group to be considered a cluster.

Effect: Larger values → fewer, larger clusters; smaller values → more, smaller clusters
Guidance: Set based on domain knowledge about minimum meaningful cluster size
Default: Typically 5-15 for exploratory analysis

min_samples (Optional):

The number of neighbors used to compute core distance. Defaults to min_cluster_size if not specified.

Effect: Larger values → more conservative density estimates, more points labeled as noise
Guidance: Increase if data is noisy; decrease for cleaner data
Relationship: min_samples ≤ min_cluster_size is recommended

cluster_selection_epsilon (Optional):

Merge clusters that are closer than this distance.

Effect: Setting > 0 produces DBSCAN-like behavior, merging close clusters
Use case: When you want slightly larger clusters than pure stability selection gives
Default: 0 (no merging beyond stability selection)

HDBSCAN Parameter Effects
Parameter	Increase Effect	Decrease Effect	Rule of Thumb
min_cluster_size	Fewer, larger clusters	More, smaller clusters	≈ smallest meaningful cluster
min_samples	More noise, denser cores required	Less noise, looser cores	≈ min_cluster_size
cluster_selection_epsilon	Merge nearby clusters	Strict stability separation	0 for pure HDBSCAN

Parameter Selection Strategy

Start with min_cluster_size = the smallest meaningful group in your domain. Leave min_samples at default. Run HDBSCAN and examine results. If too many small clusters: increase min_cluster_size. If too much noise: decrease min_samples. If clusters under-segmented: decrease min_cluster_size.

Advanced Parameters:

cluster_selection_method:

'eom' (Excess of Mass) — Default stability-based selection
'leaf' — Select only leaf clusters (most granular)

alpha: Distance scaling parameter. alpha = 1.0 (default) uses standard distances.

metric: Distance metric to use. Supports 'euclidean', 'manhattan', 'cosine', and others.

core_dist_n_jobs: Number of parallel jobs for core distance computation. -1 uses all CPUs.

Soft Clustering and Membership Probabilities

Unlike hard clustering where each point belongs to exactly one cluster, HDBSCAN provides soft clustering probabilities that indicate how confidently each point belongs to its assigned cluster.

Probability Interpretation:

For each point p assigned to cluster C, HDBSCAN computes probability(p, C) ∈ [0, 1]:

Probability ≈ 1: Point is deep within the cluster's dense core; highly confident assignment
Probability ≈ 0.5-0.8: Point is in the cluster but not central; somewhat confident
Probability < 0.5: Point is on cluster periphery; assignment is tentative

Computing Membership Probability:

The probability is based on lambda values:

$$\text{probability}(p, C) = \frac{\lambda_p - \lambda_{birth}(C)}{\lambda_{death}(C) - \lambda_{birth}(C)}$$

where λ_p is the lambda value at which point p 'fell out' of the cluster. Points that persist longer (closer to λ_death) have higher probability.

Using Membership Probabilities

•Identify uncertain assignments: Points with low probability may need manual review
•Weight downstream analysis: Use probabilities as weights in subsequent computations
•Detect boundary cases: Low-probability points are near cluster boundaries
•Quality assessment: Average probability indicates overall cluster compactness
•Outlier detection: Even non-noise points with low probability are potential outliers

Noise Points Have No Probability

Points labeled as noise (-1) don't have meaningful cluster membership probabilities. Some implementations assign 0 or NaN to noise points. Don't try to interpret these values—noise points are, by definition, not confidently assigned to any cluster.

Outlier Scores:

HDBSCAN also provides outlier scores for each point:

$$\text{outlier_score}(p) = 1 - \text{probability}(p)$$

Or more sophisticated: based on how 'different' the point's local density is from its cluster's typical density. High outlier score → point is unusual even within its cluster.

Practical Usage:

from hdbscan import HDBSCAN

clusterer = HDBSCAN(min_cluster_size=10)
clusterer.fit(X)

# Hard cluster assignments
labels = clusterer.labels_

# Membership probabilities
probabilities = clusterer.probabilities_

# Outlier scores
outlier_scores = clusterer.outlier_scores_

# Find uncertain assignments
uncertain = probabilities < 0.5
print(f"Uncertain points: {uncertain.sum()}")

Visualization and Interpretation

HDBSCAN provides powerful visualizations that aid interpretation:

1. Condensed Tree Visualization

The condensed tree shows the hierarchical structure with:

X-axis: Lambda (1/distance, roughly density)
Y-axis: Cluster branches
Selected clusters highlighted
Noise shown as thin lines that 'fall off'

2. Single Linkage Tree

The full dendrogram before condensation, showing all n-1 merges. Useful for understanding the complete hierarchy.

3. Cluster Persistence Plot

Shows stability values for each potential cluster, helping understand why certain clusters were selected.

4. Cluster Probability Plot

Scatter plot with point colors indicating membership probability—dense cores bright, periphery dim.

hdbscan_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import hdbscan
 
def comprehensive_hdbscan_visualization(X, min_cluster_size=10):
    """
    Create comprehensive HDBSCAN visualization suite.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data (2D for visualization).
    min_cluster_size : int
        HDBSCAN parameter.
    """
    # Fit HDBSCAN
    clusterer = hdbscan.HDBSCAN(
        min_cluster_size=min_cluster_size,
        gen_min_span_tree=True
    )
    clusterer.fit(X)
    
    fig = plt.figure(figsize=(20, 12))
    
    # 1. Data with cluster assignments
    ax1 = fig.add_subplot(2, 3, 1)
    scatter = ax1.scatter(
        X[:, 0], X[:, 1],
        c=clusterer.labels_,
        cmap='Spectral',
        s=50,
        alpha=0.7
    )
    ax1.set_title(f'HDBSCAN Clustering (n_clusters={clusterer.labels_.max() + 1})')
    
    # Mark noise points
    noise_mask = clusterer.labels_ == -1
    ax1.scatter(
        X[noise_mask, 0], X[noise_mask, 1],
        c='gray', marker='x', s=30, label='Noise'
    )
    ax1.legend()
    
    # 2. Cluster membership probability
    ax2 = fig.add_subplot(2, 3, 2)
    scatter2 = ax2.scatter(
        X[:, 0], X[:, 1],
        c=clusterer.probabilities_,
        cmap='viridis',
        s=50,
        alpha=0.8
    )
    ax2.set_title('Membership Probability')
    plt.colorbar(scatter2, ax=ax2, label='Probability')
    
    # 3. Outlier scores
    ax3 = fig.add_subplot(2, 3, 3)
    outlier_scores = clusterer.outlier_scores_
    scatter3 = ax3.scatter(
        X[:, 0], X[:, 1],
        c=outlier_scores,
        cmap='Reds',
        s=50,
        alpha=0.8
    )
    ax3.set_title('Outlier Scores')
    plt.colorbar(scatter3, ax=ax3, label='Outlier Score')
    
    # 4. Condensed tree (if available)
    ax4 = fig.add_subplot(2, 3, 4)
    clusterer.condensed_tree_.plot(
        select_clusters=True,
        axis=ax4,
        colorbar=False
    )
    ax4.set_title('Condensed Cluster Tree')
    
    # 5. Minimum spanning tree (projected to 2D)
    ax5 = fig.add_subplot(2, 3, 5)
    clusterer.minimum_spanning_tree_.plot(
        edge_cmap='viridis',
        edge_alpha=0.6,
        node_size=10,
        axis=ax5
    )
    ax5.set_title('Minimum Spanning Tree')
    
    # 6. Cluster sizes and stability summary
    ax6 = fig.add_subplot(2, 3, 6)
    
    # Get cluster statistics
    unique_labels = set(clusterer.labels_)
    unique_labels.discard(-1)  # Remove noise
    
    cluster_sizes = []
    cluster_avg_probs = []
    cluster_labels = []
    
    for label in sorted(unique_labels):
        mask = clusterer.labels_ == label
        cluster_sizes.append(mask.sum())
        cluster_avg_probs.append(clusterer.probabilities_[mask].mean())
        cluster_labels.append(f'Cluster {label}')
    
    x_pos = np.arange(len(cluster_labels))
    ax6.bar(x_pos, cluster_sizes, color='steelblue', alpha=0.7)
    ax6.set_xticks(x_pos)
    ax6.set_xticklabels(cluster_labels, rotation=45)
    ax6.set_ylabel('Cluster Size')
    ax6.set_title('Cluster Size Distribution')
    
    # Add average probability as text
    for i, (size, prob) in enumerate(zip(cluster_sizes, cluster_avg_probs)):
        ax6.text(i, size + 2, f'p̄={prob:.2f}', ha='center', fontsize=8)
    
    # Add noise count
    n_noise = (clusterer.labels_ == -1).sum()
    ax6.text(
        0.95, 0.95, f'Noise: {n_noise} points',
        transform=ax6.transAxes,
        ha='right', va='top',
        fontsize=10,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    )
    
    plt.tight_layout()
    plt.show()
    
    return clusterer
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create multi-density clusters
    cluster1 = np.random.normal([0, 0], 0.3, (200, 2))  # Dense
    cluster2 = np.random.normal([3, 3], 0.8, (100, 2))  # Medium
    cluster3 = np.random.normal([6, 0], 1.2, (50, 2))   # Sparse
    noise = np.random.uniform(-2, 8, (30, 2))
    
    X = np.vstack([cluster1, cluster2, cluster3, noise])
    
    clusterer = comprehensive_hdbscan_visualization(X, min_cluster_size=10)

Summary: HDBSCAN

HDBSCAN represents the culmination of density-based clustering evolution. Let's consolidate the essential concepts:

Key Takeaways

•Automatic Clustering — HDBSCAN automatically selects optimal clusters without requiring ε or extraction thresholds.
•Mutual Reachability — The symmetric distance measure that smooths density variations and creates natural cluster boundaries.
•Hierarchical Construction — MST + single linkage in mutual reachability space builds a complete cluster hierarchy.
•Condensed Tree — Simplifies the hierarchy by collapsing insignificant splits, keeping only meaningful structure.
•Stability-Based Selection — Clusters are selected by maximizing total stability—persistence across density levels.
•Variable Density — Different clusters can have different densities; no global density threshold.
•Soft Clustering — Membership probabilities indicate assignment confidence, enabling nuanced analysis.
•Minimal Parameters — Only min_cluster_size truly needs tuning; far simpler than DBSCAN or OPTICS.

What's Next:

With the three major density-based algorithms mastered (DBSCAN, OPTICS, HDBSCAN), we turn to a crucial practical topic: parameter selection. How do we choose ε, MinPts, and min_cluster_size effectively? What diagnostic tools and heuristics guide these decisions?

Page Complete

You now have a deep understanding of HDBSCAN—the state-of-the-art in density-based clustering. You understand its theoretical foundations, algorithmic mechanics, and practical advantages. You're equipped to apply HDBSCAN effectively and interpret its rich outputs.

4 / 5

Loading learning content...

Machine LearningDensity-Based Clustering

Density-Based Clustering

LevelIntermediate

Duration120 mins

TopicDensity-Based Clustering

4 / 5

HDBSCAN

The Final Evolution: Hierarchical Density Clustering

Developed by Ricardo Campello, Davoud Moulavi, and Jörg Sander in 2013, HDBSCAN has become the de facto standard for density-based clustering in practice. Its key innovations:

Automatic cluster extraction — No need to specify ε or extraction thresholds
Variable-density clusters — Different clusters can have different densities
Robust noise detection — Principled separation of signal from noise
Soft clustering support — Probabilistic cluster membership available
Minimal parameters — Only MinPts (called min_cluster_size) truly matters

What You Will Learn

From DBSCAN to HDBSCAN: The Journey

Understanding HDBSCAN requires tracing the conceptual evolution from DBSCAN through OPTICS:

DBSCAN's Limitation: Fixed ε means one density threshold for all clusters. Dense and sparse clusters can't coexist.

OPTICS' Contribution: The reachability plot captures structure at all densities. But extracting clusters still requires choosing thresholds or tuning Xi.

HDBSCAN's Innovation: Build a true hierarchy of clusters based on density, then use a principled criterion to select the 'most stable' clusters from this hierarchy.

Evolution of Density-Based Clustering
Algorithm	Year	Key Innovation	Remaining Limitation
DBSCAN	1996	Density-based cluster definition	Single density scale
OPTICS	1999	Multi-scale ordering and visualization	Manual extraction required
HDBSCAN	2013	Automatic optimal cluster extraction	Minimal (very few parameters)

The Stability Principle

Mutual Reachability Distance

HDBSCAN introduces a key concept that smooths out density variations: the mutual reachability distance.

Core Distance (Recap):

The core distance of point p with respect to parameter k (often min_cluster_size) is the distance to its k-th nearest neighbor:

$$\text{core}_k(p) = d(p, \text{k-th nearest neighbor of } p)$$

This measures the 'local density' at p. Small core distance = dense region.

Mutual Reachability Distance:

The mutual reachability distance between points a and b is:

$$d_{mreach-k}(a, b) = \max{\text{core}_k(a), \text{core}_k(b), d(a, b)}$$

This is symmetric (unlike OPTICS' reachability distance) and captures the density 'barrier' between two points.

Understanding Mutual Reachability

Properties of Mutual Reachability Distance:

Symmetry: $d_{mreach}(a, b) = d_{mreach}(b, a)$
Lower Bound: $d_{mreach}(a, b) \geq d(a, b)$ always
Density Smoothing: In dense regions, nearby points have low mutual reachability. In sparse regions, it's higher.
Metric Properties: Mutual reachability is a valid distance metric (satisfies triangle inequality)

Effect on Clustering:

The mutual reachability distance 'equalizes' dense and sparse regions:

In dense clusters, it roughly equals Euclidean distance
At cluster boundaries (where density drops), it inflates distances
This creates natural 'gaps' at cluster boundaries

mutual_reachability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from sklearn.neighbors import NearestNeighbors
from typing import Tuple
 
def compute_mutual_reachability_graph(
    X: np.ndarray,
    min_samples: int = 5
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute the mutual reachability distance graph.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data.
    min_samples : int
        The k parameter for core distance (k-th nearest neighbor).
    
    Returns
    -------
    core_distances : ndarray of shape (n_samples,)
        Core distance for each point.
    mutual_reach_graph : ndarray of shape (n_samples, n_samples)
        Mutual reachability distance matrix.
    """
    n_samples = X.shape[0]
    
    # Compute k-nearest neighbors for core distances
    nn = NearestNeighbors(n_neighbors=min_samples, metric='euclidean')
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    
    # Core distance is distance to k-th neighbor (last column)
    core_distances = distances[:, -1]
    
    # Compute pairwise Euclidean distances
    euclidean_dist = np.linalg.norm(
        X[:, np.newaxis, :] - X[np.newaxis, :, :], 
        axis=2
    )
    
    # Mutual reachability: max(core_dist[a], core_dist[b], d(a,b))
    # Broadcasting: core_distances[:, None] for rows, core_distances for cols
    mutual_reach_graph = np.maximum.reduce([
        core_distances[:, np.newaxis],  # core_dist of point a
        core_distances[np.newaxis, :],  # core_dist of point b
        euclidean_dist                   # Euclidean distance
    ])
    
    return core_distances, mutual_reach_graph
 
 
def visualize_mutual_reachability_effect(
    X: np.ndarray,
    min_samples: int = 5
) -> None:
    """
    Visualize how mutual reachability transforms distances.
    """
    import matplotlib.pyplot as plt
    
    core_dist, mr_graph = compute_mutual_reachability_graph(X, min_samples)
    euclid_graph = np.linalg.norm(
        X[:, np.newaxis, :] - X[np.newaxis, :, :], axis=2
    )
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Core distances as scatter colors
    ax0 = axes[0]
    scatter = ax0.scatter(X[:, 0], X[:, 1], c=core_dist, 
                          cmap='viridis', s=50)
    ax0.set_title(f'Core Distance (min_samples={min_samples})')
    plt.colorbar(scatter, ax=ax0, label='Core Distance')
    
    # Euclidean distance matrix
    ax1 = axes[1]
    im1 = ax1.imshow(euclid_graph, cmap='viridis')
    ax1.set_title('Euclidean Distance Matrix')
    plt.colorbar(im1, ax=ax1)
    
    # Mutual reachability matrix
    ax2 = axes[2]
    im2 = ax2.imshow(mr_graph, cmap='viridis')
    ax2.set_title('Mutual Reachability Distance Matrix')
    plt.colorbar(im2, ax=ax2)
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print(f"Euclidean distances: min={euclid_graph.min():.3f}, "
          f"max={euclid_graph.max():.3f}, mean={euclid_graph.mean():.3f}")
    print(f"Mutual reach dists:  min={mr_graph.min():.3f}, "
          f"max={mr_graph.max():.3f}, mean={mr_graph.mean():.3f}")
    print(f"Inflation factor: {mr_graph.mean() / euclid_graph.mean():.2f}x")

Building the Cluster Hierarchy

HDBSCAN builds a hierarchical clustering structure by constructing a minimum spanning tree (MST) in the mutual reachability graph, then converting it into a hierarchy.

Step 1: Construct Minimum Spanning Tree

Build an MST over all data points where edge weights are mutual reachability distances. This MST captures the 'backbone' of cluster connectivity.

Step 2: Build the Cluster Hierarchy (Dendrogram)

Process MST edges in increasing order of weight:

Start with each point as its own cluster
For each edge (a, b) with weight w:
- Merge the clusters containing a and b
- Record this merge at 'height' w (the mutual reachability distance)
Continue until all points are in one cluster

This produces a dendrogram where:

Leaves are individual points
Internal nodes are cluster merges
Heights represent the 'density level' at which clusters merge

Relationship to Single-Linkage

Step 3: Condensed Cluster Tree

The full dendrogram has n-1 merges (one per data point absorbed). This is often too detailed. HDBSCAN 'condenses' the tree using a minimum cluster size parameter:

Spurious splits ignored: If a cluster splits and one child has fewer than min_cluster_size points, treat it as 'noise falling off' rather than a true split.
Only significant splits kept: Record splits only when both children exceed min_cluster_size.

The condensed tree has far fewer nodes and captures only the 'meaningful' cluster structure.

Condensed Tree Properties:

Feature	Full Dendrogram	Condensed Tree
Number of splits	n-1	Much fewer
Leaf interpretation	Individual points	Noise points or cluster endpoints
Node interpretation	Every merge	Only significant splits
Visualization	Often overwhelming	Interpretable

Cluster Stability and Selection

The defining innovation of HDBSCAN is its principled method for selecting clusters from the condensed tree: stability-based selection.

Defining Lambda (λ):

Instead of working with distance directly, HDBSCAN uses lambda = 1 / distance (like density). Higher lambda means denser regions. Each node in the condensed tree has:

λ_birth: The lambda value when the cluster first appeared (split from parent)
λ_death: The lambda value when the cluster disappeared (split into children or became noise)

Cluster Stability:

The stability of a cluster C is the sum of 'how long' each point persisted in that cluster:

$$\text{stability}(C) = \sum_{p \in C} (\lambda_{death}(p, C) - \lambda_{birth}(C))$$

Intuitively, stability measures the total 'lifetime' of points in the cluster. Stable clusters have many points that persist across a range of density levels.

Stability Intuition

Optimal Cluster Selection:

HDBSCAN selects clusters by traversing the condensed tree bottom-up:

select_clusters(node):
    if node is leaf:
        return stability(node), [node]
    
    left_stability, left_clusters = select_clusters(left_child)
    right_stability, right_clusters = select_clusters(right_child)
    children_stability = left_stability + right_stability
    
    if stability(node) > children_stability:
        # This node is more stable than its children combined
        # Select this cluster, not its descendants
        return stability(node), [node]
    else:
        # Children are more stable
        # Select the descendants instead
        return children_stability, left_clusters + right_clusters

This greedy algorithm finds the set of non-overlapping clusters that maximizes total stability— the clusters that 'deserve' to be selected based on their persistence.

Key Property: The selected clusters form a valid flat clustering—no point belongs to two selected clusters.

Why Stability-Based Selection Works

•Automatic granularity: Chooses finer or coarser clusters based on which is more stable
•Variable density: Different clusters can be at different density levels
•Noise separation: Unstable, fleeting groupings are left as noise
•No threshold needed: Optimal selection is computed, not specified
•Mathematically principled: Maximizing stability has theoretical justification

The Complete HDBSCAN Algorithm

Bringing all pieces together, here's the complete HDBSCAN algorithm:

Input: Dataset X, min_cluster_size, min_samples (optional, defaults to min_cluster_size)

Algorithm:

Compute Core Distances
- For each point p, compute core_dist(p) = distance to min_samples-th nearest neighbor
Build Mutual Reachability Graph
- For all pairs (a, b): d_mreach(a, b) = max(core_dist(a), core_dist(b), d(a, b))
Construct Minimum Spanning Tree
- Build MST over mutual reachability graph
- Can use Prim's or Kruskal's algorithm
Build Cluster Hierarchy
- Sort MST edges by weight (ascending)
- Use union-find to build hierarchy as edges are processed
Condense the Tree
- Collapse nodes with fewer than min_cluster_size points
- Only keep 'significant' splits
Extract Clusters via Stability
- Compute stability for each node in condensed tree
- Select optimal clusters using the stability maximization algorithm
Assign Labels
- Points in selected clusters get cluster labels
- Remaining points are labeled as noise (-1)

Output: Cluster labels, soft clustering probabilities (optional), condensed tree (for visualization)

hdbscan_core.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import numpy as np
from scipy.sparse.csgraph import minimum_spanning_tree
from scipy.spatial.distance import pdist, squareform
from sklearn.neighbors import NearestNeighbors
from typing import List, Tuple, Dict, NamedTuple
from dataclasses import dataclass
import heapq
 
@dataclass
class CondensedNode:
    """A node in the condensed cluster tree."""
    id: int
    parent: int
    lambda_birth: float
    lambda_death: float
    size: int
    stability: float
    children: List[int]
    is_cluster: bool  # Selected as final cluster?
    members: List[int]  # Point indices (for leaves)
 
 
class HDBSCANResult(NamedTuple):
    """Result of HDBSCAN clustering."""
    labels: np.ndarray
    probabilities: np.ndarray
    condensed_tree: List[CondensedNode]
    n_clusters: int
 
 
def hdbscan_simple(
    X: np.ndarray,
    min_cluster_size: int = 5,
    min_samples: int = None
) -> HDBSCANResult:
    """
    Simplified HDBSCAN implementation for educational purposes.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data.
    min_cluster_size : int
        Minimum size for a cluster.
    min_samples : int, optional
        Number of samples in neighborhood for core distance.
        Defaults to min_cluster_size.
    
    Returns
    -------
    HDBSCANResult
        Clustering results including labels and tree.
    """
    if min_samples is None:
        min_samples = min_cluster_size
    
    n_samples = X.shape[0]
    
    # Step 1: Core distances
    nn = NearestNeighbors(n_neighbors=min_samples)
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    core_distances = distances[:, -1]
    
    # Step 2: Mutual reachability graph
    # Full distance matrix (for educational clarity)
    dist_matrix = squareform(pdist(X))
    mutual_reach = np.maximum.reduce([
        core_distances[:, np.newaxis],
        core_distances[np.newaxis, :],
        dist_matrix
    ])
    
    # Step 3: Minimum spanning tree
    # scipy's minimum_spanning_tree expects a sparse matrix, returns CSR
    mst = minimum_spanning_tree(mutual_reach)
    mst_full = mst.toarray()
    # Make symmetric for easier edge extraction
    mst_full = mst_full + mst_full.T
    
    # Extract edges: (weight, node_a, node_b)
    edges = []
    for i in range(n_samples):
        for j in range(i + 1, n_samples):
            if mst_full[i, j] > 0:
                edges.append((mst_full[i, j], i, j))
    edges.sort()  # Sort by weight
    
    # Step 4: Build hierarchy using union-find
    parent = list(range(n_samples))
    rank = [0] * n_samples
    cluster_size = [1] * n_samples
    
    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]
    
    def union(x, y, weight):
        px, py = find(x), find(y)
        if px == py:
            return None
        
        if rank[px] < rank[py]:
            px, py = py, px
        parent[py] = px
        if rank[px] == rank[py]:
            rank[px] += 1
        
        new_size = cluster_size[px] + cluster_size[py]
        cluster_size[px] = new_size
        
        return (weight, px, py, new_size)
    
    # Process edges in order
    merges = []
    for weight, a, b in edges:
        result = union(a, b, weight)
        if result is not None:
            merges.append(result)
    
    # Step 5-6: Build condensed tree and compute stability
    # (Simplified: using the merge sequence directly)
    
    # For this simplified version, we'll use a basic stability heuristic
    # based on cluster "persistence" in the hierarchy
    
    labels = extract_flat_clustering_simple(
        n_samples, merges, min_cluster_size
    )
    
    # Compute soft probabilities (simplified)
    probabilities = compute_probabilities(X, labels, core_distances)
    
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    
    return HDBSCANResult(
        labels=labels,
        probabilities=probabilities,
        condensed_tree=[],  # Simplified version doesn't build full tree
        n_clusters=n_clusters
    )
 
 
def extract_flat_clustering_simple(
    n_samples: int,
    merges: List[Tuple],
    min_cluster_size: int
) -> np.ndarray:
    """
    Simplified cluster extraction based on merge hierarchy.
    
    Uses a heuristic: cut the hierarchy where cluster sizes
    first exceed min_cluster_size and stability is maximized.
    """
    # For simplicity, use a threshold-based extraction
    # More sophisticated: full condensed tree + stability calculation
    
    parent = list(range(n_samples))
    cluster_members = {i: {i} for i in range(n_samples)}
    
    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]
    
    labels = np.full(n_samples, -1, dtype=int)
    cluster_id = 0
    
    threshold = None
    for i, (weight, _, _, size) in enumerate(merges):
        if size >= min_cluster_size and threshold is None:
            threshold = weight
            break
    
    if threshold is None:
        return labels  # All noise
    
    # Replay merges up to threshold
    for weight, root_a, root_b, size in merges:
        if weight > threshold * 2:  # Allow some margin
            break
        
        pa, pb = find(root_a), find(root_b)
        if pa != pb:
            parent[pb] = pa
            cluster_members[pa] = cluster_members.get(pa, set()) | cluster_members.get(pb, set())
    
    # Assign labels based on final clusters
    for i in range(n_samples):
        root = find(i)
        members = cluster_members.get(root, {i})
        if len(members) >= min_cluster_size:
            if labels[root] == -1:
                labels[root] = cluster_id
                cluster_id += 1
            for member in members:
                labels[member] = labels[root]
    
    return labels
 
 
def compute_probabilities(
    X: np.ndarray,
    labels: np.ndarray,
    core_distances: np.ndarray
) -> np.ndarray:
    """
    Compute cluster membership probabilities.
    
    Points with lower core distances (in denser regions) have
    higher probability of belonging to their assigned cluster.
    """
    probabilities = np.zeros(len(labels))
    
    for cluster_id in set(labels):
        if cluster_id == -1:
            continue
        
        mask = labels == cluster_id
        cluster_core_dists = core_distances[mask]
        
        if len(cluster_core_dists) > 0:
            max_core = cluster_core_dists.max()
            # Probability inversely related to core distance
            # Normalized so max is 1
            probs = 1 - (cluster_core_dists / (max_core + 1e-10))
            probs = np.clip(probs, 0.1, 1.0)  # Floor at 0.1
            probabilities[mask] = probs
    
    return probabilities

HDBSCAN Parameters and Their Effects

One of HDBSCAN's great advantages is requiring very few parameters. The primary parameters and their effects:

min_cluster_size (Required):

The minimum number of points required for a group to be considered a cluster.

Effect: Larger values → fewer, larger clusters; smaller values → more, smaller clusters
Guidance: Set based on domain knowledge about minimum meaningful cluster size
Default: Typically 5-15 for exploratory analysis

min_samples (Optional):

The number of neighbors used to compute core distance. Defaults to min_cluster_size if not specified.

Effect: Larger values → more conservative density estimates, more points labeled as noise
Guidance: Increase if data is noisy; decrease for cleaner data
Relationship: min_samples ≤ min_cluster_size is recommended

cluster_selection_epsilon (Optional):

Merge clusters that are closer than this distance.

Effect: Setting > 0 produces DBSCAN-like behavior, merging close clusters
Use case: When you want slightly larger clusters than pure stability selection gives
Default: 0 (no merging beyond stability selection)

HDBSCAN Parameter Effects
Parameter	Increase Effect	Decrease Effect	Rule of Thumb
min_cluster_size	Fewer, larger clusters	More, smaller clusters	≈ smallest meaningful cluster
min_samples	More noise, denser cores required	Less noise, looser cores	≈ min_cluster_size
cluster_selection_epsilon	Merge nearby clusters	Strict stability separation	0 for pure HDBSCAN

Parameter Selection Strategy

Advanced Parameters:

cluster_selection_method:

'eom' (Excess of Mass) — Default stability-based selection
'leaf' — Select only leaf clusters (most granular)

alpha: Distance scaling parameter. alpha = 1.0 (default) uses standard distances.

metric: Distance metric to use. Supports 'euclidean', 'manhattan', 'cosine', and others.

core_dist_n_jobs: Number of parallel jobs for core distance computation. -1 uses all CPUs.

Soft Clustering and Membership Probabilities

Unlike hard clustering where each point belongs to exactly one cluster, HDBSCAN provides soft clustering probabilities that indicate how confidently each point belongs to its assigned cluster.

Probability Interpretation:

For each point p assigned to cluster C, HDBSCAN computes probability(p, C) ∈ [0, 1]:

Probability ≈ 1: Point is deep within the cluster's dense core; highly confident assignment
Probability ≈ 0.5-0.8: Point is in the cluster but not central; somewhat confident
Probability < 0.5: Point is on cluster periphery; assignment is tentative

Computing Membership Probability:

The probability is based on lambda values:

$$\text{probability}(p, C) = \frac{\lambda_p - \lambda_{birth}(C)}{\lambda_{death}(C) - \lambda_{birth}(C)}$$

where λ_p is the lambda value at which point p 'fell out' of the cluster. Points that persist longer (closer to λ_death) have higher probability.

Using Membership Probabilities

•Identify uncertain assignments: Points with low probability may need manual review
•Weight downstream analysis: Use probabilities as weights in subsequent computations
•Detect boundary cases: Low-probability points are near cluster boundaries
•Quality assessment: Average probability indicates overall cluster compactness
•Outlier detection: Even non-noise points with low probability are potential outliers

Noise Points Have No Probability

Outlier Scores:

HDBSCAN also provides outlier scores for each point:

$$\text{outlier_score}(p) = 1 - \text{probability}(p)$$

Or more sophisticated: based on how 'different' the point's local density is from its cluster's typical density. High outlier score → point is unusual even within its cluster.

Practical Usage:

from hdbscan import HDBSCAN

clusterer = HDBSCAN(min_cluster_size=10)
clusterer.fit(X)

# Hard cluster assignments
labels = clusterer.labels_

# Membership probabilities
probabilities = clusterer.probabilities_

# Outlier scores
outlier_scores = clusterer.outlier_scores_

# Find uncertain assignments
uncertain = probabilities < 0.5
print(f"Uncertain points: {uncertain.sum()}")

Visualization and Interpretation

HDBSCAN provides powerful visualizations that aid interpretation:

1. Condensed Tree Visualization

The condensed tree shows the hierarchical structure with:

X-axis: Lambda (1/distance, roughly density)
Y-axis: Cluster branches
Selected clusters highlighted
Noise shown as thin lines that 'fall off'

2. Single Linkage Tree

The full dendrogram before condensation, showing all n-1 merges. Useful for understanding the complete hierarchy.

3. Cluster Persistence Plot

Shows stability values for each potential cluster, helping understand why certain clusters were selected.

4. Cluster Probability Plot

Scatter plot with point colors indicating membership probability—dense cores bright, periphery dim.

hdbscan_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import hdbscan
 
def comprehensive_hdbscan_visualization(X, min_cluster_size=10):
    """
    Create comprehensive HDBSCAN visualization suite.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data (2D for visualization).
    min_cluster_size : int
        HDBSCAN parameter.
    """
    # Fit HDBSCAN
    clusterer = hdbscan.HDBSCAN(
        min_cluster_size=min_cluster_size,
        gen_min_span_tree=True
    )
    clusterer.fit(X)
    
    fig = plt.figure(figsize=(20, 12))
    
    # 1. Data with cluster assignments
    ax1 = fig.add_subplot(2, 3, 1)
    scatter = ax1.scatter(
        X[:, 0], X[:, 1],
        c=clusterer.labels_,
        cmap='Spectral',
        s=50,
        alpha=0.7
    )
    ax1.set_title(f'HDBSCAN Clustering (n_clusters={clusterer.labels_.max() + 1})')
    
    # Mark noise points
    noise_mask = clusterer.labels_ == -1
    ax1.scatter(
        X[noise_mask, 0], X[noise_mask, 1],
        c='gray', marker='x', s=30, label='Noise'
    )
    ax1.legend()
    
    # 2. Cluster membership probability
    ax2 = fig.add_subplot(2, 3, 2)
    scatter2 = ax2.scatter(
        X[:, 0], X[:, 1],
        c=clusterer.probabilities_,
        cmap='viridis',
        s=50,
        alpha=0.8
    )
    ax2.set_title('Membership Probability')
    plt.colorbar(scatter2, ax=ax2, label='Probability')
    
    # 3. Outlier scores
    ax3 = fig.add_subplot(2, 3, 3)
    outlier_scores = clusterer.outlier_scores_
    scatter3 = ax3.scatter(
        X[:, 0], X[:, 1],
        c=outlier_scores,
        cmap='Reds',
        s=50,
        alpha=0.8
    )
    ax3.set_title('Outlier Scores')
    plt.colorbar(scatter3, ax=ax3, label='Outlier Score')
    
    # 4. Condensed tree (if available)
    ax4 = fig.add_subplot(2, 3, 4)
    clusterer.condensed_tree_.plot(
        select_clusters=True,
        axis=ax4,
        colorbar=False
    )
    ax4.set_title('Condensed Cluster Tree')
    
    # 5. Minimum spanning tree (projected to 2D)
    ax5 = fig.add_subplot(2, 3, 5)
    clusterer.minimum_spanning_tree_.plot(
        edge_cmap='viridis',
        edge_alpha=0.6,
        node_size=10,
        axis=ax5
    )
    ax5.set_title('Minimum Spanning Tree')
    
    # 6. Cluster sizes and stability summary
    ax6 = fig.add_subplot(2, 3, 6)
    
    # Get cluster statistics
    unique_labels = set(clusterer.labels_)
    unique_labels.discard(-1)  # Remove noise
    
    cluster_sizes = []
    cluster_avg_probs = []
    cluster_labels = []
    
    for label in sorted(unique_labels):
        mask = clusterer.labels_ == label
        cluster_sizes.append(mask.sum())
        cluster_avg_probs.append(clusterer.probabilities_[mask].mean())
        cluster_labels.append(f'Cluster {label}')
    
    x_pos = np.arange(len(cluster_labels))
    ax6.bar(x_pos, cluster_sizes, color='steelblue', alpha=0.7)
    ax6.set_xticks(x_pos)
    ax6.set_xticklabels(cluster_labels, rotation=45)
    ax6.set_ylabel('Cluster Size')
    ax6.set_title('Cluster Size Distribution')
    
    # Add average probability as text
    for i, (size, prob) in enumerate(zip(cluster_sizes, cluster_avg_probs)):
        ax6.text(i, size + 2, f'p̄={prob:.2f}', ha='center', fontsize=8)
    
    # Add noise count
    n_noise = (clusterer.labels_ == -1).sum()
    ax6.text(
        0.95, 0.95, f'Noise: {n_noise} points',
        transform=ax6.transAxes,
        ha='right', va='top',
        fontsize=10,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    )
    
    plt.tight_layout()
    plt.show()
    
    return clusterer
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create multi-density clusters
    cluster1 = np.random.normal([0, 0], 0.3, (200, 2))  # Dense
    cluster2 = np.random.normal([3, 3], 0.8, (100, 2))  # Medium
    cluster3 = np.random.normal([6, 0], 1.2, (50, 2))   # Sparse
    noise = np.random.uniform(-2, 8, (30, 2))
    
    X = np.vstack([cluster1, cluster2, cluster3, noise])
    
    clusterer = comprehensive_hdbscan_visualization(X, min_cluster_size=10)

Summary: HDBSCAN

HDBSCAN represents the culmination of density-based clustering evolution. Let's consolidate the essential concepts:

Key Takeaways

•Automatic Clustering — HDBSCAN automatically selects optimal clusters without requiring ε or extraction thresholds.
•Mutual Reachability — The symmetric distance measure that smooths density variations and creates natural cluster boundaries.
•Hierarchical Construction — MST + single linkage in mutual reachability space builds a complete cluster hierarchy.
•Condensed Tree — Simplifies the hierarchy by collapsing insignificant splits, keeping only meaningful structure.
•Stability-Based Selection — Clusters are selected by maximizing total stability—persistence across density levels.
•Variable Density — Different clusters can have different densities; no global density threshold.
•Soft Clustering — Membership probabilities indicate assignment confidence, enabling nuanced analysis.
•Minimal Parameters — Only min_cluster_size truly needs tuning; far simpler than DBSCAN or OPTICS.

What's Next:

Page Complete

4 / 5