Machine LearningDensity-Based Clustering

Density-Based Clustering

LevelIntermediate

Duration120 mins

TopicDensity-Based Clustering

5 / 5

Parameter Selection

The Parameter Selection Challenge

Density-based clustering algorithms promise to discover clusters of arbitrary shape without specifying the number of clusters. But this flexibility comes with its own challenge: parameter selection. The quality of clustering results depends critically on choosing appropriate values for ε (DBSCAN), ε_max (OPTICS), min_samples, and min_cluster_size (HDBSCAN).

Poorly chosen parameters lead to:

Underclustering: All points merge into one giant cluster (ε too large)
Overclustering: Points shatter into many tiny clusters or become noise (ε too small)
Missing structure: Important density variations are missed (MinPts inappropriate)
Spurious clusters: Noise points form artificial groupings (MinPts too small)

This page provides a comprehensive toolkit for systematic, principled parameter selection—transforming parameter tuning from guesswork into informed decision-making.

What You Will Learn

By the end of this page, you will master the k-distance plot method for ε selection, understand how domain knowledge guides MinPts choice, learn validation strategies for parameter assessment, and gain practical workflows for parameter optimization across DBSCAN, OPTICS, and HDBSCAN.

Parameter Roles and Interactions

Before diving into selection methods, let's clearly understand what each parameter controls and how they interact.

DBSCAN Parameters:

ε (epsilon) — Neighborhood radius
- Defines the scale at which density is measured
- Too small: most points become noise
- Too large: distinct clusters merge
- Controls: cluster granularity and separation
MinPts — Core point threshold
- Minimum neighbors for core point status
- Too small: noise points form clusters
- Too large: clusters shrink, more noise
- Controls: noise tolerance and cluster density

Parameter Interaction Matrix
ε \ MinPts	Low MinPts	High MinPts
Small ε	Many small clusters, some noise included	Most points become noise
Large ε	Few huge clusters, includes noise	Moderate clusters, good noise rejection

OPTICS Parameters:

ε_max — Maximum neighborhood radius
- Upper bound on density scale considered
- Should be large enough to capture sparsest clusters
- Too large increases computation without benefit
MinPts — Same role as DBSCAN
Extraction parameters — ε threshold or Xi
- Control which clusters are extracted from the ordering

HDBSCAN Parameters:

min_cluster_size — Minimum cluster membership
- Clusters smaller than this become noise
- Primary parameter; most important to tune
min_samples — Core distance neighbor count
- Controls density estimation smoothing
- Defaults to min_cluster_size if not specified

The ε-MinPts Relationship

Think of ε as 'how far to look' and MinPts as 'how crowded it must be.' Finding clusters requires: (1) ε matches the natural point spacing within clusters, and (2) MinPts matches the typical neighbor count in cluster interiors. These constraints should be satisfied simultaneously.

The K-Distance Plot Method

The k-distance plot (also called k-NN distance plot) is the canonical method for selecting ε in DBSCAN. Introduced in the original DBSCAN paper, it provides a visual heuristic for finding the natural density threshold.

The Method:

Choose k (typically k = MinPts - 1 or k = MinPts)
Compute the distance from each point to its k-th nearest neighbor
Sort these distances in ascending order
Plot sorted distances (y-axis) against point rank (x-axis)
Find the elbow — the point where the curve transitions from gradual to steep
Set ε to the y-value at the elbow

Interpretation:

The k-distance plot typically shows:

Flat or gradual region: Points within clusters; k-th neighbors are close
Steep region: Points are noise or between clusters; k-th neighbors are far
Elbow: The transition between cluster interiors and sparse regions

Why the Elbow Works

In well-clustered data, cluster-interior points have similar k-distances (small, relatively uniform). Noise and boundary points have larger k-distances. The elbow marks the transition: below it, points have enough neighbors to be core; above it, they don't. Setting ε at the elbow captures cluster structure while rejecting noise.

k_distance_plot.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from kneed import KneeLocator
from typing import Tuple, Optional, List
 
def k_distance_plot(
    X: np.ndarray,
    k: int = 5,
    find_elbow: bool = True,
    ax: Optional[plt.Axes] = None
) -> Tuple[np.ndarray, Optional[float]]:
    """
    Create k-distance plot for DBSCAN epsilon selection.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data.
    k : int
        Number of neighbors (typically MinPts or MinPts - 1).
    find_elbow : bool
        Whether to automatically detect the elbow.
    ax : matplotlib Axes, optional
        Axes to plot on.
    
    Returns
    -------
    k_distances_sorted : ndarray
        Sorted k-distances.
    elbow_epsilon : float or None
        Suggested epsilon if elbow detection succeeded.
    """
    # Compute k-nearest neighbor distances
    nn = NearestNeighbors(n_neighbors=k)
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    
    # k-distance is distance to k-th neighbor (last column)
    k_distances = distances[:, -1]
    
    # Sort in ascending order
    k_distances_sorted = np.sort(k_distances)
    
    # Create plot
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 6))
    
    n_points = len(k_distances_sorted)
    ax.plot(range(n_points), k_distances_sorted, 'b-', linewidth=1.5)
    ax.set_xlabel('Points (sorted by k-distance)', fontsize=12)
    ax.set_ylabel(f'{k}-distance', fontsize=12)
    ax.set_title(f'{k}-Distance Plot for ε Selection', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Elbow detection
    elbow_epsilon = None
    if find_elbow:
        try:
            # Use KneeLocator for automatic elbow detection
            kneedle = KneeLocator(
                range(n_points),
                k_distances_sorted,
                curve='convex',
                direction='increasing',
                interp_method='polynomial'
            )
            
            if kneedle.elbow is not None:
                elbow_epsilon = k_distances_sorted[kneedle.elbow]
                ax.axhline(
                    y=elbow_epsilon, 
                    color='r', 
                    linestyle='--', 
                    linewidth=2,
                    label=f'Suggested ε = {elbow_epsilon:.4f}'
                )
                ax.axvline(
                    x=kneedle.elbow,
                    color='r',
                    linestyle=':',
                    alpha=0.5
                )
                ax.scatter(
                    [kneedle.elbow], 
                    [elbow_epsilon],
                    color='red',
                    s=100,
                    zorder=5,
                    label='Elbow point'
                )
                ax.legend()
        except Exception as e:
            print(f"Elbow detection failed: {e}")
    
    plt.tight_layout()
    
    return k_distances_sorted, elbow_epsilon
 
 
def multi_k_distance_analysis(
    X: np.ndarray,
    k_values: List[int] = [3, 5, 7, 10]
) -> dict:
    """
    Analyze k-distance plots for multiple k values.
    
    Returns suggested epsilon for each k and overall recommendation.
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    results = {}
    
    for i, k in enumerate(k_values[:4]):
        k_dist, epsilon = k_distance_plot(X, k=k, ax=axes[i])
        results[k] = {
            'k_distances': k_dist,
            'suggested_epsilon': epsilon
        }
        
        # Add percentile information
        percentiles = [50, 75, 90, 95]
        for p in percentiles:
            val = np.percentile(k_dist, p)
            axes[i].axhline(y=val, color='gray', linestyle=':', alpha=0.5)
            axes[i].text(
                len(k_dist) * 0.02, val,
                f'P{p}={val:.3f}',
                fontsize=8, alpha=0.7
            )
    
    plt.tight_layout()
    plt.show()
    
    # Summary
    print("\nK-Distance Analysis Summary:")
    print("=" * 50)
    for k, data in results.items():
        eps = data['suggested_epsilon']
        if eps is not None:
            print(f"  k={k}: suggested ε = {eps:.4f}")
        else:
            print(f"  k={k}: elbow detection failed")
    
    return results
 
 
def epsilon_sensitivity_analysis(
    X: np.ndarray,
    min_pts: int = 5,
    epsilon_range: Optional[np.ndarray] = None
) -> None:
    """
    Analyze how clustering changes across epsilon values.
    """
    from sklearn.cluster import DBSCAN
    
    # Get k-distance for reference
    k_dist_sorted, suggested_eps = k_distance_plot(X, k=min_pts)
    
    if epsilon_range is None:
        # Create range based on k-distance distribution
        p25, p50, p75 = np.percentile(k_dist_sorted, [25, 50, 75])
        epsilon_range = np.linspace(p25 * 0.5, p75 * 2, 20)
    
    n_clusters_list = []
    n_noise_list = []
    
    for eps in epsilon_range:
        dbscan = DBSCAN(eps=eps, min_samples=min_pts)
        labels = dbscan.fit_predict(X)
        
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_noise = np.sum(labels == -1)
        
        n_clusters_list.append(n_clusters)
        n_noise_list.append(n_noise)
    
    # Plot sensitivity
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    ax1.plot(epsilon_range, n_clusters_list, 'b-o', markersize=5)
    ax1.set_xlabel('ε')
    ax1.set_ylabel('Number of Clusters')
    ax1.set_title('Cluster Count vs. ε')
    ax1.grid(True, alpha=0.3)
    if suggested_eps is not None:
        ax1.axvline(x=suggested_eps, color='r', linestyle='--',
                    label=f'Suggested ε = {suggested_eps:.4f}')
        ax1.legend()
    
    ax2.plot(epsilon_range, n_noise_list, 'gray', marker='s', markersize=5)
    ax2.set_xlabel('ε')
    ax2.set_ylabel('Number of Noise Points')
    ax2.set_title('Noise Count vs. ε')
    ax2.grid(True, alpha=0.3)
    
    # Add stable region detection
    stable_region = []
    for i in range(1, len(n_clusters_list) - 1):
        if n_clusters_list[i-1] == n_clusters_list[i] == n_clusters_list[i+1]:
            stable_region.append(i)
    
    if stable_region:
        stable_eps_range = epsilon_range[stable_region]
        ax1.axvspan(
            stable_eps_range.min(), stable_eps_range.max(),
            alpha=0.2, color='green', label='Stable region'
        )
        ax1.legend()
    
    plt.tight_layout()
    plt.show()
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create test data
    cluster1 = np.random.normal([0, 0], 0.5, (100, 2))
    cluster2 = np.random.normal([4, 4], 0.5, (80, 2))
    cluster3 = np.random.normal([8, 0], 0.7, (60, 2))
    noise = np.random.uniform(-2, 10, (20, 2))
    
    X = np.vstack([cluster1, cluster2, cluster3, noise])
    
    # Analyze
    results = multi_k_distance_analysis(X, k_values=[4, 5, 6, 8])
    epsilon_sensitivity_analysis(X, min_pts=5)

MinPts Selection Guidelines

While ε gets most attention, MinPts (or min_samples) is equally important and often determines whether clustering succeeds or fails.

Rules of Thumb:

MinPts ≥ dimensionality + 1
- In d-dimensional space, need at least d+1 points to identify local density
- 2D data: MinPts ≥ 3 minimum
- 10D data: MinPts ≥ 11 minimum
MinPts ≈ 2 × dimensionality (common heuristic)
- Provides buffer for noise
- 2D: MinPts ≈ 4
- 10D: MinPts ≈ 20
Log-based rule: MinPts ≈ ln(n)
- Scales with dataset size
- 1,000 points: MinPts ≈ 7
- 10,000 points: MinPts ≈ 9
- 100,000 points: MinPts ≈ 12

MinPts Selection Guidelines by Context
Context	MinPts Range	Rationale
Clean, well-separated data	3-5	Low noise, clear structure
Moderate noise	5-10	Balance density vs. noise rejection
High noise / outliers	10-20	Aggressive noise filtering
High-dimensional data	2d to 4d	Account for dimensional sparsity
Very large datasets	ln(n) + 5	Statistical stability
Unknown / exploratory	5	Reasonable default starting point

Domain Knowledge Guidance:

The most reliable MinPts selection comes from domain understanding:

Ask: What's the smallest meaningful group in my domain?
If you're clustering customers and a 'segment' needs at least 50 customers to be actionable, consider min_cluster_size = 50 (HDBSCAN)
If sensor readings need 10 similar samples to confirm an event, MinPts ≈ 10

The Dimensionality Trap:

In high dimensions, distances become less meaningful (curse of dimensionality). This affects parameter selection:

ε becomes harder to set: Distance distributions concentrate, making elbow detection unclear
MinPts must increase: More points needed to establish 'dense' in sparse high-D space
Consider preprocessing: Dimensionality reduction (PCA, UMAP) before clustering

The 'All Noise' Trap

A common mistake: setting MinPts too high makes every point noise. If DBSCAN returns all or mostly noise points, first try decreasing MinPts before adjusting ε. The k-distance plot assumes MinPts is already reasonable.

HDBSCAN-Specific Parameter Selection

HDBSCAN's parameter selection is simpler because the algorithm handles most complexity internally. Focus on:

min_cluster_size:

This is HDBSCAN's primary parameter and deserves careful selection:

Semantic meaning: 'What's the smallest meaningful cluster?'
- Customer segmentation: Maybe 100 customers minimum
- Document clustering: Maybe 10 documents
- Anomaly detection: Set large to find only major patterns
Practical heuristics:
- Start with 1% of dataset size
- Adjust based on results
- Larger = fewer, bigger clusters

min_samples:

Often left at default (= min_cluster_size), but can be tuned:

Increase for noisy data: Makes density estimation more conservative
Decrease for clean data: Allows finer density distinctions
Rule: min_samples ≤ min_cluster_size usually

hdbscan_parameter_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
import numpy as np
import matplotlib.pyplot as plt
from hdbscan import HDBSCAN
from sklearn.metrics import silhouette_score
from typing import List, Tuple, Dict
import warnings
 
def hdbscan_parameter_search(
    X: np.ndarray,
    min_cluster_sizes: List[int] = None,
    min_samples_list: List[int] = None,
    metric: str = 'euclidean'
) -> Dict:
    """
    Grid search over HDBSCAN parameters with quality metrics.
    
    Parameters
    ----------
    X : ndarray
        Input data.
    min_cluster_sizes : list of int
        Values to try for min_cluster_size.
    min_samples_list : list of int
        Values to try for min_samples.
    
    Returns
    -------
    results : dict
        Search results with best parameters.
    """
    n_samples = X.shape[0]
    
    # Default search ranges
    if min_cluster_sizes is None:
        base = max(5, int(0.005 * n_samples))
        min_cluster_sizes = [
            base, base * 2, base * 4, base * 8,
            int(0.01 * n_samples), int(0.02 * n_samples)
        ]
        min_cluster_sizes = sorted(set(
            [x for x in min_cluster_sizes if x >= 5]
        ))
    
    if min_samples_list is None:
        min_samples_list = [None, 3, 5, 10, 15]
    
    results = []
    
    for mcs in min_cluster_sizes:
        for ms in min_samples_list:
            # ms = None means use mcs as min_samples
            actual_ms = ms if ms is not None else mcs
            if actual_ms > mcs:
                continue  # Invalid: min_samples > min_cluster_size
            
            try:
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    
                    clusterer = HDBSCAN(
                        min_cluster_size=mcs,
                        min_samples=actual_ms,
                        metric=metric
                    )
                    labels = clusterer.fit_predict(X)
                
                # Compute metrics
                n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
                n_noise = np.sum(labels == -1)
                noise_ratio = n_noise / n_samples
                
                # Silhouette score (only if 2+ clusters and some non-noise)
                if n_clusters >= 2 and n_noise < n_samples - 2:
                    mask = labels != -1
                    if len(set(labels[mask])) >= 2:
                        sil = silhouette_score(X[mask], labels[mask])
                    else:
                        sil = -1
                else:
                    sil = -1
                
                # Cluster size variability
                cluster_sizes = []
                for c in set(labels):
                    if c != -1:
                        cluster_sizes.append(np.sum(labels == c))
                
                size_std = np.std(cluster_sizes) if cluster_sizes else 0
                
                results.append({
                    'min_cluster_size': mcs,
                    'min_samples': actual_ms,
                    'n_clusters': n_clusters,
                    'n_noise': n_noise,
                    'noise_ratio': noise_ratio,
                    'silhouette': sil,
                    'size_std': size_std,
                    'labels': labels
                })
                
            except Exception as e:
                print(f"Failed for mcs={mcs}, ms={actual_ms}: {e}")
    
    # Find best configuration
    # Criterion: highest silhouette with reasonable noise ratio
    valid_results = [
        r for r in results 
        if r['silhouette'] > 0 and r['n_clusters'] >= 2
    ]
    
    if valid_results:
        best = max(valid_results, key=lambda r: r['silhouette'])
    else:
        # Fallback: most clusters with reasonable noise
        best = min(results, key=lambda r: r['noise_ratio'])
    
    return {
        'all_results': results,
        'best': best,
        'recommendation': {
            'min_cluster_size': best['min_cluster_size'],
            'min_samples': best['min_samples']
        }
    }
 
 
def visualize_parameter_search(results: Dict, X: np.ndarray = None):
    """
    Visualize parameter search results.
    """
    all_results = results['all_results']
    best = results['best']
    
    # Extract data for plotting
    mcs_values = sorted(set(r['min_cluster_size'] for r in all_results))
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Clusters vs min_cluster_size
    ax1 = axes[0, 0]
    for ms_val in sorted(set(r['min_samples'] for r in all_results)):
        subset = [r for r in all_results if r['min_samples'] == ms_val]
        subset.sort(key=lambda r: r['min_cluster_size'])
        xs = [r['min_cluster_size'] for r in subset]
        ys = [r['n_clusters'] for r in subset]
        ax1.plot(xs, ys, 'o-', label=f'min_samples={ms_val}')
    ax1.set_xlabel('min_cluster_size')
    ax1.set_ylabel('Number of Clusters')
    ax1.set_title('Cluster Count vs. Parameters')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 2. Noise ratio vs parameters
    ax2 = axes[0, 1]
    for ms_val in sorted(set(r['min_samples'] for r in all_results)):
        subset = [r for r in all_results if r['min_samples'] == ms_val]
        subset.sort(key=lambda r: r['min_cluster_size'])
        xs = [r['min_cluster_size'] for r in subset]
        ys = [r['noise_ratio'] * 100 for r in subset]
        ax2.plot(xs, ys, 's-', label=f'min_samples={ms_val}')
    ax2.set_xlabel('min_cluster_size')
    ax2.set_ylabel('Noise Ratio (%)')
    ax2.set_title('Noise Ratio vs. Parameters')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # 3. Silhouette score vs parameters
    ax3 = axes[1, 0]
    for ms_val in sorted(set(r['min_samples'] for r in all_results)):
        subset = [r for r in all_results if r['min_samples'] == ms_val]
        subset.sort(key=lambda r: r['min_cluster_size'])
        xs = [r['min_cluster_size'] for r in subset]
        ys = [r['silhouette'] for r in subset]
        ax3.plot(xs, ys, '^-', label=f'min_samples={ms_val}')
    ax3.set_xlabel('min_cluster_size')
    ax3.set_ylabel('Silhouette Score')
    ax3.set_title('Cluster Quality vs. Parameters')
    ax3.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # Highlight best
    ax3.scatter(
        [best['min_cluster_size']],
        [best['silhouette']],
        s=200, c='red', marker='*', zorder=10,
        label=f"Best: ({best['min_cluster_size']}, {best['min_samples']})"
    )
    
    # 4. Best clustering (if X provided)
    ax4 = axes[1, 1]
    if X is not None and X.shape[1] == 2:
        scatter = ax4.scatter(
            X[:, 0], X[:, 1],
            c=best['labels'],
            cmap='Spectral',
            s=30, alpha=0.7
        )
        ax4.set_title(
            f"Best Clustering: mcs={best['min_cluster_size']}, "
            f"ms={best['min_samples']}, {best['n_clusters']} clusters"
        )
    else:
        # Show summary table
        ax4.axis('off')
        summary_text = f"""
Best Parameters Found:
━━━━━━━━━━━━━━━━━━━━━━
min_cluster_size: {best['min_cluster_size']}
min_samples: {best['min_samples']}
━━━━━━━━━━━━━━━━━━━━━━
Clusters: {best['n_clusters']}
Noise points: {best['n_noise']} ({best['noise_ratio']*100:.1f}%)
Silhouette: {best['silhouette']:.3f}
"""
        ax4.text(0.1, 0.5, summary_text, fontsize=12,
                 fontfamily='monospace', verticalalignment='center')
    
    plt.tight_layout()
    plt.show()

Validation Strategies

Once you've selected parameters, how do you know they're good? Clustering validation provides the answer.

Internal Validation Metrics:

These require only the data and labels, not ground truth:

Silhouette Score (−1 to +1)
- Measures how well each point fits its cluster vs. other clusters
- Higher is better; > 0.5 is good, > 0.7 is excellent
- Biased toward convex clusters
Davies-Bouldin Index (0 to ∞)
- Ratio of within-cluster to between-cluster distances
- Lower is better
- Penalizes clusters that are close together
Calinski-Harabasz Index (0 to ∞)
- Ratio of between-cluster to within-cluster variance
- Higher is better
- Works well for convex, similarly-sized clusters
DBCV (Density-Based Clustering Validation)
- Designed specifically for density-based clustering
- Accounts for arbitrary cluster shapes
- Preferred metric for DBSCAN/HDBSCAN

Validation Best Practices

•Use multiple metrics: No single metric captures all aspects of clustering quality
•Consider domain validation: Do clusters make sense for your application?
•Stability analysis: Do clusters remain stable under data perturbation?
•Visual inspection: For low-D data, always visualize results
•External validation: If ground truth exists for a subset, use it

Stability-Based Validation:

A robust clustering should be stable under:

Subsampling: Cluster 80% of data multiple times; compare results
Noise addition: Add small noise to data; clusters should persist
Parameter perturbation: Slightly vary ε or MinPts; structure should be similar

Practical Stability Test:

def stability_test(X, clusterer_fn, n_trials=20, subsample_ratio=0.8):
    """
    Test clustering stability via subsampling.
    """
    n_samples = len(X)
    all_labels = []
    
    for _ in range(n_trials):
        # Random subsample
        indices = np.random.choice(
            n_samples, 
            int(n_samples * subsample_ratio), 
            replace=False
        )
        X_sub = X[indices]
        
        # Cluster
        labels = clusterer_fn(X_sub)
        all_labels.append((indices, labels))
    
    # Compute pairwise agreement
    # Points that are in same cluster across runs → stable
    # ... (compute Adjusted Rand Index between runs)

The Domain Test

The ultimate test: show clusters to a domain expert. Do they recognize the groups? Are there expected clusters that are missing? Are there strange groupings that don't make sense? Domain validation often reveals parameter issues that metrics miss.

Troubleshooting Common Issues

Parameter selection often involves diagnosing and fixing common problems. Here's a troubleshooting guide:

Problem: All points labeled as noise

Causes and solutions:

ε too small: Increase ε (check k-distance plot)
MinPts too large: Decrease MinPts
Distance metric inappropriate: Try different metrics
Data not scaled: Normalize/standardize features first

Problem: Everything in one cluster

Causes and solutions:

ε too large: Decrease ε
MinPts too small: Increase MinPts
Clusters not density-separated: Consider different algorithm
Feature scaling issues: Check that features are on comparable scales

Troubleshooting Decision Matrix
Symptom	Likely Cause	First Action
All noise	ε too small	Check k-distance plot, increase ε
One giant cluster	ε too large	Decrease ε
Too many small clusters	MinPts too small	Increase MinPts
Known clusters missing	ε or MinPts too strict	Relax parameters
Clusters merge that shouldn't	ε too large	Decrease ε or use HDBSCAN
Border points seem misassigned	ε at cluster interface	Try different ε or accept ambiguity
Different runs give different results	Border point order dependence	Use deterministic variant or HDBSCAN
High-D data gives poor results	Curse of dimensionality	Reduce dimensions first

Problem: Clusters have very different densities

Single ε can't capture variable density

Solutions:

Use HDBSCAN: Automatically handles variable density
Use OPTICS + Xi extraction: Discovers hierarchical structure
Local scaling: Normalize distances by local density before clustering
Separate clustering: Cluster dense and sparse regions separately

Problem: Results sensitive to ε choice

Small ε changes dramatically change results

This suggests:

Data doesn't have clear density-based structure
You're at the boundary between clustering regimes
Consider stability analysis to find robust parameter ranges

The Preprocessing Trap

Many parameter selection issues stem from preprocessing, not the algorithm. Always check: (1) Are features scaled appropriately? (2) Is the distance metric suitable for your features? (3) Are there outliers skewing the distance distribution? Fix preprocessing first, then tune parameters.

Practical Parameter Selection Workflow

Here's a systematic workflow for parameter selection in practice:

Step 1: Data Preparation

Scale/normalize features appropriately
Handle missing values
Consider dimensionality reduction if d > 20
Understand distance metric implications

Step 2: Initial Exploration (HDBSCAN)

Start with HDBSCAN for initial exploration—fewer parameters, more robust:

from hdbscan import HDBSCAN

# Conservative start
clusterer = HDBSCAN(
    min_cluster_size=max(10, int(0.01 * len(X))),
    min_samples=5
)
clusterer.fit(X)

# Visualize condensed tree
clusterer.condensed_tree_.plot(select_clusters=True)

Step 3: Refinement via Analysis

If too many noise: decrease min_cluster_size
If clusters too coarse: decrease min_cluster_size
If spurious small clusters: increase min_cluster_size
Examine condensed tree for hierarchy insights

Step 4: DBSCAN Analysis (if needed)

If you need DBSCAN specifically (for compatibility, speed, or simplicity):

Fix MinPts based on dimension and noise level:
- Low noise: MinPts = 2d
- High noise: MinPts = 4d or more

Use k-distance plot to find ε:

k_distances, suggested_eps = k_distance_plot(X, k=MinPts)

Sensitivity analysis around suggested ε
Validate with internal metrics and domain review

Step 5: Validation and Iteration

After selecting parameters:

Compute internal metrics (silhouette, DBCV)
Visualize clusters (if possible)
Domain validation with expert
Stability testing
Iterate if needed

Parameter Selection Checklist

•☐ Data is properly preprocessed (scaled, cleaned)
•☐ Distance metric is appropriate for the data
•☐ Explored data with HDBSCAN first
•☐ Examined k-distance plot for ε selection (if using DBSCAN)
•☐ MinPts/min_cluster_size reflects domain constraints
•☐ Computed multiple internal validation metrics
•☐ Visualized results (if data is low-dimensional)
•☐ Performed stability analysis
•☐ Domain expert has reviewed results

Summary: Parameter Selection Mastery

Parameter selection transforms density-based clustering from black-box experimentation into principled analysis. Let's consolidate the key insights:

Key Takeaways

•K-Distance Plot — The foundational tool for ε selection. The elbow indicates the natural density threshold separating clusters from noise.
•MinPts Guidelines — Use dimensionality + 1 minimum, 2d as heuristic, adjusted for noise level and domain requirements.
•HDBSCAN Simplicity — min_cluster_size is the primary parameter; set based on 'smallest meaningful cluster' in your domain.
•Multiple Metrics — Validate with silhouette, DBCV, and domain expertise. No single metric tells the whole story.
•Stability Testing — Robust clusters persist under subsampling and parameter perturbation.
•Preprocessing Matters — Many parameter issues are actually preprocessing issues. Scale features and choose appropriate metrics.
•Iterative Process — Parameter selection is exploratory. Start conservatively, analyze results, and refine.

Module Complete:

You have now completed the comprehensive module on Density-Based Clustering. You understand:

DBSCAN: Core algorithm, point types, and fundamental concepts
OPTICS: Multi-scale analysis and reachability plots
HDBSCAN: Automatic cluster extraction via stability
Parameter Selection: Principled approaches for choosing ε, MinPts, and min_cluster_size

These tools form a powerful toolkit for discovering structure in data without assuming cluster shapes or counts.

Module Complete

Congratulations! You have mastered density-based clustering from theory to practice. You can now effectively apply DBSCAN, OPTICS, and HDBSCAN to real-world data, select appropriate parameters, and validate your results. This knowledge forms a critical component of modern unsupervised learning expertise.

5 / 5

Loading learning content...

Machine LearningDensity-Based Clustering

Density-Based Clustering

LevelIntermediate

Duration120 mins

TopicDensity-Based Clustering

5 / 5

Parameter Selection

The Parameter Selection Challenge

Poorly chosen parameters lead to:

Underclustering: All points merge into one giant cluster (ε too large)
Overclustering: Points shatter into many tiny clusters or become noise (ε too small)
Missing structure: Important density variations are missed (MinPts inappropriate)
Spurious clusters: Noise points form artificial groupings (MinPts too small)

This page provides a comprehensive toolkit for systematic, principled parameter selection—transforming parameter tuning from guesswork into informed decision-making.

What You Will Learn

Parameter Roles and Interactions

Before diving into selection methods, let's clearly understand what each parameter controls and how they interact.

DBSCAN Parameters:

ε (epsilon) — Neighborhood radius
- Defines the scale at which density is measured
- Too small: most points become noise
- Too large: distinct clusters merge
- Controls: cluster granularity and separation
MinPts — Core point threshold
- Minimum neighbors for core point status
- Too small: noise points form clusters
- Too large: clusters shrink, more noise
- Controls: noise tolerance and cluster density

Parameter Interaction Matrix
ε \ MinPts	Low MinPts	High MinPts
Small ε	Many small clusters, some noise included	Most points become noise
Large ε	Few huge clusters, includes noise	Moderate clusters, good noise rejection

OPTICS Parameters:

ε_max — Maximum neighborhood radius
- Upper bound on density scale considered
- Should be large enough to capture sparsest clusters
- Too large increases computation without benefit
MinPts — Same role as DBSCAN
Extraction parameters — ε threshold or Xi
- Control which clusters are extracted from the ordering

HDBSCAN Parameters:

min_cluster_size — Minimum cluster membership
- Clusters smaller than this become noise
- Primary parameter; most important to tune
min_samples — Core distance neighbor count
- Controls density estimation smoothing
- Defaults to min_cluster_size if not specified

The ε-MinPts Relationship

The K-Distance Plot Method

The Method:

Choose k (typically k = MinPts - 1 or k = MinPts)
Compute the distance from each point to its k-th nearest neighbor
Sort these distances in ascending order
Plot sorted distances (y-axis) against point rank (x-axis)
Find the elbow — the point where the curve transitions from gradual to steep
Set ε to the y-value at the elbow

Interpretation:

The k-distance plot typically shows:

Flat or gradual region: Points within clusters; k-th neighbors are close
Steep region: Points are noise or between clusters; k-th neighbors are far
Elbow: The transition between cluster interiors and sparse regions

Why the Elbow Works

k_distance_plot.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from kneed import KneeLocator
from typing import Tuple, Optional, List
 
def k_distance_plot(
    X: np.ndarray,
    k: int = 5,
    find_elbow: bool = True,
    ax: Optional[plt.Axes] = None
) -> Tuple[np.ndarray, Optional[float]]:
    """
    Create k-distance plot for DBSCAN epsilon selection.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data.
    k : int
        Number of neighbors (typically MinPts or MinPts - 1).
    find_elbow : bool
        Whether to automatically detect the elbow.
    ax : matplotlib Axes, optional
        Axes to plot on.
    
    Returns
    -------
    k_distances_sorted : ndarray
        Sorted k-distances.
    elbow_epsilon : float or None
        Suggested epsilon if elbow detection succeeded.
    """
    # Compute k-nearest neighbor distances
    nn = NearestNeighbors(n_neighbors=k)
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    
    # k-distance is distance to k-th neighbor (last column)
    k_distances = distances[:, -1]
    
    # Sort in ascending order
    k_distances_sorted = np.sort(k_distances)
    
    # Create plot
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 6))
    
    n_points = len(k_distances_sorted)
    ax.plot(range(n_points), k_distances_sorted, 'b-', linewidth=1.5)
    ax.set_xlabel('Points (sorted by k-distance)', fontsize=12)
    ax.set_ylabel(f'{k}-distance', fontsize=12)
    ax.set_title(f'{k}-Distance Plot for ε Selection', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Elbow detection
    elbow_epsilon = None
    if find_elbow:
        try:
            # Use KneeLocator for automatic elbow detection
            kneedle = KneeLocator(
                range(n_points),
                k_distances_sorted,
                curve='convex',
                direction='increasing',
                interp_method='polynomial'
            )
            
            if kneedle.elbow is not None:
                elbow_epsilon = k_distances_sorted[kneedle.elbow]
                ax.axhline(
                    y=elbow_epsilon, 
                    color='r', 
                    linestyle='--', 
                    linewidth=2,
                    label=f'Suggested ε = {elbow_epsilon:.4f}'
                )
                ax.axvline(
                    x=kneedle.elbow,
                    color='r',
                    linestyle=':',
                    alpha=0.5
                )
                ax.scatter(
                    [kneedle.elbow], 
                    [elbow_epsilon],
                    color='red',
                    s=100,
                    zorder=5,
                    label='Elbow point'
                )
                ax.legend()
        except Exception as e:
            print(f"Elbow detection failed: {e}")
    
    plt.tight_layout()
    
    return k_distances_sorted, elbow_epsilon
 
 
def multi_k_distance_analysis(
    X: np.ndarray,
    k_values: List[int] = [3, 5, 7, 10]
) -> dict:
    """
    Analyze k-distance plots for multiple k values.
    
    Returns suggested epsilon for each k and overall recommendation.
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    results = {}
    
    for i, k in enumerate(k_values[:4]):
        k_dist, epsilon = k_distance_plot(X, k=k, ax=axes[i])
        results[k] = {
            'k_distances': k_dist,
            'suggested_epsilon': epsilon
        }
        
        # Add percentile information
        percentiles = [50, 75, 90, 95]
        for p in percentiles:
            val = np.percentile(k_dist, p)
            axes[i].axhline(y=val, color='gray', linestyle=':', alpha=0.5)
            axes[i].text(
                len(k_dist) * 0.02, val,
                f'P{p}={val:.3f}',
                fontsize=8, alpha=0.7
            )
    
    plt.tight_layout()
    plt.show()
    
    # Summary
    print("\nK-Distance Analysis Summary:")
    print("=" * 50)
    for k, data in results.items():
        eps = data['suggested_epsilon']
        if eps is not None:
            print(f"  k={k}: suggested ε = {eps:.4f}")
        else:
            print(f"  k={k}: elbow detection failed")
    
    return results
 
 
def epsilon_sensitivity_analysis(
    X: np.ndarray,
    min_pts: int = 5,
    epsilon_range: Optional[np.ndarray] = None
) -> None:
    """
    Analyze how clustering changes across epsilon values.
    """
    from sklearn.cluster import DBSCAN
    
    # Get k-distance for reference
    k_dist_sorted, suggested_eps = k_distance_plot(X, k=min_pts)
    
    if epsilon_range is None:
        # Create range based on k-distance distribution
        p25, p50, p75 = np.percentile(k_dist_sorted, [25, 50, 75])
        epsilon_range = np.linspace(p25 * 0.5, p75 * 2, 20)
    
    n_clusters_list = []
    n_noise_list = []
    
    for eps in epsilon_range:
        dbscan = DBSCAN(eps=eps, min_samples=min_pts)
        labels = dbscan.fit_predict(X)
        
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_noise = np.sum(labels == -1)
        
        n_clusters_list.append(n_clusters)
        n_noise_list.append(n_noise)
    
    # Plot sensitivity
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    ax1.plot(epsilon_range, n_clusters_list, 'b-o', markersize=5)
    ax1.set_xlabel('ε')
    ax1.set_ylabel('Number of Clusters')
    ax1.set_title('Cluster Count vs. ε')
    ax1.grid(True, alpha=0.3)
    if suggested_eps is not None:
        ax1.axvline(x=suggested_eps, color='r', linestyle='--',
                    label=f'Suggested ε = {suggested_eps:.4f}')
        ax1.legend()
    
    ax2.plot(epsilon_range, n_noise_list, 'gray', marker='s', markersize=5)
    ax2.set_xlabel('ε')
    ax2.set_ylabel('Number of Noise Points')
    ax2.set_title('Noise Count vs. ε')
    ax2.grid(True, alpha=0.3)
    
    # Add stable region detection
    stable_region = []
    for i in range(1, len(n_clusters_list) - 1):
        if n_clusters_list[i-1] == n_clusters_list[i] == n_clusters_list[i+1]:
            stable_region.append(i)
    
    if stable_region:
        stable_eps_range = epsilon_range[stable_region]
        ax1.axvspan(
            stable_eps_range.min(), stable_eps_range.max(),
            alpha=0.2, color='green', label='Stable region'
        )
        ax1.legend()
    
    plt.tight_layout()
    plt.show()
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create test data
    cluster1 = np.random.normal([0, 0], 0.5, (100, 2))
    cluster2 = np.random.normal([4, 4], 0.5, (80, 2))
    cluster3 = np.random.normal([8, 0], 0.7, (60, 2))
    noise = np.random.uniform(-2, 10, (20, 2))
    
    X = np.vstack([cluster1, cluster2, cluster3, noise])
    
    # Analyze
    results = multi_k_distance_analysis(X, k_values=[4, 5, 6, 8])
    epsilon_sensitivity_analysis(X, min_pts=5)

MinPts Selection Guidelines

While ε gets most attention, MinPts (or min_samples) is equally important and often determines whether clustering succeeds or fails.

Rules of Thumb:

MinPts ≥ dimensionality + 1
- In d-dimensional space, need at least d+1 points to identify local density
- 2D data: MinPts ≥ 3 minimum
- 10D data: MinPts ≥ 11 minimum
MinPts ≈ 2 × dimensionality (common heuristic)
- Provides buffer for noise
- 2D: MinPts ≈ 4
- 10D: MinPts ≈ 20
Log-based rule: MinPts ≈ ln(n)
- Scales with dataset size
- 1,000 points: MinPts ≈ 7
- 10,000 points: MinPts ≈ 9
- 100,000 points: MinPts ≈ 12

MinPts Selection Guidelines by Context
Context	MinPts Range	Rationale
Clean, well-separated data	3-5	Low noise, clear structure
Moderate noise	5-10	Balance density vs. noise rejection
High noise / outliers	10-20	Aggressive noise filtering
High-dimensional data	2d to 4d	Account for dimensional sparsity
Very large datasets	ln(n) + 5	Statistical stability
Unknown / exploratory	5	Reasonable default starting point

Domain Knowledge Guidance:

The most reliable MinPts selection comes from domain understanding:

Ask: What's the smallest meaningful group in my domain?
If you're clustering customers and a 'segment' needs at least 50 customers to be actionable, consider min_cluster_size = 50 (HDBSCAN)
If sensor readings need 10 similar samples to confirm an event, MinPts ≈ 10

The Dimensionality Trap:

In high dimensions, distances become less meaningful (curse of dimensionality). This affects parameter selection:

ε becomes harder to set: Distance distributions concentrate, making elbow detection unclear
MinPts must increase: More points needed to establish 'dense' in sparse high-D space
Consider preprocessing: Dimensionality reduction (PCA, UMAP) before clustering

The 'All Noise' Trap

HDBSCAN-Specific Parameter Selection

HDBSCAN's parameter selection is simpler because the algorithm handles most complexity internally. Focus on:

min_cluster_size:

This is HDBSCAN's primary parameter and deserves careful selection:

Semantic meaning: 'What's the smallest meaningful cluster?'
- Customer segmentation: Maybe 100 customers minimum
- Document clustering: Maybe 10 documents
- Anomaly detection: Set large to find only major patterns
Practical heuristics:
- Start with 1% of dataset size
- Adjust based on results
- Larger = fewer, bigger clusters

min_samples:

Often left at default (= min_cluster_size), but can be tuned:

Increase for noisy data: Makes density estimation more conservative
Decrease for clean data: Allows finer density distinctions
Rule: min_samples ≤ min_cluster_size usually

hdbscan_parameter_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
import numpy as np
import matplotlib.pyplot as plt
from hdbscan import HDBSCAN
from sklearn.metrics import silhouette_score
from typing import List, Tuple, Dict
import warnings
 
def hdbscan_parameter_search(
    X: np.ndarray,
    min_cluster_sizes: List[int] = None,
    min_samples_list: List[int] = None,
    metric: str = 'euclidean'
) -> Dict:
    """
    Grid search over HDBSCAN parameters with quality metrics.
    
    Parameters
    ----------
    X : ndarray
        Input data.
    min_cluster_sizes : list of int
        Values to try for min_cluster_size.
    min_samples_list : list of int
        Values to try for min_samples.
    
    Returns
    -------
    results : dict
        Search results with best parameters.
    """
    n_samples = X.shape[0]
    
    # Default search ranges
    if min_cluster_sizes is None:
        base = max(5, int(0.005 * n_samples))
        min_cluster_sizes = [
            base, base * 2, base * 4, base * 8,
            int(0.01 * n_samples), int(0.02 * n_samples)
        ]
        min_cluster_sizes = sorted(set(
            [x for x in min_cluster_sizes if x >= 5]
        ))
    
    if min_samples_list is None:
        min_samples_list = [None, 3, 5, 10, 15]
    
    results = []
    
    for mcs in min_cluster_sizes:
        for ms in min_samples_list:
            # ms = None means use mcs as min_samples
            actual_ms = ms if ms is not None else mcs
            if actual_ms > mcs:
                continue  # Invalid: min_samples > min_cluster_size
            
            try:
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    
                    clusterer = HDBSCAN(
                        min_cluster_size=mcs,
                        min_samples=actual_ms,
                        metric=metric
                    )
                    labels = clusterer.fit_predict(X)
                
                # Compute metrics
                n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
                n_noise = np.sum(labels == -1)
                noise_ratio = n_noise / n_samples
                
                # Silhouette score (only if 2+ clusters and some non-noise)
                if n_clusters >= 2 and n_noise < n_samples - 2:
                    mask = labels != -1
                    if len(set(labels[mask])) >= 2:
                        sil = silhouette_score(X[mask], labels[mask])
                    else:
                        sil = -1
                else:
                    sil = -1
                
                # Cluster size variability
                cluster_sizes = []
                for c in set(labels):
                    if c != -1:
                        cluster_sizes.append(np.sum(labels == c))
                
                size_std = np.std(cluster_sizes) if cluster_sizes else 0
                
                results.append({
                    'min_cluster_size': mcs,
                    'min_samples': actual_ms,
                    'n_clusters': n_clusters,
                    'n_noise': n_noise,
                    'noise_ratio': noise_ratio,
                    'silhouette': sil,
                    'size_std': size_std,
                    'labels': labels
                })
                
            except Exception as e:
                print(f"Failed for mcs={mcs}, ms={actual_ms}: {e}")
    
    # Find best configuration
    # Criterion: highest silhouette with reasonable noise ratio
    valid_results = [
        r for r in results 
        if r['silhouette'] > 0 and r['n_clusters'] >= 2
    ]
    
    if valid_results:
        best = max(valid_results, key=lambda r: r['silhouette'])
    else:
        # Fallback: most clusters with reasonable noise
        best = min(results, key=lambda r: r['noise_ratio'])
    
    return {
        'all_results': results,
        'best': best,
        'recommendation': {
            'min_cluster_size': best['min_cluster_size'],
            'min_samples': best['min_samples']
        }
    }
 
 
def visualize_parameter_search(results: Dict, X: np.ndarray = None):
    """
    Visualize parameter search results.
    """
    all_results = results['all_results']
    best = results['best']
    
    # Extract data for plotting
    mcs_values = sorted(set(r['min_cluster_size'] for r in all_results))
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Clusters vs min_cluster_size
    ax1 = axes[0, 0]
    for ms_val in sorted(set(r['min_samples'] for r in all_results)):
        subset = [r for r in all_results if r['min_samples'] == ms_val]
        subset.sort(key=lambda r: r['min_cluster_size'])
        xs = [r['min_cluster_size'] for r in subset]
        ys = [r['n_clusters'] for r in subset]
        ax1.plot(xs, ys, 'o-', label=f'min_samples={ms_val}')
    ax1.set_xlabel('min_cluster_size')
    ax1.set_ylabel('Number of Clusters')
    ax1.set_title('Cluster Count vs. Parameters')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 2. Noise ratio vs parameters
    ax2 = axes[0, 1]
    for ms_val in sorted(set(r['min_samples'] for r in all_results)):
        subset = [r for r in all_results if r['min_samples'] == ms_val]
        subset.sort(key=lambda r: r['min_cluster_size'])
        xs = [r['min_cluster_size'] for r in subset]
        ys = [r['noise_ratio'] * 100 for r in subset]
        ax2.plot(xs, ys, 's-', label=f'min_samples={ms_val}')
    ax2.set_xlabel('min_cluster_size')
    ax2.set_ylabel('Noise Ratio (%)')
    ax2.set_title('Noise Ratio vs. Parameters')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # 3. Silhouette score vs parameters
    ax3 = axes[1, 0]
    for ms_val in sorted(set(r['min_samples'] for r in all_results)):
        subset = [r for r in all_results if r['min_samples'] == ms_val]
        subset.sort(key=lambda r: r['min_cluster_size'])
        xs = [r['min_cluster_size'] for r in subset]
        ys = [r['silhouette'] for r in subset]
        ax3.plot(xs, ys, '^-', label=f'min_samples={ms_val}')
    ax3.set_xlabel('min_cluster_size')
    ax3.set_ylabel('Silhouette Score')
    ax3.set_title('Cluster Quality vs. Parameters')
    ax3.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # Highlight best
    ax3.scatter(
        [best['min_cluster_size']],
        [best['silhouette']],
        s=200, c='red', marker='*', zorder=10,
        label=f"Best: ({best['min_cluster_size']}, {best['min_samples']})"
    )
    
    # 4. Best clustering (if X provided)
    ax4 = axes[1, 1]
    if X is not None and X.shape[1] == 2:
        scatter = ax4.scatter(
            X[:, 0], X[:, 1],
            c=best['labels'],
            cmap='Spectral',
            s=30, alpha=0.7
        )
        ax4.set_title(
            f"Best Clustering: mcs={best['min_cluster_size']}, "
            f"ms={best['min_samples']}, {best['n_clusters']} clusters"
        )
    else:
        # Show summary table
        ax4.axis('off')
        summary_text = f"""
Best Parameters Found:
━━━━━━━━━━━━━━━━━━━━━━
min_cluster_size: {best['min_cluster_size']}
min_samples: {best['min_samples']}
━━━━━━━━━━━━━━━━━━━━━━
Clusters: {best['n_clusters']}
Noise points: {best['n_noise']} ({best['noise_ratio']*100:.1f}%)
Silhouette: {best['silhouette']:.3f}
"""
        ax4.text(0.1, 0.5, summary_text, fontsize=12,
                 fontfamily='monospace', verticalalignment='center')
    
    plt.tight_layout()
    plt.show()

Validation Strategies

Once you've selected parameters, how do you know they're good? Clustering validation provides the answer.

Internal Validation Metrics:

These require only the data and labels, not ground truth:

Silhouette Score (−1 to +1)
- Measures how well each point fits its cluster vs. other clusters
- Higher is better; > 0.5 is good, > 0.7 is excellent
- Biased toward convex clusters
Davies-Bouldin Index (0 to ∞)
- Ratio of within-cluster to between-cluster distances
- Lower is better
- Penalizes clusters that are close together
Calinski-Harabasz Index (0 to ∞)
- Ratio of between-cluster to within-cluster variance
- Higher is better
- Works well for convex, similarly-sized clusters
DBCV (Density-Based Clustering Validation)
- Designed specifically for density-based clustering
- Accounts for arbitrary cluster shapes
- Preferred metric for DBSCAN/HDBSCAN

Validation Best Practices

•Use multiple metrics: No single metric captures all aspects of clustering quality
•Consider domain validation: Do clusters make sense for your application?
•Stability analysis: Do clusters remain stable under data perturbation?
•Visual inspection: For low-D data, always visualize results
•External validation: If ground truth exists for a subset, use it

Stability-Based Validation:

A robust clustering should be stable under:

Subsampling: Cluster 80% of data multiple times; compare results
Noise addition: Add small noise to data; clusters should persist
Parameter perturbation: Slightly vary ε or MinPts; structure should be similar

Practical Stability Test:

def stability_test(X, clusterer_fn, n_trials=20, subsample_ratio=0.8):
    """
    Test clustering stability via subsampling.
    """
    n_samples = len(X)
    all_labels = []
    
    for _ in range(n_trials):
        # Random subsample
        indices = np.random.choice(
            n_samples, 
            int(n_samples * subsample_ratio), 
            replace=False
        )
        X_sub = X[indices]
        
        # Cluster
        labels = clusterer_fn(X_sub)
        all_labels.append((indices, labels))
    
    # Compute pairwise agreement
    # Points that are in same cluster across runs → stable
    # ... (compute Adjusted Rand Index between runs)

The Domain Test

Troubleshooting Common Issues

Parameter selection often involves diagnosing and fixing common problems. Here's a troubleshooting guide:

Problem: All points labeled as noise

Causes and solutions:

ε too small: Increase ε (check k-distance plot)
MinPts too large: Decrease MinPts
Distance metric inappropriate: Try different metrics
Data not scaled: Normalize/standardize features first

Problem: Everything in one cluster

Causes and solutions:

ε too large: Decrease ε
MinPts too small: Increase MinPts
Clusters not density-separated: Consider different algorithm
Feature scaling issues: Check that features are on comparable scales

Troubleshooting Decision Matrix
Symptom	Likely Cause	First Action
All noise	ε too small	Check k-distance plot, increase ε
One giant cluster	ε too large	Decrease ε
Too many small clusters	MinPts too small	Increase MinPts
Known clusters missing	ε or MinPts too strict	Relax parameters
Clusters merge that shouldn't	ε too large	Decrease ε or use HDBSCAN
Border points seem misassigned	ε at cluster interface	Try different ε or accept ambiguity
Different runs give different results	Border point order dependence	Use deterministic variant or HDBSCAN
High-D data gives poor results	Curse of dimensionality	Reduce dimensions first

Problem: Clusters have very different densities

Single ε can't capture variable density

Solutions:

Use HDBSCAN: Automatically handles variable density
Use OPTICS + Xi extraction: Discovers hierarchical structure
Local scaling: Normalize distances by local density before clustering
Separate clustering: Cluster dense and sparse regions separately

Problem: Results sensitive to ε choice

Small ε changes dramatically change results

This suggests:

Data doesn't have clear density-based structure
You're at the boundary between clustering regimes
Consider stability analysis to find robust parameter ranges

The Preprocessing Trap

Practical Parameter Selection Workflow

Here's a systematic workflow for parameter selection in practice:

Step 1: Data Preparation

Scale/normalize features appropriately
Handle missing values
Consider dimensionality reduction if d > 20
Understand distance metric implications

Step 2: Initial Exploration (HDBSCAN)

Start with HDBSCAN for initial exploration—fewer parameters, more robust:

from hdbscan import HDBSCAN

# Conservative start
clusterer = HDBSCAN(
    min_cluster_size=max(10, int(0.01 * len(X))),
    min_samples=5
)
clusterer.fit(X)

# Visualize condensed tree
clusterer.condensed_tree_.plot(select_clusters=True)

Step 3: Refinement via Analysis

If too many noise: decrease min_cluster_size
If clusters too coarse: decrease min_cluster_size
If spurious small clusters: increase min_cluster_size
Examine condensed tree for hierarchy insights

Step 4: DBSCAN Analysis (if needed)

If you need DBSCAN specifically (for compatibility, speed, or simplicity):

Fix MinPts based on dimension and noise level:
- Low noise: MinPts = 2d
- High noise: MinPts = 4d or more

Use k-distance plot to find ε:

k_distances, suggested_eps = k_distance_plot(X, k=MinPts)

Sensitivity analysis around suggested ε
Validate with internal metrics and domain review

Step 5: Validation and Iteration

After selecting parameters:

Compute internal metrics (silhouette, DBCV)
Visualize clusters (if possible)
Domain validation with expert
Stability testing
Iterate if needed

Parameter Selection Checklist

•☐ Data is properly preprocessed (scaled, cleaned)
•☐ Distance metric is appropriate for the data
•☐ Explored data with HDBSCAN first
•☐ Examined k-distance plot for ε selection (if using DBSCAN)
•☐ MinPts/min_cluster_size reflects domain constraints
•☐ Computed multiple internal validation metrics
•☐ Visualized results (if data is low-dimensional)
•☐ Performed stability analysis
•☐ Domain expert has reviewed results

Summary: Parameter Selection Mastery

Parameter selection transforms density-based clustering from black-box experimentation into principled analysis. Let's consolidate the key insights:

Key Takeaways

•K-Distance Plot — The foundational tool for ε selection. The elbow indicates the natural density threshold separating clusters from noise.
•MinPts Guidelines — Use dimensionality + 1 minimum, 2d as heuristic, adjusted for noise level and domain requirements.
•HDBSCAN Simplicity — min_cluster_size is the primary parameter; set based on 'smallest meaningful cluster' in your domain.
•Multiple Metrics — Validate with silhouette, DBCV, and domain expertise. No single metric tells the whole story.
•Stability Testing — Robust clusters persist under subsampling and parameter perturbation.
•Preprocessing Matters — Many parameter issues are actually preprocessing issues. Scale features and choose appropriate metrics.
•Iterative Process — Parameter selection is exploratory. Start conservatively, analyze results, and refine.

Module Complete:

You have now completed the comprehensive module on Density-Based Clustering. You understand:

DBSCAN: Core algorithm, point types, and fundamental concepts
OPTICS: Multi-scale analysis and reachability plots
HDBSCAN: Automatic cluster extraction via stability
Parameter Selection: Principled approaches for choosing ε, MinPts, and min_cluster_size

These tools form a powerful toolkit for discovering structure in data without assuming cluster shapes or counts.

Module Complete

5 / 5