Loading learning content...
Density-based clustering algorithms promise to discover clusters of arbitrary shape without specifying the number of clusters. But this flexibility comes with its own challenge: parameter selection. The quality of clustering results depends critically on choosing appropriate values for ε (DBSCAN), ε_max (OPTICS), min_samples, and min_cluster_size (HDBSCAN).
Poorly chosen parameters lead to:
This page provides a comprehensive toolkit for systematic, principled parameter selection—transforming parameter tuning from guesswork into informed decision-making.
By the end of this page, you will master the k-distance plot method for ε selection, understand how domain knowledge guides MinPts choice, learn validation strategies for parameter assessment, and gain practical workflows for parameter optimization across DBSCAN, OPTICS, and HDBSCAN.
Before diving into selection methods, let's clearly understand what each parameter controls and how they interact.
DBSCAN Parameters:
ε (epsilon) — Neighborhood radius
MinPts — Core point threshold
| ε \ MinPts | Low MinPts | High MinPts |
|---|---|---|
| Small ε | Many small clusters, some noise included | Most points become noise |
| Large ε | Few huge clusters, includes noise | Moderate clusters, good noise rejection |
OPTICS Parameters:
ε_max — Maximum neighborhood radius
MinPts — Same role as DBSCAN
Extraction parameters — ε threshold or Xi
HDBSCAN Parameters:
min_cluster_size — Minimum cluster membership
min_samples — Core distance neighbor count
Think of ε as 'how far to look' and MinPts as 'how crowded it must be.' Finding clusters requires: (1) ε matches the natural point spacing within clusters, and (2) MinPts matches the typical neighbor count in cluster interiors. These constraints should be satisfied simultaneously.
The k-distance plot (also called k-NN distance plot) is the canonical method for selecting ε in DBSCAN. Introduced in the original DBSCAN paper, it provides a visual heuristic for finding the natural density threshold.
The Method:
Interpretation:
The k-distance plot typically shows:
In well-clustered data, cluster-interior points have similar k-distances (small, relatively uniform). Noise and boundary points have larger k-distances. The elbow marks the transition: below it, points have enough neighbors to be core; above it, they don't. Setting ε at the elbow captures cluster structure while rejecting noise.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.neighbors import NearestNeighborsfrom kneed import KneeLocatorfrom typing import Tuple, Optional, List def k_distance_plot( X: np.ndarray, k: int = 5, find_elbow: bool = True, ax: Optional[plt.Axes] = None) -> Tuple[np.ndarray, Optional[float]]: """ Create k-distance plot for DBSCAN epsilon selection. Parameters ---------- X : ndarray of shape (n_samples, n_features) Input data. k : int Number of neighbors (typically MinPts or MinPts - 1). find_elbow : bool Whether to automatically detect the elbow. ax : matplotlib Axes, optional Axes to plot on. Returns ------- k_distances_sorted : ndarray Sorted k-distances. elbow_epsilon : float or None Suggested epsilon if elbow detection succeeded. """ # Compute k-nearest neighbor distances nn = NearestNeighbors(n_neighbors=k) nn.fit(X) distances, _ = nn.kneighbors(X) # k-distance is distance to k-th neighbor (last column) k_distances = distances[:, -1] # Sort in ascending order k_distances_sorted = np.sort(k_distances) # Create plot if ax is None: fig, ax = plt.subplots(figsize=(10, 6)) n_points = len(k_distances_sorted) ax.plot(range(n_points), k_distances_sorted, 'b-', linewidth=1.5) ax.set_xlabel('Points (sorted by k-distance)', fontsize=12) ax.set_ylabel(f'{k}-distance', fontsize=12) ax.set_title(f'{k}-Distance Plot for ε Selection', fontsize=14) ax.grid(True, alpha=0.3) # Elbow detection elbow_epsilon = None if find_elbow: try: # Use KneeLocator for automatic elbow detection kneedle = KneeLocator( range(n_points), k_distances_sorted, curve='convex', direction='increasing', interp_method='polynomial' ) if kneedle.elbow is not None: elbow_epsilon = k_distances_sorted[kneedle.elbow] ax.axhline( y=elbow_epsilon, color='r', linestyle='--', linewidth=2, label=f'Suggested ε = {elbow_epsilon:.4f}' ) ax.axvline( x=kneedle.elbow, color='r', linestyle=':', alpha=0.5 ) ax.scatter( [kneedle.elbow], [elbow_epsilon], color='red', s=100, zorder=5, label='Elbow point' ) ax.legend() except Exception as e: print(f"Elbow detection failed: {e}") plt.tight_layout() return k_distances_sorted, elbow_epsilon def multi_k_distance_analysis( X: np.ndarray, k_values: List[int] = [3, 5, 7, 10]) -> dict: """ Analyze k-distance plots for multiple k values. Returns suggested epsilon for each k and overall recommendation. """ fig, axes = plt.subplots(2, 2, figsize=(14, 10)) axes = axes.flatten() results = {} for i, k in enumerate(k_values[:4]): k_dist, epsilon = k_distance_plot(X, k=k, ax=axes[i]) results[k] = { 'k_distances': k_dist, 'suggested_epsilon': epsilon } # Add percentile information percentiles = [50, 75, 90, 95] for p in percentiles: val = np.percentile(k_dist, p) axes[i].axhline(y=val, color='gray', linestyle=':', alpha=0.5) axes[i].text( len(k_dist) * 0.02, val, f'P{p}={val:.3f}', fontsize=8, alpha=0.7 ) plt.tight_layout() plt.show() # Summary print("\nK-Distance Analysis Summary:") print("=" * 50) for k, data in results.items(): eps = data['suggested_epsilon'] if eps is not None: print(f" k={k}: suggested ε = {eps:.4f}") else: print(f" k={k}: elbow detection failed") return results def epsilon_sensitivity_analysis( X: np.ndarray, min_pts: int = 5, epsilon_range: Optional[np.ndarray] = None) -> None: """ Analyze how clustering changes across epsilon values. """ from sklearn.cluster import DBSCAN # Get k-distance for reference k_dist_sorted, suggested_eps = k_distance_plot(X, k=min_pts) if epsilon_range is None: # Create range based on k-distance distribution p25, p50, p75 = np.percentile(k_dist_sorted, [25, 50, 75]) epsilon_range = np.linspace(p25 * 0.5, p75 * 2, 20) n_clusters_list = [] n_noise_list = [] for eps in epsilon_range: dbscan = DBSCAN(eps=eps, min_samples=min_pts) labels = dbscan.fit_predict(X) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = np.sum(labels == -1) n_clusters_list.append(n_clusters) n_noise_list.append(n_noise) # Plot sensitivity fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) ax1.plot(epsilon_range, n_clusters_list, 'b-o', markersize=5) ax1.set_xlabel('ε') ax1.set_ylabel('Number of Clusters') ax1.set_title('Cluster Count vs. ε') ax1.grid(True, alpha=0.3) if suggested_eps is not None: ax1.axvline(x=suggested_eps, color='r', linestyle='--', label=f'Suggested ε = {suggested_eps:.4f}') ax1.legend() ax2.plot(epsilon_range, n_noise_list, 'gray', marker='s', markersize=5) ax2.set_xlabel('ε') ax2.set_ylabel('Number of Noise Points') ax2.set_title('Noise Count vs. ε') ax2.grid(True, alpha=0.3) # Add stable region detection stable_region = [] for i in range(1, len(n_clusters_list) - 1): if n_clusters_list[i-1] == n_clusters_list[i] == n_clusters_list[i+1]: stable_region.append(i) if stable_region: stable_eps_range = epsilon_range[stable_region] ax1.axvspan( stable_eps_range.min(), stable_eps_range.max(), alpha=0.2, color='green', label='Stable region' ) ax1.legend() plt.tight_layout() plt.show() # Example usageif __name__ == "__main__": np.random.seed(42) # Create test data cluster1 = np.random.normal([0, 0], 0.5, (100, 2)) cluster2 = np.random.normal([4, 4], 0.5, (80, 2)) cluster3 = np.random.normal([8, 0], 0.7, (60, 2)) noise = np.random.uniform(-2, 10, (20, 2)) X = np.vstack([cluster1, cluster2, cluster3, noise]) # Analyze results = multi_k_distance_analysis(X, k_values=[4, 5, 6, 8]) epsilon_sensitivity_analysis(X, min_pts=5)While ε gets most attention, MinPts (or min_samples) is equally important and often determines whether clustering succeeds or fails.
Rules of Thumb:
MinPts ≥ dimensionality + 1
MinPts ≈ 2 × dimensionality (common heuristic)
Log-based rule: MinPts ≈ ln(n)
| Context | MinPts Range | Rationale |
|---|---|---|
| Clean, well-separated data | 3-5 | Low noise, clear structure |
| Moderate noise | 5-10 | Balance density vs. noise rejection |
| High noise / outliers | 10-20 | Aggressive noise filtering |
| High-dimensional data | 2d to 4d | Account for dimensional sparsity |
| Very large datasets | ln(n) + 5 | Statistical stability |
| Unknown / exploratory | 5 | Reasonable default starting point |
Domain Knowledge Guidance:
The most reliable MinPts selection comes from domain understanding:
The Dimensionality Trap:
In high dimensions, distances become less meaningful (curse of dimensionality). This affects parameter selection:
A common mistake: setting MinPts too high makes every point noise. If DBSCAN returns all or mostly noise points, first try decreasing MinPts before adjusting ε. The k-distance plot assumes MinPts is already reasonable.
HDBSCAN's parameter selection is simpler because the algorithm handles most complexity internally. Focus on:
min_cluster_size:
This is HDBSCAN's primary parameter and deserves careful selection:
Semantic meaning: 'What's the smallest meaningful cluster?'
Practical heuristics:
min_samples:
Often left at default (= min_cluster_size), but can be tuned:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220
import numpy as npimport matplotlib.pyplot as pltfrom hdbscan import HDBSCANfrom sklearn.metrics import silhouette_scorefrom typing import List, Tuple, Dictimport warnings def hdbscan_parameter_search( X: np.ndarray, min_cluster_sizes: List[int] = None, min_samples_list: List[int] = None, metric: str = 'euclidean') -> Dict: """ Grid search over HDBSCAN parameters with quality metrics. Parameters ---------- X : ndarray Input data. min_cluster_sizes : list of int Values to try for min_cluster_size. min_samples_list : list of int Values to try for min_samples. Returns ------- results : dict Search results with best parameters. """ n_samples = X.shape[0] # Default search ranges if min_cluster_sizes is None: base = max(5, int(0.005 * n_samples)) min_cluster_sizes = [ base, base * 2, base * 4, base * 8, int(0.01 * n_samples), int(0.02 * n_samples) ] min_cluster_sizes = sorted(set( [x for x in min_cluster_sizes if x >= 5] )) if min_samples_list is None: min_samples_list = [None, 3, 5, 10, 15] results = [] for mcs in min_cluster_sizes: for ms in min_samples_list: # ms = None means use mcs as min_samples actual_ms = ms if ms is not None else mcs if actual_ms > mcs: continue # Invalid: min_samples > min_cluster_size try: with warnings.catch_warnings(): warnings.simplefilter("ignore") clusterer = HDBSCAN( min_cluster_size=mcs, min_samples=actual_ms, metric=metric ) labels = clusterer.fit_predict(X) # Compute metrics n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = np.sum(labels == -1) noise_ratio = n_noise / n_samples # Silhouette score (only if 2+ clusters and some non-noise) if n_clusters >= 2 and n_noise < n_samples - 2: mask = labels != -1 if len(set(labels[mask])) >= 2: sil = silhouette_score(X[mask], labels[mask]) else: sil = -1 else: sil = -1 # Cluster size variability cluster_sizes = [] for c in set(labels): if c != -1: cluster_sizes.append(np.sum(labels == c)) size_std = np.std(cluster_sizes) if cluster_sizes else 0 results.append({ 'min_cluster_size': mcs, 'min_samples': actual_ms, 'n_clusters': n_clusters, 'n_noise': n_noise, 'noise_ratio': noise_ratio, 'silhouette': sil, 'size_std': size_std, 'labels': labels }) except Exception as e: print(f"Failed for mcs={mcs}, ms={actual_ms}: {e}") # Find best configuration # Criterion: highest silhouette with reasonable noise ratio valid_results = [ r for r in results if r['silhouette'] > 0 and r['n_clusters'] >= 2 ] if valid_results: best = max(valid_results, key=lambda r: r['silhouette']) else: # Fallback: most clusters with reasonable noise best = min(results, key=lambda r: r['noise_ratio']) return { 'all_results': results, 'best': best, 'recommendation': { 'min_cluster_size': best['min_cluster_size'], 'min_samples': best['min_samples'] } } def visualize_parameter_search(results: Dict, X: np.ndarray = None): """ Visualize parameter search results. """ all_results = results['all_results'] best = results['best'] # Extract data for plotting mcs_values = sorted(set(r['min_cluster_size'] for r in all_results)) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # 1. Clusters vs min_cluster_size ax1 = axes[0, 0] for ms_val in sorted(set(r['min_samples'] for r in all_results)): subset = [r for r in all_results if r['min_samples'] == ms_val] subset.sort(key=lambda r: r['min_cluster_size']) xs = [r['min_cluster_size'] for r in subset] ys = [r['n_clusters'] for r in subset] ax1.plot(xs, ys, 'o-', label=f'min_samples={ms_val}') ax1.set_xlabel('min_cluster_size') ax1.set_ylabel('Number of Clusters') ax1.set_title('Cluster Count vs. Parameters') ax1.legend() ax1.grid(True, alpha=0.3) # 2. Noise ratio vs parameters ax2 = axes[0, 1] for ms_val in sorted(set(r['min_samples'] for r in all_results)): subset = [r for r in all_results if r['min_samples'] == ms_val] subset.sort(key=lambda r: r['min_cluster_size']) xs = [r['min_cluster_size'] for r in subset] ys = [r['noise_ratio'] * 100 for r in subset] ax2.plot(xs, ys, 's-', label=f'min_samples={ms_val}') ax2.set_xlabel('min_cluster_size') ax2.set_ylabel('Noise Ratio (%)') ax2.set_title('Noise Ratio vs. Parameters') ax2.legend() ax2.grid(True, alpha=0.3) # 3. Silhouette score vs parameters ax3 = axes[1, 0] for ms_val in sorted(set(r['min_samples'] for r in all_results)): subset = [r for r in all_results if r['min_samples'] == ms_val] subset.sort(key=lambda r: r['min_cluster_size']) xs = [r['min_cluster_size'] for r in subset] ys = [r['silhouette'] for r in subset] ax3.plot(xs, ys, '^-', label=f'min_samples={ms_val}') ax3.set_xlabel('min_cluster_size') ax3.set_ylabel('Silhouette Score') ax3.set_title('Cluster Quality vs. Parameters') ax3.axhline(y=0, color='gray', linestyle='--', alpha=0.5) ax3.legend() ax3.grid(True, alpha=0.3) # Highlight best ax3.scatter( [best['min_cluster_size']], [best['silhouette']], s=200, c='red', marker='*', zorder=10, label=f"Best: ({best['min_cluster_size']}, {best['min_samples']})" ) # 4. Best clustering (if X provided) ax4 = axes[1, 1] if X is not None and X.shape[1] == 2: scatter = ax4.scatter( X[:, 0], X[:, 1], c=best['labels'], cmap='Spectral', s=30, alpha=0.7 ) ax4.set_title( f"Best Clustering: mcs={best['min_cluster_size']}, " f"ms={best['min_samples']}, {best['n_clusters']} clusters" ) else: # Show summary table ax4.axis('off') summary_text = f"""Best Parameters Found:━━━━━━━━━━━━━━━━━━━━━━min_cluster_size: {best['min_cluster_size']}min_samples: {best['min_samples']}━━━━━━━━━━━━━━━━━━━━━━Clusters: {best['n_clusters']}Noise points: {best['n_noise']} ({best['noise_ratio']*100:.1f}%)Silhouette: {best['silhouette']:.3f}""" ax4.text(0.1, 0.5, summary_text, fontsize=12, fontfamily='monospace', verticalalignment='center') plt.tight_layout() plt.show()Once you've selected parameters, how do you know they're good? Clustering validation provides the answer.
Internal Validation Metrics:
These require only the data and labels, not ground truth:
Silhouette Score (−1 to +1)
Davies-Bouldin Index (0 to ∞)
Calinski-Harabasz Index (0 to ∞)
DBCV (Density-Based Clustering Validation)
Stability-Based Validation:
A robust clustering should be stable under:
Practical Stability Test:
def stability_test(X, clusterer_fn, n_trials=20, subsample_ratio=0.8):
"""
Test clustering stability via subsampling.
"""
n_samples = len(X)
all_labels = []
for _ in range(n_trials):
# Random subsample
indices = np.random.choice(
n_samples,
int(n_samples * subsample_ratio),
replace=False
)
X_sub = X[indices]
# Cluster
labels = clusterer_fn(X_sub)
all_labels.append((indices, labels))
# Compute pairwise agreement
# Points that are in same cluster across runs → stable
# ... (compute Adjusted Rand Index between runs)
The ultimate test: show clusters to a domain expert. Do they recognize the groups? Are there expected clusters that are missing? Are there strange groupings that don't make sense? Domain validation often reveals parameter issues that metrics miss.
Parameter selection often involves diagnosing and fixing common problems. Here's a troubleshooting guide:
Problem: All points labeled as noise
Causes and solutions:
Problem: Everything in one cluster
Causes and solutions:
| Symptom | Likely Cause | First Action |
|---|---|---|
| All noise | ε too small | Check k-distance plot, increase ε |
| One giant cluster | ε too large | Decrease ε |
| Too many small clusters | MinPts too small | Increase MinPts |
| Known clusters missing | ε or MinPts too strict | Relax parameters |
| Clusters merge that shouldn't | ε too large | Decrease ε or use HDBSCAN |
| Border points seem misassigned | ε at cluster interface | Try different ε or accept ambiguity |
| Different runs give different results | Border point order dependence | Use deterministic variant or HDBSCAN |
| High-D data gives poor results | Curse of dimensionality | Reduce dimensions first |
Problem: Clusters have very different densities
Single ε can't capture variable density
Solutions:
Problem: Results sensitive to ε choice
Small ε changes dramatically change results
This suggests:
Many parameter selection issues stem from preprocessing, not the algorithm. Always check: (1) Are features scaled appropriately? (2) Is the distance metric suitable for your features? (3) Are there outliers skewing the distance distribution? Fix preprocessing first, then tune parameters.
Here's a systematic workflow for parameter selection in practice:
Step 1: Data Preparation
Step 2: Initial Exploration (HDBSCAN)
Start with HDBSCAN for initial exploration—fewer parameters, more robust:
from hdbscan import HDBSCAN
# Conservative start
clusterer = HDBSCAN(
min_cluster_size=max(10, int(0.01 * len(X))),
min_samples=5
)
clusterer.fit(X)
# Visualize condensed tree
clusterer.condensed_tree_.plot(select_clusters=True)
Step 3: Refinement via Analysis
Step 4: DBSCAN Analysis (if needed)
If you need DBSCAN specifically (for compatibility, speed, or simplicity):
Fix MinPts based on dimension and noise level:
Use k-distance plot to find ε:
k_distances, suggested_eps = k_distance_plot(X, k=MinPts)
Sensitivity analysis around suggested ε
Validate with internal metrics and domain review
Step 5: Validation and Iteration
After selecting parameters:
Parameter selection transforms density-based clustering from black-box experimentation into principled analysis. Let's consolidate the key insights:
Module Complete:
You have now completed the comprehensive module on Density-Based Clustering. You understand:
These tools form a powerful toolkit for discovering structure in data without assuming cluster shapes or counts.
Congratulations! You have mastered density-based clustering from theory to practice. You can now effectively apply DBSCAN, OPTICS, and HDBSCAN to real-world data, select appropriate parameters, and validate your results. This knowledge forms a critical component of modern unsupervised learning expertise.