Loading content...
Internal and external metrics evaluate a single clustering solution, but they don't answer a crucial question: Would we get the same clusters if we collected slightly different data? This is the stability question.
Stability-based evaluation assesses cluster quality by measuring how consistent clustering results are across perturbations—repeated runs, bootstrap samples, or subsets of the data. The core intuition is compelling: if clusters are 'real' structures in the data, they should be recoverable even with some noise or sampling variation.
This page provides rigorous coverage of stability-based methods, including bootstrap resampling approaches, subsampling stability, perturbation-based assessment, and the theoretical foundations connecting stability to generalization.
By the end of this page, you will understand: (1) why stability is a valid criterion for clustering quality, (2) bootstrap and subsampling stability methodologies, (3) how to implement and interpret stability measures, (4) the connection between stability and optimal k selection, and (5) limitations and best practices for stability assessment.
Consider running k-means with k=5 on a dataset and obtaining a clustering. Now imagine collecting a new dataset from the same underlying population. Would you get the same five clusters? If the answer is yes, those clusters likely represent genuine structure. If the answer is no—if small data changes produce wildly different clusterings—the original solution was probably fitting noise rather than signal.
This principle is fundamental:
Stable clusterings → Genuine, reproducible structure Unstable clusterings → Artifacts of sampling noise, wrong k, or inappropriate algorithm
Let A be a clustering algorithm and D be our data sampled from population P. We can formalize stability as:
$$\text{Stability} = \frac{1}{\binom{B}{2}} \sum_{i < j} S(C_i, C_j)$$
Alternatively, we can compare each perturbed clustering to the original clustering on data common to both samples.
| Aspect | Internal Metrics | External Metrics | Stability |
|---|---|---|---|
| Requires ground truth | No | Yes | No |
| What it measures | Compactness/separation | Agreement with labels | Consistency/robustness |
| Cluster shape assumption | Often spherical | None | None |
| Computationally expensive | Varies | No | Yes (multiple runs) |
| Directly tests generalizability | No | No | Yes |
Stability is complementary to internal and external metrics. A clustering with high Silhouette but low stability may be fitting noise. Conversely, a clustering with moderate Silhouette but high stability likely captures genuine, reproducible structure. Always use stability alongside other metrics.
Bootstrap resampling creates perturbed datasets by sampling n points with replacement from the original n points. Each bootstrap sample:
This is the most commonly used approach for stability assessment.
Input: Dataset D, Algorithm A, number of bootstraps B, similarity measure S
Output: Stability score
1. For b = 1 to B:
a. Generate bootstrap sample D_b by sampling n points with replacement from D
b. Apply A to D_b, obtaining clustering C_b
c. Identify common points between D and D_b
2. For each pair (i, j) with i < j:
a. Find points common to both D_i and D_j
b. Compute S(C_i, C_j) on common points
c. Store similarity
3. Return mean similarity across all pairs
A key subtlety: we can only compare clusterings on common points—points that appear in both bootstrap samples.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
import numpy as npfrom sklearn.cluster import KMeansfrom sklearn.metrics import adjusted_rand_scorefrom collections import defaultdictimport matplotlib.pyplot as plt def bootstrap_stability(X, n_clusters, n_bootstrap=100, algorithm=None, random_state=42): """ Compute bootstrap stability of a clustering. Parameters: ----------- X : ndarray of shape (n_samples, n_features) n_clusters : int, number of clusters n_bootstrap : int, number of bootstrap samples algorithm : clustering algorithm (default: KMeans) random_state : int Returns: -------- dict with stability score and per-bootstrap details """ rng = np.random.RandomState(random_state) n_samples = X.shape[0] if algorithm is None: algorithm = KMeans(n_clusters=n_clusters, n_init='auto', random_state=random_state) # Store bootstrap clusterings and which points they contain bootstrap_results = [] for b in range(n_bootstrap): # Bootstrap sample: sample with replacement indices = rng.choice(n_samples, size=n_samples, replace=True) unique_indices = np.unique(indices) # Cluster the bootstrap sample (using unique points to avoid duplicates) X_boot = X[unique_indices] if hasattr(algorithm, 'set_params'): algo = algorithm.__class__(**algorithm.get_params()) else: algo = KMeans(n_clusters=n_clusters, n_init='auto', random_state=b) labels_boot = algo.fit_predict(X_boot) # Store mapping from original index to cluster label index_to_label = {idx: label for idx, label in zip(unique_indices, labels_boot)} bootstrap_results.append(index_to_label) # Compute pairwise stability pairwise_scores = [] for i in range(n_bootstrap): for j in range(i + 1, n_bootstrap): # Find common points common_indices = set(bootstrap_results[i].keys()) & set(bootstrap_results[j].keys()) if len(common_indices) < 10: # Need minimum points for meaningful comparison continue common_indices = sorted(common_indices) labels_i = [bootstrap_results[i][idx] for idx in common_indices] labels_j = [bootstrap_results[j][idx] for idx in common_indices] # Compute ARI on common points ari = adjusted_rand_score(labels_i, labels_j) pairwise_scores.append(ari) stability_score = np.mean(pairwise_scores) return { 'stability': stability_score, 'std': np.std(pairwise_scores), 'n_comparisons': len(pairwise_scores), 'pairwise_scores': pairwise_scores } def find_stable_k(X, k_range, n_bootstrap=50, random_state=42): """ Use stability to find optimal number of clusters. """ results = {} for k in k_range: print(f" Evaluating k={k}...", end=' ') result = bootstrap_stability(X, k, n_bootstrap=n_bootstrap, random_state=random_state) results[k] = result print(f"Stability: {result['stability']:.4f}") return results # Example with synthetic datafrom sklearn.datasets import make_blobs # True k = 4X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42) print("Bootstrap Stability Analysis")print("="*50) # Evaluate stability for different k valuesk_range = range(2, 8)stability_results = find_stable_k(X, k_range, n_bootstrap=30) # Plot resultsfig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Stability plotstabilities = [stability_results[k]['stability'] for k in k_range]stds = [stability_results[k]['std'] for k in k_range] axes[0].errorbar(k_range, stabilities, yerr=stds, marker='o', capsize=5)axes[0].axvline(x=4, color='red', linestyle='--', label='True k=4')axes[0].set_xlabel('Number of Clusters (k)')axes[0].set_ylabel('Bootstrap Stability (ARI)')axes[0].set_title('Stability vs. Number of Clusters')axes[0].legend()axes[0].grid(True, alpha=0.3) # Distribution for true kscores_k4 = stability_results[4]['pairwise_scores']axes[1].hist(scores_k4, bins=20, edgecolor='black', alpha=0.7)axes[1].axvline(x=np.mean(scores_k4), color='red', linestyle='--', label=f'Mean: {np.mean(scores_k4):.3f}')axes[1].set_xlabel('Pairwise ARI')axes[1].set_ylabel('Frequency')axes[1].set_title('Distribution of Pairwise Stability (k=4)')axes[1].legend() plt.tight_layout()plt.show()Subsampling creates perturbed datasets by sampling without replacement, typically taking a fraction p (e.g., 80%) of the original data. This approach:
Input: Dataset D, Algorithm A, subsampling fraction p, number of subsamples B
Output: Stability score
1. For b = 1 to B:
a. Sample p*n points without replacement from D → D_b
b. Apply A to D_b → C_b
c. For each point in D_b, record its cluster assignment
2. For each pair of points (i, j) that appear in at least 2 subsamples:
a. Count how often they're in the same cluster vs different clusters
b. Compare to their clustering in original full-data solution
3. Aggregate agreement scores
| Fraction p | Subsample Size | Expected Overlap | Trade-offs |
|---|---|---|---|
| 0.9 | 90% of data | High | More stable but less diverse samples |
| 0.8 | 80% of data | Moderate | Good balance (common choice) |
| 0.7 | 70% of data | Lower | More variance but tests robustness harder |
| 0.5 | 50% of data | Low | Severe test; may be too challenging |
The choice depends on dataset size and how rigorously you want to test stability. Larger datasets can use smaller fractions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import numpy as npfrom sklearn.cluster import KMeansfrom sklearn.metrics import adjusted_rand_scoreimport matplotlib.pyplot as plt def subsampling_stability(X, n_clusters, subsample_fraction=0.8, n_subsamples=100, random_state=42): """ Compute subsampling stability of a clustering. Strategy: Compare each subsample's clustering on its points to a reference clustering on the full data. """ rng = np.random.RandomState(random_state) n_samples = X.shape[0] subsample_size = int(n_samples * subsample_fraction) # Reference clustering on full data reference_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=random_state) reference_labels = reference_model.fit_predict(X) stability_scores = [] for s in range(n_subsamples): # Subsample without replacement indices = rng.choice(n_samples, size=subsample_size, replace=False) X_sub = X[indices] # Cluster the subsample subsample_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=s) subsample_labels = subsample_model.fit_predict(X_sub) # Compare subsample clustering to reference on the subsampled points reference_labels_sub = reference_labels[indices] ari = adjusted_rand_score(reference_labels_sub, subsample_labels) stability_scores.append(ari) return { 'stability': np.mean(stability_scores), 'std': np.std(stability_scores), 'scores': stability_scores } def prediction_strength(X, n_clusters, test_fraction=0.5, n_splits=50, random_state=42): """ Compute prediction strength (Tibshirani & Walther 2005). Train on one half, predict on other half, measure consistency. """ rng = np.random.RandomState(random_state) n_samples = X.shape[0] test_size = int(n_samples * test_fraction) prediction_strengths = [] for s in range(n_splits): # Split into train and test indices = rng.permutation(n_samples) test_indices = indices[:test_size] train_indices = indices[test_size:] X_train = X[train_indices] X_test = X[test_indices] # Cluster training set train_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=s) train_model.fit(X_train) # Assign test points to training clusters test_labels = train_model.predict(X_test) # Cluster test set independently test_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=s+1000) test_labels_independent = test_model.fit_predict(X_test) # For each test cluster, compute proportion of pairs that # are also co-clustered by the training model cluster_strengths = [] for k in range(n_clusters): cluster_mask = test_labels_independent == k cluster_points = np.where(cluster_mask)[0] if len(cluster_points) < 2: continue # Count pairs in same vs different training clusters same_cluster = 0 total_pairs = 0 for i in range(len(cluster_points)): for j in range(i+1, len(cluster_points)): pi, pj = cluster_points[i], cluster_points[j] if test_labels[pi] == test_labels[pj]: same_cluster += 1 total_pairs += 1 if total_pairs > 0: cluster_strengths.append(same_cluster / total_pairs) if cluster_strengths: # Minimum over clusters (most stringent) prediction_strengths.append(min(cluster_strengths)) return { 'prediction_strength': np.mean(prediction_strengths), 'std': np.std(prediction_strengths), 'all_strengths': prediction_strengths } # Example comparisonfrom sklearn.datasets import make_blobs X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42) print("Stability Analysis Comparison")print("="*50) k_range = range(2, 8) subsamp_results = []pred_str_results = [] for k in k_range: sub_result = subsampling_stability(X, k, subsample_fraction=0.8, n_subsamples=30) ps_result = prediction_strength(X, k, n_splits=30) subsamp_results.append(sub_result['stability']) pred_str_results.append(ps_result['prediction_strength']) print(f"k={k}: Subsampling={sub_result['stability']:.3f}, " f"Pred Strength={ps_result['prediction_strength']:.3f}") # Plot comparisonfig, ax = plt.subplots(figsize=(8, 5))ax.plot(k_range, subsamp_results, 'bo-', label='Subsampling Stability', linewidth=2)ax.plot(k_range, pred_str_results, 'rs-', label='Prediction Strength', linewidth=2)ax.axvline(x=4, color='green', linestyle='--', label='True k=4', alpha=0.7)ax.set_xlabel('Number of Clusters (k)')ax.set_ylabel('Stability Score')ax.set_title('Comparison of Stability Methods')ax.legend()ax.grid(True, alpha=0.3)plt.tight_layout()plt.show()Prediction Strength (Tibshirani & Walther, 2005) takes a more principled approach to stability by explicitly separating learning and evaluation:
This directly tests whether clusters learned on one sample generalize to another.
Let Cₜᵣₐᵢₙ be the clustering learned on training data, and Cₜₑₛₜ be the independent clustering of test data. For each test cluster Cₜₑₛₜ(k):
Co-membership matrix D where D[i,j] = 1 if test points i and j are predicted to the same training cluster.
Cluster-specific prediction strength: $$PS_k = \frac{1}{|C_{test}(k)|(|C_{test}(k)|-1)} \sum_{i,j \in C_{test}(k), i \neq j} D[i,j]$$
Overall prediction strength: $$PS = \min_k PS_k$$
Using the minimum makes this a conservative measure—all clusters must be stable.
Tibshirani & Walther suggest PS > 0.8 indicates stable clusters. Values below 0.8 suggest either wrong k or inherently unstable structure. When evaluating multiple k values, choose the largest k with PS > threshold (0.8 or 0.9).
Some clustering algorithms provide natural stability measures through their model structure:
Also known as cluster ensembles or aggregated clustering:
If M values are close to 0 or 1, stability is high. Values near 0.5 indicate ambiguity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom sklearn.cluster import KMeans, AgglomerativeClusteringfrom scipy.cluster.hierarchy import linkage, fclusterimport matplotlib.pyplot as plt def consensus_clustering(X, n_clusters, n_bootstrap=100, random_state=42): """ Perform consensus clustering and compute stability. Returns: -------- dict containing: - consensus_labels: Final cluster assignments - co_association_matrix: Frequency of co-clustering - stability_score: Measure of consensus clarity """ rng = np.random.RandomState(random_state) n_samples = X.shape[0] # Initialize co-association matrix co_association = np.zeros((n_samples, n_samples)) co_occurrence = np.zeros((n_samples, n_samples)) for b in range(n_bootstrap): # Bootstrap sample indices = rng.choice(n_samples, size=n_samples, replace=True) unique_indices = np.unique(indices) # Cluster bootstrap sample X_boot = X[unique_indices] model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=b) labels = model.fit_predict(X_boot) # Update co-association for pairs in this bootstrap for i, idx_i in enumerate(unique_indices): for j, idx_j in enumerate(unique_indices): co_occurrence[idx_i, idx_j] += 1 if labels[i] == labels[j]: co_association[idx_i, idx_j] += 1 # Normalize by co-occurrence count # Avoid division by zero for pairs that never co-occur with np.errstate(divide='ignore', invalid='ignore'): M = np.divide(co_association, co_occurrence) M = np.nan_to_num(M, nan=0.0) # Cluster the co-association matrix # Convert similarity to distance distance_matrix = 1 - M np.fill_diagonal(distance_matrix, 0) # Self-distance is 0 # Use hierarchical clustering on the distance matrix from scipy.spatial.distance import squareform condensed = squareform(distance_matrix, checks=False) Z = linkage(condensed, method='average') consensus_labels = fcluster(Z, t=n_clusters, criterion='maxclust') - 1 # Compute stability score based on clarity of co-association values # Ideal: values close to 0 or 1 upper_triangle = M[np.triu_indices(n_samples, k=1)] # Bimodality measure: how far from 0.5 on average clarity = np.mean(np.abs(upper_triangle - 0.5)) * 2 # Scale to [0, 1] # Alternative: entropy-based measure def binary_entropy(p): if p == 0 or p == 1: return 0 return -p * np.log2(p) - (1-p) * np.log2(1-p) entropies = [binary_entropy(p) for p in upper_triangle] avg_entropy = np.mean(entropies) stability_entropy = 1 - avg_entropy # Higher is more stable return { 'consensus_labels': consensus_labels, 'co_association_matrix': M, 'clarity_score': clarity, 'entropy_stability': stability_entropy } # Examplefrom sklearn.datasets import make_blobs X, y_true = make_blobs(n_samples=150, centers=3, cluster_std=1.0, random_state=42) print("Consensus Clustering Analysis")print("="*50) # Compare stability for different kfor k in [2, 3, 4, 5]: result = consensus_clustering(X, n_clusters=k, n_bootstrap=50) print(f"k={k}: Clarity={result['clarity_score']:.4f}, " f"Entropy Stability={result['entropy_stability']:.4f}") # Visualize co-association matrix for true kresult_k3 = consensus_clustering(X, n_clusters=3, n_bootstrap=100) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Data with true labelsaxes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=30)axes[0].set_title('True Labels') # Data with consensus labelsaxes[1].scatter(X[:, 0], X[:, 1], c=result_k3['consensus_labels'], cmap='viridis', s=30)axes[1].set_title(f'Consensus Labels (k=3)') # Co-association matrix (sorted by consensus cluster)sorted_idx = np.argsort(result_k3['consensus_labels'])M_sorted = result_k3['co_association_matrix'][sorted_idx][:, sorted_idx]im = axes[2].imshow(M_sorted, cmap='viridis', aspect='auto')axes[2].set_title('Co-Association Matrix (sorted)')plt.colorbar(im, ax=axes[2], fraction=0.046) plt.tight_layout()plt.show() # Distribution of co-association valuesplt.figure(figsize=(8, 4))upper_triangle = result_k3['co_association_matrix'][np.triu_indices(150, k=1)]plt.hist(upper_triangle, bins=50, edgecolor='black', alpha=0.7)plt.axvline(x=0.5, color='red', linestyle='--', label='Maximum ambiguity')plt.xlabel('Co-Association Frequency')plt.ylabel('Count')plt.title('Distribution of Pairwise Co-Association (k=3)')plt.legend()plt.tight_layout()plt.show()One of the primary applications of stability analysis is selecting the number of clusters k. The key insight is:
True clusters are stable, while artificial subdivisions are unstable.
When k is too small, genuine clusters are merged—stability may be high but we're missing structure. When k is too large, the algorithm creates arbitrary subdivisions of genuine clusters—these subdivisions vary between runs, reducing stability.
For prediction strength:
For pairwise stability:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.metrics import adjusted_rand_score, silhouette_scorefrom sklearn.datasets import make_blobs def comprehensive_k_selection(X, k_range, n_bootstrap=50, random_state=42): """ Compare stability-based k selection with internal metrics. """ rng = np.random.RandomState(random_state) n_samples = X.shape[0] results = {'k': [], 'stability_mean': [], 'stability_std': [], 'silhouette': [], 'inertia': []} for k in k_range: # Bootstrap stability pairwise_aris = [] for b in range(n_bootstrap): # Two independent bootstrap samples idx1 = rng.choice(n_samples, size=n_samples, replace=True) idx2 = rng.choice(n_samples, size=n_samples, replace=True) X1 = X[np.unique(idx1)] X2 = X[np.unique(idx2)] # Cluster each m1 = KMeans(n_clusters=k, n_init='auto', random_state=b) m2 = KMeans(n_clusters=k, n_init='auto', random_state=b+1000) l1 = m1.fit_predict(X1) l2 = m2.fit_predict(X2) # Common points common = set(np.unique(idx1)) & set(np.unique(idx2)) if len(common) < 10: continue common = sorted(common) # Get labels for common points idx1_to_label = {i: l for i, l in zip(np.unique(idx1), l1)} idx2_to_label = {i: l for i, l in zip(np.unique(idx2), l2)} labels1 = [idx1_to_label[i] for i in common] labels2 = [idx2_to_label[i] for i in common] ari = adjusted_rand_score(labels1, labels2) pairwise_aris.append(ari) # Internal metrics on full data model = KMeans(n_clusters=k, n_init='auto', random_state=random_state) labels = model.fit_predict(X) sil = silhouette_score(X, labels) inertia = model.inertia_ results['k'].append(k) results['stability_mean'].append(np.mean(pairwise_aris)) results['stability_std'].append(np.std(pairwise_aris)) results['silhouette'].append(sil) results['inertia'].append(inertia) return results # Generate data with known structurenp.random.seed(42)X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42) # Add some noise to make it more realisticX += np.random.randn(*X.shape) * 0.5 print("Comprehensive k Selection Analysis")print("="*50) k_range = range(2, 10)results = comprehensive_k_selection(X, k_range, n_bootstrap=30) # Print resultsprint(f"\n{'k':^4} {'Stability':^12} {'Std':^8} {'Silhouette':^12}")print("-" * 40)for i, k in enumerate(results['k']): print(f"{k:^4} {results['stability_mean'][i]:^12.4f} " f"{results['stability_std'][i]:^8.4f} {results['silhouette'][i]:^12.4f}") # Visualizefig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Stability plotaxes[0, 0].errorbar(results['k'], results['stability_mean'], yerr=results['stability_std'], marker='o', capsize=5, linewidth=2)axes[0, 0].axhline(y=0.8, color='red', linestyle='--', alpha=0.5, label='Threshold 0.8')axes[0, 0].axvline(x=4, color='green', linestyle='--', alpha=0.7, label='True k=4')axes[0, 0].set_xlabel('Number of Clusters (k)')axes[0, 0].set_ylabel('Bootstrap Stability (ARI)')axes[0, 0].set_title('Stability-Based Selection')axes[0, 0].legend()axes[0, 0].grid(True, alpha=0.3) # Silhouette plotaxes[0, 1].plot(results['k'], results['silhouette'], 'bo-', linewidth=2, markersize=8)axes[0, 1].axvline(x=4, color='green', linestyle='--', alpha=0.7, label='True k=4')axes[0, 1].set_xlabel('Number of Clusters (k)')axes[0, 1].set_ylabel('Silhouette Score')axes[0, 1].set_title('Silhouette-Based Selection')axes[0, 1].legend()axes[0, 1].grid(True, alpha=0.3) # Elbow plotaxes[1, 0].plot(results['k'], results['inertia'], 'bo-', linewidth=2, markersize=8)axes[1, 0].axvline(x=4, color='green', linestyle='--', alpha=0.7, label='True k=4')axes[1, 0].set_xlabel('Number of Clusters (k)')axes[1, 0].set_ylabel('Inertia (Within-cluster SS)')axes[1, 0].set_title('Elbow Method')axes[1, 0].legend()axes[1, 0].grid(True, alpha=0.3) # Combined recommendation# Normalize metrics for comparisonstab_norm = np.array(results['stability_mean'])sil_norm = (np.array(results['silhouette']) - min(results['silhouette'])) / (max(results['silhouette']) - min(results['silhouette'])) combined = (stab_norm + sil_norm) / 2 axes[1, 1].plot(results['k'], stab_norm, 'b-', label='Stability (normalized)', linewidth=2)axes[1, 1].plot(results['k'], sil_norm, 'r-', label='Silhouette (normalized)', linewidth=2)axes[1, 1].plot(results['k'], combined, 'g--', label='Combined', linewidth=2)axes[1, 1].axvline(x=4, color='green', linestyle=':', alpha=0.7, label='True k=4')axes[1, 1].set_xlabel('Number of Clusters (k)')axes[1, 1].set_ylabel('Normalized Score')axes[1, 1].set_title('Combined Selection')axes[1, 1].legend()axes[1, 1].grid(True, alpha=0.3) plt.tight_layout()plt.show() # Recommendationbest_k_stability = results['k'][np.argmax(results['stability_mean'])]best_k_silhouette = results['k'][np.argmax(results['silhouette'])]best_k_combined = results['k'][np.argmax(combined)] print(f"\nRecommendations:")print(f" Best k by stability: {best_k_stability}")print(f" Best k by silhouette: {best_k_silhouette}")print(f" Best k by combined: {best_k_combined}")print(f" True k: 4")The theoretical justification for stability-based validation comes from statistical learning theory. Key results include:
Ben-David et al. (2006) showed that for certain clustering algorithms, stability implies consistency—stable clusterings converge to the population-optimal clustering as sample size increases.
Von Luxburg (2010) provided conditions under which stability indicates the 'identifiability' of clusters:
$$\Pr[\text{clusters recoverable}] \geq 1 - \delta$$
when stability exceeds a threshold depending on sample size and cluster separation.
Key insight: Stability is a necessary but not sufficient condition for good clustering. A perfectly random (useless) clustering can be perfectly stable if the algorithm is deterministic. But meaningful clusters should be stable.
Trivial solutions (k=1, or k=n) are perfectly stable but not useful. Always combine stability with internal metrics or domain knowledge. Stability tells you the clustering is reproducible, not that it's meaningful.
Overlapping clusters: True clusters may have inherent ambiguity, leading to low stability even for 'correct' k
Varying cluster sizes: Small clusters may be unstable due to sampling variance even if genuine
Hierarchical structure: Multiple valid k values exist; stability may not identify the 'best' level
Algorithm dependence: Some algorithms are inherently more stable than others (k-means++ more stable than random init)
Small samples: Insufficient data for reliable stability estimates
| Method | Best For | Key Parameters |
|---|---|---|
| Bootstrap stability | General purpose, most datasets | n_bootstrap ≥ 100 |
| Subsampling | Large datasets, controlled overlap | fraction=0.8, n_subsamples ≥ 50 |
| Prediction strength | Rigorous k selection | threshold=0.8-0.9 |
| Consensus clustering | Final clustering + stability measure | n_bootstrap ≥ 100 |
This page provided comprehensive coverage of stability-based clustering evaluation:
You now understand how to assess clustering stability through bootstrap resampling, subsampling, prediction strength, and consensus approaches. Next, we'll explore the Gap statistic, which provides a principled method for comparing clustering quality against a null reference distribution.