Clustering Evaluation - Learning Module

Loading content...

0/245

Stability-Based Evaluation: Assessing Cluster Robustness

Beyond Single-Run Evaluation

Internal and external metrics evaluate a single clustering solution, but they don't answer a crucial question: Would we get the same clusters if we collected slightly different data? This is the stability question.

Stability-based evaluation assesses cluster quality by measuring how consistent clustering results are across perturbations—repeated runs, bootstrap samples, or subsets of the data. The core intuition is compelling: if clusters are 'real' structures in the data, they should be recoverable even with some noise or sampling variation.

This page provides rigorous coverage of stability-based methods, including bootstrap resampling approaches, subsampling stability, perturbation-based assessment, and the theoretical foundations connecting stability to generalization.

What You Will Learn

By the end of this page, you will understand: (1) why stability is a valid criterion for clustering quality, (2) bootstrap and subsampling stability methodologies, (3) how to implement and interpret stability measures, (4) the connection between stability and optimal k selection, and (5) limitations and best practices for stability assessment.

The Stability Principle

Why Stability Matters

Consider running k-means with k=5 on a dataset and obtaining a clustering. Now imagine collecting a new dataset from the same underlying population. Would you get the same five clusters? If the answer is yes, those clusters likely represent genuine structure. If the answer is no—if small data changes produce wildly different clusterings—the original solution was probably fitting noise rather than signal.

This principle is fundamental:

Stable clusterings → Genuine, reproducible structure Unstable clusterings → Artifacts of sampling noise, wrong k, or inappropriate algorithm

Formal Framework

Let A be a clustering algorithm and D be our data sampled from population P. We can formalize stability as:

Generate perturbed datasets D₁, D₂, ..., Dᵦ by resampling from D (or from P if possible)
Apply algorithm A to each Dᵢ, obtaining clusterings C₁, C₂, ..., Cᵦ
Measure agreement between pairs of clusterings using a similarity measure S
The stability score is the average pairwise agreement:

$$\text{Stability} = \frac{1}{\binom{B}{2}} \sum_{i < j} S(C_i, C_j)$$

Alternatively, we can compare each perturbed clustering to the original clustering on data common to both samples.

Stability vs. Other Validation Approaches
Aspect	Internal Metrics	External Metrics	Stability
Requires ground truth	No	Yes	No
What it measures	Compactness/separation	Agreement with labels	Consistency/robustness
Cluster shape assumption	Often spherical	None	None
Computationally expensive	Varies	No	Yes (multiple runs)
Directly tests generalizability	No	No	Yes

Stability as a Meta-Criterion

Stability is complementary to internal and external metrics. A clustering with high Silhouette but low stability may be fitting noise. Conversely, a clustering with moderate Silhouette but high stability likely captures genuine, reproducible structure. Always use stability alongside other metrics.

Bootstrap Stability

Bootstrap resampling creates perturbed datasets by sampling n points with replacement from the original n points. Each bootstrap sample:

Contains n points (same size as original)
Includes some original points multiple times
Excludes approximately 36.8% of original points (out-of-bag points)

This is the most commonly used approach for stability assessment.

Bootstrap Stability Algorithm

Input: Dataset D, Algorithm A, number of bootstraps B, similarity measure S
Output: Stability score

1. For b = 1 to B:
   a. Generate bootstrap sample D_b by sampling n points with replacement from D
   b. Apply A to D_b, obtaining clustering C_b
   c. Identify common points between D and D_b
   
2. For each pair (i, j) with i < j:
   a. Find points common to both D_i and D_j
   b. Compute S(C_i, C_j) on common points
   c. Store similarity
   
3. Return mean similarity across all pairs

A key subtlety: we can only compare clusterings on common points—points that appear in both bootstrap samples.

bootstrap_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from collections import defaultdict
import matplotlib.pyplot as plt
 
def bootstrap_stability(X, n_clusters, n_bootstrap=100, algorithm=None, random_state=42):
    """
    Compute bootstrap stability of a clustering.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    n_clusters : int, number of clusters
    n_bootstrap : int, number of bootstrap samples
    algorithm : clustering algorithm (default: KMeans)
    random_state : int
    
    Returns:
    --------
    dict with stability score and per-bootstrap details
    """
    rng = np.random.RandomState(random_state)
    n_samples = X.shape[0]
    
    if algorithm is None:
        algorithm = KMeans(n_clusters=n_clusters, n_init='auto', random_state=random_state)
    
    # Store bootstrap clusterings and which points they contain
    bootstrap_results = []
    
    for b in range(n_bootstrap):
        # Bootstrap sample: sample with replacement
        indices = rng.choice(n_samples, size=n_samples, replace=True)
        unique_indices = np.unique(indices)
        
        # Cluster the bootstrap sample (using unique points to avoid duplicates)
        X_boot = X[unique_indices]
        if hasattr(algorithm, 'set_params'):
            algo = algorithm.__class__(**algorithm.get_params())
        else:
            algo = KMeans(n_clusters=n_clusters, n_init='auto', random_state=b)
        
        labels_boot = algo.fit_predict(X_boot)
        
        # Store mapping from original index to cluster label
        index_to_label = {idx: label for idx, label in zip(unique_indices, labels_boot)}
        bootstrap_results.append(index_to_label)
    
    # Compute pairwise stability
    pairwise_scores = []
    
    for i in range(n_bootstrap):
        for j in range(i + 1, n_bootstrap):
            # Find common points
            common_indices = set(bootstrap_results[i].keys()) & set(bootstrap_results[j].keys())
            
            if len(common_indices) < 10:  # Need minimum points for meaningful comparison
                continue
            
            common_indices = sorted(common_indices)
            labels_i = [bootstrap_results[i][idx] for idx in common_indices]
            labels_j = [bootstrap_results[j][idx] for idx in common_indices]
            
            # Compute ARI on common points
            ari = adjusted_rand_score(labels_i, labels_j)
            pairwise_scores.append(ari)
    
    stability_score = np.mean(pairwise_scores)
    
    return {
        'stability': stability_score,
        'std': np.std(pairwise_scores),
        'n_comparisons': len(pairwise_scores),
        'pairwise_scores': pairwise_scores
    }
 
 
def find_stable_k(X, k_range, n_bootstrap=50, random_state=42):
    """
    Use stability to find optimal number of clusters.
    """
    results = {}
    
    for k in k_range:
        print(f"  Evaluating k={k}...", end=' ')
        result = bootstrap_stability(X, k, n_bootstrap=n_bootstrap, random_state=random_state)
        results[k] = result
        print(f"Stability: {result['stability']:.4f}")
    
    return results
 
 
# Example with synthetic data
from sklearn.datasets import make_blobs
 
# True k = 4
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
 
print("Bootstrap Stability Analysis")
print("="*50)
 
# Evaluate stability for different k values
k_range = range(2, 8)
stability_results = find_stable_k(X, k_range, n_bootstrap=30)
 
# Plot results
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
 
# Stability plot
stabilities = [stability_results[k]['stability'] for k in k_range]
stds = [stability_results[k]['std'] for k in k_range]
 
axes[0].errorbar(k_range, stabilities, yerr=stds, marker='o', capsize=5)
axes[0].axvline(x=4, color='red', linestyle='--', label='True k=4')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Bootstrap Stability (ARI)')
axes[0].set_title('Stability vs. Number of Clusters')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Distribution for true k
scores_k4 = stability_results[4]['pairwise_scores']
axes[1].hist(scores_k4, bins=20, edgecolor='black', alpha=0.7)
axes[1].axvline(x=np.mean(scores_k4), color='red', linestyle='--', 
                label=f'Mean: {np.mean(scores_k4):.3f}')
axes[1].set_xlabel('Pairwise ARI')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Pairwise Stability (k=4)')
axes[1].legend()
 
plt.tight_layout()
plt.show()

Subsampling Stability

Subsampling creates perturbed datasets by sampling without replacement, typically taking a fraction p (e.g., 80%) of the original data. This approach:

Guarantees distinct points in each subsample (no duplicates)
Allows control over overlap between subsamples
More directly simulates 'what if we had different data?'

Subsampling Stability Algorithm

Input: Dataset D, Algorithm A, subsampling fraction p, number of subsamples B
Output: Stability score

1. For b = 1 to B:
   a. Sample p*n points without replacement from D → D_b
   b. Apply A to D_b → C_b
   c. For each point in D_b, record its cluster assignment
   
2. For each pair of points (i, j) that appear in at least 2 subsamples:
   a. Count how often they're in the same cluster vs different clusters
   b. Compare to their clustering in original full-data solution
   
3. Aggregate agreement scores

Choosing the Subsampling Fraction

Fraction p	Subsample Size	Expected Overlap	Trade-offs
0.9	90% of data	High	More stable but less diverse samples
0.8	80% of data	Moderate	Good balance (common choice)
0.7	70% of data	Lower	More variance but tests robustness harder
0.5	50% of data	Low	Severe test; may be too challenging

The choice depends on dataset size and how rigorously you want to test stability. Larger datasets can use smaller fractions.

subsampling_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import matplotlib.pyplot as plt
 
def subsampling_stability(X, n_clusters, subsample_fraction=0.8, 
                          n_subsamples=100, random_state=42):
    """
    Compute subsampling stability of a clustering.
    
    Strategy: Compare each subsample's clustering on its points
    to a reference clustering on the full data.
    """
    rng = np.random.RandomState(random_state)
    n_samples = X.shape[0]
    subsample_size = int(n_samples * subsample_fraction)
    
    # Reference clustering on full data
    reference_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=random_state)
    reference_labels = reference_model.fit_predict(X)
    
    stability_scores = []
    
    for s in range(n_subsamples):
        # Subsample without replacement
        indices = rng.choice(n_samples, size=subsample_size, replace=False)
        X_sub = X[indices]
        
        # Cluster the subsample
        subsample_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=s)
        subsample_labels = subsample_model.fit_predict(X_sub)
        
        # Compare subsample clustering to reference on the subsampled points
        reference_labels_sub = reference_labels[indices]
        
        ari = adjusted_rand_score(reference_labels_sub, subsample_labels)
        stability_scores.append(ari)
    
    return {
        'stability': np.mean(stability_scores),
        'std': np.std(stability_scores),
        'scores': stability_scores
    }
 
 
def prediction_strength(X, n_clusters, test_fraction=0.5, 
                         n_splits=50, random_state=42):
    """
    Compute prediction strength (Tibshirani & Walther 2005).
    
    Train on one half, predict on other half, measure consistency.
    """
    rng = np.random.RandomState(random_state)
    n_samples = X.shape[0]
    test_size = int(n_samples * test_fraction)
    
    prediction_strengths = []
    
    for s in range(n_splits):
        # Split into train and test
        indices = rng.permutation(n_samples)
        test_indices = indices[:test_size]
        train_indices = indices[test_size:]
        
        X_train = X[train_indices]
        X_test = X[test_indices]
        
        # Cluster training set
        train_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=s)
        train_model.fit(X_train)
        
        # Assign test points to training clusters
        test_labels = train_model.predict(X_test)
        
        # Cluster test set independently
        test_model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=s+1000)
        test_labels_independent = test_model.fit_predict(X_test)
        
        # For each test cluster, compute proportion of pairs that 
        # are also co-clustered by the training model
        cluster_strengths = []
        
        for k in range(n_clusters):
            cluster_mask = test_labels_independent == k
            cluster_points = np.where(cluster_mask)[0]
            
            if len(cluster_points) < 2:
                continue
            
            # Count pairs in same vs different training clusters
            same_cluster = 0
            total_pairs = 0
            
            for i in range(len(cluster_points)):
                for j in range(i+1, len(cluster_points)):
                    pi, pj = cluster_points[i], cluster_points[j]
                    if test_labels[pi] == test_labels[pj]:
                        same_cluster += 1
                    total_pairs += 1
            
            if total_pairs > 0:
                cluster_strengths.append(same_cluster / total_pairs)
        
        if cluster_strengths:
            # Minimum over clusters (most stringent)
            prediction_strengths.append(min(cluster_strengths))
    
    return {
        'prediction_strength': np.mean(prediction_strengths),
        'std': np.std(prediction_strengths),
        'all_strengths': prediction_strengths
    }
 
 
# Example comparison
from sklearn.datasets import make_blobs
 
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
 
print("Stability Analysis Comparison")
print("="*50)
 
k_range = range(2, 8)
 
subsamp_results = []
pred_str_results = []
 
for k in k_range:
    sub_result = subsampling_stability(X, k, subsample_fraction=0.8, n_subsamples=30)
    ps_result = prediction_strength(X, k, n_splits=30)
    
    subsamp_results.append(sub_result['stability'])
    pred_str_results.append(ps_result['prediction_strength'])
    
    print(f"k={k}: Subsampling={sub_result['stability']:.3f}, "
          f"Pred Strength={ps_result['prediction_strength']:.3f}")
 
# Plot comparison
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(k_range, subsamp_results, 'bo-', label='Subsampling Stability', linewidth=2)
ax.plot(k_range, pred_str_results, 'rs-', label='Prediction Strength', linewidth=2)
ax.axvline(x=4, color='green', linestyle='--', label='True k=4', alpha=0.7)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Stability Score')
ax.set_title('Comparison of Stability Methods')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Prediction Strength

Prediction Strength (Tibshirani & Walther, 2005) takes a more principled approach to stability by explicitly separating learning and evaluation:

Split data into training and test sets
Cluster the training data
Assign test points to training clusters (prediction)
Cluster the test data independently
For each test cluster, measure how well the training model keeps its points together

This directly tests whether clusters learned on one sample generalize to another.

Formal Definition

Let Cₜᵣₐᵢₙ be the clustering learned on training data, and Cₜₑₛₜ be the independent clustering of test data. For each test cluster Cₜₑₛₜ(k):

Co-membership matrix D where D[i,j] = 1 if test points i and j are predicted to the same training cluster.

Cluster-specific prediction strength: $$PS_k = \frac{1}{|C_{test}(k)|(|C_{test}(k)|-1)} \sum_{i,j \in C_{test}(k), i \neq j} D[i,j]$$

Overall prediction strength: $$PS = \min_k PS_k$$

Using the minimum makes this a conservative measure—all clusters must be stable.

Interpreting Prediction Strength

Tibshirani & Walther suggest PS > 0.8 indicates stable clusters. Values below 0.8 suggest either wrong k or inherently unstable structure. When evaluating multiple k values, choose the largest k with PS > threshold (0.8 or 0.9).

Advantages of Prediction Strength

Clear theoretical justification: Tests generalization explicitly
Conservative: Uses minimum over clusters, robust to single weak clusters
Interpretable threshold: ~0.8 has empirical support as meaningful cutoff
Works well for k selection: The 'elbow' is often clearer than internal metrics

Limitations

High variance: Random splits can give different results
Requires sufficient data: Each split needs enough points per cluster
Sensitive to split proportion: 50/50 is common but may not suit all problems
Computational cost: Many splits needed for stable estimates

Model-Based Stability Measures

Some clustering algorithms provide natural stability measures through their model structure:

Consensus Clustering

Also known as cluster ensembles or aggregated clustering:

Generate many bootstrap samples and cluster each
Build a co-association matrix M where M[i,j] = frequency that points i and j are co-clustered
Cluster the co-association matrix to get consensus clusters
Stability measure: clarity of the consensus (bimodality of M values)

If M values are close to 0 or 1, stability is high. Values near 0.5 indicate ambiguity.

consensus_clustering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, fcluster
import matplotlib.pyplot as plt
 
def consensus_clustering(X, n_clusters, n_bootstrap=100, random_state=42):
    """
    Perform consensus clustering and compute stability.
    
    Returns:
    --------
    dict containing:
        - consensus_labels: Final cluster assignments
        - co_association_matrix: Frequency of co-clustering
        - stability_score: Measure of consensus clarity
    """
    rng = np.random.RandomState(random_state)
    n_samples = X.shape[0]
    
    # Initialize co-association matrix
    co_association = np.zeros((n_samples, n_samples))
    co_occurrence = np.zeros((n_samples, n_samples))
    
    for b in range(n_bootstrap):
        # Bootstrap sample
        indices = rng.choice(n_samples, size=n_samples, replace=True)
        unique_indices = np.unique(indices)
        
        # Cluster bootstrap sample
        X_boot = X[unique_indices]
        model = KMeans(n_clusters=n_clusters, n_init='auto', random_state=b)
        labels = model.fit_predict(X_boot)
        
        # Update co-association for pairs in this bootstrap
        for i, idx_i in enumerate(unique_indices):
            for j, idx_j in enumerate(unique_indices):
                co_occurrence[idx_i, idx_j] += 1
                if labels[i] == labels[j]:
                    co_association[idx_i, idx_j] += 1
    
    # Normalize by co-occurrence count
    # Avoid division by zero for pairs that never co-occur
    with np.errstate(divide='ignore', invalid='ignore'):
        M = np.divide(co_association, co_occurrence)
        M = np.nan_to_num(M, nan=0.0)
    
    # Cluster the co-association matrix
    # Convert similarity to distance
    distance_matrix = 1 - M
    np.fill_diagonal(distance_matrix, 0)  # Self-distance is 0
    
    # Use hierarchical clustering on the distance matrix
    from scipy.spatial.distance import squareform
    condensed = squareform(distance_matrix, checks=False)
    Z = linkage(condensed, method='average')
    consensus_labels = fcluster(Z, t=n_clusters, criterion='maxclust') - 1
    
    # Compute stability score based on clarity of co-association values
    # Ideal: values close to 0 or 1
    upper_triangle = M[np.triu_indices(n_samples, k=1)]
    
    # Bimodality measure: how far from 0.5 on average
    clarity = np.mean(np.abs(upper_triangle - 0.5)) * 2  # Scale to [0, 1]
    
    # Alternative: entropy-based measure
    def binary_entropy(p):
        if p == 0 or p == 1:
            return 0
        return -p * np.log2(p) - (1-p) * np.log2(1-p)
    
    entropies = [binary_entropy(p) for p in upper_triangle]
    avg_entropy = np.mean(entropies)
    stability_entropy = 1 - avg_entropy  # Higher is more stable
    
    return {
        'consensus_labels': consensus_labels,
        'co_association_matrix': M,
        'clarity_score': clarity,
        'entropy_stability': stability_entropy
    }
 
 
# Example
from sklearn.datasets import make_blobs
 
X, y_true = make_blobs(n_samples=150, centers=3, cluster_std=1.0, random_state=42)
 
print("Consensus Clustering Analysis")
print("="*50)
 
# Compare stability for different k
for k in [2, 3, 4, 5]:
    result = consensus_clustering(X, n_clusters=k, n_bootstrap=50)
    print(f"k={k}: Clarity={result['clarity_score']:.4f}, "
          f"Entropy Stability={result['entropy_stability']:.4f}")
 
# Visualize co-association matrix for true k
result_k3 = consensus_clustering(X, n_clusters=3, n_bootstrap=100)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Data with true labels
axes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=30)
axes[0].set_title('True Labels')
 
# Data with consensus labels
axes[1].scatter(X[:, 0], X[:, 1], c=result_k3['consensus_labels'], cmap='viridis', s=30)
axes[1].set_title(f'Consensus Labels (k=3)')
 
# Co-association matrix (sorted by consensus cluster)
sorted_idx = np.argsort(result_k3['consensus_labels'])
M_sorted = result_k3['co_association_matrix'][sorted_idx][:, sorted_idx]
im = axes[2].imshow(M_sorted, cmap='viridis', aspect='auto')
axes[2].set_title('Co-Association Matrix (sorted)')
plt.colorbar(im, ax=axes[2], fraction=0.046)
 
plt.tight_layout()
plt.show()
 
# Distribution of co-association values
plt.figure(figsize=(8, 4))
upper_triangle = result_k3['co_association_matrix'][np.triu_indices(150, k=1)]
plt.hist(upper_triangle, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(x=0.5, color='red', linestyle='--', label='Maximum ambiguity')
plt.xlabel('Co-Association Frequency')
plt.ylabel('Count')
plt.title('Distribution of Pairwise Co-Association (k=3)')
plt.legend()
plt.tight_layout()
plt.show()

Using Stability for k Selection

One of the primary applications of stability analysis is selecting the number of clusters k. The key insight is:

True clusters are stable, while artificial subdivisions are unstable.

When k is too small, genuine clusters are merged—stability may be high but we're missing structure. When k is too large, the algorithm creates arbitrary subdivisions of genuine clusters—these subdivisions vary between runs, reducing stability.

The Stability Selection Principle

Compute stability for k = 2, 3, ..., K_max
Look for:
- Maximum stability: The k with highest stability score
- Stability plateau: Range of k values with similar high stability
- Stability drop: The k where stability suddenly decreases

Tibshirani & Walther Recommendation

For prediction strength:

Choose the largest k such that PS(k) ≥ threshold (typically 0.8-0.9)
This avoids under-clustering while ensuring stability

Ben-Hur et al. Approach

For pairwise stability:

Plot the cumulative distribution of pairwise similarities
Stable k: distribution concentrated near 1
Unstable k: distribution spread across [0, 1]
Choose k where distribution first 'snaps' to high values

stability_k_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.datasets import make_blobs
 
def comprehensive_k_selection(X, k_range, n_bootstrap=50, random_state=42):
    """
    Compare stability-based k selection with internal metrics.
    """
    rng = np.random.RandomState(random_state)
    n_samples = X.shape[0]
    
    results = {'k': [], 'stability_mean': [], 'stability_std': [],
               'silhouette': [], 'inertia': []}
    
    for k in k_range:
        # Bootstrap stability
        pairwise_aris = []
        
        for b in range(n_bootstrap):
            # Two independent bootstrap samples
            idx1 = rng.choice(n_samples, size=n_samples, replace=True)
            idx2 = rng.choice(n_samples, size=n_samples, replace=True)
            
            X1 = X[np.unique(idx1)]
            X2 = X[np.unique(idx2)]
            
            # Cluster each
            m1 = KMeans(n_clusters=k, n_init='auto', random_state=b)
            m2 = KMeans(n_clusters=k, n_init='auto', random_state=b+1000)
            
            l1 = m1.fit_predict(X1)
            l2 = m2.fit_predict(X2)
            
            # Common points
            common = set(np.unique(idx1)) & set(np.unique(idx2))
            if len(common) < 10:
                continue
            
            common = sorted(common)
            
            # Get labels for common points
            idx1_to_label = {i: l for i, l in zip(np.unique(idx1), l1)}
            idx2_to_label = {i: l for i, l in zip(np.unique(idx2), l2)}
            
            labels1 = [idx1_to_label[i] for i in common]
            labels2 = [idx2_to_label[i] for i in common]
            
            ari = adjusted_rand_score(labels1, labels2)
            pairwise_aris.append(ari)
        
        # Internal metrics on full data
        model = KMeans(n_clusters=k, n_init='auto', random_state=random_state)
        labels = model.fit_predict(X)
        
        sil = silhouette_score(X, labels)
        inertia = model.inertia_
        
        results['k'].append(k)
        results['stability_mean'].append(np.mean(pairwise_aris))
        results['stability_std'].append(np.std(pairwise_aris))
        results['silhouette'].append(sil)
        results['inertia'].append(inertia)
    
    return results
 
 
# Generate data with known structure
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
 
# Add some noise to make it more realistic
X += np.random.randn(*X.shape) * 0.5
 
print("Comprehensive k Selection Analysis")
print("="*50)
 
k_range = range(2, 10)
results = comprehensive_k_selection(X, k_range, n_bootstrap=30)
 
# Print results
print(f"\n{'k':^4} {'Stability':^12} {'Std':^8} {'Silhouette':^12}")
print("-" * 40)
for i, k in enumerate(results['k']):
    print(f"{k:^4} {results['stability_mean'][i]:^12.4f} "
          f"{results['stability_std'][i]:^8.4f} {results['silhouette'][i]:^12.4f}")
 
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
 
# Stability plot
axes[0, 0].errorbar(results['k'], results['stability_mean'], 
                     yerr=results['stability_std'], marker='o', capsize=5, linewidth=2)
axes[0, 0].axhline(y=0.8, color='red', linestyle='--', alpha=0.5, label='Threshold 0.8')
axes[0, 0].axvline(x=4, color='green', linestyle='--', alpha=0.7, label='True k=4')
axes[0, 0].set_xlabel('Number of Clusters (k)')
axes[0, 0].set_ylabel('Bootstrap Stability (ARI)')
axes[0, 0].set_title('Stability-Based Selection')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
 
# Silhouette plot
axes[0, 1].plot(results['k'], results['silhouette'], 'bo-', linewidth=2, markersize=8)
axes[0, 1].axvline(x=4, color='green', linestyle='--', alpha=0.7, label='True k=4')
axes[0, 1].set_xlabel('Number of Clusters (k)')
axes[0, 1].set_ylabel('Silhouette Score')
axes[0, 1].set_title('Silhouette-Based Selection')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
 
# Elbow plot
axes[1, 0].plot(results['k'], results['inertia'], 'bo-', linewidth=2, markersize=8)
axes[1, 0].axvline(x=4, color='green', linestyle='--', alpha=0.7, label='True k=4')
axes[1, 0].set_xlabel('Number of Clusters (k)')
axes[1, 0].set_ylabel('Inertia (Within-cluster SS)')
axes[1, 0].set_title('Elbow Method')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
 
# Combined recommendation
# Normalize metrics for comparison
stab_norm = np.array(results['stability_mean'])
sil_norm = (np.array(results['silhouette']) - min(results['silhouette'])) /            (max(results['silhouette']) - min(results['silhouette']))
 
combined = (stab_norm + sil_norm) / 2
 
axes[1, 1].plot(results['k'], stab_norm, 'b-', label='Stability (normalized)', linewidth=2)
axes[1, 1].plot(results['k'], sil_norm, 'r-', label='Silhouette (normalized)', linewidth=2)
axes[1, 1].plot(results['k'], combined, 'g--', label='Combined', linewidth=2)
axes[1, 1].axvline(x=4, color='green', linestyle=':', alpha=0.7, label='True k=4')
axes[1, 1].set_xlabel('Number of Clusters (k)')
axes[1, 1].set_ylabel('Normalized Score')
axes[1, 1].set_title('Combined Selection')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()
 
# Recommendation
best_k_stability = results['k'][np.argmax(results['stability_mean'])]
best_k_silhouette = results['k'][np.argmax(results['silhouette'])]
best_k_combined = results['k'][np.argmax(combined)]
 
print(f"\nRecommendations:")
print(f"  Best k by stability:  {best_k_stability}")
print(f"  Best k by silhouette: {best_k_silhouette}")
print(f"  Best k by combined:   {best_k_combined}")
print(f"  True k:               4")

Theoretical Foundations

Stability and Generalization

The theoretical justification for stability-based validation comes from statistical learning theory. Key results include:

Ben-David et al. (2006) showed that for certain clustering algorithms, stability implies consistency—stable clusterings converge to the population-optimal clustering as sample size increases.

Von Luxburg (2010) provided conditions under which stability indicates the 'identifiability' of clusters:

$$\Pr[\text{clusters recoverable}] \geq 1 - \delta$$

when stability exceeds a threshold depending on sample size and cluster separation.

Key insight: Stability is a necessary but not sufficient condition for good clustering. A perfectly random (useless) clustering can be perfectly stable if the algorithm is deterministic. But meaningful clusters should be stable.

Stability Alone Is Not Enough

Trivial solutions (k=1, or k=n) are perfectly stable but not useful. Always combine stability with internal metrics or domain knowledge. Stability tells you the clustering is reproducible, not that it's meaningful.

When Stability Fails

Overlapping clusters: True clusters may have inherent ambiguity, leading to low stability even for 'correct' k
Varying cluster sizes: Small clusters may be unstable due to sampling variance even if genuine
Hierarchical structure: Multiple valid k values exist; stability may not identify the 'best' level
Algorithm dependence: Some algorithms are inherently more stable than others (k-means++ more stable than random init)
Small samples: Insufficient data for reliable stability estimates

Best Practices for Stability Assessment

Recommended Practices

•Use sufficient repetitions: At least 50-100 bootstrap samples for reliable estimates
•Report confidence intervals: Stability estimates have variance; show uncertainty
•Combine with internal metrics: Use Silhouette/CH alongside stability
•Test multiple k values: Plot stability vs. k to find plateau or maximum
•Consider algorithm sensitivity: Test stability with different random seeds for the same k
•Validate visually when possible: Inspect cluster assignments for interpretability
•Use appropriate similarity measure: ARI is preferred for chance-adjusted comparison
•Document your methodology: Report subsample fraction, number of repetitions, etc.

When to Use Each Stability Approach
Method	Best For	Key Parameters
Bootstrap stability	General purpose, most datasets	n_bootstrap ≥ 100
Subsampling	Large datasets, controlled overlap	fraction=0.8, n_subsamples ≥ 50
Prediction strength	Rigorous k selection	threshold=0.8-0.9
Consensus clustering	Final clustering + stability measure	n_bootstrap ≥ 100

Summary

This page provided comprehensive coverage of stability-based clustering evaluation:

Key Takeaways

•Stability measures robustness: Do we get the same clusters with slightly different data?
•Bootstrap stability resamples with replacement and measures agreement between runs
•Subsampling uses random subsets without replacement for perturbation
•Prediction strength explicitly tests generalization from training to test sets
•Consensus clustering aggregates multiple runs into a single stable solution
•Stability aids k selection: True k should give stable clusters; over-clustering is unstable
•Use thresholds carefully: PS > 0.8 or high ARI indicates meaningful stability
•Combine with other metrics: Stability alone doesn't prove clusters are meaningful
•Report uncertainty: Show confidence intervals for stability estimates

Page Complete

You now understand how to assess clustering stability through bootstrap resampling, subsampling, prediction strength, and consensus approaches. Next, we'll explore the Gap statistic, which provides a principled method for comparing clustering quality against a null reference distribution.