Machine LearningClustering Algorithms

Hierarchical Clustering

LevelIntermediate

Duration90 mins

TopicClustering Algorithms

4 / 5

Cluster Extraction

From Hierarchy to Flat Clusters

Hierarchical clustering produces a dendrogram—a complete record of how data points merge at every distance. But most downstream applications require flat clusters: a single partition where each point belongs to exactly one group. The final step in hierarchical clustering is therefore cluster extraction: cutting the dendrogram to produce flat cluster assignments.

This is both an opportunity and a challenge. The opportunity: unlike K-Means where you must choose k upfront, you can explore multiple cut points without reclustering. The challenge: with infinitely many possible cuts (any height produces a different clustering), how do you choose the "right" one?

This page develops a systematic approach to cluster extraction, from simple fixed cuts to sophisticated dynamic tree methods, along with validation techniques to assess whether your chosen cut is meaningful.

What You Will Learn

By the end of this page, you will master: distance-based cutting for domain-interpretable thresholds; k-based cutting when cluster count is known; inconsistency-based cutting for adaptive structure discovery; dynamic tree cutting for complex hierarchies; the gap statistic and silhouette analysis for optimal k selection; and practical strategies for validating and justifying your cut choice.

The Cutting Problem

Formal Definition:

Given a dendrogram D over n data points and a cut criterion, cluster extraction partitions the data into k non-overlapping clusters C₁, C₂, ..., Cₖ where 1 ≤ k ≤ n.

Every horizontal cut at height h intersects some number of vertical branches. Each intersection defines a cluster: all data points below that intersection (within that subtree) are assigned to the same cluster.

The Fundamental Trade-off:

Cut too low (many clusters): Captures fine-grained structure but may over-segment coherent groups
Cut too high (few clusters): Combines conceptually different groups but avoids over-fragmentation

Three Perspectives on Cutting:

Distance-based: "I want all points within a cluster to be within distance d of each other" (meaningful when d has domain interpretation)
Count-based: "I need exactly k clusters for my downstream task" (imposed by application requirements)
Structure-based: "Let the data tell me where natural breaks occur" (using statistical criteria to detect gaps)

Cluster Extraction Approaches
Approach	Input	When to Use
Fixed distance	Height threshold h	Distance h has domain meaning (e.g., max within-cluster variance)
Fixed count	Target k clusters	Downstream task requires specific k (e.g., k market segments)
Inconsistency	Inconsistency threshold	Want adaptive cuts based on local structure
Gap statistic	None (automatic)	Want statistical validation of optimal k
Dynamic tree cut	None (automatic)	Complex hierarchies with variable branch heights

Multiple Valid Cuts

Unlike partition-based clustering where one solution exists for each k, hierarchical clustering admits multiple perspectives on the same data. There's often no single 'correct' cut—different cuts reveal different aspects of the data structure. This flexibility is a strength when used thoughtfully and a source of confusion when not.

Distance-Based Cutting

Distance-based cutting is the most direct approach: draw a horizontal line at height h and count intersections with the dendrogram branches.

Mathematical Definition:

For a cut at height h:

Any merge that occurs at distance ≤ h is preserved (points remain in same cluster)
Any merge that occurs at distance > h is severed (points split into separate clusters)

The resulting number of clusters k is: n minus the number of merges at distance ≤ h.

Interpretation by Linkage:

The meaning of the cut height depends on the linkage:

Complete linkage: h is the maximum within-cluster diameter. All points in a cluster are within distance h of each other.
Single linkage: h is the minimum gap between clusters. Any two points in different clusters are at least distance h apart.
Ward linkage: h relates to the cumulative variance increase. Harder to interpret directly.
Average linkage: h is approximately the average within-cluster pairwise distance.

distance_based_cutting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import matplotlib.pyplot as plt
 
# Generate sample data
np.random.seed(42)
X = np.vstack([
    np.random.randn(40, 2) * 0.5 + [0, 0],
    np.random.randn(40, 2) * 0.5 + [4, 0],
    np.random.randn(40, 2) * 0.5 + [2, 4]
])
 
Z = linkage(X, method='complete')
 
# Explore multiple distance thresholds
thresholds = [1.0, 2.0, 3.0, 4.0, 5.0]
 
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
 
# Dendrogram with thresholds
ax = axes[0, 0]
dendrogram(Z, ax=ax, truncate_mode='lastp', p=30, no_labels=True)
ax.set_title('Dendrogram with Distance Thresholds')
ax.set_ylabel('Distance')
 
colors = ['red', 'orange', 'green', 'blue', 'purple']
for h, c in zip(thresholds, colors):
    ax.axhline(y=h, color=c, linestyle='--', alpha=0.7, linewidth=2)
    ax.text(ax.get_xlim()[1], h, f'  h={h}', color=c, va='center')
 
# Number of clusters vs threshold
heights = np.linspace(0.1, 8.0, 100)
n_clusters = []
for h in heights:
    labels = fcluster(Z, t=h, criterion='distance')
    n_clusters.append(len(np.unique(labels)))
 
ax = axes[0, 1]
ax.plot(heights, n_clusters, 'b-', linewidth=2)
ax.set_xlabel('Cut Height (h)')
ax.set_ylabel('Number of Clusters')
ax.set_title('Clusters vs. Distance Threshold')
ax.grid(True, alpha=0.3)
 
# Highlight thresholds
for h, c in zip(thresholds, colors):
    labels = fcluster(Z, t=h, criterion='distance')
    k = len(np.unique(labels))
    ax.scatter([h], [k], c=c, s=100, zorder=5)
    ax.annotate(f'k={k}', (h, k), xytext=(5, 5), textcoords='offset points')
 
# Show cluster assignments for different thresholds
for idx, (h, c) in enumerate(zip([1.0, 2.0, 4.0], ['red', 'orange', 'blue'])):
    ax = axes[1, idx]
    labels = fcluster(Z, t=h, criterion='distance')
    k = len(np.unique(labels))
    
    scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'Distance threshold h={h} → {k} clusters')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
# Unused subplot - use for summary
ax = axes[0, 2]
ax.axis('off')
summary = """Distance-Based Cutting:
 
criterion='distance' in fcluster()
 
Advantages:
• Interpretable threshold
• Works with any linkage
• Domain-meaningful cutoff
 
Disadvantages:
• Must know meaningful h
• No automatic k selection
• Different k for different h
"""
ax.text(0.1, 0.9, summary, transform=ax.transAxes, fontsize=11,
        verticalalignment='top', fontfamily='monospace',
        bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
 
plt.tight_layout()
plt.savefig('distance_based_cutting.png', dpi=150)
print("Distance-based cutting saved to distance_based_cutting.png")

Complete Linkage Diameter Guarantee

With complete linkage, a cut at height h guarantees that every cluster has diameter at most h. This is extremely useful in applications where you need a bound on within-cluster distances—for example, ensuring that all customers in a segment have similarity above a threshold, or that all locations in a cluster are within a maximum travel time.

Count-Based Cutting

When you know exactly how many clusters you need—often dictated by business requirements or downstream algorithms—count-based cutting extracts precisely k clusters.

Implementation:

To get k clusters, cut just below the (n-k+1)th highest merge height. Equivalently, use fcluster(Z, t=k, criterion='maxclust') in scipy.

Why Use Fixed k:

Application requirement: Marketing wants exactly 5 customer segments; recommender system needs 10 topic clusters
Resource constraints: Can only afford k focus groups or k product lines
Comparison: Testing multiple k values and choosing the best via validation metrics
Prior knowledge: Domain expertise suggests approximately k natural groups

Choosing k:

If k is not externally specified, several methods can help choose:

Elbow method: Plot within-cluster variance vs. k, look for elbow
Silhouette analysis: Maximize average silhouette score
Gap statistic: Compare to null (uniform) distribution
Domain knowledge: How many groups make sense semantically?

count_based_cutting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
 
# Generate data with 4 clusters
np.random.seed(42)
X = np.vstack([
    np.random.randn(40, 2) * 0.5 + [0, 0],
    np.random.randn(40, 2) * 0.5 + [4, 0],
    np.random.randn(40, 2) * 0.5 + [0, 4],
    np.random.randn(40, 2) * 0.5 + [4, 4]
])
 
Z = linkage(X, method='ward')
 
# Evaluate different k values
k_values = range(2, 11)
silhouette_scores = []
inertias = []  # Within-cluster sum of squares
 
for k in k_values:
    labels = fcluster(Z, t=k, criterion='maxclust')
    
    # Silhouette score
    sil = silhouette_score(X, labels)
    silhouette_scores.append(sil)
    
    # Within-cluster variance (inertia)
    inertia = 0
    for cluster_id in np.unique(labels):
        cluster_points = X[labels == cluster_id]
        centroid = cluster_points.mean(axis=0)
        inertia += np.sum((cluster_points - centroid) ** 2)
    inertias.append(inertia)
 
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
 
# Silhouette scores
ax = axes[0, 0]
ax.plot(k_values, silhouette_scores, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Silhouette Score')
ax.set_title('Silhouette Score vs. k')
ax.grid(True, alpha=0.3)
 
best_k_sil = k_values[np.argmax(silhouette_scores)]
ax.axvline(x=best_k_sil, color='red', linestyle='--', 
           label=f'Best k={best_k_sil}')
ax.legend()
 
# Elbow plot (inertia)
ax = axes[0, 1]
ax.plot(k_values, inertias, 'go-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Within-Cluster Variance')
ax.set_title('Elbow Method')
ax.grid(True, alpha=0.3)
 
# Show cluster assignments for k=2,3,4,5
for idx, k in enumerate([2, 3, 4, 5]):
    row = 1 if idx >= 2 else 0
    col = (idx % 2) + 1 if idx < 2 else idx - 2
    
    if idx < 2:
        ax = axes[0, 2] if idx == 0 else axes[1, 0]
    else:
        ax = axes[1, idx - 1]
    
    if idx == 0:
        ax = axes[0, 2]
    elif idx == 1:
        ax = axes[1, 0]
    elif idx == 2:
        ax = axes[1, 1]
    else:
        ax = axes[1, 2]
    
    labels = fcluster(Z, t=k, criterion='maxclust')
    sil = silhouette_score(X, labels)
    
    ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'k={k}, Silhouette={sil:.3f}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('count_based_cutting.png', dpi=150)
print("Count-based cutting saved to count_based_cutting.png")
 
# Print summary
print("
=== Count-Based Cutting Analysis ===")
print(f"{'k':<5} {'Silhouette':>12} {'Inertia':>15}")
print("-" * 35)
for k, sil, inert in zip(k_values, silhouette_scores, inertias):
    marker = ' <-- best' if k == best_k_sil else ''
    print(f"{k:<5} {sil:>12.4f} {inert:>15.2f}{marker}")

Count-Based Advantages

•Direct control: Exact k for downstream requirements
•Easy comparison: Test multiple k values systematically
•Quick iteration: Extract different k without reclustering
•Interpretable: 'k customer segments' is concrete

Count-Based Disadvantages

•Ignores structure: May cut within natural clusters
•Requires k selection: Still need method to choose k
•Unequal splits: k doesn't guarantee balanced sizes
•Not data-driven: Imposed rather than discovered

Inconsistency-Based Cutting

Inconsistency-based cutting provides an adaptive approach that detects natural cluster boundaries by comparing each merge to its local context.

The Inconsistency Coefficient:

For each merge at height h, we compute how "inconsistent" this merge is compared to previous merges in its subtree:

$$I = \frac{h - \bar{h}{\text{below}}}{\sigma{\text{below}}}$$

where $\bar{h}{\text{below}}$ is the mean height of merges below this point (up to depth d), and $\sigma{\text{below}}$ is their standard deviation.

Interpretation:

I ≈ 0: This merge is at a typical height for its subtree (no surprise)
I > 1: This merge is notably higher than typical (potential cluster boundary)
I > 2: This merge is significantly higher (strong cluster boundary)

How the Cut Works:

Cut all branches where the inconsistency coefficient exceeds threshold t. Points below unchanged merges stay together; points above severed merges split apart.

Depth Parameter:

The depth d controls how far down the tree to look when computing the baseline statistics. Larger d considers more history; smaller d is more local. Typical values: d = 2 or 3.

Inconsistency Intuition

Imagine climbing a staircase where some steps are much taller than others. Inconsistency detects the 'unusually tall steps'—merges that happen at much greater distance than the merges just below them. These tall steps often correspond to crossing from tight clusters into the gaps between clusters.

inconsistency_cutting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster, inconsistent, dendrogram
import matplotlib.pyplot as plt
 
# Generate data with varying cluster tightness
np.random.seed(42)
 
# Tight cluster
tight = np.random.randn(30, 2) * 0.3 + [0, 0]
# Moderate cluster
moderate = np.random.randn(30, 2) * 0.6 + [4, 0]
# Loose cluster
loose = np.random.randn(30, 2) * 1.0 + [2, 5]
 
X = np.vstack([tight, moderate, loose])
Z = linkage(X, method='average')
 
# Compute inconsistency coefficients
R = inconsistent(Z, d=2)
 
# R has 4 columns:
# 0: mean of heights below
# 1: std of heights below  
# 2: number of links below
# 3: inconsistency coefficient
 
inconsistency_coeffs = R[:, 3]
 
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
 
# Dendrogram
dendrogram(Z, ax=axes[0, 0], truncate_mode='lastp', p=25, no_labels=True)
axes[0, 0].set_title('Dendrogram')
axes[0, 0].set_ylabel('Distance')
 
# Inconsistency coefficients
ax = axes[0, 1]
merge_indices = np.arange(1, len(inconsistency_coeffs) + 1)
ax.plot(merge_indices, inconsistency_coeffs, 'b-', linewidth=1)
ax.scatter(merge_indices, inconsistency_coeffs, c='blue', s=15)
ax.set_xlabel('Merge Index')
ax.set_ylabel('Inconsistency Coefficient')
ax.set_title('Inconsistency Coefficients per Merge')
ax.grid(True, alpha=0.3)
 
# Show thresholds
for t, c in [(0.7, 'green'), (1.0, 'orange'), (1.5, 'red')]:
    ax.axhline(y=t, color=c, linestyle='--', alpha=0.7, label=f't={t}')
ax.legend()
 
# Inconsistency histogram
ax = axes[0, 2]
ax.hist(inconsistency_coeffs, bins=20, color='steelblue', edgecolor='black', alpha=0.7)
ax.set_xlabel('Inconsistency Coefficient')
ax.set_ylabel('Count')
ax.set_title('Distribution of Inconsistency Coefficients')
ax.axvline(x=1.0, color='red', linestyle='--', linewidth=2, label='t=1.0')
ax.legend()
 
# Show cuts at different thresholds
thresholds = [0.7, 1.0, 1.5]
for idx, t in enumerate(thresholds):
    labels = fcluster(Z, t=t, criterion='inconsistent', depth=2)
    k = len(np.unique(labels))
    
    ax = axes[1, idx]
    ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'Inconsistency t={t} → {k} clusters')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('inconsistency_cutting.png', dpi=150)
print("Inconsistency cutting saved to inconsistency_cutting.png")
 
# Analyze the inconsistency structure
print("
=== Inconsistency Analysis ===")
print(f"Total merges: {len(inconsistency_coeffs)}")
print(f"Inconsistency range: [{inconsistency_coeffs.min():.3f}, {inconsistency_coeffs.max():.3f}]")
print(f"Mean inconsistency: {inconsistency_coeffs.mean():.3f}")
print(f"Std inconsistency: {inconsistency_coeffs.std():.3f}")
 
print("
Merges with high inconsistency (>1.0):")
high_inconsistency = np.where(inconsistency_coeffs > 1.0)[0]
for idx in high_inconsistency[-5:]:  # Last 5 high-inconsistency merges
    print(f"  Merge {idx+1}: I={inconsistency_coeffs[idx]:.3f}, height={Z[idx, 2]:.3f}")

Advantages of Inconsistency-Based Cutting:

Adaptive: Automatically adjusts to local cluster density
Finds natural boundaries: Detects gaps without external specification
Works with irregular data: Handles clusters of different sizes/tightness

Limitations:

Sensitive to depth parameter: Different d can give different results
Threshold still required: Must choose inconsistency threshold t
Can be unstable: Small changes in data can shift boundaries
Requires sufficient depth: Shallow subtrees have poorly estimated statistics

The Gap Statistic

The gap statistic (Tibshirani, Walther, and Hastie, 2001) provides a principled, statistically-grounded method for selecting the optimal number of clusters.

Core Idea:

Compare the within-cluster dispersion of your clustering to what you'd expect from data with no cluster structure (uniform distribution). If your data has real clusters, the dispersion will be much smaller than the null expectation—and this gap will be largest at the correct k.

Mathematical Definition:

$$\text{Gap}(k) = E^*[\log W_k] - \log W_k$$

where:

$W_k$ = pooled within-cluster sum of squares for k clusters on your data
$E^*[\log W_k]$ = expected value under a null distribution (estimated via simulation)

Algorithm:

Cluster data at each k from 1 to K_max, compute $W_k$
Generate B reference datasets from uniform distribution over bounding box
Cluster each reference dataset, compute $W_k^*$ for each
Estimate $E^*[\log W_k] = \frac{1}{B}\sum_b \log W_k^{(b)}$
Compute $\text{Gap}(k)$ and its standard error
Select smallest k such that $\text{Gap}(k) \geq \text{Gap}(k+1) - s_{k+1}$

gap_statistic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
import matplotlib.pyplot as plt
 
def compute_wk(X, labels):
    """Compute pooled within-cluster sum of squares."""
    wk = 0
    for cluster_id in np.unique(labels):
        cluster_points = X[labels == cluster_id]
        centroid = cluster_points.mean(axis=0)
        wk += np.sum((cluster_points - centroid) ** 2)
    return wk
 
def gap_statistic(X, k_range, n_refs=20, random_state=42):
    """
    Compute gap statistic for a range of k values.
    
    Returns:
        gaps: Gap statistic for each k
        sk: Standard error for each k
        log_wks: Log within-cluster dispersion for actual data
        ref_log_wks_mean: Mean log dispersion for reference data
    """
    np.random.seed(random_state)
    n, d = X.shape
    
    # Bounding box for uniform reference
    mins = X.min(axis=0)
    maxs = X.max(axis=0)
    
    log_wks = []
    ref_log_wks_all = []
    
    # Compute linkage once
    Z = linkage(X, method='ward')
    
    for k in k_range:
        # Cluster actual data
        labels = fcluster(Z, t=k, criterion='maxclust')
        wk = compute_wk(X, labels)
        log_wks.append(np.log(wk))
        
        # Cluster reference data
        ref_log_wks = []
        for _ in range(n_refs):
            # Generate uniform reference
            ref_X = np.random.uniform(mins, maxs, size=(n, d))
            ref_Z = linkage(ref_X, method='ward')
            ref_labels = fcluster(ref_Z, t=k, criterion='maxclust')
            ref_wk = compute_wk(ref_X, ref_labels)
            ref_log_wks.append(np.log(ref_wk))
        
        ref_log_wks_all.append(ref_log_wks)
    
    log_wks = np.array(log_wks)
    ref_log_wks_mean = np.array([np.mean(refs) for refs in ref_log_wks_all])
    ref_log_wks_std = np.array([np.std(refs) for refs in ref_log_wks_all])
    
    # Gap = E*[log(Wk)] - log(Wk)
    gaps = ref_log_wks_mean - log_wks
    
    # Standard error with sqrt(1 + 1/B) factor
    sk = ref_log_wks_std * np.sqrt(1 + 1/n_refs)
    
    return gaps, sk, log_wks, ref_log_wks_mean
 
 
# Generate test data with 3 clusters
np.random.seed(42)
X = np.vstack([
    np.random.randn(50, 2) * 0.5 + [0, 0],
    np.random.randn(50, 2) * 0.5 + [4, 4],
    np.random.randn(50, 2) * 0.5 + [8, 0]
])
 
# Compute gap statistic
k_range = range(1, 11)
gaps, sk, log_wks, ref_log_wks = gap_statistic(X, k_range, n_refs=10)
 
# Find optimal k using gap criterion
# Select smallest k where Gap(k) >= Gap(k+1) - s_{k+1}
optimal_k = 1
for i in range(len(gaps) - 1):
    if gaps[i] >= gaps[i+1] - sk[i+1]:
        optimal_k = k_range[i]
        break
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
# Gap statistic
ax = axes[0]
ax.errorbar(list(k_range), gaps, yerr=sk, fmt='bo-', capsize=5, linewidth=2)
ax.axvline(x=optimal_k, color='red', linestyle='--', linewidth=2, 
           label=f'Optimal k={optimal_k}')
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Gap Statistic')
ax.set_title('Gap Statistic')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Log(Wk) comparison
ax = axes[1]
ax.plot(list(k_range), log_wks, 'b-o', linewidth=2, label='Actual data')
ax.plot(list(k_range), ref_log_wks, 'r--s', linewidth=2, label='Reference (uniform)')
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('log(Wk)')
ax.set_title('Within-Cluster Dispersion')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Resulting clusters
ax = axes[2]
Z = linkage(X, method='ward')
labels = fcluster(Z, t=optimal_k, criterion='maxclust')
ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
ax.set_title(f'Gap-Optimal Clustering (k={optimal_k})')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('gap_statistic.png', dpi=150)
print(f"Gap statistic analysis saved (optimal k={optimal_k})")

Gap Statistic Best Practices

Use at least 20-50 reference samples for stable estimates. The uniform distribution may not be appropriate for all data—consider matching the data's marginal distributions instead. Gap statistic can be slow for large datasets due to repeated clustering of reference samples.

Validation Strategies

Beyond automated methods like the gap statistic, several validation strategies help assess whether a chosen cut produces a meaningful clustering:

1. Internal Validation Metrics:

Measures computed from the data and clustering alone:

Silhouette Score: Mean (b-a)/max(a,b) where a = mean intra-cluster distance, b = mean nearest-cluster distance. Range [-1, 1], higher is better.
Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance, weighted by cluster counts. Higher is better.
Davies-Bouldin Index: Average similarity of each cluster to its most similar cluster. Lower is better.

2. Stability Analysis:

Do similar cuts emerge when data is perturbed?

Subsample data multiple times and cluster each subsample
Measure agreement (e.g., Adjusted Rand Index) between clusterings
Stable clusters should appear consistently

3. External Validation (when ground truth exists):

Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)
Homogeneity, Completeness, V-measure

validation_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.metrics import (
    silhouette_score, calinski_harabasz_score, davies_bouldin_score,
    adjusted_rand_score
)
import matplotlib.pyplot as plt
 
# Generate data with ground truth
np.random.seed(42)
n_per_cluster = 50
X = np.vstack([
    np.random.randn(n_per_cluster, 2) * 0.5 + [0, 0],
    np.random.randn(n_per_cluster, 2) * 0.5 + [4, 0],
    np.random.randn(n_per_cluster, 2) * 0.5 + [2, 4]
])
true_labels = np.repeat([0, 1, 2], n_per_cluster)
 
Z = linkage(X, method='ward')
 
# Evaluate metrics for different k
k_range = range(2, 11)
metrics = {
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': [],
    'ARI vs True': []
}
 
for k in k_range:
    labels = fcluster(Z, t=k, criterion='maxclust')
    
    metrics['Silhouette'].append(silhouette_score(X, labels))
    metrics['Calinski-Harabasz'].append(calinski_harabasz_score(X, labels))
    metrics['Davies-Bouldin'].append(davies_bouldin_score(X, labels))
    metrics['ARI vs True'].append(adjusted_rand_score(true_labels, labels))
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot each metric
for ax, (name, values) in zip(axes.flatten(), metrics.items()):
    ax.plot(list(k_range), values, 'o-', linewidth=2, markersize=8)
    ax.set_xlabel('Number of Clusters (k)')
    ax.set_ylabel(name)
    ax.set_title(name)
    ax.grid(True, alpha=0.3)
    
    # Mark optimal
    if name == 'Davies-Bouldin':
        opt_k = k_range[np.argmin(values)]
        opt_val = min(values)
    else:
        opt_k = k_range[np.argmax(values)]
        opt_val = max(values)
    
    ax.axvline(x=opt_k, color='red', linestyle='--', alpha=0.7)
    ax.scatter([opt_k], [opt_val], color='red', s=150, zorder=5)
    ax.annotate(f'k={opt_k}', (opt_k, opt_val), 
               xytext=(10, 10), textcoords='offset points')
 
plt.tight_layout()
plt.savefig('validation_metrics.png', dpi=150)
print("Validation metrics saved to validation_metrics.png")
 
# Stability analysis
print("
=== Stability Analysis ===")
n_bootstrap = 20
ari_matrix = np.zeros((n_bootstrap, n_bootstrap))
 
bootstrap_labels = []
for i in range(n_bootstrap):
    # Bootstrap sample
    indices = np.random.choice(len(X), size=len(X), replace=True)
    X_boot = X[indices]
    
    Z_boot = linkage(X_boot, method='ward')
    labels_boot = fcluster(Z_boot, t=3, criterion='maxclust')
    bootstrap_labels.append((indices, labels_boot))
 
# Compare pairs of bootstrap clusterings
ari_scores = []
for i in range(n_bootstrap):
    for j in range(i+1, n_bootstrap):
        idx_i, labels_i = bootstrap_labels[i]
        idx_j, labels_j = bootstrap_labels[j]
        
        # Find common indices
        common = np.intersect1d(idx_i, idx_j)
        if len(common) > 10:
            # Map labels to common indices
            labels_i_common = labels_i[[np.where(idx_i == c)[0][0] for c in common]]
            labels_j_common = labels_j[[np.where(idx_j == c)[0][0] for c in common]]
            ari = adjusted_rand_score(labels_i_common, labels_j_common)
            ari_scores.append(ari)
 
print(f"Bootstrap stability (k=3):")
print(f"  Mean ARI between bootstrap samples: {np.mean(ari_scores):.4f}")
print(f"  Std ARI: {np.std(ari_scores):.4f}")
print(f"  (ARI > 0.8 suggests stable clusters)")

Validation Metric Summary
Metric	Range	Optimal	Interpretation
Silhouette	[-1, 1]	Higher	How well points fit their cluster vs nearest other
Calinski-Harabasz	[0, ∞)	Higher	Between-cluster vs within-cluster variance ratio
Davies-Bouldin	[0, ∞)	Lower	Average cluster similarity to most similar neighbor
ARI	[-1, 1]	Higher	Agreement with ground truth (if available)
Stability	[0, 1]	Higher	Consistency across data perturbations

Summary: Cluster Extraction

We've covered the complete spectrum of techniques for extracting flat cluster assignments from hierarchical dendrograms.

Key Takeaways

•Distance-based cutting uses a height threshold with domain meaning; ideal when distance interpretability matters (e.g., complete linkage diameter bounds).
•Count-based cutting extracts exactly k clusters; use when downstream tasks require specific k or when comparing validation metrics across k values.
•Inconsistency-based cutting adapts to local structure, detecting natural boundaries where merge heights spike relative to their subtree context.
•Gap statistic provides statistical grounding by comparing dispersion to a null (no-cluster) reference distribution; robust but computationally intensive.
•Validation metrics (silhouette, Calinski-Harabasz, Davies-Bouldin) help evaluate and compare cuts; no single metric is universally best.
•Stability analysis tests whether clusters are robust to data perturbation; stable clusters reappear consistently across bootstrap samples.

What's Next:

We've primarily focused on agglomerative (bottom-up) clustering. In the final page of this module, we'll explore divisive clustering—the top-down alternative that starts with all points in one cluster and recursively splits. We'll understand when divisive approaches might outperform agglomerative methods and see the computational challenges that make them rare in practice.

Page Complete

You now have a complete toolkit for extracting clusters from hierarchical structures: distance cuts, count cuts, inconsistency cuts, gap statistic, and validation strategies. You can justify your cut choice with appropriate metrics and assess cluster stability. Next, we'll complete the picture with divisive clustering.

4 / 5

Loading learning content...

Machine LearningClustering Algorithms

Hierarchical Clustering

LevelIntermediate

Duration90 mins

TopicClustering Algorithms

4 / 5

Cluster Extraction

From Hierarchy to Flat Clusters

What You Will Learn

The Cutting Problem

Formal Definition:

Given a dendrogram D over n data points and a cut criterion, cluster extraction partitions the data into k non-overlapping clusters C₁, C₂, ..., Cₖ where 1 ≤ k ≤ n.

The Fundamental Trade-off:

Cut too low (many clusters): Captures fine-grained structure but may over-segment coherent groups
Cut too high (few clusters): Combines conceptually different groups but avoids over-fragmentation

Three Perspectives on Cutting:

Distance-based: "I want all points within a cluster to be within distance d of each other" (meaningful when d has domain interpretation)
Count-based: "I need exactly k clusters for my downstream task" (imposed by application requirements)
Structure-based: "Let the data tell me where natural breaks occur" (using statistical criteria to detect gaps)

Cluster Extraction Approaches
Approach	Input	When to Use
Fixed distance	Height threshold h	Distance h has domain meaning (e.g., max within-cluster variance)
Fixed count	Target k clusters	Downstream task requires specific k (e.g., k market segments)
Inconsistency	Inconsistency threshold	Want adaptive cuts based on local structure
Gap statistic	None (automatic)	Want statistical validation of optimal k
Dynamic tree cut	None (automatic)	Complex hierarchies with variable branch heights

Multiple Valid Cuts

Distance-Based Cutting

Distance-based cutting is the most direct approach: draw a horizontal line at height h and count intersections with the dendrogram branches.

Mathematical Definition:

For a cut at height h:

Any merge that occurs at distance ≤ h is preserved (points remain in same cluster)
Any merge that occurs at distance > h is severed (points split into separate clusters)

The resulting number of clusters k is: n minus the number of merges at distance ≤ h.

Interpretation by Linkage:

The meaning of the cut height depends on the linkage:

Complete linkage: h is the maximum within-cluster diameter. All points in a cluster are within distance h of each other.
Single linkage: h is the minimum gap between clusters. Any two points in different clusters are at least distance h apart.
Ward linkage: h relates to the cumulative variance increase. Harder to interpret directly.
Average linkage: h is approximately the average within-cluster pairwise distance.

distance_based_cutting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import matplotlib.pyplot as plt
 
# Generate sample data
np.random.seed(42)
X = np.vstack([
    np.random.randn(40, 2) * 0.5 + [0, 0],
    np.random.randn(40, 2) * 0.5 + [4, 0],
    np.random.randn(40, 2) * 0.5 + [2, 4]
])
 
Z = linkage(X, method='complete')
 
# Explore multiple distance thresholds
thresholds = [1.0, 2.0, 3.0, 4.0, 5.0]
 
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
 
# Dendrogram with thresholds
ax = axes[0, 0]
dendrogram(Z, ax=ax, truncate_mode='lastp', p=30, no_labels=True)
ax.set_title('Dendrogram with Distance Thresholds')
ax.set_ylabel('Distance')
 
colors = ['red', 'orange', 'green', 'blue', 'purple']
for h, c in zip(thresholds, colors):
    ax.axhline(y=h, color=c, linestyle='--', alpha=0.7, linewidth=2)
    ax.text(ax.get_xlim()[1], h, f'  h={h}', color=c, va='center')
 
# Number of clusters vs threshold
heights = np.linspace(0.1, 8.0, 100)
n_clusters = []
for h in heights:
    labels = fcluster(Z, t=h, criterion='distance')
    n_clusters.append(len(np.unique(labels)))
 
ax = axes[0, 1]
ax.plot(heights, n_clusters, 'b-', linewidth=2)
ax.set_xlabel('Cut Height (h)')
ax.set_ylabel('Number of Clusters')
ax.set_title('Clusters vs. Distance Threshold')
ax.grid(True, alpha=0.3)
 
# Highlight thresholds
for h, c in zip(thresholds, colors):
    labels = fcluster(Z, t=h, criterion='distance')
    k = len(np.unique(labels))
    ax.scatter([h], [k], c=c, s=100, zorder=5)
    ax.annotate(f'k={k}', (h, k), xytext=(5, 5), textcoords='offset points')
 
# Show cluster assignments for different thresholds
for idx, (h, c) in enumerate(zip([1.0, 2.0, 4.0], ['red', 'orange', 'blue'])):
    ax = axes[1, idx]
    labels = fcluster(Z, t=h, criterion='distance')
    k = len(np.unique(labels))
    
    scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'Distance threshold h={h} → {k} clusters')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
# Unused subplot - use for summary
ax = axes[0, 2]
ax.axis('off')
summary = """Distance-Based Cutting:
 
criterion='distance' in fcluster()
 
Advantages:
• Interpretable threshold
• Works with any linkage
• Domain-meaningful cutoff
 
Disadvantages:
• Must know meaningful h
• No automatic k selection
• Different k for different h
"""
ax.text(0.1, 0.9, summary, transform=ax.transAxes, fontsize=11,
        verticalalignment='top', fontfamily='monospace',
        bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
 
plt.tight_layout()
plt.savefig('distance_based_cutting.png', dpi=150)
print("Distance-based cutting saved to distance_based_cutting.png")

Complete Linkage Diameter Guarantee

Count-Based Cutting

When you know exactly how many clusters you need—often dictated by business requirements or downstream algorithms—count-based cutting extracts precisely k clusters.

Implementation:

To get k clusters, cut just below the (n-k+1)th highest merge height. Equivalently, use fcluster(Z, t=k, criterion='maxclust') in scipy.

Why Use Fixed k:

Application requirement: Marketing wants exactly 5 customer segments; recommender system needs 10 topic clusters
Resource constraints: Can only afford k focus groups or k product lines
Comparison: Testing multiple k values and choosing the best via validation metrics
Prior knowledge: Domain expertise suggests approximately k natural groups

Choosing k:

If k is not externally specified, several methods can help choose:

Elbow method: Plot within-cluster variance vs. k, look for elbow
Silhouette analysis: Maximize average silhouette score
Gap statistic: Compare to null (uniform) distribution
Domain knowledge: How many groups make sense semantically?

count_based_cutting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
 
# Generate data with 4 clusters
np.random.seed(42)
X = np.vstack([
    np.random.randn(40, 2) * 0.5 + [0, 0],
    np.random.randn(40, 2) * 0.5 + [4, 0],
    np.random.randn(40, 2) * 0.5 + [0, 4],
    np.random.randn(40, 2) * 0.5 + [4, 4]
])
 
Z = linkage(X, method='ward')
 
# Evaluate different k values
k_values = range(2, 11)
silhouette_scores = []
inertias = []  # Within-cluster sum of squares
 
for k in k_values:
    labels = fcluster(Z, t=k, criterion='maxclust')
    
    # Silhouette score
    sil = silhouette_score(X, labels)
    silhouette_scores.append(sil)
    
    # Within-cluster variance (inertia)
    inertia = 0
    for cluster_id in np.unique(labels):
        cluster_points = X[labels == cluster_id]
        centroid = cluster_points.mean(axis=0)
        inertia += np.sum((cluster_points - centroid) ** 2)
    inertias.append(inertia)
 
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
 
# Silhouette scores
ax = axes[0, 0]
ax.plot(k_values, silhouette_scores, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Silhouette Score')
ax.set_title('Silhouette Score vs. k')
ax.grid(True, alpha=0.3)
 
best_k_sil = k_values[np.argmax(silhouette_scores)]
ax.axvline(x=best_k_sil, color='red', linestyle='--', 
           label=f'Best k={best_k_sil}')
ax.legend()
 
# Elbow plot (inertia)
ax = axes[0, 1]
ax.plot(k_values, inertias, 'go-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Within-Cluster Variance')
ax.set_title('Elbow Method')
ax.grid(True, alpha=0.3)
 
# Show cluster assignments for k=2,3,4,5
for idx, k in enumerate([2, 3, 4, 5]):
    row = 1 if idx >= 2 else 0
    col = (idx % 2) + 1 if idx < 2 else idx - 2
    
    if idx < 2:
        ax = axes[0, 2] if idx == 0 else axes[1, 0]
    else:
        ax = axes[1, idx - 1]
    
    if idx == 0:
        ax = axes[0, 2]
    elif idx == 1:
        ax = axes[1, 0]
    elif idx == 2:
        ax = axes[1, 1]
    else:
        ax = axes[1, 2]
    
    labels = fcluster(Z, t=k, criterion='maxclust')
    sil = silhouette_score(X, labels)
    
    ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'k={k}, Silhouette={sil:.3f}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('count_based_cutting.png', dpi=150)
print("Count-based cutting saved to count_based_cutting.png")
 
# Print summary
print("
=== Count-Based Cutting Analysis ===")
print(f"{'k':<5} {'Silhouette':>12} {'Inertia':>15}")
print("-" * 35)
for k, sil, inert in zip(k_values, silhouette_scores, inertias):
    marker = ' <-- best' if k == best_k_sil else ''
    print(f"{k:<5} {sil:>12.4f} {inert:>15.2f}{marker}")

Count-Based Advantages

•Direct control: Exact k for downstream requirements
•Easy comparison: Test multiple k values systematically
•Quick iteration: Extract different k without reclustering
•Interpretable: 'k customer segments' is concrete

Count-Based Disadvantages

•Ignores structure: May cut within natural clusters
•Requires k selection: Still need method to choose k
•Unequal splits: k doesn't guarantee balanced sizes
•Not data-driven: Imposed rather than discovered

Inconsistency-Based Cutting

Inconsistency-based cutting provides an adaptive approach that detects natural cluster boundaries by comparing each merge to its local context.

The Inconsistency Coefficient:

For each merge at height h, we compute how "inconsistent" this merge is compared to previous merges in its subtree:

$$I = \frac{h - \bar{h}{\text{below}}}{\sigma{\text{below}}}$$

where $\bar{h}{\text{below}}$ is the mean height of merges below this point (up to depth d), and $\sigma{\text{below}}$ is their standard deviation.

Interpretation:

I ≈ 0: This merge is at a typical height for its subtree (no surprise)
I > 1: This merge is notably higher than typical (potential cluster boundary)
I > 2: This merge is significantly higher (strong cluster boundary)

How the Cut Works:

Cut all branches where the inconsistency coefficient exceeds threshold t. Points below unchanged merges stay together; points above severed merges split apart.

Depth Parameter:

The depth d controls how far down the tree to look when computing the baseline statistics. Larger d considers more history; smaller d is more local. Typical values: d = 2 or 3.

Inconsistency Intuition

inconsistency_cutting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster, inconsistent, dendrogram
import matplotlib.pyplot as plt
 
# Generate data with varying cluster tightness
np.random.seed(42)
 
# Tight cluster
tight = np.random.randn(30, 2) * 0.3 + [0, 0]
# Moderate cluster
moderate = np.random.randn(30, 2) * 0.6 + [4, 0]
# Loose cluster
loose = np.random.randn(30, 2) * 1.0 + [2, 5]
 
X = np.vstack([tight, moderate, loose])
Z = linkage(X, method='average')
 
# Compute inconsistency coefficients
R = inconsistent(Z, d=2)
 
# R has 4 columns:
# 0: mean of heights below
# 1: std of heights below  
# 2: number of links below
# 3: inconsistency coefficient
 
inconsistency_coeffs = R[:, 3]
 
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
 
# Dendrogram
dendrogram(Z, ax=axes[0, 0], truncate_mode='lastp', p=25, no_labels=True)
axes[0, 0].set_title('Dendrogram')
axes[0, 0].set_ylabel('Distance')
 
# Inconsistency coefficients
ax = axes[0, 1]
merge_indices = np.arange(1, len(inconsistency_coeffs) + 1)
ax.plot(merge_indices, inconsistency_coeffs, 'b-', linewidth=1)
ax.scatter(merge_indices, inconsistency_coeffs, c='blue', s=15)
ax.set_xlabel('Merge Index')
ax.set_ylabel('Inconsistency Coefficient')
ax.set_title('Inconsistency Coefficients per Merge')
ax.grid(True, alpha=0.3)
 
# Show thresholds
for t, c in [(0.7, 'green'), (1.0, 'orange'), (1.5, 'red')]:
    ax.axhline(y=t, color=c, linestyle='--', alpha=0.7, label=f't={t}')
ax.legend()
 
# Inconsistency histogram
ax = axes[0, 2]
ax.hist(inconsistency_coeffs, bins=20, color='steelblue', edgecolor='black', alpha=0.7)
ax.set_xlabel('Inconsistency Coefficient')
ax.set_ylabel('Count')
ax.set_title('Distribution of Inconsistency Coefficients')
ax.axvline(x=1.0, color='red', linestyle='--', linewidth=2, label='t=1.0')
ax.legend()
 
# Show cuts at different thresholds
thresholds = [0.7, 1.0, 1.5]
for idx, t in enumerate(thresholds):
    labels = fcluster(Z, t=t, criterion='inconsistent', depth=2)
    k = len(np.unique(labels))
    
    ax = axes[1, idx]
    ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'Inconsistency t={t} → {k} clusters')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('inconsistency_cutting.png', dpi=150)
print("Inconsistency cutting saved to inconsistency_cutting.png")
 
# Analyze the inconsistency structure
print("
=== Inconsistency Analysis ===")
print(f"Total merges: {len(inconsistency_coeffs)}")
print(f"Inconsistency range: [{inconsistency_coeffs.min():.3f}, {inconsistency_coeffs.max():.3f}]")
print(f"Mean inconsistency: {inconsistency_coeffs.mean():.3f}")
print(f"Std inconsistency: {inconsistency_coeffs.std():.3f}")
 
print("
Merges with high inconsistency (>1.0):")
high_inconsistency = np.where(inconsistency_coeffs > 1.0)[0]
for idx in high_inconsistency[-5:]:  # Last 5 high-inconsistency merges
    print(f"  Merge {idx+1}: I={inconsistency_coeffs[idx]:.3f}, height={Z[idx, 2]:.3f}")

Advantages of Inconsistency-Based Cutting:

Adaptive: Automatically adjusts to local cluster density
Finds natural boundaries: Detects gaps without external specification
Works with irregular data: Handles clusters of different sizes/tightness

Limitations:

Sensitive to depth parameter: Different d can give different results
Threshold still required: Must choose inconsistency threshold t
Can be unstable: Small changes in data can shift boundaries
Requires sufficient depth: Shallow subtrees have poorly estimated statistics

The Gap Statistic

The gap statistic (Tibshirani, Walther, and Hastie, 2001) provides a principled, statistically-grounded method for selecting the optimal number of clusters.

Core Idea:

Mathematical Definition:

$$\text{Gap}(k) = E^*[\log W_k] - \log W_k$$

where:

$W_k$ = pooled within-cluster sum of squares for k clusters on your data
$E^*[\log W_k]$ = expected value under a null distribution (estimated via simulation)

Algorithm:

Cluster data at each k from 1 to K_max, compute $W_k$
Generate B reference datasets from uniform distribution over bounding box
Cluster each reference dataset, compute $W_k^*$ for each
Estimate $E^*[\log W_k] = \frac{1}{B}\sum_b \log W_k^{(b)}$
Compute $\text{Gap}(k)$ and its standard error
Select smallest k such that $\text{Gap}(k) \geq \text{Gap}(k+1) - s_{k+1}$

gap_statistic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
import matplotlib.pyplot as plt
 
def compute_wk(X, labels):
    """Compute pooled within-cluster sum of squares."""
    wk = 0
    for cluster_id in np.unique(labels):
        cluster_points = X[labels == cluster_id]
        centroid = cluster_points.mean(axis=0)
        wk += np.sum((cluster_points - centroid) ** 2)
    return wk
 
def gap_statistic(X, k_range, n_refs=20, random_state=42):
    """
    Compute gap statistic for a range of k values.
    
    Returns:
        gaps: Gap statistic for each k
        sk: Standard error for each k
        log_wks: Log within-cluster dispersion for actual data
        ref_log_wks_mean: Mean log dispersion for reference data
    """
    np.random.seed(random_state)
    n, d = X.shape
    
    # Bounding box for uniform reference
    mins = X.min(axis=0)
    maxs = X.max(axis=0)
    
    log_wks = []
    ref_log_wks_all = []
    
    # Compute linkage once
    Z = linkage(X, method='ward')
    
    for k in k_range:
        # Cluster actual data
        labels = fcluster(Z, t=k, criterion='maxclust')
        wk = compute_wk(X, labels)
        log_wks.append(np.log(wk))
        
        # Cluster reference data
        ref_log_wks = []
        for _ in range(n_refs):
            # Generate uniform reference
            ref_X = np.random.uniform(mins, maxs, size=(n, d))
            ref_Z = linkage(ref_X, method='ward')
            ref_labels = fcluster(ref_Z, t=k, criterion='maxclust')
            ref_wk = compute_wk(ref_X, ref_labels)
            ref_log_wks.append(np.log(ref_wk))
        
        ref_log_wks_all.append(ref_log_wks)
    
    log_wks = np.array(log_wks)
    ref_log_wks_mean = np.array([np.mean(refs) for refs in ref_log_wks_all])
    ref_log_wks_std = np.array([np.std(refs) for refs in ref_log_wks_all])
    
    # Gap = E*[log(Wk)] - log(Wk)
    gaps = ref_log_wks_mean - log_wks
    
    # Standard error with sqrt(1 + 1/B) factor
    sk = ref_log_wks_std * np.sqrt(1 + 1/n_refs)
    
    return gaps, sk, log_wks, ref_log_wks_mean
 
 
# Generate test data with 3 clusters
np.random.seed(42)
X = np.vstack([
    np.random.randn(50, 2) * 0.5 + [0, 0],
    np.random.randn(50, 2) * 0.5 + [4, 4],
    np.random.randn(50, 2) * 0.5 + [8, 0]
])
 
# Compute gap statistic
k_range = range(1, 11)
gaps, sk, log_wks, ref_log_wks = gap_statistic(X, k_range, n_refs=10)
 
# Find optimal k using gap criterion
# Select smallest k where Gap(k) >= Gap(k+1) - s_{k+1}
optimal_k = 1
for i in range(len(gaps) - 1):
    if gaps[i] >= gaps[i+1] - sk[i+1]:
        optimal_k = k_range[i]
        break
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
# Gap statistic
ax = axes[0]
ax.errorbar(list(k_range), gaps, yerr=sk, fmt='bo-', capsize=5, linewidth=2)
ax.axvline(x=optimal_k, color='red', linestyle='--', linewidth=2, 
           label=f'Optimal k={optimal_k}')
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Gap Statistic')
ax.set_title('Gap Statistic')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Log(Wk) comparison
ax = axes[1]
ax.plot(list(k_range), log_wks, 'b-o', linewidth=2, label='Actual data')
ax.plot(list(k_range), ref_log_wks, 'r--s', linewidth=2, label='Reference (uniform)')
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('log(Wk)')
ax.set_title('Within-Cluster Dispersion')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Resulting clusters
ax = axes[2]
Z = linkage(X, method='ward')
labels = fcluster(Z, t=optimal_k, criterion='maxclust')
ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
ax.set_title(f'Gap-Optimal Clustering (k={optimal_k})')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('gap_statistic.png', dpi=150)
print(f"Gap statistic analysis saved (optimal k={optimal_k})")

Gap Statistic Best Practices

Validation Strategies

Beyond automated methods like the gap statistic, several validation strategies help assess whether a chosen cut produces a meaningful clustering:

1. Internal Validation Metrics:

Measures computed from the data and clustering alone:

Silhouette Score: Mean (b-a)/max(a,b) where a = mean intra-cluster distance, b = mean nearest-cluster distance. Range [-1, 1], higher is better.
Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance, weighted by cluster counts. Higher is better.
Davies-Bouldin Index: Average similarity of each cluster to its most similar cluster. Lower is better.

2. Stability Analysis:

Do similar cuts emerge when data is perturbed?

Subsample data multiple times and cluster each subsample
Measure agreement (e.g., Adjusted Rand Index) between clusterings
Stable clusters should appear consistently

3. External Validation (when ground truth exists):

Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)
Homogeneity, Completeness, V-measure

validation_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.metrics import (
    silhouette_score, calinski_harabasz_score, davies_bouldin_score,
    adjusted_rand_score
)
import matplotlib.pyplot as plt
 
# Generate data with ground truth
np.random.seed(42)
n_per_cluster = 50
X = np.vstack([
    np.random.randn(n_per_cluster, 2) * 0.5 + [0, 0],
    np.random.randn(n_per_cluster, 2) * 0.5 + [4, 0],
    np.random.randn(n_per_cluster, 2) * 0.5 + [2, 4]
])
true_labels = np.repeat([0, 1, 2], n_per_cluster)
 
Z = linkage(X, method='ward')
 
# Evaluate metrics for different k
k_range = range(2, 11)
metrics = {
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': [],
    'ARI vs True': []
}
 
for k in k_range:
    labels = fcluster(Z, t=k, criterion='maxclust')
    
    metrics['Silhouette'].append(silhouette_score(X, labels))
    metrics['Calinski-Harabasz'].append(calinski_harabasz_score(X, labels))
    metrics['Davies-Bouldin'].append(davies_bouldin_score(X, labels))
    metrics['ARI vs True'].append(adjusted_rand_score(true_labels, labels))
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot each metric
for ax, (name, values) in zip(axes.flatten(), metrics.items()):
    ax.plot(list(k_range), values, 'o-', linewidth=2, markersize=8)
    ax.set_xlabel('Number of Clusters (k)')
    ax.set_ylabel(name)
    ax.set_title(name)
    ax.grid(True, alpha=0.3)
    
    # Mark optimal
    if name == 'Davies-Bouldin':
        opt_k = k_range[np.argmin(values)]
        opt_val = min(values)
    else:
        opt_k = k_range[np.argmax(values)]
        opt_val = max(values)
    
    ax.axvline(x=opt_k, color='red', linestyle='--', alpha=0.7)
    ax.scatter([opt_k], [opt_val], color='red', s=150, zorder=5)
    ax.annotate(f'k={opt_k}', (opt_k, opt_val), 
               xytext=(10, 10), textcoords='offset points')
 
plt.tight_layout()
plt.savefig('validation_metrics.png', dpi=150)
print("Validation metrics saved to validation_metrics.png")
 
# Stability analysis
print("
=== Stability Analysis ===")
n_bootstrap = 20
ari_matrix = np.zeros((n_bootstrap, n_bootstrap))
 
bootstrap_labels = []
for i in range(n_bootstrap):
    # Bootstrap sample
    indices = np.random.choice(len(X), size=len(X), replace=True)
    X_boot = X[indices]
    
    Z_boot = linkage(X_boot, method='ward')
    labels_boot = fcluster(Z_boot, t=3, criterion='maxclust')
    bootstrap_labels.append((indices, labels_boot))
 
# Compare pairs of bootstrap clusterings
ari_scores = []
for i in range(n_bootstrap):
    for j in range(i+1, n_bootstrap):
        idx_i, labels_i = bootstrap_labels[i]
        idx_j, labels_j = bootstrap_labels[j]
        
        # Find common indices
        common = np.intersect1d(idx_i, idx_j)
        if len(common) > 10:
            # Map labels to common indices
            labels_i_common = labels_i[[np.where(idx_i == c)[0][0] for c in common]]
            labels_j_common = labels_j[[np.where(idx_j == c)[0][0] for c in common]]
            ari = adjusted_rand_score(labels_i_common, labels_j_common)
            ari_scores.append(ari)
 
print(f"Bootstrap stability (k=3):")
print(f"  Mean ARI between bootstrap samples: {np.mean(ari_scores):.4f}")
print(f"  Std ARI: {np.std(ari_scores):.4f}")
print(f"  (ARI > 0.8 suggests stable clusters)")

Validation Metric Summary
Metric	Range	Optimal	Interpretation
Silhouette	[-1, 1]	Higher	How well points fit their cluster vs nearest other
Calinski-Harabasz	[0, ∞)	Higher	Between-cluster vs within-cluster variance ratio
Davies-Bouldin	[0, ∞)	Lower	Average cluster similarity to most similar neighbor
ARI	[-1, 1]	Higher	Agreement with ground truth (if available)
Stability	[0, 1]	Higher	Consistency across data perturbations

Summary: Cluster Extraction

We've covered the complete spectrum of techniques for extracting flat cluster assignments from hierarchical dendrograms.

Key Takeaways

•Distance-based cutting uses a height threshold with domain meaning; ideal when distance interpretability matters (e.g., complete linkage diameter bounds).
•Count-based cutting extracts exactly k clusters; use when downstream tasks require specific k or when comparing validation metrics across k values.
•Inconsistency-based cutting adapts to local structure, detecting natural boundaries where merge heights spike relative to their subtree context.
•Gap statistic provides statistical grounding by comparing dispersion to a null (no-cluster) reference distribution; robust but computationally intensive.
•Validation metrics (silhouette, Calinski-Harabasz, Davies-Bouldin) help evaluate and compare cuts; no single metric is universally best.
•Stability analysis tests whether clusters are robust to data perturbation; stable clusters reappear consistently across bootstrap samples.

What's Next:

Page Complete

4 / 5