Hierarchical Clustering - Learning Module

Loading content...

0/278

Dendrogram Interpretation

Reading the Hierarchical Blueprint

A dendrogram is far more than a pretty tree diagram—it's a complete record of how your data relates at every level of granularity. A well-interpreted dendrogram can reveal natural groupings, identify outliers, expose hierarchical structure, and guide the critical decision of how many clusters to extract. Mastering dendrogram interpretation transforms hierarchical clustering from a black box into a powerful exploratory tool.

Every branch, split, and merge height in a dendrogram carries meaning. The height at which clusters join reflects their dissimilarity. The shape of the tree reveals whether clusters are well-separated or gradually blending. The pattern of early versus late merges exposes the natural hierarchy in your data. Learning to read these patterns is an essential skill for any practitioner using hierarchical methods.

What You Will Learn

By the end of this page, you will be able to: read dendrogram structure and identify core components; interpret merge heights and understand what they represent; identify natural cluster boundaries using visual inspection and quantitative methods; recognize common dendrogram patterns and what they indicate about data structure; use dendrogram cophenetic distance to evaluate clustering quality; and apply practical visualization techniques for large and complex hierarchies.

Dendrogram Anatomy

A dendrogram is a binary tree where:

Leaves (terminal nodes): Represent individual data points (or pre-existing small clusters)
Internal nodes: Represent merged clusters
Branches (edges): Connect nodes vertically
Height/Distance axis: Usually the y-axis, representing the merge distance

Key Structural Elements:

Horizontal lines (crossbars): Each horizontal line represents a merge event where two clusters combine into one. The y-position of this line indicates the merge distance—how far apart the clusters were when merged (according to the linkage criterion).
Vertical lines: Connect the horizontal merge bar to the clusters being merged below and to the parent merge above.
Leaf positions: The x-axis ordering of leaves is not unique—siblings can be swapped without changing the hierarchy. Good visualizations choose orderings that minimize crossing lines.
Total height: The distance from leaves to the root represents the total range of merge distances in the data.

Dendrogram Components and Their Meaning
Component	Visual Appearance	Interpretation
Leaf node	Bottom of vertical line, labeled	Original data point or singleton cluster
Internal node	Horizontal line with two children below	Merged cluster created at that step
Merge height	Y-coordinate of horizontal line	Inter-cluster distance at merge time
Branch length	Vertical line length	Distance gap between successive merges
Root node	Topmost horizontal line	Final cluster containing all points
Subtree	Any connected sub-portion	A cluster at some granularity level

The X-Axis is Arbitrary

Only the y-axis (height) carries distance information. The x-axis ordering is chosen for visual clarity but has no mathematical meaning. Two dendrograms can represent identical hierarchies with completely different x-axis arrangements. Never interpret horizontal proximity of leaves as indicating similarity—only the height at which they merge matters.

dendrogram_anatomy_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
 
# Simple 6-point example for clear illustration
np.random.seed(42)
X = np.array([
    [0, 0],    # Point A
    [0.5, 0],  # Point B - close to A
    [3, 0],    # Point C
    [3.5, 0],  # Point D - close to C
    [1.5, 3],  # Point E
    [2, 3.2],  # Point F - close to E
])
labels = ['A', 'B', 'C', 'D', 'E', 'F']
 
# Compute linkage
Z = linkage(X, method='average')
 
# Create detailed visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Left: Scatter plot with annotations
axes[0].scatter(X[:, 0], X[:, 1], s=100, c='steelblue', edgecolors='black', zorder=5)
for i, label in enumerate(labels):
    axes[0].annotate(label, (X[i, 0], X[i, 1]), fontsize=12, fontweight='bold',
                     xytext=(5, 5), textcoords='offset points')
axes[0].set_title('Data Points in Feature Space')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].grid(True, alpha=0.3)
 
# Right: Annotated dendrogram
dendro = dendrogram(Z, ax=axes[1], labels=labels, leaf_font_size=12)
 
# Annotate merge events
merge_heights = Z[:, 2]
merge_labels = ['Merge 1
(A+B)', 'Merge 2
(C+D)', 'Merge 3
(E+F)', 
                'Merge 4
(AB+EF)', 'Merge 5
(ABEF+CD)']
 
axes[1].set_title('Dendrogram with Merge Height Interpretation')
axes[1].set_ylabel('Distance (Merge Height)')
axes[1].set_xlabel('Data Points')
 
# Add horizontal guidelines at major merge heights
for height, label in zip(merge_heights, merge_labels[:len(merge_heights)]):
    axes[1].axhline(y=height, color='gray', linestyle=':', alpha=0.5)
    axes[1].text(5.5, height, f'{height:.2f}', fontsize=9, va='center')
 
plt.tight_layout()
plt.savefig('dendrogram_anatomy.png', dpi=150)
print("Dendrogram anatomy saved to dendrogram_anatomy.png")
 
# Print linkage matrix interpretation
print("
=== Linkage Matrix Z ===")
print("Format: [cluster_1, cluster_2, distance, size]")
for i, row in enumerate(Z):
    c1, c2, dist, size = row
    c1_name = labels[int(c1)] if c1 < 6 else f"C{int(c1)}"
    c2_name = labels[int(c2)] if c2 < 6 else f"C{int(c2)}"
    print(f"Merge {i+1}: {c1_name} + {c2_name} at distance {dist:.4f}, new size: {int(size)}")

Height Interpretation

The y-axis height in a dendrogram represents the distance at which clusters merge, as computed by the chosen linkage function. Understanding height is crucial for:

Identifying natural clusters: Large jumps in height suggest natural cluster boundaries
Comparing cluster tightness: Low merge heights indicate tight, similar groups
Detecting outliers: Points that merge very late (high height) may be outliers
Choosing the number of clusters: Cutting at the "right" height determines cluster count

What Height Means for Different Linkages:

Single linkage: Height = minimum distance between any two points across clusters
Complete linkage: Height = maximum distance; guarantees cluster diameter ≤ height
Average linkage: Height = mean pairwise distance across clusters
Ward's linkage: Height = increase in total within-cluster variance

Monotonicity Property

In a valid dendrogram, merge heights must be monotonically increasing from leaves to root. Each successive merge happens at a height at least as high as the previous. This property holds for single, complete, average, and Ward linkages. Centroid/median linkage can violate this (producing 'inversions'), making those dendrograms harder to interpret.

Reading Height Patterns:

Steady, gradual increase: Points are relatively uniformly distributed; no clear cluster structure
Sharp jumps followed by plateaus: Well-defined clusters exist; the jumps indicate inter-cluster gaps
Long vertical branches at the bottom: Tight, cohesive clusters with similar internal points
Short branches near the top: Final merges connect already-merged large clusters
Single very tall branch: An outlier that doesn't fit into any cluster until the very end

The Gap-Height Diagnostic:

One can plot the sorted merge heights and look for the largest "gap"—the biggest jump between consecutive heights. This gap often corresponds to a natural cut point for determining cluster count.

height_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
 
# Generate data with 3 well-separated clusters
np.random.seed(42)
n_per_cluster = 40
 
cluster1 = np.random.randn(n_per_cluster, 2) * 0.5 + [0, 0]
cluster2 = np.random.randn(n_per_cluster, 2) * 0.5 + [5, 0]
cluster3 = np.random.randn(n_per_cluster, 2) * 0.5 + [2.5, 4]
# Add some outliers
outliers = np.array([[8, 5], [-3, 6]])
 
X = np.vstack([cluster1, cluster2, cluster3, outliers])
 
# Compute linkage
Z = linkage(X, method='ward')
heights = Z[:, 2]
 
# Calculate gaps between consecutive heights
gaps = np.diff(heights)
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# 1. Dendrogram
dendrogram(Z, ax=axes[0, 0], truncate_mode='lastp', p=30, no_labels=True)
axes[0, 0].set_title('Dendrogram (Ward Linkage)')
axes[0, 0].set_ylabel('Distance')
axes[0, 0].set_xlabel('Cluster')
 
# 2. Merge heights plot
merge_indices = np.arange(1, len(heights) + 1)
axes[0, 1].plot(merge_indices, heights, 'b-', linewidth=1)
axes[0, 1].scatter(merge_indices, heights, c='blue', s=10)
axes[0, 1].set_xlabel('Merge Step')
axes[0, 1].set_ylabel('Merge Height')
axes[0, 1].set_title('Merge Height Progression')
axes[0, 1].grid(True, alpha=0.3)
 
# Highlight the largest gaps
largest_gap_indices = np.argsort(gaps)[-3:][::-1]  # Top 3 gaps
for idx in largest_gap_indices:
    axes[0, 1].axhline(y=(heights[idx] + heights[idx+1])/2, 
                       color='red', linestyle='--', alpha=0.7)
    axes[0, 1].annotate(f'Gap: {gaps[idx]:.2f}', 
                       xy=(idx+1, heights[idx]), fontsize=9)
 
# 3. Gap sizes
gap_indices = np.arange(1, len(gaps) + 1)
axes[1, 0].bar(gap_indices, gaps, alpha=0.7, color='steelblue')
axes[1, 0].set_xlabel('Gap Index (between merge i and i+1)')
axes[1, 0].set_ylabel('Gap Size')
axes[1, 0].set_title('Height Gaps (Large Gap = Natural Cut Point)')
 
# Highlight top gaps
for idx in largest_gap_indices:
    axes[1, 0].bar(idx+1, gaps[idx], color='red', alpha=0.8)
 
# 4. Suggested cuts and resulting clusters
suggested_k_values = [2, 3, 4]
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']
 
for k in suggested_k_values:
    labels = fcluster(Z, t=k, criterion='maxclust')
    axes[1, 1].scatter(X[:, 0] + (k-3)*0.1, X[:, 1], 
                      c=[colors[l-1] for l in labels], s=30, alpha=0.6,
                      label=f'k={k}')
 
axes[1, 1].set_title('Cluster Assignments for Different k')
axes[1, 1].set_xlabel('Feature 1')
axes[1, 1].set_ylabel('Feature 2')
axes[1, 1].legend()
 
plt.tight_layout()
plt.savefig('height_analysis.png', dpi=150)
print("Height analysis saved to height_analysis.png")
 
# Print analysis summary
print("
=== Height Gap Analysis ===")
print(f"Total merges: {len(heights)}")
print(f"Height range: [{heights.min():.3f}, {heights.max():.3f}]")
print(f"
Largest gaps (suggesting natural cluster boundaries):")
for i, idx in enumerate(largest_gap_indices):
    k = len(heights) - idx  # Approximate number of clusters at this cut
    print(f"  Gap {i+1}: {gaps[idx]:.3f} at merge {idx+1} → ~{k} clusters")

Identifying Cluster Boundaries

The fundamental question when using hierarchical clustering is: Where should we cut the dendrogram? This determines the final number of clusters. Several approaches exist:

1. Fixed Distance Cut:

Draw a horizontal line at a chosen height h. Every intersection with a vertical branch defines a cluster. Points below the line within the same subtree belong to the same cluster.

Pros: Interpretable threshold (e.g., "clusters with max internal distance < h")
Cons: Requires domain knowledge to choose h

2. Fixed Number of Clusters:

Specify k clusters and cut the dendrogram at the height that produces exactly k clusters (the (n-k)th highest merge height).

Pros: Direct control over cluster count
Cons: Doesn't leverage the hierarchy's information

3. Inconsistency Criterion:

Compare each merge height to the average height of preceding merges in that subtree. Cut where the inconsistency (normalized deviation) exceeds a threshold.

Pros: Adapts to local structure
Cons: Requires tuning the depth and threshold parameters

Dendrogram Cutting Criteria in scipy
Criterion	Parameter t Meaning	Use Case
maxclust	Maximum number of clusters	You know how many clusters you want
distance	Maximum linkage distance	You have a meaningful distance threshold
inconsistent	Inconsistency threshold	Adaptive cut based on local structure
maxclust_monocrit	Like maxclust with custom metric	Advanced: custom cutting criterion

The 'Elbow' Method for Dendrograms

Plot the number of clusters (n-i where i is merge index) versus merge height. Look for an 'elbow' where the height suddenly increases—similar to the elbow method for K-Means. This indicates that merging beyond this point combines very dissimilar clusters.

cluster_boundary_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.cluster.hierarchy import (
    linkage, dendrogram, fcluster, inconsistent
)
import matplotlib.pyplot as plt
 
# Generate data with hierarchical structure
np.random.seed(42)
 
# Two main clusters, each with sub-clusters
main1_sub1 = np.random.randn(25, 2) * 0.3 + [0, 0]
main1_sub2 = np.random.randn(25, 2) * 0.3 + [1.5, 0]
main2_sub1 = np.random.randn(25, 2) * 0.3 + [5, 0]
main2_sub2 = np.random.randn(25, 2) * 0.3 + [6.5, 0]
 
X = np.vstack([main1_sub1, main1_sub2, main2_sub1, main2_sub2])
Z = linkage(X, method='ward')
 
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
 
# 1. Base dendrogram with cut lines
dendrogram(Z, ax=axes[0, 0], truncate_mode='lastp', p=30, no_labels=True)
axes[0, 0].set_title('Dendrogram with Cut Options')
axes[0, 0].set_ylabel('Distance')
 
# Add horizontal cut lines
cut_heights = [1.0, 2.5, 6.0]
cut_labels = ['4 clusters', '2 clusters', '1 cluster']
colors = ['green', 'orange', 'red']
for h, label, c in zip(cut_heights, cut_labels, colors):
    axes[0, 0].axhline(y=h, color=c, linestyle='--', linewidth=2, label=label)
axes[0, 0].legend(loc='upper right')
 
# 2. Fixed number of clusters
for k, ax in zip([2, 4], [axes[0, 1], axes[0, 2]]):
    labels = fcluster(Z, t=k, criterion='maxclust')
    scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'criterion="maxclust", t={k}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
# 3. Fixed distance threshold
for h, ax in zip([1.0, 2.5], [axes[1, 0], axes[1, 1]]):
    labels = fcluster(Z, t=h, criterion='distance')
    n_clusters = len(np.unique(labels))
    ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    ax.set_title(f'criterion="distance", t={h}
({n_clusters} clusters)')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
# 4. Inconsistency-based cutting
# Compute inconsistency coefficients
R = inconsistent(Z, d=2)  # d=2 means look 2 levels deep
 
# The 4th column of R is the inconsistency coefficient
inconsistencies = R[:, 3]
 
ax = axes[1, 2]
ax.plot(range(1, len(inconsistencies)+1), inconsistencies, 'b-o', markersize=3)
ax.set_xlabel('Merge Step')
ax.set_ylabel('Inconsistency Coefficient')
ax.set_title('Inconsistency Coefficients')
ax.axhline(y=1.0, color='red', linestyle='--', label='Threshold=1.0')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('cluster_boundaries.png', dpi=150)
print("Cluster boundary analysis saved to cluster_boundaries.png")
 
# Demonstrate inconsistency-based clustering
labels_inconsistent = fcluster(Z, t=1.0, criterion='inconsistent', depth=2)
print(f"
Inconsistency-based clustering (t=1.0, depth=2):")
print(f"  Number of clusters: {len(np.unique(labels_inconsistent))}")

Cophenetic Distance and Correlation

How well does the dendrogram represent the original distance structure? The cophenetic distance provides a way to answer this fundamental question.

Definition:

The cophenetic distance c(i,j) between two data points i and j is the height at which they first become part of the same cluster in the dendrogram. It's the y-coordinate of the horizontal merge bar that first joins the branches containing i and j.

Cophenetic Correlation:

The cophenetic correlation coefficient (CPCC) is the Pearson correlation between the original pairwise distances d(i,j) and the cophenetic distances c(i,j):

$$\text{CPCC} = \text{corr}(d(i,j), c(i,j))$$

A CPCC close to 1 indicates that the dendrogram faithfully preserves the original distance relationships. A low CPCC suggests the hierarchical representation distorts the data structure—perhaps indicating that hierarchical clustering isn't appropriate for this data, or that a different linkage might work better.

Interpreting CPCC Values

CPCC ≥ 0.85: Excellent representation; dendrogram reliably reflects true distances. CPCC 0.7-0.85: Good representation; minor distortions present. CPCC 0.5-0.7: Moderate; consider if hierarchical structure is appropriate. CPCC < 0.5: Poor; the dendrogram significantly distorts the data structure.

Using CPCC to Compare Linkages:

CPCC can help choose between linkage methods. Compute the hierarchy for each linkage, calculate CPCC, and select the linkage that best preserves original distances. Note that this is just one criterion—the "best" linkage also depends on what cluster shapes you expect and what you'll do with the results.

Connection to Ultrametrics:

A valid dendrogram defines an ultrametric on the data points—a distance function where the triangle inequality is strengthened to: d(a,c) ≤ max(d(a,b), d(b,c)). The cophenetic distance is exactly this ultrametric. Hierarchical clustering can be viewed as finding the best ultrametric approximation to the original distances.

cophenetic_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from scipy.cluster.hierarchy import linkage, cophenet
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
 
# Generate data
np.random.seed(42)
X = np.vstack([
    np.random.randn(40, 2) * 0.5 + [0, 0],
    np.random.randn(40, 2) * 0.5 + [3, 3],
    np.random.randn(40, 2) * 0.5 + [6, 0]
])
 
# Compute original distances
original_distances = pdist(X)
 
# Compare linkages using cophenetic correlation
linkages = ['single', 'complete', 'average', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
 
for idx, link in enumerate(linkages):
    ax = axes[idx // 2, idx % 2]
    
    # Compute hierarchical clustering
    Z = linkage(X, method=link)
    
    # Compute cophenetic distances and correlation
    cophenetic_distances, cophenetic_matrix = cophenet(Z, original_distances)
    cpcc = np.corrcoef(original_distances, cophenetic_distances)[0, 1]
    
    # Scatter plot: original vs cophenetic distances
    ax.scatter(original_distances, cophenetic_distances, alpha=0.3, s=10)
    ax.plot([0, max(original_distances)], [0, max(original_distances)], 
            'r--', linewidth=2, label='Perfect correlation')
    ax.set_xlabel('Original Pairwise Distance')
    ax.set_ylabel('Cophenetic Distance')
    ax.set_title(f'{link.capitalize()} Linkage
CPCC = {cpcc:.4f}')
    ax.legend()
    ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('cophenetic_comparison.png', dpi=150)
print("Cophenetic comparison saved to cophenetic_comparison.png")
 
# Summary table
print("
=== Cophenetic Correlation Comparison ===")
print("-" * 45)
print(f"{'Linkage':<12} {'CPCC':>10} {'Interpretation':>20}")
print("-" * 45)
 
for link in linkages:
    Z = linkage(X, method=link)
    coph_dists, _ = cophenet(Z, original_distances)
    cpcc = np.corrcoef(original_distances, coph_dists)[0, 1]
    
    if cpcc >= 0.85:
        interp = "Excellent"
    elif cpcc >= 0.7:
        interp = "Good"
    elif cpcc >= 0.5:
        interp = "Moderate"
    else:
        interp = "Poor"
    
    print(f"{link:<12} {cpcc:>10.4f} {interp:>20}")
 
print("-" * 45)

Common Dendrogram Patterns

Experienced practitioners learn to recognize characteristic dendrogram shapes that reveal underlying data structure. Here are the most common patterns and what they indicate:

Dendrogram Pattern Recognition Guide
Pattern	Visual Description	Indicates
Well-separated clusters	Long vertical branches at bottom, short at top with large gaps	Clear cluster structure; easy to cut
Chaining (straggling)	One branch accumulates points one at a time, like a comb	Single linkage on non-clustered data; possible noise bridges
Balanced binary tree	Regular, symmetric tree with even subtrees	Complete/Ward on uniform density; equal-sized clusters
Hierarchical nesting	Multiple distinct 'shelves' at different heights	True multi-level hierarchy (e.g., taxonomy)
Outlier spikes	Long vertical branches reaching from bottom to near-top	Outliers that don't fit any cluster until the end
Gradual merging	No clear jumps; heights increase smoothly	No natural clusters; uniform or continuous distribution

Pattern 1: Well-Separated Clusters

The ideal pattern for clustering. You'll see tight, low-height subtrees for individual clusters, then large vertical gaps before these subtrees merge into a single root. The gaps represent clear inter-cluster separation, and cutting within a gap produces stable cluster assignments.

Pattern 2: Chaining

Characteristic of single linkage on data without clear clusters, or when noise points form bridges. Instead of balanced subtrees, points accumulate one by one onto a growing chain. This usually indicates single linkage should be replaced with average or complete linkage.

Pattern 3: Outlier Detection

Outliers produce distinctive "spikes"—branches that extend nearly the full height of the dendrogram before joining any cluster. These points are equally distant from all clusters and only merge at the very end. Counting such high branches can estimate outlier count.

Pattern 4: No Natural Clusters

When data is uniformly distributed or has no cluster structure, the dendrogram shows steady, gradual height increase with no large gaps. This is a signal that clustering may not be meaningful for this data.

Visualization Limitations

For large n (> 100 points), full dendrograms become illegible. Use truncation (show only top p merges), condensed views, or zoom into specific subtrees. The scipy dendrogram function supports truncation via truncate_mode='lastp' or truncate_mode='level'.

Visualization Best Practices

Effective dendrogram visualization is crucial for communicating hierarchical clustering results. Here are techniques for clear, informative dendrograms:

1. Truncation for Large Datasets:

When n > 50-100, showing all leaves makes the plot unreadable. Use:

truncate_mode='lastp': Show only the last p merged clusters
truncate_mode='level': Show only clusters above a certain level
Combine with show_leaf_counts=True to indicate how many original points are in each leaf

2. Color Coding:

Use color_threshold to color branches below a cut height
Match colors between dendrogram and scatter plots for correspondence
Consider coloring by some external variable (e.g., true labels if available)

3. Leaf Ordering:

The x-axis ordering affects readability. Scipy uses a heuristic to minimize crossings, but you can:

Use optimal_ordering=True for better (but slower) leaf arrangement
Reorder leaves to group related items together

4. Annotations:

Add horizontal lines at important cut thresholds
Label specific clusters of interest
Include merge distance annotations for key branches

dendrogram_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, set_link_color_palette
import matplotlib.pyplot as plt
from matplotlib.colors import to_hex
 
# Generate hierarchical data
np.random.seed(42)
n = 200  # Large enough to need truncation
 
# 4 clusters with sub-structure
X = np.vstack([
    np.random.randn(50, 2) * 0.4 + [0, 0],
    np.random.randn(50, 2) * 0.4 + [3, 0],
    np.random.randn(50, 2) * 0.4 + [1.5, 4],
    np.random.randn(50, 2) * 0.4 + [4.5, 4]
])
 
Z = linkage(X, method='ward')
 
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
 
# 1. Full dendrogram (illegible at this size)
dendrogram(Z, ax=axes[0, 0], no_labels=True)
axes[0, 0].set_title(f'Full Dendrogram (n={n}) - Hard to Read!')
axes[0, 0].set_ylabel('Distance')
 
# 2. Truncated dendrogram (last 30 clusters)
dendrogram(Z, ax=axes[0, 1], truncate_mode='lastp', p=30, 
           show_leaf_counts=True)
axes[0, 1].set_title('Truncated: Last 30 Clusters')
axes[0, 1].set_ylabel('Distance')
axes[0, 1].set_xlabel('Cluster (number of points)')
 
# 3. Color-coded with threshold
# Set custom color palette
custom_colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3']
set_link_color_palette(custom_colors)
 
# Color threshold creates colored branches below the cut
color_threshold = 5.0
dendro = dendrogram(Z, ax=axes[1, 0], truncate_mode='lastp', p=30,
                    color_threshold=color_threshold, 
                    above_threshold_color='gray')
axes[1, 0].axhline(y=color_threshold, color='red', linestyle='--', 
                   linewidth=2, label=f'Cut at {color_threshold}')
axes[1, 0].set_title('Color-Coded by Cluster')
axes[1, 0].set_ylabel('Distance')
axes[1, 0].legend()
 
# 4. Side-by-side with scatter plot
from scipy.cluster.hierarchy import fcluster
 
# Cut at 4 clusters and show correspondence
labels = fcluster(Z, t=4, criterion='maxclust')
 
# Make colors match between dendrogram and scatter
cluster_colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3']
point_colors = [cluster_colors[l-1] for l in labels]
 
axes[1, 1].scatter(X[:, 0], X[:, 1], c=point_colors, s=30, alpha=0.7)
axes[1, 1].set_title('Corresponding Cluster Assignments')
axes[1, 1].set_xlabel('Feature 1')
axes[1, 1].set_ylabel('Feature 2')
 
# Add cluster centroids
for i in range(1, 5):
    mask = labels == i
    centroid = X[mask].mean(axis=0)
    axes[1, 1].scatter(*centroid, c=cluster_colors[i-1], s=200, 
                       marker='*', edgecolors='black', linewidth=2)
    axes[1, 1].annotate(f'C{i}', centroid, fontsize=12, 
                       fontweight='bold', ha='center', va='bottom')
 
plt.tight_layout()
plt.savefig('dendrogram_visualization_tips.png', dpi=150)
print("Visualization saved to dendrogram_visualization_tips.png")
 
# Reset color palette
set_link_color_palette(None)

Visualization Checklist

•Use truncation for n > 50: Full dendrograms become crowded and unreadable
•Add cut threshold lines: Horizontal lines clarify where clusters are defined
•Match colors across plots: Use consistent coloring between dendrogram and scatter plots
•Label axes clearly: Always label the y-axis as 'Distance' or 'Linkage Distance'
•Consider orientation: Horizontal dendrograms work well for many clusters; vertical for few
•Show leaf counts: When truncating, indicate how many points each leaf represents

Summary: Dendrogram Interpretation

We've thoroughly covered how to read, interpret, and extract insights from dendrogramsthe primary output of hierarchical clustering.

Key Takeaways

•Dendrogram anatomy: Leaves are data points, internal nodes are merged clusters, height represents merge distance, and only the y-axis carries distance information.
•Height interpretation: Merge height reflects inter-cluster distance; large gaps indicate natural cluster boundaries; patterns reveal data structure.
•Cutting criteria: Use fixed distance for interpretable thresholds, fixed k for known cluster count, or inconsistency for adaptive cuts.
•Cophenetic correlation: CPCC measures how well the dendrogram preserves original distances; use it to compare linkage methods.
•Pattern recognition: Learn to identify well-separated clusters, chaining, outliers, and no-structure patterns from dendrogram shapes.
•Visualization best practices: Truncate large dendrograms, use color coding, match colors to scatter plots, and add threshold annotations.

What's Next:

Now that you can interpret dendrograms, the next question is: how do we extract a flat clustering from this hierarchy? In the next page, we'll explore cluster extraction methods in detail—including distance-based cuts, k-based cuts, dynamic tree cutting, and validation strategies for choosing the optimal cut.

Page Complete

You've mastered dendrogram interpretation: reading structure, understanding heights, identifying boundaries, measuring quality with CPCC, recognizing patterns, and visualizing effectively. Next, we'll learn systematic methods for extracting clusters from the hierarchy.