Loading content...
In agglomerative clustering, every iteration asks the same question: Which two clusters are most similar and should be merged? The answer depends entirely on how we define similarity between clusters—and this is where linkage methods enter as the single most consequential design decision in hierarchical clustering.
Given two clusters A and B containing multiple points each, what is the "distance" between them? There's no single correct answer. Should we use the distance between their closest points? Their farthest points? Their centers? Some average? Each choice leads to fundamentally different behavior, producing hierarchies with radically different structures from the same input data.
Understanding linkage methods is understanding hierarchical clustering. The choice of linkage determines whether your algorithm finds elongated chains, compact spheres, or balanced trees. It affects sensitivity to noise, ability to detect irregularly shaped clusters, and computational complexity.
By the end of this page, you will master all major linkage methods: Single (minimum), Complete (maximum), Average (UPGMA/WPGMA), Ward's (minimum variance), Centroid, and Median linkages. For each method, you'll understand the mathematical definition, geometric intuition, characteristic behavior, advantages, disadvantages, and appropriate use cases. You'll be able to select the right linkage for any clustering problem.
Formal Definition:
Given two clusters Cᵢ = {x₁, x₂, ..., xₘ} and Cⱼ = {y₁, y₂, ..., yₙ}, and a point-wise distance function d(x, y), the linkage function L(Cᵢ, Cⱼ) computes a single scalar representing the inter-cluster distance.
The linkage function must satisfy:
Beyond these minimal requirements, different linkage functions encode different notions of cluster similarity, leading to dramatically different hierarchies.
Why Linkage Matters So Much:
Consider a simple example: two clusters where one is compact and one is elongated. The closest points between them might be quite near, but the farthest points might be very far. Single linkage and complete linkage will make opposite decisions about whether to merge these clusters, leading to entirely different hierarchical structures.
| Linkage | Also Known As | Formula | Cluster Shape Tendency |
|---|---|---|---|
| Single | Minimum, Nearest-neighbor | min d(x,y) for x∈Cᵢ, y∈Cⱼ | Elongated chains |
| Complete | Maximum, Farthest-neighbor | max d(x,y) for x∈Cᵢ, y∈Cⱼ | Compact spheres |
| Average (UPGMA) | Mean, UPGMA | Mean of all pairwise distances | Moderate compactness |
| Weighted (WPGMA) | WPGMA | Mean of sub-cluster distances | Moderate compactness |
| Ward | Minimum variance | Increase in total within-cluster variance | Compact, equal-sized |
| Centroid | UPGMC | Distance between centroids | Variable |
| Median | WPGMC | Recursive midpoint distance | Variable |
Mathematical Definition:
$$d_{\text{single}}(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$$
Single linkage defines inter-cluster distance as the distance between the closest pair of points, one from each cluster. It's the most optimistic measure—clusters are as close as their nearest members.
Geometric Intuition:
Imagine stretching a rubber band from cluster A to cluster B. Single linkage uses the shortest possible rubber band. If there's any path of nearby points connecting two regions, single linkage will merge them early.
The Chaining Effect:
Single linkage is notorious for producing long, chain-like clusters. If points A₁→A₂→A₃→...→Aₙ form a chain where each consecutive pair is close, single linkage will merge them all into one cluster—even if A₁ and Aₙ are far apart.
This "chaining effect" can be a bug or a feature:
Single linkage clustering is mathematically equivalent to the minimum spanning tree (MST) of the data. The dendrogram can be constructed by building the MST (using Kruskal's or Prim's algorithm) and recording edge weights as merge distances. This connection enables the efficient O(n²) SLINK algorithm and provides theoretical guarantees from graph theory.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom scipy.cluster.hierarchy import linkage, dendrogram, fclusterimport matplotlib.pyplot as plt # Create data demonstrating chaining effectnp.random.seed(42) # Two well-separated compact clusterscluster1 = np.random.randn(30, 2) * 0.5 + [0, 0]cluster2 = np.random.randn(30, 2) * 0.5 + [8, 0] # Bridge of noise points connecting thembridge = np.array([[i, 0] for i in np.linspace(1, 7, 10)])bridge += np.random.randn(10, 2) * 0.1 # Small noise X = np.vstack([cluster1, bridge, cluster2]) # Compare single vs complete linkagefig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Single linkageZ_single = linkage(X, method='single')labels_single = fcluster(Z_single, t=2, criterion='maxclust') axes[0, 0].scatter(X[:, 0], X[:, 1], c=labels_single, cmap='viridis', s=50)axes[0, 0].set_title('Single Linkage: 2 Clusters')axes[0, 0].set_xlabel('Feature 1')axes[0, 0].set_ylabel('Feature 2') dendrogram(Z_single, ax=axes[0, 1], truncate_mode='lastp', p=20)axes[0, 1].set_title('Single Linkage Dendrogram')axes[0, 1].set_ylabel('Distance') # Complete linkageZ_complete = linkage(X, method='complete')labels_complete = fcluster(Z_complete, t=2, criterion='maxclust') axes[1, 0].scatter(X[:, 0], X[:, 1], c=labels_complete, cmap='viridis', s=50)axes[1, 0].set_title('Complete Linkage: 2 Clusters')axes[1, 0].set_xlabel('Feature 1')axes[1, 0].set_ylabel('Feature 2') dendrogram(Z_complete, ax=axes[1, 1], truncate_mode='lastp', p=20)axes[1, 1].set_title('Complete Linkage Dendrogram')axes[1, 1].set_ylabel('Distance') plt.tight_layout()plt.savefig('linkage_comparison.png', dpi=150)print("Visualization saved to linkage_comparison.png") # The chaining effect: single linkage merges through the bridgeprint(f"\nSingle linkage found {len(np.unique(labels_single))} clusters")print(f"Complete linkage found {len(np.unique(labels_complete))} clusters") # Analyze cluster compositionsfor label in np.unique(labels_single): size = np.sum(labels_single == label) print(f" Single cluster {label}: {size} points")Mathematical Definition:
$$d_{\text{complete}}(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$$
Complete linkage defines inter-cluster distance as the distance between the farthest pair of points, one from each cluster. It's the most pessimistic measure—clusters are as far apart as their most distant members.
Geometric Intuition:
Complete linkage asks: "What's the worst-case distance if I merge these clusters?" It ensures that after merging, no two points in the combined cluster are farther apart than the merge distance. This creates compact, spherical clusters with known diameter bounds.
Behavior Characteristics:
Complete linkage tends to produce balanced, compact clusters of similar diameter. It's resistant to chaining because a single bridge point cannot make two clusters appear close—the maximum distance still reflects the true separation.
However, it's sensitive to outliers in a different way: an outlier far from a cluster center increases the cluster's effective diameter, potentially preventing merges that would otherwise be intuitive.
Complete linkage provides a useful guarantee: if two clusters were merged at distance d, then the maximum pairwise distance within the merged cluster is at most d. This 'diameter bound' is valuable in applications where you need to ensure all members of a cluster are within a known distance of each other (e.g., network latency constraints, physical proximity requirements).
When to Use Complete Linkage:
When to Avoid Complete Linkage:
Mathematical Definition:
$$d_{\text{UPGMA}}(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$$
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) defines inter-cluster distance as the arithmetic mean of all pairwise distances between points in the two clusters. It considers every point equally, weighted by cluster size.
Geometric Intuition:
Average linkage asks: "On average, how far apart are points in these two clusters?" It balances between the extremes of single linkage (only considering the closest pair) and complete linkage (only considering the farthest pair).
UPGMA vs WPGMA:
There are two variants of average linkage:
UPGMA (Unweighted): Weights each point equally. A cluster with 100 points has more influence than a cluster with 10 points.
WPGMA (Weighted): Weights each sub-cluster equally. Uses the average of the two sub-cluster distances to any third cluster, regardless of sub-cluster sizes.
$$d_{\text{WPGMA}}(C_i \cup C_j, C_k) = \frac{d(C_i, C_k) + d(C_j, C_k)}{2}$$
| Aspect | UPGMA (Unweighted) | WPGMA (Weighted) |
|---|---|---|
| Weighting | By number of points | By number of sub-clusters |
| Large cluster influence | Strong (proportional to size) | Equal to small clusters |
| Mathematical property | Preserves ultrametric | May violate ultrametric |
| Use case | When all points equally important | When cluster history matters |
| Sensitivity to merging order | Lower | Higher |
Average linkage is often the default choice when you're unsure. It's less prone to chaining than single linkage, less biased toward spherical shapes than complete linkage, and generally produces reasonable hierarchies for a wide range of data distributions. It's the 'safe' choice for exploratory analysis.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npfrom scipy.cluster.hierarchy import linkage, dendrogramimport matplotlib.pyplot as plt # Create asymmetric clusters to show UPGMA vs WPGMA differencenp.random.seed(42) # Small cluster (10 points)small_cluster = np.random.randn(10, 2) * 0.3 + [0, 0]# Large cluster (50 points)large_cluster = np.random.randn(50, 2) * 0.3 + [2, 0]# Medium cluster (20 points)medium_cluster = np.random.randn(20, 2) * 0.3 + [1, 2] X = np.vstack([small_cluster, large_cluster, medium_cluster]) # Compare UPGMA (average) vs WPGMA (weighted)Z_upgma = linkage(X, method='average') # UPGMAZ_wpgma = linkage(X, method='weighted') # WPGMA fig, axes = plt.subplots(1, 3, figsize=(16, 5)) # Scatter plotcolors = ['red'] * 10 + ['blue'] * 50 + ['green'] * 20axes[0].scatter(X[:, 0], X[:, 1], c=colors, s=50, alpha=0.7)axes[0].set_title('Data: Unequal Cluster Sizes')axes[0].set_xlabel('Feature 1')axes[0].set_ylabel('Feature 2')axes[0].legend(['Small (10)', 'Large (50)', 'Medium (20)'], loc='upper right') # UPGMA dendrogramdendrogram(Z_upgma, ax=axes[1], truncate_mode='lastp', p=15)axes[1].set_title('UPGMA (Average) Linkage')axes[1].set_ylabel('Distance')axes[1].set_xlabel('Cluster') # WPGMA dendrogram dendrogram(Z_wpgma, ax=axes[2], truncate_mode='lastp', p=15)axes[2].set_title('WPGMA (Weighted) Linkage')axes[2].set_ylabel('Distance')axes[2].set_xlabel('Cluster') plt.tight_layout()plt.savefig('average_linkage_variants.png', dpi=150)print("Comparison saved to average_linkage_variants.png") # Examine the final mergesprint("\n=== Final Merges ===")print("\nUPGMA (last 3 merges):")for i in range(-3, 0): print(f" Distance: {Z_upgma[i, 2]:.4f}, Combined size: {int(Z_upgma[i, 3])}") print("\nWPGMA (last 3 merges):")for i in range(-3, 0): print(f" Distance: {Z_wpgma[i, 2]:.4f}, Combined size: {int(Z_wpgma[i, 3])}")Mathematical Definition:
Ward's method minimizes the increase in total within-cluster variance (sum of squared deviations from cluster centroids) when merging two clusters:
$$d_{\text{Ward}}(C_i, C_j) = \sqrt{\frac{2 \cdot n_i \cdot n_j}{n_i + n_j}} \cdot |\bar{C}_i - \bar{C}_j|$$
where $n_i$ and $n_j$ are cluster sizes, and $\bar{C}_i$ and $\bar{C}_j$ are centroids.
Equivalently, the Ward distance is the increase in total intra-cluster variance:
$$\Delta V = V(C_i \cup C_j) - V(C_i) - V(C_j)$$
where $V(C) = \sum_{x \in C} |x - \bar{C}|^2$ is the within-cluster variance.
Geometric Intuition:
Ward's method asks: "How much will our clusters become more spread out if we merge these two?" It greedily minimizes variance increase at each step, similar to K-Means but building a full hierarchy.
Connection to K-Means:
Ward's linkage optimizes the same objective as K-Means (minimize within-cluster sum of squares). Cutting a Ward dendrogram at a chosen level often produces clusters similar to K-Means with that many clusters. This makes Ward's method a natural choice when you expect K-Means-like compact clusters.
Ward's method fundamentally relies on computing centroids and variances, which requires a Euclidean (or more generally, inner-product) space. Unlike other linkages that only need pairwise distances, Ward's method cannot be applied to arbitrary distance matrices—it needs the original feature vectors. This is a significant limitation for non-Euclidean data.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom scipy.cluster.hierarchy import linkage, dendrogram, fclusterfrom sklearn.cluster import KMeansimport matplotlib.pyplot as plt # Generate data with known cluster structurenp.random.seed(42)n_samples = 150 # Three well-separated Gaussian clusters (ideal for Ward's method)X = np.vstack([ np.random.randn(50, 2) * 0.8 + [0, 0], np.random.randn(50, 2) * 0.8 + [4, 4], np.random.randn(50, 2) * 0.8 + [8, 0]]) # Apply Ward's linkageZ_ward = linkage(X, method='ward') # Compare with K-Means for k=3kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)kmeans_labels = kmeans.fit_predict(X) # Cut Ward's dendrogram at 3 clustersward_labels = fcluster(Z_ward, t=3, criterion='maxclust') # Visualize comparisonfig, axes = plt.subplots(2, 2, figsize=(14, 12)) # Ward clustering resultaxes[0, 0].scatter(X[:, 0], X[:, 1], c=ward_labels, cmap='viridis', s=50)axes[0, 0].set_title("Ward's Linkage (3 clusters)")axes[0, 0].set_xlabel('Feature 1')axes[0, 0].set_ylabel('Feature 2') # K-Means resultaxes[0, 1].scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=50)axes[0, 1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='X', s=200, edgecolors='black')axes[0, 1].set_title('K-Means (k=3)')axes[0, 1].set_xlabel('Feature 1')axes[0, 1].set_ylabel('Feature 2') # Ward dendrogramdendrogram(Z_ward, ax=axes[1, 0], truncate_mode='lastp', p=20)axes[1, 0].set_title("Ward's Linkage Dendrogram")axes[1, 0].set_ylabel('Distance (Variance Increase)')axes[1, 0].axhline(y=Z_ward[-3, 2], color='r', linestyle='--', label='3-cluster cut')axes[1, 0].legend() # Within-cluster variance at each mergevariances = Z_ward[:, 2] # The 3rd column is the merge distance (variance increase)cumulative_variance = np.cumsum(variances[::-1])[::-1] # Total variance reduction axes[1, 1].plot(range(1, len(cumulative_variance) + 1), cumulative_variance)axes[1, 1].set_xlabel('Number of Clusters')axes[1, 1].set_ylabel('Cumulative Within-Cluster Variance')axes[1, 1].set_title('Variance vs Number of Clusters (Elbow Analysis)')axes[1, 1].axvline(x=3, color='r', linestyle='--', label='Elbow at k=3')axes[1, 1].legend()axes[1, 1].set_xlim(1, 20) plt.tight_layout()plt.savefig('ward_vs_kmeans.png', dpi=150)print("Comparison saved to ward_vs_kmeans.png") # Quantify agreement between Ward and K-Meansfrom sklearn.metrics import adjusted_rand_scoreari = adjusted_rand_score(kmeans_labels, ward_labels)print(f"\nAdjusted Rand Index (Ward vs K-Means): {ari:.4f}")print("(1.0 = perfect agreement, 0.0 = random)")Centroid Linkage (UPGMC):
$$d_{\text{centroid}}(C_i, C_j) = |\bar{C}_i - \bar{C}_j|$$
Centroid linkage defines inter-cluster distance as the Euclidean distance between cluster centroids (means). After merging, the new centroid is the weighted average:
$$\bar{C}_{ij} = \frac{n_i \bar{C}_i + n_j \bar{C}_j}{n_i + n_j}$$
Median Linkage (WPGMC):
Median linkage also uses centroids but ignores cluster sizes:
$$\bar{C}_{ij} = \frac{\bar{C}_i + \bar{C}_j}{2}$$
The merged cluster centroid is the midpoint of the two sub-cluster centroids, regardless of how many points are in each.
The Inversion Problem:
Both centroid and median linkages can produce inversions in the dendrogram—where a merge occurs at a smaller distance than a previous merge. This violates the monotonicity property expected in valid dendrograms and can make interpretation difficult.
In a valid dendrogram, merge distances should be monotonically increasing: later merges happen at larger distances. Centroid and median linkages can violate this, producing 'inversions' where the dendrogram 'goes backwards.' This happens because centroids move during merging, potentially bringing clusters closer. Inversions make it impossible to cut the dendrogram at a fixed height to get a valid clustering.
| Aspect | Centroid (UPGMC) | Median (WPGMC) |
|---|---|---|
| Centroid computation | Weighted by size | Unweighted (midpoint) |
| Large cluster influence | Strong | Equal to small clusters |
| Inversion risk | Moderate | Higher |
| Requires Euclidean | Yes (needs centroids) | Yes (needs centroids) |
| Common use case | Rare (use Ward's instead) | Very rare |
Why These Linkages Are Rarely Used:
When to Consider Centroid Linkage:
Despite its limitations, centroid linkage can be useful when:
In practice, Ward's method is almost always preferred over centroid/median linkages for Euclidean data.
Selecting the appropriate linkage method depends on your data characteristics, domain knowledge, and clustering goals. Here's a systematic decision framework:
| If Your Data Has... | Consider... | Avoid... |
|---|---|---|
| Compact, spherical clusters | Ward, Complete | Single |
| Elongated or chain-like clusters | Single | Complete, Ward |
| Unknown structure (exploratory) | Average (UPGMA) | Centroid, Median |
| Significant outliers | Average, Complete | Single, Ward |
| Varying cluster densities | Single, Average | Complete, Ward |
| Euclidean features | Ward (variance focus) | N/A |
| Non-Euclidean distances | Single, Complete, Average | Ward, Centroid |
| Need diameter guarantee | Complete | All others |
| Very large n (>10,000) | Single (SLINK) | Ward (memory-intensive) |
When uncertain, start with Ward's linkage for Euclidean data (robust, K-Means-like) or Average linkage for arbitrary distances (balanced compromise). If you see unexpected chaining, switch to Complete. If you believe clusters are non-convex, try Single and inspect for spurious bridges. Always visualize dendrograms and compare multiple linkages on your data.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as npfrom scipy.cluster.hierarchy import linkage, fclusterfrom sklearn.metrics import silhouette_score, calinski_harabasz_scoreimport matplotlib.pyplot as plt # Generate 2D data with known structurenp.random.seed(42) # Three clusters with different characteristicscluster1 = np.random.randn(50, 2) * 0.5 + [0, 0] # Compactcluster2 = np.random.randn(50, 2) * [0.3, 2] + [4, 0] # Elongatedcluster3 = np.random.randn(50, 2) * 0.5 + [2, 4] # Compact X = np.vstack([cluster1, cluster2, cluster3])true_labels = np.array([0]*50 + [1]*50 + [2]*50) # Compare all standard linkageslinkages = ['single', 'complete', 'average', 'ward']results = {} print("Linkage Comparison (3 clusters):")print("-" * 70)print(f"{'Linkage':<12} {'Silhouette':>12} {'Calinski-Harabasz':>18} {'ARI vs True':>12}")print("-" * 70) fig, axes = plt.subplots(2, 4, figsize=(18, 9)) for idx, link in enumerate(linkages): # Compute linkage Z = linkage(X, method=link) labels = fcluster(Z, t=3, criterion='maxclust') # Compute metrics sil = silhouette_score(X, labels) ch = calinski_harabasz_score(X, labels) from sklearn.metrics import adjusted_rand_score ari = adjusted_rand_score(true_labels, labels) results[link] = {'silhouette': sil, 'ch': ch, 'ari': ari, 'labels': labels} print(f"{link:<12} {sil:>12.4f} {ch:>18.2f} {ari:>12.4f}") # Scatter plot axes[0, idx].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30) axes[0, idx].set_title(f'{link.capitalize()} Linkage') axes[0, idx].set_xlabel('Feature 1') if idx == 0: axes[0, idx].set_ylabel('Feature 2') # Dendrogram from scipy.cluster.hierarchy import dendrogram dendrogram(Z, ax=axes[1, idx], truncate_mode='lastp', p=15, no_labels=True) axes[1, idx].set_title(f'{link.capitalize()} Dendrogram') if idx == 0: axes[1, idx].set_ylabel('Distance') print("-" * 70) plt.tight_layout()plt.savefig('all_linkages_comparison.png', dpi=150)print("\nVisualization saved to all_linkages_comparison.png")We've comprehensively covered the linkage methods that define how agglomerative clustering measures inter-cluster distance—the central design decision in hierarchical clustering.
What's Next:
Linkage methods produce the dendrogram, but interpreting this hierarchical structure requires careful analysis. In the next page, we'll learn to read and interpret dendrograms: understanding merge heights, identifying cluster boundaries, recognizing hierarchical patterns, and extracting actionable insights from these visualizations.
You now command a complete understanding of hierarchical clustering linkage methods: their mathematical definitions, geometric intuitions, behavioral characteristics, and practical trade-offs. You can intelligently select linkage methods based on data properties and clustering goals. Next, we'll master dendrogram interpretation.