Machine LearningClustering Algorithms

Hierarchical Clustering

LevelIntermediate

Duration90 mins

TopicClustering Algorithms

5 / 5

Divisive Clustering

Starting from the Top

While agglomerative clustering builds from the ground up—starting with individual points and merging upward—divisive clustering takes the opposite approach: start with all data in a single cluster and recursively split until reaching individual points. This top-down perspective fundamentally changes how the hierarchy is constructed and discovered.

Divisive methods are conceptually appealing: when exploring a new dataset, it's often natural to first identify the gross structure (major groupings) before drilling into finer distinctions. A divisive hierarchy reveals the most significant splits first—the primary fault lines in your data—before showing the detailed structure within each partition.

However, divisive clustering presents formidable computational challenges. At each step, we must find the optimal way to split a cluster into two—a problem that is NP-hard in general. This has made divisive methods far less common than their agglomerative counterparts, though clever heuristics and specific algorithms like DIANA make them practical for certain applications.

What You Will Learn

By the end of this page, you will understand: the divisive paradigm and how it differs from agglomerative approaches; the DIANA algorithm and its splinter-based splitting strategy; recursive bisection using K-Means; the computational complexity that makes optimal divisive clustering intractable; when to prefer divisive over agglomerative methods; and practical implementations for divisive hierarchical clustering.

The Top-Down Paradigm

Formal Definition:

Divisive clustering starts with a single cluster containing all n data points and recursively partitions clusters until each point forms its own singleton cluster, producing a hierarchy of n-1 splits.

Input: A set of n data points X = {x₁, x₂, ..., xₙ} and a splitting criterion

Output: A dendrogram recording all n-1 splits

Algorithm Template:

Initialize: C = {{x₁, x₂, ..., xₙ}} (single cluster with all points)
While any cluster has more than one point: a. Select a cluster C_i ∈ C to split (usually largest or most heterogeneous) b. Find the optimal binary partition {A, B} of C_i c. Replace C_i with A and B in C d. Record the split in the dendrogram
Return the complete dendrogram

The Splitting Challenge:

Step 2b is the critical challenge. For a cluster of size m, there are 2^(m-1) - 1 possible binary partitions. Exhaustively evaluating all partitions to find the optimal split is exponential in m—completely intractable for clusters larger than about 20 points.

Agglomerative vs. Divisive Comparison
Aspect	Agglomerative (Bottom-Up)	Divisive (Top-Down)
Starting state	n singleton clusters	1 cluster with all n points
Each step	Merge closest pair	Split most heterogeneous cluster
Number of steps	n-1 merges	n-1 splits
Local decision	O(n²) pairwise comparisons	O(2^m) possible partitions
Early structure	Fine-grained local similarities	Gross-level major divisions
Hierarchy direction	Leaves → Root (bottom-up)	Root → Leaves (top-down)
Typical complexity	O(n² log n) or O(n²)	O(2^n) optimal, O(n² log n) heuristic

Why Agglomerative Dominates

The asymmetry in local decision complexity explains why agglomerative methods are far more common. Merging requires choosing from O(n²) pairs—polynomial. Splitting optimally requires choosing from O(2^m) partitions—exponential. This fundamental difference means divisive methods must rely on heuristics, while agglomerative methods can find exact solutions efficiently.

The DIANA Algorithm

DIANA (DIvisive ANAlysis) is the classic divisive hierarchical clustering algorithm, introduced by Kaufman and Rousseeuw (1990). It uses a clever heuristic based on splinter groups to avoid the exponential complexity of optimal splitting.

The Splinter Heuristic:

Instead of evaluating all 2^(m-1) - 1 partitions, DIANA grows a "splinter group" one point at a time:

Start with the cluster to split, C, and an empty splinter group S = {}
Find the point x* ∈ C with largest average distance to all other points in C
Move x* to S (it becomes the "seed" of the splinter)
Repeat:
- For each point x ∈ C (not in S), compute:
  - d_in(x) = average distance from x to points still in C
  - d_out(x) = average distance from x to points in S
- If any point x has d_out(x) < d_in(x), move the point with largest difference to S
- Otherwise, stop
Return S and (C \ S) as the two new clusters

Intuition: DIANA identifies points that are more "at home" in the splinter than in the remaining cluster, incrementally building the splinter until no more points want to leave.

diana_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
import numpy as np
from typing import List, Set, Tuple, Dict
from scipy.spatial.distance import pdist, squareform
 
class DIANAClustering:
    """
    DIANA (DIvisive ANAlysis) hierarchical clustering implementation.
    
    Uses the splinter group heuristic to avoid exponential splitting complexity.
    """
    
    def __init__(self):
        self.dendrogram: List[Tuple[int, Set[int], Set[int], float]] = []
        self.clusters: List[Set[int]] = []
        
    def fit(self, X: np.ndarray) -> 'DIANAClustering':
        """
        Perform divisive clustering on data X.
        
        Args:
            X: Data matrix of shape (n_samples, n_features)
            
        Returns:
            self with fitted dendrogram
        """
        n = len(X)
        
        # Precompute full distance matrix
        self.distance_matrix = squareform(pdist(X))
        
        # Start with all points in one cluster
        all_indices = set(range(n))
        self.clusters = [all_indices.copy()]
        
        # Track split order for dendrogram
        split_id = 0
        
        while any(len(c) > 1 for c in self.clusters):
            # Select cluster to split: most heterogeneous (largest diameter)
            split_idx = self._select_cluster_to_split()
            cluster = self.clusters[split_idx]
            
            if len(cluster) <= 1:
                continue
            
            # Perform DIANA split
            splinter, remaining, split_distance = self._diana_split(cluster)
            
            if len(splinter) == 0 or len(remaining) == 0:
                # Can't split further (shouldn't happen with proper data)
                break
            
            # Record split
            self.dendrogram.append((
                split_id,
                remaining.copy(),
                splinter.copy(),
                split_distance
            ))
            split_id += 1
            
            # Update cluster list
            self.clusters.pop(split_idx)
            self.clusters.append(remaining)
            self.clusters.append(splinter)
        
        return self
    
    def _select_cluster_to_split(self) -> int:
        """Select the most heterogeneous cluster (largest diameter)."""
        max_diameter = -1
        max_idx = 0
        
        for i, cluster in enumerate(self.clusters):
            if len(cluster) < 2:
                continue
            
            # Compute diameter (maximum pairwise distance)
            indices = list(cluster)
            diameter = max(
                self.distance_matrix[a, b]
                for a in indices for b in indices if a < b
            )
            
            if diameter > max_diameter:
                max_diameter = diameter
                max_idx = i
        
        return max_idx
    
    def _diana_split(self, cluster: Set[int]) -> Tuple[Set[int], Set[int], float]:
        """
        Split a cluster using DIANA's splinter group heuristic.
        
        Returns:
            (splinter, remaining, split_distance)
        """
        indices = list(cluster)
        n = len(indices)
        
        if n <= 1:
            return set(), cluster, 0.0
        
        # Step 1: Find point with largest average distance to others
        avg_distances = []
        for i in indices:
            avg_dist = np.mean([
                self.distance_matrix[i, j] for j in indices if j != i
            ])
            avg_distances.append((i, avg_dist))
        
        # Start splinter with most distant point
        seed_point = max(avg_distances, key=lambda x: x[1])[0]
        splinter = {seed_point}
        remaining = cluster - splinter
        
        # Steps 2-4: Grow splinter
        changed = True
        while changed and len(remaining) > 0:
            changed = False
            best_candidate = None
            best_diff = 0
            
            for point in remaining:
                # Average distance to remaining cluster
                d_in = np.mean([
                    self.distance_matrix[point, j] 
                    for j in remaining if j != point
                ]) if len(remaining) > 1 else float('inf')
                
                # Average distance to splinter
                d_out = np.mean([
                    self.distance_matrix[point, j] for j in splinter
                ])
                
                diff = d_in - d_out  # Positive means point prefers splinter
                
                if diff > best_diff:
                    best_diff = diff
                    best_candidate = point
            
            if best_candidate is not None and best_diff > 0:
                splinter.add(best_candidate)
                remaining.remove(best_candidate)
                changed = True
        
        # Compute split distance (average between-cluster distance)
        split_distance = np.mean([
            self.distance_matrix[a, b]
            for a in splinter for b in remaining
        ]) if splinter and remaining else 0.0
        
        return splinter, remaining, split_distance
 
 
# Usage example
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate test data with 3 clusters
    X = np.vstack([
        np.random.randn(20, 2) * 0.5 + [0, 0],
        np.random.randn(20, 2) * 0.5 + [4, 0],
        np.random.randn(20, 2) * 0.5 + [2, 4]
    ])
    
    # Fit DIANA
    diana = DIANAClustering()
    diana.fit(X)
    
    print("DIANA Divisive Clustering Results:")
    print("=" * 50)
    print(f"Total splits: {len(diana.dendrogram)}")
    
    # Show first few splits
    print("\nFirst 5 splits (top of hierarchy):")
    for split_id, remaining, splinter, distance in diana.dendrogram[:5]:
        print(f"  Split {split_id}: {len(remaining)} points + {len(splinter)} points, dist={distance:.3f}")

DIANA Advantages

•Polynomial complexity: O(n²) per split instead of O(2^n)
•Interpretable splits: Each division is based on clear dissimilarity logic
•Major structure first: Early splits reveal gross data organization
•Works with distances: Only needs pairwise distances, not features

DIANA Disadvantages

•Suboptimal splits: Heuristic doesn't find globally optimal partitions
•Irregular cluster sizes: Splinters can be very small
•Sensitive to seed point: Initial choice affects entire subtree
•Less common: Less library support than agglomerative

Recursive Bisection with K-Means

An alternative to DIANA's splinter heuristic is recursive bisection: use an existing clustering algorithm (typically K-Means with k=2) to split each cluster into two parts.

Algorithm:

Start with all data in one cluster
Select a cluster to split (largest, or most heterogeneous)
Apply K-Means with k=2 to that cluster
Replace the parent cluster with the two child clusters
Repeat until stopping criterion (e.g., target k, minimum cluster size)

Why K-Means Bisection Works Well:

K-Means is fast: O(m × d × 2 × iterations) per split
K-Means finds compact, spherical partitions—good for many datasets
Multiple random initializations can improve split quality
Total complexity for k final clusters: O(n × k × log k × d × iterations)

Bisecting K-Means:

This approach is so common it has its own name: Bisecting K-Means. It's often more efficient than standard K-Means for large k, and can produce hierarchical structure as a byproduct.

bisecting_kmeans.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import numpy as np
from sklearn.cluster import KMeans
from typing import List, Dict, Tuple, Optional
import matplotlib.pyplot as plt
 
class BisectingKMeans:
    """
    Divisive hierarchical clustering using K-Means bisection.
    
    At each step, splits the selected cluster using K-Means(k=2).
    """
    
    def __init__(
        self, 
        n_clusters: int = 8,
        selection: str = 'largest',  # 'largest' or 'sse'
        n_init: int = 10,
        random_state: int = 42
    ):
        self.n_clusters = n_clusters
        self.selection = selection
        self.n_init = n_init
        self.random_state = random_state
        
        self.labels_ = None
        self.cluster_centers_ = None
        self.split_history_: List[Dict] = []
        
    def fit(self, X: np.ndarray) -> 'BisectingKMeans':
        """Fit bisecting k-means."""
        n = len(X)
        
        # Initialize: all points in cluster 0
        labels = np.zeros(n, dtype=int)
        
        # Track clusters: {cluster_id: indices}
        clusters = {0: np.arange(n)}
        
        next_cluster_id = 1
        
        while len(clusters) < self.n_clusters:
            # Select cluster to split
            split_cluster_id = self._select_cluster(X, clusters)
            cluster_indices = clusters[split_cluster_id]
            
            if len(cluster_indices) < 2:
                # Can't split singleton
                break
            
            # Extract cluster data
            cluster_data = X[cluster_indices]
            
            # Apply K-Means bisection
            kmeans = KMeans(
                n_clusters=2, 
                n_init=self.n_init,
                random_state=self.random_state
            )
            sub_labels = kmeans.fit_predict(cluster_data)
            
            # Compute SSE reduction for this split
            old_sse = self._compute_sse(cluster_data)
            new_sse = sum(
                self._compute_sse(cluster_data[sub_labels == k])
                for k in [0, 1]
            )
            
            # Create new clusters
            new_cluster_id_0 = split_cluster_id  # Reuse parent ID
            new_cluster_id_1 = next_cluster_id
            next_cluster_id += 1
            
            mask_0 = sub_labels == 0
            mask_1 = sub_labels == 1
            
            clusters[new_cluster_id_0] = cluster_indices[mask_0]
            clusters[new_cluster_id_1] = cluster_indices[mask_1]
            
            # Update global labels
            labels[cluster_indices[mask_1]] = new_cluster_id_1
            
            # Record split
            self.split_history_.append({
                'parent': split_cluster_id,
                'children': (new_cluster_id_0, new_cluster_id_1),
                'parent_size': len(cluster_indices),
                'child_sizes': (np.sum(mask_0), np.sum(mask_1)),
                'sse_reduction': old_sse - new_sse
            })
        
        # Renumber clusters to be 0...k-1
        unique_labels = np.unique(labels)
        label_map = {old: new for new, old in enumerate(unique_labels)}
        self.labels_ = np.array([label_map[l] for l in labels])
        
        # Compute final centroids
        self.cluster_centers_ = np.array([
            X[self.labels_ == k].mean(axis=0)
            for k in range(len(unique_labels))
        ])
        
        return self
    
    def _select_cluster(
        self, X: np.ndarray, clusters: Dict[int, np.ndarray]
    ) -> int:
        """Select which cluster to split next."""
        if self.selection == 'largest':
            return max(clusters.keys(), key=lambda k: len(clusters[k]))
        elif self.selection == 'sse':
            return max(
                clusters.keys(),
                key=lambda k: self._compute_sse(X[clusters[k]])
            )
        else:
            raise ValueError(f"Unknown selection: {self.selection}")
    
    def _compute_sse(self, points: np.ndarray) -> float:
        """Compute sum of squared distances to centroid."""
        if len(points) == 0:
            return 0.0
        centroid = points.mean(axis=0)
        return np.sum((points - centroid) ** 2)
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate hierarchical data
    # Level 1: Two main groups
    # Level 2: Each main group has 2 sub-groups
    main1_sub1 = np.random.randn(40, 2) * 0.5 + [0, 0]
    main1_sub2 = np.random.randn(40, 2) * 0.5 + [2, 0]
    main2_sub1 = np.random.randn(40, 2) * 0.5 + [6, 6]
    main2_sub2 = np.random.randn(40, 2) * 0.5 + [8, 6]
    
    X = np.vstack([main1_sub1, main1_sub2, main2_sub1, main2_sub2])
    
    # Compare different k values
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    for idx, k in enumerate([2, 3, 4, 6]):
        ax = axes[idx // 2, idx % 2]
        
        bisecting = BisectingKMeans(n_clusters=k, selection='sse', random_state=42)
        bisecting.fit(X)
        
        scatter = ax.scatter(X[:, 0], X[:, 1], c=bisecting.labels_, 
                            cmap='viridis', s=50, alpha=0.7)
        ax.scatter(bisecting.cluster_centers_[:, 0], 
                  bisecting.cluster_centers_[:, 1],
                  c='red', marker='X', s=200, edgecolors='black')
        ax.set_title(f'Bisecting K-Means (k={k})')
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
    
    plt.tight_layout()
    plt.savefig('bisecting_kmeans.png', dpi=150)
    print("Bisecting K-Means saved to bisecting_kmeans.png")
    
    # Print split hierarchy
    bisecting = BisectingKMeans(n_clusters=4, selection='sse', random_state=42)
    bisecting.fit(X)
    
    print("\n=== Split History ===")
    for i, split in enumerate(bisecting.split_history_):
        print(f"Split {i+1}: Cluster {split['parent']} "
              f"({split['parent_size']} pts) → "
              f"Clusters {split['children']} "
              f"({split['child_sizes'][0]}+{split['child_sizes'][1]} pts), "
              f"SSE reduction: {split['sse_reduction']:.2f}")

Bisecting vs. Standard K-Means

Bisecting K-Means often outperforms standard K-Means for large k because each bisection is a simple 2-class problem with stable solutions. Standard K-Means with large k faces more local optima and initialization sensitivity. Additionally, bisecting K-Means naturally produces a hierarchy—useful for understanding data structure at multiple resolutions.

Computational Complexity Analysis

Understanding the computational complexity of divisive methods helps explain why they're less common than agglomerative approaches.

Optimal Divisive Clustering (Intractable):

Finding the optimal binary partition of m points requires evaluating 2^(m-1) - 1 possible splits. For the first split (m = n), this is:

$$T_{\text{optimal}} = O(2^n)$$

This is clearly intractable. For just n = 50 points, we'd need to evaluate over 10^14 partitions—impossible in practice.

DIANA Complexity:

DAINA's splinter heuristic avoids exponential complexity:

Selecting the cluster to split: O(n²) to compute all diameters
Building the splinter: O(m²) per point evaluation, at most m points moved
Total per split: O(m³) in worst case
Total for complete hierarchy (n-1 splits): O(n³)

While cubic, DIANA is still slower than the O(n² log n) or O(n²) achievable for agglomerative methods.

Bisecting K-Means Complexity:

Per split: O(m × d × 2 × iterations) for K-Means
Number of splits to get k clusters: k-1
Total: O(n × d × k × iterations)

This is typically much faster than DIANA, especially for high-dimensional data.

Divisive Method Complexity Comparison
Method	Time Complexity	Space Complexity	Notes
Optimal divisive	O(2^n)	O(n²)	Intractable for n > 20
DIANA	O(n³)	O(n²)	Polynomial but slow
Bisecting K-Means	O(ndk × iters)	O(nd + k)	Fast, practical choice
Agglomerative (comparison)	O(n² log n)	O(n²)	More efficient alternative

Why the Asymmetry Matters:

The fundamental complexity asymmetry between agglomerative and divisive methods has deep consequences:

Algorithm Availability: Far more agglomerative algorithms exist because they're tractable
Software Support: Libraries like scipy, sklearn, and R's hclust focus on agglomerative methods
Research Focus: Most hierarchical clustering research addresses agglomerative approaches
Practical Usage: Industry deployments overwhelmingly choose agglomerative methods

When Divisive Methods Are Worth It:

Despite their complexity disadvantages, divisive methods shine when:

You only need a few clusters (early termination saves computation)
Understanding top-level structure is more important than fine details
You're using efficient bisection methods (like bisecting K-Means)
The domain naturally suggests top-down partitioning

NP-Hardness of Optimal Splitting

Finding the optimal binary partition that minimizes within-cluster variance is NP-hard in general. This is why all practical divisive methods use heuristics. There's no known polynomial-time algorithm that guarantees optimal splits for arbitrary data distributions.

Agglomerative vs. Divisive: When to Choose Each

The choice between agglomerative and divisive methods depends on data characteristics, computational constraints, and analysis goals.

Structural Differences:

Agglomerative and divisive methods produce different hierarchies even on the same data:

Agglomerative: Optimizes local (pairwise) merges. May miss global structure.
Divisive: Optimizes global (cluster-level) splits. May miss local similarities.

Imagine two tight clusters with a few points between them. Agglomerative methods might merge the bridge points with one cluster based on nearest neighbors. Divisive methods are more likely to correctly separate the major clusters first, then deal with the bridge.

Hierarchy Quality:

In practice:

Agglomerative hierarchies are often better at the bottom (correct small groupings)
Divisive hierarchies are often better at the top (correct major divisions)

If you need to cut toward the leaves (many clusters), agglomerative may be better. If you need few clusters, divisive may be more accurate.

Decision Guide: Agglomerative vs. Divisive
Criterion	Choose Agglomerative	Choose Divisive
Library support needed	✓ Widely available	Limited support
Large n (> 10,000)	✓ O(n² log n) algorithms exist	Usually too slow
Need many clusters	✓ Fine-grained structure better	Top structure matters less
Need few clusters	Good but may merge incorrectly	✓ Major divisions reliable
Exploratory analysis	✓ Full dendrogram standard	Early termination possible
Computational budget	✓ Well-optimized implementations	Slower in general
Domain is top-down	May miss global structure	✓ Natural fit for hierarchical domains

agg_vs_div_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score, silhouette_score
import matplotlib.pyplot as plt
import time
 
# Generate challenging data with clear global structure but local noise
np.random.seed(42)
 
# Two main groups with sub-structure and noise
main1 = np.vstack([
    np.random.randn(30, 2) * 0.4 + [0, 0],
    np.random.randn(30, 2) * 0.4 + [1.5, 0],
])
main2 = np.vstack([
    np.random.randn(30, 2) * 0.4 + [6, 0],
    np.random.randn(30, 2) * 0.4 + [7.5, 0],
])
 
# Add bridge noise between main groups
bridge = np.random.randn(5, 2) * 0.3 + [3.5, 0]
 
X = np.vstack([main1, main2, bridge])
true_labels_2 = np.array([0]*60 + [1]*60 + [0]*5)  # 2 main groups
true_labels_4 = np.array([0]*30 + [1]*30 + [2]*30 + [3]*30 + [0]*5)  # 4 subgroups
 
print("Comparing Agglomerative vs Divisive Approaches")
print("=" * 55)
 
# Agglomerative clustering
Z_agg = linkage(X, method='ward')
 
# Simulate divisive via bisecting K-means
# (Simplified for comparison)
from sklearn.cluster import KMeans
 
def bisecting_kmeans_labels(X, n_clusters):
    """Simple bisecting k-means for comparison."""
    labels = np.zeros(len(X), dtype=int)
    next_label = 1
    
    for _ in range(n_clusters - 1):
        # Find largest cluster
        unique, counts = np.unique(labels, return_counts=True)
        largest_cluster = unique[np.argmax(counts)]
        
        # Split it
        mask = labels == largest_cluster
        if np.sum(mask) < 2:
            break
            
        km = KMeans(n_clusters=2, n_init=10, random_state=42)
        sub_labels = km.fit_predict(X[mask])
        
        # Assign new labels
        labels[mask] = np.where(sub_labels == 0, largest_cluster, next_label)
        next_label += 1
    
    return labels
 
# Compare at k=2 and k=4
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
 
for row, k in enumerate([2, 4]):
    # Agglomerative
    labels_agg = fcluster(Z_agg, t=k, criterion='maxclust')
    
    # Divisive (bisecting)
    labels_div = bisecting_kmeans_labels(X, k)
    
    # True labels for comparison
    true_labels = true_labels_2 if k == 2 else true_labels_4
    
    # Compute metrics
    ari_agg = adjusted_rand_score(true_labels, labels_agg)
    ari_div = adjusted_rand_score(true_labels, labels_div)
    sil_agg = silhouette_score(X, labels_agg)
    sil_div = silhouette_score(X, labels_div)
    
    # Plot
    axes[row, 0].scatter(X[:, 0], X[:, 1], c=labels_agg, cmap='viridis', s=50)
    axes[row, 0].set_title(f'Agglomerative (k={k})\nARI={ari_agg:.3f}, Sil={sil_agg:.3f}')
    
    axes[row, 1].scatter(X[:, 0], X[:, 1], c=labels_div, cmap='viridis', s=50)
    axes[row, 1].set_title(f'Divisive/Bisecting (k={k})\nARI={ari_div:.3f}, Sil={sil_div:.3f}')
    
    axes[row, 2].scatter(X[:, 0], X[:, 1], c=true_labels, cmap='viridis', s=50)
    axes[row, 2].set_title(f'True Labels (k={k})')
    
    print(f"\nk={k}:")
    print(f"  Agglomerative - ARI: {ari_agg:.3f}, Silhouette: {sil_agg:.3f}")
    print(f"  Divisive      - ARI: {ari_div:.3f}, Silhouette: {sil_div:.3f}")
 
for ax in axes.flatten():
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('agg_vs_div_comparison.png', dpi=150)
print("\nComparison saved to agg_vs_div_comparison.png")

Practical Applications

Despite their relative obscurity compared to agglomerative methods, divisive approaches have specific applications where they excel:

1. Document/Topic Hierarchies:

Organizing documents into a topic taxonomy is inherently top-down. You first divide "all documents" into broad categories (e.g., "Science" vs "Arts"), then subdivide each. Divisive clustering mimics this natural conceptualization.

2. Decision Tree Preprocessing:

Bisecting K-Means can efficiently create initial cluster assignments for decision tree-based algorithms, providing a hierarchy that can be refined by supervised methods.

3. Coarse-to-Fine Analysis:

In exploratory data analysis, you often want to understand the major divisions first. Divisive methods provide immediate insight into top-level structure without computing the full n-1 splits.

4. Large-Scale Approximate Clustering:

Bisecting K-Means is used for large-scale clustering because each bisection is O(n), and you can stop after k-1 splits for k clusters. This avoids the O(n²) distance matrix entirely.

Industry Use Cases

•Web taxonomies: Yahoo and early web directories used divisive-style categorization to organize websites into hierarchies
•Gene ontologies: Top-down organization of gene functions starting from broad categories (cellular processes, metabolism, etc.)
•Geographic partitioning: Recursively splitting regions for spatial indexing (quadtrees, k-d trees share divisive spirit)
•Load balancing: Bisecting workloads for distributed systems—divide until each partition fits one server
•Image segmentation: Recursive region splitting in computer vision, stopping when regions are homogeneous

Modern Usage: Bisecting K-Means in Spark

Apache Spark MLlib includes BisectingKMeans as a scalable clustering algorithm. It's specifically designed for large-scale distributed environments where the O(n²) distance matrix of agglomerative methods is prohibitive. For big data clustering with hierarchical output, bisecting K-Means is often the only practical choice.

Summary: Divisive Clustering

We've completed our exploration of hierarchical clustering with this examination of divisive (top-down) methods—the conceptual opposite of the agglomerative approaches that dominate practice.

Key Takeaways

•Top-down paradigm: Divisive clustering starts with all data in one cluster and recursively splits until reaching singletons, revealing gross structure before fine details.
•Optimal splitting is intractable: Finding the best binary partition is NP-hard (O(2^n)), forcing all practical methods to use heuristics.
•DIANA algorithm: Uses a splinter-group heuristic to achieve O(n³) complexity—polynomial but still slow compared to agglomerative methods.
•Bisecting K-Means: The most practical divisive method; uses K-Means(k=2) for each split, achieving O(ndk × iters) complexity.
•Agglomerative usually wins: For most applications, agglomerative methods offer better library support, faster implementations, and comparable or better results.
•Divisive shines for few clusters: When you need coarse partitioning or top-level structure, divisive methods may find more natural major divisions.

Module Complete:

You've now mastered hierarchical clustering comprehensively: the agglomerative paradigm, all major linkage methods, dendrogram interpretation, cluster extraction strategies, and divisive alternatives. You understand when to apply hierarchical methods versus partition-based alternatives like K-Means, how to choose between linkages, and how to validate your clustering results.

Hierarchical clustering provides a powerful lens for understanding data structure at multiple scales—a capability that partition methods cannot offer. While computationally more demanding than K-Means, the multi-resolution view from a dendrogram is invaluable for exploratory analysis and domains where natural hierarchies exist.

Module Complete: Hierarchical Clustering

Congratulations! You've completed the Hierarchical Clustering module with deep understanding of both agglomerative and divisive approaches. You can now: implement and optimize agglomerative clustering; select appropriate linkage methods; interpret dendrograms expertly; extract meaningful clusters using multiple criteria; and understand when divisive methods might be preferable. This knowledge completes your understanding of classical clustering approaches.

5 / 5

Loading learning content...

Machine LearningClustering Algorithms

Hierarchical Clustering

LevelIntermediate

Duration90 mins

TopicClustering Algorithms

5 / 5

Divisive Clustering

Starting from the Top

What You Will Learn

The Top-Down Paradigm

Formal Definition:

Input: A set of n data points X = {x₁, x₂, ..., xₙ} and a splitting criterion

Output: A dendrogram recording all n-1 splits

Algorithm Template:

Initialize: C = {{x₁, x₂, ..., xₙ}} (single cluster with all points)
While any cluster has more than one point: a. Select a cluster C_i ∈ C to split (usually largest or most heterogeneous) b. Find the optimal binary partition {A, B} of C_i c. Replace C_i with A and B in C d. Record the split in the dendrogram
Return the complete dendrogram

The Splitting Challenge:

Agglomerative vs. Divisive Comparison
Aspect	Agglomerative (Bottom-Up)	Divisive (Top-Down)
Starting state	n singleton clusters	1 cluster with all n points
Each step	Merge closest pair	Split most heterogeneous cluster
Number of steps	n-1 merges	n-1 splits
Local decision	O(n²) pairwise comparisons	O(2^m) possible partitions
Early structure	Fine-grained local similarities	Gross-level major divisions
Hierarchy direction	Leaves → Root (bottom-up)	Root → Leaves (top-down)
Typical complexity	O(n² log n) or O(n²)	O(2^n) optimal, O(n² log n) heuristic

Why Agglomerative Dominates

The DIANA Algorithm

The Splinter Heuristic:

Instead of evaluating all 2^(m-1) - 1 partitions, DIANA grows a "splinter group" one point at a time:

Start with the cluster to split, C, and an empty splinter group S = {}
Find the point x* ∈ C with largest average distance to all other points in C
Move x* to S (it becomes the "seed" of the splinter)
Repeat:
- For each point x ∈ C (not in S), compute:
  - d_in(x) = average distance from x to points still in C
  - d_out(x) = average distance from x to points in S
- If any point x has d_out(x) < d_in(x), move the point with largest difference to S
- Otherwise, stop
Return S and (C \ S) as the two new clusters

Intuition: DIANA identifies points that are more "at home" in the splinter than in the remaining cluster, incrementally building the splinter until no more points want to leave.

diana_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
import numpy as np
from typing import List, Set, Tuple, Dict
from scipy.spatial.distance import pdist, squareform
 
class DIANAClustering:
    """
    DIANA (DIvisive ANAlysis) hierarchical clustering implementation.
    
    Uses the splinter group heuristic to avoid exponential splitting complexity.
    """
    
    def __init__(self):
        self.dendrogram: List[Tuple[int, Set[int], Set[int], float]] = []
        self.clusters: List[Set[int]] = []
        
    def fit(self, X: np.ndarray) -> 'DIANAClustering':
        """
        Perform divisive clustering on data X.
        
        Args:
            X: Data matrix of shape (n_samples, n_features)
            
        Returns:
            self with fitted dendrogram
        """
        n = len(X)
        
        # Precompute full distance matrix
        self.distance_matrix = squareform(pdist(X))
        
        # Start with all points in one cluster
        all_indices = set(range(n))
        self.clusters = [all_indices.copy()]
        
        # Track split order for dendrogram
        split_id = 0
        
        while any(len(c) > 1 for c in self.clusters):
            # Select cluster to split: most heterogeneous (largest diameter)
            split_idx = self._select_cluster_to_split()
            cluster = self.clusters[split_idx]
            
            if len(cluster) <= 1:
                continue
            
            # Perform DIANA split
            splinter, remaining, split_distance = self._diana_split(cluster)
            
            if len(splinter) == 0 or len(remaining) == 0:
                # Can't split further (shouldn't happen with proper data)
                break
            
            # Record split
            self.dendrogram.append((
                split_id,
                remaining.copy(),
                splinter.copy(),
                split_distance
            ))
            split_id += 1
            
            # Update cluster list
            self.clusters.pop(split_idx)
            self.clusters.append(remaining)
            self.clusters.append(splinter)
        
        return self
    
    def _select_cluster_to_split(self) -> int:
        """Select the most heterogeneous cluster (largest diameter)."""
        max_diameter = -1
        max_idx = 0
        
        for i, cluster in enumerate(self.clusters):
            if len(cluster) < 2:
                continue
            
            # Compute diameter (maximum pairwise distance)
            indices = list(cluster)
            diameter = max(
                self.distance_matrix[a, b]
                for a in indices for b in indices if a < b
            )
            
            if diameter > max_diameter:
                max_diameter = diameter
                max_idx = i
        
        return max_idx
    
    def _diana_split(self, cluster: Set[int]) -> Tuple[Set[int], Set[int], float]:
        """
        Split a cluster using DIANA's splinter group heuristic.
        
        Returns:
            (splinter, remaining, split_distance)
        """
        indices = list(cluster)
        n = len(indices)
        
        if n <= 1:
            return set(), cluster, 0.0
        
        # Step 1: Find point with largest average distance to others
        avg_distances = []
        for i in indices:
            avg_dist = np.mean([
                self.distance_matrix[i, j] for j in indices if j != i
            ])
            avg_distances.append((i, avg_dist))
        
        # Start splinter with most distant point
        seed_point = max(avg_distances, key=lambda x: x[1])[0]
        splinter = {seed_point}
        remaining = cluster - splinter
        
        # Steps 2-4: Grow splinter
        changed = True
        while changed and len(remaining) > 0:
            changed = False
            best_candidate = None
            best_diff = 0
            
            for point in remaining:
                # Average distance to remaining cluster
                d_in = np.mean([
                    self.distance_matrix[point, j] 
                    for j in remaining if j != point
                ]) if len(remaining) > 1 else float('inf')
                
                # Average distance to splinter
                d_out = np.mean([
                    self.distance_matrix[point, j] for j in splinter
                ])
                
                diff = d_in - d_out  # Positive means point prefers splinter
                
                if diff > best_diff:
                    best_diff = diff
                    best_candidate = point
            
            if best_candidate is not None and best_diff > 0:
                splinter.add(best_candidate)
                remaining.remove(best_candidate)
                changed = True
        
        # Compute split distance (average between-cluster distance)
        split_distance = np.mean([
            self.distance_matrix[a, b]
            for a in splinter for b in remaining
        ]) if splinter and remaining else 0.0
        
        return splinter, remaining, split_distance
 
 
# Usage example
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate test data with 3 clusters
    X = np.vstack([
        np.random.randn(20, 2) * 0.5 + [0, 0],
        np.random.randn(20, 2) * 0.5 + [4, 0],
        np.random.randn(20, 2) * 0.5 + [2, 4]
    ])
    
    # Fit DIANA
    diana = DIANAClustering()
    diana.fit(X)
    
    print("DIANA Divisive Clustering Results:")
    print("=" * 50)
    print(f"Total splits: {len(diana.dendrogram)}")
    
    # Show first few splits
    print("\nFirst 5 splits (top of hierarchy):")
    for split_id, remaining, splinter, distance in diana.dendrogram[:5]:
        print(f"  Split {split_id}: {len(remaining)} points + {len(splinter)} points, dist={distance:.3f}")

DIANA Advantages

•Polynomial complexity: O(n²) per split instead of O(2^n)
•Interpretable splits: Each division is based on clear dissimilarity logic
•Major structure first: Early splits reveal gross data organization
•Works with distances: Only needs pairwise distances, not features

DIANA Disadvantages

•Suboptimal splits: Heuristic doesn't find globally optimal partitions
•Irregular cluster sizes: Splinters can be very small
•Sensitive to seed point: Initial choice affects entire subtree
•Less common: Less library support than agglomerative

Recursive Bisection with K-Means

An alternative to DIANA's splinter heuristic is recursive bisection: use an existing clustering algorithm (typically K-Means with k=2) to split each cluster into two parts.

Algorithm:

Start with all data in one cluster
Select a cluster to split (largest, or most heterogeneous)
Apply K-Means with k=2 to that cluster
Replace the parent cluster with the two child clusters
Repeat until stopping criterion (e.g., target k, minimum cluster size)

Why K-Means Bisection Works Well:

K-Means is fast: O(m × d × 2 × iterations) per split
K-Means finds compact, spherical partitions—good for many datasets
Multiple random initializations can improve split quality
Total complexity for k final clusters: O(n × k × log k × d × iterations)

Bisecting K-Means:

This approach is so common it has its own name: Bisecting K-Means. It's often more efficient than standard K-Means for large k, and can produce hierarchical structure as a byproduct.

bisecting_kmeans.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import numpy as np
from sklearn.cluster import KMeans
from typing import List, Dict, Tuple, Optional
import matplotlib.pyplot as plt
 
class BisectingKMeans:
    """
    Divisive hierarchical clustering using K-Means bisection.
    
    At each step, splits the selected cluster using K-Means(k=2).
    """
    
    def __init__(
        self, 
        n_clusters: int = 8,
        selection: str = 'largest',  # 'largest' or 'sse'
        n_init: int = 10,
        random_state: int = 42
    ):
        self.n_clusters = n_clusters
        self.selection = selection
        self.n_init = n_init
        self.random_state = random_state
        
        self.labels_ = None
        self.cluster_centers_ = None
        self.split_history_: List[Dict] = []
        
    def fit(self, X: np.ndarray) -> 'BisectingKMeans':
        """Fit bisecting k-means."""
        n = len(X)
        
        # Initialize: all points in cluster 0
        labels = np.zeros(n, dtype=int)
        
        # Track clusters: {cluster_id: indices}
        clusters = {0: np.arange(n)}
        
        next_cluster_id = 1
        
        while len(clusters) < self.n_clusters:
            # Select cluster to split
            split_cluster_id = self._select_cluster(X, clusters)
            cluster_indices = clusters[split_cluster_id]
            
            if len(cluster_indices) < 2:
                # Can't split singleton
                break
            
            # Extract cluster data
            cluster_data = X[cluster_indices]
            
            # Apply K-Means bisection
            kmeans = KMeans(
                n_clusters=2, 
                n_init=self.n_init,
                random_state=self.random_state
            )
            sub_labels = kmeans.fit_predict(cluster_data)
            
            # Compute SSE reduction for this split
            old_sse = self._compute_sse(cluster_data)
            new_sse = sum(
                self._compute_sse(cluster_data[sub_labels == k])
                for k in [0, 1]
            )
            
            # Create new clusters
            new_cluster_id_0 = split_cluster_id  # Reuse parent ID
            new_cluster_id_1 = next_cluster_id
            next_cluster_id += 1
            
            mask_0 = sub_labels == 0
            mask_1 = sub_labels == 1
            
            clusters[new_cluster_id_0] = cluster_indices[mask_0]
            clusters[new_cluster_id_1] = cluster_indices[mask_1]
            
            # Update global labels
            labels[cluster_indices[mask_1]] = new_cluster_id_1
            
            # Record split
            self.split_history_.append({
                'parent': split_cluster_id,
                'children': (new_cluster_id_0, new_cluster_id_1),
                'parent_size': len(cluster_indices),
                'child_sizes': (np.sum(mask_0), np.sum(mask_1)),
                'sse_reduction': old_sse - new_sse
            })
        
        # Renumber clusters to be 0...k-1
        unique_labels = np.unique(labels)
        label_map = {old: new for new, old in enumerate(unique_labels)}
        self.labels_ = np.array([label_map[l] for l in labels])
        
        # Compute final centroids
        self.cluster_centers_ = np.array([
            X[self.labels_ == k].mean(axis=0)
            for k in range(len(unique_labels))
        ])
        
        return self
    
    def _select_cluster(
        self, X: np.ndarray, clusters: Dict[int, np.ndarray]
    ) -> int:
        """Select which cluster to split next."""
        if self.selection == 'largest':
            return max(clusters.keys(), key=lambda k: len(clusters[k]))
        elif self.selection == 'sse':
            return max(
                clusters.keys(),
                key=lambda k: self._compute_sse(X[clusters[k]])
            )
        else:
            raise ValueError(f"Unknown selection: {self.selection}")
    
    def _compute_sse(self, points: np.ndarray) -> float:
        """Compute sum of squared distances to centroid."""
        if len(points) == 0:
            return 0.0
        centroid = points.mean(axis=0)
        return np.sum((points - centroid) ** 2)
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate hierarchical data
    # Level 1: Two main groups
    # Level 2: Each main group has 2 sub-groups
    main1_sub1 = np.random.randn(40, 2) * 0.5 + [0, 0]
    main1_sub2 = np.random.randn(40, 2) * 0.5 + [2, 0]
    main2_sub1 = np.random.randn(40, 2) * 0.5 + [6, 6]
    main2_sub2 = np.random.randn(40, 2) * 0.5 + [8, 6]
    
    X = np.vstack([main1_sub1, main1_sub2, main2_sub1, main2_sub2])
    
    # Compare different k values
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    for idx, k in enumerate([2, 3, 4, 6]):
        ax = axes[idx // 2, idx % 2]
        
        bisecting = BisectingKMeans(n_clusters=k, selection='sse', random_state=42)
        bisecting.fit(X)
        
        scatter = ax.scatter(X[:, 0], X[:, 1], c=bisecting.labels_, 
                            cmap='viridis', s=50, alpha=0.7)
        ax.scatter(bisecting.cluster_centers_[:, 0], 
                  bisecting.cluster_centers_[:, 1],
                  c='red', marker='X', s=200, edgecolors='black')
        ax.set_title(f'Bisecting K-Means (k={k})')
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
    
    plt.tight_layout()
    plt.savefig('bisecting_kmeans.png', dpi=150)
    print("Bisecting K-Means saved to bisecting_kmeans.png")
    
    # Print split hierarchy
    bisecting = BisectingKMeans(n_clusters=4, selection='sse', random_state=42)
    bisecting.fit(X)
    
    print("\n=== Split History ===")
    for i, split in enumerate(bisecting.split_history_):
        print(f"Split {i+1}: Cluster {split['parent']} "
              f"({split['parent_size']} pts) → "
              f"Clusters {split['children']} "
              f"({split['child_sizes'][0]}+{split['child_sizes'][1]} pts), "
              f"SSE reduction: {split['sse_reduction']:.2f}")

Bisecting vs. Standard K-Means

Computational Complexity Analysis

Understanding the computational complexity of divisive methods helps explain why they're less common than agglomerative approaches.

Optimal Divisive Clustering (Intractable):

Finding the optimal binary partition of m points requires evaluating 2^(m-1) - 1 possible splits. For the first split (m = n), this is:

$$T_{\text{optimal}} = O(2^n)$$

This is clearly intractable. For just n = 50 points, we'd need to evaluate over 10^14 partitions—impossible in practice.

DIANA Complexity:

DAINA's splinter heuristic avoids exponential complexity:

Selecting the cluster to split: O(n²) to compute all diameters
Building the splinter: O(m²) per point evaluation, at most m points moved
Total per split: O(m³) in worst case
Total for complete hierarchy (n-1 splits): O(n³)

While cubic, DIANA is still slower than the O(n² log n) or O(n²) achievable for agglomerative methods.

Bisecting K-Means Complexity:

Per split: O(m × d × 2 × iterations) for K-Means
Number of splits to get k clusters: k-1
Total: O(n × d × k × iterations)

This is typically much faster than DIANA, especially for high-dimensional data.

Divisive Method Complexity Comparison
Method	Time Complexity	Space Complexity	Notes
Optimal divisive	O(2^n)	O(n²)	Intractable for n > 20
DIANA	O(n³)	O(n²)	Polynomial but slow
Bisecting K-Means	O(ndk × iters)	O(nd + k)	Fast, practical choice
Agglomerative (comparison)	O(n² log n)	O(n²)	More efficient alternative

Why the Asymmetry Matters:

The fundamental complexity asymmetry between agglomerative and divisive methods has deep consequences:

Algorithm Availability: Far more agglomerative algorithms exist because they're tractable
Software Support: Libraries like scipy, sklearn, and R's hclust focus on agglomerative methods
Research Focus: Most hierarchical clustering research addresses agglomerative approaches
Practical Usage: Industry deployments overwhelmingly choose agglomerative methods

When Divisive Methods Are Worth It:

Despite their complexity disadvantages, divisive methods shine when:

You only need a few clusters (early termination saves computation)
Understanding top-level structure is more important than fine details
You're using efficient bisection methods (like bisecting K-Means)
The domain naturally suggests top-down partitioning

NP-Hardness of Optimal Splitting

Agglomerative vs. Divisive: When to Choose Each

The choice between agglomerative and divisive methods depends on data characteristics, computational constraints, and analysis goals.

Structural Differences:

Agglomerative and divisive methods produce different hierarchies even on the same data:

Agglomerative: Optimizes local (pairwise) merges. May miss global structure.
Divisive: Optimizes global (cluster-level) splits. May miss local similarities.

Hierarchy Quality:

In practice:

Agglomerative hierarchies are often better at the bottom (correct small groupings)
Divisive hierarchies are often better at the top (correct major divisions)

If you need to cut toward the leaves (many clusters), agglomerative may be better. If you need few clusters, divisive may be more accurate.

Decision Guide: Agglomerative vs. Divisive
Criterion	Choose Agglomerative	Choose Divisive
Library support needed	✓ Widely available	Limited support
Large n (> 10,000)	✓ O(n² log n) algorithms exist	Usually too slow
Need many clusters	✓ Fine-grained structure better	Top structure matters less
Need few clusters	Good but may merge incorrectly	✓ Major divisions reliable
Exploratory analysis	✓ Full dendrogram standard	Early termination possible
Computational budget	✓ Well-optimized implementations	Slower in general
Domain is top-down	May miss global structure	✓ Natural fit for hierarchical domains

agg_vs_div_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score, silhouette_score
import matplotlib.pyplot as plt
import time
 
# Generate challenging data with clear global structure but local noise
np.random.seed(42)
 
# Two main groups with sub-structure and noise
main1 = np.vstack([
    np.random.randn(30, 2) * 0.4 + [0, 0],
    np.random.randn(30, 2) * 0.4 + [1.5, 0],
])
main2 = np.vstack([
    np.random.randn(30, 2) * 0.4 + [6, 0],
    np.random.randn(30, 2) * 0.4 + [7.5, 0],
])
 
# Add bridge noise between main groups
bridge = np.random.randn(5, 2) * 0.3 + [3.5, 0]
 
X = np.vstack([main1, main2, bridge])
true_labels_2 = np.array([0]*60 + [1]*60 + [0]*5)  # 2 main groups
true_labels_4 = np.array([0]*30 + [1]*30 + [2]*30 + [3]*30 + [0]*5)  # 4 subgroups
 
print("Comparing Agglomerative vs Divisive Approaches")
print("=" * 55)
 
# Agglomerative clustering
Z_agg = linkage(X, method='ward')
 
# Simulate divisive via bisecting K-means
# (Simplified for comparison)
from sklearn.cluster import KMeans
 
def bisecting_kmeans_labels(X, n_clusters):
    """Simple bisecting k-means for comparison."""
    labels = np.zeros(len(X), dtype=int)
    next_label = 1
    
    for _ in range(n_clusters - 1):
        # Find largest cluster
        unique, counts = np.unique(labels, return_counts=True)
        largest_cluster = unique[np.argmax(counts)]
        
        # Split it
        mask = labels == largest_cluster
        if np.sum(mask) < 2:
            break
            
        km = KMeans(n_clusters=2, n_init=10, random_state=42)
        sub_labels = km.fit_predict(X[mask])
        
        # Assign new labels
        labels[mask] = np.where(sub_labels == 0, largest_cluster, next_label)
        next_label += 1
    
    return labels
 
# Compare at k=2 and k=4
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
 
for row, k in enumerate([2, 4]):
    # Agglomerative
    labels_agg = fcluster(Z_agg, t=k, criterion='maxclust')
    
    # Divisive (bisecting)
    labels_div = bisecting_kmeans_labels(X, k)
    
    # True labels for comparison
    true_labels = true_labels_2 if k == 2 else true_labels_4
    
    # Compute metrics
    ari_agg = adjusted_rand_score(true_labels, labels_agg)
    ari_div = adjusted_rand_score(true_labels, labels_div)
    sil_agg = silhouette_score(X, labels_agg)
    sil_div = silhouette_score(X, labels_div)
    
    # Plot
    axes[row, 0].scatter(X[:, 0], X[:, 1], c=labels_agg, cmap='viridis', s=50)
    axes[row, 0].set_title(f'Agglomerative (k={k})\nARI={ari_agg:.3f}, Sil={sil_agg:.3f}')
    
    axes[row, 1].scatter(X[:, 0], X[:, 1], c=labels_div, cmap='viridis', s=50)
    axes[row, 1].set_title(f'Divisive/Bisecting (k={k})\nARI={ari_div:.3f}, Sil={sil_div:.3f}')
    
    axes[row, 2].scatter(X[:, 0], X[:, 1], c=true_labels, cmap='viridis', s=50)
    axes[row, 2].set_title(f'True Labels (k={k})')
    
    print(f"\nk={k}:")
    print(f"  Agglomerative - ARI: {ari_agg:.3f}, Silhouette: {sil_agg:.3f}")
    print(f"  Divisive      - ARI: {ari_div:.3f}, Silhouette: {sil_div:.3f}")
 
for ax in axes.flatten():
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
 
plt.tight_layout()
plt.savefig('agg_vs_div_comparison.png', dpi=150)
print("\nComparison saved to agg_vs_div_comparison.png")

Practical Applications

Despite their relative obscurity compared to agglomerative methods, divisive approaches have specific applications where they excel:

1. Document/Topic Hierarchies:

2. Decision Tree Preprocessing:

Bisecting K-Means can efficiently create initial cluster assignments for decision tree-based algorithms, providing a hierarchy that can be refined by supervised methods.

3. Coarse-to-Fine Analysis:

In exploratory data analysis, you often want to understand the major divisions first. Divisive methods provide immediate insight into top-level structure without computing the full n-1 splits.

4. Large-Scale Approximate Clustering:

Bisecting K-Means is used for large-scale clustering because each bisection is O(n), and you can stop after k-1 splits for k clusters. This avoids the O(n²) distance matrix entirely.

Industry Use Cases

•Web taxonomies: Yahoo and early web directories used divisive-style categorization to organize websites into hierarchies
•Gene ontologies: Top-down organization of gene functions starting from broad categories (cellular processes, metabolism, etc.)
•Geographic partitioning: Recursively splitting regions for spatial indexing (quadtrees, k-d trees share divisive spirit)
•Load balancing: Bisecting workloads for distributed systems—divide until each partition fits one server
•Image segmentation: Recursive region splitting in computer vision, stopping when regions are homogeneous

Modern Usage: Bisecting K-Means in Spark

Summary: Divisive Clustering

We've completed our exploration of hierarchical clustering with this examination of divisive (top-down) methods—the conceptual opposite of the agglomerative approaches that dominate practice.

Key Takeaways

•Top-down paradigm: Divisive clustering starts with all data in one cluster and recursively splits until reaching singletons, revealing gross structure before fine details.
•Optimal splitting is intractable: Finding the best binary partition is NP-hard (O(2^n)), forcing all practical methods to use heuristics.
•DIANA algorithm: Uses a splinter-group heuristic to achieve O(n³) complexity—polynomial but still slow compared to agglomerative methods.
•Bisecting K-Means: The most practical divisive method; uses K-Means(k=2) for each split, achieving O(ndk × iters) complexity.
•Agglomerative usually wins: For most applications, agglomerative methods offer better library support, faster implementations, and comparable or better results.
•Divisive shines for few clusters: When you need coarse partitioning or top-level structure, divisive methods may find more natural major divisions.

Module Complete:

Module Complete: Hierarchical Clustering

5 / 5