K Means Variants - Learning Module

Loading content...

0/245

Bisecting K-means: Divisive Hierarchical Clustering

A Different Approach: Top-Down Clustering

Standard K-means approaches clustering as a simultaneous partitioning problem: given K clusters, find the best assignment of all points at once. But what if we built cluster structure incrementally, starting with one cluster and recursively dividing it?

Bisecting K-means takes this top-down, divisive approach. It begins with all data in a single cluster, then repeatedly splits the "best" cluster into two until the desired number of clusters is reached. This simple modification yields surprising benefits: often better quality than standard K-means, natural cluster hierarchy, and efficient computation.

What You Will Learn

By the end of this page, you will understand the bisecting K-means algorithm and its variants, master different cluster selection and splitting strategies, analyze the computational and quality trade-offs versus standard K-means, and recognize when bisecting K-means provides superior results.

The Divisive Clustering Paradigm

Clustering algorithms can be categorized by how they construct cluster structure:

Agglomerative (Bottom-up):

Start with each point as its own cluster
Repeatedly merge closest clusters
Result: Complete dendrogram from n clusters to 1
Examples: Single-linkage, Complete-linkage, Ward's method

Divisive (Top-down):

Start with all points in one cluster
Repeatedly split a cluster into two
Result: Binary tree from 1 cluster to K
Examples: Bisecting K-means, DIANA

Partitional (Direct):

Directly find K clusters simultaneously
No inherent hierarchy
Examples: K-means, K-medoids, GMM

Clustering Paradigm Comparison
Aspect	Agglomerative	Divisive (Bisecting)	Partitional (K-means)
Direction	Bottom-up	Top-down	Direct
Initial state	n singletons	1 cluster (all data)	K random centers
Operation	Merge pairs	Split one	Reassign & update
Hierarchy	Full dendrogram	Binary tree	None
Complexity	O(n² log n)	O(nKt)	O(nKt)
Global optimization	Greedy merges	Independent bisections	Local minimum

Why Divisive Often Beats Agglomerative for Large K

Agglomerative clustering makes early decisions (initial merges) that cannot be undone, and these compound over n-1 merge steps. Divisive clustering makes K-1 split decisions, and each split is relatively independent—errors in one branch don't affect others. This isolation of decisions often leads to better global solutions.

The Bisecting K-means Algorithm

Bisecting K-means is elegantly simple: repeatedly apply K-means with K=2 (bisection) to selected clusters until the desired number of clusters is reached.

Algorithm: Bisecting K-means

Input: Data X = {x_1, ..., x_n}, target clusters K
Output: K cluster assignments

1. Initialize: cluster_list = [all data points]
2. While len(cluster_list) < K:
   a. Select a cluster C from cluster_list to split
   b. Apply K-means with K=2 to C, obtaining C1 and C2
   c. Remove C from cluster_list
   d. Add C1 and C2 to cluster_list
3. Return cluster_list

Key Design Decisions:

Which cluster to split? (Selection strategy)
How to perform the bisection? (Splitting strategy)
When to stop? (Termination criteria)

bisecting_kmeans.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
 
class BisectingKMeans:
    """
    Bisecting K-means clustering algorithm.
    
    Parameters:
    -----------
    n_clusters : int
        Target number of clusters (K)
    n_bisections : int
        Number of bisection trials per split (best one is kept)
    selection : str
        Cluster selection strategy: 'largest', 'highest_sse', 'hybrid'
    max_iter_bisect : int
        Max iterations for each K=2 split
    random_state : int
        Random seed for reproducibility
    """
    
    def __init__(self, n_clusters=8, n_bisections=5, selection='largest',
                 max_iter_bisect=100, random_state=None):
        self.n_clusters = n_clusters
        self.n_bisections = n_bisections
        self.selection = selection
        self.max_iter_bisect = max_iter_bisect
        self.random_state = random_state
    
    def _compute_sse(self, X, center):
        """Compute sum of squared errors for a cluster."""
        return np.sum((X - center) ** 2)
    
    def _bisect_cluster(self, X, n_trials=5):
        """
        Split a cluster into two using K-means with K=2.
        Run multiple trials and keep the best split.
        """
        best_inertia = np.inf
        best_labels = None
        best_centers = None
        
        for trial in range(n_trials):
            km = KMeans(
                n_clusters=2,
                n_init=1,
                max_iter=self.max_iter_bisect,
                random_state=self.random_state + trial if self.random_state else None
            )
            km.fit(X)
            
            if km.inertia_ < best_inertia:
                best_inertia = km.inertia_
                best_labels = km.labels_
                best_centers = km.cluster_centers_
        
        return best_labels, best_centers, best_inertia
    
    def _select_cluster_to_split(self, clusters):
        """
        Select which cluster to split next.
        
        Strategies:
        - 'largest': Split the cluster with most points
        - 'highest_sse': Split the cluster with highest SSE
        - 'hybrid': Balance size and SSE
        """
        if self.selection == 'largest':
            # Select largest cluster
            sizes = [c['size'] for c in clusters]
            return np.argmax(sizes)
        
        elif self.selection == 'highest_sse':
            # Select cluster with highest SSE
            sses = [c['sse'] for c in clusters]
            return np.argmax(sses)
        
        elif self.selection == 'hybrid':
            # Score = size * sse (balance both factors)
            scores = [c['size'] * c['sse'] for c in clusters]
            return np.argmax(scores)
        
        else:
            raise ValueError(f"Unknown selection strategy: {self.selection}")
    
    def fit(self, X):
        """Fit bisecting K-means to data."""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        
        # Initialize: all points in one cluster
        all_indices = np.arange(n_samples)
        initial_center = X.mean(axis=0)
        initial_sse = self._compute_sse(X, initial_center)
        
        clusters = [{
            'indices': all_indices,
            'center': initial_center,
            'sse': initial_sse,
            'size': n_samples
        }]
        
        # Split history for building hierarchy
        self.split_history_ = []
        
        # Repeatedly bisect until K clusters
        while len(clusters) < self.n_clusters:
            # Select cluster to split
            split_idx = self._select_cluster_to_split(clusters)
            cluster_to_split = clusters[split_idx]
            
            # Get data for this cluster
            cluster_data = X[cluster_to_split['indices']]
            
            # Bisect
            labels, centers, inertia = self._bisect_cluster(
                cluster_data, 
                n_trials=self.n_bisections
            )
            
            # Create two new clusters
            mask_0 = labels == 0
            mask_1 = labels == 1
            
            indices_0 = cluster_to_split['indices'][mask_0]
            indices_1 = cluster_to_split['indices'][mask_1]
            
            new_cluster_0 = {
                'indices': indices_0,
                'center': centers[0],
                'sse': self._compute_sse(X[indices_0], centers[0]),
                'size': len(indices_0)
            }
            
            new_cluster_1 = {
                'indices': indices_1,
                'center': centers[1],
                'sse': self._compute_sse(X[indices_1], centers[1]),
                'size': len(indices_1)
            }
            
            # Record split for hierarchy
            self.split_history_.append({
                'parent_idx': split_idx,
                'parent_sse': cluster_to_split['sse'],
                'child_sses': [new_cluster_0['sse'], new_cluster_1['sse']],
                'n_clusters_after': len(clusters) + 1
            })
            
            # Replace parent with children
            clusters.pop(split_idx)
            clusters.append(new_cluster_0)
            clusters.append(new_cluster_1)
        
        # Store final results
        self.cluster_centers_ = np.array([c['center'] for c in clusters])
        self.labels_ = np.zeros(n_samples, dtype=int)
        for k, c in enumerate(clusters):
            self.labels_[c['indices']] = k
        
        # Compute inertia
        self.inertia_ = sum(c['sse'] for c in clusters)
        
        return self
    
    def predict(self, X):
        """Predict cluster labels for new data."""
        distances = cdist(X, self.cluster_centers_)
        return np.argmin(distances, axis=1)
    
    def fit_predict(self, X):
        """Fit and return labels."""
        self.fit(X)
        return self.labels_
 
 
# Example usage
np.random.seed(42)
 
# Create well-separated clusters
X = np.vstack([
    np.random.randn(200, 2) * 0.5 + [0, 0],
    np.random.randn(200, 2) * 0.5 + [5, 0],
    np.random.randn(200, 2) * 0.5 + [2.5, 4],
    np.random.randn(200, 2) * 0.5 + [7.5, 4],
    np.random.randn(200, 2) * 0.5 + [2.5, -4],
])
 
# Compare bisecting K-means with standard K-means
bisecting = BisectingKMeans(n_clusters=5, selection='largest', random_state=42)
bisecting.fit(X)
 
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X)
 
print("Bisecting K-means vs Standard K-means:")
print(f"  Bisecting inertia: {bisecting.inertia_:.2f}")
print(f"  Standard inertia:  {kmeans.inertia_:.2f}")
print(f"  Improvement: {(1 - bisecting.inertia_ / kmeans.inertia_) * 100:.2f}%")

Cluster Selection Strategies

The choice of which cluster to split next significantly impacts the final clustering quality. Different strategies optimize for different objectives.

Strategy 1: Largest Cluster

Split the cluster with the most points: $$C^* = \arg\max_{C \in \text{clusters}} |C|$$

Rationale: Large clusters likely contain multiple natural subgroups. Splitting them first produces more balanced cluster sizes.

Pros:

Simple and fast (just count points)
Tends to produce balanced cluster sizes
Works well when clusters are roughly equal in density

Cons:

May ignore tight, high-density clusters that are genuinely compact
Doesn't account for cluster shape or dispersion

Strategy 2: Highest SSE (Within-Cluster Variance)

Split the cluster with highest sum of squared errors: $$C^* = \arg\max_{C \in \text{clusters}} \sum_{x \in C} |x - \mu_C|^2$$

Rationale: High SSE indicates high dispersion—the cluster is spread out and likely contains multiple natural subgroups.

Pros:

Directly optimizes the K-means objective
Automatically handles varying cluster densities
Focuses computation where it matters most

Cons:

Sensitive to outliers (they inflate SSE)
May excessively split large, uniform clusters

Strategy 3: Largest × SSE (Hybrid)

Balance both factors: $$C^* = \arg\max_{C \in \text{clusters}} |C| \cdot \text{SSE}(C)$$

Rationale: Large, dispersed clusters are most deserving of splits. Small tight clusters and large tight clusters are left alone.

Selection Strategy Comparison
Strategy	Formula	Best For	Weakness
Largest	max \|C\|	Balanced sizes needed	Ignores dispersion
Highest SSE	max SSE(C)	Optimizing total SSE	Outlier sensitive
Hybrid	max \|C\|×SSE(C)	General purpose	Two hyperparams to balance
Lowest Density	max SSE(C)/\|C\|	Density-based analysis	Ignores absolute sizes

strategy_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.metrics import silhouette_score
 
def compare_selection_strategies(X, K, n_trials=5):
    """
    Compare different cluster selection strategies.
    """
    strategies = ['largest', 'highest_sse', 'hybrid']
    results = {}
    
    for strategy in strategies:
        inertias = []
        silhouettes = []
        
        for trial in range(n_trials):
            bkm = BisectingKMeans(
                n_clusters=K,
                selection=strategy,
                n_bisections=3,
                random_state=42 + trial
            )
            bkm.fit(X)
            
            inertias.append(bkm.inertia_)
            silhouettes.append(silhouette_score(X, bkm.labels_))
        
        results[strategy] = {
            'mean_inertia': np.mean(inertias),
            'std_inertia': np.std(inertias),
            'mean_silhouette': np.mean(silhouettes),
            'std_silhouette': np.std(silhouettes)
        }
        
        print(f"{strategy:15}: inertia={results[strategy]['mean_inertia']:.2f} "
              f"(±{results[strategy]['std_inertia']:.2f}), "
              f"silhouette={results[strategy]['mean_silhouette']:.4f}")
    
    return results
 
# Create test data with varying cluster sizes
np.random.seed(42)
X = np.vstack([
    np.random.randn(500, 2) * 1.0 + [0, 0],      # Large, spread
    np.random.randn(100, 2) * 0.3 + [5, 5],      # Small, tight
    np.random.randn(300, 2) * 0.5 + [10, 0],     # Medium
    np.random.randn(50, 2) * 2.0 + [-5, 5],      # Small, very spread
])
 
print("Comparing selection strategies on imbalanced data:")
print("=" * 60)
results = compare_selection_strategies(X, K=4, n_trials=5)

Advantages Over Standard K-means

Bisecting K-means often outperforms standard K-means despite its simplicity. Understanding why reveals important insights about optimization landscapes.

Key Advantages

•Better local optima — Each bisection optimizes over a smaller, simpler subspace. K-means with K=2 has far fewer local minima than K-means with K=10.
•Hierarchical structure — The split history provides a natural dendrogram, enabling analysis at multiple granularities.
•Computational efficiency — Each bisection is cheap (K=2). Total work is comparable to standard K-means but better distributed.
•Less sensitive to initialization — The initial single cluster has no initialization ambiguity. Randomness only enters in bisection.
•Natural handling of outliers — Outliers tend to form their own small clusters during splits, isolating them.
•Incremental clustering — Easy to add more clusters by bisecting existing ones.

Why Fewer Local Minima?

K-means with K clusters has O(K^n) possible assignments. The optimization landscape becomes increasingly complex with larger K. Bisecting limits each optimization to K=2, which has only O(2^n) possible assignments for the cluster being split—and the split is over far fewer than n points.

Empirical Quality Comparison:

Research consistently shows that bisecting K-means achieves lower inertia than standard K-means on many real-world datasets:

Dataset	Standard K-means	Bisecting K-means	Improvement
Text documents	Higher	Lower	2-5%
Image features	Higher	Lower	1-3%
Gene expression	Higher	Lower	3-7%
Random Gaussian	Similar	Similar	<1%

The advantage is most pronounced when:

Clusters have hierarchical structure
Cluster sizes are imbalanced
K is relatively large (>5)
Standard K-means converges to poor local minima

quality_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
import time
 
def comprehensive_comparison(X, y_true, K, n_trials=10):
    """
    Comprehensive comparison of bisecting vs standard K-means.
    """
    results = {
        'bisecting': {'inertia': [], 'ari': [], 'nmi': [], 'time': []},
        'standard': {'inertia': [], 'ari': [], 'nmi': [], 'time': []}
    }
    
    for trial in range(n_trials):
        # Bisecting K-means
        start = time.time()
        bkm = BisectingKMeans(n_clusters=K, selection='highest_sse',
                               n_bisections=5, random_state=trial)
        bkm.fit(X)
        results['bisecting']['time'].append(time.time() - start)
        results['bisecting']['inertia'].append(bkm.inertia_)
        results['bisecting']['ari'].append(adjusted_rand_score(y_true, bkm.labels_))
        results['bisecting']['nmi'].append(normalized_mutual_info_score(y_true, bkm.labels_))
        
        # Standard K-means
        start = time.time()
        km = KMeans(n_clusters=K, n_init=10, random_state=trial)
        km.fit(X)
        results['standard']['time'].append(time.time() - start)
        results['standard']['inertia'].append(km.inertia_)
        results['standard']['ari'].append(adjusted_rand_score(y_true, km.labels_))
        results['standard']['nmi'].append(normalized_mutual_info_score(y_true, km.labels_))
    
    print("Comparison: Bisecting vs Standard K-means")
    print("=" * 60)
    
    for method in ['bisecting', 'standard']:
        print(f"\n{method.upper()}:")
        print(f"  Inertia:  {np.mean(results[method]['inertia']):.2f} "
              f"± {np.std(results[method]['inertia']):.2f}")
        print(f"  ARI:      {np.mean(results[method]['ari']):.4f} "
              f"± {np.std(results[method]['ari']):.4f}")
        print(f"  NMI:      {np.mean(results[method]['nmi']):.4f} "
              f"± {np.std(results[method]['nmi']):.4f}")
        print(f"  Time:     {np.mean(results[method]['time']):.4f}s "
              f"± {np.std(results[method]['time']):.4f}s")
    
    # Improvement summary
    inertia_improvement = (1 - np.mean(results['bisecting']['inertia']) / 
                           np.mean(results['standard']['inertia'])) * 100
    print(f"\nBisecting inertia improvement: {inertia_improvement:.2f}%")
    
    return results
 
 
# Generate hierarchical data (where bisecting should excel)
np.random.seed(42)
 
# Create 2-level hierarchical structure
# Top level: 2 super-clusters
# Bottom level: 4 clusters each
X_list = []
y_list = []
cluster_id = 0
 
for super_center in [[0, 0], [20, 20]]:
    for offset in [[-2, -2], [2, -2], [-2, 2], [2, 2]]:
        center = np.array(super_center) + np.array(offset) * 2
        points = np.random.randn(100, 2) * 0.5 + center
        X_list.append(points)
        y_list.extend([cluster_id] * 100)
        cluster_id += 1
 
X_hierarchical = np.vstack(X_list)
y_true = np.array(y_list)
 
results = comprehensive_comparison(X_hierarchical, y_true, K=8, n_trials=10)

Extracting and Using the Cluster Hierarchy

Unlike flat partitioning algorithms, bisecting K-means naturally produces a binary tree (dendrogram) of cluster relationships. This hierarchy encodes information about cluster relationships that can be exploited for analysis and flexible granularity control.

The Bisection Tree:

Each split in bisecting K-means creates a parent-child relationship:

                    [All Data]
                    /         \
             [Cluster A]    [Cluster B]
              /     \          /     \
           [A1]   [A2]      [B1]   [B2]
            /\     /\        /\     /\
          [...]   [...]    [...]   [...]

Uses of the Hierarchy:

Multi-resolution analysis: Can extract 2, 3, 4, ... K clusters by stopping at different split levels
Cluster similarity: Sibling clusters are more similar than cousins
Explainability: Can describe a cluster by its split history
Uncertainty: Points that split late into different clusters were initially grouped together

extract_hierarchy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from collections import defaultdict
 
class BisectingKMeansWithHierarchy(BisectingKMeans):
    """
    Extended bisecting K-means that explicitly tracks hierarchy.
    """
    
    def fit(self, X):
        """Fit and build explicit tree structure."""
        super().fit(X)
        
        # Build tree from split history
        self._build_tree()
        
        return self
    
    def _build_tree(self):
        """Construct explicit tree structure from splits."""
        # Tree nodes: {node_id: {'parent': id, 'children': [ids], 'cluster_idx': k}}
        self.tree_ = {}
        
        # Start with root
        self.tree_[0] = {
            'parent': None,
            'children': [],
            'level': 0,
            'indices': None  # Will be filled
        }
        
        # Track current leaves
        current_leaves = [0]
        node_counter = 1
        
        # Replay splits to build tree
        for split in self.split_history_:
            split_node = current_leaves[split['parent_idx']]
            
            # Create two children
            child_1 = node_counter
            child_2 = node_counter + 1
            node_counter += 2
            
            self.tree_[split_node]['children'] = [child_1, child_2]
            self.tree_[child_1] = {
                'parent': split_node,
                'children': [],
                'level': self.tree_[split_node]['level'] + 1
            }
            self.tree_[child_2] = {
                'parent': split_node,
                'children': [],
                'level': self.tree_[split_node]['level'] + 1
            }
            
            # Update leaves
            current_leaves.pop(split['parent_idx'])
            current_leaves.append(child_1)
            current_leaves.append(child_2)
        
        # Assign final cluster indices to leaves
        self.leaf_nodes_ = current_leaves
        for k, leaf in enumerate(current_leaves):
            self.tree_[leaf]['cluster_idx'] = k
    
    def get_clusters_at_level(self, level):
        """
        Get cluster assignments at a specific hierarchy level.
        
        Level 0 = all data in one cluster
        Level 1 = 2 clusters
        Level log2(K) = K clusters
        """
        # Find nodes at this level (or their descendants)
        nodes_at_level = []
        
        def traverse(node_id, current_level):
            if current_level == level or len(self.tree_[node_id]['children']) == 0:
                nodes_at_level.append(node_id)
            else:
                for child in self.tree_[node_id]['children']:
                    traverse(child, current_level + 1)
        
        traverse(0, 0)
        
        # Map final cluster labels to level-clusters
        level_labels = np.zeros(len(self.labels_), dtype=int)
        
        for level_cluster_id, node in enumerate(nodes_at_level):
            # Find all leaf descendants
            leaf_clusters = self._get_leaf_descendants(node)
            for leaf_cluster in leaf_clusters:
                mask = self.labels_ == leaf_cluster
                level_labels[mask] = level_cluster_id
        
        return level_labels, len(nodes_at_level)
    
    def _get_leaf_descendants(self, node_id):
        """Get all leaf cluster indices under a node."""
        if len(self.tree_[node_id]['children']) == 0:
            return [self.tree_[node_id]['cluster_idx']]
        
        descendants = []
        for child in self.tree_[node_id]['children']:
            descendants.extend(self._get_leaf_descendants(child))
        return descendants
    
    def print_hierarchy(self):
        """Print tree structure."""
        def print_node(node_id, indent=""):
            node = self.tree_[node_id]
            
            if len(node['children']) == 0:
                k = node['cluster_idx']
                size = np.sum(self.labels_ == k)
                print(f"{indent}Leaf Cluster {k} (n={size})")
            else:
                print(f"{indent}Internal Node (level={node['level']})")
                for child in node['children']:
                    print_node(child, indent + "  ")
        
        print("Cluster Hierarchy:")
        print_node(0)
 
 
# Demonstrate hierarchy
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) * 0.5 + [0, 0],
    np.random.randn(100, 2) * 0.5 + [3, 0],
    np.random.randn(100, 2) * 0.5 + [6, 0],
    np.random.randn(100, 2) * 0.5 + [9, 0],
    np.random.randn(100, 2) * 0.5 + [12, 0],
    np.random.randn(100, 2) * 0.5 + [15, 0],
    np.random.randn(100, 2) * 0.5 + [18, 0],
    np.random.randn(100, 2) * 0.5 + [21, 0],
])
 
bkm = BisectingKMeansWithHierarchy(n_clusters=8, selection='largest', random_state=42)
bkm.fit(X)
 
bkm.print_hierarchy()
 
print("\nMulti-resolution clustering:")
for level in range(4):
    labels_level, n_clusters = bkm.get_clusters_at_level(level)
    print(f"  Level {level}: {n_clusters} clusters, "
          f"sizes = {np.bincount(labels_level)}")

Computational Complexity Analysis

Understanding bisecting K-means' computational behavior helps in capacity planning and algorithm selection.

Time Complexity:

Bisecting K-means performs K-1 splits to reach K clusters. Each split:

Runs K-means with K=2 on the selected cluster
Cluster size varies from n (first split) to ~n/K (late splits)

Per Split: O(m × t × 2 × D) = O(mtD) where:

m = cluster size being split
t = iterations for K=2 K-means
D = dimensions

Total Across All Splits:

The sum of cluster sizes across all splits varies by selection strategy, but averages to O(n log K) for balanced splits: $$\sum_{\text{splits}} m_i \approx n + \frac{n}{2} + \frac{n}{2} + ... = O(n \log_2 K)$$

Total Time Complexity: O(ntD log K)

Compared to standard K-means: O(ntKD)

For balanced clusters and similar t, bisecting is faster when log K < K (always true for K > 2).

Complexity Comparison
Algorithm	Time	Space	Notes
Standard K-means	O(ntKD)	O(nD + KD)	All data every iteration
Bisecting K-means	O(ntD log K)	O(nD + KD)	Subsets per split
Mini-batch K-means	O(btKD)	O(bD + KD)	Fixed batch b << n

Best of Both Worlds

Combine bisecting K-means with mini-batch K-means for each split: Mini-batch bisecting K-means achieves the structural benefits of divisive clustering with the speed benefits of stochastic optimization.

Space Complexity:

Data storage: O(nD)
Centers: O(KD)
Labels: O(n)
Split history (optional): O(K)

Total: O(nD + KD) — same as standard K-means.

Parallelization Potential:

Within-split parallelism: Each bisection's K-means can use multi-threaded distance computation
Cross-split parallelism: Limited — splits depend on previous results
Trial parallelism: Multiple bisection trials (n_bisections) run independently

Practical Deployment Guidelines

Guidelines for deploying bisecting K-means effectively in production systems.

When to Choose Bisecting K-means

•Hierarchical structure expected — Data has natural multi-scale organization
•Cluster quality matters more than speed — Each split is well-optimized
•Need multi-resolution analysis — Want to explore different K values from one run
•K is relatively large — Advantages increase with more clusters
•Standard K-means gives poor results — Local minima issues
•Interpretable hierarchy needed — Explain clusters through split history

sklearn_bisecting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Scikit-learn includes BisectingKMeans as of version 1.1
from sklearn.cluster import BisectingKMeans as SklearnBisecting
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
 
def production_bisecting_kmeans(X, K_max=10, random_state=42):
    """
    Production bisecting K-means with automated K selection.
    """
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print(f"Data: {X.shape[0]} samples × {X.shape[1]} features")
    print()
    
    # Find optimal K using silhouette
    results = []
    best_K = 2
    best_score = -1
    
    for K in range(2, K_max + 1):
        bkm = SklearnBisecting(
            n_clusters=K,
            init='k-means++',         # Initialization for bisections
            n_init=5,                  # Number of bisection trials
            bisecting_strategy='largest_cluster',  # or 'biggest_inertia'
            random_state=random_state
        )
        
        labels = bkm.fit_predict(X_scaled)
        
        inertia = bkm.inertia_
        silhouette = silhouette_score(X_scaled, labels)
        
        results.append({
            'K': K,
            'inertia': inertia,
            'silhouette': silhouette
        })
        
        print(f"K={K}: silhouette={silhouette:.4f}, inertia={inertia:.2e}")
        
        if silhouette > best_score:
            best_score = silhouette
            best_K = K
    
    print(f"\nOptimal K: {best_K} (silhouette={best_score:.4f})")
    
    # Train final model
    final_model = SklearnBisecting(
        n_clusters=best_K,
        init='k-means++',
        n_init=10,  # More trials for final
        random_state=random_state
    )
    final_model.fit(X_scaled)
    
    return {
        'model': final_model,
        'scaler': scaler,
        'optimal_K': best_K,
        'results': results,
        'cluster_centers_original': scaler.inverse_transform(final_model.cluster_centers_)
    }
 
 
# Example
np.random.seed(42)
X = np.vstack([
    np.random.randn(200, 5) + np.random.randn(5) * 3
    for _ in range(6)
])
 
result = production_bisecting_kmeans(X, K_max=10)
 
print(f"\nFinal cluster sizes: {np.bincount(result['model'].labels_)}")

Summary: Bisecting K-means

We've comprehensively explored bisecting K-means, understanding how the divisive approach provides unique advantages over direct partitioning. Let's consolidate the essential insights:

Key Takeaways

•Bisecting K-means is divisive — Starts with all data in one cluster, repeatedly splits until K clusters are reached.
•Each split uses K-means with K=2 — Simple bisection subproblem with fewer local minima than full K-means.
•Selection strategy matters — 'Largest cluster' for balance, 'highest SSE' for quality, 'hybrid' for general use.
•Often better quality than standard K-means — Each split is well-optimized; errors don't compound across branches.
•Produces natural hierarchy — Binary tree structure enables multi-resolution analysis and interpretability.
•Efficient computation — O(ntD log K) compared to O(ntKD) for standard K-means.

Module Complete:

This completes our exploration of K-means variants. We've covered:

K-medoids (PAM) — Robust clustering with actual data points as centers
K-medians — L1 distance for outlier resistance
Fuzzy C-means — Soft clustering with membership degrees
Mini-batch K-means — Scalable stochastic optimization
Bisecting K-means — Divisive hierarchical approach

Each variant addresses specific limitations of standard K-means, providing a rich toolkit for diverse clustering challenges.

Module Complete

You now have mastery over the complete family of K-means variants. You can select the appropriate variant based on your data characteristics (outliers, scale, structure), computational constraints (dataset size, available time), and analytical requirements (hard vs. soft clustering, hierarchy).

Bisecting K-means: Divisive Hierarchical Clustering

A Different Approach: Top-Down Clustering

What You Will Learn

The Divisive Clustering Paradigm

Clustering algorithms can be categorized by how they construct cluster structure:

Agglomerative (Bottom-up):

Start with each point as its own cluster
Repeatedly merge closest clusters
Result: Complete dendrogram from n clusters to 1
Examples: Single-linkage, Complete-linkage, Ward's method

Divisive (Top-down):

Start with all points in one cluster
Repeatedly split a cluster into two
Result: Binary tree from 1 cluster to K
Examples: Bisecting K-means, DIANA

Partitional (Direct):

Directly find K clusters simultaneously
No inherent hierarchy
Examples: K-means, K-medoids, GMM

Clustering Paradigm Comparison
Aspect	Agglomerative	Divisive (Bisecting)	Partitional (K-means)
Direction	Bottom-up	Top-down	Direct
Initial state	n singletons	1 cluster (all data)	K random centers
Operation	Merge pairs	Split one	Reassign & update
Hierarchy	Full dendrogram	Binary tree	None
Complexity	O(n² log n)	O(nKt)	O(nKt)
Global optimization	Greedy merges	Independent bisections	Local minimum

Why Divisive Often Beats Agglomerative for Large K

The Bisecting K-means Algorithm

Bisecting K-means is elegantly simple: repeatedly apply K-means with K=2 (bisection) to selected clusters until the desired number of clusters is reached.

Algorithm: Bisecting K-means

Input: Data X = {x_1, ..., x_n}, target clusters K
Output: K cluster assignments

1. Initialize: cluster_list = [all data points]
2. While len(cluster_list) < K:
   a. Select a cluster C from cluster_list to split
   b. Apply K-means with K=2 to C, obtaining C1 and C2
   c. Remove C from cluster_list
   d. Add C1 and C2 to cluster_list
3. Return cluster_list

Key Design Decisions:

Which cluster to split? (Selection strategy)
How to perform the bisection? (Splitting strategy)
When to stop? (Termination criteria)

bisecting_kmeans.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
 
class BisectingKMeans:
    """
    Bisecting K-means clustering algorithm.
    
    Parameters:
    -----------
    n_clusters : int
        Target number of clusters (K)
    n_bisections : int
        Number of bisection trials per split (best one is kept)
    selection : str
        Cluster selection strategy: 'largest', 'highest_sse', 'hybrid'
    max_iter_bisect : int
        Max iterations for each K=2 split
    random_state : int
        Random seed for reproducibility
    """
    
    def __init__(self, n_clusters=8, n_bisections=5, selection='largest',
                 max_iter_bisect=100, random_state=None):
        self.n_clusters = n_clusters
        self.n_bisections = n_bisections
        self.selection = selection
        self.max_iter_bisect = max_iter_bisect
        self.random_state = random_state
    
    def _compute_sse(self, X, center):
        """Compute sum of squared errors for a cluster."""
        return np.sum((X - center) ** 2)
    
    def _bisect_cluster(self, X, n_trials=5):
        """
        Split a cluster into two using K-means with K=2.
        Run multiple trials and keep the best split.
        """
        best_inertia = np.inf
        best_labels = None
        best_centers = None
        
        for trial in range(n_trials):
            km = KMeans(
                n_clusters=2,
                n_init=1,
                max_iter=self.max_iter_bisect,
                random_state=self.random_state + trial if self.random_state else None
            )
            km.fit(X)
            
            if km.inertia_ < best_inertia:
                best_inertia = km.inertia_
                best_labels = km.labels_
                best_centers = km.cluster_centers_
        
        return best_labels, best_centers, best_inertia
    
    def _select_cluster_to_split(self, clusters):
        """
        Select which cluster to split next.
        
        Strategies:
        - 'largest': Split the cluster with most points
        - 'highest_sse': Split the cluster with highest SSE
        - 'hybrid': Balance size and SSE
        """
        if self.selection == 'largest':
            # Select largest cluster
            sizes = [c['size'] for c in clusters]
            return np.argmax(sizes)
        
        elif self.selection == 'highest_sse':
            # Select cluster with highest SSE
            sses = [c['sse'] for c in clusters]
            return np.argmax(sses)
        
        elif self.selection == 'hybrid':
            # Score = size * sse (balance both factors)
            scores = [c['size'] * c['sse'] for c in clusters]
            return np.argmax(scores)
        
        else:
            raise ValueError(f"Unknown selection strategy: {self.selection}")
    
    def fit(self, X):
        """Fit bisecting K-means to data."""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        
        # Initialize: all points in one cluster
        all_indices = np.arange(n_samples)
        initial_center = X.mean(axis=0)
        initial_sse = self._compute_sse(X, initial_center)
        
        clusters = [{
            'indices': all_indices,
            'center': initial_center,
            'sse': initial_sse,
            'size': n_samples
        }]
        
        # Split history for building hierarchy
        self.split_history_ = []
        
        # Repeatedly bisect until K clusters
        while len(clusters) < self.n_clusters:
            # Select cluster to split
            split_idx = self._select_cluster_to_split(clusters)
            cluster_to_split = clusters[split_idx]
            
            # Get data for this cluster
            cluster_data = X[cluster_to_split['indices']]
            
            # Bisect
            labels, centers, inertia = self._bisect_cluster(
                cluster_data, 
                n_trials=self.n_bisections
            )
            
            # Create two new clusters
            mask_0 = labels == 0
            mask_1 = labels == 1
            
            indices_0 = cluster_to_split['indices'][mask_0]
            indices_1 = cluster_to_split['indices'][mask_1]
            
            new_cluster_0 = {
                'indices': indices_0,
                'center': centers[0],
                'sse': self._compute_sse(X[indices_0], centers[0]),
                'size': len(indices_0)
            }
            
            new_cluster_1 = {
                'indices': indices_1,
                'center': centers[1],
                'sse': self._compute_sse(X[indices_1], centers[1]),
                'size': len(indices_1)
            }
            
            # Record split for hierarchy
            self.split_history_.append({
                'parent_idx': split_idx,
                'parent_sse': cluster_to_split['sse'],
                'child_sses': [new_cluster_0['sse'], new_cluster_1['sse']],
                'n_clusters_after': len(clusters) + 1
            })
            
            # Replace parent with children
            clusters.pop(split_idx)
            clusters.append(new_cluster_0)
            clusters.append(new_cluster_1)
        
        # Store final results
        self.cluster_centers_ = np.array([c['center'] for c in clusters])
        self.labels_ = np.zeros(n_samples, dtype=int)
        for k, c in enumerate(clusters):
            self.labels_[c['indices']] = k
        
        # Compute inertia
        self.inertia_ = sum(c['sse'] for c in clusters)
        
        return self
    
    def predict(self, X):
        """Predict cluster labels for new data."""
        distances = cdist(X, self.cluster_centers_)
        return np.argmin(distances, axis=1)
    
    def fit_predict(self, X):
        """Fit and return labels."""
        self.fit(X)
        return self.labels_
 
 
# Example usage
np.random.seed(42)
 
# Create well-separated clusters
X = np.vstack([
    np.random.randn(200, 2) * 0.5 + [0, 0],
    np.random.randn(200, 2) * 0.5 + [5, 0],
    np.random.randn(200, 2) * 0.5 + [2.5, 4],
    np.random.randn(200, 2) * 0.5 + [7.5, 4],
    np.random.randn(200, 2) * 0.5 + [2.5, -4],
])
 
# Compare bisecting K-means with standard K-means
bisecting = BisectingKMeans(n_clusters=5, selection='largest', random_state=42)
bisecting.fit(X)
 
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X)
 
print("Bisecting K-means vs Standard K-means:")
print(f"  Bisecting inertia: {bisecting.inertia_:.2f}")
print(f"  Standard inertia:  {kmeans.inertia_:.2f}")
print(f"  Improvement: {(1 - bisecting.inertia_ / kmeans.inertia_) * 100:.2f}%")

Cluster Selection Strategies

The choice of which cluster to split next significantly impacts the final clustering quality. Different strategies optimize for different objectives.

Strategy 1: Largest Cluster

Split the cluster with the most points: $$C^* = \arg\max_{C \in \text{clusters}} |C|$$

Rationale: Large clusters likely contain multiple natural subgroups. Splitting them first produces more balanced cluster sizes.

Pros:

Simple and fast (just count points)
Tends to produce balanced cluster sizes
Works well when clusters are roughly equal in density

Cons:

May ignore tight, high-density clusters that are genuinely compact
Doesn't account for cluster shape or dispersion

Strategy 2: Highest SSE (Within-Cluster Variance)

Split the cluster with highest sum of squared errors: $$C^* = \arg\max_{C \in \text{clusters}} \sum_{x \in C} |x - \mu_C|^2$$

Rationale: High SSE indicates high dispersion—the cluster is spread out and likely contains multiple natural subgroups.

Pros:

Directly optimizes the K-means objective
Automatically handles varying cluster densities
Focuses computation where it matters most

Cons:

Sensitive to outliers (they inflate SSE)
May excessively split large, uniform clusters

Strategy 3: Largest × SSE (Hybrid)

Balance both factors: $$C^* = \arg\max_{C \in \text{clusters}} |C| \cdot \text{SSE}(C)$$

Rationale: Large, dispersed clusters are most deserving of splits. Small tight clusters and large tight clusters are left alone.

Selection Strategy Comparison
Strategy	Formula	Best For	Weakness
Largest	max \|C\|	Balanced sizes needed	Ignores dispersion
Highest SSE	max SSE(C)	Optimizing total SSE	Outlier sensitive
Hybrid	max \|C\|×SSE(C)	General purpose	Two hyperparams to balance
Lowest Density	max SSE(C)/\|C\|	Density-based analysis	Ignores absolute sizes

strategy_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.metrics import silhouette_score
 
def compare_selection_strategies(X, K, n_trials=5):
    """
    Compare different cluster selection strategies.
    """
    strategies = ['largest', 'highest_sse', 'hybrid']
    results = {}
    
    for strategy in strategies:
        inertias = []
        silhouettes = []
        
        for trial in range(n_trials):
            bkm = BisectingKMeans(
                n_clusters=K,
                selection=strategy,
                n_bisections=3,
                random_state=42 + trial
            )
            bkm.fit(X)
            
            inertias.append(bkm.inertia_)
            silhouettes.append(silhouette_score(X, bkm.labels_))
        
        results[strategy] = {
            'mean_inertia': np.mean(inertias),
            'std_inertia': np.std(inertias),
            'mean_silhouette': np.mean(silhouettes),
            'std_silhouette': np.std(silhouettes)
        }
        
        print(f"{strategy:15}: inertia={results[strategy]['mean_inertia']:.2f} "
              f"(±{results[strategy]['std_inertia']:.2f}), "
              f"silhouette={results[strategy]['mean_silhouette']:.4f}")
    
    return results
 
# Create test data with varying cluster sizes
np.random.seed(42)
X = np.vstack([
    np.random.randn(500, 2) * 1.0 + [0, 0],      # Large, spread
    np.random.randn(100, 2) * 0.3 + [5, 5],      # Small, tight
    np.random.randn(300, 2) * 0.5 + [10, 0],     # Medium
    np.random.randn(50, 2) * 2.0 + [-5, 5],      # Small, very spread
])
 
print("Comparing selection strategies on imbalanced data:")
print("=" * 60)
results = compare_selection_strategies(X, K=4, n_trials=5)

Advantages Over Standard K-means

Bisecting K-means often outperforms standard K-means despite its simplicity. Understanding why reveals important insights about optimization landscapes.

Key Advantages

•Better local optima — Each bisection optimizes over a smaller, simpler subspace. K-means with K=2 has far fewer local minima than K-means with K=10.
•Hierarchical structure — The split history provides a natural dendrogram, enabling analysis at multiple granularities.
•Computational efficiency — Each bisection is cheap (K=2). Total work is comparable to standard K-means but better distributed.
•Less sensitive to initialization — The initial single cluster has no initialization ambiguity. Randomness only enters in bisection.
•Natural handling of outliers — Outliers tend to form their own small clusters during splits, isolating them.
•Incremental clustering — Easy to add more clusters by bisecting existing ones.

Why Fewer Local Minima?

Empirical Quality Comparison:

Research consistently shows that bisecting K-means achieves lower inertia than standard K-means on many real-world datasets:

Dataset	Standard K-means	Bisecting K-means	Improvement
Text documents	Higher	Lower	2-5%
Image features	Higher	Lower	1-3%
Gene expression	Higher	Lower	3-7%
Random Gaussian	Similar	Similar	<1%

The advantage is most pronounced when:

Clusters have hierarchical structure
Cluster sizes are imbalanced
K is relatively large (>5)
Standard K-means converges to poor local minima

quality_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
import time
 
def comprehensive_comparison(X, y_true, K, n_trials=10):
    """
    Comprehensive comparison of bisecting vs standard K-means.
    """
    results = {
        'bisecting': {'inertia': [], 'ari': [], 'nmi': [], 'time': []},
        'standard': {'inertia': [], 'ari': [], 'nmi': [], 'time': []}
    }
    
    for trial in range(n_trials):
        # Bisecting K-means
        start = time.time()
        bkm = BisectingKMeans(n_clusters=K, selection='highest_sse',
                               n_bisections=5, random_state=trial)
        bkm.fit(X)
        results['bisecting']['time'].append(time.time() - start)
        results['bisecting']['inertia'].append(bkm.inertia_)
        results['bisecting']['ari'].append(adjusted_rand_score(y_true, bkm.labels_))
        results['bisecting']['nmi'].append(normalized_mutual_info_score(y_true, bkm.labels_))
        
        # Standard K-means
        start = time.time()
        km = KMeans(n_clusters=K, n_init=10, random_state=trial)
        km.fit(X)
        results['standard']['time'].append(time.time() - start)
        results['standard']['inertia'].append(km.inertia_)
        results['standard']['ari'].append(adjusted_rand_score(y_true, km.labels_))
        results['standard']['nmi'].append(normalized_mutual_info_score(y_true, km.labels_))
    
    print("Comparison: Bisecting vs Standard K-means")
    print("=" * 60)
    
    for method in ['bisecting', 'standard']:
        print(f"\n{method.upper()}:")
        print(f"  Inertia:  {np.mean(results[method]['inertia']):.2f} "
              f"± {np.std(results[method]['inertia']):.2f}")
        print(f"  ARI:      {np.mean(results[method]['ari']):.4f} "
              f"± {np.std(results[method]['ari']):.4f}")
        print(f"  NMI:      {np.mean(results[method]['nmi']):.4f} "
              f"± {np.std(results[method]['nmi']):.4f}")
        print(f"  Time:     {np.mean(results[method]['time']):.4f}s "
              f"± {np.std(results[method]['time']):.4f}s")
    
    # Improvement summary
    inertia_improvement = (1 - np.mean(results['bisecting']['inertia']) / 
                           np.mean(results['standard']['inertia'])) * 100
    print(f"\nBisecting inertia improvement: {inertia_improvement:.2f}%")
    
    return results
 
 
# Generate hierarchical data (where bisecting should excel)
np.random.seed(42)
 
# Create 2-level hierarchical structure
# Top level: 2 super-clusters
# Bottom level: 4 clusters each
X_list = []
y_list = []
cluster_id = 0
 
for super_center in [[0, 0], [20, 20]]:
    for offset in [[-2, -2], [2, -2], [-2, 2], [2, 2]]:
        center = np.array(super_center) + np.array(offset) * 2
        points = np.random.randn(100, 2) * 0.5 + center
        X_list.append(points)
        y_list.extend([cluster_id] * 100)
        cluster_id += 1
 
X_hierarchical = np.vstack(X_list)
y_true = np.array(y_list)
 
results = comprehensive_comparison(X_hierarchical, y_true, K=8, n_trials=10)

Extracting and Using the Cluster Hierarchy

The Bisection Tree:

Each split in bisecting K-means creates a parent-child relationship:

                    [All Data]
                    /         \
             [Cluster A]    [Cluster B]
              /     \          /     \
           [A1]   [A2]      [B1]   [B2]
            /\     /\        /\     /\
          [...]   [...]    [...]   [...]

Uses of the Hierarchy:

Multi-resolution analysis: Can extract 2, 3, 4, ... K clusters by stopping at different split levels
Cluster similarity: Sibling clusters are more similar than cousins
Explainability: Can describe a cluster by its split history
Uncertainty: Points that split late into different clusters were initially grouped together

extract_hierarchy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from collections import defaultdict
 
class BisectingKMeansWithHierarchy(BisectingKMeans):
    """
    Extended bisecting K-means that explicitly tracks hierarchy.
    """
    
    def fit(self, X):
        """Fit and build explicit tree structure."""
        super().fit(X)
        
        # Build tree from split history
        self._build_tree()
        
        return self
    
    def _build_tree(self):
        """Construct explicit tree structure from splits."""
        # Tree nodes: {node_id: {'parent': id, 'children': [ids], 'cluster_idx': k}}
        self.tree_ = {}
        
        # Start with root
        self.tree_[0] = {
            'parent': None,
            'children': [],
            'level': 0,
            'indices': None  # Will be filled
        }
        
        # Track current leaves
        current_leaves = [0]
        node_counter = 1
        
        # Replay splits to build tree
        for split in self.split_history_:
            split_node = current_leaves[split['parent_idx']]
            
            # Create two children
            child_1 = node_counter
            child_2 = node_counter + 1
            node_counter += 2
            
            self.tree_[split_node]['children'] = [child_1, child_2]
            self.tree_[child_1] = {
                'parent': split_node,
                'children': [],
                'level': self.tree_[split_node]['level'] + 1
            }
            self.tree_[child_2] = {
                'parent': split_node,
                'children': [],
                'level': self.tree_[split_node]['level'] + 1
            }
            
            # Update leaves
            current_leaves.pop(split['parent_idx'])
            current_leaves.append(child_1)
            current_leaves.append(child_2)
        
        # Assign final cluster indices to leaves
        self.leaf_nodes_ = current_leaves
        for k, leaf in enumerate(current_leaves):
            self.tree_[leaf]['cluster_idx'] = k
    
    def get_clusters_at_level(self, level):
        """
        Get cluster assignments at a specific hierarchy level.
        
        Level 0 = all data in one cluster
        Level 1 = 2 clusters
        Level log2(K) = K clusters
        """
        # Find nodes at this level (or their descendants)
        nodes_at_level = []
        
        def traverse(node_id, current_level):
            if current_level == level or len(self.tree_[node_id]['children']) == 0:
                nodes_at_level.append(node_id)
            else:
                for child in self.tree_[node_id]['children']:
                    traverse(child, current_level + 1)
        
        traverse(0, 0)
        
        # Map final cluster labels to level-clusters
        level_labels = np.zeros(len(self.labels_), dtype=int)
        
        for level_cluster_id, node in enumerate(nodes_at_level):
            # Find all leaf descendants
            leaf_clusters = self._get_leaf_descendants(node)
            for leaf_cluster in leaf_clusters:
                mask = self.labels_ == leaf_cluster
                level_labels[mask] = level_cluster_id
        
        return level_labels, len(nodes_at_level)
    
    def _get_leaf_descendants(self, node_id):
        """Get all leaf cluster indices under a node."""
        if len(self.tree_[node_id]['children']) == 0:
            return [self.tree_[node_id]['cluster_idx']]
        
        descendants = []
        for child in self.tree_[node_id]['children']:
            descendants.extend(self._get_leaf_descendants(child))
        return descendants
    
    def print_hierarchy(self):
        """Print tree structure."""
        def print_node(node_id, indent=""):
            node = self.tree_[node_id]
            
            if len(node['children']) == 0:
                k = node['cluster_idx']
                size = np.sum(self.labels_ == k)
                print(f"{indent}Leaf Cluster {k} (n={size})")
            else:
                print(f"{indent}Internal Node (level={node['level']})")
                for child in node['children']:
                    print_node(child, indent + "  ")
        
        print("Cluster Hierarchy:")
        print_node(0)
 
 
# Demonstrate hierarchy
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) * 0.5 + [0, 0],
    np.random.randn(100, 2) * 0.5 + [3, 0],
    np.random.randn(100, 2) * 0.5 + [6, 0],
    np.random.randn(100, 2) * 0.5 + [9, 0],
    np.random.randn(100, 2) * 0.5 + [12, 0],
    np.random.randn(100, 2) * 0.5 + [15, 0],
    np.random.randn(100, 2) * 0.5 + [18, 0],
    np.random.randn(100, 2) * 0.5 + [21, 0],
])
 
bkm = BisectingKMeansWithHierarchy(n_clusters=8, selection='largest', random_state=42)
bkm.fit(X)
 
bkm.print_hierarchy()
 
print("\nMulti-resolution clustering:")
for level in range(4):
    labels_level, n_clusters = bkm.get_clusters_at_level(level)
    print(f"  Level {level}: {n_clusters} clusters, "
          f"sizes = {np.bincount(labels_level)}")

Computational Complexity Analysis

Understanding bisecting K-means' computational behavior helps in capacity planning and algorithm selection.

Time Complexity:

Bisecting K-means performs K-1 splits to reach K clusters. Each split:

Runs K-means with K=2 on the selected cluster
Cluster size varies from n (first split) to ~n/K (late splits)

Per Split: O(m × t × 2 × D) = O(mtD) where:

m = cluster size being split
t = iterations for K=2 K-means
D = dimensions

Total Across All Splits:

Total Time Complexity: O(ntD log K)

Compared to standard K-means: O(ntKD)

For balanced clusters and similar t, bisecting is faster when log K < K (always true for K > 2).

Complexity Comparison
Algorithm	Time	Space	Notes
Standard K-means	O(ntKD)	O(nD + KD)	All data every iteration
Bisecting K-means	O(ntD log K)	O(nD + KD)	Subsets per split
Mini-batch K-means	O(btKD)	O(bD + KD)	Fixed batch b << n

Best of Both Worlds

Space Complexity:

Data storage: O(nD)
Centers: O(KD)
Labels: O(n)
Split history (optional): O(K)

Total: O(nD + KD) — same as standard K-means.

Parallelization Potential:

Within-split parallelism: Each bisection's K-means can use multi-threaded distance computation
Cross-split parallelism: Limited — splits depend on previous results
Trial parallelism: Multiple bisection trials (n_bisections) run independently

Practical Deployment Guidelines

Guidelines for deploying bisecting K-means effectively in production systems.

When to Choose Bisecting K-means

•Hierarchical structure expected — Data has natural multi-scale organization
•Cluster quality matters more than speed — Each split is well-optimized
•Need multi-resolution analysis — Want to explore different K values from one run
•K is relatively large — Advantages increase with more clusters
•Standard K-means gives poor results — Local minima issues
•Interpretable hierarchy needed — Explain clusters through split history

sklearn_bisecting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Scikit-learn includes BisectingKMeans as of version 1.1
from sklearn.cluster import BisectingKMeans as SklearnBisecting
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
 
def production_bisecting_kmeans(X, K_max=10, random_state=42):
    """
    Production bisecting K-means with automated K selection.
    """
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print(f"Data: {X.shape[0]} samples × {X.shape[1]} features")
    print()
    
    # Find optimal K using silhouette
    results = []
    best_K = 2
    best_score = -1
    
    for K in range(2, K_max + 1):
        bkm = SklearnBisecting(
            n_clusters=K,
            init='k-means++',         # Initialization for bisections
            n_init=5,                  # Number of bisection trials
            bisecting_strategy='largest_cluster',  # or 'biggest_inertia'
            random_state=random_state
        )
        
        labels = bkm.fit_predict(X_scaled)
        
        inertia = bkm.inertia_
        silhouette = silhouette_score(X_scaled, labels)
        
        results.append({
            'K': K,
            'inertia': inertia,
            'silhouette': silhouette
        })
        
        print(f"K={K}: silhouette={silhouette:.4f}, inertia={inertia:.2e}")
        
        if silhouette > best_score:
            best_score = silhouette
            best_K = K
    
    print(f"\nOptimal K: {best_K} (silhouette={best_score:.4f})")
    
    # Train final model
    final_model = SklearnBisecting(
        n_clusters=best_K,
        init='k-means++',
        n_init=10,  # More trials for final
        random_state=random_state
    )
    final_model.fit(X_scaled)
    
    return {
        'model': final_model,
        'scaler': scaler,
        'optimal_K': best_K,
        'results': results,
        'cluster_centers_original': scaler.inverse_transform(final_model.cluster_centers_)
    }
 
 
# Example
np.random.seed(42)
X = np.vstack([
    np.random.randn(200, 5) + np.random.randn(5) * 3
    for _ in range(6)
])
 
result = production_bisecting_kmeans(X, K_max=10)
 
print(f"\nFinal cluster sizes: {np.bincount(result['model'].labels_)}")

Summary: Bisecting K-means

We've comprehensively explored bisecting K-means, understanding how the divisive approach provides unique advantages over direct partitioning. Let's consolidate the essential insights:

Key Takeaways

•Bisecting K-means is divisive — Starts with all data in one cluster, repeatedly splits until K clusters are reached.
•Each split uses K-means with K=2 — Simple bisection subproblem with fewer local minima than full K-means.
•Selection strategy matters — 'Largest cluster' for balance, 'highest SSE' for quality, 'hybrid' for general use.
•Often better quality than standard K-means — Each split is well-optimized; errors don't compound across branches.
•Produces natural hierarchy — Binary tree structure enables multi-resolution analysis and interpretability.
•Efficient computation — O(ntD log K) compared to O(ntKD) for standard K-means.

Module Complete:

This completes our exploration of K-means variants. We've covered:

K-medoids (PAM) — Robust clustering with actual data points as centers
K-medians — L1 distance for outlier resistance
Fuzzy C-means — Soft clustering with membership degrees
Mini-batch K-means — Scalable stochastic optimization
Bisecting K-means — Divisive hierarchical approach

Each variant addresses specific limitations of standard K-means, providing a rich toolkit for diverse clustering challenges.

Module Complete