K Means Variants - Learning Module

Loading content...

0/245

Mini-batch K-means: Scalable Clustering for Large Datasets

The Scalability Challenge

Standard K-means, despite its elegant simplicity, faces a fundamental bottleneck: each iteration requires a full pass through all n data points to compute distances and update centroids. For modern datasets with millions or billions of samples, this becomes prohibitively expensive—both in time and memory.

Mini-batch K-means solves this problem by applying the principles of stochastic optimization: instead of using all data in each iteration, we sample a small random subset (a mini-batch) and update centroids based only on that sample. This trades a small amount of clustering quality for massive gains in speed and scalability.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of mini-batch optimization for clustering, master the mini-batch K-means algorithm with its convergent learning rate schedules, analyze the trade-offs between batch size, convergence speed, and solution quality, and deploy mini-batch K-means for real-world large-scale clustering tasks.

The Case for Mini-batches

To appreciate mini-batch K-means, we must first understand why standard K-means struggles at scale.

Standard K-means Computational Analysis:

Operation	Time Complexity	Memory Required
Distance computation	O(nKD)	O(nD) for data
Assignment	O(n)	O(n) for labels
Centroid update	O(nD)	O(KD) for centers
Per iteration	O(nKD)	O(nD + KD)

For n = 100 million points, K = 1000 clusters, D = 100 dimensions:

Per iteration: 10^13 distance operations
Data memory: 80 GB (float64)
Even with highly optimized code, each iteration takes minutes

The Stochastic Optimization Insight:

In machine learning, stochastic gradient descent (SGD) revolutionized training by showing that:

Gradient computed on a small random sample is an unbiased estimate of the full gradient
Noisy updates with decreasing step sizes still converge
More frequent, cheaper updates often outperform rare, expensive ones

Mini-batch K-means applies this same principle to clustering.

Standard vs. Mini-batch K-means
Aspect	Standard K-means	Mini-batch K-means
Data per iteration	All n points	b points (b << n)
Time per iteration	O(nKD)	O(bKD)
Memory for batches	O(nD)	O(bD)
Convergence	Monotonic decrease	Noisy but convergent
Final quality	Local optimum	Near local optimum (slightly worse)
Scaling to big data	Limited by memory	Streaming-compatible

The Practical Trade-off

Mini-batch K-means typically achieves inertia within 1-2% of standard K-means while being 10-100× faster. For most applications, this trade-off is highly favorable—the time saved can be invested in more runs, hyperparameter tuning, or feature engineering.

The Mini-batch K-means Algorithm

The Mini-batch K-means algorithm modifies standard K-means in two key ways:

Sampling: Each iteration uses a random mini-batch of b samples
Incremental update: Centroids are updated incrementally using a streaming rule

Algorithm Overview:

Input: Data X, clusters K, batch size b, iterations T
Output: Cluster centers C

1. Initialize centers C = {c_1, ..., c_K} (e.g., k-means++)
2. Initialize per-center counts: n_k = 0 for k = 1,...,K
3. For t = 1 to T:
   a. Sample mini-batch M of b points from X
   b. For each x ∈ M:
      - Assign x to nearest center: k* = argmin_k ||x - c_k||²
      - Increment count: n_{k*} += 1
      - Compute learning rate: η = 1 / n_{k*}
      - Update center: c_{k*} = (1 - η) c_{k*} + η x
4. Return C

Understanding the Incremental Update:

The key insight is in step 3d: the centroid update uses a weighted average between the current centroid and the new point:

$$c_k^{\text{new}} = (1 - \eta) c_k^{\text{old}} + \eta \cdot x$$

With $\eta = 1/n_k$ where $n_k$ is the total number of points assigned to cluster $k$ so far, this is equivalent to:

$$c_k = \frac{\sum_{i: z_i = k} x_i}{n_k}$$

The running mean! Each new point slightly adjusts the centroid toward itself, with diminishing influence as more points accumulate.

Why This Works:

After $n_k$ points have been assigned to cluster $k$, the centroid is exactly the mean of all those points—the same result K-means would produce. The difference is that we compute this incrementally, point by point, rather than in batch.

The randomness in mini-batch sampling means different runs see points in different orders, introducing stochasticity. But for large enough datasets, the final centroids converge to nearly the same positions as batch K-means.

minibatch_kmeans.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
import numpy as np
from scipy.spatial.distance import cdist
 
class MiniBatchKMeans:
    """
    Mini-batch K-means clustering algorithm.
    
    Parameters:
    -----------
    n_clusters : int
        Number of clusters (K)
    batch_size : int
        Number of samples per mini-batch (b)
    max_iter : int
        Maximum number of mini-batch iterations
    n_init : int
        Number of random initializations
    init : str
        Initialization method ('random' or 'k-means++')
    reassignment_ratio : float
        Fraction of samples to reassign cluster centers if empty
    random_state : int, optional
        Random seed
    """
    
    def __init__(self, n_clusters=8, batch_size=100, max_iter=100,
                 n_init=3, init='k-means++', reassignment_ratio=0.01,
                 random_state=None):
        self.n_clusters = n_clusters
        self.batch_size = batch_size
        self.max_iter = max_iter
        self.n_init = n_init
        self.init = init
        self.reassignment_ratio = reassignment_ratio
        self.random_state = random_state
    
    def _init_centers_kmeanspp(self, X):
        """K-means++ initialization on a subsample."""
        n_samples = min(len(X), self.batch_size * 10)
        subsample_idx = np.random.choice(len(X), n_samples, replace=False)
        X_subsample = X[subsample_idx]
        
        centers = np.zeros((self.n_clusters, X.shape[1]))
        
        # First center: random
        centers[0] = X_subsample[np.random.randint(n_samples)]
        
        for k in range(1, self.n_clusters):
            distances = cdist(X_subsample, centers[:k])
            min_dist_sq = np.min(distances, axis=1) ** 2
            probs = min_dist_sq / min_dist_sq.sum()
            centers[k] = X_subsample[np.random.choice(n_samples, p=probs)]
        
        return centers
    
    def _single_run(self, X):
        """Single mini-batch K-means run."""
        n_samples, n_features = X.shape
        
        # Initialize centers
        if self.init == 'k-means++':
            centers = self._init_centers_kmeanspp(X)
        else:
            idx = np.random.choice(n_samples, self.n_clusters, replace=False)
            centers = X[idx].copy()
        
        # Track counts per cluster
        counts = np.zeros(self.n_clusters)
        
        for iteration in range(self.max_iter):
            # Sample mini-batch
            batch_idx = np.random.choice(n_samples, self.batch_size, replace=False)
            X_batch = X[batch_idx]
            
            # Assign batch points to nearest centers
            distances = cdist(X_batch, centers)
            labels_batch = np.argmin(distances, axis=1)
            
            # Update centers incrementally
            for i, k in enumerate(labels_batch):
                counts[k] += 1
                eta = 1.0 / counts[k]  # Learning rate
                centers[k] = (1 - eta) * centers[k] + eta * X_batch[i]
            
            # Handle empty clusters
            empty_clusters = np.where(counts == 0)[0]
            if len(empty_clusters) > 0:
                # Reassign from random batch points
                for k in empty_clusters:
                    random_point = X[np.random.randint(n_samples)]
                    centers[k] = random_point
        
        # Compute final inertia on full dataset (or sample for efficiency)
        sample_size = min(n_samples, 10000)
        sample_idx = np.random.choice(n_samples, sample_size, replace=False)
        distances = cdist(X[sample_idx], centers)
        inertia = np.sum(np.min(distances, axis=1) ** 2) * (n_samples / sample_size)
        
        # Final labels
        labels = np.argmin(cdist(X, centers), axis=1)
        
        return centers, labels, inertia
    
    def fit(self, X):
        """Fit mini-batch K-means."""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        best_inertia = np.inf
        
        for init_run in range(self.n_init):
            centers, labels, inertia = self._single_run(X)
            
            if inertia < best_inertia:
                best_inertia = inertia
                self.cluster_centers_ = centers
                self.labels_ = labels
                self.inertia_ = inertia
        
        return self
    
    def predict(self, X):
        """Predict cluster labels."""
        return np.argmin(cdist(X, self.cluster_centers_), axis=1)
    
    def partial_fit(self, X):
        """
        Incremental fit on a single batch.
        Useful for streaming data.
        """
        if not hasattr(self, 'cluster_centers_'):
            # Initialize on first batch
            if self.init == 'k-means++':
                self.cluster_centers_ = self._init_centers_kmeanspp(X)
            else:
                idx = np.random.choice(len(X), self.n_clusters, replace=False)
                self.cluster_centers_ = X[idx].copy()
            self._counts = np.zeros(self.n_clusters)
        
        # Assign and update
        distances = cdist(X, self.cluster_centers_)
        labels = np.argmin(distances, axis=1)
        
        for i, k in enumerate(labels):
            self._counts[k] += 1
            eta = 1.0 / self._counts[k]
            self.cluster_centers_[k] = (1 - eta) * self.cluster_centers_[k] + eta * X[i]
        
        return self
 
 
# Example usage and comparison with standard K-means
import time
 
def compare_minibatch_vs_standard():
    """Compare mini-batch and standard K-means."""
    np.random.seed(42)
    
    # Generate large dataset
    n_samples = 100000
    n_features = 50
    n_clusters = 10
    
    # Create clustered data
    centers_true = np.random.randn(n_clusters, n_features) * 10
    X = np.vstack([
        np.random.randn(n_samples // n_clusters, n_features) + center
        for center in centers_true
    ])
    np.random.shuffle(X)
    
    print(f"Dataset: {X.shape[0]} samples × {X.shape[1]} features")
    print(f"Target clusters: {n_clusters}")
    print()
    
    # Standard K-means (from sklearn for fair comparison)
    from sklearn.cluster import KMeans, MiniBatchKMeans as SklearnMBKMeans
    
    start = time.time()
    kmeans = KMeans(n_clusters=n_clusters, n_init=3, random_state=42)
    kmeans.fit(X)
    kmeans_time = time.time() - start
    
    # Mini-batch K-means (sklearn)
    start = time.time()
    mbkmeans = SklearnMBKMeans(n_clusters=n_clusters, batch_size=1000,
                                n_init=3, random_state=42)
    mbkmeans.fit(X)
    mbkmeans_time = time.time() - start
    
    print("Results:")
    print(f"  Standard K-means:   {kmeans_time:.2f}s, inertia={kmeans.inertia_:.2e}")
    print(f"  Mini-batch K-means: {mbkmeans_time:.2f}s, inertia={mbkmeans.inertia_:.2e}")
    print(f"  Speedup: {kmeans_time / mbkmeans_time:.1f}×")
    print(f"  Inertia ratio: {mbkmeans.inertia_ / kmeans.inertia_:.4f}")
 
compare_minibatch_vs_standard()
# Example output:
# Dataset: 100000 samples × 50 features
# Target clusters: 10
# 
# Results:
#   Standard K-means:   4.23s, inertia=4.98e+07
#   Mini-batch K-means: 0.31s, inertia=5.02e+07
#   Speedup: 13.6×
#   Inertia ratio: 1.0080

Convergence Analysis

Unlike standard K-means, which monotonically decreases its objective, mini-batch K-means exhibits stochastic convergence. Understanding this behavior is crucial for proper algorithm tuning.

The Convergence Property:

Mini-batch K-means converges to a local optimum of the K-means objective in the following sense:

Expected descent: In expectation over mini-batch sampling, each iteration decreases the objective (on average)
Variance bound: The variance of centroid estimates decreases as O(1/n_k) where n_k is the number of points seen for cluster k
Asymptotic convergence: As iterations approach infinity, centroids converge to a neighborhood of the batch K-means solution

The Learning Rate Schedule:

The learning rate $\eta_k = 1/n_k$ for cluster $k$ is crucial:

Early iterations: $\eta$ is large, allowing rapid adaptation
Later iterations: $\eta$ is small, ensuring stability

This follows the Robbins-Monro conditions for stochastic approximation:

$\sum_t \eta_t = \infty$ (can reach any point)
$\sum_t \eta_t^2 < \infty$ (noise diminishes fast enough)

Why 1/n Rather Than Fixed Learning Rate?

A fixed learning rate η would cause centroids to oscillate forever, never settling. The 1/n schedule ensures that after seeing many points, new points have diminishing influence—eventually converging to the empirical mean of all assigned points.

Convergence Speed Factors:

Factor	Effect on Convergence
Larger batch size	Faster convergence, less noise
More iterations	Better final solution
Better initialization	Faster and better convergence
Data ordering	Random sampling essential
Number of clusters	More clusters → more iterations needed

Practical Stopping Criteria:

Fixed iterations: Run for a predetermined number of iterations (simplest)
Inertia monitoring: Sample inertia periodically; stop when stable
Center movement: Stop when centers move less than threshold
Reassignment rate: Stop when few points change clusters between evaluations

convergence_monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.spatial.distance import cdist
 
def minibatch_kmeans_with_monitoring(X, K, batch_size=100, max_iter=1000,
                                      eval_interval=50, random_state=42):
    """
    Mini-batch K-means with convergence monitoring.
    
    Records inertia at regular intervals to visualize convergence.
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape
    
    # Initialize with K-means++
    centers = X[np.random.choice(n_samples, K, replace=False)].copy()
    counts = np.zeros(K)
    
    # Tracking
    inertia_history = []
    iterations = []
    
    for iteration in range(max_iter):
        # Sample mini-batch
        batch_idx = np.random.choice(n_samples, batch_size, replace=False)
        X_batch = X[batch_idx]
        
        # Assign and update
        distances = cdist(X_batch, centers)
        labels_batch = np.argmin(distances, axis=1)
        
        for i, k in enumerate(labels_batch):
            counts[k] += 1
            eta = 1.0 / counts[k]
            centers[k] = (1 - eta) * centers[k] + eta * X_batch[i]
        
        # Periodic evaluation
        if (iteration + 1) % eval_interval == 0:
            # Sample inertia (not full dataset for efficiency)
            sample_size = min(n_samples, 5000)
            sample_idx = np.random.choice(n_samples, sample_size, replace=False)
            sample_distances = cdist(X[sample_idx], centers)
            sample_inertia = np.sum(np.min(sample_distances, axis=1) ** 2)
            
            # Scale to full dataset estimate
            estimated_inertia = sample_inertia * (n_samples / sample_size)
            
            inertia_history.append(estimated_inertia)
            iterations.append(iteration + 1)
            
            # Early stopping check
            if len(inertia_history) >= 3:
                recent_change = abs(inertia_history[-1] - inertia_history[-3]) / inertia_history[-3]
                if recent_change < 0.001:
                    print(f"Converged at iteration {iteration + 1} "
                          f"(change < 0.1%)")
                    break
    
    # Final inertia
    final_distances = cdist(X, centers)
    final_inertia = np.sum(np.min(final_distances, axis=1) ** 2)
    
    return {
        'centers': centers,
        'labels': np.argmin(final_distances, axis=1),
        'inertia': final_inertia,
        'inertia_history': inertia_history,
        'iterations': iterations
    }
 
# Demonstrate convergence
np.random.seed(42)
X = np.vstack([
    np.random.randn(5000, 10) + np.random.randn(10) * 5
    for _ in range(5)
])
 
result = minibatch_kmeans_with_monitoring(X, K=5, batch_size=100, max_iter=500)
 
print("\nConvergence History:")
for i, (it, inertia) in enumerate(zip(result['iterations'], result['inertia_history'])):
    if i < 5 or i >= len(result['iterations']) - 3:
        print(f"  Iteration {it:4d}: inertia = {inertia:.2e}")
    elif i == 5:
        print("  ...")

Batch Size Selection: Critical Trade-offs

Batch size is the most important hyperparameter in mini-batch K-means, fundamentally controlling the trade-off between computation and quality.

The Batch Size Spectrum:

Batch Size	Behavior	Trade-off
b = 1	Online K-means, maximum stochasticity	Very fast iterations, high variance
b = 10-100	High variance, many iterations needed	Fast, noisy convergence
b = 100-1000	Sweet spot for most applications	Good balance
b = 1000-10000	Lower variance, fewer iterations	Slower iterations, stable
b = n	Standard K-means	Maximum quality, maximum cost

Rule of Thumb

Batch size ≈ 256 to 1024 works well for most datasets. Larger datasets benefit from larger batches (up to 10,000). The key is ensuring each cluster sees enough points per epoch—batch_size should be >> K.

Mathematical Analysis:

The variance of the centroid estimate depends on batch size:

$$\text{Var}(\hat{c}_k) \propto \frac{\sigma_k^2}{b \cdot p_k}$$

where:

$\sigma_k^2$ is the within-cluster variance
$b$ is the batch size
$p_k$ is the fraction of data belonging to cluster $k$

Key Insight: Small clusters (low $p_k$) have higher variance. If cluster $k$ contains only 1% of data and batch size is 100, we expect only 1 point from that cluster per batch on average.

Addressing Cluster Imbalance:

For imbalanced clusters, consider:

Larger batch sizes: Ensure rare clusters get enough samples
Stratified sampling: Sample proportionally from estimated clusters (requires labels)
Reassignment: Periodically check for and fix drifting/empty clusters

batch_size_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from sklearn.cluster import MiniBatchKMeans, KMeans
import time
 
def analyze_batch_size_tradeoff(X, K, batch_sizes, n_trials=5):
    """
    Analyze how batch size affects quality and speed.
    """
    results = []
    
    # Baseline: standard K-means
    kmeans = KMeans(n_clusters=K, n_init=1, random_state=42)
    start = time.time()
    kmeans.fit(X)
    baseline_time = time.time() - start
    baseline_inertia = kmeans.inertia_
    
    print(f"Baseline (standard K-means): {baseline_time:.3f}s, "
          f"inertia={baseline_inertia:.4e}")
    print()
    
    for batch_size in batch_sizes:
        times = []
        inertias = []
        
        for trial in range(n_trials):
            mbk = MiniBatchKMeans(
                n_clusters=K,
                batch_size=batch_size,
                n_init=1,
                random_state=42 + trial
            )
            
            start = time.time()
            mbk.fit(X)
            elapsed = time.time() - start
            
            times.append(elapsed)
            inertias.append(mbk.inertia_)
        
        avg_time = np.mean(times)
        avg_inertia = np.mean(inertias)
        std_inertia = np.std(inertias)
        
        speedup = baseline_time / avg_time
        quality_ratio = avg_inertia / baseline_inertia
        
        results.append({
            'batch_size': batch_size,
            'avg_time': avg_time,
            'avg_inertia': avg_inertia,
            'std_inertia': std_inertia,
            'speedup': speedup,
            'quality_ratio': quality_ratio
        })
        
        print(f"Batch={batch_size:5d}: {avg_time:.3f}s ({speedup:5.1f}× speedup), "
              f"quality ratio={quality_ratio:.4f} ± {std_inertia/baseline_inertia:.4f}")
    
    return results
 
# Example analysis
np.random.seed(42)
X = np.vstack([
    np.random.randn(10000, 20) + np.random.randn(20) * 8
    for _ in range(8)
])
 
print(f"Analyzing batch size on {X.shape[0]} points, {X.shape[1]} dims, 8 clusters")
print("=" * 70)
 
batch_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096]
results = analyze_batch_size_tradeoff(X, K=8, batch_sizes=batch_sizes)
 
# Find sweet spot
best_efficiency = 0
best_batch = None
for r in results:
    # Efficiency = speedup / quality_loss
    efficiency = r['speedup'] / (r['quality_ratio'] - 0.99)  # Penalize quality loss
    if efficiency > best_efficiency and r['quality_ratio'] < 1.02:
        best_efficiency = efficiency
        best_batch = r['batch_size']
 
print(f"\nRecommended batch size: {best_batch}")

Streaming and Online Clustering

One of mini-batch K-means' most powerful features is its compatibility with streaming data—scenarios where data arrives continuously and cannot be stored entirely in memory.

The Streaming Paradigm:

In streaming clustering:

Data arrives in chunks (batches)
Each chunk is processed once, then discarded
Model maintains compact state (cluster centers + counts)
Predictions can be made at any time using current centers

The partial_fit Interface:

Scikit-learn's MiniBatchKMeans provides partial_fit() for incremental learning:

mbk = MiniBatchKMeans(n_clusters=K)

for batch in data_stream:
    mbk.partial_fit(batch)  # Update with new data
    
    # Can query current state anytime
    current_centers = mbk.cluster_centers_
    predictions = mbk.predict(new_data)

streaming_clustering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from sklearn.cluster import MiniBatchKMeans
 
class StreamingKMeansClustering:
    """
    Streaming K-means clustering system.
    
    Handles data that arrives in batches, continuously updating
    cluster centers while maintaining memory efficiency.
    """
    
    def __init__(self, n_clusters, batch_size=100, 
                 init_samples=1000, random_state=None):
        """
        Parameters:
        -----------
        n_clusters : int
            Number of clusters
        batch_size : int
            Samples per partial_fit call
        init_samples : int
            Number of samples to collect before initialization
        random_state : int
            Random seed
        """
        self.n_clusters = n_clusters
        self.batch_size = batch_size
        self.init_samples = init_samples
        self.random_state = random_state
        
        self.mbk = MiniBatchKMeans(
            n_clusters=n_clusters,
            batch_size=batch_size,
            random_state=random_state
        )
        
        self._initialized = False
        self._init_buffer = []
        self._total_samples_seen = 0
        self._cluster_counts = np.zeros(n_clusters)
    
    def partial_fit(self, X):
        """
        Update clustering with new batch of data.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            New batch of data
            
        Returns:
        --------
        self
        """
        if not self._initialized:
            # Buffer samples for initialization
            self._init_buffer.append(X)
            total_buffered = sum(len(b) for b in self._init_buffer)
            
            if total_buffered >= self.init_samples:
                # Initialize
                init_data = np.vstack(self._init_buffer)
                self.mbk.fit(init_data)
                self._initialized = True
                self._init_buffer = []  # Free memory
                self._total_samples_seen = len(init_data)
                
                # Track cluster counts
                labels = self.mbk.labels_
                self._cluster_counts = np.bincount(labels, minlength=self.n_clusters)
        else:
            # Incremental update
            self.mbk.partial_fit(X)
            self._total_samples_seen += len(X)
            
            # Update cluster counts (approximate, from this batch)
            labels = self.mbk.predict(X)
            self._cluster_counts += np.bincount(labels, minlength=self.n_clusters)
        
        return self
    
    def predict(self, X):
        """Predict cluster labels for data."""
        if not self._initialized:
            raise RuntimeError("Model not yet initialized. Feed more data.")
        return self.mbk.predict(X)
    
    def get_centers(self):
        """Get current cluster centers."""
        if not self._initialized:
            raise RuntimeError("Model not yet initialized.")
        return self.mbk.cluster_centers_.copy()
    
    def get_stats(self):
        """Get clustering statistics."""
        return {
            'initialized': self._initialized,
            'total_samples_seen': self._total_samples_seen,
            'cluster_distribution': self._cluster_counts / max(1, self._cluster_counts.sum())
        }
 
 
def simulate_data_stream(n_batches=100, samples_per_batch=100, 
                          n_features=10, n_true_clusters=5):
    """
    Simulate a data stream with arriving batches.
    """
    np.random.seed(42)
    
    # True cluster centers (unknown to model)
    true_centers = np.random.randn(n_true_clusters, n_features) * 5
    
    for _ in range(n_batches):
        # Each batch has samples from random clusters
        cluster_ids = np.random.randint(0, n_true_clusters, samples_per_batch)
        batch = np.zeros((samples_per_batch, n_features))
        
        for i, cid in enumerate(cluster_ids):
            batch[i] = true_centers[cid] + np.random.randn(n_features) * 0.5
        
        yield batch
 
 
# Demonstrate streaming clustering
print("Streaming K-means Clustering Demo")
print("=" * 50)
 
streamer = StreamingKMeansClustering(
    n_clusters=5,
    batch_size=100,
    init_samples=500
)
 
for i, batch in enumerate(simulate_data_stream(n_batches=100)):
    streamer.partial_fit(batch)
    
    if (i + 1) % 20 == 0:
        stats = streamer.get_stats()
        print(f"Batch {i+1:3d}: "
              f"samples_seen={stats['total_samples_seen']:5d}, "
              f"dist={np.round(stats['cluster_distribution'], 2)}")
 
print("\nFinal cluster centers shape:", streamer.get_centers().shape)

Concept Drift

In streaming scenarios, the data distribution may change over time (concept drift). Standard mini-batch K-means will slowly adapt, but old data has accumulated influence. For non-stationary data, consider exponential decay of old counts: η = 1/(λn + 1) with λ < 1, or periodic reinitialization.

Implementation Best Practices

Deploying mini-batch K-means effectively requires attention to several practical considerations.

Best Practices

•Use k-means++ initialization — Start with good centers, even if computed on a subsample. Poor initialization wastes iterations.
•Batch size ≈ 10-100× K — Ensure each cluster sees multiple points per batch on average.
•Run multiple initializations — n_init=3 to 10 helps find better solutions; cheap with mini-batches.
•Monitor convergence — Sample inertia periodically; stop when improvement plateaus.
•Handle empty clusters — Reinitialize empty cluster centers with random points or farthest points.
•Shuffle data — If data has temporal structure, shuffle before streaming to ensure representative batches.
•Standardize features — Scale features to similar ranges; K-means is sensitive to feature magnitudes.

production_minibatch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
 
def production_minibatch_kmeans(X, K_range=range(2, 11), 
                                  batch_sizes=[256, 512, 1024],
                                  n_init=5, random_state=42):
    """
    Production-ready mini-batch K-means with hyperparameter search.
    
    Performs:
    1. Feature standardization
    2. Batch size selection
    3. Number of clusters selection via silhouette score
    4. Final model training
    """
    np.random.seed(random_state)
    
    # 1. Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print(f"Data: {X.shape[0]} samples × {X.shape[1]} features")
    print(f"Testing K in {list(K_range)}")
    print()
    
    # 2. Quick batch size selection (on subset)
    subsample_size = min(len(X), 10000)
    X_sub = X_scaled[np.random.choice(len(X), subsample_size, replace=False)]
    
    print("Batch size selection (on subsample):")
    best_batch = batch_sizes[0]
    best_time_quality = 0
    
    for bs in batch_sizes:
        mbk = MiniBatchKMeans(n_clusters=5, batch_size=bs, 
                               n_init=1, random_state=42)
        import time
        start = time.time()
        mbk.fit(X_sub)
        elapsed = time.time() - start
        
        # Balance speed and quality
        score = 1.0 / (elapsed * mbk.inertia_)
        if score > best_time_quality:
            best_time_quality = score
            best_batch = bs
        
        print(f"  batch_size={bs}: {elapsed:.3f}s, inertia={mbk.inertia_:.2e}")
    
    print(f"  Selected batch_size: {best_batch}")
    print()
    
    # 3. Find optimal K using silhouette score
    print("Cluster selection:")
    best_K = K_range[0]
    best_silhouette = -1
    results = []
    
    # Use subsample for silhouette (expensive for large n)
    eval_subsample = min(len(X), 5000)
    X_eval = X_scaled[np.random.choice(len(X), eval_subsample, replace=False)]
    
    for K in K_range:
        mbk = MiniBatchKMeans(
            n_clusters=K,
            batch_size=best_batch,
            n_init=n_init,
            random_state=random_state
        )
        mbk.fit(X_scaled)
        
        # Compute silhouette on subsample
        labels_eval = mbk.predict(X_eval)
        
        # Skip if any cluster is empty in sample
        if len(np.unique(labels_eval)) < K:
            print(f"  K={K}: skipped (empty clusters)")
            continue
        
        sil = silhouette_score(X_eval, labels_eval)
        results.append({'K': K, 'inertia': mbk.inertia_, 'silhouette': sil})
        
        if sil > best_silhouette:
            best_silhouette = sil
            best_K = K
        
        print(f"  K={K}: silhouette={sil:.4f}, inertia={mbk.inertia_:.2e}")
    
    print(f"  Selected K: {best_K} (silhouette={best_silhouette:.4f})")
    print()
    
    # 4. Train final model
    print("Training final model...")
    final_model = MiniBatchKMeans(
        n_clusters=best_K,
        batch_size=best_batch,
        n_init=n_init * 2,  # More initializations for final
        random_state=random_state
    )
    final_model.fit(X_scaled)
    
    # Return model along with scaler for inference
    return {
        'model': final_model,
        'scaler': scaler,
        'best_K': best_K,
        'best_batch_size': best_batch,
        'cluster_centers_original': scaler.inverse_transform(final_model.cluster_centers_),
        'selection_results': results
    }
 
# Example usage
np.random.seed(42)
X = np.vstack([
    np.random.randn(5000, 15) * 2 + np.random.randn(15) * 10
    for _ in range(6)
])
 
result = production_minibatch_kmeans(X, K_range=range(3, 10))
print(f"\nFinal model: K={result['best_K']}, batch_size={result['best_batch_size']}")

Summary: Mini-batch K-means

We've comprehensively explored mini-batch K-means, the scalable variant that enables clustering of massive datasets. Let's consolidate the essential insights:

Key Takeaways

•Mini-batch K-means uses random subsampling — Each iteration processes only b << n points, reducing time complexity from O(nKD) to O(bKD) per iteration.
•Incremental centroid updates — Centroids are updated as running means using η = 1/n_k, converging to the same solution as batch mean.
•Stochastic convergence — Convergence is noisy but guaranteed; more iterations compensate for smaller batches.
•Batch size is critical — Typical sweet spot: 256-1024. Too small = high variance; too large = loses speedup benefit.
•Streaming-compatible — partial_fit() enables online learning where data arrives incrementally.
•Quality trade-off is minimal — Typically achieves 98-99% of standard K-means quality with 10-100× speedup.

Looking Ahead:

The next page explores Bisecting K-means, a hierarchical approach that builds cluster structure through recursive binary splits. This provides an alternative perspective on partitioning that can be more efficient and offer interpretable cluster hierarchies.

Page Complete

You now have a thorough understanding of mini-batch K-means—from the stochastic optimization foundations through convergence analysis to production deployment strategies. You can confidently apply mini-batch K-means to large-scale clustering problems, appropriately tuning batch size and monitoring convergence.