Loading content...
Standard K-means, despite its elegant simplicity, faces a fundamental bottleneck: each iteration requires a full pass through all n data points to compute distances and update centroids. For modern datasets with millions or billions of samples, this becomes prohibitively expensive—both in time and memory.
Mini-batch K-means solves this problem by applying the principles of stochastic optimization: instead of using all data in each iteration, we sample a small random subset (a mini-batch) and update centroids based only on that sample. This trades a small amount of clustering quality for massive gains in speed and scalability.
By the end of this page, you will understand the theoretical foundations of mini-batch optimization for clustering, master the mini-batch K-means algorithm with its convergent learning rate schedules, analyze the trade-offs between batch size, convergence speed, and solution quality, and deploy mini-batch K-means for real-world large-scale clustering tasks.
To appreciate mini-batch K-means, we must first understand why standard K-means struggles at scale.
Standard K-means Computational Analysis:
| Operation | Time Complexity | Memory Required |
|---|---|---|
| Distance computation | O(nKD) | O(nD) for data |
| Assignment | O(n) | O(n) for labels |
| Centroid update | O(nD) | O(KD) for centers |
| Per iteration | O(nKD) | O(nD + KD) |
For n = 100 million points, K = 1000 clusters, D = 100 dimensions:
The Stochastic Optimization Insight:
In machine learning, stochastic gradient descent (SGD) revolutionized training by showing that:
Mini-batch K-means applies this same principle to clustering.
| Aspect | Standard K-means | Mini-batch K-means |
|---|---|---|
| Data per iteration | All n points | b points (b << n) |
| Time per iteration | O(nKD) | O(bKD) |
| Memory for batches | O(nD) | O(bD) |
| Convergence | Monotonic decrease | Noisy but convergent |
| Final quality | Local optimum | Near local optimum (slightly worse) |
| Scaling to big data | Limited by memory | Streaming-compatible |
Mini-batch K-means typically achieves inertia within 1-2% of standard K-means while being 10-100× faster. For most applications, this trade-off is highly favorable—the time saved can be invested in more runs, hyperparameter tuning, or feature engineering.
The Mini-batch K-means algorithm modifies standard K-means in two key ways:
Algorithm Overview:
Input: Data X, clusters K, batch size b, iterations T
Output: Cluster centers C
1. Initialize centers C = {c_1, ..., c_K} (e.g., k-means++)
2. Initialize per-center counts: n_k = 0 for k = 1,...,K
3. For t = 1 to T:
a. Sample mini-batch M of b points from X
b. For each x ∈ M:
- Assign x to nearest center: k* = argmin_k ||x - c_k||²
- Increment count: n_{k*} += 1
- Compute learning rate: η = 1 / n_{k*}
- Update center: c_{k*} = (1 - η) c_{k*} + η x
4. Return C
Understanding the Incremental Update:
The key insight is in step 3d: the centroid update uses a weighted average between the current centroid and the new point:
$$c_k^{\text{new}} = (1 - \eta) c_k^{\text{old}} + \eta \cdot x$$
With $\eta = 1/n_k$ where $n_k$ is the total number of points assigned to cluster $k$ so far, this is equivalent to:
$$c_k = \frac{\sum_{i: z_i = k} x_i}{n_k}$$
The running mean! Each new point slightly adjusts the centroid toward itself, with diminishing influence as more points accumulate.
Why This Works:
After $n_k$ points have been assigned to cluster $k$, the centroid is exactly the mean of all those points—the same result K-means would produce. The difference is that we compute this incrementally, point by point, rather than in batch.
The randomness in mini-batch sampling means different runs see points in different orders, introducing stochasticity. But for large enough datasets, the final centroids converge to nearly the same positions as batch K-means.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206
import numpy as npfrom scipy.spatial.distance import cdist class MiniBatchKMeans: """ Mini-batch K-means clustering algorithm. Parameters: ----------- n_clusters : int Number of clusters (K) batch_size : int Number of samples per mini-batch (b) max_iter : int Maximum number of mini-batch iterations n_init : int Number of random initializations init : str Initialization method ('random' or 'k-means++') reassignment_ratio : float Fraction of samples to reassign cluster centers if empty random_state : int, optional Random seed """ def __init__(self, n_clusters=8, batch_size=100, max_iter=100, n_init=3, init='k-means++', reassignment_ratio=0.01, random_state=None): self.n_clusters = n_clusters self.batch_size = batch_size self.max_iter = max_iter self.n_init = n_init self.init = init self.reassignment_ratio = reassignment_ratio self.random_state = random_state def _init_centers_kmeanspp(self, X): """K-means++ initialization on a subsample.""" n_samples = min(len(X), self.batch_size * 10) subsample_idx = np.random.choice(len(X), n_samples, replace=False) X_subsample = X[subsample_idx] centers = np.zeros((self.n_clusters, X.shape[1])) # First center: random centers[0] = X_subsample[np.random.randint(n_samples)] for k in range(1, self.n_clusters): distances = cdist(X_subsample, centers[:k]) min_dist_sq = np.min(distances, axis=1) ** 2 probs = min_dist_sq / min_dist_sq.sum() centers[k] = X_subsample[np.random.choice(n_samples, p=probs)] return centers def _single_run(self, X): """Single mini-batch K-means run.""" n_samples, n_features = X.shape # Initialize centers if self.init == 'k-means++': centers = self._init_centers_kmeanspp(X) else: idx = np.random.choice(n_samples, self.n_clusters, replace=False) centers = X[idx].copy() # Track counts per cluster counts = np.zeros(self.n_clusters) for iteration in range(self.max_iter): # Sample mini-batch batch_idx = np.random.choice(n_samples, self.batch_size, replace=False) X_batch = X[batch_idx] # Assign batch points to nearest centers distances = cdist(X_batch, centers) labels_batch = np.argmin(distances, axis=1) # Update centers incrementally for i, k in enumerate(labels_batch): counts[k] += 1 eta = 1.0 / counts[k] # Learning rate centers[k] = (1 - eta) * centers[k] + eta * X_batch[i] # Handle empty clusters empty_clusters = np.where(counts == 0)[0] if len(empty_clusters) > 0: # Reassign from random batch points for k in empty_clusters: random_point = X[np.random.randint(n_samples)] centers[k] = random_point # Compute final inertia on full dataset (or sample for efficiency) sample_size = min(n_samples, 10000) sample_idx = np.random.choice(n_samples, sample_size, replace=False) distances = cdist(X[sample_idx], centers) inertia = np.sum(np.min(distances, axis=1) ** 2) * (n_samples / sample_size) # Final labels labels = np.argmin(cdist(X, centers), axis=1) return centers, labels, inertia def fit(self, X): """Fit mini-batch K-means.""" if self.random_state is not None: np.random.seed(self.random_state) best_inertia = np.inf for init_run in range(self.n_init): centers, labels, inertia = self._single_run(X) if inertia < best_inertia: best_inertia = inertia self.cluster_centers_ = centers self.labels_ = labels self.inertia_ = inertia return self def predict(self, X): """Predict cluster labels.""" return np.argmin(cdist(X, self.cluster_centers_), axis=1) def partial_fit(self, X): """ Incremental fit on a single batch. Useful for streaming data. """ if not hasattr(self, 'cluster_centers_'): # Initialize on first batch if self.init == 'k-means++': self.cluster_centers_ = self._init_centers_kmeanspp(X) else: idx = np.random.choice(len(X), self.n_clusters, replace=False) self.cluster_centers_ = X[idx].copy() self._counts = np.zeros(self.n_clusters) # Assign and update distances = cdist(X, self.cluster_centers_) labels = np.argmin(distances, axis=1) for i, k in enumerate(labels): self._counts[k] += 1 eta = 1.0 / self._counts[k] self.cluster_centers_[k] = (1 - eta) * self.cluster_centers_[k] + eta * X[i] return self # Example usage and comparison with standard K-meansimport time def compare_minibatch_vs_standard(): """Compare mini-batch and standard K-means.""" np.random.seed(42) # Generate large dataset n_samples = 100000 n_features = 50 n_clusters = 10 # Create clustered data centers_true = np.random.randn(n_clusters, n_features) * 10 X = np.vstack([ np.random.randn(n_samples // n_clusters, n_features) + center for center in centers_true ]) np.random.shuffle(X) print(f"Dataset: {X.shape[0]} samples × {X.shape[1]} features") print(f"Target clusters: {n_clusters}") print() # Standard K-means (from sklearn for fair comparison) from sklearn.cluster import KMeans, MiniBatchKMeans as SklearnMBKMeans start = time.time() kmeans = KMeans(n_clusters=n_clusters, n_init=3, random_state=42) kmeans.fit(X) kmeans_time = time.time() - start # Mini-batch K-means (sklearn) start = time.time() mbkmeans = SklearnMBKMeans(n_clusters=n_clusters, batch_size=1000, n_init=3, random_state=42) mbkmeans.fit(X) mbkmeans_time = time.time() - start print("Results:") print(f" Standard K-means: {kmeans_time:.2f}s, inertia={kmeans.inertia_:.2e}") print(f" Mini-batch K-means: {mbkmeans_time:.2f}s, inertia={mbkmeans.inertia_:.2e}") print(f" Speedup: {kmeans_time / mbkmeans_time:.1f}×") print(f" Inertia ratio: {mbkmeans.inertia_ / kmeans.inertia_:.4f}") compare_minibatch_vs_standard()# Example output:# Dataset: 100000 samples × 50 features# Target clusters: 10# # Results:# Standard K-means: 4.23s, inertia=4.98e+07# Mini-batch K-means: 0.31s, inertia=5.02e+07# Speedup: 13.6×# Inertia ratio: 1.0080Unlike standard K-means, which monotonically decreases its objective, mini-batch K-means exhibits stochastic convergence. Understanding this behavior is crucial for proper algorithm tuning.
The Convergence Property:
Mini-batch K-means converges to a local optimum of the K-means objective in the following sense:
Expected descent: In expectation over mini-batch sampling, each iteration decreases the objective (on average)
Variance bound: The variance of centroid estimates decreases as O(1/n_k) where n_k is the number of points seen for cluster k
Asymptotic convergence: As iterations approach infinity, centroids converge to a neighborhood of the batch K-means solution
The Learning Rate Schedule:
The learning rate $\eta_k = 1/n_k$ for cluster $k$ is crucial:
This follows the Robbins-Monro conditions for stochastic approximation:
A fixed learning rate η would cause centroids to oscillate forever, never settling. The 1/n schedule ensures that after seeing many points, new points have diminishing influence—eventually converging to the empirical mean of all assigned points.
Convergence Speed Factors:
| Factor | Effect on Convergence |
|---|---|
| Larger batch size | Faster convergence, less noise |
| More iterations | Better final solution |
| Better initialization | Faster and better convergence |
| Data ordering | Random sampling essential |
| Number of clusters | More clusters → more iterations needed |
Practical Stopping Criteria:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom scipy.spatial.distance import cdist def minibatch_kmeans_with_monitoring(X, K, batch_size=100, max_iter=1000, eval_interval=50, random_state=42): """ Mini-batch K-means with convergence monitoring. Records inertia at regular intervals to visualize convergence. """ np.random.seed(random_state) n_samples, n_features = X.shape # Initialize with K-means++ centers = X[np.random.choice(n_samples, K, replace=False)].copy() counts = np.zeros(K) # Tracking inertia_history = [] iterations = [] for iteration in range(max_iter): # Sample mini-batch batch_idx = np.random.choice(n_samples, batch_size, replace=False) X_batch = X[batch_idx] # Assign and update distances = cdist(X_batch, centers) labels_batch = np.argmin(distances, axis=1) for i, k in enumerate(labels_batch): counts[k] += 1 eta = 1.0 / counts[k] centers[k] = (1 - eta) * centers[k] + eta * X_batch[i] # Periodic evaluation if (iteration + 1) % eval_interval == 0: # Sample inertia (not full dataset for efficiency) sample_size = min(n_samples, 5000) sample_idx = np.random.choice(n_samples, sample_size, replace=False) sample_distances = cdist(X[sample_idx], centers) sample_inertia = np.sum(np.min(sample_distances, axis=1) ** 2) # Scale to full dataset estimate estimated_inertia = sample_inertia * (n_samples / sample_size) inertia_history.append(estimated_inertia) iterations.append(iteration + 1) # Early stopping check if len(inertia_history) >= 3: recent_change = abs(inertia_history[-1] - inertia_history[-3]) / inertia_history[-3] if recent_change < 0.001: print(f"Converged at iteration {iteration + 1} " f"(change < 0.1%)") break # Final inertia final_distances = cdist(X, centers) final_inertia = np.sum(np.min(final_distances, axis=1) ** 2) return { 'centers': centers, 'labels': np.argmin(final_distances, axis=1), 'inertia': final_inertia, 'inertia_history': inertia_history, 'iterations': iterations } # Demonstrate convergencenp.random.seed(42)X = np.vstack([ np.random.randn(5000, 10) + np.random.randn(10) * 5 for _ in range(5)]) result = minibatch_kmeans_with_monitoring(X, K=5, batch_size=100, max_iter=500) print("\nConvergence History:")for i, (it, inertia) in enumerate(zip(result['iterations'], result['inertia_history'])): if i < 5 or i >= len(result['iterations']) - 3: print(f" Iteration {it:4d}: inertia = {inertia:.2e}") elif i == 5: print(" ...")Batch size is the most important hyperparameter in mini-batch K-means, fundamentally controlling the trade-off between computation and quality.
The Batch Size Spectrum:
| Batch Size | Behavior | Trade-off |
|---|---|---|
| b = 1 | Online K-means, maximum stochasticity | Very fast iterations, high variance |
| b = 10-100 | High variance, many iterations needed | Fast, noisy convergence |
| b = 100-1000 | Sweet spot for most applications | Good balance |
| b = 1000-10000 | Lower variance, fewer iterations | Slower iterations, stable |
| b = n | Standard K-means | Maximum quality, maximum cost |
Batch size ≈ 256 to 1024 works well for most datasets. Larger datasets benefit from larger batches (up to 10,000). The key is ensuring each cluster sees enough points per epoch—batch_size should be >> K.
Mathematical Analysis:
The variance of the centroid estimate depends on batch size:
$$\text{Var}(\hat{c}_k) \propto \frac{\sigma_k^2}{b \cdot p_k}$$
where:
Key Insight: Small clusters (low $p_k$) have higher variance. If cluster $k$ contains only 1% of data and batch size is 100, we expect only 1 point from that cluster per batch on average.
Addressing Cluster Imbalance:
For imbalanced clusters, consider:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom sklearn.cluster import MiniBatchKMeans, KMeansimport time def analyze_batch_size_tradeoff(X, K, batch_sizes, n_trials=5): """ Analyze how batch size affects quality and speed. """ results = [] # Baseline: standard K-means kmeans = KMeans(n_clusters=K, n_init=1, random_state=42) start = time.time() kmeans.fit(X) baseline_time = time.time() - start baseline_inertia = kmeans.inertia_ print(f"Baseline (standard K-means): {baseline_time:.3f}s, " f"inertia={baseline_inertia:.4e}") print() for batch_size in batch_sizes: times = [] inertias = [] for trial in range(n_trials): mbk = MiniBatchKMeans( n_clusters=K, batch_size=batch_size, n_init=1, random_state=42 + trial ) start = time.time() mbk.fit(X) elapsed = time.time() - start times.append(elapsed) inertias.append(mbk.inertia_) avg_time = np.mean(times) avg_inertia = np.mean(inertias) std_inertia = np.std(inertias) speedup = baseline_time / avg_time quality_ratio = avg_inertia / baseline_inertia results.append({ 'batch_size': batch_size, 'avg_time': avg_time, 'avg_inertia': avg_inertia, 'std_inertia': std_inertia, 'speedup': speedup, 'quality_ratio': quality_ratio }) print(f"Batch={batch_size:5d}: {avg_time:.3f}s ({speedup:5.1f}× speedup), " f"quality ratio={quality_ratio:.4f} ± {std_inertia/baseline_inertia:.4f}") return results # Example analysisnp.random.seed(42)X = np.vstack([ np.random.randn(10000, 20) + np.random.randn(20) * 8 for _ in range(8)]) print(f"Analyzing batch size on {X.shape[0]} points, {X.shape[1]} dims, 8 clusters")print("=" * 70) batch_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096]results = analyze_batch_size_tradeoff(X, K=8, batch_sizes=batch_sizes) # Find sweet spotbest_efficiency = 0best_batch = Nonefor r in results: # Efficiency = speedup / quality_loss efficiency = r['speedup'] / (r['quality_ratio'] - 0.99) # Penalize quality loss if efficiency > best_efficiency and r['quality_ratio'] < 1.02: best_efficiency = efficiency best_batch = r['batch_size'] print(f"\nRecommended batch size: {best_batch}")One of mini-batch K-means' most powerful features is its compatibility with streaming data—scenarios where data arrives continuously and cannot be stored entirely in memory.
The Streaming Paradigm:
In streaming clustering:
The partial_fit Interface:
Scikit-learn's MiniBatchKMeans provides partial_fit() for incremental learning:
mbk = MiniBatchKMeans(n_clusters=K)
for batch in data_stream:
mbk.partial_fit(batch) # Update with new data
# Can query current state anytime
current_centers = mbk.cluster_centers_
predictions = mbk.predict(new_data)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
import numpy as npfrom sklearn.cluster import MiniBatchKMeans class StreamingKMeansClustering: """ Streaming K-means clustering system. Handles data that arrives in batches, continuously updating cluster centers while maintaining memory efficiency. """ def __init__(self, n_clusters, batch_size=100, init_samples=1000, random_state=None): """ Parameters: ----------- n_clusters : int Number of clusters batch_size : int Samples per partial_fit call init_samples : int Number of samples to collect before initialization random_state : int Random seed """ self.n_clusters = n_clusters self.batch_size = batch_size self.init_samples = init_samples self.random_state = random_state self.mbk = MiniBatchKMeans( n_clusters=n_clusters, batch_size=batch_size, random_state=random_state ) self._initialized = False self._init_buffer = [] self._total_samples_seen = 0 self._cluster_counts = np.zeros(n_clusters) def partial_fit(self, X): """ Update clustering with new batch of data. Parameters: ----------- X : array-like of shape (n_samples, n_features) New batch of data Returns: -------- self """ if not self._initialized: # Buffer samples for initialization self._init_buffer.append(X) total_buffered = sum(len(b) for b in self._init_buffer) if total_buffered >= self.init_samples: # Initialize init_data = np.vstack(self._init_buffer) self.mbk.fit(init_data) self._initialized = True self._init_buffer = [] # Free memory self._total_samples_seen = len(init_data) # Track cluster counts labels = self.mbk.labels_ self._cluster_counts = np.bincount(labels, minlength=self.n_clusters) else: # Incremental update self.mbk.partial_fit(X) self._total_samples_seen += len(X) # Update cluster counts (approximate, from this batch) labels = self.mbk.predict(X) self._cluster_counts += np.bincount(labels, minlength=self.n_clusters) return self def predict(self, X): """Predict cluster labels for data.""" if not self._initialized: raise RuntimeError("Model not yet initialized. Feed more data.") return self.mbk.predict(X) def get_centers(self): """Get current cluster centers.""" if not self._initialized: raise RuntimeError("Model not yet initialized.") return self.mbk.cluster_centers_.copy() def get_stats(self): """Get clustering statistics.""" return { 'initialized': self._initialized, 'total_samples_seen': self._total_samples_seen, 'cluster_distribution': self._cluster_counts / max(1, self._cluster_counts.sum()) } def simulate_data_stream(n_batches=100, samples_per_batch=100, n_features=10, n_true_clusters=5): """ Simulate a data stream with arriving batches. """ np.random.seed(42) # True cluster centers (unknown to model) true_centers = np.random.randn(n_true_clusters, n_features) * 5 for _ in range(n_batches): # Each batch has samples from random clusters cluster_ids = np.random.randint(0, n_true_clusters, samples_per_batch) batch = np.zeros((samples_per_batch, n_features)) for i, cid in enumerate(cluster_ids): batch[i] = true_centers[cid] + np.random.randn(n_features) * 0.5 yield batch # Demonstrate streaming clusteringprint("Streaming K-means Clustering Demo")print("=" * 50) streamer = StreamingKMeansClustering( n_clusters=5, batch_size=100, init_samples=500) for i, batch in enumerate(simulate_data_stream(n_batches=100)): streamer.partial_fit(batch) if (i + 1) % 20 == 0: stats = streamer.get_stats() print(f"Batch {i+1:3d}: " f"samples_seen={stats['total_samples_seen']:5d}, " f"dist={np.round(stats['cluster_distribution'], 2)}") print("\nFinal cluster centers shape:", streamer.get_centers().shape)In streaming scenarios, the data distribution may change over time (concept drift). Standard mini-batch K-means will slowly adapt, but old data has accumulated influence. For non-stationary data, consider exponential decay of old counts: η = 1/(λn + 1) with λ < 1, or periodic reinitialization.
Deploying mini-batch K-means effectively requires attention to several practical considerations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
from sklearn.cluster import MiniBatchKMeansfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import silhouette_scoreimport numpy as np def production_minibatch_kmeans(X, K_range=range(2, 11), batch_sizes=[256, 512, 1024], n_init=5, random_state=42): """ Production-ready mini-batch K-means with hyperparameter search. Performs: 1. Feature standardization 2. Batch size selection 3. Number of clusters selection via silhouette score 4. Final model training """ np.random.seed(random_state) # 1. Standardize features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print(f"Data: {X.shape[0]} samples × {X.shape[1]} features") print(f"Testing K in {list(K_range)}") print() # 2. Quick batch size selection (on subset) subsample_size = min(len(X), 10000) X_sub = X_scaled[np.random.choice(len(X), subsample_size, replace=False)] print("Batch size selection (on subsample):") best_batch = batch_sizes[0] best_time_quality = 0 for bs in batch_sizes: mbk = MiniBatchKMeans(n_clusters=5, batch_size=bs, n_init=1, random_state=42) import time start = time.time() mbk.fit(X_sub) elapsed = time.time() - start # Balance speed and quality score = 1.0 / (elapsed * mbk.inertia_) if score > best_time_quality: best_time_quality = score best_batch = bs print(f" batch_size={bs}: {elapsed:.3f}s, inertia={mbk.inertia_:.2e}") print(f" Selected batch_size: {best_batch}") print() # 3. Find optimal K using silhouette score print("Cluster selection:") best_K = K_range[0] best_silhouette = -1 results = [] # Use subsample for silhouette (expensive for large n) eval_subsample = min(len(X), 5000) X_eval = X_scaled[np.random.choice(len(X), eval_subsample, replace=False)] for K in K_range: mbk = MiniBatchKMeans( n_clusters=K, batch_size=best_batch, n_init=n_init, random_state=random_state ) mbk.fit(X_scaled) # Compute silhouette on subsample labels_eval = mbk.predict(X_eval) # Skip if any cluster is empty in sample if len(np.unique(labels_eval)) < K: print(f" K={K}: skipped (empty clusters)") continue sil = silhouette_score(X_eval, labels_eval) results.append({'K': K, 'inertia': mbk.inertia_, 'silhouette': sil}) if sil > best_silhouette: best_silhouette = sil best_K = K print(f" K={K}: silhouette={sil:.4f}, inertia={mbk.inertia_:.2e}") print(f" Selected K: {best_K} (silhouette={best_silhouette:.4f})") print() # 4. Train final model print("Training final model...") final_model = MiniBatchKMeans( n_clusters=best_K, batch_size=best_batch, n_init=n_init * 2, # More initializations for final random_state=random_state ) final_model.fit(X_scaled) # Return model along with scaler for inference return { 'model': final_model, 'scaler': scaler, 'best_K': best_K, 'best_batch_size': best_batch, 'cluster_centers_original': scaler.inverse_transform(final_model.cluster_centers_), 'selection_results': results } # Example usagenp.random.seed(42)X = np.vstack([ np.random.randn(5000, 15) * 2 + np.random.randn(15) * 10 for _ in range(6)]) result = production_minibatch_kmeans(X, K_range=range(3, 10))print(f"\nFinal model: K={result['best_K']}, batch_size={result['best_batch_size']}")We've comprehensively explored mini-batch K-means, the scalable variant that enables clustering of massive datasets. Let's consolidate the essential insights:
Looking Ahead:
The next page explores Bisecting K-means, a hierarchical approach that builds cluster structure through recursive binary splits. This provides an alternative perspective on partitioning that can be more efficient and offer interpretable cluster hierarchies.
You now have a thorough understanding of mini-batch K-means—from the stochastic optimization foundations through convergence analysis to production deployment strategies. You can confidently apply mini-batch K-means to large-scale clustering problems, appropriately tuning batch size and monitoring convergence.