Loading learning content...
Standard K-means clustering computes cluster centers as the arithmetic mean of all points assigned to each cluster. While computationally efficient, this approach harbors a fundamental vulnerability: sensitivity to outliers. A single extreme data point can dramatically shift the centroid away from the true cluster core, corrupting the entire clustering solution.
K-medoids addresses this problem through an elegant constraint: cluster centers must be actual data points. This seemingly simple restriction transforms the algorithmic landscape, introducing both enhanced robustness and new computational challenges that define one of the most important clustering variants in statistical learning.
By the end of this page, you will understand the theoretical foundations of K-medoids, master the PAM (Partitioning Around Medoids) algorithm in complete detail, analyze its computational complexity, and recognize scenarios where K-medoids provides significant advantages over standard K-means clustering.
To understand why K-medoids exists, we must first examine the mathematical properties of K-means that lead to outlier sensitivity. K-means minimizes the Sum of Squared Errors (SSE), also called the within-cluster sum of squares:
$$J_{\text{K-means}} = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2$$
where $\mu_k$ is the centroid (mean) of cluster $C_k$. The centroid is computed as:
$$\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i$$
The Outlier Problem:
The arithmetic mean is highly sensitive to extreme values. Consider a cluster with points at positions [1, 2, 3, 4, 100]. The mean is 22, which lies far from any typical cluster member. This single outlier has distorted the centroid representation of the cluster.
In robust statistics, the breakdown point measures the proportion of arbitrarily large observations that can contaminate a sample before the estimator yields meaningless results. The arithmetic mean has a breakdown point of 0%—a single outlier can drive it to infinity. The median has a breakdown point of 50%, making it far more robust.
The Medoid Concept:
A medoid is the most centrally located point within a cluster—the data point that minimizes the sum of dissimilarities to all other points in the cluster. Formally, for cluster $C_k$, the medoid $m_k$ is:
$$m_k = \arg\min_{x_i \in C_k} \sum_{x_j \in C_k} d(x_i, x_j)$$
where $d(x_i, x_j)$ is the dissimilarity (typically distance) between points $x_i$ and $x_j$.
Key Insight: Unlike the centroid, the medoid is always an actual data point. This constraint provides several important guarantees:
| Property | K-means | K-medoids |
|---|---|---|
| Cluster center | Arithmetic mean (virtual point) | Medoid (actual data point) |
| Objective function | Minimize sum of squared distances | Minimize sum of dissimilarities |
| Distance metric | Euclidean (or squared Euclidean) | Any dissimilarity measure |
| Outlier sensitivity | High (mean is non-robust) | Low (medoid is robust) |
| Interpretability | Abstract center may not exist in data | Center is a real data point |
| Time complexity | O(nKt) per iteration | O(K(n-K)² + K(n-K)n) per iteration |
| Space complexity | O(nD + KD) | O(n²) for distance matrix |
The Partitioning Around Medoids (PAM) algorithm, introduced by Kaufman and Rousseeuw in 1987, is the original and most widely used K-medoids algorithm. PAM consists of two phases: BUILD (initialization) and SWAP (iterative improvement).
Algorithm Overview:
PAM iteratively improves the clustering by considering whether swapping a medoid with a non-medoid point reduces the total dissimilarity. Unlike K-means, which updates all centroids simultaneously, PAM evaluates each potential swap individually and greedily accepts improvements.
The name reflects the core idea: we partition the data into clusters, and each cluster is defined by its medoid—the central representative around which other points are grouped. Points are assigned to the cluster whose medoid they are closest to.
The BUILD phase constructs an initial set of K medoids using a greedy forward selection strategy:
Step 1: Select the first medoid Choose the point that minimizes the sum of distances to all other points: $$m_1 = \arg\min_{x_i} \sum_{j \neq i} d(x_i, x_j)$$
Step 2: Select subsequent medoids For each remaining medoid position (k = 2, ..., K), select the point that maximally reduces the total distance. For each candidate point $x_c$ not yet selected as a medoid, compute the gain: $$\text{gain}(x_c) = \sum_{x_j \notin M} \max(0, D_j - d(x_c, x_j))$$
where $D_j$ is the current distance from $x_j$ to its nearest medoid, and $M$ is the current medoid set. Select the candidate with maximum gain.
This greedy initialization typically produces higher quality starting solutions than random initialization, though it requires O(Kn²) time.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom scipy.spatial.distance import cdist def pam_build(X, K, metric='euclidean'): """ PAM BUILD phase: Greedy initialization of K medoids. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Input data matrix K : int Number of clusters/medoids to select metric : str Distance metric to use (default: 'euclidean') Returns: -------- medoid_indices : list Indices of the selected medoids """ n = len(X) # Precompute full distance matrix # Shape: (n, n) - symmetric, diagonal is zero D = cdist(X, X, metric=metric) medoid_indices = [] remaining = set(range(n)) # Step 1: Select first medoid # Point that minimizes sum of distances to all others total_distances = D.sum(axis=1) first_medoid = np.argmin(total_distances) medoid_indices.append(first_medoid) remaining.remove(first_medoid) # Track current distance from each point to nearest medoid # Initially, this is distance to the first medoid nearest_medoid_dist = D[:, first_medoid].copy() # Step 2: Greedily add remaining K-1 medoids for k in range(1, K): best_gain = -np.inf best_candidate = None for candidate in remaining: # Compute gain: reduction in total distance if we add this medoid # For each non-medoid point, check if new medoid is closer gain = 0.0 for j in remaining: if j != candidate: # Current distance vs. distance to candidate current_dist = nearest_medoid_dist[j] new_dist = D[candidate, j] # Gain is positive if new medoid is closer gain += max(0, current_dist - new_dist) if gain > best_gain: best_gain = gain best_candidate = candidate # Add best candidate as new medoid medoid_indices.append(best_candidate) remaining.remove(best_candidate) # Update nearest medoid distances for j in range(n): if D[best_candidate, j] < nearest_medoid_dist[j]: nearest_medoid_dist[j] = D[best_candidate, j] return medoid_indices, D # Example usagenp.random.seed(42)X = np.vstack([ np.random.randn(50, 2) + [0, 0], np.random.randn(50, 2) + [5, 5], np.random.randn(50, 2) + [10, 0]]) medoids, dist_matrix = pam_build(X, K=3)print(f"Initial medoid indices: {medoids}")print(f"Medoid coordinates:\n{X[medoids]}")The SWAP phase is the heart of PAM. It iteratively attempts to improve the clustering by swapping medoids with non-medoids.
The SWAP Procedure:
For each pair (medoid $m_i$, non-medoid $x_h$), compute the total change in dissimilarity $TC_{ih}$ if we swap $m_i$ with $x_h$. For each non-medoid point $x_j$ (where $j \neq h$), there are four cases to consider:
Case 1: Point $x_j$ belongs to cluster of medoid $m_i$ (the one being swapped out), and after the swap it joins cluster of some other existing medoid $m_{j_2}$: $$C_{jih} = d(x_j, m_{j_2}) - d(x_j, m_i)$$
Case 2: Point $x_j$ belongs to cluster of medoid $m_i$, and after the swap it joins the new medoid $x_h$: $$C_{jih} = d(x_j, x_h) - d(x_j, m_i)$$
Case 3: Point $x_j$ belongs to cluster of some other medoid $m_{j_2}$ (not $m_i$), and it remains with $m_{j_2}$: $$C_{jih} = 0$$
Case 4: Point $x_j$ belongs to cluster of some other medoid $m_{j_2}$, but after the swap it would be closer to $x_h$: $$C_{jih} = d(x_j, x_h) - d(x_j, m_{j_2})$$
The total cost change is: $$TC_{ih} = \sum_{j \notin M, j \neq h} C_{jih}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
def pam_swap(X, medoid_indices, D, max_iter=100): """ PAM SWAP phase: Iteratively improve medoids by swapping. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Input data matrix medoid_indices : list Initial medoid indices from BUILD phase D : ndarray of shape (n_samples, n_samples) Precomputed distance matrix max_iter : int Maximum number of swap iterations Returns: -------- medoid_indices : list Final medoid indices after convergence labels : ndarray Cluster assignments for each point total_cost : float Final total dissimilarity (objective function value) """ n = len(X) K = len(medoid_indices) medoids = list(medoid_indices) # Make a copy for iteration in range(max_iter): # Assign points to nearest medoid distances_to_medoids = D[:, medoids] # Shape: (n, K) labels = np.argmin(distances_to_medoids, axis=1) nearest_dist = np.min(distances_to_medoids, axis=1) # For each point, also track second nearest medoid distance second_nearest_dist = np.partition(distances_to_medoids, 1, axis=1)[:, 1] # Current total cost current_cost = nearest_dist.sum() # Find the best swap best_delta = 0 # Must improve (negative delta) best_swap = None non_medoids = [i for i in range(n) if i not in medoids] for i, m_i in enumerate(medoids): for h in non_medoids: # Compute total change if we swap m_i with x_h delta = 0.0 for j in range(n): if j == h or j in medoids: continue d_j_mi = D[j, m_i] # Distance to medoid being removed d_j_h = D[j, h] # Distance to new candidate d_j_nearest = nearest_dist[j] d_j_second = second_nearest_dist[j] current_cluster = labels[j] if current_cluster == i: # j belongs to cluster of m_i (being swapped out) # After swap, j goes to either x_h or second nearest new_dist = min(d_j_h, d_j_second) delta += new_dist - d_j_nearest else: # j belongs to different cluster # Check if x_h would be closer if d_j_h < d_j_nearest: delta += d_j_h - d_j_nearest # else: delta += 0 (no change) if delta < best_delta: best_delta = delta best_swap = (i, h) # If no improvement, we've converged if best_swap is None: break # Perform the best swap medoid_idx_to_swap, new_medoid = best_swap old_medoid = medoids[medoid_idx_to_swap] medoids[medoid_idx_to_swap] = new_medoid print(f"Iteration {iteration + 1}: Swapped {old_medoid} -> {new_medoid}, " f"delta={best_delta:.4f}") # Final assignment distances_to_medoids = D[:, medoids] labels = np.argmin(distances_to_medoids, axis=1) total_cost = np.min(distances_to_medoids, axis=1).sum() return medoids, labels, total_cost # Complete PAM algorithmdef pam(X, K, metric='euclidean', max_iter=100): """ Complete PAM (Partitioning Around Medoids) algorithm. """ # Phase 1: BUILD medoids, D = pam_build(X, K, metric) print(f"BUILD phase complete. Initial medoids: {medoids}") # Phase 2: SWAP medoids, labels, cost = pam_swap(X, medoids, D, max_iter) print(f"SWAP phase complete. Final cost: {cost:.4f}") return medoids, labels, cost # Run complete PAMfinal_medoids, cluster_labels, final_cost = pam(X, K=3)print(f"\nFinal medoid indices: {final_medoids}")print(f"Final medoid coordinates:\n{X[final_medoids]}")Understanding PAM's computational requirements is crucial for practical deployment. The algorithm's complexity presents significant challenges for large datasets, which has motivated the development of faster approximations.
Time Complexity:
BUILD Phase: O(K(n-K)n) ≈ O(Kn²)
SWAP Phase per iteration: O(K(n-K)n) ≈ O(Kn²)
Total SWAP Phase: O(tKn²) where t is the number of iterations until convergence
Overall PAM Complexity: O(Kn²) for BUILD + O(tKn²) for SWAP = O(tKn²)
| Dataset Size (n) | K-means (~10 iter) | PAM (~10 iter) | Ratio |
|---|---|---|---|
| 1,000 | ~10ms | ~100ms | 10× |
| 10,000 | ~100ms | ~10s | 100× |
| 100,000 | ~1s | ~17 min | 1,000× |
| 1,000,000 | ~10s | ~28 hours | 10,000× |
PAM's O(n²) complexity makes it impractical for datasets larger than a few thousand points. For larger datasets, consider CLARA (Clustering Large Applications) or CLARANS (Clustering Large Applications based on RANdomized Search), which sample subsets of data to achieve approximate solutions in O(Ks² + n) time, where s << n is the sample size.
Space Complexity:
PAM requires storing the complete distance matrix:
Total Space: O(n²)
This quadratic space requirement can be prohibitive for large datasets. For a dataset with 100,000 64-bit floating-point distances, the matrix alone requires:
$$\frac{100,000^2 \times 8 \text{ bytes}}{10^9} = 80 \text{ GB}$$
Memory Optimization Strategies:
K-medoids' primary advantage over K-means is its robustness to outliers. This robustness emerges from the constraint that medoids must be actual data points.
Why Medoids Are Robust:
Outliers cannot become medoids: An outlier, by definition, has high dissimilarity to most other points. It therefore has high total dissimilarity and will never minimize the medoid selection criterion.
Outliers have bounded influence: In K-means, a single outlier shifts the centroid toward itself, affecting all cluster assignments. In K-medoids, an outlier merely contributes its distance to the total dissimilarity—it cannot fundamentally alter the medoid's position.
The medoid is a rank-based statistic: Like the median, the medoid depends on the relative ordering of distances, not their absolute magnitudes. This makes it resistant to extreme values.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeans def empirical_outlier_comparison(): """ Demonstrate K-means vs K-medoids outlier sensitivity. """ np.random.seed(42) # Create clean cluster with one outlier clean_cluster = np.random.randn(50, 2) * 0.5 + [2, 2] outlier = np.array([[15, 15]]) # Extreme outlier X_with_outlier = np.vstack([clean_cluster, outlier]) # K-means centroid (mean) kmeans_centroid = X_with_outlier.mean(axis=0) # K-medoids medoid (actual point minimizing total distance) from scipy.spatial.distance import cdist distances = cdist(X_with_outlier, X_with_outlier) total_dist_per_point = distances.sum(axis=1) medoid_idx = np.argmin(total_dist_per_point) medoid = X_with_outlier[medoid_idx] # True center (without outlier) true_center = clean_cluster.mean(axis=0) print("Cluster Analysis with Outlier:") print(f" True center (no outlier): {true_center}") print(f" K-means centroid: {kmeans_centroid}") print(f" K-medoids medoid: {medoid}") print(f"") print(f" Distance from true center:") print(f" K-means: {np.linalg.norm(kmeans_centroid - true_center):.4f}") print(f" K-medoids: {np.linalg.norm(medoid - true_center):.4f}") return X_with_outlier, kmeans_centroid, medoid, true_center result = empirical_outlier_comparison()# Output:# Cluster Analysis with Outlier:# True center (no outlier): [1.9847 1.9652]# K-means centroid: [2.2398 2.2206]# K-medoids medoid: [1.8234 1.9012]# # Distance from true center:# K-means: 0.3605# K-medoids: 0.1712Quantifying Robustness: Breakdown Point
The breakdown point of an estimator is the fraction of data that can be arbitrarily corrupted before the estimator produces meaningless results.
This 50% breakdown point matches that of the univariate median, making K-medoids one of the most robust clustering algorithms available.
One of K-medoids' most powerful features is its ability to work with any dissimilarity measure, not just Euclidean distance. This flexibility enables clustering in domains where Euclidean distance is meaningless or suboptimal.
Why K-means Is Restricted:
K-means computes centroids as arithmetic means. This operation requires:
For the proof: The centroid $\mu$ minimizes $\sum_i |x_i - \mu|^2$ because: $$\frac{\partial}{\partial \mu} \sum_i |x_i - \mu|^2 = -2\sum_i (x_i - \mu) = 0$$ $$\Rightarrow \mu = \frac{1}{n}\sum_i x_i$$
This derivation only works for squared Euclidean distance.
K-medoids' Generality:
Since K-medoids selects cluster centers from actual data points, it only requires:
No vector space, no arithmetic operations—just pairwise dissimilarities.
| Data Type | Distance Metric | K-means Compatible? |
|---|---|---|
| Continuous vectors | Euclidean, Manhattan, Minkowski | ✅ (Euclidean only) |
| Text documents | Cosine distance, Jaccard | ❌ |
| DNA sequences | Levenshtein (edit distance) | ❌ |
| Time series | Dynamic Time Warping (DTW) | ❌ |
| Graphs/Networks | Graph edit distance | ❌ |
| Mixed categorical/numeric | Gower distance | ❌ |
| Probability distributions | KL divergence, Wasserstein | ❌ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
from scipy.spatial.distance import cdist, pdist, squareformfrom sklearn_extra.cluster import KMedoidsimport numpy as np # Example 1: Manhattan (L1) distance# More robust to outliers than EuclideanX = np.random.randn(100, 5)kmedoids_L1 = KMedoids( n_clusters=3, metric='manhattan', # L1 distance random_state=42)labels_L1 = kmedoids_L1.fit_predict(X)print(f"Manhattan distance medoids: {kmedoids_L1.medoid_indices_}") # Example 2: Cosine distance for text data (TF-IDF vectors)# Cosine distance = 1 - cosine_similaritytfidf_vectors = np.random.rand(100, 1000) # Sparse in practicetfidf_vectors = tfidf_vectors / np.linalg.norm(tfidf_vectors, axis=1, keepdims=True) kmedoids_cosine = KMedoids( n_clusters=5, metric='cosine', random_state=42)labels_cosine = kmedoids_cosine.fit_predict(tfidf_vectors)print(f"Cosine distance medoids: {kmedoids_cosine.medoid_indices_}") # Example 3: Custom distance functiondef weighted_manhattan(u, v, weights=None): """Weighted Manhattan distance with feature importance.""" if weights is None: weights = np.ones(len(u)) return np.sum(weights * np.abs(u - v)) # Precompute custom distance matrixX_custom = np.random.randn(50, 4)feature_weights = np.array([1.0, 2.0, 0.5, 1.5]) # Feature importance # Build distance matrixn = len(X_custom)D_custom = np.zeros((n, n))for i in range(n): for j in range(i+1, n): d = weighted_manhattan(X_custom[i], X_custom[j], feature_weights) D_custom[i, j] = d D_custom[j, i] = d # Use precomputed distance matrixkmedoids_custom = KMedoids( n_clusters=3, metric='precomputed', # Use provided distance matrix random_state=42)labels_custom = kmedoids_custom.fit_predict(D_custom)print(f"Custom weighted distance medoids: {kmedoids_custom.medoid_indices_}") # Example 4: Edit distance for string datafrom difflib import SequenceMatcher strings = [ "clustering", "clustered", "cluster", "clutter", "algorithm", "algorithmic", "logarithm", "medoid", "median", "medium", "medic"] def edit_distance_ratio(s1, s2): """Normalized edit distance (0 = identical, 1 = completely different).""" return 1 - SequenceMatcher(None, s1, s2).ratio() # Build string distance matrixn_strings = len(strings)D_strings = np.zeros((n_strings, n_strings))for i in range(n_strings): for j in range(i+1, n_strings): d = edit_distance_ratio(strings[i], strings[j]) D_strings[i, j] = d D_strings[j, i] = d kmedoids_strings = KMedoids( n_clusters=3, metric='precomputed', random_state=42)labels_strings = kmedoids_strings.fit_predict(D_strings)print(f"\nString clustering:")for k in range(3): cluster_strings = [strings[i] for i in range(n_strings) if labels_strings[i] == k] medoid_string = strings[kmedoids_strings.medoid_indices_[k]] print(f" Cluster {k} (medoid='{medoid_string}'): {cluster_strings}")Successfully deploying K-medoids in practice requires understanding its strengths, limitations, and appropriate use cases.
When to Choose K-medoids over K-means:
For large datasets, use CLARA (Clustering Large Applications) which applies PAM to multiple random samples and keeps the best result. CLARANS further improves efficiency by randomizing the search through the neighbor space. Both achieve approximate solutions in sub-quadratic time.
Implementation Recommendations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
from sklearn_extra.cluster import KMedoidsfrom sklearn.datasets import make_blobsfrom sklearn.metrics import silhouette_scoreimport numpy as np # Generate sample data with outliersnp.random.seed(42)X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0) # Add outliersoutliers = np.random.uniform(-15, 15, size=(30, 2))X_with_outliers = np.vstack([X, outliers]) # Compare K-medoids with different initialization methodsresults = {} for init_method in ['random', 'heuristic', 'k-medoids++', 'build']: kmedoids = KMedoids( n_clusters=3, metric='euclidean', method='pam', # Use full PAM algorithm init=init_method, max_iter=300, random_state=42 ) labels = kmedoids.fit_predict(X_with_outliers) silhouette = silhouette_score(X_with_outliers, labels) inertia = kmedoids.inertia_ # Total dissimilarity results[init_method] = { 'silhouette': silhouette, 'inertia': inertia, 'medoids': kmedoids.medoid_indices_, 'n_iter': kmedoids.n_iter_ } print(f"Init: {init_method:15} | Silhouette: {silhouette:.4f} | " f"Inertia: {inertia:.2f} | Iterations: {kmedoids.n_iter_}") # Best practice: use 'build' (PAM BUILD phase) or 'k-medoids++'# These provide more deterministic, high-quality initialization # For very large datasets, switch to 'alternate' method (faster but less accurate)kmedoids_fast = KMedoids( n_clusters=3, metric='euclidean', method='alternate', # Faster method for large datasets init='k-medoids++', max_iter=300, random_state=42)labels_fast = kmedoids_fast.fit_predict(X_with_outliers)print(f"\nAlternate method: Silhouette: {silhouette_score(X_with_outliers, labels_fast):.4f}")We have explored K-medoids (PAM) in depth, understanding both its theoretical foundations and practical implementation. Let's consolidate the key takeaways:
Looking Ahead:
The next page explores K-medians, which takes a different approach to robustness by using the L1 (Manhattan) distance instead of L2 (Euclidean). This provides another form of outlier resistance while maintaining the virtual centroid approach of K-means.
You now understand K-medoids (PAM) in comprehensive detail—from its motivation in robust statistics to its BUILD and SWAP phases, complexity characteristics, and practical deployment considerations. This foundation enables you to choose between K-means and K-medoids based on your specific clustering requirements.