Machine LearningK-Means Variants

K-Means Variants: Beyond Standard K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Variants

1 / 5

K-medoids (PAM): Robust Clustering with Actual Data Points

The Robustness Problem in Centroid-Based Clustering

Standard K-means clustering computes cluster centers as the arithmetic mean of all points assigned to each cluster. While computationally efficient, this approach harbors a fundamental vulnerability: sensitivity to outliers. A single extreme data point can dramatically shift the centroid away from the true cluster core, corrupting the entire clustering solution.

K-medoids addresses this problem through an elegant constraint: cluster centers must be actual data points. This seemingly simple restriction transforms the algorithmic landscape, introducing both enhanced robustness and new computational challenges that define one of the most important clustering variants in statistical learning.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of K-medoids, master the PAM (Partitioning Around Medoids) algorithm in complete detail, analyze its computational complexity, and recognize scenarios where K-medoids provides significant advantages over standard K-means clustering.

Motivation and Theoretical Foundation

To understand why K-medoids exists, we must first examine the mathematical properties of K-means that lead to outlier sensitivity. K-means minimizes the Sum of Squared Errors (SSE), also called the within-cluster sum of squares:

$$J_{\text{K-means}} = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2$$

where $\mu_k$ is the centroid (mean) of cluster $C_k$. The centroid is computed as:

$$\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i$$

The Outlier Problem:

The arithmetic mean is highly sensitive to extreme values. Consider a cluster with points at positions [1, 2, 3, 4, 100]. The mean is 22, which lies far from any typical cluster member. This single outlier has distorted the centroid representation of the cluster.

Mean vs. Median: Statistical Robustness

In robust statistics, the breakdown point measures the proportion of arbitrarily large observations that can contaminate a sample before the estimator yields meaningless results. The arithmetic mean has a breakdown point of 0%—a single outlier can drive it to infinity. The median has a breakdown point of 50%, making it far more robust.

The Medoid Concept:

A medoid is the most centrally located point within a cluster—the data point that minimizes the sum of dissimilarities to all other points in the cluster. Formally, for cluster $C_k$, the medoid $m_k$ is:

$$m_k = \arg\min_{x_i \in C_k} \sum_{x_j \in C_k} d(x_i, x_j)$$

where $d(x_i, x_j)$ is the dissimilarity (typically distance) between points $x_i$ and $x_j$.

Key Insight: Unlike the centroid, the medoid is always an actual data point. This constraint provides several important guarantees:

Interpretability: Cluster centers are concrete examples from your data
Robustness: Outliers have limited influence since they cannot become medoids
Generality: Works with any dissimilarity measure, not just Euclidean distance
Categorical data: Applicable when mean is undefined (e.g., mixed data types)

K-means vs. K-medoids: Fundamental Comparison
Property	K-means	K-medoids
Cluster center	Arithmetic mean (virtual point)	Medoid (actual data point)
Objective function	Minimize sum of squared distances	Minimize sum of dissimilarities
Distance metric	Euclidean (or squared Euclidean)	Any dissimilarity measure
Outlier sensitivity	High (mean is non-robust)	Low (medoid is robust)
Interpretability	Abstract center may not exist in data	Center is a real data point
Time complexity	O(nKt) per iteration	O(K(n-K)² + K(n-K)n) per iteration
Space complexity	O(nD + KD)	O(n²) for distance matrix

The PAM Algorithm: Partitioning Around Medoids

The Partitioning Around Medoids (PAM) algorithm, introduced by Kaufman and Rousseeuw in 1987, is the original and most widely used K-medoids algorithm. PAM consists of two phases: BUILD (initialization) and SWAP (iterative improvement).

Algorithm Overview:

PAM iteratively improves the clustering by considering whether swapping a medoid with a non-medoid point reduces the total dissimilarity. Unlike K-means, which updates all centroids simultaneously, PAM evaluates each potential swap individually and greedily accepts improvements.

Why 'Partitioning Around Medoids'?

The name reflects the core idea: we partition the data into clusters, and each cluster is defined by its medoid—the central representative around which other points are grouped. Points are assigned to the cluster whose medoid they are closest to.

Phase 1: BUILD (Initialization)

The BUILD phase constructs an initial set of K medoids using a greedy forward selection strategy:

Step 1: Select the first medoid Choose the point that minimizes the sum of distances to all other points: $$m_1 = \arg\min_{x_i} \sum_{j \neq i} d(x_i, x_j)$$

Step 2: Select subsequent medoids For each remaining medoid position (k = 2, ..., K), select the point that maximally reduces the total distance. For each candidate point $x_c$ not yet selected as a medoid, compute the gain: $$\text{gain}(x_c) = \sum_{x_j \notin M} \max(0, D_j - d(x_c, x_j))$$

where $D_j$ is the current distance from $x_j$ to its nearest medoid, and $M$ is the current medoid set. Select the candidate with maximum gain.

This greedy initialization typically produces higher quality starting solutions than random initialization, though it requires O(Kn²) time.

pam_build_phase.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.spatial.distance import cdist
 
def pam_build(X, K, metric='euclidean'):
    """
    PAM BUILD phase: Greedy initialization of K medoids.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Input data matrix
    K : int
        Number of clusters/medoids to select
    metric : str
        Distance metric to use (default: 'euclidean')
    
    Returns:
    --------
    medoid_indices : list
        Indices of the selected medoids
    """
    n = len(X)
    
    # Precompute full distance matrix
    # Shape: (n, n) - symmetric, diagonal is zero
    D = cdist(X, X, metric=metric)
    
    medoid_indices = []
    remaining = set(range(n))
    
    # Step 1: Select first medoid
    # Point that minimizes sum of distances to all others
    total_distances = D.sum(axis=1)
    first_medoid = np.argmin(total_distances)
    medoid_indices.append(first_medoid)
    remaining.remove(first_medoid)
    
    # Track current distance from each point to nearest medoid
    # Initially, this is distance to the first medoid
    nearest_medoid_dist = D[:, first_medoid].copy()
    
    # Step 2: Greedily add remaining K-1 medoids
    for k in range(1, K):
        best_gain = -np.inf
        best_candidate = None
        
        for candidate in remaining:
            # Compute gain: reduction in total distance if we add this medoid
            # For each non-medoid point, check if new medoid is closer
            gain = 0.0
            for j in remaining:
                if j != candidate:
                    # Current distance vs. distance to candidate
                    current_dist = nearest_medoid_dist[j]
                    new_dist = D[candidate, j]
                    # Gain is positive if new medoid is closer
                    gain += max(0, current_dist - new_dist)
            
            if gain > best_gain:
                best_gain = gain
                best_candidate = candidate
        
        # Add best candidate as new medoid
        medoid_indices.append(best_candidate)
        remaining.remove(best_candidate)
        
        # Update nearest medoid distances
        for j in range(n):
            if D[best_candidate, j] < nearest_medoid_dist[j]:
                nearest_medoid_dist[j] = D[best_candidate, j]
    
    return medoid_indices, D
 
# Example usage
np.random.seed(42)
X = np.vstack([
    np.random.randn(50, 2) + [0, 0],
    np.random.randn(50, 2) + [5, 5],
    np.random.randn(50, 2) + [10, 0]
])
 
medoids, dist_matrix = pam_build(X, K=3)
print(f"Initial medoid indices: {medoids}")
print(f"Medoid coordinates:\n{X[medoids]}")

Phase 2: SWAP (Iterative Refinement)

The SWAP phase is the heart of PAM. It iteratively attempts to improve the clustering by swapping medoids with non-medoids.

The SWAP Procedure:

For each pair (medoid $m_i$, non-medoid $x_h$), compute the total change in dissimilarity $TC_{ih}$ if we swap $m_i$ with $x_h$. For each non-medoid point $x_j$ (where $j \neq h$), there are four cases to consider:

Case 1: Point $x_j$ belongs to cluster of medoid $m_i$ (the one being swapped out), and after the swap it joins cluster of some other existing medoid $m_{j_2}$: $$C_{jih} = d(x_j, m_{j_2}) - d(x_j, m_i)$$

Case 2: Point $x_j$ belongs to cluster of medoid $m_i$, and after the swap it joins the new medoid $x_h$: $$C_{jih} = d(x_j, x_h) - d(x_j, m_i)$$

Case 3: Point $x_j$ belongs to cluster of some other medoid $m_{j_2}$ (not $m_i$), and it remains with $m_{j_2}$: $$C_{jih} = 0$$

Case 4: Point $x_j$ belongs to cluster of some other medoid $m_{j_2}$, but after the swap it would be closer to $x_h$: $$C_{jih} = d(x_j, x_h) - d(x_j, m_{j_2})$$

The total cost change is: $$TC_{ih} = \sum_{j \notin M, j \neq h} C_{jih}$$

pam_swap_phase.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def pam_swap(X, medoid_indices, D, max_iter=100):
    """
    PAM SWAP phase: Iteratively improve medoids by swapping.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Input data matrix
    medoid_indices : list
        Initial medoid indices from BUILD phase
    D : ndarray of shape (n_samples, n_samples)
        Precomputed distance matrix
    max_iter : int
        Maximum number of swap iterations
    
    Returns:
    --------
    medoid_indices : list
        Final medoid indices after convergence
    labels : ndarray
        Cluster assignments for each point
    total_cost : float
        Final total dissimilarity (objective function value)
    """
    n = len(X)
    K = len(medoid_indices)
    medoids = list(medoid_indices)  # Make a copy
    
    for iteration in range(max_iter):
        # Assign points to nearest medoid
        distances_to_medoids = D[:, medoids]  # Shape: (n, K)
        labels = np.argmin(distances_to_medoids, axis=1)
        nearest_dist = np.min(distances_to_medoids, axis=1)
        
        # For each point, also track second nearest medoid distance
        second_nearest_dist = np.partition(distances_to_medoids, 1, axis=1)[:, 1]
        
        # Current total cost
        current_cost = nearest_dist.sum()
        
        # Find the best swap
        best_delta = 0  # Must improve (negative delta)
        best_swap = None
        
        non_medoids = [i for i in range(n) if i not in medoids]
        
        for i, m_i in enumerate(medoids):
            for h in non_medoids:
                # Compute total change if we swap m_i with x_h
                delta = 0.0
                
                for j in range(n):
                    if j == h or j in medoids:
                        continue
                    
                    d_j_mi = D[j, m_i]      # Distance to medoid being removed
                    d_j_h = D[j, h]          # Distance to new candidate
                    d_j_nearest = nearest_dist[j]
                    d_j_second = second_nearest_dist[j]
                    current_cluster = labels[j]
                    
                    if current_cluster == i:
                        # j belongs to cluster of m_i (being swapped out)
                        # After swap, j goes to either x_h or second nearest
                        new_dist = min(d_j_h, d_j_second)
                        delta += new_dist - d_j_nearest
                    else:
                        # j belongs to different cluster
                        # Check if x_h would be closer
                        if d_j_h < d_j_nearest:
                            delta += d_j_h - d_j_nearest
                        # else: delta += 0 (no change)
                
                if delta < best_delta:
                    best_delta = delta
                    best_swap = (i, h)
        
        # If no improvement, we've converged
        if best_swap is None:
            break
        
        # Perform the best swap
        medoid_idx_to_swap, new_medoid = best_swap
        old_medoid = medoids[medoid_idx_to_swap]
        medoids[medoid_idx_to_swap] = new_medoid
        
        print(f"Iteration {iteration + 1}: Swapped {old_medoid} -> {new_medoid}, "
              f"delta={best_delta:.4f}")
    
    # Final assignment
    distances_to_medoids = D[:, medoids]
    labels = np.argmin(distances_to_medoids, axis=1)
    total_cost = np.min(distances_to_medoids, axis=1).sum()
    
    return medoids, labels, total_cost
 
# Complete PAM algorithm
def pam(X, K, metric='euclidean', max_iter=100):
    """
    Complete PAM (Partitioning Around Medoids) algorithm.
    """
    # Phase 1: BUILD
    medoids, D = pam_build(X, K, metric)
    print(f"BUILD phase complete. Initial medoids: {medoids}")
    
    # Phase 2: SWAP
    medoids, labels, cost = pam_swap(X, medoids, D, max_iter)
    print(f"SWAP phase complete. Final cost: {cost:.4f}")
    
    return medoids, labels, cost
 
# Run complete PAM
final_medoids, cluster_labels, final_cost = pam(X, K=3)
print(f"\nFinal medoid indices: {final_medoids}")
print(f"Final medoid coordinates:\n{X[final_medoids]}")

Computational Complexity Analysis

Understanding PAM's computational requirements is crucial for practical deployment. The algorithm's complexity presents significant challenges for large datasets, which has motivated the development of faster approximations.

Time Complexity:

BUILD Phase: O(K(n-K)n) ≈ O(Kn²)

For each of K medoids, we evaluate O(n) candidates
Each candidate evaluation requires O(n) distance comparisons

SWAP Phase per iteration: O(K(n-K)n) ≈ O(Kn²)

We consider K medoids × (n-K) non-medoids = O(Kn) swap pairs
For each swap pair, we evaluate impact on all n points

Total SWAP Phase: O(tKn²) where t is the number of iterations until convergence

Overall PAM Complexity: O(Kn²) for BUILD + O(tKn²) for SWAP = O(tKn²)

Runtime Comparison: PAM vs. K-means
Dataset Size (n)	K-means (~10 iter)	PAM (~10 iter)	Ratio
1,000	~10ms	~100ms	10×
10,000	~100ms	~10s	100×
100,000	~1s	~17 min	1,000×
1,000,000	~10s	~28 hours	10,000×

The Scalability Challenge

PAM's O(n²) complexity makes it impractical for datasets larger than a few thousand points. For larger datasets, consider CLARA (Clustering Large Applications) or CLARANS (Clustering Large Applications based on RANdomized Search), which sample subsets of data to achieve approximate solutions in O(Ks² + n) time, where s << n is the sample size.

Space Complexity:

PAM requires storing the complete distance matrix:

Distance matrix: O(n²)
Medoid indices: O(K)
Cluster assignments: O(n)

Total Space: O(n²)

This quadratic space requirement can be prohibitive for large datasets. For a dataset with 100,000 64-bit floating-point distances, the matrix alone requires:

$$\frac{100,000^2 \times 8 \text{ bytes}}{10^9} = 80 \text{ GB}$$

Memory Optimization Strategies:

Symmetric storage: Store only the upper triangle, reducing space to n(n-1)/2
On-the-fly computation: Recompute distances as needed (trades time for space)
Approximate distances: Use locality-sensitive hashing for approximate nearest neighbors
Chunked processing: Process distance matrix in blocks for cache efficiency

Robustness Properties and Outlier Resistance

K-medoids' primary advantage over K-means is its robustness to outliers. This robustness emerges from the constraint that medoids must be actual data points.

Why Medoids Are Robust:

Outliers cannot become medoids: An outlier, by definition, has high dissimilarity to most other points. It therefore has high total dissimilarity and will never minimize the medoid selection criterion.
Outliers have bounded influence: In K-means, a single outlier shifts the centroid toward itself, affecting all cluster assignments. In K-medoids, an outlier merely contributes its distance to the total dissimilarity—it cannot fundamentally alter the medoid's position.
The medoid is a rank-based statistic: Like the median, the medoid depends on the relative ordering of distances, not their absolute magnitudes. This makes it resistant to extreme values.

K-means with Outlier

•Cluster points: [1, 2, 3, 4, 5]
•Outlier: [100]
•Centroid = (1+2+3+4+5+100)/6 = 19.17
•Centroid is far from all typical points
•High within-cluster variance
•Poor cluster representation

K-medoids with Outlier

•Cluster points: [1, 2, 3, 4, 5]
•Outlier: [100]
•Medoid = 3 (minimizes total distance)
•Medoid remains at cluster center
•Outlier has limited influence
•Robust cluster representation

outlier_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
 
def empirical_outlier_comparison():
    """
    Demonstrate K-means vs K-medoids outlier sensitivity.
    """
    np.random.seed(42)
    
    # Create clean cluster with one outlier
    clean_cluster = np.random.randn(50, 2) * 0.5 + [2, 2]
    outlier = np.array([[15, 15]])  # Extreme outlier
    X_with_outlier = np.vstack([clean_cluster, outlier])
    
    # K-means centroid (mean)
    kmeans_centroid = X_with_outlier.mean(axis=0)
    
    # K-medoids medoid (actual point minimizing total distance)
    from scipy.spatial.distance import cdist
    distances = cdist(X_with_outlier, X_with_outlier)
    total_dist_per_point = distances.sum(axis=1)
    medoid_idx = np.argmin(total_dist_per_point)
    medoid = X_with_outlier[medoid_idx]
    
    # True center (without outlier)
    true_center = clean_cluster.mean(axis=0)
    
    print("Cluster Analysis with Outlier:")
    print(f"  True center (no outlier): {true_center}")
    print(f"  K-means centroid:         {kmeans_centroid}")
    print(f"  K-medoids medoid:         {medoid}")
    print(f"")
    print(f"  Distance from true center:")
    print(f"    K-means: {np.linalg.norm(kmeans_centroid - true_center):.4f}")
    print(f"    K-medoids: {np.linalg.norm(medoid - true_center):.4f}")
    
    return X_with_outlier, kmeans_centroid, medoid, true_center
 
result = empirical_outlier_comparison()
# Output:
# Cluster Analysis with Outlier:
#   True center (no outlier): [1.9847 1.9652]
#   K-means centroid:         [2.2398 2.2206]
#   K-medoids medoid:         [1.8234 1.9012]
#   
#   Distance from true center:
#     K-means: 0.3605
#     K-medoids: 0.1712

Quantifying Robustness: Breakdown Point

The breakdown point of an estimator is the fraction of data that can be arbitrarily corrupted before the estimator produces meaningless results.

Mean (K-means centroid): Breakdown point = 0. A single outlier at infinity drives the mean to infinity.
Medroid (K-medoids): Breakdown point ≈ 50%. You must corrupt more than half the cluster before the medoid fundamentally changes.

This 50% breakdown point matches that of the univariate median, making K-medoids one of the most robust clustering algorithms available.

Distance Metric Flexibility

One of K-medoids' most powerful features is its ability to work with any dissimilarity measure, not just Euclidean distance. This flexibility enables clustering in domains where Euclidean distance is meaningless or suboptimal.

Why K-means Is Restricted:

K-means computes centroids as arithmetic means. This operation requires:

A vector space where addition and scalar multiplication are defined
The centroid to exist in the same space as the data
Euclidean distance (or its square) for the centroid to be the optimal minimizer of within-cluster variance

For the proof: The centroid $\mu$ minimizes $\sum_i |x_i - \mu|^2$ because: $$\frac{\partial}{\partial \mu} \sum_i |x_i - \mu|^2 = -2\sum_i (x_i - \mu) = 0$$ $$\Rightarrow \mu = \frac{1}{n}\sum_i x_i$$

This derivation only works for squared Euclidean distance.

K-medoids' Generality:

Since K-medoids selects cluster centers from actual data points, it only requires:

A set of objects (not necessarily vectors)
A dissimilarity function $d(x_i, x_j)$ between any pair of objects

No vector space, no arithmetic operations—just pairwise dissimilarities.

Distance Metrics Enabled by K-medoids
Data Type	Distance Metric	K-means Compatible?
Continuous vectors	Euclidean, Manhattan, Minkowski	✅ (Euclidean only)
Text documents	Cosine distance, Jaccard	❌
DNA sequences	Levenshtein (edit distance)	❌
Time series	Dynamic Time Warping (DTW)	❌
Graphs/Networks	Graph edit distance	❌
Mixed categorical/numeric	Gower distance	❌
Probability distributions	KL divergence, Wasserstein	❌

custom_distance_pam.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from scipy.spatial.distance import cdist, pdist, squareform
from sklearn_extra.cluster import KMedoids
import numpy as np
 
# Example 1: Manhattan (L1) distance
# More robust to outliers than Euclidean
X = np.random.randn(100, 5)
kmedoids_L1 = KMedoids(
    n_clusters=3,
    metric='manhattan',  # L1 distance
    random_state=42
)
labels_L1 = kmedoids_L1.fit_predict(X)
print(f"Manhattan distance medoids: {kmedoids_L1.medoid_indices_}")
 
# Example 2: Cosine distance for text data (TF-IDF vectors)
# Cosine distance = 1 - cosine_similarity
tfidf_vectors = np.random.rand(100, 1000)  # Sparse in practice
tfidf_vectors = tfidf_vectors / np.linalg.norm(tfidf_vectors, axis=1, keepdims=True)
 
kmedoids_cosine = KMedoids(
    n_clusters=5,
    metric='cosine',
    random_state=42
)
labels_cosine = kmedoids_cosine.fit_predict(tfidf_vectors)
print(f"Cosine distance medoids: {kmedoids_cosine.medoid_indices_}")
 
# Example 3: Custom distance function
def weighted_manhattan(u, v, weights=None):
    """Weighted Manhattan distance with feature importance."""
    if weights is None:
        weights = np.ones(len(u))
    return np.sum(weights * np.abs(u - v))
 
# Precompute custom distance matrix
X_custom = np.random.randn(50, 4)
feature_weights = np.array([1.0, 2.0, 0.5, 1.5])  # Feature importance
 
# Build distance matrix
n = len(X_custom)
D_custom = np.zeros((n, n))
for i in range(n):
    for j in range(i+1, n):
        d = weighted_manhattan(X_custom[i], X_custom[j], feature_weights)
        D_custom[i, j] = d
        D_custom[j, i] = d
 
# Use precomputed distance matrix
kmedoids_custom = KMedoids(
    n_clusters=3,
    metric='precomputed',  # Use provided distance matrix
    random_state=42
)
labels_custom = kmedoids_custom.fit_predict(D_custom)
print(f"Custom weighted distance medoids: {kmedoids_custom.medoid_indices_}")
 
# Example 4: Edit distance for string data
from difflib import SequenceMatcher
 
strings = [
    "clustering", "clustered", "cluster", "clutter",
    "algorithm", "algorithmic", "logarithm",
    "medoid", "median", "medium", "medic"
]
 
def edit_distance_ratio(s1, s2):
    """Normalized edit distance (0 = identical, 1 = completely different)."""
    return 1 - SequenceMatcher(None, s1, s2).ratio()
 
# Build string distance matrix
n_strings = len(strings)
D_strings = np.zeros((n_strings, n_strings))
for i in range(n_strings):
    for j in range(i+1, n_strings):
        d = edit_distance_ratio(strings[i], strings[j])
        D_strings[i, j] = d
        D_strings[j, i] = d
 
kmedoids_strings = KMedoids(
    n_clusters=3,
    metric='precomputed',
    random_state=42
)
labels_strings = kmedoids_strings.fit_predict(D_strings)
print(f"\nString clustering:")
for k in range(3):
    cluster_strings = [strings[i] for i in range(n_strings) if labels_strings[i] == k]
    medoid_string = strings[kmedoids_strings.medoid_indices_[k]]
    print(f"  Cluster {k} (medoid='{medoid_string}'): {cluster_strings}")

Practical Considerations and Best Practices

Successfully deploying K-medoids in practice requires understanding its strengths, limitations, and appropriate use cases.

When to Choose K-medoids over K-means:

Use K-medoids When:

•Data contains outliers: K-medoids' robustness prevents cluster distortion
•Interpretable centers matter: Medoids are actual examples you can inspect and explain
•Non-Euclidean distance needed: Edit distance, DTW, custom metrics
•Data is not in vector form: Strings, graphs, sequences with only pairwise distances
•Dataset is small to medium: n < 10,000 points
•Cluster representatives guide decisions: Medical diagnosis, recommendation prototypes

Prefer K-means When:

•Dataset is large: n > 100,000 points makes PAM impractical
•Speed is critical: K-means is orders of magnitude faster
•Data is clean: No significant outliers or noise
•Euclidean distance is appropriate: Standard numeric data in ℝⁿ
•Exact cluster centers don't need interpretation

Scaling K-medoids: CLARA and CLARANS

For large datasets, use CLARA (Clustering Large Applications) which applies PAM to multiple random samples and keeps the best result. CLARANS further improves efficiency by randomizing the search through the neighbor space. Both achieve approximate solutions in sub-quadratic time.

Implementation Recommendations:

Precompute the distance matrix if memory allows—it significantly speeds up the SWAP phase
Use symmetric storage to halve memory requirements
Consider early termination if swaps provide diminishing returns
Run multiple initializations (PAM is still sensitive to starting configuration)
Profile memory before running—the O(n²) distance matrix can be prohibitive
Leverage sklearn-extra for production-grade implementation

sklearn_kmedoids.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import numpy as np
 
# Generate sample data with outliers
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0)
 
# Add outliers
outliers = np.random.uniform(-15, 15, size=(30, 2))
X_with_outliers = np.vstack([X, outliers])
 
# Compare K-medoids with different initialization methods
results = {}
 
for init_method in ['random', 'heuristic', 'k-medoids++', 'build']:
    kmedoids = KMedoids(
        n_clusters=3,
        metric='euclidean',
        method='pam',  # Use full PAM algorithm
        init=init_method,
        max_iter=300,
        random_state=42
    )
    labels = kmedoids.fit_predict(X_with_outliers)
    
    silhouette = silhouette_score(X_with_outliers, labels)
    inertia = kmedoids.inertia_  # Total dissimilarity
    
    results[init_method] = {
        'silhouette': silhouette,
        'inertia': inertia,
        'medoids': kmedoids.medoid_indices_,
        'n_iter': kmedoids.n_iter_
    }
    
    print(f"Init: {init_method:15} | Silhouette: {silhouette:.4f} | "
          f"Inertia: {inertia:.2f} | Iterations: {kmedoids.n_iter_}")
 
# Best practice: use 'build' (PAM BUILD phase) or 'k-medoids++'
# These provide more deterministic, high-quality initialization
 
# For very large datasets, switch to 'alternate' method (faster but less accurate)
kmedoids_fast = KMedoids(
    n_clusters=3,
    metric='euclidean',
    method='alternate',  # Faster method for large datasets
    init='k-medoids++',
    max_iter=300,
    random_state=42
)
labels_fast = kmedoids_fast.fit_predict(X_with_outliers)
print(f"\nAlternate method: Silhouette: {silhouette_score(X_with_outliers, labels_fast):.4f}")

Summary: K-medoids and PAM

We have explored K-medoids (PAM) in depth, understanding both its theoretical foundations and practical implementation. Let's consolidate the key takeaways:

Key Takeaways

•Medoids are actual data points — Unlike centroids, medoids provide interpretable, concrete cluster representatives that exist in your data.
•PAM has two phases — BUILD initializes medoids greedily; SWAP iteratively improves by testing all medoid-nonmedoid swaps.
•Robustness to outliers — K-medoids has a ~50% breakdown point, making it far more robust than K-means' 0% breakdown point.
•Distance metric flexibility — Works with any dissimilarity measure: edit distance, DTW, Gower, custom metrics.
•Computational cost is O(n²) — Prohibitive for large datasets; use CLARA/CLARANS for scalability.
•Choose K-medoids for quality, K-means for speed — When interpretability and robustness matter more than runtime.

Looking Ahead:

The next page explores K-medians, which takes a different approach to robustness by using the L1 (Manhattan) distance instead of L2 (Euclidean). This provides another form of outlier resistance while maintaining the virtual centroid approach of K-means.

Page Complete

You now understand K-medoids (PAM) in comprehensive detail—from its motivation in robust statistics to its BUILD and SWAP phases, complexity characteristics, and practical deployment considerations. This foundation enables you to choose between K-means and K-medoids based on your specific clustering requirements.

1 / 5

Loading learning content...

Machine LearningK-Means Variants

K-Means Variants: Beyond Standard K-Means Clustering

LevelIntermediate

Duration120 mins

TopicK-Means Variants

1 / 5

K-medoids (PAM): Robust Clustering with Actual Data Points

The Robustness Problem in Centroid-Based Clustering

What You Will Learn

Motivation and Theoretical Foundation

$$J_{\text{K-means}} = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2$$

where $\mu_k$ is the centroid (mean) of cluster $C_k$. The centroid is computed as:

$$\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i$$

The Outlier Problem:

Mean vs. Median: Statistical Robustness

The Medoid Concept:

$$m_k = \arg\min_{x_i \in C_k} \sum_{x_j \in C_k} d(x_i, x_j)$$

where $d(x_i, x_j)$ is the dissimilarity (typically distance) between points $x_i$ and $x_j$.

Key Insight: Unlike the centroid, the medoid is always an actual data point. This constraint provides several important guarantees:

Interpretability: Cluster centers are concrete examples from your data
Robustness: Outliers have limited influence since they cannot become medoids
Generality: Works with any dissimilarity measure, not just Euclidean distance
Categorical data: Applicable when mean is undefined (e.g., mixed data types)

K-means vs. K-medoids: Fundamental Comparison
Property	K-means	K-medoids
Cluster center	Arithmetic mean (virtual point)	Medoid (actual data point)
Objective function	Minimize sum of squared distances	Minimize sum of dissimilarities
Distance metric	Euclidean (or squared Euclidean)	Any dissimilarity measure
Outlier sensitivity	High (mean is non-robust)	Low (medoid is robust)
Interpretability	Abstract center may not exist in data	Center is a real data point
Time complexity	O(nKt) per iteration	O(K(n-K)² + K(n-K)n) per iteration
Space complexity	O(nD + KD)	O(n²) for distance matrix

The PAM Algorithm: Partitioning Around Medoids

Algorithm Overview:

Why 'Partitioning Around Medoids'?

Phase 1: BUILD (Initialization)

The BUILD phase constructs an initial set of K medoids using a greedy forward selection strategy:

Step 1: Select the first medoid Choose the point that minimizes the sum of distances to all other points: $$m_1 = \arg\min_{x_i} \sum_{j \neq i} d(x_i, x_j)$$

where $D_j$ is the current distance from $x_j$ to its nearest medoid, and $M$ is the current medoid set. Select the candidate with maximum gain.

This greedy initialization typically produces higher quality starting solutions than random initialization, though it requires O(Kn²) time.

pam_build_phase.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.spatial.distance import cdist
 
def pam_build(X, K, metric='euclidean'):
    """
    PAM BUILD phase: Greedy initialization of K medoids.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Input data matrix
    K : int
        Number of clusters/medoids to select
    metric : str
        Distance metric to use (default: 'euclidean')
    
    Returns:
    --------
    medoid_indices : list
        Indices of the selected medoids
    """
    n = len(X)
    
    # Precompute full distance matrix
    # Shape: (n, n) - symmetric, diagonal is zero
    D = cdist(X, X, metric=metric)
    
    medoid_indices = []
    remaining = set(range(n))
    
    # Step 1: Select first medoid
    # Point that minimizes sum of distances to all others
    total_distances = D.sum(axis=1)
    first_medoid = np.argmin(total_distances)
    medoid_indices.append(first_medoid)
    remaining.remove(first_medoid)
    
    # Track current distance from each point to nearest medoid
    # Initially, this is distance to the first medoid
    nearest_medoid_dist = D[:, first_medoid].copy()
    
    # Step 2: Greedily add remaining K-1 medoids
    for k in range(1, K):
        best_gain = -np.inf
        best_candidate = None
        
        for candidate in remaining:
            # Compute gain: reduction in total distance if we add this medoid
            # For each non-medoid point, check if new medoid is closer
            gain = 0.0
            for j in remaining:
                if j != candidate:
                    # Current distance vs. distance to candidate
                    current_dist = nearest_medoid_dist[j]
                    new_dist = D[candidate, j]
                    # Gain is positive if new medoid is closer
                    gain += max(0, current_dist - new_dist)
            
            if gain > best_gain:
                best_gain = gain
                best_candidate = candidate
        
        # Add best candidate as new medoid
        medoid_indices.append(best_candidate)
        remaining.remove(best_candidate)
        
        # Update nearest medoid distances
        for j in range(n):
            if D[best_candidate, j] < nearest_medoid_dist[j]:
                nearest_medoid_dist[j] = D[best_candidate, j]
    
    return medoid_indices, D
 
# Example usage
np.random.seed(42)
X = np.vstack([
    np.random.randn(50, 2) + [0, 0],
    np.random.randn(50, 2) + [5, 5],
    np.random.randn(50, 2) + [10, 0]
])
 
medoids, dist_matrix = pam_build(X, K=3)
print(f"Initial medoid indices: {medoids}")
print(f"Medoid coordinates:\n{X[medoids]}")

Phase 2: SWAP (Iterative Refinement)

The SWAP phase is the heart of PAM. It iteratively attempts to improve the clustering by swapping medoids with non-medoids.

The SWAP Procedure:

Case 2: Point $x_j$ belongs to cluster of medoid $m_i$, and after the swap it joins the new medoid $x_h$: $$C_{jih} = d(x_j, x_h) - d(x_j, m_i)$$

Case 3: Point $x_j$ belongs to cluster of some other medoid $m_{j_2}$ (not $m_i$), and it remains with $m_{j_2}$: $$C_{jih} = 0$$

Case 4: Point $x_j$ belongs to cluster of some other medoid $m_{j_2}$, but after the swap it would be closer to $x_h$: $$C_{jih} = d(x_j, x_h) - d(x_j, m_{j_2})$$

The total cost change is: $$TC_{ih} = \sum_{j \notin M, j \neq h} C_{jih}$$

pam_swap_phase.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def pam_swap(X, medoid_indices, D, max_iter=100):
    """
    PAM SWAP phase: Iteratively improve medoids by swapping.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Input data matrix
    medoid_indices : list
        Initial medoid indices from BUILD phase
    D : ndarray of shape (n_samples, n_samples)
        Precomputed distance matrix
    max_iter : int
        Maximum number of swap iterations
    
    Returns:
    --------
    medoid_indices : list
        Final medoid indices after convergence
    labels : ndarray
        Cluster assignments for each point
    total_cost : float
        Final total dissimilarity (objective function value)
    """
    n = len(X)
    K = len(medoid_indices)
    medoids = list(medoid_indices)  # Make a copy
    
    for iteration in range(max_iter):
        # Assign points to nearest medoid
        distances_to_medoids = D[:, medoids]  # Shape: (n, K)
        labels = np.argmin(distances_to_medoids, axis=1)
        nearest_dist = np.min(distances_to_medoids, axis=1)
        
        # For each point, also track second nearest medoid distance
        second_nearest_dist = np.partition(distances_to_medoids, 1, axis=1)[:, 1]
        
        # Current total cost
        current_cost = nearest_dist.sum()
        
        # Find the best swap
        best_delta = 0  # Must improve (negative delta)
        best_swap = None
        
        non_medoids = [i for i in range(n) if i not in medoids]
        
        for i, m_i in enumerate(medoids):
            for h in non_medoids:
                # Compute total change if we swap m_i with x_h
                delta = 0.0
                
                for j in range(n):
                    if j == h or j in medoids:
                        continue
                    
                    d_j_mi = D[j, m_i]      # Distance to medoid being removed
                    d_j_h = D[j, h]          # Distance to new candidate
                    d_j_nearest = nearest_dist[j]
                    d_j_second = second_nearest_dist[j]
                    current_cluster = labels[j]
                    
                    if current_cluster == i:
                        # j belongs to cluster of m_i (being swapped out)
                        # After swap, j goes to either x_h or second nearest
                        new_dist = min(d_j_h, d_j_second)
                        delta += new_dist - d_j_nearest
                    else:
                        # j belongs to different cluster
                        # Check if x_h would be closer
                        if d_j_h < d_j_nearest:
                            delta += d_j_h - d_j_nearest
                        # else: delta += 0 (no change)
                
                if delta < best_delta:
                    best_delta = delta
                    best_swap = (i, h)
        
        # If no improvement, we've converged
        if best_swap is None:
            break
        
        # Perform the best swap
        medoid_idx_to_swap, new_medoid = best_swap
        old_medoid = medoids[medoid_idx_to_swap]
        medoids[medoid_idx_to_swap] = new_medoid
        
        print(f"Iteration {iteration + 1}: Swapped {old_medoid} -> {new_medoid}, "
              f"delta={best_delta:.4f}")
    
    # Final assignment
    distances_to_medoids = D[:, medoids]
    labels = np.argmin(distances_to_medoids, axis=1)
    total_cost = np.min(distances_to_medoids, axis=1).sum()
    
    return medoids, labels, total_cost
 
# Complete PAM algorithm
def pam(X, K, metric='euclidean', max_iter=100):
    """
    Complete PAM (Partitioning Around Medoids) algorithm.
    """
    # Phase 1: BUILD
    medoids, D = pam_build(X, K, metric)
    print(f"BUILD phase complete. Initial medoids: {medoids}")
    
    # Phase 2: SWAP
    medoids, labels, cost = pam_swap(X, medoids, D, max_iter)
    print(f"SWAP phase complete. Final cost: {cost:.4f}")
    
    return medoids, labels, cost
 
# Run complete PAM
final_medoids, cluster_labels, final_cost = pam(X, K=3)
print(f"\nFinal medoid indices: {final_medoids}")
print(f"Final medoid coordinates:\n{X[final_medoids]}")

Computational Complexity Analysis

Time Complexity:

BUILD Phase: O(K(n-K)n) ≈ O(Kn²)

For each of K medoids, we evaluate O(n) candidates
Each candidate evaluation requires O(n) distance comparisons

SWAP Phase per iteration: O(K(n-K)n) ≈ O(Kn²)

We consider K medoids × (n-K) non-medoids = O(Kn) swap pairs
For each swap pair, we evaluate impact on all n points

Total SWAP Phase: O(tKn²) where t is the number of iterations until convergence

Overall PAM Complexity: O(Kn²) for BUILD + O(tKn²) for SWAP = O(tKn²)

Runtime Comparison: PAM vs. K-means
Dataset Size (n)	K-means (~10 iter)	PAM (~10 iter)	Ratio
1,000	~10ms	~100ms	10×
10,000	~100ms	~10s	100×
100,000	~1s	~17 min	1,000×
1,000,000	~10s	~28 hours	10,000×

The Scalability Challenge

Space Complexity:

PAM requires storing the complete distance matrix:

Distance matrix: O(n²)
Medoid indices: O(K)
Cluster assignments: O(n)

Total Space: O(n²)

This quadratic space requirement can be prohibitive for large datasets. For a dataset with 100,000 64-bit floating-point distances, the matrix alone requires:

$$\frac{100,000^2 \times 8 \text{ bytes}}{10^9} = 80 \text{ GB}$$

Memory Optimization Strategies:

Symmetric storage: Store only the upper triangle, reducing space to n(n-1)/2
On-the-fly computation: Recompute distances as needed (trades time for space)
Approximate distances: Use locality-sensitive hashing for approximate nearest neighbors
Chunked processing: Process distance matrix in blocks for cache efficiency

Robustness Properties and Outlier Resistance

K-medoids' primary advantage over K-means is its robustness to outliers. This robustness emerges from the constraint that medoids must be actual data points.

Why Medoids Are Robust:

Outliers cannot become medoids: An outlier, by definition, has high dissimilarity to most other points. It therefore has high total dissimilarity and will never minimize the medoid selection criterion.
Outliers have bounded influence: In K-means, a single outlier shifts the centroid toward itself, affecting all cluster assignments. In K-medoids, an outlier merely contributes its distance to the total dissimilarity—it cannot fundamentally alter the medoid's position.
The medoid is a rank-based statistic: Like the median, the medoid depends on the relative ordering of distances, not their absolute magnitudes. This makes it resistant to extreme values.

K-means with Outlier

•Cluster points: [1, 2, 3, 4, 5]
•Outlier: [100]
•Centroid = (1+2+3+4+5+100)/6 = 19.17
•Centroid is far from all typical points
•High within-cluster variance
•Poor cluster representation

K-medoids with Outlier

•Cluster points: [1, 2, 3, 4, 5]
•Outlier: [100]
•Medoid = 3 (minimizes total distance)
•Medoid remains at cluster center
•Outlier has limited influence
•Robust cluster representation

outlier_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
 
def empirical_outlier_comparison():
    """
    Demonstrate K-means vs K-medoids outlier sensitivity.
    """
    np.random.seed(42)
    
    # Create clean cluster with one outlier
    clean_cluster = np.random.randn(50, 2) * 0.5 + [2, 2]
    outlier = np.array([[15, 15]])  # Extreme outlier
    X_with_outlier = np.vstack([clean_cluster, outlier])
    
    # K-means centroid (mean)
    kmeans_centroid = X_with_outlier.mean(axis=0)
    
    # K-medoids medoid (actual point minimizing total distance)
    from scipy.spatial.distance import cdist
    distances = cdist(X_with_outlier, X_with_outlier)
    total_dist_per_point = distances.sum(axis=1)
    medoid_idx = np.argmin(total_dist_per_point)
    medoid = X_with_outlier[medoid_idx]
    
    # True center (without outlier)
    true_center = clean_cluster.mean(axis=0)
    
    print("Cluster Analysis with Outlier:")
    print(f"  True center (no outlier): {true_center}")
    print(f"  K-means centroid:         {kmeans_centroid}")
    print(f"  K-medoids medoid:         {medoid}")
    print(f"")
    print(f"  Distance from true center:")
    print(f"    K-means: {np.linalg.norm(kmeans_centroid - true_center):.4f}")
    print(f"    K-medoids: {np.linalg.norm(medoid - true_center):.4f}")
    
    return X_with_outlier, kmeans_centroid, medoid, true_center
 
result = empirical_outlier_comparison()
# Output:
# Cluster Analysis with Outlier:
#   True center (no outlier): [1.9847 1.9652]
#   K-means centroid:         [2.2398 2.2206]
#   K-medoids medoid:         [1.8234 1.9012]
#   
#   Distance from true center:
#     K-means: 0.3605
#     K-medoids: 0.1712

Quantifying Robustness: Breakdown Point

The breakdown point of an estimator is the fraction of data that can be arbitrarily corrupted before the estimator produces meaningless results.

Mean (K-means centroid): Breakdown point = 0. A single outlier at infinity drives the mean to infinity.
Medroid (K-medoids): Breakdown point ≈ 50%. You must corrupt more than half the cluster before the medoid fundamentally changes.

This 50% breakdown point matches that of the univariate median, making K-medoids one of the most robust clustering algorithms available.

Distance Metric Flexibility

Why K-means Is Restricted:

K-means computes centroids as arithmetic means. This operation requires:

A vector space where addition and scalar multiplication are defined
The centroid to exist in the same space as the data
Euclidean distance (or its square) for the centroid to be the optimal minimizer of within-cluster variance

This derivation only works for squared Euclidean distance.

K-medoids' Generality:

Since K-medoids selects cluster centers from actual data points, it only requires:

A set of objects (not necessarily vectors)
A dissimilarity function $d(x_i, x_j)$ between any pair of objects

No vector space, no arithmetic operations—just pairwise dissimilarities.

Distance Metrics Enabled by K-medoids
Data Type	Distance Metric	K-means Compatible?
Continuous vectors	Euclidean, Manhattan, Minkowski	✅ (Euclidean only)
Text documents	Cosine distance, Jaccard	❌
DNA sequences	Levenshtein (edit distance)	❌
Time series	Dynamic Time Warping (DTW)	❌
Graphs/Networks	Graph edit distance	❌
Mixed categorical/numeric	Gower distance	❌
Probability distributions	KL divergence, Wasserstein	❌

custom_distance_pam.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from scipy.spatial.distance import cdist, pdist, squareform
from sklearn_extra.cluster import KMedoids
import numpy as np
 
# Example 1: Manhattan (L1) distance
# More robust to outliers than Euclidean
X = np.random.randn(100, 5)
kmedoids_L1 = KMedoids(
    n_clusters=3,
    metric='manhattan',  # L1 distance
    random_state=42
)
labels_L1 = kmedoids_L1.fit_predict(X)
print(f"Manhattan distance medoids: {kmedoids_L1.medoid_indices_}")
 
# Example 2: Cosine distance for text data (TF-IDF vectors)
# Cosine distance = 1 - cosine_similarity
tfidf_vectors = np.random.rand(100, 1000)  # Sparse in practice
tfidf_vectors = tfidf_vectors / np.linalg.norm(tfidf_vectors, axis=1, keepdims=True)
 
kmedoids_cosine = KMedoids(
    n_clusters=5,
    metric='cosine',
    random_state=42
)
labels_cosine = kmedoids_cosine.fit_predict(tfidf_vectors)
print(f"Cosine distance medoids: {kmedoids_cosine.medoid_indices_}")
 
# Example 3: Custom distance function
def weighted_manhattan(u, v, weights=None):
    """Weighted Manhattan distance with feature importance."""
    if weights is None:
        weights = np.ones(len(u))
    return np.sum(weights * np.abs(u - v))
 
# Precompute custom distance matrix
X_custom = np.random.randn(50, 4)
feature_weights = np.array([1.0, 2.0, 0.5, 1.5])  # Feature importance
 
# Build distance matrix
n = len(X_custom)
D_custom = np.zeros((n, n))
for i in range(n):
    for j in range(i+1, n):
        d = weighted_manhattan(X_custom[i], X_custom[j], feature_weights)
        D_custom[i, j] = d
        D_custom[j, i] = d
 
# Use precomputed distance matrix
kmedoids_custom = KMedoids(
    n_clusters=3,
    metric='precomputed',  # Use provided distance matrix
    random_state=42
)
labels_custom = kmedoids_custom.fit_predict(D_custom)
print(f"Custom weighted distance medoids: {kmedoids_custom.medoid_indices_}")
 
# Example 4: Edit distance for string data
from difflib import SequenceMatcher
 
strings = [
    "clustering", "clustered", "cluster", "clutter",
    "algorithm", "algorithmic", "logarithm",
    "medoid", "median", "medium", "medic"
]
 
def edit_distance_ratio(s1, s2):
    """Normalized edit distance (0 = identical, 1 = completely different)."""
    return 1 - SequenceMatcher(None, s1, s2).ratio()
 
# Build string distance matrix
n_strings = len(strings)
D_strings = np.zeros((n_strings, n_strings))
for i in range(n_strings):
    for j in range(i+1, n_strings):
        d = edit_distance_ratio(strings[i], strings[j])
        D_strings[i, j] = d
        D_strings[j, i] = d
 
kmedoids_strings = KMedoids(
    n_clusters=3,
    metric='precomputed',
    random_state=42
)
labels_strings = kmedoids_strings.fit_predict(D_strings)
print(f"\nString clustering:")
for k in range(3):
    cluster_strings = [strings[i] for i in range(n_strings) if labels_strings[i] == k]
    medoid_string = strings[kmedoids_strings.medoid_indices_[k]]
    print(f"  Cluster {k} (medoid='{medoid_string}'): {cluster_strings}")

Practical Considerations and Best Practices

Successfully deploying K-medoids in practice requires understanding its strengths, limitations, and appropriate use cases.

When to Choose K-medoids over K-means:

Use K-medoids When:

•Data contains outliers: K-medoids' robustness prevents cluster distortion
•Interpretable centers matter: Medoids are actual examples you can inspect and explain
•Non-Euclidean distance needed: Edit distance, DTW, custom metrics
•Data is not in vector form: Strings, graphs, sequences with only pairwise distances
•Dataset is small to medium: n < 10,000 points
•Cluster representatives guide decisions: Medical diagnosis, recommendation prototypes

Prefer K-means When:

•Dataset is large: n > 100,000 points makes PAM impractical
•Speed is critical: K-means is orders of magnitude faster
•Data is clean: No significant outliers or noise
•Euclidean distance is appropriate: Standard numeric data in ℝⁿ
•Exact cluster centers don't need interpretation

Scaling K-medoids: CLARA and CLARANS

Implementation Recommendations:

Precompute the distance matrix if memory allows—it significantly speeds up the SWAP phase
Use symmetric storage to halve memory requirements
Consider early termination if swaps provide diminishing returns
Run multiple initializations (PAM is still sensitive to starting configuration)
Profile memory before running—the O(n²) distance matrix can be prohibitive
Leverage sklearn-extra for production-grade implementation

sklearn_kmedoids.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import numpy as np
 
# Generate sample data with outliers
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0)
 
# Add outliers
outliers = np.random.uniform(-15, 15, size=(30, 2))
X_with_outliers = np.vstack([X, outliers])
 
# Compare K-medoids with different initialization methods
results = {}
 
for init_method in ['random', 'heuristic', 'k-medoids++', 'build']:
    kmedoids = KMedoids(
        n_clusters=3,
        metric='euclidean',
        method='pam',  # Use full PAM algorithm
        init=init_method,
        max_iter=300,
        random_state=42
    )
    labels = kmedoids.fit_predict(X_with_outliers)
    
    silhouette = silhouette_score(X_with_outliers, labels)
    inertia = kmedoids.inertia_  # Total dissimilarity
    
    results[init_method] = {
        'silhouette': silhouette,
        'inertia': inertia,
        'medoids': kmedoids.medoid_indices_,
        'n_iter': kmedoids.n_iter_
    }
    
    print(f"Init: {init_method:15} | Silhouette: {silhouette:.4f} | "
          f"Inertia: {inertia:.2f} | Iterations: {kmedoids.n_iter_}")
 
# Best practice: use 'build' (PAM BUILD phase) or 'k-medoids++'
# These provide more deterministic, high-quality initialization
 
# For very large datasets, switch to 'alternate' method (faster but less accurate)
kmedoids_fast = KMedoids(
    n_clusters=3,
    metric='euclidean',
    method='alternate',  # Faster method for large datasets
    init='k-medoids++',
    max_iter=300,
    random_state=42
)
labels_fast = kmedoids_fast.fit_predict(X_with_outliers)
print(f"\nAlternate method: Silhouette: {silhouette_score(X_with_outliers, labels_fast):.4f}")

Summary: K-medoids and PAM

We have explored K-medoids (PAM) in depth, understanding both its theoretical foundations and practical implementation. Let's consolidate the key takeaways:

Key Takeaways

•Medoids are actual data points — Unlike centroids, medoids provide interpretable, concrete cluster representatives that exist in your data.
•PAM has two phases — BUILD initializes medoids greedily; SWAP iteratively improves by testing all medoid-nonmedoid swaps.
•Robustness to outliers — K-medoids has a ~50% breakdown point, making it far more robust than K-means' 0% breakdown point.
•Distance metric flexibility — Works with any dissimilarity measure: edit distance, DTW, Gower, custom metrics.
•Computational cost is O(n²) — Prohibitive for large datasets; use CLARA/CLARANS for scalability.
•Choose K-medoids for quality, K-means for speed — When interpretability and robustness matter more than runtime.

Looking Ahead:

Page Complete

1 / 5