Machine LearningClustering Evaluation

Clustering Evaluation

LevelIntermediate

Duration90 mins

TopicClustering Evaluation

1 / 5

Internal Metrics: Silhouette and Calinski-Harabasz

The Challenge of Evaluating Clusters

Clustering algorithms produce partitions of data, but unlike supervised learning, there is no ground truth to compare against. This creates a fundamental challenge: how do we know if the clustering is any good?

Internal validation metrics address this by measuring cluster quality using only intrinsic properties of the data and the resulting cluster assignments. They quantify intuitive notions like 'points within a cluster should be similar' and 'points in different clusters should be dissimilar.'

This page provides a deep, rigorous exploration of two of the most widely-used internal metrics: the Silhouette Coefficient and the Calinski-Harabasz Index. We will derive their mathematical foundations, understand their geometric interpretations, analyze their strengths and weaknesses, and learn when to use each.

What You Will Learn

By the end of this page, you will understand: (1) the mathematical derivation and geometric intuition behind Silhouette scores, (2) the variance-ratio formulation of the Calinski-Harabasz Index, (3) how to interpret and compare these metrics, (4) computational considerations for large datasets, and (5) the inherent limitations and biases of internal validation metrics.

Internal vs. External Validation

Before diving into specific metrics, it's crucial to understand the distinction between internal and external validation paradigms:

Internal Validation evaluates clustering quality using only:

The data points themselves (their features/coordinates)
The cluster assignments produced by the algorithm

External Validation evaluates clustering quality by comparing against:

Known ground-truth labels (class memberships)
External reference partitions

Internal metrics are essential because ground truth is often unavailable in real clustering applications—if we knew the true labels, we likely wouldn't need clustering in the first place.

Comparison: Internal vs. External Validation
Aspect	Internal Validation	External Validation
Ground truth required	No	Yes
Primary use case	Model selection, hyperparameter tuning	Algorithm benchmarking, method comparison
What it measures	Compactness and separation	Agreement with known structure
Examples	Silhouette, CH, Davies-Bouldin	ARI, NMI, Fowlkes-Mallows
Bias toward	Spherical, convex clusters	Structure matching ground truth

Critical Caveat: The Geometry Bias

Internal metrics inherently encode assumptions about cluster geometry—typically favoring compact, spherical, well-separated clusters. Consequently, they may give poor scores to valid non-convex or density-based clusterings. Always interpret internal metrics within the context of what structure you expect in your data.

The Silhouette Coefficient: Foundation

The Silhouette Coefficient, introduced by Peter Rousseeuw in 1987, provides a per-sample measure of how appropriately each data point is assigned to its cluster. It elegantly balances two competing objectives:

Cohesion: Points should be close to other points in the same cluster
Separation: Points should be far from points in other clusters

The genius of the Silhouette formulation is that it combines these into a single normalized score that is intuitive to interpret.

Mathematical Definition

For a data point xᵢ assigned to cluster Cₖ, we define:

a(i) = average distance from xᵢ to all other points in the same cluster Cₖ

$$a(i) = \frac{1}{|C_k| - 1} \sum_{j \in C_k, j \neq i} d(x_i, x_j)$$

b(i) = minimum average distance from xᵢ to points in any other cluster

$$b(i) = \min_{l \neq k} \frac{1}{|C_l|} \sum_{j \in C_l} d(x_i, x_j)$$

The cluster achieving this minimum is called the nearest neighboring cluster of xᵢ—the cluster that xᵢ would be assigned to if removed from its current cluster.

The Silhouette Score

The silhouette score for point xᵢ is:

$$s(i) = \frac{b(i) - a(i)}{\max{a(i), b(i)}}$$

This can be equivalently written as:

$$s(i) = \begin{cases} 1 - \frac{a(i)}{b(i)} & \text{if } a(i) < b(i) \ 0 & \text{if } a(i) = b(i) \ \frac{b(i)}{a(i)} - 1 & \text{if } a(i) > b(i) \end{cases}$$

The overall Silhouette Coefficient for a clustering is the mean of all individual silhouette scores:

$$\bar{s} = \frac{1}{n} \sum_{i=1}^{n} s(i)$$

Intuitive Interpretation

The silhouette score measures how much more similar a point is to its own cluster compared to other clusters. A score of +1 means the point is perfectly placed (very far from neighboring clusters, very close to its own). A score of 0 means the point is on the boundary between clusters. A score of -1 means the point is likely misclassified (closer to another cluster than its assigned one).

Interpreting Silhouette Scores

Understanding what silhouette scores mean at different scales is essential for practical use:

Score Range	Interpretation
0.71 – 1.00	Strong structure; clusters are dense and well-separated
0.51 – 0.70	Reasonable structure; may be worth investigating further
0.26 – 0.50	Weak structure; clusters may be overlapping
≤ 0.25	No substantial structure; may have incorrect k or data not clusterable

These thresholds are heuristics, not absolute rules. Context matters enormously—some datasets have inherently overlapping structure where a score of 0.4 represents the best achievable clustering.

Per-Cluster Analysis

The mean silhouette score aggregates information, but examining per-cluster silhouette distributions reveals much more:

Cluster-wise means: Identify clusters that are well-defined vs. poorly defined
Width of distribution: Clusters with wide score ranges may span dense cores and sparse peripheries
Negative scores: Points with negative silhouettes are potentially misclassified

A silhouette plot visualizes these distributions by showing sorted silhouette values for each cluster, making it easy to spot problematic clusters.

silhouette_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
 
# Generate sample data
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
 
# Perform clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init='auto')
cluster_labels = kmeans.fit_predict(X)
 
# Compute silhouette scores
silhouette_avg = silhouette_score(X, cluster_labels)
sample_silhouette_values = silhouette_samples(X, cluster_labels)
 
print(f"Overall Silhouette Score: {silhouette_avg:.4f}")
 
# Per-cluster analysis
n_clusters = len(np.unique(cluster_labels))
for i in range(n_clusters):
    cluster_silhouette = sample_silhouette_values[cluster_labels == i]
    cluster_size = len(cluster_silhouette)
    cluster_mean = np.mean(cluster_silhouette)
    negative_count = np.sum(cluster_silhouette < 0)
    
    print(f"\nCluster {i}:")
    print(f"  Size: {cluster_size}")
    print(f"  Mean silhouette: {cluster_mean:.4f}")
    print(f"  Min: {np.min(cluster_silhouette):.4f}, Max: {np.max(cluster_silhouette):.4f}")
    print(f"  Potentially misassigned (s < 0): {negative_count} ({100*negative_count/cluster_size:.1f}%)")
 
# Create silhouette plot
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
y_lower = 10
 
for i in range(n_clusters):
    cluster_silhouette = sample_silhouette_values[cluster_labels == i]
    cluster_silhouette.sort()
    
    y_upper = y_lower + len(cluster_silhouette)
    color = plt.cm.viridis(float(i) / n_clusters)
    
    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_silhouette,
                     facecolor=color, edgecolor=color, alpha=0.7)
    ax.text(-0.05, y_lower + 0.5 * len(cluster_silhouette), str(i))
    y_lower = y_upper + 10
 
ax.axvline(x=silhouette_avg, color="red", linestyle="--", label=f"Mean: {silhouette_avg:.3f}")
ax.set_xlabel("Silhouette Coefficient")
ax.set_ylabel("Cluster")
ax.set_title("Silhouette Plot for Cluster Analysis")
ax.legend()
plt.tight_layout()
plt.show()

Diagnosing Clustering Issues

Silhouette analysis helps diagnose specific problems:

All clusters have low scores: Wrong number of clusters k, or data lacks natural cluster structure
One cluster has much lower scores: That cluster may be absorbing noise or merging distinct groups
Many negative scores in a cluster: Consider splitting that cluster or reassigning those points
Scores vary wildly within clusters: Clusters may have irregular shapes or varying densities

Computational Considerations for Silhouette

Computational Complexity

The naïve computation of silhouette scores has complexity O(n²) because it requires computing pairwise distances between all points. For large datasets, this becomes prohibitive:

Dataset Size	Pairwise Comparisons	Memory for Distance Matrix
1,000	500,000	~4 MB
10,000	50 million	~400 MB
100,000	5 billion	~40 GB
1,000,000	500 billion	~4 TB

This quadratic scaling presents serious challenges for large-scale clustering evaluation.

Approximation Strategies

Several approaches can make silhouette computation tractable for large datasets:

1. Sampling-Based Approximation

Compute silhouette on a representative sample
Use stratified sampling from each cluster
Confidence intervals via bootstrap resampling

2. Simplified Silhouette

Replace a(i) with distance to cluster centroid
Replace b(i) with minimum distance to other centroids
Reduces to O(n·k) complexity
Less accurate but much faster

3. Batch Processing

Compute silhouette in chunks
Avoids storing full distance matrix
Still O(n²) but constant memory

4. Approximate Nearest Neighbors

Use KD-trees, ball trees, or locality-sensitive hashing
Approximate b(i) using nearest neighbor structures
Can reduce to O(n log n) in favorable conditions

silhouette_approximations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
 
def simplified_silhouette(X, labels, centroids):
    """
    Simplified silhouette using distances to centroids instead of 
    full pairwise distances. Complexity: O(n * k) instead of O(n²).
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    labels : ndarray of shape (n_samples,) - cluster assignments
    centroids : ndarray of shape (n_clusters, n_features)
    
    Returns:
    --------
    float : simplified silhouette score
    """
    n_samples = X.shape[0]
    n_clusters = len(centroids)
    
    # Distance from each point to each centroid: O(n * k)
    distances_to_centroids = cdist(X, centroids, metric='euclidean')
    
    silhouette_scores = np.zeros(n_samples)
    
    for i in range(n_samples):
        cluster_i = labels[i]
        
        # a(i): distance to own centroid
        a_i = distances_to_centroids[i, cluster_i]
        
        # b(i): minimum distance to other centroids
        other_distances = distances_to_centroids[i, np.arange(n_clusters) != cluster_i]
        b_i = np.min(other_distances)
        
        # Silhouette score
        if max(a_i, b_i) == 0:
            silhouette_scores[i] = 0
        else:
            silhouette_scores[i] = (b_i - a_i) / max(a_i, b_i)
    
    return np.mean(silhouette_scores)
 
 
def sampled_silhouette(X, labels, sample_size=1000, n_bootstrap=10, random_state=42):
    """
    Estimate silhouette score using stratified sampling with bootstrap confidence intervals.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    labels : ndarray of shape (n_samples,)
    sample_size : int, number of samples to use
    n_bootstrap : int, number of bootstrap iterations
    random_state : int
    
    Returns:
    --------
    tuple : (mean estimate, standard error)
    """
    rng = np.random.RandomState(random_state)
    n_samples = X.shape[0]
    unique_labels = np.unique(labels)
    
    bootstrap_scores = []
    
    for b in range(n_bootstrap):
        # Stratified sampling
        sample_indices = []
        for label in unique_labels:
            label_indices = np.where(labels == label)[0]
            # Sample proportionally from each cluster
            n_from_cluster = max(1, int(sample_size * len(label_indices) / n_samples))
            sampled = rng.choice(label_indices, size=min(n_from_cluster, len(label_indices)), replace=False)
            sample_indices.extend(sampled)
        
        sample_indices = np.array(sample_indices)
        X_sample = X[sample_indices]
        labels_sample = labels[sample_indices]
        
        score = silhouette_score(X_sample, labels_sample)
        bootstrap_scores.append(score)
    
    return np.mean(bootstrap_scores), np.std(bootstrap_scores)
 
 
# Example usage
from sklearn.datasets import make_blobs
 
# Generate larger dataset
X, y = make_blobs(n_samples=10000, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42, n_init='auto')
labels = kmeans.fit_predict(X)
 
# Exact silhouette (for comparison)
exact_score = silhouette_score(X, labels)
print(f"Exact Silhouette Score: {exact_score:.4f}")
 
# Simplified silhouette
simplified_score = simplified_silhouette(X, labels, kmeans.cluster_centers_)
print(f"Simplified Silhouette Score: {simplified_score:.4f}")
 
# Sampled silhouette with CI
mean_est, std_err = sampled_silhouette(X, labels, sample_size=1000, n_bootstrap=20)
print(f"Sampled Silhouette Score: {mean_est:.4f} ± {1.96*std_err:.4f} (95% CI)")

The Calinski-Harabasz Index

The Calinski-Harabasz Index (CH), also known as the Variance Ratio Criterion (VRC), was introduced by Tadeusz Caliński and Jerzy Harabasz in 1974. It takes a fundamentally different approach from Silhouette: rather than measuring per-point cohesion and separation, it computes a global ratio of between-cluster variance to within-cluster variance.

The intuition is straightforward: good clusterings have tight clusters (low within-cluster variance) that are far apart (high between-cluster variance). The CH index formalizes this as a single ratio.

Mathematical Formulation

Let X be a dataset with n points in d dimensions, partitioned into k clusters. Define:

Within-cluster dispersion (W): $$W = \sum_{j=1}^{k} \sum_{x \in C_j} |x - \mu_j|^2$$

where μⱼ is the centroid of cluster Cⱼ.

Between-cluster dispersion (B): $$B = \sum_{j=1}^{k} |C_j| \cdot |\mu_j - \mu|^2$$

where μ is the global centroid (mean of all points) and |Cⱼ| is the size of cluster j.

The Calinski-Harabasz Index is then:

$$CH = \frac{B / (k-1)}{W / (n-k)} = \frac{B}{W} \cdot \frac{n-k}{k-1}$$

Connection to ANOVA

The CH index is essentially an F-statistic from one-way ANOVA, treating cluster assignments as groups. B/(k-1) is the between-group mean square, W/(n-k) is the within-group mean square. Higher values indicate better-defined clusters, just as higher F-statistics indicate significant group differences in ANOVA.

Interpretation

Higher CH values indicate better clustering (more separation, less dispersion within clusters)
No upper bound: Values can range from 0 to infinity
Scale-dependent: Absolute values aren't meaningful across different datasets
Best used comparatively: Compare CH values for different k values on the same data

Unlike silhouette scores which range from -1 to 1, there's no universal threshold for 'good' CH values. The metric is primarily used to compare clusterings:

$$k^* = \arg\max_k CH(k)$$

where k* is the optimal number of clusters.

Mathematical Properties of Calinski-Harabasz

Variance Decomposition

The CH index is built on the fundamental total variance decomposition:

$$T = W + B$$

where:

T = Total Sum of Squares = Σᵢ ‖xᵢ - μ‖²
W = Within-cluster Sum of Squares
B = Between-cluster Sum of Squares

This identity always holds, meaning that as B increases, W must decrease for fixed T. The CH index measures the ratio in a degrees-of-freedom adjusted manner.

Behavior with Varying k

Understanding how CH behaves as k changes is critical for its use:

As k approaches 1: B approaches 0 (all points in one cluster → no between-cluster variance), so CH → 0.

As k approaches n: Each point becomes its own cluster, W → 0, but the (n-k)/(k-1) term → 0 faster, so CH typically decreases.

At the 'true' k: We expect a local maximum in CH, where the clustering captures natural structure without overfitting.

This behavior makes CH useful for selecting k: plot CH(k) for k = 2, 3, ..., K and look for the maximum.

calinski_harabasz_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
 
def compute_variance_decomposition(X, labels):
    """
    Compute the within-cluster (W), between-cluster (B), 
    and total (T) sum of squares.
    """
    n = X.shape[0]
    global_centroid = np.mean(X, axis=0)
    unique_labels = np.unique(labels)
    
    # Total sum of squares
    T = np.sum((X - global_centroid) ** 2)
    
    # Within-cluster sum of squares
    W = 0
    for label in unique_labels:
        cluster_points = X[labels == label]
        cluster_centroid = np.mean(cluster_points, axis=0)
        W += np.sum((cluster_points - cluster_centroid) ** 2)
    
    # Between-cluster sum of squares
    B = 0
    for label in unique_labels:
        cluster_points = X[labels == label]
        cluster_centroid = np.mean(cluster_points, axis=0)
        B += len(cluster_points) * np.sum((cluster_centroid - global_centroid) ** 2)
    
    return T, W, B
 
 
def manual_calinski_harabasz(X, labels):
    """
    Manual implementation of CH index for understanding.
    """
    n = X.shape[0]
    k = len(np.unique(labels))
    
    T, W, B = compute_variance_decomposition(X, labels)
    
    # CH = (B / (k-1)) / (W / (n-k))
    ch = (B / (k - 1)) / (W / (n - k))
    
    return ch, T, W, B
 
 
# Generate data with 4 true clusters
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=1.0, random_state=42)
 
# Compute CH for different values of k
k_range = range(2, 11)
ch_scores = []
variance_data = []
 
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
    labels = kmeans.fit_predict(X)
    
    # Using sklearn
    ch_sklearn = calinski_harabasz_score(X, labels)
    
    # Manual computation for verification
    ch_manual, T, W, B = manual_calinski_harabasz(X, labels)
    
    ch_scores.append(ch_sklearn)
    variance_data.append({'k': k, 'T': T, 'W': W, 'B': B, 'CH': ch_sklearn})
    
    print(f"k={k}: CH={ch_sklearn:.2f}, W={W:.2f}, B={B:.2f}, W+B={W+B:.2f}, T={T:.2f}")
 
# Plot CH vs k
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# CH score plot
axes[0].plot(k_range, ch_scores, 'bo-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='red', linestyle='--', label='True k=4')
optimal_k = k_range[np.argmax(ch_scores)]
axes[0].scatter([optimal_k], [max(ch_scores)], color='green', s=200, zorder=5, 
                label=f'Optimal k={optimal_k}')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Calinski-Harabasz Score')
axes[0].set_title('CH Score vs Number of Clusters')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Variance decomposition plot
W_values = [v['W'] for v in variance_data]
B_values = [v['B'] for v in variance_data]
 
axes[1].bar(k_range, W_values, label='Within (W)', alpha=0.7)
axes[1].bar(k_range, B_values, bottom=W_values, label='Between (B)', alpha=0.7)
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Sum of Squares')
axes[1].set_title('Variance Decomposition: W + B = T')
axes[1].legend()
 
plt.tight_layout()
plt.show()

Comparing Silhouette and Calinski-Harabasz

Both Silhouette and CH are widely used, but they have distinct characteristics that make each more suitable in different scenarios:

Silhouette vs. Calinski-Harabasz Comparison
Criterion	Silhouette	Calinski-Harabasz
Computational complexity	O(n²) for full computation	O(n·k) — much faster
Score range	[-1, 1] — interpretable	[0, ∞) — no absolute meaning
Per-sample analysis	Yes — inspect individual points	No — global metric only
Cluster shape bias	Prefers convex, similar-sized clusters	Prefers convex, balanced clusters
Density variation	Handles somewhat	Sensitive to density differences
Finding optimal k	Maximum value heuristic	Maximum value heuristic
Implementation	Distance-based	Variance-based (centroid)

Use Silhouette When

•You need per-point diagnostics
•You want interpretable scores
•Dataset size is manageable (< 10K points)
•You need to identify misassigned points
•Using distance-based clustering (DBSCAN, hierarchical)

Use Calinski-Harabasz When

•You need fast computation at scale
•Using centroid-based clustering (k-means)
•Only need global quality assessment
•Screening many clusterings quickly
•Dataset has convex, well-separated clusters

Best Practice: Use Both

When computational resources permit, use both metrics along with the Davies-Bouldin Index. If all three agree on the optimal k, you have strong evidence for that choice. If they disagree, investigate the clusterings manually and consider domain knowledge.

Limitations and Inherent Biases

Internal validation metrics have fundamental limitations that practitioners must understand:

The Geometry Bias Problem

Both Silhouette and CH implicitly assume clusters are convex (roughly spherical) and similarly sized. They penalize:

Non-convex clusters: A crescent-shaped cluster will have low scores even if perfectly detected
Elongated clusters: Cigar-shaped clusters violate the compactness assumption
Clusters with varying densities: Dense regions score differently than sparse ones
Nested structures: Rings within rings, for example

metric_bias_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.cluster import KMeans, DBSCAN, SpectralClustering
from sklearn.datasets import make_moons, make_circles, make_blobs
 
# Create datasets with different structures
datasets = {
    'Blobs (convex)': make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42),
    'Moons (non-convex)': make_moons(n_samples=300, noise=0.05, random_state=42),
    'Circles (nested)': make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42),
}
 
fig, axes = plt.subplots(len(datasets), 4, figsize=(16, 12))
 
for row, (name, (X, y_true)) in enumerate(datasets.items()):
    n_clusters = len(np.unique(y_true))
    
    # K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')
    labels_kmeans = kmeans.fit_predict(X)
    
    # Spectral clustering (can handle non-convex)
    spectral = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors',
                                   n_neighbors=10, random_state=42)
    labels_spectral = spectral.fit_predict(X)
    
    # Compute metrics
    sil_true = silhouette_score(X, y_true)
    ch_true = calinski_harabasz_score(X, y_true)
    
    sil_kmeans = silhouette_score(X, labels_kmeans)
    ch_kmeans = calinski_harabasz_score(X, labels_kmeans)
    
    sil_spectral = silhouette_score(X, labels_spectral)
    ch_spectral = calinski_harabasz_score(X, labels_spectral)
    
    # Plot ground truth
    axes[row, 0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=20)
    axes[row, 0].set_title(f'{name}\nGround Truth')
    axes[row, 0].set_xlabel(f'Sil: {sil_true:.3f}, CH: {ch_true:.1f}')
    
    # Plot K-means result
    axes[row, 1].scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='viridis', s=20)
    axes[row, 1].set_title('K-means Clustering')
    axes[row, 1].set_xlabel(f'Sil: {sil_kmeans:.3f}, CH: {ch_kmeans:.1f}')
    
    # Plot Spectral result
    axes[row, 2].scatter(X[:, 0], X[:, 1], c=labels_spectral, cmap='viridis', s=20)
    axes[row, 2].set_title('Spectral Clustering')
    axes[row, 2].set_xlabel(f'Sil: {sil_spectral:.3f}, CH: {ch_spectral:.1f}')
    
    # Comparison bar chart
    methods = ['True', 'K-means', 'Spectral']
    sil_scores = [sil_true, sil_kmeans, sil_spectral]
    x_pos = np.arange(len(methods))
    axes[row, 3].bar(x_pos, sil_scores, color=['green', 'blue', 'orange'])
    axes[row, 3].set_xticks(x_pos)
    axes[row, 3].set_xticklabels(methods)
    axes[row, 3].set_ylabel('Silhouette Score')
    axes[row, 3].set_title('Metric Comparison')
 
plt.tight_layout()
plt.show()
 
# Key observation: For moons and circles, the "wrong" clustering (K-means)
# often gets HIGHER silhouette scores than the correct one!
print("\n--- Key Insight ---")
print("Internal metrics can prefer 'wrong' clusterings if they're more convex!")

Critical Warning

A clustering with higher Silhouette or CH score is NOT necessarily better! These metrics can score an incorrect convex partitioning higher than a correct non-convex one. Always visualize clusters when possible and consider whether the metric assumptions match your data's structure.

Other Limitations

1. Monotonicity with k (for some metrics): Some internal metrics tend to improve as k increases, making direct comparison across different k values misleading.

2. Sensitivity to noise: Outliers can dramatically affect both metrics, as they inflate within-cluster distances or variance.

3. Scale dependence: Both metrics depend on the distance/variance scale of the data. Preprocessing (normalization, standardization) affects scores.

4. No absolute threshold: Unlike classification accuracy, there's no universal 'good enough' threshold. A Silhouette of 0.4 might be excellent for one dataset and poor for another.

5. No consideration of interpretability: High-scoring clusters may not correspond to meaningful domain concepts.

Additional Internal Metrics

While Silhouette and CH are the most common, several other internal metrics exist:

Davies-Bouldin Index

Measures the average similarity between each cluster and its most similar cluster:

$$DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \frac{\sigma_i + \sigma_j}{d(\mu_i, \mu_j)}$$

where σᵢ is the average distance of points in cluster i to centroid μᵢ. Lower is better (unlike Silhouette and CH).

Dunn Index

Ratio of minimum inter-cluster distance to maximum intra-cluster distance:

$$Dunn = \frac{\min_{i \neq j} d(C_i, C_j)}{\max_k \Delta_k}$$

where d(Cᵢ, Cⱼ) is the distance between clusters and Δₖ is the diameter of cluster k. Higher is better, but sensitive to outliers.

Xie-Beni Index (for fuzzy clustering)

Used with fuzzy c-means:

$$XB = \frac{\sum_i \sum_j u_{ij}^m |x_i - c_j|^2}{n \cdot \min_{i \neq j} |c_i - c_j|^2}$$

Lower is better.

Summary of Common Internal Validation Metrics
Metric	Optimal	Complexity	Key Characteristics
Silhouette	Maximize (near 1)	O(n²)	Per-sample, interpretable range
Calinski-Harabasz	Maximize	O(n·k)	ANOVA-based, very fast
Davies-Bouldin	Minimize	O(n·k)	Cluster similarity ratio
Dunn	Maximize	O(n²)	Sensitive to outliers
Xie-Beni	Minimize	O(n·k)	For fuzzy clustering

Summary

This page provided a comprehensive treatment of internal clustering validation metrics, with deep focus on Silhouette and Calinski-Harabasz:

Key Takeaways

•Internal metrics evaluate clustering using only data and assignments—no ground truth required
•Silhouette Coefficient measures per-point cohesion vs. separation, ranging from -1 to 1
•Calinski-Harabasz Index is a variance ratio (ANOVA F-stat), computationally efficient at O(n·k)
•Silhouette allows per-sample analysis to identify potentially misassigned points
•CH is faster but provides only global quality assessment
•Both metrics favor convex clusters and can mislead for non-convex structure
•Always visualize clusters and consider domain knowledge alongside metric values
•Use multiple metrics for robust model selection

Page Complete

You now understand the mathematical foundations, interpretations, computational considerations, and limitations of internal clustering validation metrics. Next, we'll explore external metrics like Adjusted Rand Index and Normalized Mutual Information, which compare clusterings against known ground truth.

1 / 5

Loading learning content...

Machine LearningClustering Evaluation

Clustering Evaluation

LevelIntermediate

Duration90 mins

TopicClustering Evaluation

1 / 5

Internal Metrics: Silhouette and Calinski-Harabasz

The Challenge of Evaluating Clusters

What You Will Learn

Internal vs. External Validation

Before diving into specific metrics, it's crucial to understand the distinction between internal and external validation paradigms:

Internal Validation evaluates clustering quality using only:

The data points themselves (their features/coordinates)
The cluster assignments produced by the algorithm

External Validation evaluates clustering quality by comparing against:

Known ground-truth labels (class memberships)
External reference partitions

Internal metrics are essential because ground truth is often unavailable in real clustering applications—if we knew the true labels, we likely wouldn't need clustering in the first place.

Comparison: Internal vs. External Validation
Aspect	Internal Validation	External Validation
Ground truth required	No	Yes
Primary use case	Model selection, hyperparameter tuning	Algorithm benchmarking, method comparison
What it measures	Compactness and separation	Agreement with known structure
Examples	Silhouette, CH, Davies-Bouldin	ARI, NMI, Fowlkes-Mallows
Bias toward	Spherical, convex clusters	Structure matching ground truth

Critical Caveat: The Geometry Bias

The Silhouette Coefficient: Foundation

Cohesion: Points should be close to other points in the same cluster
Separation: Points should be far from points in other clusters

The genius of the Silhouette formulation is that it combines these into a single normalized score that is intuitive to interpret.

Mathematical Definition

For a data point xᵢ assigned to cluster Cₖ, we define:

a(i) = average distance from xᵢ to all other points in the same cluster Cₖ

$$a(i) = \frac{1}{|C_k| - 1} \sum_{j \in C_k, j \neq i} d(x_i, x_j)$$

b(i) = minimum average distance from xᵢ to points in any other cluster

$$b(i) = \min_{l \neq k} \frac{1}{|C_l|} \sum_{j \in C_l} d(x_i, x_j)$$

The cluster achieving this minimum is called the nearest neighboring cluster of xᵢ—the cluster that xᵢ would be assigned to if removed from its current cluster.

The Silhouette Score

The silhouette score for point xᵢ is:

$$s(i) = \frac{b(i) - a(i)}{\max{a(i), b(i)}}$$

This can be equivalently written as:

$$s(i) = \begin{cases} 1 - \frac{a(i)}{b(i)} & \text{if } a(i) < b(i) \ 0 & \text{if } a(i) = b(i) \ \frac{b(i)}{a(i)} - 1 & \text{if } a(i) > b(i) \end{cases}$$

The overall Silhouette Coefficient for a clustering is the mean of all individual silhouette scores:

$$\bar{s} = \frac{1}{n} \sum_{i=1}^{n} s(i)$$

Intuitive Interpretation

Interpreting Silhouette Scores

Understanding what silhouette scores mean at different scales is essential for practical use:

Score Range	Interpretation
0.71 – 1.00	Strong structure; clusters are dense and well-separated
0.51 – 0.70	Reasonable structure; may be worth investigating further
0.26 – 0.50	Weak structure; clusters may be overlapping
≤ 0.25	No substantial structure; may have incorrect k or data not clusterable

These thresholds are heuristics, not absolute rules. Context matters enormously—some datasets have inherently overlapping structure where a score of 0.4 represents the best achievable clustering.

Per-Cluster Analysis

The mean silhouette score aggregates information, but examining per-cluster silhouette distributions reveals much more:

Cluster-wise means: Identify clusters that are well-defined vs. poorly defined
Width of distribution: Clusters with wide score ranges may span dense cores and sparse peripheries
Negative scores: Points with negative silhouettes are potentially misclassified

A silhouette plot visualizes these distributions by showing sorted silhouette values for each cluster, making it easy to spot problematic clusters.

silhouette_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
 
# Generate sample data
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
 
# Perform clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init='auto')
cluster_labels = kmeans.fit_predict(X)
 
# Compute silhouette scores
silhouette_avg = silhouette_score(X, cluster_labels)
sample_silhouette_values = silhouette_samples(X, cluster_labels)
 
print(f"Overall Silhouette Score: {silhouette_avg:.4f}")
 
# Per-cluster analysis
n_clusters = len(np.unique(cluster_labels))
for i in range(n_clusters):
    cluster_silhouette = sample_silhouette_values[cluster_labels == i]
    cluster_size = len(cluster_silhouette)
    cluster_mean = np.mean(cluster_silhouette)
    negative_count = np.sum(cluster_silhouette < 0)
    
    print(f"\nCluster {i}:")
    print(f"  Size: {cluster_size}")
    print(f"  Mean silhouette: {cluster_mean:.4f}")
    print(f"  Min: {np.min(cluster_silhouette):.4f}, Max: {np.max(cluster_silhouette):.4f}")
    print(f"  Potentially misassigned (s < 0): {negative_count} ({100*negative_count/cluster_size:.1f}%)")
 
# Create silhouette plot
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
y_lower = 10
 
for i in range(n_clusters):
    cluster_silhouette = sample_silhouette_values[cluster_labels == i]
    cluster_silhouette.sort()
    
    y_upper = y_lower + len(cluster_silhouette)
    color = plt.cm.viridis(float(i) / n_clusters)
    
    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_silhouette,
                     facecolor=color, edgecolor=color, alpha=0.7)
    ax.text(-0.05, y_lower + 0.5 * len(cluster_silhouette), str(i))
    y_lower = y_upper + 10
 
ax.axvline(x=silhouette_avg, color="red", linestyle="--", label=f"Mean: {silhouette_avg:.3f}")
ax.set_xlabel("Silhouette Coefficient")
ax.set_ylabel("Cluster")
ax.set_title("Silhouette Plot for Cluster Analysis")
ax.legend()
plt.tight_layout()
plt.show()

Diagnosing Clustering Issues

Silhouette analysis helps diagnose specific problems:

All clusters have low scores: Wrong number of clusters k, or data lacks natural cluster structure
One cluster has much lower scores: That cluster may be absorbing noise or merging distinct groups
Many negative scores in a cluster: Consider splitting that cluster or reassigning those points
Scores vary wildly within clusters: Clusters may have irregular shapes or varying densities

Computational Considerations for Silhouette

Computational Complexity

The naïve computation of silhouette scores has complexity O(n²) because it requires computing pairwise distances between all points. For large datasets, this becomes prohibitive:

Dataset Size	Pairwise Comparisons	Memory for Distance Matrix
1,000	500,000	~4 MB
10,000	50 million	~400 MB
100,000	5 billion	~40 GB
1,000,000	500 billion	~4 TB

This quadratic scaling presents serious challenges for large-scale clustering evaluation.

Approximation Strategies

Several approaches can make silhouette computation tractable for large datasets:

1. Sampling-Based Approximation

Compute silhouette on a representative sample
Use stratified sampling from each cluster
Confidence intervals via bootstrap resampling

2. Simplified Silhouette

Replace a(i) with distance to cluster centroid
Replace b(i) with minimum distance to other centroids
Reduces to O(n·k) complexity
Less accurate but much faster

3. Batch Processing

Compute silhouette in chunks
Avoids storing full distance matrix
Still O(n²) but constant memory

4. Approximate Nearest Neighbors

Use KD-trees, ball trees, or locality-sensitive hashing
Approximate b(i) using nearest neighbor structures
Can reduce to O(n log n) in favorable conditions

silhouette_approximations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
 
def simplified_silhouette(X, labels, centroids):
    """
    Simplified silhouette using distances to centroids instead of 
    full pairwise distances. Complexity: O(n * k) instead of O(n²).
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    labels : ndarray of shape (n_samples,) - cluster assignments
    centroids : ndarray of shape (n_clusters, n_features)
    
    Returns:
    --------
    float : simplified silhouette score
    """
    n_samples = X.shape[0]
    n_clusters = len(centroids)
    
    # Distance from each point to each centroid: O(n * k)
    distances_to_centroids = cdist(X, centroids, metric='euclidean')
    
    silhouette_scores = np.zeros(n_samples)
    
    for i in range(n_samples):
        cluster_i = labels[i]
        
        # a(i): distance to own centroid
        a_i = distances_to_centroids[i, cluster_i]
        
        # b(i): minimum distance to other centroids
        other_distances = distances_to_centroids[i, np.arange(n_clusters) != cluster_i]
        b_i = np.min(other_distances)
        
        # Silhouette score
        if max(a_i, b_i) == 0:
            silhouette_scores[i] = 0
        else:
            silhouette_scores[i] = (b_i - a_i) / max(a_i, b_i)
    
    return np.mean(silhouette_scores)
 
 
def sampled_silhouette(X, labels, sample_size=1000, n_bootstrap=10, random_state=42):
    """
    Estimate silhouette score using stratified sampling with bootstrap confidence intervals.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    labels : ndarray of shape (n_samples,)
    sample_size : int, number of samples to use
    n_bootstrap : int, number of bootstrap iterations
    random_state : int
    
    Returns:
    --------
    tuple : (mean estimate, standard error)
    """
    rng = np.random.RandomState(random_state)
    n_samples = X.shape[0]
    unique_labels = np.unique(labels)
    
    bootstrap_scores = []
    
    for b in range(n_bootstrap):
        # Stratified sampling
        sample_indices = []
        for label in unique_labels:
            label_indices = np.where(labels == label)[0]
            # Sample proportionally from each cluster
            n_from_cluster = max(1, int(sample_size * len(label_indices) / n_samples))
            sampled = rng.choice(label_indices, size=min(n_from_cluster, len(label_indices)), replace=False)
            sample_indices.extend(sampled)
        
        sample_indices = np.array(sample_indices)
        X_sample = X[sample_indices]
        labels_sample = labels[sample_indices]
        
        score = silhouette_score(X_sample, labels_sample)
        bootstrap_scores.append(score)
    
    return np.mean(bootstrap_scores), np.std(bootstrap_scores)
 
 
# Example usage
from sklearn.datasets import make_blobs
 
# Generate larger dataset
X, y = make_blobs(n_samples=10000, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42, n_init='auto')
labels = kmeans.fit_predict(X)
 
# Exact silhouette (for comparison)
exact_score = silhouette_score(X, labels)
print(f"Exact Silhouette Score: {exact_score:.4f}")
 
# Simplified silhouette
simplified_score = simplified_silhouette(X, labels, kmeans.cluster_centers_)
print(f"Simplified Silhouette Score: {simplified_score:.4f}")
 
# Sampled silhouette with CI
mean_est, std_err = sampled_silhouette(X, labels, sample_size=1000, n_bootstrap=20)
print(f"Sampled Silhouette Score: {mean_est:.4f} ± {1.96*std_err:.4f} (95% CI)")

The Calinski-Harabasz Index

Mathematical Formulation

Let X be a dataset with n points in d dimensions, partitioned into k clusters. Define:

Within-cluster dispersion (W): $$W = \sum_{j=1}^{k} \sum_{x \in C_j} |x - \mu_j|^2$$

where μⱼ is the centroid of cluster Cⱼ.

Between-cluster dispersion (B): $$B = \sum_{j=1}^{k} |C_j| \cdot |\mu_j - \mu|^2$$

where μ is the global centroid (mean of all points) and |Cⱼ| is the size of cluster j.

The Calinski-Harabasz Index is then:

$$CH = \frac{B / (k-1)}{W / (n-k)} = \frac{B}{W} \cdot \frac{n-k}{k-1}$$

Connection to ANOVA

Interpretation

Higher CH values indicate better clustering (more separation, less dispersion within clusters)
No upper bound: Values can range from 0 to infinity
Scale-dependent: Absolute values aren't meaningful across different datasets
Best used comparatively: Compare CH values for different k values on the same data

Unlike silhouette scores which range from -1 to 1, there's no universal threshold for 'good' CH values. The metric is primarily used to compare clusterings:

$$k^* = \arg\max_k CH(k)$$

where k* is the optimal number of clusters.

Mathematical Properties of Calinski-Harabasz

Variance Decomposition

The CH index is built on the fundamental total variance decomposition:

$$T = W + B$$

where:

T = Total Sum of Squares = Σᵢ ‖xᵢ - μ‖²
W = Within-cluster Sum of Squares
B = Between-cluster Sum of Squares

This identity always holds, meaning that as B increases, W must decrease for fixed T. The CH index measures the ratio in a degrees-of-freedom adjusted manner.

Behavior with Varying k

Understanding how CH behaves as k changes is critical for its use:

As k approaches 1: B approaches 0 (all points in one cluster → no between-cluster variance), so CH → 0.

As k approaches n: Each point becomes its own cluster, W → 0, but the (n-k)/(k-1) term → 0 faster, so CH typically decreases.

At the 'true' k: We expect a local maximum in CH, where the clustering captures natural structure without overfitting.

This behavior makes CH useful for selecting k: plot CH(k) for k = 2, 3, ..., K and look for the maximum.

calinski_harabasz_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
 
def compute_variance_decomposition(X, labels):
    """
    Compute the within-cluster (W), between-cluster (B), 
    and total (T) sum of squares.
    """
    n = X.shape[0]
    global_centroid = np.mean(X, axis=0)
    unique_labels = np.unique(labels)
    
    # Total sum of squares
    T = np.sum((X - global_centroid) ** 2)
    
    # Within-cluster sum of squares
    W = 0
    for label in unique_labels:
        cluster_points = X[labels == label]
        cluster_centroid = np.mean(cluster_points, axis=0)
        W += np.sum((cluster_points - cluster_centroid) ** 2)
    
    # Between-cluster sum of squares
    B = 0
    for label in unique_labels:
        cluster_points = X[labels == label]
        cluster_centroid = np.mean(cluster_points, axis=0)
        B += len(cluster_points) * np.sum((cluster_centroid - global_centroid) ** 2)
    
    return T, W, B
 
 
def manual_calinski_harabasz(X, labels):
    """
    Manual implementation of CH index for understanding.
    """
    n = X.shape[0]
    k = len(np.unique(labels))
    
    T, W, B = compute_variance_decomposition(X, labels)
    
    # CH = (B / (k-1)) / (W / (n-k))
    ch = (B / (k - 1)) / (W / (n - k))
    
    return ch, T, W, B
 
 
# Generate data with 4 true clusters
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=1.0, random_state=42)
 
# Compute CH for different values of k
k_range = range(2, 11)
ch_scores = []
variance_data = []
 
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
    labels = kmeans.fit_predict(X)
    
    # Using sklearn
    ch_sklearn = calinski_harabasz_score(X, labels)
    
    # Manual computation for verification
    ch_manual, T, W, B = manual_calinski_harabasz(X, labels)
    
    ch_scores.append(ch_sklearn)
    variance_data.append({'k': k, 'T': T, 'W': W, 'B': B, 'CH': ch_sklearn})
    
    print(f"k={k}: CH={ch_sklearn:.2f}, W={W:.2f}, B={B:.2f}, W+B={W+B:.2f}, T={T:.2f}")
 
# Plot CH vs k
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# CH score plot
axes[0].plot(k_range, ch_scores, 'bo-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='red', linestyle='--', label='True k=4')
optimal_k = k_range[np.argmax(ch_scores)]
axes[0].scatter([optimal_k], [max(ch_scores)], color='green', s=200, zorder=5, 
                label=f'Optimal k={optimal_k}')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Calinski-Harabasz Score')
axes[0].set_title('CH Score vs Number of Clusters')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Variance decomposition plot
W_values = [v['W'] for v in variance_data]
B_values = [v['B'] for v in variance_data]
 
axes[1].bar(k_range, W_values, label='Within (W)', alpha=0.7)
axes[1].bar(k_range, B_values, bottom=W_values, label='Between (B)', alpha=0.7)
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Sum of Squares')
axes[1].set_title('Variance Decomposition: W + B = T')
axes[1].legend()
 
plt.tight_layout()
plt.show()

Comparing Silhouette and Calinski-Harabasz

Both Silhouette and CH are widely used, but they have distinct characteristics that make each more suitable in different scenarios:

Silhouette vs. Calinski-Harabasz Comparison
Criterion	Silhouette	Calinski-Harabasz
Computational complexity	O(n²) for full computation	O(n·k) — much faster
Score range	[-1, 1] — interpretable	[0, ∞) — no absolute meaning
Per-sample analysis	Yes — inspect individual points	No — global metric only
Cluster shape bias	Prefers convex, similar-sized clusters	Prefers convex, balanced clusters
Density variation	Handles somewhat	Sensitive to density differences
Finding optimal k	Maximum value heuristic	Maximum value heuristic
Implementation	Distance-based	Variance-based (centroid)

Use Silhouette When

•You need per-point diagnostics
•You want interpretable scores
•Dataset size is manageable (< 10K points)
•You need to identify misassigned points
•Using distance-based clustering (DBSCAN, hierarchical)

Use Calinski-Harabasz When

•You need fast computation at scale
•Using centroid-based clustering (k-means)
•Only need global quality assessment
•Screening many clusterings quickly
•Dataset has convex, well-separated clusters

Best Practice: Use Both

Limitations and Inherent Biases

Internal validation metrics have fundamental limitations that practitioners must understand:

The Geometry Bias Problem

Both Silhouette and CH implicitly assume clusters are convex (roughly spherical) and similarly sized. They penalize:

Non-convex clusters: A crescent-shaped cluster will have low scores even if perfectly detected
Elongated clusters: Cigar-shaped clusters violate the compactness assumption
Clusters with varying densities: Dense regions score differently than sparse ones
Nested structures: Rings within rings, for example

metric_bias_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.cluster import KMeans, DBSCAN, SpectralClustering
from sklearn.datasets import make_moons, make_circles, make_blobs
 
# Create datasets with different structures
datasets = {
    'Blobs (convex)': make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42),
    'Moons (non-convex)': make_moons(n_samples=300, noise=0.05, random_state=42),
    'Circles (nested)': make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42),
}
 
fig, axes = plt.subplots(len(datasets), 4, figsize=(16, 12))
 
for row, (name, (X, y_true)) in enumerate(datasets.items()):
    n_clusters = len(np.unique(y_true))
    
    # K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')
    labels_kmeans = kmeans.fit_predict(X)
    
    # Spectral clustering (can handle non-convex)
    spectral = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors',
                                   n_neighbors=10, random_state=42)
    labels_spectral = spectral.fit_predict(X)
    
    # Compute metrics
    sil_true = silhouette_score(X, y_true)
    ch_true = calinski_harabasz_score(X, y_true)
    
    sil_kmeans = silhouette_score(X, labels_kmeans)
    ch_kmeans = calinski_harabasz_score(X, labels_kmeans)
    
    sil_spectral = silhouette_score(X, labels_spectral)
    ch_spectral = calinski_harabasz_score(X, labels_spectral)
    
    # Plot ground truth
    axes[row, 0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=20)
    axes[row, 0].set_title(f'{name}\nGround Truth')
    axes[row, 0].set_xlabel(f'Sil: {sil_true:.3f}, CH: {ch_true:.1f}')
    
    # Plot K-means result
    axes[row, 1].scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='viridis', s=20)
    axes[row, 1].set_title('K-means Clustering')
    axes[row, 1].set_xlabel(f'Sil: {sil_kmeans:.3f}, CH: {ch_kmeans:.1f}')
    
    # Plot Spectral result
    axes[row, 2].scatter(X[:, 0], X[:, 1], c=labels_spectral, cmap='viridis', s=20)
    axes[row, 2].set_title('Spectral Clustering')
    axes[row, 2].set_xlabel(f'Sil: {sil_spectral:.3f}, CH: {ch_spectral:.1f}')
    
    # Comparison bar chart
    methods = ['True', 'K-means', 'Spectral']
    sil_scores = [sil_true, sil_kmeans, sil_spectral]
    x_pos = np.arange(len(methods))
    axes[row, 3].bar(x_pos, sil_scores, color=['green', 'blue', 'orange'])
    axes[row, 3].set_xticks(x_pos)
    axes[row, 3].set_xticklabels(methods)
    axes[row, 3].set_ylabel('Silhouette Score')
    axes[row, 3].set_title('Metric Comparison')
 
plt.tight_layout()
plt.show()
 
# Key observation: For moons and circles, the "wrong" clustering (K-means)
# often gets HIGHER silhouette scores than the correct one!
print("\n--- Key Insight ---")
print("Internal metrics can prefer 'wrong' clusterings if they're more convex!")

Critical Warning

Other Limitations

1. Monotonicity with k (for some metrics): Some internal metrics tend to improve as k increases, making direct comparison across different k values misleading.

2. Sensitivity to noise: Outliers can dramatically affect both metrics, as they inflate within-cluster distances or variance.

3. Scale dependence: Both metrics depend on the distance/variance scale of the data. Preprocessing (normalization, standardization) affects scores.

4. No absolute threshold: Unlike classification accuracy, there's no universal 'good enough' threshold. A Silhouette of 0.4 might be excellent for one dataset and poor for another.

5. No consideration of interpretability: High-scoring clusters may not correspond to meaningful domain concepts.

Additional Internal Metrics

While Silhouette and CH are the most common, several other internal metrics exist:

Davies-Bouldin Index

Measures the average similarity between each cluster and its most similar cluster:

$$DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \frac{\sigma_i + \sigma_j}{d(\mu_i, \mu_j)}$$

where σᵢ is the average distance of points in cluster i to centroid μᵢ. Lower is better (unlike Silhouette and CH).

Dunn Index

Ratio of minimum inter-cluster distance to maximum intra-cluster distance:

$$Dunn = \frac{\min_{i \neq j} d(C_i, C_j)}{\max_k \Delta_k}$$

where d(Cᵢ, Cⱼ) is the distance between clusters and Δₖ is the diameter of cluster k. Higher is better, but sensitive to outliers.

Xie-Beni Index (for fuzzy clustering)

Used with fuzzy c-means:

$$XB = \frac{\sum_i \sum_j u_{ij}^m |x_i - c_j|^2}{n \cdot \min_{i \neq j} |c_i - c_j|^2}$$

Lower is better.

Summary of Common Internal Validation Metrics
Metric	Optimal	Complexity	Key Characteristics
Silhouette	Maximize (near 1)	O(n²)	Per-sample, interpretable range
Calinski-Harabasz	Maximize	O(n·k)	ANOVA-based, very fast
Davies-Bouldin	Minimize	O(n·k)	Cluster similarity ratio
Dunn	Maximize	O(n²)	Sensitive to outliers
Xie-Beni	Minimize	O(n·k)	For fuzzy clustering

Summary

This page provided a comprehensive treatment of internal clustering validation metrics, with deep focus on Silhouette and Calinski-Harabasz:

Key Takeaways

•Internal metrics evaluate clustering using only data and assignments—no ground truth required
•Silhouette Coefficient measures per-point cohesion vs. separation, ranging from -1 to 1
•Calinski-Harabasz Index is a variance ratio (ANOVA F-stat), computationally efficient at O(n·k)
•Silhouette allows per-sample analysis to identify potentially misassigned points
•CH is faster but provides only global quality assessment
•Both metrics favor convex clusters and can mislead for non-convex structure
•Always visualize clusters and consider domain knowledge alongside metric values
•Use multiple metrics for robust model selection

Page Complete

1 / 5