Loading content...
In the previous page, we explored internal validation metrics that assess clustering quality without reference to ground truth. But in many scenarios—algorithm benchmarking, method validation, or labeled datasets—we have access to true class labels. External validation metrics compare the discovered cluster structure against these known labels.
External metrics answer a fundamentally different question: instead of 'Are these clusters compact and separated?' they ask 'Do these clusters match the known structure?' This makes them essential for objectively comparing clustering algorithms and validating new methods.
This page provides rigorous coverage of the two most important external metrics: the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). We'll derive them from first principles, understand their properties, and learn when to use each.
By the end of this page, you will understand: (1) the contingency table representation of clustering comparisons, (2) the Rand Index and why adjustment for chance is necessary, (3) the information-theoretic foundations of Mutual Information, (4) different normalization schemes (NMI, AMI, V-measure), and (5) practical guidance for choosing between metrics.
All external clustering metrics can be understood through the contingency table, which cross-tabulates two partitions of the same data.
Let U = {U₁, U₂, ..., Uᵣ} be the ground truth partition with r classes, and V = {V₁, V₂, ..., Vₛ} be the clustering with s clusters. The contingency table N has entries:
$$n_{ij} = |U_i \cap V_j|$$
the number of points that belong to both true class i and cluster j.
| Cluster V₁ | Cluster V₂ | Cluster V₃ | Row Sum | |
|---|---|---|---|---|
| Class U₁ | 40 | 5 | 5 | 50 |
| Class U₂ | 3 | 45 | 2 | 50 |
| Class U₃ | 2 | 5 | 43 | 50 |
| Col Sum | 45 | 55 | 50 | n=150 |
This table tells us that V₁ mostly captures U₁, V₂ mostly captures U₂, and V₃ mostly captures U₃. A perfect clustering would have all mass on the diagonal.
From the contingency table, we compute:
Marginal sums:
Total: n = Σᵢ aᵢ = Σⱼ bⱼ = Σᵢⱼ nᵢⱼ
These marginals are crucial for computing both pair-counting metrics (Rand Index) and information-theoretic metrics (Mutual Information).
Most external metrics are symmetric: comparing U to V gives the same result as comparing V to U. This is important because it means the metric measures agreement between partitions without assuming one is 'correct.'
The Rand Index (RI), introduced by William Rand in 1971, takes a pair-counting approach: it considers all possible pairs of data points and asks whether the two partitions agree on each pair.
For any pair of points (xᵢ, xⱼ), there are four possibilities:
| Ground Truth | Clustering | Description |
|---|---|---|
| Same class | Same cluster | True Positive (TP) — Correctly grouped together |
| Different class | Different cluster | True Negative (TN) — Correctly separated |
| Same class | Different cluster | False Negative (FN) — Should be together, separated |
| Different class | Same cluster | False Positive (FP) — Should be separate, grouped |
From the contingency table:
TP: Pairs that are together in both partitions: $$TP = \sum_{i,j} \binom{n_{ij}}{2}$$
FP: Pairs together in clustering but not in ground truth: $$FP = \sum_j \binom{b_j}{2} - TP$$
FN: Pairs together in ground truth but not in clustering: $$FN = \sum_i \binom{a_i}{2} - TP$$
TN: Pairs separated in both: $$TN = \binom{n}{2} - TP - FP - FN$$
The Rand Index is simply the accuracy of pair classifications:
$$RI = \frac{TP + TN}{TP + FP + FN + TN} = \frac{TP + TN}{\binom{n}{2}}$$
Range: [0, 1], where 1 indicates perfect agreement and 0 indicates complete disagreement.
Interpretation: The fraction of all pairs on which the two partitions agree.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy.special import combfrom sklearn.metrics import rand_score def contingency_table(labels_true, labels_pred): """ Construct the contingency table from two label vectors. """ classes = np.unique(labels_true) clusters = np.unique(labels_pred) n_classes = len(classes) n_clusters = len(clusters) # Map labels to indices class_to_idx = {c: i for i, c in enumerate(classes)} cluster_to_idx = {c: i for i, c in enumerate(clusters)} # Build contingency table table = np.zeros((n_classes, n_clusters), dtype=int) for true_label, pred_label in zip(labels_true, labels_pred): i = class_to_idx[true_label] j = cluster_to_idx[pred_label] table[i, j] += 1 return table, classes, clusters def rand_index_from_contingency(table): """ Compute Rand Index from contingency table. """ n = table.sum() # Row and column sums a = table.sum(axis=1) # Class sizes b = table.sum(axis=0) # Cluster sizes # TP: pairs together in both tp = sum(comb(table[i, j], 2) for i in range(table.shape[0]) for j in range(table.shape[1])) # Sum of C(a_i, 2) for all classes sum_a = sum(comb(ai, 2) for ai in a) # Sum of C(b_j, 2) for all clusters sum_b = sum(comb(bj, 2) for bj in b) # FP and FN fp = sum_b - tp fn = sum_a - tp # Total pairs total_pairs = comb(n, 2) # TN tn = total_pairs - tp - fp - fn # Rand Index ri = (tp + tn) / total_pairs return ri, {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn} # Examplelabels_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])labels_pred = np.array([0, 0, 1, 1, 1, 1, 2, 2, 2]) table, classes, clusters = contingency_table(labels_true, labels_pred)print("Contingency Table:")print(table) ri, counts = rand_index_from_contingency(table)print(f"\nPair Counts: {counts}")print(f"Rand Index: {ri:.4f}") # Verify with sklearnri_sklearn = rand_score(labels_true, labels_pred)print(f"Sklearn Rand Index: {ri_sklearn:.4f}")The raw Rand Index has a significant flaw: it does not account for chance agreement. Even random clusterings will achieve non-zero RI scores, and the expected value under random clustering is not zero.
Consider two random partitions of 1000 points into 10 clusters each. The expected Rand Index is:
$$E[RI] \approx 1 - \frac{2k(k-1)}{n-1}$$
For n=1000 and k=10, E[RI] ≈ 0.82! This means a completely random clustering appears to have 82% 'agreement' with ground truth.
This inflated baseline makes the RI misleading:
123456789101112131415161718192021222324252627282930313233343536
import numpy as npfrom sklearn.metrics import rand_score, adjusted_rand_score def demonstrate_chance_problem(): """ Demonstrate that random clusterings get high Rand Index. """ np.random.seed(42) n = 1000 n_classes = 10 # Random ground truth labels_true = np.random.randint(0, n_classes, n) # Multiple random clusterings random_scores = [] for _ in range(100): labels_random = np.random.randint(0, n_classes, n) ri = rand_score(labels_true, labels_random) ari = adjusted_rand_score(labels_true, labels_random) random_scores.append((ri, ari)) ri_mean = np.mean([s[0] for s in random_scores]) ri_std = np.std([s[0] for s in random_scores]) ari_mean = np.mean([s[1] for s in random_scores]) ari_std = np.std([s[1] for s in random_scores]) print("Random Clustering vs Random Ground Truth") print(f" Rand Index: {ri_mean:.4f} ± {ri_std:.4f}") print(f" Adjusted Rand Index: {ari_mean:.4f} ± {ari_std:.4f}") print() print("Expected values:") print(f" RI for random: ~{1 - 2*n_classes*(n_classes-1)/(n-1):.4f} (theoretical)") print(f" ARI for random: ~0.0 (by design)") demonstrate_chance_problem()The raw Rand Index should almost never be used in practice. Its high baseline for random clusterings makes it unsuitable for meaningful evaluation. Always use the Adjusted Rand Index (ARI) instead, which corrects for chance agreement.
The Adjusted Rand Index (ARI), introduced by Hubert and Arabie in 1985, corrects the RI for chance using a hypergeometric model of randomness:
$$ARI = \frac{RI - E[RI]}{\max(RI) - E[RI]}$$
This adjustment ensures that:
Let the contingency table be N with entries nᵢⱼ, row sums aᵢ, and column sums bⱼ. Define:
$$Index = \sum_{i,j} \binom{n_{ij}}{2}$$
$$ExpectedIndex = \frac{\sum_i \binom{a_i}{2} \cdot \sum_j \binom{b_j}{2}}{\binom{n}{2}}$$
$$MaxIndex = \frac{1}{2}\left(\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}\right)$$
Then:
$$ARI = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex}$$
This formulation directly computes the chance-adjusted score.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom scipy.special import combfrom sklearn.metrics import adjusted_rand_score def adjusted_rand_index_manual(labels_true, labels_pred): """ Manual implementation of ARI for understanding the formula. """ # Build contingency table classes = np.unique(labels_true) clusters = np.unique(labels_pred) class_to_idx = {c: i for i, c in enumerate(classes)} cluster_to_idx = {c: i for i, c in enumerate(clusters)} n = len(labels_true) table = np.zeros((len(classes), len(clusters)), dtype=int) for true_label, pred_label in zip(labels_true, labels_pred): i = class_to_idx[true_label] j = cluster_to_idx[pred_label] table[i, j] += 1 # Marginal sums a = table.sum(axis=1) # Row sums (class sizes) b = table.sum(axis=0) # Column sums (cluster sizes) # Index: sum of C(n_ij, 2) index = sum(comb(table[i, j], 2) for i in range(table.shape[0]) for j in range(table.shape[1])) # Sum of C(a_i, 2) sum_a = sum(comb(ai, 2) for ai in a) # Sum of C(b_j, 2) sum_b = sum(comb(bj, 2) for bj in b) # Total pairs n_pairs = comb(n, 2) # Expected index under null hypothesis (random permutation) expected_index = (sum_a * sum_b) / n_pairs # Maximum index max_index = 0.5 * (sum_a + sum_b) # ARI if max_index == expected_index: return 1.0 # Perfect agreement when only one cluster/class ari = (index - expected_index) / (max_index - expected_index) return ari, { 'index': index, 'expected_index': expected_index, 'max_index': max_index, 'sum_a': sum_a, 'sum_b': sum_b, 'n_pairs': n_pairs } # Example with varying agreement levelsexamples = [ ("Perfect match", [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 0, 0, 1, 1, 1, 2, 2, 2]), ("Good match", [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 0, 1, 1, 1, 1, 2, 2, 2]), ("Poor match", [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]), ("Permuted labels", [0, 0, 0, 1, 1, 1, 2, 2, 2], [2, 2, 2, 0, 0, 0, 1, 1, 1]),] print("ARI Examples:\n")for name, true, pred in examples: true = np.array(true) pred = np.array(pred) ari_manual, details = adjusted_rand_index_manual(true, pred) ari_sklearn = adjusted_rand_score(true, pred) print(f"{name}:") print(f" True: {true}") print(f" Pred: {pred}") print(f" ARI: {ari_sklearn:.4f}") print() # Key property: ARI is invariant to label permutation!print("Note: 'Perfect match' and 'Permuted labels' both get ARI = 1.0")print("This label invariance is crucial - we care about structure, not label names.")A crucial property of ARI (and all proper external metrics) is label invariance: the score depends only on how points are grouped, not on the labels assigned to groups. Swapping cluster labels 0 and 1 doesn't change the ARI. This is essential because cluster labels are arbitrary.
Range: Theoretically [-0.5, 1], practically usually [-0.1, 1]
Chance-adjusted: E[ARI] = 0 for random clustering with fixed marginals
Symmetric: ARI(U, V) = ARI(V, U)
Label-invariant: Renaming labels doesn't change the score
Does not require matching cluster counts: Can compare partitions with different numbers of clusters
| ARI Range | Interpretation | Typical Scenario |
|---|---|---|
| 1.0 | Perfect agreement | Identical partitions (up to label permutation) |
| 0.8 – 1.0 | Strong agreement | Minor disagreements on boundaries |
| 0.5 – 0.8 | Moderate agreement | Captures main structure with some errors |
| 0.2 – 0.5 | Weak agreement | Some correspondence but substantial differences |
| 0.0 – 0.2 | Near-chance | Little meaningful correspondence |
| < 0 | Worse than random | Actively anti-correlated (very rare) |
ARI is sensitive to:
ARI is less sensitive to:
1. Large cluster bias: Errors in large clusters affect more pairs, so they're penalized more heavily. This may or may not be desirable.
2. Doesn't distinguish error types: A point assigned to the wrong cluster is equally bad regardless of which wrong cluster.
3. Number of clusters: ARI can favor solutions with similar numbers of clusters as the ground truth, even when this matches by coincidence.
The second major family of external metrics is based on information theory, treating clusterings as random variables and measuring the information shared between them.
The entropy of a partition U measures its uncertainty:
$$H(U) = -\sum_{i=1}^{r} P(U_i) \log P(U_i) = -\sum_{i=1}^{r} \frac{a_i}{n} \log \frac{a_i}{n}$$
where aᵢ = |Uᵢ| is the size of class i.
Interpretation: Entropy measures how 'spread out' the distribution is. Maximum entropy occurs when all classes are equally sized. Minimum entropy (0) occurs when all points are in one class.
Joint entropy of U and V: $$H(U, V) = -\sum_{i,j} P(U_i \cap V_j) \log P(U_i \cap V_j) = -\sum_{i,j} \frac{n_{ij}}{n} \log \frac{n_{ij}}{n}$$
Conditional entropy of U given V: $$H(U|V) = H(U, V) - H(V) = -\sum_{i,j} \frac{n_{ij}}{n} \log \frac{n_{ij}}{b_j}$$
Interpretation: H(U|V) measures how much uncertainty remains about U after observing V. If V perfectly predicts U, then H(U|V) = 0.
The choice of logarithm base (2, e, or 10) affects the scale of entropy values but not the normalized metrics. Log base 2 gives entropy in 'bits', natural log in 'nats'. Scikit-learn uses natural log by default.
Mutual Information (MI) measures the information shared between two partitions—how much knowing one reduces uncertainty about the other:
$$MI(U, V) = \sum_{i=1}^{r} \sum_{j=1}^{s} P(U_i \cap V_j) \log \frac{P(U_i \cap V_j)}{P(U_i) P(V_j)}$$
$$= \sum_{i,j} \frac{n_{ij}}{n} \log \frac{n \cdot n_{ij}}{a_i \cdot b_j}$$
MI can be expressed as:
$$MI(U, V) = H(U) - H(U|V) = H(V) - H(V|U) = H(U) + H(V) - H(U, V)$$
Interpretation: MI is the reduction in uncertainty about U when V is known (and vice versa).
Raw MI is not directly comparable across different datasets or clusterings because its scale depends on the entropies involved. A clustering with more clusters will have higher entropy and potentially higher MI, even if the agreement is proportionally worse.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom sklearn.metrics import mutual_info_score, normalized_mutual_info_scorefrom sklearn.metrics import adjusted_mutual_info_score def compute_entropy(labels): """Compute entropy of a label distribution.""" n = len(labels) unique, counts = np.unique(labels, return_counts=True) probs = counts / n # Filter out zero probabilities to avoid log(0) probs = probs[probs > 0] return -np.sum(probs * np.log(probs)) def compute_mi_manual(labels_true, labels_pred): """Manual Mutual Information computation.""" n = len(labels_true) # Joint distribution classes = np.unique(labels_true) clusters = np.unique(labels_pred) mi = 0 for c in classes: for k in clusters: # Joint probability p_ij = np.sum((labels_true == c) & (labels_pred == k)) / n if p_ij == 0: continue # Marginal probabilities p_i = np.sum(labels_true == c) / n p_j = np.sum(labels_pred == k) / n mi += p_ij * np.log(p_ij / (p_i * p_j)) return mi # Examplelabels_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2])labels_pred = np.array([0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 2, 2]) h_true = compute_entropy(labels_true)h_pred = compute_entropy(labels_pred)mi_manual = compute_mi_manual(labels_true, labels_pred)mi_sklearn = mutual_info_score(labels_true, labels_pred) print("Information-Theoretic Analysis:")print(f" H(true): {h_true:.4f} nats")print(f" H(pred): {h_pred:.4f} nats")print(f" MI: {mi_sklearn:.4f} nats")print(f" MI (manual): {mi_manual:.4f} nats")print()print(f" MI / H(true): {mi_sklearn / h_true:.4f}")print(f" MI / H(pred): {mi_sklearn / h_pred:.4f}")print() # Show why normalization is neededprint("Effect of cluster count on MI:")for n_clusters in [3, 5, 10, 20]: random_labels = np.random.randint(0, n_clusters, len(labels_true)) mi = mutual_info_score(labels_true, random_labels) nmi = normalized_mutual_info_score(labels_true, random_labels) print(f" k={n_clusters}: MI={mi:.4f}, NMI={nmi:.4f}")Normalized Mutual Information (NMI) scales MI to the [0, 1] range, making it comparable across different datasets and clusterings. Several normalization schemes exist:
1. Geometric mean normalization: $$NMI_{sqrt} = \frac{MI(U, V)}{\sqrt{H(U) \cdot H(V)}}$$
2. Arithmetic mean normalization: $$NMI_{avg} = \frac{2 \cdot MI(U, V)}{H(U) + H(V)}$$
3. Minimum normalization: $$NMI_{min} = \frac{MI(U, V)}{\min(H(U), H(V))}$$
4. Maximum normalization: $$NMI_{max} = \frac{MI(U, V)}{\max(H(U), H(V))}$$
The most commonly used is the arithmetic mean (also called 'sum' normalization).
Like raw Rand Index, raw NMI has a non-zero expected value under random clustering. The expected value depends on the number of clusters and can be substantial. This led to the development of Adjusted Mutual Information.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom sklearn.metrics import ( mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score) def compare_nmi_variants(labels_true, labels_pred): """Compare different NMI normalization schemes.""" mi = mutual_info_score(labels_true, labels_pred) # Compute entropies def entropy(labels): n = len(labels) _, counts = np.unique(labels, return_counts=True) probs = counts / n return -np.sum(probs * np.log(probs)) h_true = entropy(labels_true) h_pred = entropy(labels_pred) # Different normalizations nmi_geometric = mi / np.sqrt(h_true * h_pred) nmi_arithmetic = 2 * mi / (h_true + h_pred) nmi_min = mi / min(h_true, h_pred) nmi_max = mi / max(h_true, h_pred) # Sklearn variants nmi_sklearn_geometric = normalized_mutual_info_score( labels_true, labels_pred, average_method='geometric') nmi_sklearn_arithmetic = normalized_mutual_info_score( labels_true, labels_pred, average_method='arithmetic') nmi_sklearn_min = normalized_mutual_info_score( labels_true, labels_pred, average_method='min') nmi_sklearn_max = normalized_mutual_info_score( labels_true, labels_pred, average_method='max') ami = adjusted_mutual_info_score(labels_true, labels_pred) print("NMI Variants Comparison:") print(f" MI: {mi:.4f}") print(f" H(true): {h_true:.4f}, H(pred): {h_pred:.4f}") print() print(f" NMI (geometric): {nmi_sklearn_geometric:.4f}") print(f" NMI (arithmetic): {nmi_sklearn_arithmetic:.4f}") print(f" NMI (min): {nmi_sklearn_min:.4f}") print(f" NMI (max): {nmi_sklearn_max:.4f}") print(f" AMI (adjusted): {ami:.4f}") return { 'MI': mi, 'NMI_geometric': nmi_sklearn_geometric, 'NMI_arithmetic': nmi_sklearn_arithmetic, 'NMI_min': nmi_sklearn_min, 'NMI_max': nmi_sklearn_max, 'AMI': ami } # Test with different scenariosprint("\n=== Perfect Agreement ===")labels_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])labels_pred = np.array([2, 2, 2, 0, 0, 0, 1, 1, 1]) # Permuted labelscompare_nmi_variants(labels_true, labels_pred) print("\n=== Partial Agreement ===")labels_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2])labels_pred = np.array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2])compare_nmi_variants(labels_true, labels_pred) print("\n=== Random Clustering (Chance) ===")np.random.seed(42)labels_true = np.repeat([0, 1, 2, 3, 4], 20)labels_pred = np.random.permutation(labels_true)compare_nmi_variants(labels_true, labels_pred)Adjusted Mutual Information (AMI) applies a chance correction to NMI, analogous to how ARI corrects RI:
$$AMI = \frac{MI - E[MI]}{\frac{1}{2}(H(U) + H(V)) - E[MI]}$$
where E[MI] is the expected mutual information under a hypergeometric model of randomness (random clustering with fixed marginals).
The expected MI is computed as:
$$E[MI] = \sum_{i,j} \sum_{n_{ij}=\max(0, a_i+b_j-n)}^{\min(a_i, b_j)} \frac{n_{ij}}{n} \log\frac{n \cdot n_{ij}}{a_i b_j} \cdot P(N_{ij} = n_{ij})$$
where the probability is the hypergeometric distribution:
$$P(N_{ij} = n_{ij}) = \frac{\binom{a_i}{n_{ij}} \binom{n-a_i}{b_j-n_{ij}}}{\binom{n}{b_j}}$$
This correction ensures E[AMI] ≈ 0 for random clusterings.
| Property | NMI | AMI |
|---|---|---|
| Range | [0, 1] | ≤ 1, typically [-1, 1] |
| Perfect agreement | 1 | 1 |
| Random clustering | 0 (biased) | ≈ 0 (unbiased) |
| Computational cost | O(n + r × s) | O(n + r × s × log(n)) |
| Sensitivity to k | Increases with k | Stable across k |
When comparing clusterings with different numbers of clusters, or when you need a statistically grounded metric, use AMI over NMI. The chance correction is essential for fair comparisons, especially when some clusterings have many more clusters than others.
Beyond ARI and NMI/AMI, several other external metrics are used:
The geometric mean of precision and recall at the pair level:
$$FMI = \sqrt{\frac{TP}{TP + FP} \cdot \frac{TP}{TP + FN}} = \frac{TP}{\sqrt{(TP+FP)(TP+FN)}}$$
Inspired by precision and recall:
Homogeneity: Each cluster should contain only members of a single class $$h = 1 - \frac{H(U|V)}{H(U)}$$
Completeness: All members of a given class should be assigned to the same cluster $$c = 1 - \frac{H(V|U)}{H(V)}$$
V-Measure: Harmonic mean of homogeneity and completeness $$V = \frac{2 \cdot h \cdot c}{h + c}$$
V-measure equals NMI with arithmetic normalization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom sklearn.metrics import ( adjusted_rand_score, normalized_mutual_info_score, adjusted_mutual_info_score, fowlkes_mallows_score, homogeneity_score, completeness_score, v_measure_score, homogeneity_completeness_v_measure) def comprehensive_external_evaluation(labels_true, labels_pred, name=""): """ Compute all major external clustering metrics. """ print(f"\n{'='*50}") print(f"External Metrics Evaluation: {name}") print('='*50) # Pair-counting metrics ari = adjusted_rand_score(labels_true, labels_pred) fmi = fowlkes_mallows_score(labels_true, labels_pred) # Information-theoretic metrics nmi = normalized_mutual_info_score(labels_true, labels_pred) ami = adjusted_mutual_info_score(labels_true, labels_pred) # Homogeneity/Completeness h, c, v = homogeneity_completeness_v_measure(labels_true, labels_pred) print(f"\nPair-Counting Metrics:") print(f" Adjusted Rand Index (ARI): {ari:.4f}") print(f" Fowlkes-Mallows Index: {fmi:.4f}") print(f"\nInformation-Theoretic Metrics:") print(f" Normalized MI (NMI): {nmi:.4f}") print(f" Adjusted MI (AMI): {ami:.4f}") print(f"\nHomogeneity-Completeness:") print(f" Homogeneity: {h:.4f}") print(f" Completeness: {c:.4f}") print(f" V-Measure: {v:.4f}") return { 'ARI': ari, 'FMI': fmi, 'NMI': nmi, 'AMI': ami, 'Homogeneity': h, 'Completeness': c, 'V-Measure': v } # Scenario 1: Perfect clusteringprint("\n" + "="*60)print("SCENARIO 1: Perfect Clustering")print("="*60)labels_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])labels_pred = np.array([1, 1, 1, 2, 2, 2, 0, 0, 0]) # Different labels, same structurecomprehensive_external_evaluation(labels_true, labels_pred, "Perfect (permuted labels)") # Scenario 2: Over-clustering (too many clusters)print("\n" + "="*60)print("SCENARIO 2: Over-Clustering")print("="*60)labels_true = np.array([0, 0, 0, 0, 1, 1, 1, 1])labels_pred = np.array([0, 0, 1, 1, 2, 2, 3, 3]) # Each true cluster split in twocomprehensive_external_evaluation(labels_true, labels_pred, "Over-clustering") # Scenario 3: Under-clustering (too few clusters)print("\n" + "="*60)print("SCENARIO 3: Under-Clustering")print("="*60)labels_true = np.array([0, 0, 1, 1, 2, 2, 3, 3])labels_pred = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # Clusters mergedcomprehensive_external_evaluation(labels_true, labels_pred, "Under-clustering") print("\n" + "="*60)print("INTERPRETATION NOTES:")print("="*60)print("- Over-clustering: High homogeneity, low completeness")print("- Under-clustering: Low homogeneity, high completeness")print("- ARI and AMI are more balanced metrics")Different metrics capture different aspects of clustering quality. Here's guidance on when to use each:
For rigorous comparison: Use ARI and AMI together. If they agree, you have robust evidence.
For diagnosing problems: Use homogeneity and completeness separately. High homogeneity but low completeness indicates over-clustering. The opposite indicates under-clustering.
For benchmarking: Report at least ARI and NMI—they're the most widely used and expected in publications.
Avoid raw RI and raw MI: Always use adjusted versions (ARI, AMI) or properly normalized versions (NMI).
Consider your domain: If certain types of errors are more costly, consider weighted or custom metrics.
ARI and AMI often correlate highly but can differ when partitions have very different numbers of clusters or when cluster sizes are imbalanced. When they disagree substantially, investigate the clustering more closely and consider which metric's assumptions better match your problem.
This page provided comprehensive coverage of external clustering validation metrics:
You now understand the mathematical foundations and practical usage of external clustering validation metrics. Next, we'll explore stability-based evaluation, which assesses cluster quality by measuring consistency across perturbations of the data.