Clustering Evaluation - Learning Module

Loading content...

0/245

External Metrics: Adjusted Rand Index and Normalized Mutual Information

When Ground Truth Exists

In the previous page, we explored internal validation metrics that assess clustering quality without reference to ground truth. But in many scenarios—algorithm benchmarking, method validation, or labeled datasets—we have access to true class labels. External validation metrics compare the discovered cluster structure against these known labels.

External metrics answer a fundamentally different question: instead of 'Are these clusters compact and separated?' they ask 'Do these clusters match the known structure?' This makes them essential for objectively comparing clustering algorithms and validating new methods.

This page provides rigorous coverage of the two most important external metrics: the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). We'll derive them from first principles, understand their properties, and learn when to use each.

What You Will Learn

By the end of this page, you will understand: (1) the contingency table representation of clustering comparisons, (2) the Rand Index and why adjustment for chance is necessary, (3) the information-theoretic foundations of Mutual Information, (4) different normalization schemes (NMI, AMI, V-measure), and (5) practical guidance for choosing between metrics.

The Contingency Table Framework

All external clustering metrics can be understood through the contingency table, which cross-tabulates two partitions of the same data.

Let U = {U₁, U₂, ..., Uᵣ} be the ground truth partition with r classes, and V = {V₁, V₂, ..., Vₛ} be the clustering with s clusters. The contingency table N has entries:

$$n_{ij} = |U_i \cap V_j|$$

the number of points that belong to both true class i and cluster j.

Example Contingency Table

	Cluster V₁	Cluster V₂	Cluster V₃	Row Sum
Class U₁	40	5	5	50
Class U₂	3	45	2	50
Class U₃	2	5	43	50
Col Sum	45	55	50	n=150

This table tells us that V₁ mostly captures U₁, V₂ mostly captures U₂, and V₃ mostly captures U₃. A perfect clustering would have all mass on the diagonal.

Derived Quantities

From the contingency table, we compute:

Marginal sums:

Row sums: aᵢ = Σⱼ nᵢⱼ = |Uᵢ| (size of true class i)
Column sums: bⱼ = Σᵢ nᵢⱼ = |Vⱼ| (size of cluster j)

Total: n = Σᵢ aᵢ = Σⱼ bⱼ = Σᵢⱼ nᵢⱼ

These marginals are crucial for computing both pair-counting metrics (Rand Index) and information-theoretic metrics (Mutual Information).

Symmetry Note

Most external metrics are symmetric: comparing U to V gives the same result as comparing V to U. This is important because it means the metric measures agreement between partitions without assuming one is 'correct.'

The Rand Index

The Rand Index (RI), introduced by William Rand in 1971, takes a pair-counting approach: it considers all possible pairs of data points and asks whether the two partitions agree on each pair.

Pair Classification

For any pair of points (xᵢ, xⱼ), there are four possibilities:

Ground Truth	Clustering	Description
Same class	Same cluster	True Positive (TP) — Correctly grouped together
Different class	Different cluster	True Negative (TN) — Correctly separated
Same class	Different cluster	False Negative (FN) — Should be together, separated
Different class	Same cluster	False Positive (FP) — Should be separate, grouped

Computing the Counts

From the contingency table:

TP: Pairs that are together in both partitions: $$TP = \sum_{i,j} \binom{n_{ij}}{2}$$

FP: Pairs together in clustering but not in ground truth: $$FP = \sum_j \binom{b_j}{2} - TP$$

FN: Pairs together in ground truth but not in clustering: $$FN = \sum_i \binom{a_i}{2} - TP$$

TN: Pairs separated in both: $$TN = \binom{n}{2} - TP - FP - FN$$

The Rand Index Formula

The Rand Index is simply the accuracy of pair classifications:

$$RI = \frac{TP + TN}{TP + FP + FN + TN} = \frac{TP + TN}{\binom{n}{2}}$$

Range: [0, 1], where 1 indicates perfect agreement and 0 indicates complete disagreement.

Interpretation: The fraction of all pairs on which the two partitions agree.

rand_index_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy.special import comb
from sklearn.metrics import rand_score
 
def contingency_table(labels_true, labels_pred):
    """
    Construct the contingency table from two label vectors.
    """
    classes = np.unique(labels_true)
    clusters = np.unique(labels_pred)
    
    n_classes = len(classes)
    n_clusters = len(clusters)
    
    # Map labels to indices
    class_to_idx = {c: i for i, c in enumerate(classes)}
    cluster_to_idx = {c: i for i, c in enumerate(clusters)}
    
    # Build contingency table
    table = np.zeros((n_classes, n_clusters), dtype=int)
    for true_label, pred_label in zip(labels_true, labels_pred):
        i = class_to_idx[true_label]
        j = cluster_to_idx[pred_label]
        table[i, j] += 1
    
    return table, classes, clusters
 
 
def rand_index_from_contingency(table):
    """
    Compute Rand Index from contingency table.
    """
    n = table.sum()
    
    # Row and column sums
    a = table.sum(axis=1)  # Class sizes
    b = table.sum(axis=0)  # Cluster sizes
    
    # TP: pairs together in both
    tp = sum(comb(table[i, j], 2) for i in range(table.shape[0]) for j in range(table.shape[1]))
    
    # Sum of C(a_i, 2) for all classes
    sum_a = sum(comb(ai, 2) for ai in a)
    
    # Sum of C(b_j, 2) for all clusters
    sum_b = sum(comb(bj, 2) for bj in b)
    
    # FP and FN
    fp = sum_b - tp
    fn = sum_a - tp
    
    # Total pairs
    total_pairs = comb(n, 2)
    
    # TN
    tn = total_pairs - tp - fp - fn
    
    # Rand Index
    ri = (tp + tn) / total_pairs
    
    return ri, {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn}
 
 
# Example
labels_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
labels_pred = np.array([0, 0, 1, 1, 1, 1, 2, 2, 2])
 
table, classes, clusters = contingency_table(labels_true, labels_pred)
print("Contingency Table:")
print(table)
 
ri, counts = rand_index_from_contingency(table)
print(f"\nPair Counts: {counts}")
print(f"Rand Index: {ri:.4f}")
 
# Verify with sklearn
ri_sklearn = rand_score(labels_true, labels_pred)
print(f"Sklearn Rand Index: {ri_sklearn:.4f}")

The Problem: Rand Index Is Not Chance-Adjusted

The raw Rand Index has a significant flaw: it does not account for chance agreement. Even random clusterings will achieve non-zero RI scores, and the expected value under random clustering is not zero.

Why This Matters

Consider two random partitions of 1000 points into 10 clusters each. The expected Rand Index is:

$$E[RI] \approx 1 - \frac{2k(k-1)}{n-1}$$

For n=1000 and k=10, E[RI] ≈ 0.82! This means a completely random clustering appears to have 82% 'agreement' with ground truth.

This inflated baseline makes the RI misleading:

It cannot distinguish meaningful structure from random noise
Comparing RI across different numbers of clusters is unreliable
A 'good' RI could simply reflect the number of clusters matching

rand_chance_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
from sklearn.metrics import rand_score, adjusted_rand_score
 
def demonstrate_chance_problem():
    """
    Demonstrate that random clusterings get high Rand Index.
    """
    np.random.seed(42)
    n = 1000
    n_classes = 10
    
    # Random ground truth
    labels_true = np.random.randint(0, n_classes, n)
    
    # Multiple random clusterings
    random_scores = []
    for _ in range(100):
        labels_random = np.random.randint(0, n_classes, n)
        ri = rand_score(labels_true, labels_random)
        ari = adjusted_rand_score(labels_true, labels_random)
        random_scores.append((ri, ari))
    
    ri_mean = np.mean([s[0] for s in random_scores])
    ri_std = np.std([s[0] for s in random_scores])
    ari_mean = np.mean([s[1] for s in random_scores])
    ari_std = np.std([s[1] for s in random_scores])
    
    print("Random Clustering vs Random Ground Truth")
    print(f"  Rand Index:          {ri_mean:.4f} ± {ri_std:.4f}")
    print(f"  Adjusted Rand Index: {ari_mean:.4f} ± {ari_std:.4f}")
    print()
    print("Expected values:")
    print(f"  RI for random: ~{1 - 2*n_classes*(n_classes-1)/(n-1):.4f} (theoretical)")
    print(f"  ARI for random: ~0.0 (by design)")
 
demonstrate_chance_problem()

Never Use Raw Rand Index

The raw Rand Index should almost never be used in practice. Its high baseline for random clusterings makes it unsuitable for meaningful evaluation. Always use the Adjusted Rand Index (ARI) instead, which corrects for chance agreement.

The Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI), introduced by Hubert and Arabie in 1985, corrects the RI for chance using a hypergeometric model of randomness:

$$ARI = \frac{RI - E[RI]}{\max(RI) - E[RI]}$$

This adjustment ensures that:

ARI = 0 for random clusterings (expected value under random permutation)
ARI = 1 for perfect agreement (identical partitions)
ARI < 0 for worse than random agreement

Mathematical Derivation

Let the contingency table be N with entries nᵢⱼ, row sums aᵢ, and column sums bⱼ. Define:

$$Index = \sum_{i,j} \binom{n_{ij}}{2}$$

$$ExpectedIndex = \frac{\sum_i \binom{a_i}{2} \cdot \sum_j \binom{b_j}{2}}{\binom{n}{2}}$$

$$MaxIndex = \frac{1}{2}\left(\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}\right)$$

Then:

$$ARI = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex}$$

This formulation directly computes the chance-adjusted score.

adjusted_rand_index.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.special import comb
from sklearn.metrics import adjusted_rand_score
 
def adjusted_rand_index_manual(labels_true, labels_pred):
    """
    Manual implementation of ARI for understanding the formula.
    """
    # Build contingency table
    classes = np.unique(labels_true)
    clusters = np.unique(labels_pred)
    
    class_to_idx = {c: i for i, c in enumerate(classes)}
    cluster_to_idx = {c: i for i, c in enumerate(clusters)}
    
    n = len(labels_true)
    table = np.zeros((len(classes), len(clusters)), dtype=int)
    
    for true_label, pred_label in zip(labels_true, labels_pred):
        i = class_to_idx[true_label]
        j = cluster_to_idx[pred_label]
        table[i, j] += 1
    
    # Marginal sums
    a = table.sum(axis=1)  # Row sums (class sizes)
    b = table.sum(axis=0)  # Column sums (cluster sizes)
    
    # Index: sum of C(n_ij, 2)
    index = sum(comb(table[i, j], 2) 
                for i in range(table.shape[0]) 
                for j in range(table.shape[1]))
    
    # Sum of C(a_i, 2)
    sum_a = sum(comb(ai, 2) for ai in a)
    
    # Sum of C(b_j, 2)
    sum_b = sum(comb(bj, 2) for bj in b)
    
    # Total pairs
    n_pairs = comb(n, 2)
    
    # Expected index under null hypothesis (random permutation)
    expected_index = (sum_a * sum_b) / n_pairs
    
    # Maximum index
    max_index = 0.5 * (sum_a + sum_b)
    
    # ARI
    if max_index == expected_index:
        return 1.0  # Perfect agreement when only one cluster/class
    
    ari = (index - expected_index) / (max_index - expected_index)
    
    return ari, {
        'index': index,
        'expected_index': expected_index,
        'max_index': max_index,
        'sum_a': sum_a,
        'sum_b': sum_b,
        'n_pairs': n_pairs
    }
 
 
# Example with varying agreement levels
examples = [
    ("Perfect match", [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 0, 0, 1, 1, 1, 2, 2, 2]),
    ("Good match", [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 0, 1, 1, 1, 1, 2, 2, 2]),
    ("Poor match", [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]),
    ("Permuted labels", [0, 0, 0, 1, 1, 1, 2, 2, 2], [2, 2, 2, 0, 0, 0, 1, 1, 1]),
]
 
print("ARI Examples:\n")
for name, true, pred in examples:
    true = np.array(true)
    pred = np.array(pred)
    
    ari_manual, details = adjusted_rand_index_manual(true, pred)
    ari_sklearn = adjusted_rand_score(true, pred)
    
    print(f"{name}:")
    print(f"  True:    {true}")
    print(f"  Pred:    {pred}")
    print(f"  ARI:     {ari_sklearn:.4f}")
    print()
 
# Key property: ARI is invariant to label permutation!
print("Note: 'Perfect match' and 'Permuted labels' both get ARI = 1.0")
print("This label invariance is crucial - we care about structure, not label names.")

Label Invariance

A crucial property of ARI (and all proper external metrics) is label invariance: the score depends only on how points are grouped, not on the labels assigned to groups. Swapping cluster labels 0 and 1 doesn't change the ARI. This is essential because cluster labels are arbitrary.

Properties and Interpretation of ARI

Key Properties

Range: Theoretically [-0.5, 1], practically usually [-0.1, 1]
- ARI = 1: Perfect agreement
- ARI = 0: Agreement expected by chance
- ARI < 0: Less agreement than expected by chance (rare)
Chance-adjusted: E[ARI] = 0 for random clustering with fixed marginals
Symmetric: ARI(U, V) = ARI(V, U)
Label-invariant: Renaming labels doesn't change the score
Does not require matching cluster counts: Can compare partitions with different numbers of clusters

Interpreting ARI Values
ARI Range	Interpretation	Typical Scenario
1.0	Perfect agreement	Identical partitions (up to label permutation)
0.8 – 1.0	Strong agreement	Minor disagreements on boundaries
0.5 – 0.8	Moderate agreement	Captures main structure with some errors
0.2 – 0.5	Weak agreement	Some correspondence but substantial differences
0.0 – 0.2	Near-chance	Little meaningful correspondence
< 0	Worse than random	Actively anti-correlated (very rare)

Sensitivity Characteristics

ARI is sensitive to:

Large clusters being split or merged
Systematic misclassification patterns

ARI is less sensitive to:

Isolated point misclassifications (pair-based, so single errors have limited impact)
Very small clusters (they contribute few pairs)

Limitations

1. Large cluster bias: Errors in large clusters affect more pairs, so they're penalized more heavily. This may or may not be desirable.

2. Doesn't distinguish error types: A point assigned to the wrong cluster is equally bad regardless of which wrong cluster.

3. Number of clusters: ARI can favor solutions with similar numbers of clusters as the ground truth, even when this matches by coincidence.

Information-Theoretic Foundation

The second major family of external metrics is based on information theory, treating clusterings as random variables and measuring the information shared between them.

Entropy

The entropy of a partition U measures its uncertainty:

$$H(U) = -\sum_{i=1}^{r} P(U_i) \log P(U_i) = -\sum_{i=1}^{r} \frac{a_i}{n} \log \frac{a_i}{n}$$

where aᵢ = |Uᵢ| is the size of class i.

Interpretation: Entropy measures how 'spread out' the distribution is. Maximum entropy occurs when all classes are equally sized. Minimum entropy (0) occurs when all points are in one class.

Joint and Conditional Entropy

Joint entropy of U and V: $$H(U, V) = -\sum_{i,j} P(U_i \cap V_j) \log P(U_i \cap V_j) = -\sum_{i,j} \frac{n_{ij}}{n} \log \frac{n_{ij}}{n}$$

Conditional entropy of U given V: $$H(U|V) = H(U, V) - H(V) = -\sum_{i,j} \frac{n_{ij}}{n} \log \frac{n_{ij}}{b_j}$$

Interpretation: H(U|V) measures how much uncertainty remains about U after observing V. If V perfectly predicts U, then H(U|V) = 0.

Logarithm Base

The choice of logarithm base (2, e, or 10) affects the scale of entropy values but not the normalized metrics. Log base 2 gives entropy in 'bits', natural log in 'nats'. Scikit-learn uses natural log by default.

Mutual Information

Mutual Information (MI) measures the information shared between two partitions—how much knowing one reduces uncertainty about the other:

$$MI(U, V) = \sum_{i=1}^{r} \sum_{j=1}^{s} P(U_i \cap V_j) \log \frac{P(U_i \cap V_j)}{P(U_i) P(V_j)}$$

$$= \sum_{i,j} \frac{n_{ij}}{n} \log \frac{n \cdot n_{ij}}{a_i \cdot b_j}$$

Equivalent Forms

MI can be expressed as:

$$MI(U, V) = H(U) - H(U|V) = H(V) - H(V|U) = H(U) + H(V) - H(U, V)$$

Interpretation: MI is the reduction in uncertainty about U when V is known (and vice versa).

Properties of Mutual Information

Non-negative: MI(U, V) ≥ 0
Symmetric: MI(U, V) = MI(V, U)
Zero for independence: MI(U, V) = 0 iff U and V are independent
Upper bounded: MI(U, V) ≤ min(H(U), H(V))
Unbounded scale: MI values depend on the entropy of the partitions

The Normalization Problem

Raw MI is not directly comparable across different datasets or clusterings because its scale depends on the entropies involved. A clustering with more clusters will have higher entropy and potentially higher MI, even if the agreement is proportionally worse.

mutual_information.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from sklearn.metrics import mutual_info_score, normalized_mutual_info_score
from sklearn.metrics import adjusted_mutual_info_score
 
def compute_entropy(labels):
    """Compute entropy of a label distribution."""
    n = len(labels)
    unique, counts = np.unique(labels, return_counts=True)
    probs = counts / n
    # Filter out zero probabilities to avoid log(0)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log(probs))
 
 
def compute_mi_manual(labels_true, labels_pred):
    """Manual Mutual Information computation."""
    n = len(labels_true)
    
    # Joint distribution
    classes = np.unique(labels_true)
    clusters = np.unique(labels_pred)
    
    mi = 0
    for c in classes:
        for k in clusters:
            # Joint probability
            p_ij = np.sum((labels_true == c) & (labels_pred == k)) / n
            if p_ij == 0:
                continue
            
            # Marginal probabilities
            p_i = np.sum(labels_true == c) / n
            p_j = np.sum(labels_pred == k) / n
            
            mi += p_ij * np.log(p_ij / (p_i * p_j))
    
    return mi
 
 
# Example
labels_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2])
labels_pred = np.array([0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 2, 2])
 
h_true = compute_entropy(labels_true)
h_pred = compute_entropy(labels_pred)
mi_manual = compute_mi_manual(labels_true, labels_pred)
mi_sklearn = mutual_info_score(labels_true, labels_pred)
 
print("Information-Theoretic Analysis:")
print(f"  H(true):  {h_true:.4f} nats")
print(f"  H(pred):  {h_pred:.4f} nats")
print(f"  MI:       {mi_sklearn:.4f} nats")
print(f"  MI (manual): {mi_manual:.4f} nats")
print()
print(f"  MI / H(true): {mi_sklearn / h_true:.4f}")
print(f"  MI / H(pred): {mi_sklearn / h_pred:.4f}")
print()
 
# Show why normalization is needed
print("Effect of cluster count on MI:")
for n_clusters in [3, 5, 10, 20]:
    random_labels = np.random.randint(0, n_clusters, len(labels_true))
    mi = mutual_info_score(labels_true, random_labels)
    nmi = normalized_mutual_info_score(labels_true, random_labels)
    print(f"  k={n_clusters}: MI={mi:.4f}, NMI={nmi:.4f}")

Normalized Mutual Information (NMI)

Normalized Mutual Information (NMI) scales MI to the [0, 1] range, making it comparable across different datasets and clusterings. Several normalization schemes exist:

Common NMI Variants

1. Geometric mean normalization: $$NMI_{sqrt} = \frac{MI(U, V)}{\sqrt{H(U) \cdot H(V)}}$$

2. Arithmetic mean normalization: $$NMI_{avg} = \frac{2 \cdot MI(U, V)}{H(U) + H(V)}$$

3. Minimum normalization: $$NMI_{min} = \frac{MI(U, V)}{\min(H(U), H(V))}$$

4. Maximum normalization: $$NMI_{max} = \frac{MI(U, V)}{\max(H(U), H(V))}$$

The most commonly used is the arithmetic mean (also called 'sum' normalization).

Properties of NMI

Range: [0, 1] for all normalization schemes
NMI = 0: Independent partitions (MI = 0)
NMI = 1: Perfect agreement (H(U) = H(V) = MI)
Symmetric: NMI(U, V) = NMI(V, U)
Label invariant: Like ARI, only structure matters

The Chance Problem with NMI

Like raw Rand Index, raw NMI has a non-zero expected value under random clustering. The expected value depends on the number of clusters and can be substantial. This led to the development of Adjusted Mutual Information.

nmi_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from sklearn.metrics import (
    mutual_info_score,
    normalized_mutual_info_score,
    adjusted_mutual_info_score
)
 
def compare_nmi_variants(labels_true, labels_pred):
    """Compare different NMI normalization schemes."""
    
    mi = mutual_info_score(labels_true, labels_pred)
    
    # Compute entropies
    def entropy(labels):
        n = len(labels)
        _, counts = np.unique(labels, return_counts=True)
        probs = counts / n
        return -np.sum(probs * np.log(probs))
    
    h_true = entropy(labels_true)
    h_pred = entropy(labels_pred)
    
    # Different normalizations
    nmi_geometric = mi / np.sqrt(h_true * h_pred)
    nmi_arithmetic = 2 * mi / (h_true + h_pred)
    nmi_min = mi / min(h_true, h_pred)
    nmi_max = mi / max(h_true, h_pred)
    
    # Sklearn variants
    nmi_sklearn_geometric = normalized_mutual_info_score(
        labels_true, labels_pred, average_method='geometric')
    nmi_sklearn_arithmetic = normalized_mutual_info_score(
        labels_true, labels_pred, average_method='arithmetic')
    nmi_sklearn_min = normalized_mutual_info_score(
        labels_true, labels_pred, average_method='min')
    nmi_sklearn_max = normalized_mutual_info_score(
        labels_true, labels_pred, average_method='max')
    
    ami = adjusted_mutual_info_score(labels_true, labels_pred)
    
    print("NMI Variants Comparison:")
    print(f"  MI: {mi:.4f}")
    print(f"  H(true): {h_true:.4f}, H(pred): {h_pred:.4f}")
    print()
    print(f"  NMI (geometric): {nmi_sklearn_geometric:.4f}")
    print(f"  NMI (arithmetic): {nmi_sklearn_arithmetic:.4f}")
    print(f"  NMI (min): {nmi_sklearn_min:.4f}")
    print(f"  NMI (max): {nmi_sklearn_max:.4f}")
    print(f"  AMI (adjusted): {ami:.4f}")
    
    return {
        'MI': mi,
        'NMI_geometric': nmi_sklearn_geometric,
        'NMI_arithmetic': nmi_sklearn_arithmetic,
        'NMI_min': nmi_sklearn_min,
        'NMI_max': nmi_sklearn_max,
        'AMI': ami
    }
 
 
# Test with different scenarios
print("\n=== Perfect Agreement ===")
labels_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
labels_pred = np.array([2, 2, 2, 0, 0, 0, 1, 1, 1])  # Permuted labels
compare_nmi_variants(labels_true, labels_pred)
 
print("\n=== Partial Agreement ===")
labels_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2])
labels_pred = np.array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2])
compare_nmi_variants(labels_true, labels_pred)
 
print("\n=== Random Clustering (Chance) ===")
np.random.seed(42)
labels_true = np.repeat([0, 1, 2, 3, 4], 20)
labels_pred = np.random.permutation(labels_true)
compare_nmi_variants(labels_true, labels_pred)

Adjusted Mutual Information (AMI)

Adjusted Mutual Information (AMI) applies a chance correction to NMI, analogous to how ARI corrects RI:

$$AMI = \frac{MI - E[MI]}{\frac{1}{2}(H(U) + H(V)) - E[MI]}$$

where E[MI] is the expected mutual information under a hypergeometric model of randomness (random clustering with fixed marginals).

Computing E[MI]

The expected MI is computed as:

$$E[MI] = \sum_{i,j} \sum_{n_{ij}=\max(0, a_i+b_j-n)}^{\min(a_i, b_j)} \frac{n_{ij}}{n} \log\frac{n \cdot n_{ij}}{a_i b_j} \cdot P(N_{ij} = n_{ij})$$

where the probability is the hypergeometric distribution:

$$P(N_{ij} = n_{ij}) = \frac{\binom{a_i}{n_{ij}} \binom{n-a_i}{b_j-n_{ij}}}{\binom{n}{b_j}}$$

This correction ensures E[AMI] ≈ 0 for random clusterings.

NMI vs. AMI
Property	NMI	AMI
Range	[0, 1]	≤ 1, typically [-1, 1]
Perfect agreement	1	1
Random clustering	0 (biased)	≈ 0 (unbiased)
Computational cost	O(n + r × s)	O(n + r × s × log(n))
Sensitivity to k	Increases with k	Stable across k

Use AMI for Rigorous Comparison

When comparing clusterings with different numbers of clusters, or when you need a statistically grounded metric, use AMI over NMI. The chance correction is essential for fair comparisons, especially when some clusterings have many more clusters than others.

Additional External Metrics

Beyond ARI and NMI/AMI, several other external metrics are used:

Fowlkes-Mallows Index (FMI)

The geometric mean of precision and recall at the pair level:

$$FMI = \sqrt{\frac{TP}{TP + FP} \cdot \frac{TP}{TP + FN}} = \frac{TP}{\sqrt{(TP+FP)(TP+FN)}}$$

Range: [0, 1]
NOT chance-adjusted (like raw RI)
Focus on pairs that should be together

Homogeneity, Completeness, and V-Measure

Inspired by precision and recall:

Homogeneity: Each cluster should contain only members of a single class $$h = 1 - \frac{H(U|V)}{H(U)}$$

Completeness: All members of a given class should be assigned to the same cluster $$c = 1 - \frac{H(V|U)}{H(V)}$$

V-Measure: Harmonic mean of homogeneity and completeness $$V = \frac{2 \cdot h \cdot c}{h + c}$$

V-measure equals NMI with arithmetic normalization.

external_metrics_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from sklearn.metrics import (
    adjusted_rand_score,
    normalized_mutual_info_score,
    adjusted_mutual_info_score,
    fowlkes_mallows_score,
    homogeneity_score,
    completeness_score,
    v_measure_score,
    homogeneity_completeness_v_measure
)
 
def comprehensive_external_evaluation(labels_true, labels_pred, name=""):
    """
    Compute all major external clustering metrics.
    """
    print(f"\n{'='*50}")
    print(f"External Metrics Evaluation: {name}")
    print('='*50)
    
    # Pair-counting metrics
    ari = adjusted_rand_score(labels_true, labels_pred)
    fmi = fowlkes_mallows_score(labels_true, labels_pred)
    
    # Information-theoretic metrics
    nmi = normalized_mutual_info_score(labels_true, labels_pred)
    ami = adjusted_mutual_info_score(labels_true, labels_pred)
    
    # Homogeneity/Completeness
    h, c, v = homogeneity_completeness_v_measure(labels_true, labels_pred)
    
    print(f"\nPair-Counting Metrics:")
    print(f"  Adjusted Rand Index (ARI): {ari:.4f}")
    print(f"  Fowlkes-Mallows Index:     {fmi:.4f}")
    
    print(f"\nInformation-Theoretic Metrics:")
    print(f"  Normalized MI (NMI):       {nmi:.4f}")
    print(f"  Adjusted MI (AMI):         {ami:.4f}")
    
    print(f"\nHomogeneity-Completeness:")
    print(f"  Homogeneity:               {h:.4f}")
    print(f"  Completeness:              {c:.4f}")
    print(f"  V-Measure:                 {v:.4f}")
    
    return {
        'ARI': ari, 'FMI': fmi,
        'NMI': nmi, 'AMI': ami,
        'Homogeneity': h, 'Completeness': c, 'V-Measure': v
    }
 
 
# Scenario 1: Perfect clustering
print("\n" + "="*60)
print("SCENARIO 1: Perfect Clustering")
print("="*60)
labels_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
labels_pred = np.array([1, 1, 1, 2, 2, 2, 0, 0, 0])  # Different labels, same structure
comprehensive_external_evaluation(labels_true, labels_pred, "Perfect (permuted labels)")
 
# Scenario 2: Over-clustering (too many clusters)
print("\n" + "="*60)
print("SCENARIO 2: Over-Clustering")
print("="*60)
labels_true = np.array([0, 0, 0, 0, 1, 1, 1, 1])
labels_pred = np.array([0, 0, 1, 1, 2, 2, 3, 3])  # Each true cluster split in two
comprehensive_external_evaluation(labels_true, labels_pred, "Over-clustering")
 
# Scenario 3: Under-clustering (too few clusters)
print("\n" + "="*60)
print("SCENARIO 3: Under-Clustering")
print("="*60)
labels_true = np.array([0, 0, 1, 1, 2, 2, 3, 3])
labels_pred = np.array([0, 0, 0, 0, 1, 1, 1, 1])  # Clusters merged
comprehensive_external_evaluation(labels_true, labels_pred, "Under-clustering")
 
print("\n" + "="*60)
print("INTERPRETATION NOTES:")
print("="*60)
print("- Over-clustering: High homogeneity, low completeness")
print("- Under-clustering: Low homogeneity, high completeness")
print("- ARI and AMI are more balanced metrics")

Choosing the Right External Metric

Different metrics capture different aspects of clustering quality. Here's guidance on when to use each:

Use ARI When

•Comparing clusterings with different k values
•Need chance-adjusted comparison
•Want focus on correct co-assignments
•Publishing academic work (most common)
•All cluster sizes matter equally

Use AMI/NMI When

•Information-theoretic perspective is relevant
•Want to relate to information content
•Need to understand homogeneity/completeness
•Comparing hierarchical structures
•NMI for quick comparisons, AMI for rigorous analysis

General Recommendations

For rigorous comparison: Use ARI and AMI together. If they agree, you have robust evidence.
For diagnosing problems: Use homogeneity and completeness separately. High homogeneity but low completeness indicates over-clustering. The opposite indicates under-clustering.
For benchmarking: Report at least ARI and NMI—they're the most widely used and expected in publications.
Avoid raw RI and raw MI: Always use adjusted versions (ARI, AMI) or properly normalized versions (NMI).
Consider your domain: If certain types of errors are more costly, consider weighted or custom metrics.

Agreement Between Metrics

ARI and AMI often correlate highly but can differ when partitions have very different numbers of clusters or when cluster sizes are imbalanced. When they disagree substantially, investigate the clustering more closely and consider which metric's assumptions better match your problem.

Summary

This page provided comprehensive coverage of external clustering validation metrics:

Key Takeaways

•External metrics compare clustering against known ground truth labels
•The contingency table is the foundation for all external metrics
•Raw Rand Index is biased—always use Adjusted Rand Index (ARI)
•ARI corrects for chance using a hypergeometric model, with E[ARI]=0 for random clustering
•Mutual Information measures shared information between partitions
•NMI normalizes MI to [0,1] but still has some chance bias
•AMI corrects for chance and is preferred for rigorous comparisons
•Homogeneity and completeness diagnose over- and under-clustering
•All external metrics are label-invariant—only structure matters
•Use multiple metrics for robust evaluation

Page Complete

You now understand the mathematical foundations and practical usage of external clustering validation metrics. Next, we'll explore stability-based evaluation, which assesses cluster quality by measuring consistency across perturbations of the data.