Loading content...
When performing unsupervised learning, one of the most challenging aspects is determining how good a clustering solution actually is. Unlike supervised learning where we have ground truth labels, clustering requires intrinsic metrics that evaluate cluster quality based solely on the data structure.
The Variance Ratio Criterion (VRC), also known as the Calinski-Harabasz Index, is a powerful internal evaluation metric that quantifies cluster separation quality. It operates on a fundamental principle: good clustering should produce compact, well-separated groups—points within the same cluster should be close together, while points in different clusters should be far apart.
The VRC score is computed as the ratio of between-cluster dispersion to within-cluster dispersion, normalized by degrees of freedom:
$$VRC = \frac{B_k / (k - 1)}{W_k / (n - k)}$$
Where:
This measures how spread out the cluster centroids are from the global centroid:
$$B_k = \sum_{q=1}^{k} n_q \cdot |c_q - c|^2$$
Where:
This measures how tightly packed points are within their respective clusters:
$$W_k = \sum_{q=1}^{k} \sum_{x \in C_q} |x - c_q|^2$$
Implement a function that computes the Variance Ratio Criterion given a dataset and its cluster assignments. Handle edge cases appropriately and return the score rounded to 4 decimal places.
X = [[0, 0], [1, 0], [0, 1], [10, 10], [11, 10], [10, 11]]
labels = [0, 0, 0, 1, 1, 1]450.0Dataset Analysis: The dataset contains 6 points in 2D space, divided into 2 distinct clusters.
Cluster 0: Points near the origin
Cluster 1: Points near (10, 10)
Global Centroid: c = (5.333, 5.333)
Within-Cluster Dispersion (W_k):
Between-Cluster Dispersion (B_k):
Result: VRC = 450.0
This high score indicates excellent cluster separation—the clusters are tight internally and well-separated from each other.
X = [[0.445, -1.0836], [0.1704, -0.892], [0.395, -0.3682], [9.8784, 9.5017], [9.1809, 8.9391], [9.6732, 10.102], [20.4938, 20.3002], [20.7388, 19.6266], [20.3168, 20.5986]]
labels = [0, 0, 0, 1, 1, 1, 2, 2, 2]2076.7937Dataset Analysis: 9 points in 2D space divided into 3 clusters, each with very tight groupings.
Cluster 0: Points near origin (lower-left region)
Cluster 1: Points near (9.5, 9.5) (middle region)
Cluster 2: Points near (20.5, 20.2) (upper-right region)
Key Observations:
Result: VRC = 2076.7937
This very high score reflects the exceptional cluster quality: extremely compact clusters with large separation distances between them.
X = [[9.1443, -3.2681], [-8.1451, -8.0657], [6.9499, 2.0745], [6.1426, 4.5946], [0.7246, 9.4623]]
labels = [0, 0, 0, 0, 0]0.0Edge Case: Single Cluster
When all data points are assigned to the same cluster (k = 1), the Variance Ratio Criterion cannot be meaningfully computed because:
By convention, we return 0.0 for this edge case, indicating that the clustering provides no information about cluster quality.
Similarly, if k = n (each point is its own cluster), we return 0.0 because within-cluster dispersion would be zero with single-point clusters.
Constraints