Loading content...
In competitive machine learning evaluation, ranking models based on their relative performance is essential for establishing leaderboards and understanding model quality. A powerful approach to this problem is the pairwise comparison rating system, which dynamically adjusts skill ratings based on head-to-head outcomes between competitors.
This rating methodology, originally developed for chess tournaments, has become the gold standard for evaluating large language models (LLMs) through human preference assessments. Platforms like the Chatbot Arena leaderboard use this system to rank AI models based on thousands of blind pairwise comparisons.
The system works by computing expected win probabilities based on the current rating difference between two competitors, then adjusting ratings according to whether the actual outcome matched or deviated from expectations.
For two competitors A and B with ratings ( R_A ) and ( R_B ), the expected score for competitor A is:
$$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$$
Similarly, for competitor B:
$$E_B = \frac{1}{1 + 10^{(R_A - R_B)/400}}$$
Note that ( E_A + E_B = 1 ), reflecting the zero-sum nature of each comparison.
Based on the comparison outcome:
After each comparison, ratings are updated sequentially:
$$R_A^{new} = R_A + K \times (S_A - E_A)$$
$$R_B^{new} = R_B + K \times (S_B - E_B)$$
Where:
The adjustment magnitude reflects the surprise factor of the outcome:
Your Task:
Implement a function that processes a sequence of pairwise comparisons and returns the final adjusted ratings. The comparisons must be processed sequentially, with each subsequent comparison using the updated ratings from the previous one.
ratings = {'model_a': 1500, 'model_b': 1500}
comparisons = [('model_a', 'model_b', '1')]
adjustment_factor = 32{'model_a': 1516.0, 'model_b': 1484.0}Both models begin with identical ratings of 1500. Since ratings are equal, the expected win probability for each is 50% (expected score = 0.5).
For model_a (winner): • Actual score = 1.0 (win) • Rating change = 32 × (1.0 - 0.5) = 16 • New rating = 1500 + 16 = 1516.0
For model_b (loser): • Actual score = 0.0 (loss) • Rating change = 32 × (0.0 - 0.5) = -16 • New rating = 1500 - 16 = 1484.0
The total points exchanged sum to zero (zero-sum property).
ratings = {'gpt4': 1600, 'claude': 1550, 'llama': 1450}
comparisons = [('gpt4', 'claude', '1'), ('claude', 'llama', '2'), ('gpt4', 'llama', 'tie')]
adjustment_factor = 32{'gpt4': 1607.4441, 'claude': 1516.3929, 'llama': 1476.163}Comparison 1: gpt4 (1600) vs claude (1550) → gpt4 wins • Expected score for gpt4: 1/(1 + 10^((1550-1600)/400)) ≈ 0.5714 • gpt4 rating: 1600 + 32×(1.0 - 0.5714) ≈ 1613.71 • claude rating: 1550 + 32×(0.0 - 0.4286) ≈ 1536.29
Comparison 2: claude (~1536.29) vs llama (1450) → llama wins (upset!) • Expected score for claude ≈ 0.6201 (claude favored) • claude rating: 1536.29 + 32×(0.0 - 0.6201) ≈ 1516.45 • llama rating: 1450 + 32×(1.0 - 0.3799) ≈ 1469.84
Comparison 3: gpt4 (~1613.71) vs llama (~1469.84) → tie • Expected score for gpt4 ≈ 0.6935 • gpt4 rating: 1613.71 + 32×(0.5 - 0.6935) ≈ 1607.52 • llama rating: 1469.84 + 32×(0.5 - 0.3065) ≈ 1476.03
(Slight differences due to rounding in explanation; actual computation yields the expected output.)
ratings = {'alpha': 1500, 'beta': 1500}
comparisons = [('alpha', 'beta', 'tie')]
adjustment_factor = 32{'alpha': 1500.0, 'beta': 1500.0}Both competitors have equal ratings, so expected score = 0.5 for each.
For a tie: • Both actual scores = 0.5 • Rating change for alpha = 32 × (0.5 - 0.5) = 0 • Rating change for beta = 32 × (0.5 - 0.5) = 0
A tie between equally-rated competitors confirms expectations and results in no rating change.
Constraints