0/318

00:00:00

Description

Editorial

Pairwise Ranking Score Adjustment for Competitive Models

MEDIUM20 pts

In competitive machine learning evaluation, ranking models based on their relative performance is essential for establishing leaderboards and understanding model quality. A powerful approach to this problem is the pairwise comparison rating system, which dynamically adjusts skill ratings based on head-to-head outcomes between competitors.

This rating methodology, originally developed for chess tournaments, has become the gold standard for evaluating large language models (LLMs) through human preference assessments. Platforms like the Chatbot Arena leaderboard use this system to rank AI models based on thousands of blind pairwise comparisons.

Core Algorithm

The system works by computing expected win probabilities based on the current rating difference between two competitors, then adjusting ratings according to whether the actual outcome matched or deviated from expectations.

Expected Score Calculation

For two competitors A and B with ratings ( R_A ) and ( R_B ), the expected score for competitor A is:

$$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$$

Similarly, for competitor B:

$$E_B = \frac{1}{1 + 10^{(R_A - R_B)/400}}$$

Note that ( E_A + E_B = 1 ), reflecting the zero-sum nature of each comparison.

Actual Scores

Based on the comparison outcome:

Win: Actual score = 1.0
Loss: Actual score = 0.0
Tie: Actual score = 0.5

Rating Update Formula

After each comparison, ratings are updated sequentially:

$$R_A^{new} = R_A + K \times (S_A - E_A)$$

$$R_B^{new} = R_B + K \times (S_B - E_B)$$

Where:

( K ) is the adjustment factor (sensitivity parameter)
( S_A ) and ( S_B ) are the actual scores for each competitor
( E_A ) and ( E_B ) are the expected scores

Key Insight

The adjustment magnitude reflects the surprise factor of the outcome:

An underdog victory yields larger rating gains (and losses for the favorite)
A favorite winning against a weaker opponent results in smaller adjustments
Ties between equally-rated competitors cause no rating change

Your Task:

Implement a function that processes a sequence of pairwise comparisons and returns the final adjusted ratings. The comparisons must be processed sequentially, with each subsequent comparison using the updated ratings from the previous one.

Example

Input

ratings = {'model_a': 1500, 'model_b': 1500}
comparisons = [('model_a', 'model_b', '1')]
adjustment_factor = 32

Output

{'model_a': 1516.0, 'model_b': 1484.0}

Explanation

Both models begin with identical ratings of 1500. Since ratings are equal, the expected win probability for each is 50% (expected score = 0.5).

For model_a (winner): • Actual score = 1.0 (win) • Rating change = 32 × (1.0 - 0.5) = 16 • New rating = 1500 + 16 = 1516.0

For model_b (loser): • Actual score = 0.0 (loss) • Rating change = 32 × (0.0 - 0.5) = -16 • New rating = 1500 - 16 = 1484.0

The total points exchanged sum to zero (zero-sum property).

Example

Input

ratings = {'gpt4': 1600, 'claude': 1550, 'llama': 1450}
comparisons = [('gpt4', 'claude', '1'), ('claude', 'llama', '2'), ('gpt4', 'llama', 'tie')]
adjustment_factor = 32

Output

{'gpt4': 1607.4441, 'claude': 1516.3929, 'llama': 1476.163}

Explanation

Comparison 1: gpt4 (1600) vs claude (1550) → gpt4 wins • Expected score for gpt4: 1/(1 + 10^((1550-1600)/400)) ≈ 0.5714 • gpt4 rating: 1600 + 32×(1.0 - 0.5714) ≈ 1613.71 • claude rating: 1550 + 32×(0.0 - 0.4286) ≈ 1536.29

Comparison 2: claude (~1536.29) vs llama (1450) → llama wins (upset!) • Expected score for claude ≈ 0.6201 (claude favored) • claude rating: 1536.29 + 32×(0.0 - 0.6201) ≈ 1516.45 • llama rating: 1450 + 32×(1.0 - 0.3799) ≈ 1469.84

Comparison 3: gpt4 (~1613.71) vs llama (~1469.84) → tie • Expected score for gpt4 ≈ 0.6935 • gpt4 rating: 1613.71 + 32×(0.5 - 0.6935) ≈ 1607.52 • llama rating: 1469.84 + 32×(0.5 - 0.3065) ≈ 1476.03

(Slight differences due to rounding in explanation; actual computation yields the expected output.)

Example

Input

ratings = {'alpha': 1500, 'beta': 1500}
comparisons = [('alpha', 'beta', 'tie')]
adjustment_factor = 32

Output

{'alpha': 1500.0, 'beta': 1500.0}

Explanation

Both competitors have equal ratings, so expected score = 0.5 for each.

For a tie: • Both actual scores = 0.5 • Rating change for alpha = 32 × (0.5 - 0.5) = 0 • Rating change for beta = 32 × (0.5 - 0.5) = 0

A tie between equally-rated competitors confirms expectations and results in no rating change.

Accepted0/0·0% Acceptance

Constraints

2 ≤ number of models ≤ 100
1 ≤ number of comparisons ≤ 10,000
1000 ≤ initial rating ≤ 3000 for each model
1 ≤ adjustment_factor ≤ 64
All model identifiers in comparisons will exist in the ratings dictionary
Outcomes are strictly '1' (first competitor wins), '2' (second competitor wins), or 'tie'
Comparisons must be processed in sequential order with ratings updated after each
Output ratings should be rounded to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

matches =

[["alpha","beta","draw"]]

ratings =

{"alpha":1500,"beta":1500}

k_factor =

Pairwise Ranking Score Adjustment for Competitive Models

Core Algorithm

Expected Score Calculation

Actual Scores

Rating Update Formula

Key Insight

Hints

Pairwise Ranking Score Adjustment for Competitive Models

Core Algorithm

Expected Score Calculation

Actual Scores

Rating Update Formula

Key Insight

Hints