0/318

00:00:00

Description

Editorial

Probabilistic Document Relevance Scorer (BM25)

MEDIUM20 pts

In information retrieval systems, determining which documents are most relevant to a user's search query is a fundamental challenge. The Probabilistic Relevance Scorer, based on the renowned BM25 (Best Matching 25) algorithm, is one of the most effective and widely-adopted ranking functions used in modern search engines, databases, and NLP applications.

Unlike simpler term-matching approaches, BM25 introduces two critical refinements that make it remarkably effective in practice:

1. Term Frequency Saturation While it's intuitive that a document mentioning a query term more frequently should be considered more relevant, the relationship is not strictly linear. A document containing "machine learning" 100 times is unlikely to be 100× more relevant than one containing it once. BM25 uses a saturation function that diminishes the impact of additional term occurrences, preventing "keyword stuffing" from artificially inflating relevance scores.

2. Document Length Normalization Longer documents naturally contain more words and thus have a higher probability of containing query terms by chance. BM25 accounts for this by penalizing longer documents relative to the average document length in the corpus, ensuring shorter, focused documents aren't unfairly ranked lower than lengthy, tangential ones.

Mathematical Formulation

For a query Q containing terms $ q_1, q_2, ..., q_n $, the BM25 score for a document D is:

$$\text{Score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

Where:

$ f(q_i, D) $ is the frequency of term $ q_i $ in document $ D $
$ |D| $ is the number of terms in document $ D $
$ \text{avgdl} $ is the average document length across the entire corpus
$ k_1 $ is the saturation parameter (typically 1.2–2.0, controls how quickly term frequency saturates)
$ b $ is the length normalization parameter (0–1, where 0 = no normalization, 1 = full normalization)

The Inverse Document Frequency (IDF) component measures how informative a term is:

$$\text{IDF}(q_i) = \ln\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\right)$$

Where:

$ N $ is the total number of documents in the corpus
$ n(q_i) $ is the number of documents containing term $ q_i $

Your Task: Implement a function that computes the BM25 relevance score for each document in a corpus given a query. Use default parameter values of k1 = 1.5 and b = 0.75. Return a list of scores rounded to 3 decimal places.

Example

Input

corpus = [['the', 'cat', 'sat'], ['the', 'dog', 'ran'], ['the', 'bird', 'flew']]
query = ['the', 'cat']

Output

[1.114, 0.134, 0.134]

Explanation

We have 3 documents, each with 3 terms. The average document length is 3.0.

Term Analysis: • 'the' appears in all 3 documents → IDF('the') = ln((3 - 3 + 0.5)/(3 + 0.5) + 1) = ln(1.143) ≈ 0.134 • 'cat' appears in 1 document → IDF('cat') = ln((3 - 1 + 0.5)/(1 + 0.5) + 1) = ln(2.667) ≈ 0.981

Document 1 Score: • 'the': freq=1, contribution = 0.134 × (1 × 2.5)/(1 + 1.5 × (1 - 0.75 + 0.75 × 1)) = 0.134 × 1.0 = 0.134 • 'cat': freq=1, contribution = 0.981 × 1.0 = 0.981 • Total: 0.134 + 0.981 = 1.114

Documents 2 & 3: Only contain 'the' (no 'cat'), so score = 0.134 each.

Example

Input

corpus = [['quick', 'brown', 'fox'], ['lazy', 'brown', 'dog'], ['quick', 'lazy', 'cat']]
query = ['quick']

Output

[0.47, 0.0, 0.47]

Explanation

The query contains only 'quick', which appears in documents 1 and 3 but not in document 2.

IDF Calculation: • 'quick' appears in 2 out of 3 documents • IDF('quick') = ln((3 - 2 + 0.5)/(2 + 0.5) + 1) = ln(1.6) ≈ 0.47

Scoring: • Document 1: Contains 'quick' → Score = 0.47 • Document 2: Does not contain 'quick' → Score = 0.0 • Document 3: Contains 'quick' → Score = 0.47

Example

Input

corpus = [['apple', 'banana', 'orange'], ['banana', 'grape', 'lemon'], ['apple', 'apple', 'grape'], ['lemon', 'orange', 'banana']]
query = ['apple', 'banana']

Output

[1.05, 0.357, 0.99, 0.357]

Explanation

A 4-document corpus with average document length of 3.0. The query contains 'apple' and 'banana'.

IDF Values: • 'apple' appears in 2 documents → IDF ≈ 0.693 • 'banana' appears in 3 documents → IDF ≈ 0.357

Document Scores: • Document 1: Has both 'apple' (×1) and 'banana' (×1) → receives high combined score of 1.05 • Document 2: Has only 'banana' (×1) → score of 0.357 • Document 3: Has 'apple' (×2) → higher TF contributes to 0.99 (note term saturation effect) • Document 4: Has only 'banana' (×1) → score of 0.357

Accepted0/0·0% Acceptance

Constraints

1 ≤ number of documents in corpus ≤ 1000
1 ≤ document length ≤ 10,000 terms
1 ≤ query length ≤ 50 terms
All words are lowercase alphabetic strings
Documents are pre-tokenized (split into lists of words)
Use k1 = 1.5 and b = 0.75 as default parameters
Return scores rounded to 3 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

query =

["the","cat"]

corpus =

[["the","cat","sat"],["the","dog","ran"],["the","bird","flew"]]

Probabilistic Document Relevance Scorer (BM25)

Mathematical Formulation

Hints

Probabilistic Document Relevance Scorer (BM25)

Mathematical Formulation

Hints