Loading content...
In information retrieval systems, determining which documents are most relevant to a user's search query is a fundamental challenge. The Probabilistic Relevance Scorer, based on the renowned BM25 (Best Matching 25) algorithm, is one of the most effective and widely-adopted ranking functions used in modern search engines, databases, and NLP applications.
Unlike simpler term-matching approaches, BM25 introduces two critical refinements that make it remarkably effective in practice:
1. Term Frequency Saturation While it's intuitive that a document mentioning a query term more frequently should be considered more relevant, the relationship is not strictly linear. A document containing "machine learning" 100 times is unlikely to be 100× more relevant than one containing it once. BM25 uses a saturation function that diminishes the impact of additional term occurrences, preventing "keyword stuffing" from artificially inflating relevance scores.
2. Document Length Normalization Longer documents naturally contain more words and thus have a higher probability of containing query terms by chance. BM25 accounts for this by penalizing longer documents relative to the average document length in the corpus, ensuring shorter, focused documents aren't unfairly ranked lower than lengthy, tangential ones.
For a query Q containing terms \( q_1, q_2, ..., q_n \), the BM25 score for a document D is:
$$\text{Score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$
Where:
The Inverse Document Frequency (IDF) component measures how informative a term is:
$$\text{IDF}(q_i) = \ln\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\right)$$
Where:
Your Task: Implement a function that computes the BM25 relevance score for each document in a corpus given a query. Use default parameter values of k1 = 1.5 and b = 0.75. Return a list of scores rounded to 3 decimal places.
corpus = [['the', 'cat', 'sat'], ['the', 'dog', 'ran'], ['the', 'bird', 'flew']]
query = ['the', 'cat'][1.114, 0.134, 0.134]We have 3 documents, each with 3 terms. The average document length is 3.0.
Term Analysis: • 'the' appears in all 3 documents → IDF('the') = ln((3 - 3 + 0.5)/(3 + 0.5) + 1) = ln(1.143) ≈ 0.134 • 'cat' appears in 1 document → IDF('cat') = ln((3 - 1 + 0.5)/(1 + 0.5) + 1) = ln(2.667) ≈ 0.981
Document 1 Score: • 'the': freq=1, contribution = 0.134 × (1 × 2.5)/(1 + 1.5 × (1 - 0.75 + 0.75 × 1)) = 0.134 × 1.0 = 0.134 • 'cat': freq=1, contribution = 0.981 × 1.0 = 0.981 • Total: 0.134 + 0.981 = 1.114
Documents 2 & 3: Only contain 'the' (no 'cat'), so score = 0.134 each.
corpus = [['quick', 'brown', 'fox'], ['lazy', 'brown', 'dog'], ['quick', 'lazy', 'cat']]
query = ['quick'][0.47, 0.0, 0.47]The query contains only 'quick', which appears in documents 1 and 3 but not in document 2.
IDF Calculation: • 'quick' appears in 2 out of 3 documents • IDF('quick') = ln((3 - 2 + 0.5)/(2 + 0.5) + 1) = ln(1.6) ≈ 0.47
Scoring: • Document 1: Contains 'quick' → Score = 0.47 • Document 2: Does not contain 'quick' → Score = 0.0 • Document 3: Contains 'quick' → Score = 0.47
corpus = [['apple', 'banana', 'orange'], ['banana', 'grape', 'lemon'], ['apple', 'apple', 'grape'], ['lemon', 'orange', 'banana']]
query = ['apple', 'banana'][1.05, 0.357, 0.99, 0.357]A 4-document corpus with average document length of 3.0. The query contains 'apple' and 'banana'.
IDF Values: • 'apple' appears in 2 documents → IDF ≈ 0.693 • 'banana' appears in 3 documents → IDF ≈ 0.357
Document Scores: • Document 1: Has both 'apple' (×1) and 'banana' (×1) → receives high combined score of 1.05 • Document 2: Has only 'banana' (×1) → score of 0.357 • Document 3: Has 'apple' (×2) → higher TF contributes to 0.99 (note term saturation effect) • Document 4: Has only 'banana' (×1) → score of 0.357
Constraints