Loading content...
In natural language processing and machine translation, evaluating the quality of machine-generated text is a critical challenge. One of the most widely adopted approaches is to measure how well the generated text overlaps with one or more human-written reference texts using n-gram precision scoring.
An n-gram is a contiguous sequence of n tokens from a text. For example, in the sentence "the cat sat", the 1-grams (unigrams) are ["the", "cat", "sat"], and the 2-grams (bigrams) are [("the", "cat"), ("cat", "sat")].
The N-Gram Precision Score evaluates the quality of generated text through the following components:
For each n from 1 to max_n, calculate the precision of n-grams in the candidate text:
$$p_n = \frac{\sum_{\text{n-gram} \in \text{candidate}} \text{Count}{\text{clipped}}(\text{n-gram})}{\sum{\text{n-gram} \in \text{candidate}} \text{Count}(\text{n-gram})}$$
The clipped count for each n-gram is the minimum of:
This clipping mechanism prevents gaming by repetition—a candidate of "the the the the" would not score highly just because "the" appears in the reference.
To discourage overly short translations that might achieve high precision by only producing "safe" words, a brevity penalty is applied:
$$BP = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \leq r \end{cases}$$
Where:
When multiple references exist, the effective reference length is chosen as the reference length closest to the candidate length. If two reference lengths are equidistant, choose the shorter one.
The final score combines the brevity penalty with a geometric mean of the n-gram precisions:
$$\text{Score} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \cdot \log(p_n)\right)$$
Where $w_n = \frac{1}{N}$ (uniform weighting).
Your Task: Implement a function that computes this n-gram precision score given a candidate sentence (as a list of tokens), one or more reference sentences (each as a list of tokens), and the maximum n-gram order to consider.
candidate = ['a', 'b', 'c', 'd']
references = [['a', 'b', 'x', 'd']]
max_n = 20.5Step 1: Calculate 1-gram (Unigram) Precision
Candidate 1-grams: {a: 1, b: 1, c: 1, d: 1} → 4 total Reference 1-grams: {a: 1, b: 1, x: 1, d: 1}
Clipped counts: • 'a': min(1, 1) = 1 ✓ • 'b': min(1, 1) = 1 ✓ • 'c': min(1, 0) = 0 ✗ (not in reference) • 'd': min(1, 1) = 1 ✓
Sum of clipped counts = 3 p₁ = 3/4 = 0.75
Step 2: Calculate 2-gram (Bigram) Precision
Candidate 2-grams: {(a,b): 1, (b,c): 1, (c,d): 1} → 3 total Reference 2-grams: {(a,b): 1, (b,x): 1, (x,d): 1}
Clipped counts: • (a,b): min(1, 1) = 1 ✓ • (b,c): min(1, 0) = 0 ✗ • (c,d): min(1, 0) = 0 ✗
Sum of clipped counts = 1 p₂ = 1/3 ≈ 0.333
Step 3: Calculate Brevity Penalty
Candidate length (c) = 4 Reference length (r) = 4 Since c = r, BP = 1.0
Step 4: Compute Final Score
Geometric mean = exp((log(0.75) + log(0.333)) / 2) = exp((-0.288 + -1.099) / 2) = exp(-0.693) ≈ 0.5
Final Score = BP × Geometric Mean = 1.0 × 0.5 = 0.5
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
references = [['the', 'cat', 'sat', 'on', 'the', 'mat']]
max_n = 41.0This is a perfect match scenario where the candidate exactly equals the reference.
N-gram Precisions: • p₁ (unigrams): All 6 tokens match perfectly → 6/6 = 1.0 • p₂ (bigrams): All 5 bigrams match → 5/5 = 1.0 • p₃ (trigrams): All 4 trigrams match → 4/4 = 1.0 • p₄ (4-grams): All 3 4-grams match → 3/3 = 1.0
Brevity Penalty: Candidate length = Reference length = 6 BP = 1.0 (no penalty)
Final Score: Geometric mean of all 1.0 values = 1.0 Score = 1.0 × 1.0 = 1.0
This demonstrates that perfect reproduction of the reference achieves the maximum score.
candidate = ['the', 'quick', 'brown', 'fox']
references = [['a', 'fast', 'brown', 'fox'], ['the', 'slow', 'brown', 'dog']]
max_n = 20.5With Multiple References, we take the maximum count from any single reference for clipping.
Step 1: Calculate 1-gram Precision
Candidate tokens: {the: 1, quick: 1, brown: 1, fox: 1}
Reference 1 tokens: {a: 1, fast: 1, brown: 1, fox: 1} Reference 2 tokens: {the: 1, slow: 1, brown: 1, dog: 1}
Clipped counts (max from any reference): • 'the': max(0, 1) = 1 ✓ (from ref 2) • 'quick': max(0, 0) = 0 ✗ • 'brown': max(1, 1) = 1 ✓ (both refs) • 'fox': max(1, 0) = 1 ✓ (from ref 1)
p₁ = 3/4 = 0.75
Step 2: Calculate 2-gram Precision
Candidate bigrams: (the,quick), (quick,brown), (brown,fox) Ref 1 bigrams: (a,fast), (fast,brown), (brown,fox) Ref 2 bigrams: (the,slow), (slow,brown), (brown,dog)
Clipped counts: • (the,quick): 0 ✗ • (quick,brown): 0 ✗ • (brown,fox): 1 ✓ (from ref 1)
p₂ = 1/3 ≈ 0.333
Step 3: Brevity Penalty
Candidate length = 4 Reference 1 length = 4 → distance = |4-4| = 0 Reference 2 length = 4 → distance = |4-4| = 0
Both references are equally close; effective reference length r = 4 BP = 1.0
Final Score = 0.5
Constraints