Loading content...
In natural language processing (NLP), evaluating the quality of machine-generated text is a fundamental challenge. One widely-used approach is to measure the lexical overlap between a generated text and a human-written reference. By quantifying how many words (unigrams) the candidate text shares with the reference, we can assess how well the generated output captures the essential content.
The Unigram Overlap Evaluation Metric computes three complementary scores that together provide a comprehensive view of text quality:
Precision measures the proportion of words in the candidate text that also appear in the reference text. It answers the question: "Of everything the model generated, how much was actually relevant?"
$$\text{Precision} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in candidate}}$$
High precision indicates that the generated text is concise and doesn't contain irrelevant words.
Recall measures the proportion of words in the reference text that are captured by the candidate text. It answers the question: "Of everything that should have been included, how much did the model capture?"
$$\text{Recall} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in reference}}$$
High recall indicates that the generated text comprehensively covers the reference content.
The F1 Score is the harmonic mean of precision and recall, providing a single balanced metric that penalizes extreme imbalances between the two:
$$F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
If either precision or recall is zero, the F1 score is defined as 0.
When counting overlapping unigrams, we must account for word frequency. If a word appears multiple times in both texts, the overlap count for that word is the minimum of its counts in the reference and candidate:
$$\text{Overlap for word } w = \min(\text{count}\text{reference}(w), \text{count}\text{candidate}(w))$$
The total overlap is the sum of these minimum counts across all unique words.
Write a Python function that computes the unigram overlap evaluation scores between a reference text and a candidate text. The function should:
precision, recall, and f1 scores as keysreference = "the cat sat on the mat"
candidate = "the cat is on the mat"{"precision": 0.8333333333333334, "recall": 0.8333333333333334, "f1": 0.8333333333333334}Step 1: Tokenize both texts
Step 2: Count unigram frequencies
Step 3: Calculate overlapping unigrams For each word present in both texts, take the minimum count:
Total overlap = 2 + 1 + 1 + 1 = 5
Step 4: Compute metrics
reference = "machine learning is amazing"
candidate = "deep learning is powerful"{"precision": 0.5, "recall": 0.5, "f1": 0.5}Step 1: Tokenize both texts
Step 2: Count unigram frequencies
Step 3: Calculate overlapping unigrams Common words: "learning" and "is"
Total overlap = 1 + 1 = 2
Step 4: Compute metrics
reference = "hello world"
candidate = "hello world"{"precision": 1.0, "recall": 1.0, "f1": 1.0}Perfect Match Case
When the candidate exactly matches the reference:
Both words overlap completely:
Total overlap = 2
A perfect score of 1.0 for all metrics indicates the candidate is identical to the reference at the unigram level.
Constraints