Loading content...
In information retrieval and text mining, quantifying the importance of words within documents is a foundational task. Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates how relevant a term is to a specific document within a larger collection (corpus). This technique is widely used in search engines, document classification, and feature extraction for machine learning models.
TF-IDF combines two complementary metrics:
Term Frequency measures how often a term occurs in a document. The intuition is that the more frequently a term appears, the more important it might be to that document:
$$TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of terms in document } d}$$
Inverse Document Frequency measures the discriminative power of a term across the corpus. Terms that appear in many documents (like "the", "and", "is") are less useful for distinguishing between documents:
$$IDF(t) = \ln\left(\frac{N + 1}{df(t) + 1}\right)$$
Where:
The final TF-IDF score is the product of these two components:
$$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$$
Implement a function that calculates TF-IDF scores for a list of query terms across a corpus of documents. Your function must:
corpus = [
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "chased", "the", "cat"],
["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"][[0.04795], [0.05754], [0.0]]Let's calculate the TF-IDF score for "cat" in each document:
Step 1: Calculate Document Frequencies
Step 2: Calculate IDF
Step 3: Calculate TF for each document
Step 4: Calculate TF-IDF
The result [[0.04795], [0.05754], [0.0]] shows that "cat" is most relevant to document 1.
corpus = [
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "chased", "the", "cat"],
["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat", "dog", "mat"][[0.04795, 0.0, 0.04795], [0.05754, 0.13863, 0.0], [0.0, 0.0, 0.04795]]This example demonstrates TF-IDF calculation for multiple query terms:
Document Frequencies:
TF-IDF Matrix Calculation:
| Document | cat | dog | mat |
|---|---|---|---|
| Doc 0 (6 words) | (1/6) × 0.28768 = 0.04795 | (0/6) × 0.69315 = 0.0 | (1/6) × 0.28768 = 0.04795 |
| Doc 1 (5 words) | (1/5) × 0.28768 = 0.05754 | (1/5) × 0.69315 = 0.13863 | (0/5) × 0.28768 = 0.0 |
| Doc 2 (6 words) | (0/6) × 0.28768 = 0.0 | (0/6) × 0.69315 = 0.0 | (1/6) × 0.28768 = 0.04795 |
Note that "dog" has the highest IDF because it appears in only one document, making it a more discriminative term.
corpus = [
["apple", "banana", "cherry"],
["date", "elderberry", "fig"],
["grape", "honeydew", "kiwi"]
]
query = ["mango"][[0.0], [0.0], [0.0]]When a query term doesn't exist in any document:
Step 1: Calculate Document Frequency
Step 2: Calculate IDF with smoothing
Step 3: Calculate TF
Step 4: Calculate TF-IDF
Even though "mango" has a high IDF (rare term), the TF-IDF score is zero everywhere because the term frequency is zero in all documents.
Constraints