0/318

00:00:00

Description

Editorial

Computing Document Term Relevance Scores

MEDIUM20 pts

In information retrieval and text mining, quantifying the importance of words within documents is a foundational task. Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates how relevant a term is to a specific document within a larger collection (corpus). This technique is widely used in search engines, document classification, and feature extraction for machine learning models.

Understanding the TF-IDF Formula

TF-IDF combines two complementary metrics:

1. Term Frequency (TF)

Term Frequency measures how often a term occurs in a document. The intuition is that the more frequently a term appears, the more important it might be to that document:

$$TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of terms in document } d}$$

2. Inverse Document Frequency (IDF)

Inverse Document Frequency measures the discriminative power of a term across the corpus. Terms that appear in many documents (like "the", "and", "is") are less useful for distinguishing between documents:

$$IDF(t) = \ln\left(\frac{N + 1}{df(t) + 1}\right)$$

Where:

N = Total number of documents in the corpus
df(t) = Number of documents containing term t
The smoothing factors (+1) prevent division by zero when a term doesn't appear in any document

3. TF-IDF Score

The final TF-IDF score is the product of these two components:

$$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$$

Your Task

Implement a function that calculates TF-IDF scores for a list of query terms across a corpus of documents. Your function must:

Handle smoothing properly: Apply the smoothing formula for IDF to prevent division by zero errors
Process empty documents gracefully: Documents with no words should return zero scores for all query terms
Handle missing query terms: Query terms that don't appear in any document should still return valid (zero) scores
Round results appropriately: Return all scores rounded to 5 decimal places

Implementation Notes

Each document in the corpus is provided as a list of individual word tokens
The query is also a list of terms to evaluate
Return a matrix where rows correspond to documents and columns correspond to query terms

Example

Input

corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]

Output

[[0.04795], [0.05754], [0.0]]

Explanation

Let's calculate the TF-IDF score for "cat" in each document:

Step 1: Calculate Document Frequencies

"cat" appears in 2 documents (doc 0 and doc 1)
Total documents N = 3

Step 2: Calculate IDF

IDF("cat") = ln((3 + 1) / (2 + 1)) = ln(4/3) = ln(1.333...) ≈ 0.28768

Step 3: Calculate TF for each document

Doc 0: "cat" appears 1 time out of 6 words → TF = 1/6 ≈ 0.16667
Doc 1: "cat" appears 1 time out of 5 words → TF = 1/5 = 0.20000
Doc 2: "cat" appears 0 times → TF = 0

Step 4: Calculate TF-IDF

Doc 0: 0.16667 × 0.28768 ≈ 0.04795
Doc 1: 0.20000 × 0.28768 ≈ 0.05754
Doc 2: 0.00000 × 0.28768 = 0.00000

The result [[0.04795], [0.05754], [0.0]] shows that "cat" is most relevant to document 1.

Example

Input

corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat", "dog", "mat"]

Output

[[0.04795, 0.0, 0.04795], [0.05754, 0.13863, 0.0], [0.0, 0.0, 0.04795]]

Explanation

This example demonstrates TF-IDF calculation for multiple query terms:

Document Frequencies:

"cat": appears in 2 documents → IDF = ln(4/3) ≈ 0.28768
"dog": appears in 1 document → IDF = ln(4/2) ≈ 0.69315
"mat": appears in 2 documents → IDF = ln(4/3) ≈ 0.28768

TF-IDF Matrix Calculation:

Document	cat	dog	mat
Doc 0 (6 words)	(1/6) × 0.28768 = 0.04795	(0/6) × 0.69315 = 0.0	(1/6) × 0.28768 = 0.04795
Doc 1 (5 words)	(1/5) × 0.28768 = 0.05754	(1/5) × 0.69315 = 0.13863	(0/5) × 0.28768 = 0.0
Doc 2 (6 words)	(0/6) × 0.28768 = 0.0	(0/6) × 0.69315 = 0.0	(1/6) × 0.28768 = 0.04795

Note that "dog" has the highest IDF because it appears in only one document, making it a more discriminative term.

Example

Input

corpus = [
    ["apple", "banana", "cherry"],
    ["date", "elderberry", "fig"],
    ["grape", "honeydew", "kiwi"]
]
query = ["mango"]

Output

[[0.0], [0.0], [0.0]]

Explanation

When a query term doesn't exist in any document:

Step 1: Calculate Document Frequency

"mango" appears in 0 documents

Step 2: Calculate IDF with smoothing

IDF("mango") = ln((3 + 1) / (0 + 1)) = ln(4/1) = ln(4) ≈ 1.38629

Step 3: Calculate TF

Since "mango" doesn't appear in any document, TF = 0 for all documents

Step 4: Calculate TF-IDF

All documents: 0 × 1.38629 = 0.0

Even though "mango" has a high IDF (rare term), the TF-IDF score is zero everywhere because the term frequency is zero in all documents.

Accepted0/0·0% Acceptance

Constraints

0 ≤ number of documents in corpus ≤ 1000
0 ≤ number of words in each document ≤ 10,000
0 ≤ number of query terms ≤ 100
All words consist of lowercase alphanumeric characters only
Words are case-sensitive (treat 'Cat' and 'cat' as different terms)
The function should handle empty corpus and empty documents gracefully

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

query =

["cat"]

corpus =

[["the","cat","sat","on","the","mat"],["the","dog","chased","the","cat"],["the","bird","flew","over","the","mat"]]

Understanding the TF-IDF Formula

TF-IDF combines two complementary metrics:

1. Term Frequency (TF)

Term Frequency measures how often a term occurs in a document. The intuition is that the more frequently a term appears, the more important it might be to that document:

$$TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of terms in document } d}$$

2. Inverse Document Frequency (IDF)

$$IDF(t) = \ln\left(\frac{N + 1}{df(t) + 1}\right)$$

Where:

N = Total number of documents in the corpus

df(t) = Number of documents containing term t

The smoothing factors (+1) prevent division by zero when a term doesn't appear in any document

3. TF-IDF Score

The final TF-IDF score is the product of these two components:

$$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$$

Your Task

Implement a function that calculates TF-IDF scores for a list of query terms across a corpus of documents. Your function must:

Handle smoothing properly: Apply the smoothing formula for IDF to prevent division by zero errors

Process empty documents gracefully: Documents with no words should return zero scores for all query terms

Handle missing query terms: Query terms that don't appear in any document should still return valid (zero) scores

Round results appropriately: Return all scores rounded to 5 decimal places

Document

cat

dog

mat

Doc 0 (6 words)

(1/6) × 0.28768 = 0.04795

(0/6) × 0.69315 = 0.0

(1/6) × 0.28768 = 0.04795

Doc 1 (5 words)

(1/5) × 0.28768 = 0.05754

(1/5) × 0.69315 = 0.13863

(0/5) × 0.28768 = 0.0

Doc 2 (6 words)

(0/6) × 0.28768 = 0.0

(0/6) × 0.69315 = 0.0

(1/6) × 0.28768 = 0.04795

Computing Document Term Relevance Scores

Understanding the TF-IDF Formula

1. Term Frequency (TF)

2. Inverse Document Frequency (IDF)

3. TF-IDF Score

Your Task

Implementation Notes

Hints

Computing Document Term Relevance Scores

Understanding the TF-IDF Formula

1. Term Frequency (TF)

2. Inverse Document Frequency (IDF)

3. TF-IDF Score

Your Task

Implementation Notes

Hints