Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

3 / 5

Term Frequency and TF-IDF

The Science of Relevance

When you search for 'database optimization' and get 10,000 results, how does the search system decide which documents appear first? The answer lies in relevance scoring—mathematical models that quantify how well each document matches your query.

At the heart of most relevance scoring systems is TF-IDF (Term Frequency-Inverse Document Frequency), a deceptively simple formula that has powered search engines for decades. Despite its age, TF-IDF remains remarkably effective and forms the foundation even for modern neural ranking systems.

What You Will Learn

By the end of this page, you will understand term frequency, document frequency, inverse document frequency, and how they combine in TF-IDF scoring. You'll learn the mathematical intuition, practical considerations, and variants used in production systems.

The Relevance Ranking Problem

Given a query and a collection of documents, we need to assign each document a relevance score indicating how well it satisfies the user's information need. This score must be:

Computable: We must be able to calculate it efficiently at query time
Orderable: Higher scores must genuinely indicate greater relevance
Interpretable: The score should reflect intuitive notions of relevance

What Makes a Document Relevant?

Intuitively, a document is relevant if:

Relevance Indicators

•Contains query terms — Documents must contain the words the user searched for
•Contains terms frequently — More occurrences suggest the document is 'about' those terms
•Contains rare terms — Matching rare/specific terms is more significant than matching common terms
•Terms appear prominently — Terms in titles, headings, or early paragraphs may be more important
•Terms appear together — Query terms appearing close together suggest phrase relevance

TF-IDF captures the first three intuitions mathematically. More advanced models extend to the others.

The Bag-of-Words Model:

TF-IDF assumes a 'bag-of-words' representation—documents are treated as unordered collections of words. Word order, grammar, and meaning are ignored; only word presence and frequency matter.

bag_of_words.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Document: "The quick brown fox jumps over the lazy dog"
 
Bag-of-Words Representation:
{
    "the": 2,
    "quick": 1,
    "brown": 1,
    "fox": 1,
    "jumps": 1,
    "over": 1,
    "lazy": 1,
    "dog": 1
}
 
What's preserved: Word frequencies
What's lost: Word order, grammar, phrases
 
"The dog jumps over the lazy brown fox" has identical representation!
 
Despite this limitation, bag-of-words works remarkably well for ranking.

Why Bag-of-Words Works

The bag-of-words model succeeds because documents about a topic tend to use specific vocabulary regardless of how sentences are structured. A document about 'machine learning' will contain those words frequently, no matter the writing style.

Term Frequency (TF)

Term Frequency (TF) measures how often a term appears in a document. The basic intuition: a document mentioning 'database' 10 times is probably more about databases than one mentioning it once.

Raw Term Frequency:

The simplest definition is the raw count:

raw_term_frequency.txt
1
2
3
4
5
6
7
8
9
10
Raw Term Frequency:
TF(t, d) = number of times term t appears in document d
 
Example Document: "Database systems use database indexes to speed up 
                   database queries. Efficient database design matters."
 
TF("database", doc) = 4
TF("systems", doc) = 1
TF("efficient", doc) = 1
TF("algorithm", doc) = 0

Problems with Raw TF:

Raw counts have issues:

Length Bias: Longer documents naturally have higher term counts
Diminishing Returns: 10 mentions isn't 10× more relevant than 1 mention
Scale Issues: Counts can vary from 0 to thousands

Normalized Term Frequency:

To address length bias, normalize by document length:

normalized_tf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Normalized Term Frequency:
TF(t, d) = count(t, d) / total_terms(d)
 
Example:
Doc A: 100 words, "database" appears 10 times → TF = 10/100 = 0.10
Doc B: 1000 words, "database" appears 10 times → TF = 10/1000 = 0.01
 
Despite same raw count, Doc A is more focused on databases.
 
Alternative: Divide by max term frequency in document
TF(t, d) = count(t, d) / max_count(d)
 
This normalizes to [0, 1] range.

Sublinear TF Scaling:

To address diminishing returns, use logarithmic scaling:

sublinear_tf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Sublinear (Log) Term Frequency:
 
TF(t, d) = 1 + log(count(t, d))    if count > 0
         = 0                        if count = 0
 
Examples:
count = 1  → TF = 1 + log(1) = 1 + 0 = 1.0
count = 2  → TF = 1 + log(2) = 1 + 0.693 = 1.693
count = 10 → TF = 1 + log(10) = 1 + 2.303 = 3.303
count = 100 → TF = 1 + log(100) = 1 + 4.605 = 5.605
 
Key insight: 100 occurrences scores only ~5.6× higher than 1, not 100×.
This models the intuition that relevance doesn't scale linearly with frequency.

TF Variants Comparison
Variant	Formula	Properties
Raw	count(t, d)	Simple but biased toward long docs
Normalized	count / doc_length	Length-independent
Log Normalized	1 + log(count)	Diminishing returns, 0 for absent
Double Normalization	0.5 + 0.5 × (count/max)	Bounded [0.5, 1], smoothed
Boolean	1 if count > 0, else 0	Ignores frequency entirely

Production Choice

Most production systems use log-normalized TF (1 + log) or BM25's saturation function. Pure raw counts almost never work well because they over-emphasize high-frequency terms.

Document Frequency (DF)

Document Frequency (DF) counts how many documents in the collection contain a term. This measures how common or rare a term is across the entire collection.

Definition:

document_frequency.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Document Frequency:
DF(t) = number of documents containing term t
 
Example Collection (N = 1,000,000 documents):
 
Term          | DF       | DF/N      | Interpretation
--------------|----------|-----------|---------------
"the"         | 999,000  | 0.999     | Extremely common (stop word)
"database"    | 150,000  | 0.150     | Common technical term
"postgresql"  | 25,000   | 0.025     | Specific technology
"b-plus-tree" | 3,000    | 0.003     | Specialized concept
"xyzquery123" | 5        | 0.000005  | Very rare (perhaps jargon)
 
Key insight: Matching "b-plus-tree" is more significant than matching "the"
because fewer documents contain it.

Why DF Matters:

Consider a query: 'database performance optimization'

If 'database' appears in 50% of documents and 'optimization' appears in 5%, matching 'optimization' is more discriminative. A document matching this rare term is more likely to be relevant.

Collection Statistics:

DF is computed once when indexing and stored in the vocabulary:

vocabulary_with_df.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Vocabulary Entry Structure:
---------------------------
{
    term: "database",
    document_frequency: 150000,
    collection_frequency: 2500000,  // Total occurrences across all docs
    posting_list_pointer: 0x4A3F00,
    posting_list_length: 150000
}
 
Document Frequency vs Collection Frequency:
- DF: In how many documents does term appear? (for IDF)
- CF: How many times does term appear total? (less commonly used)
 
A term in 100 documents with 1 occurrence each: DF=100, CF=100
A term in 100 documents with 10 occurrences each: DF=100, CF=1000

DF is Pre-Computed

Document frequency is calculated during indexing and stored in the vocabulary. At query time, looking up a term's DF is O(1)—just read it from the vocabulary entry. No scanning required.

Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) transforms document frequency into a weight that favors rare terms. The intuition: terms appearing in many documents are less discriminative.

Basic IDF Formula:

idf_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Basic IDF:
IDF(t) = log(N / DF(t))
 
Where:
  N = total number of documents in collection
  DF(t) = number of documents containing term t
 
Example (N = 1,000,000):
 
Term          | DF       | N/DF      | IDF = log(N/DF)
--------------|----------|-----------|----------------
"the"         | 999,000  | 1.001     | 0.001
"database"    | 150,000  | 6.67      | 1.90
"postgresql"  | 25,000   | 40        | 3.69
"b-plus-tree" | 3,000    | 333       | 5.81
"xyzquery123" | 5        | 200,000   | 12.21
 
Interpretation:
- "the" has near-zero IDF (useless for ranking)
- Rare terms have high IDF (very discriminative)

Why Logarithm?

The logarithm serves multiple purposes:

Compression: Without log, IDF for very rare terms explodes (200,000 vs 12.21)
Interpretability: Log values are more intuitive to compare
Dampening: Prevents rare terms from completely dominating

IDF Variants:

IDF Variants
Variant	Formula	Notes
Standard	log(N / DF)	Basic form, can be negative for DF > N/e
Smooth	log(1 + N / DF)	Always positive, smoother
Probabilistic	log((N - DF) / DF)	Used in BM25
Max	log(max_DF / DF)	Relative to most common term
Plus One	log(N / (DF + 1))	Avoids division by zero

idf_edge_cases.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Edge Cases and Solutions:
 
Problem 1: Term appears in all documents
  DF(t) = N → IDF = log(N/N) = log(1) = 0
  Solution: This is correct! Universal terms shouldn't affect ranking.
 
Problem 2: Term not in collection (DF = 0)  
  IDF = log(N/0) = undefined
  Solution: Use log(N/(DF+1)) or return 0 for missing terms.
 
Problem 3: Collection grows (N changes)
  All IDFs change, affecting stored scores.
  Solution: Recompute IDFs periodically or use relative IDF.
 
Problem 4: Very rare term becomes overly powerful
  DF = 1 → IDF = log(1,000,000) = 13.8 (dominates everything)
  Solution: Cap maximum IDF or use smooth variants.

The Stop Word Effect

IDF naturally down-weights stop words like 'the', 'is', 'and' to near-zero. This is why TF-IDF-based systems often don't need explicit stop word removal—IDF handles it automatically by making common words irrelevant to scoring.

TF-IDF: Combining the Components

TF-IDF combines term frequency and inverse document frequency to score term importance in a document relative to a collection.

The Core Formula:

tfidf_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
TF-IDF Formula:
TF-IDF(t, d) = TF(t, d) × IDF(t)
 
Full expansion (with log normalization):
TF-IDF(t, d) = (1 + log(count(t, d))) × log(N / DF(t))
             = TF component × IDF component
             
What this captures:
- High TF, High IDF: Term appears often in doc, rarely in collection → HIGH SCORE
- High TF, Low IDF: Term appears often but is common → MODERATE SCORE  
- Low TF, High IDF: Rare term appears once → MODERATE SCORE
- Low TF, Low IDF: Common term appears once → LOW SCORE
- Zero TF: Term not in document → ZERO SCORE

Worked Example:

Let's compute TF-IDF for a query against three documents:

tfidf_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Collection: N = 10,000 documents
Query: "database optimization"
 
Term Statistics:
  "database":     DF = 2,000  → IDF = log(10000/2000) = 1.61
  "optimization": DF = 500    → IDF = log(10000/500) = 3.00
 
Document Analysis:
 
Doc A: "Database performance and database optimization techniques"
  TF("database") = 2 → 1 + log(2) = 1.69
  TF("optimization") = 1 → 1 + log(1) = 1.00
  
  TF-IDF("database") = 1.69 × 1.61 = 2.72
  TF-IDF("optimization") = 1.00 × 3.00 = 3.00
  
  Doc A Score = 2.72 + 3.00 = 5.72
 
Doc B: "The database stores data in tables"
  TF("database") = 1 → 1 + log(1) = 1.00
  TF("optimization") = 0 → 0
  
  TF-IDF("database") = 1.00 × 1.61 = 1.61
  TF-IDF("optimization") = 0
  
  Doc B Score = 1.61
 
Doc C: "Query optimization and index optimization for performance"
  TF("database") = 0 → 0
  TF("optimization") = 2 → 1 + log(2) = 1.69
  
  TF-IDF("database") = 0
  TF-IDF("optimization") = 1.69 × 3.00 = 5.07
  
  Doc C Score = 5.07
 
Ranking: A (5.72) > C (5.07) > B (1.61)
 
Doc A wins because it contains both terms!

The Interaction Effect

Notice that Doc A scores highest not just because it has both terms, but because matching the rarer term 'optimization' (IDF=3.00) contributes more than the common term 'database' (IDF=1.61). TF-IDF naturally emphasizes specific, discriminative terms.

Vector Space Model

TF-IDF naturally leads to the Vector Space Model (VSM)—representing documents and queries as vectors in a high-dimensional space where each dimension corresponds to a term.

Documents and Queries as Vectors:

vector_space.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Vocabulary: [database, optimization, performance, index, query]
            Dimension:    0           1            2         3      4
 
Document vectors (TF-IDF weights):
Doc A: [2.72, 3.00, 1.38, 0.00, 0.00]  "Database optimization..."
Doc B: [1.61, 0.00, 0.00, 0.00, 0.00]  "The database stores..."
Doc C: [0.00, 5.07, 2.76, 2.15, 1.85]  "Query optimization..."
 
Query vector:
Query: [1.61, 3.00, 0.00, 0.00, 0.00]  "database optimization"
 
Each document is a point in 5-dimensional space.
Query is also a point in the same space.
"Similar" documents are nearby in this space.

Cosine Similarity:

Rather than summing TF-IDF scores, we often compute cosine similarity between query and document vectors. This measures the angle between vectors, ignoring magnitude:

cosine_similarity.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cosine Similarity:
                   ∑(qi × di)
cosine(Q, D) = ─────────────────────
                |Q| × |D|
 
Where:
  qi = query vector component for term i
  di = document vector component for term i
  |Q| = length (magnitude) of query vector = √(∑qi²)
  |D| = length of document vector = √(∑di²)
 
Example:
Q = [1.61, 3.00, 0, 0, 0]  → |Q| = √(1.61² + 3.00²) = 3.40
D = [2.72, 3.00, 1.38, 0, 0]  → |D| = √(2.72² + 3.00² + 1.38²) = 4.31
 
Dot product: 1.61×2.72 + 3.00×3.00 + 0×1.38 = 4.38 + 9.00 = 13.38
 
Cosine = 13.38 / (3.40 × 4.31) = 13.38 / 14.65 = 0.913
 
Interpretation: Very high similarity (max = 1.0)

Why Cosine Similarity?

Cosine Similarity Benefits

•Length Normalization — Long documents don't automatically score higher; only the direction (term proportions) matters
•Scale Invariance — Doubling all weights doesn't change similarity
•Bounded Output — Always in [-1, 1], or [0, 1] for positive weights
•Geometric Intuition — Identical documents have angle 0 (cosine=1), orthogonal documents have angle 90° (cosine=0)

Converting Mermaid diagram...

High-Dimensional Reality

Real vector spaces have dimensions equal to vocabulary size—often 100,000+ dimensions. Visualization impossible, but intuitions from 2D/3D still apply. Most documents are far from each other (sparse vectors), and queries find the nearest neighbors.

TF-IDF Limitations and Extensions

While remarkably effective, TF-IDF has known limitations that have driven research into improved models.

Known Limitations:

TF-IDF Weaknesses

•No Semantic Understanding — Can't recognize that 'car' and 'automobile' are synonyms
•Bag-of-Words Assumption — Ignores word order ('not good' same as 'good not')
•Term Independence — Assumes terms are statistically independent (they're not)
•Document Length Bias — Even with normalization, length effects persist
•TF Saturation Missing — Does 100 occurrences really matter vs. 50?

BM25: The Industry Standard:

BM25 (Best Match 25) is the most widely used improvement over basic TF-IDF:

bm25_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
BM25 Scoring Formula:
 
                    (k₁ + 1) × TF(t,d)
BM25(t,d) = IDF(t) × ─────────────────────────────────────
                    TF(t,d) + k₁ × (1 - b + b × |d|/avgdl)
 
Where:
  k₁ = term frequency saturation parameter (typically 1.2-2.0)
  b = document length normalization (typically 0.75)
  |d| = document length
  avgdl = average document length in collection
 
Key Improvements over TF-IDF:
 
1. TF Saturation: As TF grows, contribution asymptotically approaches (k₁+1)
   TF=1: score ≈ 1    (contribution = 1)
   TF=5: score ≈ 2.5  (not 5×)
   TF=100: score ≈ 3  (nearly same as TF=50)
 
2. Length Normalization: Parameter b controls length penalty
   b=0: No length normalization
   b=1: Full length normalization
   
3. Probabilistic IDF: log((N - DF + 0.5)/(DF + 0.5))

TF-IDF vs. BM25 Comparison
Aspect	TF-IDF	BM25
TF handling	Log or raw	Saturating (asymptotic)
Length normalization	Optional, separate	Built into formula
Parameters	Choice of TF/IDF formulas	k₁ and b tunable
Performance	Good baseline	Usually 5-20% better
Adoption	Educational, simple systems	Elasticsearch, Lucene, etc.

Modern Practice

BM25 is the default scoring model in Elasticsearch, Apache Solr, and most production search systems. Understanding TF-IDF conceptually translates directly to BM25—the intuitions are identical, just with better mathematical formulation.

Summary: Term Frequency and TF-IDF

We've explored the mathematical foundations of relevance scoring in full-text search. Let's consolidate the key concepts:

Key Takeaways

•Term Frequency (TF) measures local importance — How often does the term appear in this specific document?
•Document Frequency (DF) measures global commonality — In how many documents does the term appear?
•Inverse Document Frequency (IDF) favors rare terms — IDF = log(N/DF) makes rare terms more discriminative
•TF-IDF combines both signals — High TF × High IDF = important term for this document
•Vector Space Model enables geometric reasoning — Documents and queries as vectors, similarity as angle
•BM25 improves on TF-IDF — Saturation and length normalization for better ranking

What's Next:

The next page explores Stop Words—those extremely common words that TF-IDF down-weights automatically. We'll understand when explicit stop word removal helps, when it hurts, and how to handle them in production systems.

Scoring Foundation Complete

You now understand the mathematical foundation of relevance scoring. TF-IDF and its successor BM25 power billions of daily searches. This knowledge enables you to tune search systems, debug ranking issues, and understand why certain results appear.

3 / 5

Loading learning content...

Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

3 / 5

Term Frequency and TF-IDF

The Science of Relevance

What You Will Learn

The Relevance Ranking Problem

Given a query and a collection of documents, we need to assign each document a relevance score indicating how well it satisfies the user's information need. This score must be:

Computable: We must be able to calculate it efficiently at query time
Orderable: Higher scores must genuinely indicate greater relevance
Interpretable: The score should reflect intuitive notions of relevance

What Makes a Document Relevant?

Intuitively, a document is relevant if:

Relevance Indicators

•Contains query terms — Documents must contain the words the user searched for
•Contains terms frequently — More occurrences suggest the document is 'about' those terms
•Contains rare terms — Matching rare/specific terms is more significant than matching common terms
•Terms appear prominently — Terms in titles, headings, or early paragraphs may be more important
•Terms appear together — Query terms appearing close together suggest phrase relevance

TF-IDF captures the first three intuitions mathematically. More advanced models extend to the others.

The Bag-of-Words Model:

TF-IDF assumes a 'bag-of-words' representation—documents are treated as unordered collections of words. Word order, grammar, and meaning are ignored; only word presence and frequency matter.

bag_of_words.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Document: "The quick brown fox jumps over the lazy dog"
 
Bag-of-Words Representation:
{
    "the": 2,
    "quick": 1,
    "brown": 1,
    "fox": 1,
    "jumps": 1,
    "over": 1,
    "lazy": 1,
    "dog": 1
}
 
What's preserved: Word frequencies
What's lost: Word order, grammar, phrases
 
"The dog jumps over the lazy brown fox" has identical representation!
 
Despite this limitation, bag-of-words works remarkably well for ranking.

Why Bag-of-Words Works

Term Frequency (TF)

Term Frequency (TF) measures how often a term appears in a document. The basic intuition: a document mentioning 'database' 10 times is probably more about databases than one mentioning it once.

Raw Term Frequency:

The simplest definition is the raw count:

raw_term_frequency.txt
1
2
3
4
5
6
7
8
9
10
Raw Term Frequency:
TF(t, d) = number of times term t appears in document d
 
Example Document: "Database systems use database indexes to speed up 
                   database queries. Efficient database design matters."
 
TF("database", doc) = 4
TF("systems", doc) = 1
TF("efficient", doc) = 1
TF("algorithm", doc) = 0

Problems with Raw TF:

Raw counts have issues:

Length Bias: Longer documents naturally have higher term counts
Diminishing Returns: 10 mentions isn't 10× more relevant than 1 mention
Scale Issues: Counts can vary from 0 to thousands

Normalized Term Frequency:

To address length bias, normalize by document length:

normalized_tf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Normalized Term Frequency:
TF(t, d) = count(t, d) / total_terms(d)
 
Example:
Doc A: 100 words, "database" appears 10 times → TF = 10/100 = 0.10
Doc B: 1000 words, "database" appears 10 times → TF = 10/1000 = 0.01
 
Despite same raw count, Doc A is more focused on databases.
 
Alternative: Divide by max term frequency in document
TF(t, d) = count(t, d) / max_count(d)
 
This normalizes to [0, 1] range.

Sublinear TF Scaling:

To address diminishing returns, use logarithmic scaling:

sublinear_tf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Sublinear (Log) Term Frequency:
 
TF(t, d) = 1 + log(count(t, d))    if count > 0
         = 0                        if count = 0
 
Examples:
count = 1  → TF = 1 + log(1) = 1 + 0 = 1.0
count = 2  → TF = 1 + log(2) = 1 + 0.693 = 1.693
count = 10 → TF = 1 + log(10) = 1 + 2.303 = 3.303
count = 100 → TF = 1 + log(100) = 1 + 4.605 = 5.605
 
Key insight: 100 occurrences scores only ~5.6× higher than 1, not 100×.
This models the intuition that relevance doesn't scale linearly with frequency.

TF Variants Comparison
Variant	Formula	Properties
Raw	count(t, d)	Simple but biased toward long docs
Normalized	count / doc_length	Length-independent
Log Normalized	1 + log(count)	Diminishing returns, 0 for absent
Double Normalization	0.5 + 0.5 × (count/max)	Bounded [0.5, 1], smoothed
Boolean	1 if count > 0, else 0	Ignores frequency entirely

Production Choice

Most production systems use log-normalized TF (1 + log) or BM25's saturation function. Pure raw counts almost never work well because they over-emphasize high-frequency terms.

Document Frequency (DF)

Document Frequency (DF) counts how many documents in the collection contain a term. This measures how common or rare a term is across the entire collection.

Definition:

document_frequency.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Document Frequency:
DF(t) = number of documents containing term t
 
Example Collection (N = 1,000,000 documents):
 
Term          | DF       | DF/N      | Interpretation
--------------|----------|-----------|---------------
"the"         | 999,000  | 0.999     | Extremely common (stop word)
"database"    | 150,000  | 0.150     | Common technical term
"postgresql"  | 25,000   | 0.025     | Specific technology
"b-plus-tree" | 3,000    | 0.003     | Specialized concept
"xyzquery123" | 5        | 0.000005  | Very rare (perhaps jargon)
 
Key insight: Matching "b-plus-tree" is more significant than matching "the"
because fewer documents contain it.

Why DF Matters:

Consider a query: 'database performance optimization'

If 'database' appears in 50% of documents and 'optimization' appears in 5%, matching 'optimization' is more discriminative. A document matching this rare term is more likely to be relevant.

Collection Statistics:

DF is computed once when indexing and stored in the vocabulary:

vocabulary_with_df.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Vocabulary Entry Structure:
---------------------------
{
    term: "database",
    document_frequency: 150000,
    collection_frequency: 2500000,  // Total occurrences across all docs
    posting_list_pointer: 0x4A3F00,
    posting_list_length: 150000
}
 
Document Frequency vs Collection Frequency:
- DF: In how many documents does term appear? (for IDF)
- CF: How many times does term appear total? (less commonly used)
 
A term in 100 documents with 1 occurrence each: DF=100, CF=100
A term in 100 documents with 10 occurrences each: DF=100, CF=1000

DF is Pre-Computed

Document frequency is calculated during indexing and stored in the vocabulary. At query time, looking up a term's DF is O(1)—just read it from the vocabulary entry. No scanning required.

Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) transforms document frequency into a weight that favors rare terms. The intuition: terms appearing in many documents are less discriminative.

Basic IDF Formula:

idf_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Basic IDF:
IDF(t) = log(N / DF(t))
 
Where:
  N = total number of documents in collection
  DF(t) = number of documents containing term t
 
Example (N = 1,000,000):
 
Term          | DF       | N/DF      | IDF = log(N/DF)
--------------|----------|-----------|----------------
"the"         | 999,000  | 1.001     | 0.001
"database"    | 150,000  | 6.67      | 1.90
"postgresql"  | 25,000   | 40        | 3.69
"b-plus-tree" | 3,000    | 333       | 5.81
"xyzquery123" | 5        | 200,000   | 12.21
 
Interpretation:
- "the" has near-zero IDF (useless for ranking)
- Rare terms have high IDF (very discriminative)

Why Logarithm?

The logarithm serves multiple purposes:

Compression: Without log, IDF for very rare terms explodes (200,000 vs 12.21)
Interpretability: Log values are more intuitive to compare
Dampening: Prevents rare terms from completely dominating

IDF Variants:

IDF Variants
Variant	Formula	Notes
Standard	log(N / DF)	Basic form, can be negative for DF > N/e
Smooth	log(1 + N / DF)	Always positive, smoother
Probabilistic	log((N - DF) / DF)	Used in BM25
Max	log(max_DF / DF)	Relative to most common term
Plus One	log(N / (DF + 1))	Avoids division by zero

idf_edge_cases.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Edge Cases and Solutions:
 
Problem 1: Term appears in all documents
  DF(t) = N → IDF = log(N/N) = log(1) = 0
  Solution: This is correct! Universal terms shouldn't affect ranking.
 
Problem 2: Term not in collection (DF = 0)  
  IDF = log(N/0) = undefined
  Solution: Use log(N/(DF+1)) or return 0 for missing terms.
 
Problem 3: Collection grows (N changes)
  All IDFs change, affecting stored scores.
  Solution: Recompute IDFs periodically or use relative IDF.
 
Problem 4: Very rare term becomes overly powerful
  DF = 1 → IDF = log(1,000,000) = 13.8 (dominates everything)
  Solution: Cap maximum IDF or use smooth variants.

The Stop Word Effect

TF-IDF: Combining the Components

TF-IDF combines term frequency and inverse document frequency to score term importance in a document relative to a collection.

The Core Formula:

tfidf_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
TF-IDF Formula:
TF-IDF(t, d) = TF(t, d) × IDF(t)
 
Full expansion (with log normalization):
TF-IDF(t, d) = (1 + log(count(t, d))) × log(N / DF(t))
             = TF component × IDF component
             
What this captures:
- High TF, High IDF: Term appears often in doc, rarely in collection → HIGH SCORE
- High TF, Low IDF: Term appears often but is common → MODERATE SCORE  
- Low TF, High IDF: Rare term appears once → MODERATE SCORE
- Low TF, Low IDF: Common term appears once → LOW SCORE
- Zero TF: Term not in document → ZERO SCORE

Worked Example:

Let's compute TF-IDF for a query against three documents:

tfidf_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Collection: N = 10,000 documents
Query: "database optimization"
 
Term Statistics:
  "database":     DF = 2,000  → IDF = log(10000/2000) = 1.61
  "optimization": DF = 500    → IDF = log(10000/500) = 3.00
 
Document Analysis:
 
Doc A: "Database performance and database optimization techniques"
  TF("database") = 2 → 1 + log(2) = 1.69
  TF("optimization") = 1 → 1 + log(1) = 1.00
  
  TF-IDF("database") = 1.69 × 1.61 = 2.72
  TF-IDF("optimization") = 1.00 × 3.00 = 3.00
  
  Doc A Score = 2.72 + 3.00 = 5.72
 
Doc B: "The database stores data in tables"
  TF("database") = 1 → 1 + log(1) = 1.00
  TF("optimization") = 0 → 0
  
  TF-IDF("database") = 1.00 × 1.61 = 1.61
  TF-IDF("optimization") = 0
  
  Doc B Score = 1.61
 
Doc C: "Query optimization and index optimization for performance"
  TF("database") = 0 → 0
  TF("optimization") = 2 → 1 + log(2) = 1.69
  
  TF-IDF("database") = 0
  TF-IDF("optimization") = 1.69 × 3.00 = 5.07
  
  Doc C Score = 5.07
 
Ranking: A (5.72) > C (5.07) > B (1.61)
 
Doc A wins because it contains both terms!

The Interaction Effect

Vector Space Model

TF-IDF naturally leads to the Vector Space Model (VSM)—representing documents and queries as vectors in a high-dimensional space where each dimension corresponds to a term.

Documents and Queries as Vectors:

vector_space.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Vocabulary: [database, optimization, performance, index, query]
            Dimension:    0           1            2         3      4
 
Document vectors (TF-IDF weights):
Doc A: [2.72, 3.00, 1.38, 0.00, 0.00]  "Database optimization..."
Doc B: [1.61, 0.00, 0.00, 0.00, 0.00]  "The database stores..."
Doc C: [0.00, 5.07, 2.76, 2.15, 1.85]  "Query optimization..."
 
Query vector:
Query: [1.61, 3.00, 0.00, 0.00, 0.00]  "database optimization"
 
Each document is a point in 5-dimensional space.
Query is also a point in the same space.
"Similar" documents are nearby in this space.

Cosine Similarity:

Rather than summing TF-IDF scores, we often compute cosine similarity between query and document vectors. This measures the angle between vectors, ignoring magnitude:

cosine_similarity.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cosine Similarity:
                   ∑(qi × di)
cosine(Q, D) = ─────────────────────
                |Q| × |D|
 
Where:
  qi = query vector component for term i
  di = document vector component for term i
  |Q| = length (magnitude) of query vector = √(∑qi²)
  |D| = length of document vector = √(∑di²)
 
Example:
Q = [1.61, 3.00, 0, 0, 0]  → |Q| = √(1.61² + 3.00²) = 3.40
D = [2.72, 3.00, 1.38, 0, 0]  → |D| = √(2.72² + 3.00² + 1.38²) = 4.31
 
Dot product: 1.61×2.72 + 3.00×3.00 + 0×1.38 = 4.38 + 9.00 = 13.38
 
Cosine = 13.38 / (3.40 × 4.31) = 13.38 / 14.65 = 0.913
 
Interpretation: Very high similarity (max = 1.0)

Why Cosine Similarity?

Cosine Similarity Benefits

•Length Normalization — Long documents don't automatically score higher; only the direction (term proportions) matters
•Scale Invariance — Doubling all weights doesn't change similarity
•Bounded Output — Always in [-1, 1], or [0, 1] for positive weights
•Geometric Intuition — Identical documents have angle 0 (cosine=1), orthogonal documents have angle 90° (cosine=0)

Converting Mermaid diagram...

High-Dimensional Reality

TF-IDF Limitations and Extensions

While remarkably effective, TF-IDF has known limitations that have driven research into improved models.

Known Limitations:

TF-IDF Weaknesses

•No Semantic Understanding — Can't recognize that 'car' and 'automobile' are synonyms
•Bag-of-Words Assumption — Ignores word order ('not good' same as 'good not')
•Term Independence — Assumes terms are statistically independent (they're not)
•Document Length Bias — Even with normalization, length effects persist
•TF Saturation Missing — Does 100 occurrences really matter vs. 50?

BM25: The Industry Standard:

BM25 (Best Match 25) is the most widely used improvement over basic TF-IDF:

bm25_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
BM25 Scoring Formula:
 
                    (k₁ + 1) × TF(t,d)
BM25(t,d) = IDF(t) × ─────────────────────────────────────
                    TF(t,d) + k₁ × (1 - b + b × |d|/avgdl)
 
Where:
  k₁ = term frequency saturation parameter (typically 1.2-2.0)
  b = document length normalization (typically 0.75)
  |d| = document length
  avgdl = average document length in collection
 
Key Improvements over TF-IDF:
 
1. TF Saturation: As TF grows, contribution asymptotically approaches (k₁+1)
   TF=1: score ≈ 1    (contribution = 1)
   TF=5: score ≈ 2.5  (not 5×)
   TF=100: score ≈ 3  (nearly same as TF=50)
 
2. Length Normalization: Parameter b controls length penalty
   b=0: No length normalization
   b=1: Full length normalization
   
3. Probabilistic IDF: log((N - DF + 0.5)/(DF + 0.5))

TF-IDF vs. BM25 Comparison
Aspect	TF-IDF	BM25
TF handling	Log or raw	Saturating (asymptotic)
Length normalization	Optional, separate	Built into formula
Parameters	Choice of TF/IDF formulas	k₁ and b tunable
Performance	Good baseline	Usually 5-20% better
Adoption	Educational, simple systems	Elasticsearch, Lucene, etc.

Modern Practice

Summary: Term Frequency and TF-IDF

We've explored the mathematical foundations of relevance scoring in full-text search. Let's consolidate the key concepts:

Key Takeaways

•Term Frequency (TF) measures local importance — How often does the term appear in this specific document?
•Document Frequency (DF) measures global commonality — In how many documents does the term appear?
•Inverse Document Frequency (IDF) favors rare terms — IDF = log(N/DF) makes rare terms more discriminative
•TF-IDF combines both signals — High TF × High IDF = important term for this document
•Vector Space Model enables geometric reasoning — Documents and queries as vectors, similarity as angle
•BM25 improves on TF-IDF — Saturation and length normalization for better ranking

What's Next:

Scoring Foundation Complete

3 / 5