Feature Engineering & SelectionText Feature Engineering - TF-IDF

Text Feature Engineering: TF-IDF

LevelIntermediate

Duration90 mins

TopicText Feature Engineering - TF-IDF

2 / 5

Inverse Document Frequency (IDF)

The Missing Piece: Term Discrimination

In the previous page, we established that Term Frequency measures a word's importance within a document. But we identified a critical limitation: common words like "the," "is," and "of" achieve high TF scores in virtually every document, despite carrying minimal semantic value.

The question becomes: How do we distinguish terms that are genuinely informative from those that are merely frequent?

The answer lies in a corpus-level perspective. A term appearing in every document tells us nothing about what makes any particular document unique. Conversely, a term appearing in only a handful of documents is highly discriminative—its presence strongly signals specific content.

This insight leads us to Inverse Document Frequency (IDF): a measure of how much information a term provides based on its rarity across a document collection.

What You Will Master

By the end of this page, you will understand: (1) The mathematical derivation and intuition of IDF, (2) The information-theoretic interpretation connecting IDF to entropy, (3) Smoothed variants that handle edge cases, (4) The document frequency spectrum and its implications, (5) Practical implementation patterns for large-scale corpora.

The Core Intuition of IDF

Inverse Document Frequency answers a simple question: How rare is this term across the corpus?

Consider a corpus of 10,000 news articles:

Term	Documents Containing Term	Rarity
"the"	10,000 (100%)	Completely common
"government"	2,500 (25%)	Moderately common
"cryptocurrency"	150 (1.5%)	Relatively rare
"zkSNARK"	3 (0.03%)	Extremely rare

Terms appearing everywhere ("the") provide no discrimination—they can't help distinguish one document from another. Terms appearing rarely ("zkSNARK") are highly discriminative—their presence strongly indicates specific content.

The IDF Principle:

$$\text{Discriminative Power} \propto \frac{1}{\text{Document Frequency}}$$

We "invert" the document frequency: terms in many documents get low weight; terms in few documents get high weight.

The Intuition Behind "Inverse"

Think of IDF as measuring "uniqueness" or "specificity." A term that appears in only 1% of documents is 100x more unique than one appearing in every document. By inverting the frequency, we quantify this uniqueness numerically.

Visual Intuition:

Imagine you're searching for a specific research paper in a library:

Searching for papers containing "the" returns every paper—useless
Searching for papers containing "research" returns most papers—barely helpful
Searching for papers containing "transformer architecture" returns a focused set—useful
Searching for papers containing "rotary position embeddings" returns very few—highly specific

IDF formalizes this intuition: the more documents contain a term, the less informative that term is for retrieval and classification.

Mathematical Formulation

Let's develop the mathematical formulation of IDF with precision.

Notation:

$N$ = total number of documents in the corpus
$\text{df}(t)$ = document frequency of term $t$ (number of documents containing $t$)
$\text{idf}(t)$ = inverse document frequency of term $t$

The Standard IDF Formula:

$$\text{idf}(t) = \log\left(\frac{N}{\text{df}(t)}\right)$$

Interpretation:

When $\text{df}(t) = N$ (term in every document): $\text{idf}(t) = \log(1) = 0$
When $\text{df}(t) = 1$ (term in only one document): $\text{idf}(t) = \log(N)$
As $\text{df}(t)$ decreases, $\text{idf}(t)$ increases logarithmically

Example Calculations:

For a corpus of $N = 10,000$ documents:

Term	df(t)	N/df(t)	idf(t) = log(N/df)
"the"	10,000	1	0.00
"government"	2,500	4	1.39
"cryptocurrency"	150	66.67	4.20
"zkSNARK"	3	3,333	8.11

Note the logarithmic compression: "zkSNARK" appears in 3,333× fewer documents than "the," but its IDF is only ~8× higher. The logarithm prevents extremely rare terms from completely dominating.

Why Document Frequency, Not Term Frequency?

IDF uses document frequency (how many documents contain the term), not total term frequency (how many times the term appears across all documents). A term appearing 100 times in one document but nowhere else has df(t) = 1, giving it maximum IDF. This is intentional: we measure the term's distribution across documents, not its raw count.

Formal Properties of IDF:

Non-negativity: $\text{idf}(t) \geq 0$ for all terms (when $\text{df}(t) \leq N$)
Monotonicity: If $\text{df}(t_1) < \text{df}(t_2)$, then $\text{idf}(t_1) > \text{idf}(t_2)$
Bounded range: $\text{idf}(t) \in [0, \log(N)]$
Corpus-dependent: IDF values depend on the specific corpus; the same term has different IDF in different collections
Term-specific, document-agnostic: Unlike TF, IDF is computed per term, not per (term, document) pair

Information-Theoretic Interpretation

The logarithm in IDF isn't arbitrary—it connects directly to information theory. Understanding this connection provides deep insight into why IDF works.

Shannon's Information Content:

In information theory, the information content (or "surprisal") of an event with probability $p$ is:

$$I = -\log(p) = \log\left(\frac{1}{p}\right)$$

High probability events ($p \approx 1$): Low information ($I \approx 0$)
Low probability events ($p \approx 0$): High information ($I \to \infty$)

IDF as Information Content:

Consider the probability that a randomly selected document contains term $t$:

$$P(t) = \frac{\text{df}(t)}{N}$$

The information content of observing term $t$ is:

$$I(t) = \log\left(\frac{1}{P(t)}\right) = \log\left(\frac{N}{\text{df}(t)}\right) = \text{idf}(t)$$

IDF equals the Shannon information content of observing a term!

The Deep Insight

IDF measures how much information (in the Shannon sense) a term provides. Common terms are predictable—observing them tells us little. Rare terms are surprising—observing them is highly informative. This information-theoretic foundation explains why IDF is so effective: it weights terms by their actual information content.

Connection to Entropy:

Entropy measures the average information content across all possible outcomes. For a corpus with vocabulary $V$:

$$H = -\sum_{t \in V} P(t) \log P(t)$$

IDF plays a role analogous to the $-\log P(t)$ term: it measures the information gained from each term.

Why Logarithms Are Natural:

Additivity: Information from independent events adds: $I(A \text{ and } B) = I(A) + I(B)$. Logarithms convert multiplication to addition.
Bits interpretation: Using $\log_2$, IDF gives information in bits. A term with IDF = 10 bits tells you as much as knowing 10 binary questions about the document.
Compression bound: The logarithm connects to minimum average code length for encoding term occurrences.

Example: Bits of Information

Using $\log_2$ for a corpus of $N = 1,024$ documents:

Term	df(t)	P(t)	idf(t) = $\log_2$(1024/df)	Interpretation
"the"	1,024	1.0	0 bits	No information
"software"	512	0.5	1 bit	Like a coin flip
"kubernetes"	64	0.0625	4 bits	Significant info
"istio"	8	0.0078	7 bits	Highly informative
"zkSNARK"	1	0.001	10 bits	Maximum info

IDF Variants and Smoothing

The standard IDF formula has edge cases and numerical issues that motivate various smoothed variants.

Problem 1: Division by Zero

If a term doesn't appear in the training corpus but appears at inference time, $\text{df}(t) = 0$, causing division by zero.

Problem 2: Zero IDF for Universal Terms

Terms appearing in every document get $\text{idf}(t) = 0$, completely ignoring them. Sometimes we want a small positive weight.

Problem 3: Extreme Values for Rare Terms

Terms appearing in only one document get IDF = $\log(N)$, which can be very large for big corpora, potentially over-weighting singleton terms that might be typos or noise.

IDF Variants and Their Properties
Variant	Formula	Purpose	Notes
Standard IDF	$\log\left(\frac{N}{\text{df}(t)}\right)$	Basic formulation	Division by zero if df=0
Smooth IDF	$\log\left(\frac{N}{1 + \text{df}(t)}\right) + 1$	Prevent division by zero	Scikit-learn default
Probabilistic IDF	$\log\left(\frac{N - \text{df}(t)}{\text{df}(t)}\right)$	Negative weight if df > N/2	Used in BM25/Okapi
Max IDF	$\log\left(\frac{\max_t \text{df}(t)}{\text{df}(t)}\right)$	Relative to most common term	Corpus-relative scaling
Unary IDF	$1$ (constant)	IDF disabled	Reduces to raw TF

Deep Dive: Scikit-learn's Smooth IDF

Scikit-learn's TfidfVectorizer uses the following smooth IDF by default:

$$\text{idf}(t) = \log\left(\frac{1 + N}{1 + \text{df}(t)}\right) + 1$$

Analysis:

The "+1" in numerator and denominator prevents division by zero
The "+1" outside the log ensures IDF is always positive
For a term in every document: $\text{idf} = \log(1) + 1 = 1$ (not zero!)
This "smooth" variant never fully ignores any term

Deep Dive: Probabilistic IDF (BM25)

The Okapi BM25 ranking function uses probabilistic IDF:

$$\text{idf}_{\text{prob}}(t) = \log\left(\frac{N - \text{df}(t) + 0.5}{\text{df}(t) + 0.5}\right)$$

Key Property: This can be negative when $\text{df}(t) > N/2$.

Interpretation: Terms appearing in more than half the documents are actually anti-discriminative—their absence is more informative than their presence. The probabilistic IDF captures this nuance.

Variant Choice Matters

Different IDF variants can significantly affect model performance. When comparing results across papers or libraries, always check which IDF formula is used. Scikit-learn's smooth IDF differs from the standard textbook formula, which can cause confusion when replicating results.

The Document Frequency Spectrum

Understanding the distribution of document frequencies across a vocabulary reveals important properties of text corpora and guides preprocessing decisions.

Zipf's Law and Document Frequency:

Text follows Zipf's law: a small number of terms are extremely frequent, while a large number of terms are rare. The document frequency distribution typically shows:

Head: Few terms appear in almost all documents (stop words, domain-common terms)
Torso: Moderate number of terms appear in 1-30% of documents (content words)
Long tail: Many terms appear in very few documents (technical terms, proper nouns, typos)

The DF Spectrum:

Document Frequency Spectrum (Typical News Corpus)
DF Range	% of Vocab	Term Types	IDF Range
90% docs	~0.1%	Stop words (the, a, is)	0 - 0.1
50-90% docs	~1%	Common words (said, people, new)	0.1 - 0.7
10-50% docs	~5%	Content words (government, technology)	0.7 - 2.3
1-10% docs	~20%	Domain terms (neural, algorithm)	2.3 - 4.6
<1% docs	~75%	Rare terms (zkrollup, eigendecomposition)	4.6

Implications for Feature Engineering:

Stop Word Identification: Terms with DF > 80-90% are candidates for removal
Rare Term Thresholds: Terms with DF = 1 or 2 are often noise; setting min_df = 2 or 3 reduces vocabulary significantly
Vocabulary Size vs. DF Threshold Trade-off:

min_df	Vocabulary Reduction	Risk
1	0% (baseline)	Noise, typos, overfitting
2	~30-50%	Still includes rare valid terms
5	~50-70%	Balanced for most tasks
10	~60-80%	May lose important rare terms
100	~90%	Only frequent terms remain

The Sweet Spot:

The most discriminative terms typically have DF in the 1-10% range:

Frequent enough to appear in multiple documents (not noise)
Rare enough to be discriminative (not background words)

This "torso" region is where IDF provides the most value.

Visualize Your DF Distribution

Before building any text model, plot the cumulative document frequency distribution of your vocabulary. This reveals: (1) How much vocabulary you can eliminate with min_df thresholds, (2) Whether your corpus has unusual DF patterns, (3) The IDF range you'll actually see in practice. Many practitioners skip this step and miss obvious preprocessing opportunities.

Implementation

Let's implement IDF computation with attention to efficiency, edge cases, and variant support.

inverse_document_frequency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
import numpy as np
from collections import Counter
from typing import Dict, List, Optional, Literal
from dataclasses import dataclass
import matplotlib.pyplot as plt
 
@dataclass
class IDFStatistics:
    """Statistics about IDF computation for a corpus."""
    n_documents: int
    vocabulary_size: int
    min_df: int
    max_df: int
    mean_df: float
    median_df: float
    min_idf: float
    max_idf: float
    mean_idf: float
    terms_with_zero_idf: int
 
class IDFCalculator:
    """
    Compute Inverse Document Frequency with multiple variant support.
    
    Supports:
    - Standard IDF: log(N / df)
    - Smooth IDF: log((1 + N) / (1 + df)) + 1 (scikit-learn default)
    - Probabilistic IDF: log((N - df + 0.5) / (df + 0.5)) (BM25)
    - Max IDF: log(max_df / df)
    """
    
    IDFVariant = Literal["standard", "smooth", "probabilistic", "max"]
    
    def __init__(
        self,
        variant: IDFVariant = "smooth",
        min_df: int = 1,
        max_df_ratio: float = 1.0
    ):
        """
        Initialize IDF calculator.
        
        Parameters:
        -----------
        variant : str
            IDF variant to use
        min_df : int
            Minimum document frequency threshold
        max_df_ratio : float
            Maximum document frequency as proportion of N
        """
        self.variant = variant
        self.min_df = min_df
        self.max_df_ratio = max_df_ratio
        
        self.document_frequencies_: Dict[str, int] = {}
        self.idf_values_: Dict[str, float] = {}
        self.n_documents_: int = 0
        self._fitted = False
    
    def fit(self, tokenized_documents: List[List[str]]) -> 'IDFCalculator':
        """
        Compute document frequencies and IDF values from corpus.
        
        Parameters:
        -----------
        tokenized_documents : List[List[str]]
            Pre-tokenized documents (list of token lists)
            
        Returns:
        --------
        self : IDFCalculator
        """
        self.n_documents_ = len(tokenized_documents)
        N = self.n_documents_
        
        # Count document frequencies
        df_counter = Counter()
        for doc_tokens in tokenized_documents:
            unique_tokens = set(doc_tokens)
            for token in unique_tokens:
                df_counter[token] += 1
        
        # Apply df thresholds
        max_df = int(self.max_df_ratio * N)
        
        self.document_frequencies_ = {
            term: df for term, df in df_counter.items()
            if self.min_df <= df <= max_df
        }
        
        # Compute IDF for each term
        self.idf_values_ = {}
        for term, df in self.document_frequencies_.items():
            self.idf_values_[term] = self._compute_idf(df, N)
        
        self._fitted = True
        return self
    
    def _compute_idf(self, df: int, N: int) -> float:
        """Compute IDF value based on selected variant."""
        
        if self.variant == "standard":
            # Standard IDF: log(N / df)
            if df == 0:
                return 0.0  # Handle edge case
            return np.log(N / df)
        
        elif self.variant == "smooth":
            # Scikit-learn smooth IDF: log((1+N)/(1+df)) + 1
            return np.log((1 + N) / (1 + df)) + 1
        
        elif self.variant == "probabilistic":
            # BM25 probabilistic IDF: log((N - df + 0.5) / (df + 0.5))
            return np.log((N - df + 0.5) / (df + 0.5))
        
        elif self.variant == "max":
            # Max IDF: log(max_df / df)
            max_df = max(self.document_frequencies_.values())
            if df == 0:
                return 0.0
            return np.log(max_df / df)
        
        else:
            raise ValueError(f"Unknown IDF variant: {self.variant}")
    
    def get_idf(self, term: str, default: float = 0.0) -> float:
        """Get IDF value for a term."""
        return self.idf_values_.get(term, default)
    
    def transform(self, terms: List[str]) -> np.ndarray:
        """Get IDF values for a list of terms."""
        return np.array([self.get_idf(t) for t in terms])
    
    def get_statistics(self) -> IDFStatistics:
        """Compute summary statistics about IDF values."""
        if not self._fitted:
            raise RuntimeError("Must fit before getting statistics")
        
        dfs = list(self.document_frequencies_.values())
        idfs = list(self.idf_values_.values())
        
        return IDFStatistics(
            n_documents=self.n_documents_,
            vocabulary_size=len(self.document_frequencies_),
            min_df=min(dfs) if dfs else 0,
            max_df=max(dfs) if dfs else 0,
            mean_df=np.mean(dfs) if dfs else 0,
            median_df=np.median(dfs) if dfs else 0,
            min_idf=min(idfs) if idfs else 0,
            max_idf=max(idfs) if idfs else 0,
            mean_idf=np.mean(idfs) if idfs else 0,
            terms_with_zero_idf=sum(1 for i in idfs if i <= 0)
        )
    
    def plot_df_distribution(self, bins: int = 50):
        """Plot document frequency distribution."""
        if not self._fitted:
            raise RuntimeError("Must fit before plotting")
        
        dfs = list(self.document_frequencies_.values())
        
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # DF histogram
        axes[0].hist(dfs, bins=bins, edgecolor='black', alpha=0.7)
        axes[0].set_xlabel('Document Frequency')
        axes[0].set_ylabel('Number of Terms')
        axes[0].set_title('Document Frequency Distribution')
        axes[0].set_yscale('log')
        
        # IDF histogram
        idfs = list(self.idf_values_.values())
        axes[1].hist(idfs, bins=bins, edgecolor='black', alpha=0.7, 
                     color='orange')
        axes[1].set_xlabel('IDF Value')
        axes[1].set_ylabel('Number of Terms')
        axes[1].set_title(f'IDF Distribution ({self.variant})')
        
        plt.tight_layout()
        return fig
 
 
def compare_idf_variants(tokenized_docs: List[List[str]]) -> Dict:
    """
    Compare IDF values across different variants.
    
    Returns dictionary with term -> {variant: idf_value} mapping
    for the most interesting terms (highest/lowest IDF).
    """
    variants = ["standard", "smooth", "probabilistic"]
    results = {}
    
    for variant in variants:
        calc = IDFCalculator(variant=variant)
        calc.fit(tokenized_docs)
        results[variant] = calc.idf_values_.copy()
    
    # Find terms to compare
    standard_calc = IDFCalculator(variant="standard")
    standard_calc.fit(tokenized_docs)
    
    sorted_terms = sorted(
        standard_calc.idf_values_.items(),
        key=lambda x: x[1],
        reverse=True
    )
    
    # Get top 5 highest and lowest IDF terms
    comparison_terms = (
        [t for t, _ in sorted_terms[:5]] +
        [t for t, _ in sorted_terms[-5:]]
    )
    
    comparison = {}
    for term in comparison_terms:
        comparison[term] = {
            variant: results[variant].get(term, 0)
            for variant in variants
        }
        comparison[term]["df"] = standard_calc.document_frequencies_.get(term, 0)
    
    return comparison
 
 
# Example usage
if __name__ == "__main__":
    # Sample corpus
    documents = [
        ["machine", "learning", "algorithm", "data"],
        ["deep", "learning", "neural", "network"],
        ["machine", "learning", "model", "training"],
        ["the", "quick", "brown", "fox"],
        ["neural", "network", "deep", "learning"],
        ["data", "science", "machine", "learning"],
        ["algorithm", "optimization", "gradient", "descent"],
        ["the", "lazy", "dog", "sleeps"],
    ]
    
    print("=" * 60)
    print("IDF VARIANT COMPARISON")
    print("=" * 60)
    
    for variant in ["standard", "smooth", "probabilistic"]:
        calc = IDFCalculator(variant=variant)
        calc.fit(documents)
        stats = calc.get_statistics()
        
        print(f"
{variant.upper()} IDF:")
        print(f"  Vocabulary size: {stats.vocabulary_size}")
        print(f"  IDF range: [{stats.min_idf:.3f}, {stats.max_idf:.3f}]")
        print(f"  Mean IDF: {stats.mean_idf:.3f}")
        print(f"  Terms with zero/negative IDF: {stats.terms_with_zero_idf}")
    
    print("
" + "=" * 60)
    print("TERM-BY-TERM COMPARISON")
    print("=" * 60)
    
    comparison = compare_idf_variants(documents)
    print(f"{'Term':<15} {'DF':<5} {'Standard':<10} {'Smooth':<10} {'Prob.':<10}")
    print("-" * 50)
    for term, values in comparison.items():
        print(f"{term:<15} {values['df']:<5} {values['standard']:<10.3f} "
              f"{values['smooth']:<10.3f} {values['probabilistic']:<10.3f}")

Implementation Note

This implementation separates IDF calculation from TF calculation, allowing you to compute IDF once and apply it to multiple TF representations. In production, libraries like scikit-learn combine TF and IDF computation for efficiency, but understanding them separately aids debugging and customization.

IDF at Scale

Computing IDF for large corpora requires careful attention to memory and computation.

Scalability Characteristics:

Corpus Size	Documents	Vocabulary	Memory (DF Dict)	Time
Small	10K	50K	~2 MB	Seconds
Medium	100K	200K	~10 MB	Minutes
Large	1M	500K	~25 MB	10s of minutes
Web-scale	100M+	10M+	~500 MB+	Hours (distributed)

Key Insight: IDF computation requires only a single pass over the corpus to count document frequencies. The resulting DF dictionary is relatively compact (linear in vocabulary size), not quadratic in corpus size.

Scaling Strategies

•Streaming computation: Process documents one at a time, updating DF counts incrementally. Never load entire corpus into memory.
•Distributed counting: Use MapReduce or Spark to parallelize DF counting across nodes. Each node counts local documents; results are merged.
•Approximate counting: For massive vocabularies, use Count-Min Sketch or HyperLogLog to estimate DF with bounded error.
•Incremental updates: For streaming corpora, maintain DF counts and update IDF periodically. Exact incremental IDF requires recomputation, but approximations can be efficiently maintained.
•Vocabulary pruning: Apply aggressive min_df thresholds early to reduce vocabulary before full IDF computation.

The Corpus Stability Question:

IDF is computed on a training corpus, but documents at inference time may use different language. How stable is IDF?

Empirical finding: For large, representative training corpora (>100K documents), IDF values for common and mid-frequency terms are remarkably stable. Rare terms show more variance but have less impact on downstream tasks.

Best Practice: Periodically recompute IDF on updated corpora, especially if domain or language use shifts. For search engines, IDF is typically recomputed with each major index update.

IDF and Document Addition

Adding a new document to a corpus can change IDF for every term in that document. In practice, small additions (< 1% of corpus) have negligible effect on IDF. For streaming systems, IDF is typically updated in batches rather than per-document.

Summary and Bridge to TF-IDF

We've now completed our deep dive into Inverse Document Frequency—the discriminative component that identifies rare, informative terms.

Key Takeaways

•Core principle: IDF measures term rarity across a corpus—rare terms are more discriminative and receive higher weights.
•Information-theoretic foundation: IDF equals the Shannon information content of observing a term, providing a principled basis for the formula.
•The logarithm is essential: It connects to information theory and prevents rare terms from completely dominating.
•Multiple variants exist: Standard, smooth, and probabilistic IDF handle edge cases differently; know which your library uses.
•Document frequency spectrum: Understanding the DF distribution guides preprocessing (min_df, max_df thresholds).
•Scalable computation: A single corpus pass suffices; DF dictionary is compact; distributed computation is straightforward.

The Complete Picture:

Metric	Question Answered	Scope
TF	How important is term $t$ in document $d$?	Per (term, document) pair
IDF	How rare is term $t$ across the corpus?	Per term (corpus-level)
TF-IDF	How distinctively important is term $t$ to document $d$?	Combines both perspectives

The Combination:

$$\text{tfidf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

A term has high TF-IDF when:

It appears frequently in the document (high TF)
It appears rarely across the corpus (high IDF)

This multiplication creates precisely what we want: a weight that identifies terms that are both important to a specific document and distinctive from the general corpus.

What's Next:

In the next page, we'll explore why IDF uses a logarithm and not a simpler inverse. We'll see how different logarithm arguments affect the weighting and examine practical considerations for choosing TF-IDF variants.

Page Complete

You now have a thorough understanding of Inverse Document Frequency—its mathematical formulation, information-theoretic interpretation, variants, and practical implementation. Next, we'll examine why the logarithm is fundamental to IDF and explore different weighting schemes.

2 / 5

Loading learning content...

Feature Engineering & SelectionText Feature Engineering - TF-IDF

Text Feature Engineering: TF-IDF

LevelIntermediate

Duration90 mins

TopicText Feature Engineering - TF-IDF

2 / 5

Inverse Document Frequency (IDF)

The Missing Piece: Term Discrimination

The question becomes: How do we distinguish terms that are genuinely informative from those that are merely frequent?

This insight leads us to Inverse Document Frequency (IDF): a measure of how much information a term provides based on its rarity across a document collection.

What You Will Master

The Core Intuition of IDF

Inverse Document Frequency answers a simple question: How rare is this term across the corpus?

Consider a corpus of 10,000 news articles:

Term	Documents Containing Term	Rarity
"the"	10,000 (100%)	Completely common
"government"	2,500 (25%)	Moderately common
"cryptocurrency"	150 (1.5%)	Relatively rare
"zkSNARK"	3 (0.03%)	Extremely rare

The IDF Principle:

$$\text{Discriminative Power} \propto \frac{1}{\text{Document Frequency}}$$

We "invert" the document frequency: terms in many documents get low weight; terms in few documents get high weight.

The Intuition Behind "Inverse"

Visual Intuition:

Imagine you're searching for a specific research paper in a library:

Searching for papers containing "the" returns every paper—useless
Searching for papers containing "research" returns most papers—barely helpful
Searching for papers containing "transformer architecture" returns a focused set—useful
Searching for papers containing "rotary position embeddings" returns very few—highly specific

IDF formalizes this intuition: the more documents contain a term, the less informative that term is for retrieval and classification.

Mathematical Formulation

Let's develop the mathematical formulation of IDF with precision.

Notation:

$N$ = total number of documents in the corpus
$\text{df}(t)$ = document frequency of term $t$ (number of documents containing $t$)
$\text{idf}(t)$ = inverse document frequency of term $t$

The Standard IDF Formula:

$$\text{idf}(t) = \log\left(\frac{N}{\text{df}(t)}\right)$$

Interpretation:

When $\text{df}(t) = N$ (term in every document): $\text{idf}(t) = \log(1) = 0$
When $\text{df}(t) = 1$ (term in only one document): $\text{idf}(t) = \log(N)$
As $\text{df}(t)$ decreases, $\text{idf}(t)$ increases logarithmically

Example Calculations:

For a corpus of $N = 10,000$ documents:

Term	df(t)	N/df(t)	idf(t) = log(N/df)
"the"	10,000	1	0.00
"government"	2,500	4	1.39
"cryptocurrency"	150	66.67	4.20
"zkSNARK"	3	3,333	8.11

Note the logarithmic compression: "zkSNARK" appears in 3,333× fewer documents than "the," but its IDF is only ~8× higher. The logarithm prevents extremely rare terms from completely dominating.

Why Document Frequency, Not Term Frequency?

Formal Properties of IDF:

Non-negativity: $\text{idf}(t) \geq 0$ for all terms (when $\text{df}(t) \leq N$)
Monotonicity: If $\text{df}(t_1) < \text{df}(t_2)$, then $\text{idf}(t_1) > \text{idf}(t_2)$
Bounded range: $\text{idf}(t) \in [0, \log(N)]$
Corpus-dependent: IDF values depend on the specific corpus; the same term has different IDF in different collections
Term-specific, document-agnostic: Unlike TF, IDF is computed per term, not per (term, document) pair

Information-Theoretic Interpretation

The logarithm in IDF isn't arbitrary—it connects directly to information theory. Understanding this connection provides deep insight into why IDF works.

Shannon's Information Content:

In information theory, the information content (or "surprisal") of an event with probability $p$ is:

$$I = -\log(p) = \log\left(\frac{1}{p}\right)$$

High probability events ($p \approx 1$): Low information ($I \approx 0$)
Low probability events ($p \approx 0$): High information ($I \to \infty$)

IDF as Information Content:

Consider the probability that a randomly selected document contains term $t$:

$$P(t) = \frac{\text{df}(t)}{N}$$

The information content of observing term $t$ is:

$$I(t) = \log\left(\frac{1}{P(t)}\right) = \log\left(\frac{N}{\text{df}(t)}\right) = \text{idf}(t)$$

IDF equals the Shannon information content of observing a term!

The Deep Insight

Connection to Entropy:

Entropy measures the average information content across all possible outcomes. For a corpus with vocabulary $V$:

$$H = -\sum_{t \in V} P(t) \log P(t)$$

IDF plays a role analogous to the $-\log P(t)$ term: it measures the information gained from each term.

Why Logarithms Are Natural:

Additivity: Information from independent events adds: $I(A \text{ and } B) = I(A) + I(B)$. Logarithms convert multiplication to addition.
Bits interpretation: Using $\log_2$, IDF gives information in bits. A term with IDF = 10 bits tells you as much as knowing 10 binary questions about the document.
Compression bound: The logarithm connects to minimum average code length for encoding term occurrences.

Example: Bits of Information

Using $\log_2$ for a corpus of $N = 1,024$ documents:

Term	df(t)	P(t)	idf(t) = $\log_2$(1024/df)	Interpretation
"the"	1,024	1.0	0 bits	No information
"software"	512	0.5	1 bit	Like a coin flip
"kubernetes"	64	0.0625	4 bits	Significant info
"istio"	8	0.0078	7 bits	Highly informative
"zkSNARK"	1	0.001	10 bits	Maximum info

IDF Variants and Smoothing

The standard IDF formula has edge cases and numerical issues that motivate various smoothed variants.

Problem 1: Division by Zero

If a term doesn't appear in the training corpus but appears at inference time, $\text{df}(t) = 0$, causing division by zero.

Problem 2: Zero IDF for Universal Terms

Terms appearing in every document get $\text{idf}(t) = 0$, completely ignoring them. Sometimes we want a small positive weight.

Problem 3: Extreme Values for Rare Terms

Terms appearing in only one document get IDF = $\log(N)$, which can be very large for big corpora, potentially over-weighting singleton terms that might be typos or noise.

IDF Variants and Their Properties
Variant	Formula	Purpose	Notes
Standard IDF	$\log\left(\frac{N}{\text{df}(t)}\right)$	Basic formulation	Division by zero if df=0
Smooth IDF	$\log\left(\frac{N}{1 + \text{df}(t)}\right) + 1$	Prevent division by zero	Scikit-learn default
Probabilistic IDF	$\log\left(\frac{N - \text{df}(t)}{\text{df}(t)}\right)$	Negative weight if df > N/2	Used in BM25/Okapi
Max IDF	$\log\left(\frac{\max_t \text{df}(t)}{\text{df}(t)}\right)$	Relative to most common term	Corpus-relative scaling
Unary IDF	$1$ (constant)	IDF disabled	Reduces to raw TF

Deep Dive: Scikit-learn's Smooth IDF

Scikit-learn's TfidfVectorizer uses the following smooth IDF by default:

$$\text{idf}(t) = \log\left(\frac{1 + N}{1 + \text{df}(t)}\right) + 1$$

Analysis:

The "+1" in numerator and denominator prevents division by zero
The "+1" outside the log ensures IDF is always positive
For a term in every document: $\text{idf} = \log(1) + 1 = 1$ (not zero!)
This "smooth" variant never fully ignores any term

Deep Dive: Probabilistic IDF (BM25)

The Okapi BM25 ranking function uses probabilistic IDF:

$$\text{idf}_{\text{prob}}(t) = \log\left(\frac{N - \text{df}(t) + 0.5}{\text{df}(t) + 0.5}\right)$$

Key Property: This can be negative when $\text{df}(t) > N/2$.

Variant Choice Matters

The Document Frequency Spectrum

Understanding the distribution of document frequencies across a vocabulary reveals important properties of text corpora and guides preprocessing decisions.

Zipf's Law and Document Frequency:

Text follows Zipf's law: a small number of terms are extremely frequent, while a large number of terms are rare. The document frequency distribution typically shows:

Head: Few terms appear in almost all documents (stop words, domain-common terms)
Torso: Moderate number of terms appear in 1-30% of documents (content words)
Long tail: Many terms appear in very few documents (technical terms, proper nouns, typos)

The DF Spectrum:

Document Frequency Spectrum (Typical News Corpus)
DF Range	% of Vocab	Term Types	IDF Range
90% docs	~0.1%	Stop words (the, a, is)	0 - 0.1
50-90% docs	~1%	Common words (said, people, new)	0.1 - 0.7
10-50% docs	~5%	Content words (government, technology)	0.7 - 2.3
1-10% docs	~20%	Domain terms (neural, algorithm)	2.3 - 4.6
<1% docs	~75%	Rare terms (zkrollup, eigendecomposition)	4.6

Implications for Feature Engineering:

Stop Word Identification: Terms with DF > 80-90% are candidates for removal
Rare Term Thresholds: Terms with DF = 1 or 2 are often noise; setting min_df = 2 or 3 reduces vocabulary significantly
Vocabulary Size vs. DF Threshold Trade-off:

min_df	Vocabulary Reduction	Risk
1	0% (baseline)	Noise, typos, overfitting
2	~30-50%	Still includes rare valid terms
5	~50-70%	Balanced for most tasks
10	~60-80%	May lose important rare terms
100	~90%	Only frequent terms remain

The Sweet Spot:

The most discriminative terms typically have DF in the 1-10% range:

Frequent enough to appear in multiple documents (not noise)
Rare enough to be discriminative (not background words)

This "torso" region is where IDF provides the most value.

Visualize Your DF Distribution

Implementation

Let's implement IDF computation with attention to efficiency, edge cases, and variant support.

inverse_document_frequency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
import numpy as np
from collections import Counter
from typing import Dict, List, Optional, Literal
from dataclasses import dataclass
import matplotlib.pyplot as plt
 
@dataclass
class IDFStatistics:
    """Statistics about IDF computation for a corpus."""
    n_documents: int
    vocabulary_size: int
    min_df: int
    max_df: int
    mean_df: float
    median_df: float
    min_idf: float
    max_idf: float
    mean_idf: float
    terms_with_zero_idf: int
 
class IDFCalculator:
    """
    Compute Inverse Document Frequency with multiple variant support.
    
    Supports:
    - Standard IDF: log(N / df)
    - Smooth IDF: log((1 + N) / (1 + df)) + 1 (scikit-learn default)
    - Probabilistic IDF: log((N - df + 0.5) / (df + 0.5)) (BM25)
    - Max IDF: log(max_df / df)
    """
    
    IDFVariant = Literal["standard", "smooth", "probabilistic", "max"]
    
    def __init__(
        self,
        variant: IDFVariant = "smooth",
        min_df: int = 1,
        max_df_ratio: float = 1.0
    ):
        """
        Initialize IDF calculator.
        
        Parameters:
        -----------
        variant : str
            IDF variant to use
        min_df : int
            Minimum document frequency threshold
        max_df_ratio : float
            Maximum document frequency as proportion of N
        """
        self.variant = variant
        self.min_df = min_df
        self.max_df_ratio = max_df_ratio
        
        self.document_frequencies_: Dict[str, int] = {}
        self.idf_values_: Dict[str, float] = {}
        self.n_documents_: int = 0
        self._fitted = False
    
    def fit(self, tokenized_documents: List[List[str]]) -> 'IDFCalculator':
        """
        Compute document frequencies and IDF values from corpus.
        
        Parameters:
        -----------
        tokenized_documents : List[List[str]]
            Pre-tokenized documents (list of token lists)
            
        Returns:
        --------
        self : IDFCalculator
        """
        self.n_documents_ = len(tokenized_documents)
        N = self.n_documents_
        
        # Count document frequencies
        df_counter = Counter()
        for doc_tokens in tokenized_documents:
            unique_tokens = set(doc_tokens)
            for token in unique_tokens:
                df_counter[token] += 1
        
        # Apply df thresholds
        max_df = int(self.max_df_ratio * N)
        
        self.document_frequencies_ = {
            term: df for term, df in df_counter.items()
            if self.min_df <= df <= max_df
        }
        
        # Compute IDF for each term
        self.idf_values_ = {}
        for term, df in self.document_frequencies_.items():
            self.idf_values_[term] = self._compute_idf(df, N)
        
        self._fitted = True
        return self
    
    def _compute_idf(self, df: int, N: int) -> float:
        """Compute IDF value based on selected variant."""
        
        if self.variant == "standard":
            # Standard IDF: log(N / df)
            if df == 0:
                return 0.0  # Handle edge case
            return np.log(N / df)
        
        elif self.variant == "smooth":
            # Scikit-learn smooth IDF: log((1+N)/(1+df)) + 1
            return np.log((1 + N) / (1 + df)) + 1
        
        elif self.variant == "probabilistic":
            # BM25 probabilistic IDF: log((N - df + 0.5) / (df + 0.5))
            return np.log((N - df + 0.5) / (df + 0.5))
        
        elif self.variant == "max":
            # Max IDF: log(max_df / df)
            max_df = max(self.document_frequencies_.values())
            if df == 0:
                return 0.0
            return np.log(max_df / df)
        
        else:
            raise ValueError(f"Unknown IDF variant: {self.variant}")
    
    def get_idf(self, term: str, default: float = 0.0) -> float:
        """Get IDF value for a term."""
        return self.idf_values_.get(term, default)
    
    def transform(self, terms: List[str]) -> np.ndarray:
        """Get IDF values for a list of terms."""
        return np.array([self.get_idf(t) for t in terms])
    
    def get_statistics(self) -> IDFStatistics:
        """Compute summary statistics about IDF values."""
        if not self._fitted:
            raise RuntimeError("Must fit before getting statistics")
        
        dfs = list(self.document_frequencies_.values())
        idfs = list(self.idf_values_.values())
        
        return IDFStatistics(
            n_documents=self.n_documents_,
            vocabulary_size=len(self.document_frequencies_),
            min_df=min(dfs) if dfs else 0,
            max_df=max(dfs) if dfs else 0,
            mean_df=np.mean(dfs) if dfs else 0,
            median_df=np.median(dfs) if dfs else 0,
            min_idf=min(idfs) if idfs else 0,
            max_idf=max(idfs) if idfs else 0,
            mean_idf=np.mean(idfs) if idfs else 0,
            terms_with_zero_idf=sum(1 for i in idfs if i <= 0)
        )
    
    def plot_df_distribution(self, bins: int = 50):
        """Plot document frequency distribution."""
        if not self._fitted:
            raise RuntimeError("Must fit before plotting")
        
        dfs = list(self.document_frequencies_.values())
        
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # DF histogram
        axes[0].hist(dfs, bins=bins, edgecolor='black', alpha=0.7)
        axes[0].set_xlabel('Document Frequency')
        axes[0].set_ylabel('Number of Terms')
        axes[0].set_title('Document Frequency Distribution')
        axes[0].set_yscale('log')
        
        # IDF histogram
        idfs = list(self.idf_values_.values())
        axes[1].hist(idfs, bins=bins, edgecolor='black', alpha=0.7, 
                     color='orange')
        axes[1].set_xlabel('IDF Value')
        axes[1].set_ylabel('Number of Terms')
        axes[1].set_title(f'IDF Distribution ({self.variant})')
        
        plt.tight_layout()
        return fig
 
 
def compare_idf_variants(tokenized_docs: List[List[str]]) -> Dict:
    """
    Compare IDF values across different variants.
    
    Returns dictionary with term -> {variant: idf_value} mapping
    for the most interesting terms (highest/lowest IDF).
    """
    variants = ["standard", "smooth", "probabilistic"]
    results = {}
    
    for variant in variants:
        calc = IDFCalculator(variant=variant)
        calc.fit(tokenized_docs)
        results[variant] = calc.idf_values_.copy()
    
    # Find terms to compare
    standard_calc = IDFCalculator(variant="standard")
    standard_calc.fit(tokenized_docs)
    
    sorted_terms = sorted(
        standard_calc.idf_values_.items(),
        key=lambda x: x[1],
        reverse=True
    )
    
    # Get top 5 highest and lowest IDF terms
    comparison_terms = (
        [t for t, _ in sorted_terms[:5]] +
        [t for t, _ in sorted_terms[-5:]]
    )
    
    comparison = {}
    for term in comparison_terms:
        comparison[term] = {
            variant: results[variant].get(term, 0)
            for variant in variants
        }
        comparison[term]["df"] = standard_calc.document_frequencies_.get(term, 0)
    
    return comparison
 
 
# Example usage
if __name__ == "__main__":
    # Sample corpus
    documents = [
        ["machine", "learning", "algorithm", "data"],
        ["deep", "learning", "neural", "network"],
        ["machine", "learning", "model", "training"],
        ["the", "quick", "brown", "fox"],
        ["neural", "network", "deep", "learning"],
        ["data", "science", "machine", "learning"],
        ["algorithm", "optimization", "gradient", "descent"],
        ["the", "lazy", "dog", "sleeps"],
    ]
    
    print("=" * 60)
    print("IDF VARIANT COMPARISON")
    print("=" * 60)
    
    for variant in ["standard", "smooth", "probabilistic"]:
        calc = IDFCalculator(variant=variant)
        calc.fit(documents)
        stats = calc.get_statistics()
        
        print(f"
{variant.upper()} IDF:")
        print(f"  Vocabulary size: {stats.vocabulary_size}")
        print(f"  IDF range: [{stats.min_idf:.3f}, {stats.max_idf:.3f}]")
        print(f"  Mean IDF: {stats.mean_idf:.3f}")
        print(f"  Terms with zero/negative IDF: {stats.terms_with_zero_idf}")
    
    print("
" + "=" * 60)
    print("TERM-BY-TERM COMPARISON")
    print("=" * 60)
    
    comparison = compare_idf_variants(documents)
    print(f"{'Term':<15} {'DF':<5} {'Standard':<10} {'Smooth':<10} {'Prob.':<10}")
    print("-" * 50)
    for term, values in comparison.items():
        print(f"{term:<15} {values['df']:<5} {values['standard']:<10.3f} "
              f"{values['smooth']:<10.3f} {values['probabilistic']:<10.3f}")

Implementation Note

IDF at Scale

Computing IDF for large corpora requires careful attention to memory and computation.

Scalability Characteristics:

Corpus Size	Documents	Vocabulary	Memory (DF Dict)	Time
Small	10K	50K	~2 MB	Seconds
Medium	100K	200K	~10 MB	Minutes
Large	1M	500K	~25 MB	10s of minutes
Web-scale	100M+	10M+	~500 MB+	Hours (distributed)

Scaling Strategies

•Streaming computation: Process documents one at a time, updating DF counts incrementally. Never load entire corpus into memory.
•Distributed counting: Use MapReduce or Spark to parallelize DF counting across nodes. Each node counts local documents; results are merged.
•Approximate counting: For massive vocabularies, use Count-Min Sketch or HyperLogLog to estimate DF with bounded error.
•Incremental updates: For streaming corpora, maintain DF counts and update IDF periodically. Exact incremental IDF requires recomputation, but approximations can be efficiently maintained.
•Vocabulary pruning: Apply aggressive min_df thresholds early to reduce vocabulary before full IDF computation.

The Corpus Stability Question:

IDF is computed on a training corpus, but documents at inference time may use different language. How stable is IDF?

Best Practice: Periodically recompute IDF on updated corpora, especially if domain or language use shifts. For search engines, IDF is typically recomputed with each major index update.

IDF and Document Addition

Summary and Bridge to TF-IDF

We've now completed our deep dive into Inverse Document Frequency—the discriminative component that identifies rare, informative terms.

Key Takeaways

•Core principle: IDF measures term rarity across a corpus—rare terms are more discriminative and receive higher weights.
•Information-theoretic foundation: IDF equals the Shannon information content of observing a term, providing a principled basis for the formula.
•The logarithm is essential: It connects to information theory and prevents rare terms from completely dominating.
•Multiple variants exist: Standard, smooth, and probabilistic IDF handle edge cases differently; know which your library uses.
•Document frequency spectrum: Understanding the DF distribution guides preprocessing (min_df, max_df thresholds).
•Scalable computation: A single corpus pass suffices; DF dictionary is compact; distributed computation is straightforward.

The Complete Picture:

Metric	Question Answered	Scope
TF	How important is term $t$ in document $d$?	Per (term, document) pair
IDF	How rare is term $t$ across the corpus?	Per term (corpus-level)
TF-IDF	How distinctively important is term $t$ to document $d$?	Combines both perspectives

The Combination:

$$\text{tfidf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

A term has high TF-IDF when:

It appears frequently in the document (high TF)
It appears rarely across the corpus (high IDF)

This multiplication creates precisely what we want: a weight that identifies terms that are both important to a specific document and distinctive from the general corpus.

What's Next:

Page Complete

2 / 5