Text Feature Engineering Tfidf - Learning Module

Loading content...

0/278

Term Frequency (TF)

The Foundation of Text Quantification

Before any machine learning algorithm can process text, that text must be transformed into numbers. This seemingly simple requirement has driven decades of research in computational linguistics and information retrieval. At the heart of this transformation lies a deceptively simple question: How do we measure the importance of a word in a document?

The answer begins with Term Frequency (TF), the foundational building block of text representation. While modern deep learning approaches use word embeddings and attention mechanisms, TF-based representations remain critically important in practice. Understanding TF deeply isn't just historical—it's essential for any practitioner working with text data.

What You Will Master

By the end of this page, you will understand: (1) The mathematical formulation and intuition behind Term Frequency, (2) Multiple TF variants and when to use each, (3) Implementation considerations and computational efficiency, (4) Limitations of raw TF and how they motivate IDF, and (5) Practical code implementations in Python.

The Core Intuition

Term Frequency embodies a simple yet powerful intuition: the more times a word appears in a document, the more important that word is to the document's content.

Consider two documents about machine learning:

Document A mentions "neural" 15 times and "network" 12 times
Document B mentions "neural" once and "network" once

Intuitively, Document A is more focused on neural networks than Document B. The frequency of terms serves as a proxy for topical emphasis. This is the core insight of Term Frequency: frequency signals relevance.

The Bag-of-Words Connection

Term Frequency is intimately connected to the Bag-of-Words (BoW) representation. In BoW, we represent a document as a vector where each dimension corresponds to a vocabulary term, and the value is the term's count. TF formalizes and extends this idea, adding mathematical rigor and normalization options.

Formal Definition:

The raw term frequency of term $t$ in document $d$ is simply the count of occurrences:

$$\text{tf}(t, d) = f_{t,d}$$

where $f_{t,d}$ denotes the number of times term $t$ appears in document $d$.

Example:

Consider the document: "The quick brown fox jumps over the lazy dog. The dog was lazy."

After tokenization and lowercasing:

Term	Raw TF
the	3
quick	1
brown	1
fox	1
jumps	1
over	1
lazy	2
dog	2
was	1

The TF vector for this document has 9 dimensions, with values ranging from 1 to 3.

Mathematical Foundations

To work rigorously with Term Frequency, we need precise mathematical notation. This foundation enables us to understand variants, prove properties, and implement algorithms correctly.

Notation:

Let:

$D = {d_1, d_2, ..., d_N}$ be a corpus of $N$ documents
$V = {t_1, t_2, ..., t_M}$ be the vocabulary of $M$ unique terms across all documents
$f_{t,d}$ denote the count of term $t$ in document $d$
$|d|$ denote the length of document $d$ (total term count)

The term-document matrix $\mathbf{A}$ is an $M \times N$ matrix where $A_{ij} = f_{t_i, d_j}$, representing the frequency of term $t_i$ in document $d_j$.

Example Term-Document Matrix
Term	Doc 1	Doc 2	Doc 3
machine	3	0	5
learning	4	1	7
algorithm	2	3	1
data	5	2	4
model	1	4	2

Properties of the Term-Document Matrix:

Sparsity: In practice, most entries are zero. A vocabulary of 50,000 terms and typical documents of 500 words means each document uses only 1-2% of the vocabulary.
Non-negativity: All values are ≥ 0, since term counts cannot be negative.
Integer values: Raw TF produces integer counts (though transformation may yield reals).
Column sparsity pattern: Each column (document) has non-zero entries only for terms present in that document.

Sparse Storage Implications:

For a corpus with 100,000 documents and 50,000 vocabulary terms:

Dense matrix: 5 billion entries × 4 bytes = 20 GB
Sparse matrix (1% density): ~50 million entries × 12 bytes = 600 MB

This 30x reduction makes TF calculations tractable for large corpora.

Why Sparsity Matters

The sparsity of term-document matrices is both a blessing and a constraint. It enables efficient storage and computation but limits the ability of distance metrics (like Euclidean distance) to work well. This sparsity is one motivation for dimensionality reduction techniques like LSA/SVD on text data.

Term Frequency Variants

Raw term counts have significant limitations. A term appearing 100 times isn't necessarily 100 times more important than one appearing once. Various TF variants address this issue, each with distinct mathematical properties and use cases.

The Problem with Raw Counts:

Consider a 10,000-word legal document where "contract" appears 200 times, and a 100-word summary where "contract" appears 5 times. Raw TF suggests the long document has 40x more emphasis on "contract," but the short document dedicates 5% of its words to the term while the long document dedicates only 2%.

This motivates normalized and transformed TF variants.

Term Frequency Variants
Variant	Formula	Range	Use Case
Raw TF	$f_{t,d}$	$[0, \infty)$	Simple counting; baseline approach
Normalized TF	$\frac{f_{t,d}}{\|d\|}$	$[0, 1]$	Comparing documents of different lengths
Log TF	$1 + \log(f_{t,d})$ if $f_{t,d} > 0$, else $0$	$[0, \infty)$	Dampening high-frequency terms
Augmented TF	$0.5 + 0.5 \cdot \frac{f_{t,d}}{\max_t f_{t,d}}$	$[0.5, 1]$	Preventing bias toward longer documents
Boolean TF	$1$ if $f_{t,d} > 0$, else $0$	${0, 1}$	Binary presence/absence only
Double Normalization (K)	$K + (1-K) \cdot \frac{f_{t,d}}{\max_t f_{t,d}}$	$[K, 1]$	Tunable smoothing parameter

Deep Dive: Log Normalization

Logarithmic TF is perhaps the most widely used variant:

$$\text{tf}{\log}(t, d) = \begin{cases} 1 + \log(f{t,d}) & \text{if } f_{t,d} > 0 \ 0 & \text{otherwise} \end{cases}$$

Why the "1 +" offset?

Without the offset, a term appearing once would have $\log(1) = 0$, making it indistinguishable from absent terms. The offset ensures:

A term appearing once: $1 + \log(1) = 1$
A term appearing 10 times: $1 + \log(10) \approx 3.3$
A term appearing 100 times: $1 + \log(100) \approx 5.6$

Notice how 100 occurrences yields only ~5.6x the weight of 1 occurrence, not 100x. This sublinear scaling prevents frequent terms from dominating.

Logarithm Base Matters (Slightly)

Different implementations use different logarithm bases: natural log (ln), log base 2, or log base 10. Mathematically, changing the base only scales all values by a constant factor, which doesn't affect relative rankings. However, for interpretability and consistency, document your choice clearly.

Deep Dive: Augmented Term Frequency

The augmented TF variant, introduced by Salton (1988), addresses document length bias more aggressively:

$$\text{tf}{\text{aug}}(t, d) = 0.5 + 0.5 \cdot \frac{f{t,d}}{\max_{t' \in d} f_{t',d}}$$

Here, $\max_{t' \in d} f_{t',d}$ is the frequency of the most frequent term in document $d$.

Properties:

Range bounded: All TF values fall in $[0.5, 1.0]$ for present terms
Document-relative: Normalizes by the document's own maximum frequency
Length invariant: A 100-word and 10,000-word document with identical term proportions yield identical TF values

The "0.5" smoothing:

The 0.5 baseline (also called "K" in generalized double normalization) prevents terms from ever having zero weight when present. Some variants use K = 0.4 or other values based on empirical tuning.

Implementation Patterns

Implementing TF efficiently requires careful attention to data structures, memory management, and computational complexity. Let's examine production-grade implementation patterns.

term_frequency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple, Optional
from scipy.sparse import csr_matrix, lil_matrix
import re
 
class TermFrequencyVectorizer:
    """
    Production-ready Term Frequency vectorizer with multiple normalization options.
    
    Supports:
    - Raw TF (count-based)
    - Normalized TF (length-normalized)
    - Log TF (logarithmic scaling)
    - Augmented TF (max-frequency normalized)
    - Boolean TF (binary presence)
    """
    
    def __init__(
        self,
        tf_variant: str = "raw",
        min_df: int = 1,
        max_df: float = 1.0,
        lowercase: bool = True,
        token_pattern: str = r"(?u)\b\w\w+\b"
    ):
        """
        Initialize the vectorizer.
        
        Parameters:
        -----------
        tf_variant : str
            One of 'raw', 'normalized', 'log', 'augmented', 'boolean'
        min_df : int
            Minimum document frequency for a term to be included
        max_df : float
            Maximum document frequency (as proportion) for a term
        lowercase : bool
            Whether to convert text to lowercase
        token_pattern : str
            Regex pattern for tokenization
        """
        self.tf_variant = tf_variant
        self.min_df = min_df
        self.max_df = max_df
        self.lowercase = lowercase
        self.token_pattern = re.compile(token_pattern)
        
        self.vocabulary_: Dict[str, int] = {}
        self.document_frequencies_: Dict[str, int] = {}
        self._fitted = False
    
    def _tokenize(self, text: str) -> List[str]:
        """Tokenize a single document."""
        if self.lowercase:
            text = text.lower()
        return self.token_pattern.findall(text)
    
    def _compute_tf(
        self, 
        term_counts: Counter, 
        doc_length: int
    ) -> Dict[str, float]:
        """
        Compute term frequency values based on selected variant.
        """
        if self.tf_variant == "raw":
            return dict(term_counts)
        
        elif self.tf_variant == "normalized":
            return {
                term: count / doc_length 
                for term, count in term_counts.items()
            }
        
        elif self.tf_variant == "log":
            return {
                term: 1 + np.log(count) if count > 0 else 0
                for term, count in term_counts.items()
            }
        
        elif self.tf_variant == "augmented":
            max_count = max(term_counts.values()) if term_counts else 1
            return {
                term: 0.5 + 0.5 * (count / max_count)
                for term, count in term_counts.items()
            }
        
        elif self.tf_variant == "boolean":
            return {
                term: 1.0 if count > 0 else 0.0
                for term, count in term_counts.items()
            }
        
        else:
            raise ValueError(f"Unknown TF variant: {self.tf_variant}")
    
    def fit(self, documents: List[str]) -> 'TermFrequencyVectorizer':
        """
        Build vocabulary from corpus.
        
        Parameters:
        -----------
        documents : List[str]
            List of document strings
            
        Returns:
        --------
        self : TermFrequencyVectorizer
            Fitted vectorizer
        """
        n_docs = len(documents)
        term_doc_counts = Counter()
        all_terms = set()
        
        # First pass: count document frequencies
        for doc in documents:
            tokens = self._tokenize(doc)
            unique_tokens = set(tokens)
            all_terms.update(unique_tokens)
            for token in unique_tokens:
                term_doc_counts[token] += 1
        
        # Filter by document frequency thresholds
        max_doc_count = int(self.max_df * n_docs)
        
        filtered_terms = [
            term for term in all_terms
            if self.min_df <= term_doc_counts[term] <= max_doc_count
        ]
        
        # Build vocabulary (sorted for deterministic ordering)
        self.vocabulary_ = {
            term: idx for idx, term in enumerate(sorted(filtered_terms))
        }
        self.document_frequencies_ = {
            term: term_doc_counts[term] 
            for term in self.vocabulary_
        }
        self._fitted = True
        
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """
        Transform documents to TF matrix.
        
        Parameters:
        -----------
        documents : List[str]
            List of document strings
            
        Returns:
        --------
        X : csr_matrix
            Sparse matrix of shape (n_documents, n_vocabulary)
        """
        if not self._fitted:
            raise RuntimeError("Vectorizer must be fitted before transform")
        
        n_docs = len(documents)
        n_vocab = len(self.vocabulary_)
        
        # Use LIL matrix for efficient incremental construction
        X = lil_matrix((n_docs, n_vocab), dtype=np.float64)
        
        for doc_idx, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            term_counts = Counter(tokens)
            doc_length = len(tokens)
            
            # Compute TF values
            tf_values = self._compute_tf(term_counts, doc_length)
            
            # Fill sparse matrix row
            for term, tf_value in tf_values.items():
                if term in self.vocabulary_:
                    term_idx = self.vocabulary_[term]
                    X[doc_idx, term_idx] = tf_value
        
        # Convert to CSR for efficient arithmetic operations
        return X.tocsr()
    
    def fit_transform(self, documents: List[str]) -> csr_matrix:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
    
    def get_feature_names(self) -> List[str]:
        """Return vocabulary terms in index order."""
        return [
            term for term, _ in sorted(
                self.vocabulary_.items(), 
                key=lambda x: x[1]
            )
        ]
 
 
# Example usage and demonstration
if __name__ == "__main__":
    documents = [
        "Machine learning algorithms learn patterns from data",
        "Deep learning uses neural networks for learning",
        "Data science combines statistics and machine learning",
        "Neural networks are powerful machine learning models"
    ]
    
    print("=" * 60)
    print("TERM FREQUENCY VARIANTS COMPARISON")
    print("=" * 60)
    
    for variant in ["raw", "normalized", "log", "augmented", "boolean"]:
        vectorizer = TermFrequencyVectorizer(tf_variant=variant)
        X = vectorizer.fit_transform(documents)
        
        print(f"
{variant.upper()} TF:")
        print(f"  Matrix shape: {X.shape}")
        print(f"  Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}")
        
        # Show TF for "learning" across documents
        learning_idx = vectorizer.vocabulary_.get("learning", -1)
        if learning_idx >= 0:
            print(f"  TF('learning'): {X[:, learning_idx].toarray().flatten()}")

Sparse Matrix Selection

We use LIL (List of Lists) format during construction for efficient row-by-row filling, then convert to CSR (Compressed Sparse Row) for efficient arithmetic and slicing. This two-phase approach is standard practice for building sparse matrices incrementally.

Computational Complexity Analysis

Understanding the computational complexity of TF calculations is essential for handling large-scale corpora efficiently.

Notation:

$N$ = number of documents
$M$ = vocabulary size (unique terms)
$L$ = average document length (tokens)
$L_{\max}$ = maximum document length

Time Complexity:

TF Computation Complexity
Operation	Time Complexity	Notes
Tokenization (per doc)	$O(L)$	Linear in document length
Term counting (per doc)	$O(L)$	Hash table insertions
Vocabulary building	$O(N \cdot L)$	Single pass over corpus
TF computation (all docs)	$O(N \cdot L)$	Count and normalize
Sparse matrix construction	$O(N \cdot L)$	One entry per token
Total	$O(N \cdot L)$	Linear in total corpus tokens

Space Complexity:

Component	Space	Notes
Vocabulary	$O(M)$	One entry per unique term
Dense TF matrix	$O(N \cdot M)$	Prohibitive for large corpora
Sparse TF matrix	$O(\text{nnz})$	Where nnz = number of non-zeros
Typical nnz	$O(N \cdot L)$	Each doc contributes ~L entries

Practical Example:

For a corpus with:

1 million documents ($N = 10^6$)
100,000 vocabulary terms ($M = 10^5$)
500 average tokens per document ($L = 500$)

Dense representation: $$10^6 \times 10^5 \times 8 \text{ bytes} = 800 \text{ GB}$$

Sparse representation: $$10^6 \times 500 \times 12 \text{ bytes} = 6 \text{ GB}$$

The sparse representation is 133x more efficient, making TF computation tractable.

Memory Gotchas

Be careful with intermediate representations. During vectorization, holding all tokenized documents in memory simultaneously can exhaust RAM before sparse matrix construction begins. Production implementations often use streaming/iterative approaches that process documents in batches.

Limitations of Term Frequency

While Term Frequency captures the intuition that repeated terms are important, it has fundamental limitations that motivate more sophisticated weighting schemes.

Limitation 1: The Common Word Problem

Consider TF values for a document about "machine learning algorithms":

TF Values Highlighting the Common Word Problem
Term	Raw TF	Semantic Value
the	47	❌ Near-zero (function word)
is	31	❌ Near-zero (function word)
a	28	❌ Near-zero (function word)
learning	15	✅ High (topic word)
algorithm	12	✅ High (topic word)
neural	8	✅ High (topic word)

Function words like "the," "is," and "a" dominate by TF despite carrying almost no semantic content. This is the core problem TF alone cannot solve, and it motivates:

Stop word removal: Pre-filtering common words (addressed in preprocessing)
IDF weighting: Down-weighting terms that appear in many documents (next page)

Limitation 2: Document Length Bias

Even with normalized TF variants, longer documents naturally have more opportunities for term repetition. Consider:

A 100-word abstract mentions "quantum" 5 times (5%)
A 10,000-word paper mentions "quantum" 100 times (1%)

Despite the abstract having a higher proportion, the paper's raw TF is 20x higher. Normalization helps but doesn't fully eliminate this bias because longer documents cover more subtopics.

Limitation 3: Synonymy Blindness

TF treats each term independently. Two documents, one using "automobile" and one using "car," appear dissimilar despite discussing the same concept. This is the synonymy problem, which TF cannot address at all.

Limitation 4: Polysemy Ignorance

The word "bank" in "river bank" and "bank account" creates the same TF contribution despite completely different meanings. This is the polysemy problem, also beyond TF's reach.

Limitation 5: No Semantic Ordering

TF ignores word order entirely. "Dog bites man" and "Man bites dog" have identical TF vectors despite opposite meanings. This is inherent to the bag-of-words assumption.

What TF Cannot Do

TF is fundamentally a statistical measure of term occurrence. It cannot capture semantics, context, or linguistic relationships. These limitations are not flaws but rather scope boundaries. TF is a building block, not a complete solution. Modern approaches combine TF-IDF with word embeddings or use transformer-based models to address semantic limitations.

Practical Considerations

Beyond the mathematical formulation, several practical decisions affect TF quality in production systems.

Tokenization Choices:

Tokenization directly impacts what "terms" exist:

Tokenization Approach	Example: "machine-learning"	Trade-off
Whitespace split	["machine-learning"]	Preserves compounds, misses parts
Punctuation split	["machine", "learning"]	Loses compound meaning
Subword (BPE)	["mach", "ine", "learn", "ing"]	Handles OOV, obscures words

Case Sensitivity:

Lowercase: "Machine" = "machine" (standard choice)
Case-sensitive: "Apple" (company) ≠ "apple" (fruit) (domain-specific)

Out-of-Vocabulary (OOV) Handling:

At prediction time, documents may contain terms not in the training vocabulary. Options:

Ignore OOV: Most common—unknown terms contribute zero
UNK token: Map all OOV to a single "unknown" feature
Character n-grams: Use subword features to handle partial matches

Production Best Practices

•Document your preprocessing pipeline: TF values are meaningless without knowing exactly how tokenization, case normalization, and other preprocessing steps were applied.
•Set min_df/max_df thresholds: Filter terms appearing in too few documents (noise) or too many (stop words) to reduce dimensionality and improve signal.
•Use consistent preprocessing at train and inference: Vocabulary mismatches are a common source of bugs in text ML systems.
•Monitor vocabulary growth: In production, new terms may appear. Decide whether to retrain periodically or use subword approaches.
•Profile memory usage: For large corpora, memory during intermediate steps (tokenized documents) often exceeds final matrix size.

Scikit-learn Integration

In practice, use sklearn.feature_extraction.text.CountVectorizer for raw TF or TfidfVectorizer with use_idf=False for normalized TF. These implementations are highly optimized and handle edge cases correctly. Our custom implementation above is for pedagogical purposes—understanding the mechanics enables debugging production issues.

Summary and Connection to IDF

We've now established a rigorous understanding of Term Frequency—the first half of TF-IDF. Let's consolidate the key insights:

Key Takeaways

•Core intuition: Frequency signals importance—terms that appear more often are likely more relevant to the document's content.
•Multiple variants exist: Raw, normalized, logarithmic, augmented, and boolean TF each address different use cases and biases.
•Logarithmic TF is often preferred: It provides sublinear scaling that prevents high-frequency terms from dominating while preserving frequency information.
•Sparse matrices are essential: Text data is inherently sparse; using sparse representations reduces memory by 1-2 orders of magnitude.
•TF alone has critical limitations: Common words dominate, synonymy is ignored, and semantic meaning is absent.
•These limitations motivate IDF: The next page introduces Inverse Document Frequency, which down-weights terms appearing across many documents.

The Bridge to IDF:

Term Frequency answers: How important is this term within this document?

But it fails to answer: How distinctive is this term across the entire corpus?

A term like "the" has high TF in every document, making it useless for distinguishing documents. The Inverse Document Frequency (IDF) component, which we'll explore next, measures term rarity across a corpus.

The product TF × IDF:

High TF → term is important to this document
High IDF → term is rare across documents
High TF-IDF → term is both locally important AND globally distinctive

This combination is the insight that made TF-IDF one of the most successful text representation techniques in information retrieval history.

Page Complete

You now have a thorough understanding of Term Frequency—its mathematical foundations, variants, implementation, complexity, and limitations. Next, we'll explore Inverse Document Frequency (IDF), which addresses TF's inability to distinguish common from rare terms.

Term Frequency (TF)

The Foundation of Text Quantification

What You Will Master

The Core Intuition

Term Frequency embodies a simple yet powerful intuition: the more times a word appears in a document, the more important that word is to the document's content.

Consider two documents about machine learning:

Document A mentions "neural" 15 times and "network" 12 times
Document B mentions "neural" once and "network" once

The Bag-of-Words Connection

Formal Definition:

The raw term frequency of term $t$ in document $d$ is simply the count of occurrences:

$$\text{tf}(t, d) = f_{t,d}$$

where $f_{t,d}$ denotes the number of times term $t$ appears in document $d$.

Example:

Consider the document: "The quick brown fox jumps over the lazy dog. The dog was lazy."

After tokenization and lowercasing:

Term	Raw TF
the	3
quick	1
brown	1
fox	1
jumps	1
over	1
lazy	2
dog	2
was	1

The TF vector for this document has 9 dimensions, with values ranging from 1 to 3.

Mathematical Foundations

To work rigorously with Term Frequency, we need precise mathematical notation. This foundation enables us to understand variants, prove properties, and implement algorithms correctly.

Notation:

Let:

$D = {d_1, d_2, ..., d_N}$ be a corpus of $N$ documents
$V = {t_1, t_2, ..., t_M}$ be the vocabulary of $M$ unique terms across all documents
$f_{t,d}$ denote the count of term $t$ in document $d$
$|d|$ denote the length of document $d$ (total term count)

The term-document matrix $\mathbf{A}$ is an $M \times N$ matrix where $A_{ij} = f_{t_i, d_j}$, representing the frequency of term $t_i$ in document $d_j$.

Example Term-Document Matrix
Term	Doc 1	Doc 2	Doc 3
machine	3	0	5
learning	4	1	7
algorithm	2	3	1
data	5	2	4
model	1	4	2

Properties of the Term-Document Matrix:

Sparsity: In practice, most entries are zero. A vocabulary of 50,000 terms and typical documents of 500 words means each document uses only 1-2% of the vocabulary.
Non-negativity: All values are ≥ 0, since term counts cannot be negative.
Integer values: Raw TF produces integer counts (though transformation may yield reals).
Column sparsity pattern: Each column (document) has non-zero entries only for terms present in that document.

Sparse Storage Implications:

For a corpus with 100,000 documents and 50,000 vocabulary terms:

Dense matrix: 5 billion entries × 4 bytes = 20 GB
Sparse matrix (1% density): ~50 million entries × 12 bytes = 600 MB

This 30x reduction makes TF calculations tractable for large corpora.

Why Sparsity Matters

Term Frequency Variants

The Problem with Raw Counts:

This motivates normalized and transformed TF variants.

Term Frequency Variants
Variant	Formula	Range	Use Case
Raw TF	$f_{t,d}$	$[0, \infty)$	Simple counting; baseline approach
Normalized TF	$\frac{f_{t,d}}{\|d\|}$	$[0, 1]$	Comparing documents of different lengths
Log TF	$1 + \log(f_{t,d})$ if $f_{t,d} > 0$, else $0$	$[0, \infty)$	Dampening high-frequency terms
Augmented TF	$0.5 + 0.5 \cdot \frac{f_{t,d}}{\max_t f_{t,d}}$	$[0.5, 1]$	Preventing bias toward longer documents
Boolean TF	$1$ if $f_{t,d} > 0$, else $0$	${0, 1}$	Binary presence/absence only
Double Normalization (K)	$K + (1-K) \cdot \frac{f_{t,d}}{\max_t f_{t,d}}$	$[K, 1]$	Tunable smoothing parameter

Deep Dive: Log Normalization

Logarithmic TF is perhaps the most widely used variant:

$$\text{tf}{\log}(t, d) = \begin{cases} 1 + \log(f{t,d}) & \text{if } f_{t,d} > 0 \ 0 & \text{otherwise} \end{cases}$$

Why the "1 +" offset?

Without the offset, a term appearing once would have $\log(1) = 0$, making it indistinguishable from absent terms. The offset ensures:

A term appearing once: $1 + \log(1) = 1$
A term appearing 10 times: $1 + \log(10) \approx 3.3$
A term appearing 100 times: $1 + \log(100) \approx 5.6$

Notice how 100 occurrences yields only ~5.6x the weight of 1 occurrence, not 100x. This sublinear scaling prevents frequent terms from dominating.

Logarithm Base Matters (Slightly)

Deep Dive: Augmented Term Frequency

The augmented TF variant, introduced by Salton (1988), addresses document length bias more aggressively:

$$\text{tf}{\text{aug}}(t, d) = 0.5 + 0.5 \cdot \frac{f{t,d}}{\max_{t' \in d} f_{t',d}}$$

Here, $\max_{t' \in d} f_{t',d}$ is the frequency of the most frequent term in document $d$.

Properties:

Range bounded: All TF values fall in $[0.5, 1.0]$ for present terms
Document-relative: Normalizes by the document's own maximum frequency
Length invariant: A 100-word and 10,000-word document with identical term proportions yield identical TF values

The "0.5" smoothing:

The 0.5 baseline (also called "K" in generalized double normalization) prevents terms from ever having zero weight when present. Some variants use K = 0.4 or other values based on empirical tuning.

Implementation Patterns

Implementing TF efficiently requires careful attention to data structures, memory management, and computational complexity. Let's examine production-grade implementation patterns.

term_frequency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple, Optional
from scipy.sparse import csr_matrix, lil_matrix
import re
 
class TermFrequencyVectorizer:
    """
    Production-ready Term Frequency vectorizer with multiple normalization options.
    
    Supports:
    - Raw TF (count-based)
    - Normalized TF (length-normalized)
    - Log TF (logarithmic scaling)
    - Augmented TF (max-frequency normalized)
    - Boolean TF (binary presence)
    """
    
    def __init__(
        self,
        tf_variant: str = "raw",
        min_df: int = 1,
        max_df: float = 1.0,
        lowercase: bool = True,
        token_pattern: str = r"(?u)\b\w\w+\b"
    ):
        """
        Initialize the vectorizer.
        
        Parameters:
        -----------
        tf_variant : str
            One of 'raw', 'normalized', 'log', 'augmented', 'boolean'
        min_df : int
            Minimum document frequency for a term to be included
        max_df : float
            Maximum document frequency (as proportion) for a term
        lowercase : bool
            Whether to convert text to lowercase
        token_pattern : str
            Regex pattern for tokenization
        """
        self.tf_variant = tf_variant
        self.min_df = min_df
        self.max_df = max_df
        self.lowercase = lowercase
        self.token_pattern = re.compile(token_pattern)
        
        self.vocabulary_: Dict[str, int] = {}
        self.document_frequencies_: Dict[str, int] = {}
        self._fitted = False
    
    def _tokenize(self, text: str) -> List[str]:
        """Tokenize a single document."""
        if self.lowercase:
            text = text.lower()
        return self.token_pattern.findall(text)
    
    def _compute_tf(
        self, 
        term_counts: Counter, 
        doc_length: int
    ) -> Dict[str, float]:
        """
        Compute term frequency values based on selected variant.
        """
        if self.tf_variant == "raw":
            return dict(term_counts)
        
        elif self.tf_variant == "normalized":
            return {
                term: count / doc_length 
                for term, count in term_counts.items()
            }
        
        elif self.tf_variant == "log":
            return {
                term: 1 + np.log(count) if count > 0 else 0
                for term, count in term_counts.items()
            }
        
        elif self.tf_variant == "augmented":
            max_count = max(term_counts.values()) if term_counts else 1
            return {
                term: 0.5 + 0.5 * (count / max_count)
                for term, count in term_counts.items()
            }
        
        elif self.tf_variant == "boolean":
            return {
                term: 1.0 if count > 0 else 0.0
                for term, count in term_counts.items()
            }
        
        else:
            raise ValueError(f"Unknown TF variant: {self.tf_variant}")
    
    def fit(self, documents: List[str]) -> 'TermFrequencyVectorizer':
        """
        Build vocabulary from corpus.
        
        Parameters:
        -----------
        documents : List[str]
            List of document strings
            
        Returns:
        --------
        self : TermFrequencyVectorizer
            Fitted vectorizer
        """
        n_docs = len(documents)
        term_doc_counts = Counter()
        all_terms = set()
        
        # First pass: count document frequencies
        for doc in documents:
            tokens = self._tokenize(doc)
            unique_tokens = set(tokens)
            all_terms.update(unique_tokens)
            for token in unique_tokens:
                term_doc_counts[token] += 1
        
        # Filter by document frequency thresholds
        max_doc_count = int(self.max_df * n_docs)
        
        filtered_terms = [
            term for term in all_terms
            if self.min_df <= term_doc_counts[term] <= max_doc_count
        ]
        
        # Build vocabulary (sorted for deterministic ordering)
        self.vocabulary_ = {
            term: idx for idx, term in enumerate(sorted(filtered_terms))
        }
        self.document_frequencies_ = {
            term: term_doc_counts[term] 
            for term in self.vocabulary_
        }
        self._fitted = True
        
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """
        Transform documents to TF matrix.
        
        Parameters:
        -----------
        documents : List[str]
            List of document strings
            
        Returns:
        --------
        X : csr_matrix
            Sparse matrix of shape (n_documents, n_vocabulary)
        """
        if not self._fitted:
            raise RuntimeError("Vectorizer must be fitted before transform")
        
        n_docs = len(documents)
        n_vocab = len(self.vocabulary_)
        
        # Use LIL matrix for efficient incremental construction
        X = lil_matrix((n_docs, n_vocab), dtype=np.float64)
        
        for doc_idx, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            term_counts = Counter(tokens)
            doc_length = len(tokens)
            
            # Compute TF values
            tf_values = self._compute_tf(term_counts, doc_length)
            
            # Fill sparse matrix row
            for term, tf_value in tf_values.items():
                if term in self.vocabulary_:
                    term_idx = self.vocabulary_[term]
                    X[doc_idx, term_idx] = tf_value
        
        # Convert to CSR for efficient arithmetic operations
        return X.tocsr()
    
    def fit_transform(self, documents: List[str]) -> csr_matrix:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
    
    def get_feature_names(self) -> List[str]:
        """Return vocabulary terms in index order."""
        return [
            term for term, _ in sorted(
                self.vocabulary_.items(), 
                key=lambda x: x[1]
            )
        ]
 
 
# Example usage and demonstration
if __name__ == "__main__":
    documents = [
        "Machine learning algorithms learn patterns from data",
        "Deep learning uses neural networks for learning",
        "Data science combines statistics and machine learning",
        "Neural networks are powerful machine learning models"
    ]
    
    print("=" * 60)
    print("TERM FREQUENCY VARIANTS COMPARISON")
    print("=" * 60)
    
    for variant in ["raw", "normalized", "log", "augmented", "boolean"]:
        vectorizer = TermFrequencyVectorizer(tf_variant=variant)
        X = vectorizer.fit_transform(documents)
        
        print(f"
{variant.upper()} TF:")
        print(f"  Matrix shape: {X.shape}")
        print(f"  Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}")
        
        # Show TF for "learning" across documents
        learning_idx = vectorizer.vocabulary_.get("learning", -1)
        if learning_idx >= 0:
            print(f"  TF('learning'): {X[:, learning_idx].toarray().flatten()}")

Sparse Matrix Selection

Computational Complexity Analysis

Understanding the computational complexity of TF calculations is essential for handling large-scale corpora efficiently.

Notation:

$N$ = number of documents
$M$ = vocabulary size (unique terms)
$L$ = average document length (tokens)
$L_{\max}$ = maximum document length

Time Complexity:

TF Computation Complexity
Operation	Time Complexity	Notes
Tokenization (per doc)	$O(L)$	Linear in document length
Term counting (per doc)	$O(L)$	Hash table insertions
Vocabulary building	$O(N \cdot L)$	Single pass over corpus
TF computation (all docs)	$O(N \cdot L)$	Count and normalize
Sparse matrix construction	$O(N \cdot L)$	One entry per token
Total	$O(N \cdot L)$	Linear in total corpus tokens

Space Complexity:

Component	Space	Notes
Vocabulary	$O(M)$	One entry per unique term
Dense TF matrix	$O(N \cdot M)$	Prohibitive for large corpora
Sparse TF matrix	$O(\text{nnz})$	Where nnz = number of non-zeros
Typical nnz	$O(N \cdot L)$	Each doc contributes ~L entries

Practical Example:

For a corpus with:

1 million documents ($N = 10^6$)
100,000 vocabulary terms ($M = 10^5$)
500 average tokens per document ($L = 500$)

Dense representation: $$10^6 \times 10^5 \times 8 \text{ bytes} = 800 \text{ GB}$$

Sparse representation: $$10^6 \times 500 \times 12 \text{ bytes} = 6 \text{ GB}$$

The sparse representation is 133x more efficient, making TF computation tractable.

Memory Gotchas

Limitations of Term Frequency

While Term Frequency captures the intuition that repeated terms are important, it has fundamental limitations that motivate more sophisticated weighting schemes.

Limitation 1: The Common Word Problem

Consider TF values for a document about "machine learning algorithms":

TF Values Highlighting the Common Word Problem
Term	Raw TF	Semantic Value
the	47	❌ Near-zero (function word)
is	31	❌ Near-zero (function word)
a	28	❌ Near-zero (function word)
learning	15	✅ High (topic word)
algorithm	12	✅ High (topic word)
neural	8	✅ High (topic word)

Function words like "the," "is," and "a" dominate by TF despite carrying almost no semantic content. This is the core problem TF alone cannot solve, and it motivates:

Stop word removal: Pre-filtering common words (addressed in preprocessing)
IDF weighting: Down-weighting terms that appear in many documents (next page)

Limitation 2: Document Length Bias

Even with normalized TF variants, longer documents naturally have more opportunities for term repetition. Consider:

A 100-word abstract mentions "quantum" 5 times (5%)
A 10,000-word paper mentions "quantum" 100 times (1%)

Despite the abstract having a higher proportion, the paper's raw TF is 20x higher. Normalization helps but doesn't fully eliminate this bias because longer documents cover more subtopics.

Limitation 3: Synonymy Blindness

Limitation 4: Polysemy Ignorance

The word "bank" in "river bank" and "bank account" creates the same TF contribution despite completely different meanings. This is the polysemy problem, also beyond TF's reach.

Limitation 5: No Semantic Ordering

TF ignores word order entirely. "Dog bites man" and "Man bites dog" have identical TF vectors despite opposite meanings. This is inherent to the bag-of-words assumption.

What TF Cannot Do

Practical Considerations

Beyond the mathematical formulation, several practical decisions affect TF quality in production systems.

Tokenization Choices:

Tokenization directly impacts what "terms" exist:

Tokenization Approach	Example: "machine-learning"	Trade-off
Whitespace split	["machine-learning"]	Preserves compounds, misses parts
Punctuation split	["machine", "learning"]	Loses compound meaning
Subword (BPE)	["mach", "ine", "learn", "ing"]	Handles OOV, obscures words

Case Sensitivity:

Lowercase: "Machine" = "machine" (standard choice)
Case-sensitive: "Apple" (company) ≠ "apple" (fruit) (domain-specific)

Out-of-Vocabulary (OOV) Handling:

At prediction time, documents may contain terms not in the training vocabulary. Options:

Ignore OOV: Most common—unknown terms contribute zero
UNK token: Map all OOV to a single "unknown" feature
Character n-grams: Use subword features to handle partial matches

Production Best Practices

•Document your preprocessing pipeline: TF values are meaningless without knowing exactly how tokenization, case normalization, and other preprocessing steps were applied.
•Set min_df/max_df thresholds: Filter terms appearing in too few documents (noise) or too many (stop words) to reduce dimensionality and improve signal.
•Use consistent preprocessing at train and inference: Vocabulary mismatches are a common source of bugs in text ML systems.
•Monitor vocabulary growth: In production, new terms may appear. Decide whether to retrain periodically or use subword approaches.
•Profile memory usage: For large corpora, memory during intermediate steps (tokenized documents) often exceeds final matrix size.

Scikit-learn Integration

Summary and Connection to IDF

We've now established a rigorous understanding of Term Frequency—the first half of TF-IDF. Let's consolidate the key insights:

Key Takeaways

•Core intuition: Frequency signals importance—terms that appear more often are likely more relevant to the document's content.
•Multiple variants exist: Raw, normalized, logarithmic, augmented, and boolean TF each address different use cases and biases.
•Logarithmic TF is often preferred: It provides sublinear scaling that prevents high-frequency terms from dominating while preserving frequency information.
•Sparse matrices are essential: Text data is inherently sparse; using sparse representations reduces memory by 1-2 orders of magnitude.
•TF alone has critical limitations: Common words dominate, synonymy is ignored, and semantic meaning is absent.
•These limitations motivate IDF: The next page introduces Inverse Document Frequency, which down-weights terms appearing across many documents.

The Bridge to IDF:

Term Frequency answers: How important is this term within this document?

But it fails to answer: How distinctive is this term across the entire corpus?

The product TF × IDF:

High TF → term is important to this document
High IDF → term is rare across documents
High TF-IDF → term is both locally important AND globally distinctive

This combination is the insight that made TF-IDF one of the most successful text representation techniques in information retrieval history.

Page Complete