Loading content...
Before any machine learning algorithm can process text, that text must be transformed into numbers. This seemingly simple requirement has driven decades of research in computational linguistics and information retrieval. At the heart of this transformation lies a deceptively simple question: How do we measure the importance of a word in a document?
The answer begins with Term Frequency (TF), the foundational building block of text representation. While modern deep learning approaches use word embeddings and attention mechanisms, TF-based representations remain critically important in practice. Understanding TF deeply isn't just historical—it's essential for any practitioner working with text data.
By the end of this page, you will understand: (1) The mathematical formulation and intuition behind Term Frequency, (2) Multiple TF variants and when to use each, (3) Implementation considerations and computational efficiency, (4) Limitations of raw TF and how they motivate IDF, and (5) Practical code implementations in Python.
Term Frequency embodies a simple yet powerful intuition: the more times a word appears in a document, the more important that word is to the document's content.
Consider two documents about machine learning:
Intuitively, Document A is more focused on neural networks than Document B. The frequency of terms serves as a proxy for topical emphasis. This is the core insight of Term Frequency: frequency signals relevance.
Term Frequency is intimately connected to the Bag-of-Words (BoW) representation. In BoW, we represent a document as a vector where each dimension corresponds to a vocabulary term, and the value is the term's count. TF formalizes and extends this idea, adding mathematical rigor and normalization options.
Formal Definition:
The raw term frequency of term $t$ in document $d$ is simply the count of occurrences:
$$\text{tf}(t, d) = f_{t,d}$$
where $f_{t,d}$ denotes the number of times term $t$ appears in document $d$.
Example:
Consider the document: "The quick brown fox jumps over the lazy dog. The dog was lazy."
After tokenization and lowercasing:
| Term | Raw TF |
|---|---|
| the | 3 |
| quick | 1 |
| brown | 1 |
| fox | 1 |
| jumps | 1 |
| over | 1 |
| lazy | 2 |
| dog | 2 |
| was | 1 |
The TF vector for this document has 9 dimensions, with values ranging from 1 to 3.
To work rigorously with Term Frequency, we need precise mathematical notation. This foundation enables us to understand variants, prove properties, and implement algorithms correctly.
Notation:
Let:
The term-document matrix $\mathbf{A}$ is an $M \times N$ matrix where $A_{ij} = f_{t_i, d_j}$, representing the frequency of term $t_i$ in document $d_j$.
| Term | Doc 1 | Doc 2 | Doc 3 |
|---|---|---|---|
| machine | 3 | 0 | 5 |
| learning | 4 | 1 | 7 |
| algorithm | 2 | 3 | 1 |
| data | 5 | 2 | 4 |
| model | 1 | 4 | 2 |
Properties of the Term-Document Matrix:
Sparsity: In practice, most entries are zero. A vocabulary of 50,000 terms and typical documents of 500 words means each document uses only 1-2% of the vocabulary.
Non-negativity: All values are ≥ 0, since term counts cannot be negative.
Integer values: Raw TF produces integer counts (though transformation may yield reals).
Column sparsity pattern: Each column (document) has non-zero entries only for terms present in that document.
Sparse Storage Implications:
For a corpus with 100,000 documents and 50,000 vocabulary terms:
This 30x reduction makes TF calculations tractable for large corpora.
The sparsity of term-document matrices is both a blessing and a constraint. It enables efficient storage and computation but limits the ability of distance metrics (like Euclidean distance) to work well. This sparsity is one motivation for dimensionality reduction techniques like LSA/SVD on text data.
Raw term counts have significant limitations. A term appearing 100 times isn't necessarily 100 times more important than one appearing once. Various TF variants address this issue, each with distinct mathematical properties and use cases.
The Problem with Raw Counts:
Consider a 10,000-word legal document where "contract" appears 200 times, and a 100-word summary where "contract" appears 5 times. Raw TF suggests the long document has 40x more emphasis on "contract," but the short document dedicates 5% of its words to the term while the long document dedicates only 2%.
This motivates normalized and transformed TF variants.
| Variant | Formula | Range | Use Case |
|---|---|---|---|
| Raw TF | $f_{t,d}$ | $[0, \infty)$ | Simple counting; baseline approach |
| Normalized TF | $\frac{f_{t,d}}{|d|}$ | $[0, 1]$ | Comparing documents of different lengths |
| Log TF | $1 + \log(f_{t,d})$ if $f_{t,d} > 0$, else $0$ | $[0, \infty)$ | Dampening high-frequency terms |
| Augmented TF | $0.5 + 0.5 \cdot \frac{f_{t,d}}{\max_t f_{t,d}}$ | $[0.5, 1]$ | Preventing bias toward longer documents |
| Boolean TF | $1$ if $f_{t,d} > 0$, else $0$ | ${0, 1}$ | Binary presence/absence only |
| Double Normalization (K) | $K + (1-K) \cdot \frac{f_{t,d}}{\max_t f_{t,d}}$ | $[K, 1]$ | Tunable smoothing parameter |
Deep Dive: Log Normalization
Logarithmic TF is perhaps the most widely used variant:
$$\text{tf}{\log}(t, d) = \begin{cases} 1 + \log(f{t,d}) & \text{if } f_{t,d} > 0 \ 0 & \text{otherwise} \end{cases}$$
Why the "1 +" offset?
Without the offset, a term appearing once would have $\log(1) = 0$, making it indistinguishable from absent terms. The offset ensures:
Notice how 100 occurrences yields only ~5.6x the weight of 1 occurrence, not 100x. This sublinear scaling prevents frequent terms from dominating.
Different implementations use different logarithm bases: natural log (ln), log base 2, or log base 10. Mathematically, changing the base only scales all values by a constant factor, which doesn't affect relative rankings. However, for interpretability and consistency, document your choice clearly.
Deep Dive: Augmented Term Frequency
The augmented TF variant, introduced by Salton (1988), addresses document length bias more aggressively:
$$\text{tf}{\text{aug}}(t, d) = 0.5 + 0.5 \cdot \frac{f{t,d}}{\max_{t' \in d} f_{t',d}}$$
Here, $\max_{t' \in d} f_{t',d}$ is the frequency of the most frequent term in document $d$.
Properties:
The "0.5" smoothing:
The 0.5 baseline (also called "K" in generalized double normalization) prevents terms from ever having zero weight when present. Some variants use K = 0.4 or other values based on empirical tuning.
Implementing TF efficiently requires careful attention to data structures, memory management, and computational complexity. Let's examine production-grade implementation patterns.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223
import numpy as npfrom collections import Counterfrom typing import List, Dict, Tuple, Optionalfrom scipy.sparse import csr_matrix, lil_matriximport re class TermFrequencyVectorizer: """ Production-ready Term Frequency vectorizer with multiple normalization options. Supports: - Raw TF (count-based) - Normalized TF (length-normalized) - Log TF (logarithmic scaling) - Augmented TF (max-frequency normalized) - Boolean TF (binary presence) """ def __init__( self, tf_variant: str = "raw", min_df: int = 1, max_df: float = 1.0, lowercase: bool = True, token_pattern: str = r"(?u)\b\w\w+\b" ): """ Initialize the vectorizer. Parameters: ----------- tf_variant : str One of 'raw', 'normalized', 'log', 'augmented', 'boolean' min_df : int Minimum document frequency for a term to be included max_df : float Maximum document frequency (as proportion) for a term lowercase : bool Whether to convert text to lowercase token_pattern : str Regex pattern for tokenization """ self.tf_variant = tf_variant self.min_df = min_df self.max_df = max_df self.lowercase = lowercase self.token_pattern = re.compile(token_pattern) self.vocabulary_: Dict[str, int] = {} self.document_frequencies_: Dict[str, int] = {} self._fitted = False def _tokenize(self, text: str) -> List[str]: """Tokenize a single document.""" if self.lowercase: text = text.lower() return self.token_pattern.findall(text) def _compute_tf( self, term_counts: Counter, doc_length: int ) -> Dict[str, float]: """ Compute term frequency values based on selected variant. """ if self.tf_variant == "raw": return dict(term_counts) elif self.tf_variant == "normalized": return { term: count / doc_length for term, count in term_counts.items() } elif self.tf_variant == "log": return { term: 1 + np.log(count) if count > 0 else 0 for term, count in term_counts.items() } elif self.tf_variant == "augmented": max_count = max(term_counts.values()) if term_counts else 1 return { term: 0.5 + 0.5 * (count / max_count) for term, count in term_counts.items() } elif self.tf_variant == "boolean": return { term: 1.0 if count > 0 else 0.0 for term, count in term_counts.items() } else: raise ValueError(f"Unknown TF variant: {self.tf_variant}") def fit(self, documents: List[str]) -> 'TermFrequencyVectorizer': """ Build vocabulary from corpus. Parameters: ----------- documents : List[str] List of document strings Returns: -------- self : TermFrequencyVectorizer Fitted vectorizer """ n_docs = len(documents) term_doc_counts = Counter() all_terms = set() # First pass: count document frequencies for doc in documents: tokens = self._tokenize(doc) unique_tokens = set(tokens) all_terms.update(unique_tokens) for token in unique_tokens: term_doc_counts[token] += 1 # Filter by document frequency thresholds max_doc_count = int(self.max_df * n_docs) filtered_terms = [ term for term in all_terms if self.min_df <= term_doc_counts[term] <= max_doc_count ] # Build vocabulary (sorted for deterministic ordering) self.vocabulary_ = { term: idx for idx, term in enumerate(sorted(filtered_terms)) } self.document_frequencies_ = { term: term_doc_counts[term] for term in self.vocabulary_ } self._fitted = True return self def transform(self, documents: List[str]) -> csr_matrix: """ Transform documents to TF matrix. Parameters: ----------- documents : List[str] List of document strings Returns: -------- X : csr_matrix Sparse matrix of shape (n_documents, n_vocabulary) """ if not self._fitted: raise RuntimeError("Vectorizer must be fitted before transform") n_docs = len(documents) n_vocab = len(self.vocabulary_) # Use LIL matrix for efficient incremental construction X = lil_matrix((n_docs, n_vocab), dtype=np.float64) for doc_idx, doc in enumerate(documents): tokens = self._tokenize(doc) term_counts = Counter(tokens) doc_length = len(tokens) # Compute TF values tf_values = self._compute_tf(term_counts, doc_length) # Fill sparse matrix row for term, tf_value in tf_values.items(): if term in self.vocabulary_: term_idx = self.vocabulary_[term] X[doc_idx, term_idx] = tf_value # Convert to CSR for efficient arithmetic operations return X.tocsr() def fit_transform(self, documents: List[str]) -> csr_matrix: """Fit and transform in one step.""" return self.fit(documents).transform(documents) def get_feature_names(self) -> List[str]: """Return vocabulary terms in index order.""" return [ term for term, _ in sorted( self.vocabulary_.items(), key=lambda x: x[1] ) ] # Example usage and demonstrationif __name__ == "__main__": documents = [ "Machine learning algorithms learn patterns from data", "Deep learning uses neural networks for learning", "Data science combines statistics and machine learning", "Neural networks are powerful machine learning models" ] print("=" * 60) print("TERM FREQUENCY VARIANTS COMPARISON") print("=" * 60) for variant in ["raw", "normalized", "log", "augmented", "boolean"]: vectorizer = TermFrequencyVectorizer(tf_variant=variant) X = vectorizer.fit_transform(documents) print(f"{variant.upper()} TF:") print(f" Matrix shape: {X.shape}") print(f" Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}") # Show TF for "learning" across documents learning_idx = vectorizer.vocabulary_.get("learning", -1) if learning_idx >= 0: print(f" TF('learning'): {X[:, learning_idx].toarray().flatten()}")We use LIL (List of Lists) format during construction for efficient row-by-row filling, then convert to CSR (Compressed Sparse Row) for efficient arithmetic and slicing. This two-phase approach is standard practice for building sparse matrices incrementally.
Understanding the computational complexity of TF calculations is essential for handling large-scale corpora efficiently.
Notation:
Time Complexity:
| Operation | Time Complexity | Notes |
|---|---|---|
| Tokenization (per doc) | $O(L)$ | Linear in document length |
| Term counting (per doc) | $O(L)$ | Hash table insertions |
| Vocabulary building | $O(N \cdot L)$ | Single pass over corpus |
| TF computation (all docs) | $O(N \cdot L)$ | Count and normalize |
| Sparse matrix construction | $O(N \cdot L)$ | One entry per token |
| Total | $O(N \cdot L)$ | Linear in total corpus tokens |
Space Complexity:
| Component | Space | Notes |
|---|---|---|
| Vocabulary | $O(M)$ | One entry per unique term |
| Dense TF matrix | $O(N \cdot M)$ | Prohibitive for large corpora |
| Sparse TF matrix | $O(\text{nnz})$ | Where nnz = number of non-zeros |
| Typical nnz | $O(N \cdot L)$ | Each doc contributes ~L entries |
Practical Example:
For a corpus with:
Dense representation: $$10^6 \times 10^5 \times 8 \text{ bytes} = 800 \text{ GB}$$
Sparse representation: $$10^6 \times 500 \times 12 \text{ bytes} = 6 \text{ GB}$$
The sparse representation is 133x more efficient, making TF computation tractable.
Be careful with intermediate representations. During vectorization, holding all tokenized documents in memory simultaneously can exhaust RAM before sparse matrix construction begins. Production implementations often use streaming/iterative approaches that process documents in batches.
While Term Frequency captures the intuition that repeated terms are important, it has fundamental limitations that motivate more sophisticated weighting schemes.
Limitation 1: The Common Word Problem
Consider TF values for a document about "machine learning algorithms":
| Term | Raw TF | Semantic Value |
|---|---|---|
| the | 47 | ❌ Near-zero (function word) |
| is | 31 | ❌ Near-zero (function word) |
| a | 28 | ❌ Near-zero (function word) |
| learning | 15 | ✅ High (topic word) |
| algorithm | 12 | ✅ High (topic word) |
| neural | 8 | ✅ High (topic word) |
Function words like "the," "is," and "a" dominate by TF despite carrying almost no semantic content. This is the core problem TF alone cannot solve, and it motivates:
Limitation 2: Document Length Bias
Even with normalized TF variants, longer documents naturally have more opportunities for term repetition. Consider:
Despite the abstract having a higher proportion, the paper's raw TF is 20x higher. Normalization helps but doesn't fully eliminate this bias because longer documents cover more subtopics.
Limitation 3: Synonymy Blindness
TF treats each term independently. Two documents, one using "automobile" and one using "car," appear dissimilar despite discussing the same concept. This is the synonymy problem, which TF cannot address at all.
Limitation 4: Polysemy Ignorance
The word "bank" in "river bank" and "bank account" creates the same TF contribution despite completely different meanings. This is the polysemy problem, also beyond TF's reach.
Limitation 5: No Semantic Ordering
TF ignores word order entirely. "Dog bites man" and "Man bites dog" have identical TF vectors despite opposite meanings. This is inherent to the bag-of-words assumption.
TF is fundamentally a statistical measure of term occurrence. It cannot capture semantics, context, or linguistic relationships. These limitations are not flaws but rather scope boundaries. TF is a building block, not a complete solution. Modern approaches combine TF-IDF with word embeddings or use transformer-based models to address semantic limitations.
Beyond the mathematical formulation, several practical decisions affect TF quality in production systems.
Tokenization Choices:
Tokenization directly impacts what "terms" exist:
| Tokenization Approach | Example: "machine-learning" | Trade-off |
|---|---|---|
| Whitespace split | ["machine-learning"] | Preserves compounds, misses parts |
| Punctuation split | ["machine", "learning"] | Loses compound meaning |
| Subword (BPE) | ["mach", "ine", "learn", "ing"] | Handles OOV, obscures words |
Case Sensitivity:
Out-of-Vocabulary (OOV) Handling:
At prediction time, documents may contain terms not in the training vocabulary. Options:
In practice, use sklearn.feature_extraction.text.CountVectorizer for raw TF or TfidfVectorizer with use_idf=False for normalized TF. These implementations are highly optimized and handle edge cases correctly. Our custom implementation above is for pedagogical purposes—understanding the mechanics enables debugging production issues.
We've now established a rigorous understanding of Term Frequency—the first half of TF-IDF. Let's consolidate the key insights:
The Bridge to IDF:
Term Frequency answers: How important is this term within this document?
But it fails to answer: How distinctive is this term across the entire corpus?
A term like "the" has high TF in every document, making it useless for distinguishing documents. The Inverse Document Frequency (IDF) component, which we'll explore next, measures term rarity across a corpus.
The product TF × IDF:
This combination is the insight that made TF-IDF one of the most successful text representation techniques in information retrieval history.
You now have a thorough understanding of Term Frequency—its mathematical foundations, variants, implementation, complexity, and limitations. Next, we'll explore Inverse Document Frequency (IDF), which addresses TF's inability to distinguish common from rare terms.