Loading learning content...
At first glance, counting how often words appear in a document seems trivially simple—just iterate through the text and increment counters. Yet Term Frequency (TF) is one of the most important concepts in information retrieval and text feature engineering, with subtle nuances that significantly impact downstream model performance.
Consider this question: If a document contains the word "algorithm" 10 times and another document contains it 5 times, is the first document necessarily twice as relevant to a query about algorithms? What if the first document is 10,000 words long and the second is only 100 words? What if "algorithm" appears 10 times but so do 50 other terms—versus appearing 10 times when only 20 unique terms exist?
Term Frequency is the formalization of word importance within a document. Getting it right—choosing the appropriate normalization, handling edge cases, and understanding the statistical implications—separates effective text representations from naive ones.
By the end of this page, you will understand: (1) The formal definition of term frequency and its variants, (2) Why raw counts alone are insufficient for document comparison, (3) Different normalization strategies and when to apply each, (4) The mathematical properties that make TF effective, and (5) How TF connects to TF-IDF and modern retrieval systems.
Raw term frequency is the simplest and most intuitive definition: the number of times a term appears in a document.
Definition:
For a term t and document d, the raw term frequency is:
$$tf(t, d) = f_{t,d}$$
where $f_{t,d}$ denotes the count of occurrences of term t in document d.
Example:
Consider the document: "The quick brown fox jumps over the lazy dog. The fox is quick."
| Term | Raw TF |
|---|---|
| the | 3 |
| quick | 2 |
| fox | 2 |
| brown | 1 |
| jumps | 1 |
| over | 1 |
| lazy | 1 |
| dog | 1 |
| is | 1 |
Raw TF captures the intuition that words appearing more frequently are more important to the document's meaning. A document mentioning "machine learning" 20 times is probably more focused on ML than one mentioning it once.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
from collections import Counterfrom typing import Dict, Listimport re def compute_raw_tf(document: str) -> Dict[str, int]: """ Compute raw term frequency for all terms in a document. Args: document: Input text string Returns: Dictionary mapping each term to its raw frequency """ # Tokenize: lowercase and split on non-alphanumeric tokens = re.findall(r'\b[a-z]+\b', document.lower()) # Count occurrences return dict(Counter(tokens)) def get_term_frequency(term: str, document: str) -> int: """ Get the raw frequency of a specific term in a document. Args: term: The term to count document: The document text Returns: Number of occurrences of term in document """ tokens = re.findall(r'\b[a-z]+\b', document.lower()) return tokens.count(term.lower()) # Demonstrationdoc = "The quick brown fox jumps over the lazy dog. The fox is quick." tf = compute_raw_tf(doc)print("Raw Term Frequencies:")for term, freq in sorted(tf.items(), key=lambda x: -x[1]): print(f" {term}: {freq}") # Query specific termprint(f"\nFrequency of 'fox': {get_term_frequency('fox', doc)}")print(f"Frequency of 'cat': {get_term_frequency('cat', doc)}")Raw TF has a fundamental flaw: it's biased toward longer documents. A 10,000-word document will naturally have higher raw TF values than a 100-word document, even if both discuss the same topic with equal focus. This makes raw TF unsuitable for comparing documents of different lengths.
To compare term importance across documents of varying lengths, we must normalize raw frequencies. Several normalization strategies exist, each with distinct properties and use cases.
Relative Term Frequency (L1 Normalization):
The most intuitive normalization divides raw count by document length:
$$tf_{norm}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} = \frac{f_{t,d}}{|d|}$$
where $|d|$ is the total number of term occurrences in document d (document length in tokens).
This produces values in [0, 1] that sum to 1 across all terms—essentially converting counts to a probability distribution over vocabulary terms.
Example:
If "algorithm" appears 5 times in a 100-word document: $$tf_{norm}(\text{algorithm}, d) = \frac{5}{100} = 0.05$$
If "algorithm" appears 10 times in a 10,000-word document: $$tf_{norm}(\text{algorithm}, d') = \frac{10}{10000} = 0.001$$
Despite having higher raw frequency, the second document has lower normalized TF—indicating "algorithm" is less central to its content.
| Method | Formula | Range | Properties |
|---|---|---|---|
| Raw Count | f(t,d) | [0, ∞) | Simple but biased toward long documents |
| L1 Normalization (Relative TF) | f(t,d) / |d| | [0, 1] | Sums to 1; represents proportion of document |
| L2 Normalization (Unit Vector) | f(t,d) / ||d||₂ | [0, 1] | Document vector has unit length; cosine similarity = dot product |
| Max Normalization | f(t,d) / max(f(t',d)) | [0, 1] | Scales relative to most frequent term in document |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom collections import Counterfrom typing import Dict, Literalimport re NormMethod = Literal["raw", "l1", "l2", "max"] def compute_normalized_tf( document: str, method: NormMethod = "l1") -> Dict[str, float]: """ Compute term frequency with various normalization methods. Args: document: Input text string method: Normalization method ('raw', 'l1', 'l2', 'max') Returns: Dictionary mapping terms to normalized frequencies """ # Tokenize tokens = re.findall(r'\b[a-z]+\b', document.lower()) if not tokens: return {} # Get raw counts raw_counts = Counter(tokens) if method == "raw": return dict(raw_counts) elif method == "l1": # Divide by document length (sum of all counts) total = sum(raw_counts.values()) return {term: count / total for term, count in raw_counts.items()} elif method == "l2": # Divide by L2 norm (Euclidean length of count vector) l2_norm = np.sqrt(sum(c ** 2 for c in raw_counts.values())) return {term: count / l2_norm for term, count in raw_counts.items()} elif method == "max": # Divide by maximum term frequency in document max_freq = max(raw_counts.values()) return {term: count / max_freq for term, count in raw_counts.items()} else: raise ValueError(f"Unknown normalization method: {method}") # Demonstration with two documents of different lengthsdoc_short = "Algorithm design. Algorithm analysis. Algorithm optimization."doc_long = ( "This comprehensive guide covers software engineering practices. " "We discuss algorithm design principles briefly. " "The focus is on maintainability, scalability, testing, documentation, " "code review, deployment strategies, monitoring, and team collaboration.") print("Short document (15 words):")tf_short = compute_normalized_tf(doc_short, "l1")print(f" TF('algorithm') = {tf_short.get('algorithm', 0):.4f}") print("\nLong document (40+ words):")tf_long = compute_normalized_tf(doc_long, "l1")print(f" TF('algorithm') = {tf_long.get('algorithm', 0):.4f}") print("\n--- Comparison of normalization methods ---")doc = "data science data analysis data visualization machine learning" for method in ["raw", "l1", "l2", "max"]: tf = compute_normalized_tf(doc, method) print(f"\n{method.upper()} normalization:") for term in sorted(tf.keys()): print(f" {term}: {tf[term]:.4f}")A fundamental question in TF design is: Does the 10th occurrence of a word provide as much information as the 1st?
Intuitively, the answer is no. The difference between a word appearing 0 times and 1 time is significant—it tells us the document is about that topic. The difference between 10 and 11 occurrences is much less meaningful.
This insight motivates sublinear scaling of term frequency, where additional occurrences provide diminishing returns.
Logarithmic TF:
The most common sublinear scaling applies a logarithm:
$$tf_{log}(t, d) = \begin{cases} 1 + \log(f_{t,d}) & \text{if } f_{t,d} > 0 \ 0 & \text{otherwise} \end{cases}$$
The +1 ensures the minimum value for present terms is 1 (since log(1) = 0).
Effect of logarithmic scaling:
| Raw TF | Log TF (1 + log) |
|---|---|
| 1 | 1.00 |
| 2 | 1.69 |
| 5 | 2.61 |
| 10 | 3.30 |
| 100 | 5.61 |
| 1000 | 7.91 |
Notice how a 1000x increase in raw frequency translates to only ~8x increase in log TF. This compression prevents highly frequent terms from dominating the representation.
The choice of logarithm isn't arbitrary—it reflects information-theoretic principles. In information theory, the 'surprise' or information content of an event with probability p is -log(p). Zipf's law tells us word frequencies follow power-law distributions. Logarithmic scaling effectively linearizes this distribution, making TF values more comparable across the vocabulary.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom collections import Counterfrom typing import Dict, Literalimport re SublinearMethod = Literal["raw", "log", "double_norm", "boolean"] def compute_sublinear_tf( document: str, method: SublinearMethod = "log", k: float = 0.5 # For double normalization) -> Dict[str, float]: """ Compute term frequency with sublinear scaling. Args: document: Input text method: Scaling method k: Smoothing parameter for double normalization (typically 0.4-0.5) Returns: Dictionary mapping terms to scaled frequencies """ tokens = re.findall(r'\b[a-z]+\b', document.lower()) if not tokens: return {} raw_counts = Counter(tokens) if method == "raw": return {t: float(c) for t, c in raw_counts.items()} elif method == "log": # 1 + log(count) for count > 0 return { term: 1 + np.log(count) if count > 0 else 0 for term, count in raw_counts.items() } elif method == "double_norm": # k + (1-k) * (count / max_count) # Also called "augmented frequency" max_count = max(raw_counts.values()) return { term: k + (1 - k) * (count / max_count) for term, count in raw_counts.items() } elif method == "boolean": # Binary: 1 if present, 0 otherwise return {term: 1.0 for term in raw_counts} else: raise ValueError(f"Unknown method: {method}") # Demonstrate the dampening effectdoc = "neural " * 100 + "networks " * 50 + "deep learning machine algorithm" print("Sublinear scaling comparison:")print(f"\nDocument has 'neural' 100x, 'networks' 50x, others 1x") for method in ["raw", "log", "double_norm", "boolean"]: tf = compute_sublinear_tf(doc, method) print(f"\n{method.upper()}:") # Show top 5 by value sorted_tf = sorted(tf.items(), key=lambda x: -x[1])[:5] for term, val in sorted_tf: print(f" {term}: {val:.3f}") # Show the compression effect across different frequenciesprint("\n--- Logarithmic compression ---")print("Raw Count → Log TF")for raw in [1, 2, 5, 10, 20, 50, 100, 500, 1000]: log_tf = 1 + np.log(raw) print(f" {raw:5d} → {log_tf:.3f}")Double Normalization (Augmented Frequency):
Another sublinear approach, often called "augmented" or "double" normalization, scales TF relative to the maximum frequency in the document:
$$tf_{aug}(t, d) = k + (1 - k) \cdot \frac{f_{t,d}}{\max_{t' \in d} f_{t',d}}$$
where k is a smoothing constant (typically 0.4 or 0.5).
This approach:
Double normalization was popular in early information retrieval systems (notably in the SMART system) but has been largely superseded by simpler schemes combined with IDF weighting.
Term Frequency originated in information retrieval (IR)—the science of finding relevant documents in response to user queries. Understanding TF's role in IR illuminates why certain design choices became standard.
The Retrieval Problem:
Given a query q = (q₁, q₂, ..., qₖ) consisting of query terms, rank documents in a corpus by relevance to q.
TF-based Scoring:
A simple scoring function sums TF values for query terms:
$$score(q, d) = \sum_{t \in q} tf(t, d)$$
Documents mentioning query terms more frequently score higher.
Vector Space Model:
In the vector space model, both queries and documents are represented as TF vectors. Relevance is measured by vector similarity:
$$similarity(q, d) = \frac{\vec{q} \cdot \vec{d}}{||\vec{q}|| \cdot ||\vec{d}||}$$
This cosine similarity measures the angle between query and document vectors, making it invariant to document length when vectors are L2-normalized.
Queries also have term frequencies. In 'cheap flights cheap hotels', 'cheap' has TF=2. This double-weights relevance for the repeated term, reflecting the user's emphasis. Short queries typically use raw TF; normalization matters less when queries have few terms.
TF alone leads to poor retrieval because common words dominate. 'The' appears frequently in every document—but it's not discriminative. This motivates IDF (Inverse Document Frequency), covered in the next module, which downweights ubiquitous terms.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom collections import Counterfrom typing import List, Tupleimport re def tokenize(text: str) -> List[str]: """Simple tokenization: lowercase + alphanumeric.""" return re.findall(r'\b[a-z]+\b', text.lower()) def build_tf_vector( tokens: List[str], vocabulary: dict) -> np.ndarray: """ Build TF vector for a token sequence. Args: tokens: List of tokens vocabulary: Mapping from term to index Returns: TF vector of shape (vocab_size,) """ vec = np.zeros(len(vocabulary)) counts = Counter(tokens) for term, count in counts.items(): if term in vocabulary: # Using log TF vec[vocabulary[term]] = 1 + np.log(count) return vec def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: """Compute cosine similarity between two vectors.""" norm_a = np.linalg.norm(a) norm_b = np.linalg.norm(b) if norm_a == 0 or norm_b == 0: return 0.0 return np.dot(a, b) / (norm_a * norm_b) def tf_retrieval( query: str, corpus: List[str]) -> List[Tuple[int, float, str]]: """ Simple TF-based document retrieval. Args: query: Search query corpus: List of documents Returns: List of (doc_index, score, doc_preview) sorted by score descending """ # Build vocabulary from corpus all_tokens = [] doc_tokens = [] for doc in corpus: tokens = tokenize(doc) doc_tokens.append(tokens) all_tokens.extend(tokens) vocabulary = {term: idx for idx, term in enumerate(set(all_tokens))} # Build query vector query_tokens = tokenize(query) query_vec = build_tf_vector(query_tokens, vocabulary) # Score each document results = [] for doc_idx, tokens in enumerate(doc_tokens): doc_vec = build_tf_vector(tokens, vocabulary) score = cosine_similarity(query_vec, doc_vec) preview = corpus[doc_idx][:80] + "..." if len(corpus[doc_idx]) > 80 else corpus[doc_idx] results.append((doc_idx, score, preview)) # Sort by score descending results.sort(key=lambda x: -x[1]) return results # Demo: Simple search enginecorpus = [ "Machine learning algorithms learn patterns from data automatically.", "Deep neural networks have multiple hidden layers for feature learning.", "Data preprocessing is essential before training machine learning models.", "The weather today is sunny with a chance of rain.", "Natural language processing uses machine learning for text analysis.", "Supervised learning requires labeled training examples.",] query = "machine learning training" print(f"Query: '{query}'")print("\nRanked results (TF-based):")print("-" * 60) for doc_idx, score, preview in tf_retrieval(query, corpus): if score > 0: print(f" Score: {score:.4f} | Doc {doc_idx}: {preview}")Understanding the mathematical properties of TF helps explain its effectiveness and limitations. Let's analyze key properties rigorously.
Property 1: Non-negativity
$$tf(t, d) \geq 0 \quad \forall t, d$$
TF values are always non-negative. This seems obvious but is important: it means document vectors lie in the positive orthant of the feature space, which has implications for distance metrics and model assumptions.
Property 2: Sparsity
For any document d, the TF vector is sparse: $$|{t : tf(t, d) > 0}| << |V|$$
Most vocabulary terms don't appear in any given document. A 500-word document might have 200 unique terms from a 100,000-word vocabulary—99.8% zero values.
Property 3: Zipf's Law Distribution
Term frequencies follow Zipf's law: $$f_r \propto \frac{1}{r^\alpha}$$
where r is the rank of a term by frequency, and α ≈ 1. This means:
This extreme skew motivates logarithmic scaling—without it, common terms completely dominate representations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npimport matplotlib.pyplot as pltfrom collections import Counterimport re # Sample text (you'd use a real corpus in practice)text = """Natural language processing is a field of artificial intelligence.Language models learn from large text corpora.Machine learning algorithms process text data.Deep learning has revolutionized natural language processing.Text classification, sentiment analysis, and machine translationare common natural language processing tasks.Language understanding requires processing context and meaning.""" # Tokenize and counttokens = re.findall(r'\b[a-z]+\b', text.lower())freq = Counter(tokens) # Sort by frequency (descending)sorted_freq = sorted(freq.values(), reverse=True)ranks = range(1, len(sorted_freq) + 1) print("Demonstrating Zipf's Law in term frequencies:")print("\nRank | Frequency | Term")print("-" * 35)for i, (term, count) in enumerate(freq.most_common(15), 1): print(f" {i:2d} | {count:3d} | {term}") # Calculate sparsity for a hypothetical large vocabularyvocab_size = 100000 # Realistic English vocabularyunique_terms = len(freq)sparsity = 1 - (unique_terms / vocab_size) print(f"\n--- Sparsity Analysis ---")print(f"Vocabulary size (full): {vocab_size:,}")print(f"Unique terms in document: {unique_terms}")print(f"Sparsity ratio: {sparsity:.4%}")print(f"Non-zero entries: {(1-sparsity)*100:.4f}%") # Zipf's Law verificationprint("\n--- Zipf's Law (rank × frequency ≈ constant) ---")for rank, f in list(zip(ranks, sorted_freq))[:10]: print(f"Rank {rank:2d}: freq={f:3d}, rank×freq={rank*f}")Property 4: Linear Additivity
For document concatenation, raw TF is additive: $$tf(t, d_1 \oplus d_2) = tf(t, d_1) + tf(t, d_2)$$
This property is useful for document merging and hierarchical text aggregation. However, normalized TF variants don't preserve this property—the normalization denominator changes.
Property 5: Invariance Under Permutation
Bag of Words, and thus TF, is invariant to word order: $$tf(t, permute(d)) = tf(t, d)$$
This is a defining characteristic: "dog bites man" and "man bites dog" have identical TF representations. It's both a strength (simplicity, robustness to syntactic variation) and a weakness (loss of semantic distinctions).
Property 6: Dimensionality
TF vectors have dimensionality equal to vocabulary size: $$dim(tf(d)) = |V|$$
This high dimensionality (often 10,000-100,000+) would be computationally prohibitive without sparsity. Sparse representations exploit the fact that only ~0.1% of entries are non-zero.
Different TF computation variants are standardized in information retrieval research using the SMART notation (developed at Cornell University). Understanding this notation helps you read academic papers and configure retrieval systems.
| Code | Name | Formula | Common Use |
|---|---|---|---|
| n | Natural | f(t,d) | Raw counts; simple baseline |
| l | Logarithm | 1 + log(f(t,d)) | Most common; dampens high frequencies |
| a | Augmented | 0.5 + 0.5 × f(t,d) / max(f) | Bounds to [0.5, 1]; legacy systems |
| b | Boolean | 1 if f(t,d) > 0 else 0 | Presence-only; spam detection |
| L | Log average | (1 + log(f)) / (1 + log(avg)) | Normalized by average TF in doc |
Recommended TF Variants by Task:
Text Classification: Logarithmic (l) or Boolean (b)
Document Retrieval: Logarithmic (l) combined with IDF
Semantic Similarity: Logarithmic (l) with L2 normalization
Topic Modeling: Natural (n) or Logarithmic (l)
Short Text (tweets, titles): Boolean (b) or Natural (n)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom collections import Counterfrom typing import Dict, Literalimport re SmartTF = Literal["n", "l", "a", "b", "L"] def smart_tf( document: str, scheme: SmartTF = "l") -> Dict[str, float]: """ Compute TF using SMART notation schemes. Args: document: Input text scheme: SMART TF code (n, l, a, b, L) Returns: Dictionary of term frequencies """ tokens = re.findall(r'\b[a-z]+\b', document.lower()) if not tokens: return {} raw = Counter(tokens) if scheme == "n": # Natural (raw count) return {t: float(c) for t, c in raw.items()} elif scheme == "l": # Logarithm: 1 + log(f) return {t: 1 + np.log(c) for t, c in raw.items()} elif scheme == "a": # Augmented: 0.5 + 0.5 * f / max_f max_f = max(raw.values()) return {t: 0.5 + 0.5 * (c / max_f) for t, c in raw.items()} elif scheme == "b": # Boolean: 1 if present return {t: 1.0 for t in raw} elif scheme == "L": # Log average: (1 + log(f)) / (1 + log(avg)) avg_f = np.mean(list(raw.values())) denom = 1 + np.log(avg_f) return {t: (1 + np.log(c)) / denom for t, c in raw.items()} else: raise ValueError(f"Unknown SMART scheme: {scheme}") # Compare schemes on a sample documentdoc = """Machine learning machine learning neural networks.Deep learning uses neural networks for machine learning tasks.Learning algorithms learn patterns.""" print("SMART TF Scheme Comparison:")print("=" * 50) for scheme in ["n", "l", "a", "b", "L"]: tf = smart_tf(doc, scheme) # Show key terms sorted_tf = sorted(tf.items(), key=lambda x: -x[1])[:5] print(f"\n[{scheme}] Scheme:") for term, val in sorted_tf: print(f" {term:15s}: {val:6.3f}")Computing TF at scale—across millions of documents with vocabularies of hundreds of thousands of terms—requires careful attention to computational efficiency.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom scipy.sparse import csr_matrix, lil_matrixfrom sklearn.feature_extraction.text import ( CountVectorizer, HashingVectorizer)from collections import Counterimport refrom typing import Listimport time def benchmark_tf_methods(corpus: List[str]): """ Compare efficiency of different TF computation approaches. """ print(f"Corpus size: {len(corpus)} documents") print(f"Total tokens: ~{sum(len(d.split()) for d in corpus):,}") print("-" * 50) # Method 1: Naive dense approach (don't do this!) # Skipped for large corpora - would exhaust memory # Method 2: Scikit-learn CountVectorizer (sparse, optimized) start = time.time() vectorizer = CountVectorizer( lowercase=True, token_pattern=r'\b[a-zA-Z]+\b' ) X_count = vectorizer.fit_transform(corpus) elapsed_count = time.time() - start print(f"\nCountVectorizer:") print(f" Time: {elapsed_count:.3f}s") print(f" Shape: {X_count.shape}") print(f" Non-zeros: {X_count.nnz:,} ({100*X_count.nnz/(X_count.shape[0]*X_count.shape[1]):.4f}%)") print(f" Memory (sparse): ~{X_count.data.nbytes / 1024:.1f} KB") # Method 3: HashingVectorizer (no vocabulary storage) start = time.time() hasher = HashingVectorizer( n_features=2**16, # 65,536 features lowercase=True, token_pattern=r'\b[a-zA-Z]+\b', norm=None, # Raw counts alternate_sign=False ) X_hash = hasher.fit_transform(corpus) elapsed_hash = time.time() - start print(f"\nHashingVectorizer:") print(f" Time: {elapsed_hash:.3f}s") print(f" Shape: {X_hash.shape} (fixed)") print(f" Non-zeros: {X_hash.nnz:,}") print(f" Memory (sparse): ~{X_hash.data.nbytes / 1024:.1f} KB") print(f" Note: No vocabulary stored - constant memory!") return X_count, X_hash # Generate sample corpusnp.random.seed(42)vocab = [f"word{i}" for i in range(1000)]corpus = [ " ".join(np.random.choice(vocab, size=np.random.randint(50, 200))) for _ in range(1000)] X_count, X_hash = benchmark_tf_methods(corpus) # Show sparse matrix efficiencyprint("\n--- Sparse Matrix Formats ---")dense_size = X_count.shape[0] * X_count.shape[1] * 4 # 4 bytes per int32sparse_size = X_count.data.nbytes + X_count.indices.nbytes + X_count.indptr.nbytes print(f"Dense storage would need: {dense_size / 1e6:.1f} MB")print(f"Sparse storage (CSR): {sparse_size / 1e6:.3f} MB")print(f"Compression ratio: {dense_size / sparse_size:.1f}x")In production systems processing streaming text, HashingVectorizer is invaluable. It requires no vocabulary storage, handles out-of-vocabulary terms gracefully, and has constant memory. The trade-off: slight collision risk and inability to map features back to terms. For many applications (online learning, privacy-preserving ML), these trade-offs are acceptable.
We've explored Term Frequency in depth—from raw counts through normalization variants to computational optimizations. Let's consolidate the key insights and preview how TF evolves into TF-IDF.
The Missing Piece: Global Discriminative Power
TF tells us how important a term is within a document. But it doesn't tell us how discriminative that term is across documents.
Consider: "algorithm" appearing 5 times in a CS paper is significant. "The" appearing 50 times in the same paper is not—because "the" appears frequently in every document.
This observation leads to Inverse Document Frequency (IDF), covered in the next module:
$$idf(t) = \log\frac{N}{df(t)}$$
where N is the total number of documents and df(t) is the number of documents containing term t.
Combining TF and IDF yields TF-IDF, arguably the most successful classical text representation:
$$tfidf(t, d) = tf(t, d) \times idf(t)$$
This multiplies local importance (TF) by global discriminative power (IDF), creating representations that have driven information retrieval and text classification for decades.
You now have a rigorous understanding of Term Frequency—its definitions, variants, mathematical properties, and computational considerations. This knowledge is foundational: TF is a component of TF-IDF, BM25, and even influences how neural language models process input. Next, we'll explore vocabulary construction strategies that determine which terms receive TF values.