Machine LearningFeature Engineering & Selection

Text Feature Engineering - Basics

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

2 / 5

Term Frequency

Counting Words: Deceptively Simple, Surprisingly Deep

At first glance, counting how often words appear in a document seems trivially simple—just iterate through the text and increment counters. Yet Term Frequency (TF) is one of the most important concepts in information retrieval and text feature engineering, with subtle nuances that significantly impact downstream model performance.

Consider this question: If a document contains the word "algorithm" 10 times and another document contains it 5 times, is the first document necessarily twice as relevant to a query about algorithms? What if the first document is 10,000 words long and the second is only 100 words? What if "algorithm" appears 10 times but so do 50 other terms—versus appearing 10 times when only 20 unique terms exist?

Term Frequency is the formalization of word importance within a document. Getting it right—choosing the appropriate normalization, handling edge cases, and understanding the statistical implications—separates effective text representations from naive ones.

What You Will Master

By the end of this page, you will understand: (1) The formal definition of term frequency and its variants, (2) Why raw counts alone are insufficient for document comparison, (3) Different normalization strategies and when to apply each, (4) The mathematical properties that make TF effective, and (5) How TF connects to TF-IDF and modern retrieval systems.

Raw Term Frequency: The Foundation

Raw term frequency is the simplest and most intuitive definition: the number of times a term appears in a document.

Definition:

For a term t and document d, the raw term frequency is:

$$tf(t, d) = f_{t,d}$$

where $f_{t,d}$ denotes the count of occurrences of term t in document d.

Example:

Consider the document: "The quick brown fox jumps over the lazy dog. The fox is quick."

Term	Raw TF
the	3
quick	2
fox	2
brown	1
jumps	1
over	1
lazy	1
dog	1
is	1

Raw TF captures the intuition that words appearing more frequently are more important to the document's meaning. A document mentioning "machine learning" 20 times is probably more focused on ML than one mentioning it once.

raw_term_frequency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from collections import Counter
from typing import Dict, List
import re
 
def compute_raw_tf(document: str) -> Dict[str, int]:
    """
    Compute raw term frequency for all terms in a document.
    
    Args:
        document: Input text string
    
    Returns:
        Dictionary mapping each term to its raw frequency
    """
    # Tokenize: lowercase and split on non-alphanumeric
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    # Count occurrences
    return dict(Counter(tokens))
 
def get_term_frequency(term: str, document: str) -> int:
    """
    Get the raw frequency of a specific term in a document.
    
    Args:
        term: The term to count
        document: The document text
    
    Returns:
        Number of occurrences of term in document
    """
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    return tokens.count(term.lower())
 
 
# Demonstration
doc = "The quick brown fox jumps over the lazy dog. The fox is quick."
 
tf = compute_raw_tf(doc)
print("Raw Term Frequencies:")
for term, freq in sorted(tf.items(), key=lambda x: -x[1]):
    print(f"  {term}: {freq}")
 
# Query specific term
print(f"\nFrequency of 'fox': {get_term_frequency('fox', doc)}")
print(f"Frequency of 'cat': {get_term_frequency('cat', doc)}")

The Raw TF Problem

Raw TF has a fundamental flaw: it's biased toward longer documents. A 10,000-word document will naturally have higher raw TF values than a 100-word document, even if both discuss the same topic with equal focus. This makes raw TF unsuitable for comparing documents of different lengths.

Document Length Normalization

To compare term importance across documents of varying lengths, we must normalize raw frequencies. Several normalization strategies exist, each with distinct properties and use cases.

Relative Term Frequency (L1 Normalization):

The most intuitive normalization divides raw count by document length:

$$tf_{norm}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} = \frac{f_{t,d}}{|d|}$$

where $|d|$ is the total number of term occurrences in document d (document length in tokens).

This produces values in [0, 1] that sum to 1 across all terms—essentially converting counts to a probability distribution over vocabulary terms.

Example:

If "algorithm" appears 5 times in a 100-word document: $$tf_{norm}(\text{algorithm}, d) = \frac{5}{100} = 0.05$$

If "algorithm" appears 10 times in a 10,000-word document: $$tf_{norm}(\text{algorithm}, d') = \frac{10}{10000} = 0.001$$

Despite having higher raw frequency, the second document has lower normalized TF—indicating "algorithm" is less central to its content.

Document Length Normalization Methods
Method	Formula	Range	Properties
Raw Count	f(t,d)	[0, ∞)	Simple but biased toward long documents
L1 Normalization (Relative TF)	f(t,d) / \|d\|	[0, 1]	Sums to 1; represents proportion of document
L2 Normalization (Unit Vector)	f(t,d) / \|\|d\|\|₂	[0, 1]	Document vector has unit length; cosine similarity = dot product
Max Normalization	f(t,d) / max(f(t',d))	[0, 1]	Scales relative to most frequent term in document

tf_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from collections import Counter
from typing import Dict, Literal
import re
 
NormMethod = Literal["raw", "l1", "l2", "max"]
 
def compute_normalized_tf(
    document: str,
    method: NormMethod = "l1"
) -> Dict[str, float]:
    """
    Compute term frequency with various normalization methods.
    
    Args:
        document: Input text string
        method: Normalization method ('raw', 'l1', 'l2', 'max')
    
    Returns:
        Dictionary mapping terms to normalized frequencies
    """
    # Tokenize
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    if not tokens:
        return {}
    
    # Get raw counts
    raw_counts = Counter(tokens)
    
    if method == "raw":
        return dict(raw_counts)
    
    elif method == "l1":
        # Divide by document length (sum of all counts)
        total = sum(raw_counts.values())
        return {term: count / total for term, count in raw_counts.items()}
    
    elif method == "l2":
        # Divide by L2 norm (Euclidean length of count vector)
        l2_norm = np.sqrt(sum(c ** 2 for c in raw_counts.values()))
        return {term: count / l2_norm for term, count in raw_counts.items()}
    
    elif method == "max":
        # Divide by maximum term frequency in document
        max_freq = max(raw_counts.values())
        return {term: count / max_freq for term, count in raw_counts.items()}
    
    else:
        raise ValueError(f"Unknown normalization method: {method}")
 
 
# Demonstration with two documents of different lengths
doc_short = "Algorithm design. Algorithm analysis. Algorithm optimization."
doc_long = (
    "This comprehensive guide covers software engineering practices. "
    "We discuss algorithm design principles briefly. "
    "The focus is on maintainability, scalability, testing, documentation, "
    "code review, deployment strategies, monitoring, and team collaboration."
)
 
print("Short document (15 words):")
tf_short = compute_normalized_tf(doc_short, "l1")
print(f"  TF('algorithm') = {tf_short.get('algorithm', 0):.4f}")
 
print("\nLong document (40+ words):")
tf_long = compute_normalized_tf(doc_long, "l1")
print(f"  TF('algorithm') = {tf_long.get('algorithm', 0):.4f}")
 
print("\n--- Comparison of normalization methods ---")
doc = "data science data analysis data visualization machine learning"
 
for method in ["raw", "l1", "l2", "max"]:
    tf = compute_normalized_tf(doc, method)
    print(f"\n{method.upper()} normalization:")
    for term in sorted(tf.keys()):
        print(f"  {term}: {tf[term]:.4f}")

Sublinear Term Frequency Scaling

A fundamental question in TF design is: Does the 10th occurrence of a word provide as much information as the 1st?

Intuitively, the answer is no. The difference between a word appearing 0 times and 1 time is significant—it tells us the document is about that topic. The difference between 10 and 11 occurrences is much less meaningful.

This insight motivates sublinear scaling of term frequency, where additional occurrences provide diminishing returns.

Logarithmic TF:

The most common sublinear scaling applies a logarithm:

$$tf_{log}(t, d) = \begin{cases} 1 + \log(f_{t,d}) & \text{if } f_{t,d} > 0 \ 0 & \text{otherwise} \end{cases}$$

The +1 ensures the minimum value for present terms is 1 (since log(1) = 0).

Effect of logarithmic scaling:

Raw TF	Log TF (1 + log)
1	1.00
2	1.69
5	2.61
10	3.30
100	5.61
1000	7.91

Notice how a 1000x increase in raw frequency translates to only ~8x increase in log TF. This compression prevents highly frequent terms from dominating the representation.

Why Logarithmic?

The choice of logarithm isn't arbitrary—it reflects information-theoretic principles. In information theory, the 'surprise' or information content of an event with probability p is -log(p). Zipf's law tells us word frequencies follow power-law distributions. Logarithmic scaling effectively linearizes this distribution, making TF values more comparable across the vocabulary.

sublinear_tf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from collections import Counter
from typing import Dict, Literal
import re
 
SublinearMethod = Literal["raw", "log", "double_norm", "boolean"]
 
def compute_sublinear_tf(
    document: str,
    method: SublinearMethod = "log",
    k: float = 0.5  # For double normalization
) -> Dict[str, float]:
    """
    Compute term frequency with sublinear scaling.
    
    Args:
        document: Input text
        method: Scaling method
        k: Smoothing parameter for double normalization (typically 0.4-0.5)
    
    Returns:
        Dictionary mapping terms to scaled frequencies
    """
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    if not tokens:
        return {}
    
    raw_counts = Counter(tokens)
    
    if method == "raw":
        return {t: float(c) for t, c in raw_counts.items()}
    
    elif method == "log":
        # 1 + log(count) for count > 0
        return {
            term: 1 + np.log(count) if count > 0 else 0
            for term, count in raw_counts.items()
        }
    
    elif method == "double_norm":
        # k + (1-k) * (count / max_count)
        # Also called "augmented frequency"
        max_count = max(raw_counts.values())
        return {
            term: k + (1 - k) * (count / max_count)
            for term, count in raw_counts.items()
        }
    
    elif method == "boolean":
        # Binary: 1 if present, 0 otherwise
        return {term: 1.0 for term in raw_counts}
    
    else:
        raise ValueError(f"Unknown method: {method}")
 
 
# Demonstrate the dampening effect
doc = "neural " * 100 + "networks " * 50 + "deep learning machine algorithm"
 
print("Sublinear scaling comparison:")
print(f"\nDocument has 'neural' 100x, 'networks' 50x, others 1x")
 
for method in ["raw", "log", "double_norm", "boolean"]:
    tf = compute_sublinear_tf(doc, method)
    print(f"\n{method.upper()}:")
    # Show top 5 by value
    sorted_tf = sorted(tf.items(), key=lambda x: -x[1])[:5]
    for term, val in sorted_tf:
        print(f"  {term}: {val:.3f}")
 
# Show the compression effect across different frequencies
print("\n--- Logarithmic compression ---")
print("Raw Count → Log TF")
for raw in [1, 2, 5, 10, 20, 50, 100, 500, 1000]:
    log_tf = 1 + np.log(raw)
    print(f"  {raw:5d} → {log_tf:.3f}")

Double Normalization (Augmented Frequency):

Another sublinear approach, often called "augmented" or "double" normalization, scales TF relative to the maximum frequency in the document:

$$tf_{aug}(t, d) = k + (1 - k) \cdot \frac{f_{t,d}}{\max_{t' \in d} f_{t',d}}$$

where k is a smoothing constant (typically 0.4 or 0.5).

This approach:

Bounds TF between k and 1
The most frequent term in any document always has TF = 1
Prevents bias from document-specific word usage patterns
The k parameter ensures rare terms still have meaningful weight

Double normalization was popular in early information retrieval systems (notably in the SMART system) but has been largely superseded by simpler schemes combined with IDF weighting.

TF in Information Retrieval

Term Frequency originated in information retrieval (IR)—the science of finding relevant documents in response to user queries. Understanding TF's role in IR illuminates why certain design choices became standard.

The Retrieval Problem:

Given a query q = (q₁, q₂, ..., qₖ) consisting of query terms, rank documents in a corpus by relevance to q.

TF-based Scoring:

A simple scoring function sums TF values for query terms:

$$score(q, d) = \sum_{t \in q} tf(t, d)$$

Documents mentioning query terms more frequently score higher.

Vector Space Model:

In the vector space model, both queries and documents are represented as TF vectors. Relevance is measured by vector similarity:

$$similarity(q, d) = \frac{\vec{q} \cdot \vec{d}}{||\vec{q}|| \cdot ||\vec{d}||}$$

This cosine similarity measures the angle between query and document vectors, making it invariant to document length when vectors are L2-normalized.

Query TF

Queries also have term frequencies. In 'cheap flights cheap hotels', 'cheap' has TF=2. This double-weights relevance for the repeated term, reflecting the user's emphasis. Short queries typically use raw TF; normalization matters less when queries have few terms.

TF Alone Is Insufficient

TF alone leads to poor retrieval because common words dominate. 'The' appears frequently in every document—but it's not discriminative. This motivates IDF (Inverse Document Frequency), covered in the next module, which downweights ubiquitous terms.

tf_retrieval.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from collections import Counter
from typing import List, Tuple
import re
 
def tokenize(text: str) -> List[str]:
    """Simple tokenization: lowercase + alphanumeric."""
    return re.findall(r'\b[a-z]+\b', text.lower())
 
def build_tf_vector(
    tokens: List[str],
    vocabulary: dict
) -> np.ndarray:
    """
    Build TF vector for a token sequence.
    
    Args:
        tokens: List of tokens
        vocabulary: Mapping from term to index
    
    Returns:
        TF vector of shape (vocab_size,)
    """
    vec = np.zeros(len(vocabulary))
    counts = Counter(tokens)
    
    for term, count in counts.items():
        if term in vocabulary:
            # Using log TF
            vec[vocabulary[term]] = 1 + np.log(count)
    
    return vec
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    return np.dot(a, b) / (norm_a * norm_b)
 
def tf_retrieval(
    query: str,
    corpus: List[str]
) -> List[Tuple[int, float, str]]:
    """
    Simple TF-based document retrieval.
    
    Args:
        query: Search query
        corpus: List of documents
    
    Returns:
        List of (doc_index, score, doc_preview) sorted by score descending
    """
    # Build vocabulary from corpus
    all_tokens = []
    doc_tokens = []
    
    for doc in corpus:
        tokens = tokenize(doc)
        doc_tokens.append(tokens)
        all_tokens.extend(tokens)
    
    vocabulary = {term: idx for idx, term in enumerate(set(all_tokens))}
    
    # Build query vector
    query_tokens = tokenize(query)
    query_vec = build_tf_vector(query_tokens, vocabulary)
    
    # Score each document
    results = []
    for doc_idx, tokens in enumerate(doc_tokens):
        doc_vec = build_tf_vector(tokens, vocabulary)
        score = cosine_similarity(query_vec, doc_vec)
        preview = corpus[doc_idx][:80] + "..." if len(corpus[doc_idx]) > 80 else corpus[doc_idx]
        results.append((doc_idx, score, preview))
    
    # Sort by score descending
    results.sort(key=lambda x: -x[1])
    
    return results
 
 
# Demo: Simple search engine
corpus = [
    "Machine learning algorithms learn patterns from data automatically.",
    "Deep neural networks have multiple hidden layers for feature learning.",
    "Data preprocessing is essential before training machine learning models.",
    "The weather today is sunny with a chance of rain.",
    "Natural language processing uses machine learning for text analysis.",
    "Supervised learning requires labeled training examples.",
]
 
query = "machine learning training"
 
print(f"Query: '{query}'")
print("\nRanked results (TF-based):")
print("-" * 60)
 
for doc_idx, score, preview in tf_retrieval(query, corpus):
    if score > 0:
        print(f"  Score: {score:.4f} | Doc {doc_idx}: {preview}")

Mathematical Properties of Term Frequency

Understanding the mathematical properties of TF helps explain its effectiveness and limitations. Let's analyze key properties rigorously.

Property 1: Non-negativity

$$tf(t, d) \geq 0 \quad \forall t, d$$

TF values are always non-negative. This seems obvious but is important: it means document vectors lie in the positive orthant of the feature space, which has implications for distance metrics and model assumptions.

Property 2: Sparsity

For any document d, the TF vector is sparse: $$|{t : tf(t, d) > 0}| << |V|$$

Most vocabulary terms don't appear in any given document. A 500-word document might have 200 unique terms from a 100,000-word vocabulary—99.8% zero values.

Property 3: Zipf's Law Distribution

Term frequencies follow Zipf's law: $$f_r \propto \frac{1}{r^\alpha}$$

where r is the rank of a term by frequency, and α ≈ 1. This means:

A few terms are very frequent
Most terms are very rare
The distribution has a long tail

This extreme skew motivates logarithmic scaling—without it, common terms completely dominate representations.

tf_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import re
 
# Sample text (you'd use a real corpus in practice)
text = """
Natural language processing is a field of artificial intelligence.
Language models learn from large text corpora.
Machine learning algorithms process text data.
Deep learning has revolutionized natural language processing.
Text classification, sentiment analysis, and machine translation
are common natural language processing tasks.
Language understanding requires processing context and meaning.
"""
 
# Tokenize and count
tokens = re.findall(r'\b[a-z]+\b', text.lower())
freq = Counter(tokens)
 
# Sort by frequency (descending)
sorted_freq = sorted(freq.values(), reverse=True)
ranks = range(1, len(sorted_freq) + 1)
 
print("Demonstrating Zipf's Law in term frequencies:")
print("\nRank | Frequency | Term")
print("-" * 35)
for i, (term, count) in enumerate(freq.most_common(15), 1):
    print(f"  {i:2d} |     {count:3d}   | {term}")
 
# Calculate sparsity for a hypothetical large vocabulary
vocab_size = 100000  # Realistic English vocabulary
unique_terms = len(freq)
sparsity = 1 - (unique_terms / vocab_size)
 
print(f"\n--- Sparsity Analysis ---")
print(f"Vocabulary size (full): {vocab_size:,}")
print(f"Unique terms in document: {unique_terms}")
print(f"Sparsity ratio: {sparsity:.4%}")
print(f"Non-zero entries: {(1-sparsity)*100:.4f}%")
 
# Zipf's Law verification
print("\n--- Zipf's Law (rank × frequency ≈ constant) ---")
for rank, f in list(zip(ranks, sorted_freq))[:10]:
    print(f"Rank {rank:2d}: freq={f:3d}, rank×freq={rank*f}")

Property 4: Linear Additivity

For document concatenation, raw TF is additive: $$tf(t, d_1 \oplus d_2) = tf(t, d_1) + tf(t, d_2)$$

This property is useful for document merging and hierarchical text aggregation. However, normalized TF variants don't preserve this property—the normalization denominator changes.

Property 5: Invariance Under Permutation

Bag of Words, and thus TF, is invariant to word order: $$tf(t, permute(d)) = tf(t, d)$$

This is a defining characteristic: "dog bites man" and "man bites dog" have identical TF representations. It's both a strength (simplicity, robustness to syntactic variation) and a weakness (loss of semantic distinctions).

Property 6: Dimensionality

TF vectors have dimensionality equal to vocabulary size: $$dim(tf(d)) = |V|$$

This high dimensionality (often 10,000-100,000+) would be computationally prohibitive without sparsity. Sparse representations exploit the fact that only ~0.1% of entries are non-zero.

TF Variants in Practice

Different TF computation variants are standardized in information retrieval research using the SMART notation (developed at Cornell University). Understanding this notation helps you read academic papers and configure retrieval systems.

SMART TF Notation
Code	Name	Formula	Common Use
n	Natural	f(t,d)	Raw counts; simple baseline
l	Logarithm	1 + log(f(t,d))	Most common; dampens high frequencies
a	Augmented	0.5 + 0.5 × f(t,d) / max(f)	Bounds to [0.5, 1]; legacy systems
b	Boolean	1 if f(t,d) > 0 else 0	Presence-only; spam detection
L	Log average	(1 + log(f)) / (1 + log(avg))	Normalized by average TF in doc

Recommended TF Variants by Task:

Text Classification: Logarithmic (l) or Boolean (b)
- Log TF with multinomial Naive Bayes or logistic regression works well
- Boolean is effective for presence-based features (spam keywords)
Document Retrieval: Logarithmic (l) combined with IDF
- The classic TF-IDF uses log TF × IDF
- BM25 (Okapi BM25) uses a saturation function instead of log
Semantic Similarity: Logarithmic (l) with L2 normalization
- Dampened counts prevent common words from dominating
- L2 normalization enables cosine similarity as dot product
Topic Modeling: Natural (n) or Logarithmic (l)
- LDA and related models often work with raw counts
- Log-transformed counts can stabilize training
Short Text (tweets, titles): Boolean (b) or Natural (n)
- With very short texts, presence is more reliable than frequency
- Counts beyond 1 are rare anyway

smart_tf_schemes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from collections import Counter
from typing import Dict, Literal
import re
 
SmartTF = Literal["n", "l", "a", "b", "L"]
 
def smart_tf(
    document: str,
    scheme: SmartTF = "l"
) -> Dict[str, float]:
    """
    Compute TF using SMART notation schemes.
    
    Args:
        document: Input text
        scheme: SMART TF code (n, l, a, b, L)
    
    Returns:
        Dictionary of term frequencies
    """
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    if not tokens:
        return {}
    
    raw = Counter(tokens)
    
    if scheme == "n":
        # Natural (raw count)
        return {t: float(c) for t, c in raw.items()}
    
    elif scheme == "l":
        # Logarithm: 1 + log(f)
        return {t: 1 + np.log(c) for t, c in raw.items()}
    
    elif scheme == "a":
        # Augmented: 0.5 + 0.5 * f / max_f
        max_f = max(raw.values())
        return {t: 0.5 + 0.5 * (c / max_f) for t, c in raw.items()}
    
    elif scheme == "b":
        # Boolean: 1 if present
        return {t: 1.0 for t in raw}
    
    elif scheme == "L":
        # Log average: (1 + log(f)) / (1 + log(avg))
        avg_f = np.mean(list(raw.values()))
        denom = 1 + np.log(avg_f)
        return {t: (1 + np.log(c)) / denom for t, c in raw.items()}
    
    else:
        raise ValueError(f"Unknown SMART scheme: {scheme}")
 
 
# Compare schemes on a sample document
doc = """
Machine learning machine learning neural networks.
Deep learning uses neural networks for machine learning tasks.
Learning algorithms learn patterns.
"""
 
print("SMART TF Scheme Comparison:")
print("=" * 50)
 
for scheme in ["n", "l", "a", "b", "L"]:
    tf = smart_tf(doc, scheme)
    
    # Show key terms
    sorted_tf = sorted(tf.items(), key=lambda x: -x[1])[:5]
    
    print(f"\n[{scheme}] Scheme:")
    for term, val in sorted_tf:
        print(f"  {term:15s}: {val:6.3f}")

Computational Efficiency Considerations

Computing TF at scale—across millions of documents with vocabularies of hundreds of thousands of terms—requires careful attention to computational efficiency.

Efficiency Techniques

•Hash-based Counting — Use hash tables (Python dicts, C++ unordered_map) for O(1) average-case term count updates. Memory overhead is acceptable given sparse nature of TF vectors.
•Sparse Matrix Storage — Store document-term matrices in sparse formats (CSR, CSC, COO). A 1M-document × 100K-vocabulary matrix at 0.1% density needs ~100MB sparse vs ~400GB dense.
•Streaming Computation — Process documents one at a time rather than loading entire corpus. Essential when corpus exceeds memory. Vocabulary can be fixed or learned incrementally.
•Parallel Processing — TF computation is embarrassingly parallel—documents are independent. Use multiprocessing/multithreading for linear speedup with core count.
•Vocabulary Hashing (Hashing Trick) — Instead of explicit vocabulary, hash terms to fixed-size feature vector. Eliminates vocabulary storage; enables online learning. Minor collision risk.

efficient_tf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy.sparse import csr_matrix, lil_matrix
from sklearn.feature_extraction.text import (
    CountVectorizer, 
    HashingVectorizer
)
from collections import Counter
import re
from typing import List
import time
 
def benchmark_tf_methods(corpus: List[str]):
    """
    Compare efficiency of different TF computation approaches.
    """
    print(f"Corpus size: {len(corpus)} documents")
    print(f"Total tokens: ~{sum(len(d.split()) for d in corpus):,}")
    print("-" * 50)
    
    # Method 1: Naive dense approach (don't do this!)
    # Skipped for large corpora - would exhaust memory
    
    # Method 2: Scikit-learn CountVectorizer (sparse, optimized)
    start = time.time()
    vectorizer = CountVectorizer(
        lowercase=True,
        token_pattern=r'\b[a-zA-Z]+\b'
    )
    X_count = vectorizer.fit_transform(corpus)
    elapsed_count = time.time() - start
    
    print(f"\nCountVectorizer:")
    print(f"  Time: {elapsed_count:.3f}s")
    print(f"  Shape: {X_count.shape}")
    print(f"  Non-zeros: {X_count.nnz:,} ({100*X_count.nnz/(X_count.shape[0]*X_count.shape[1]):.4f}%)")
    print(f"  Memory (sparse): ~{X_count.data.nbytes / 1024:.1f} KB")
    
    # Method 3: HashingVectorizer (no vocabulary storage)
    start = time.time()
    hasher = HashingVectorizer(
        n_features=2**16,  # 65,536 features
        lowercase=True,
        token_pattern=r'\b[a-zA-Z]+\b',
        norm=None,  # Raw counts
        alternate_sign=False
    )
    X_hash = hasher.fit_transform(corpus)
    elapsed_hash = time.time() - start
    
    print(f"\nHashingVectorizer:")
    print(f"  Time: {elapsed_hash:.3f}s")
    print(f"  Shape: {X_hash.shape} (fixed)")
    print(f"  Non-zeros: {X_hash.nnz:,}")
    print(f"  Memory (sparse): ~{X_hash.data.nbytes / 1024:.1f} KB")
    print(f"  Note: No vocabulary stored - constant memory!")
    
    return X_count, X_hash
 
 
# Generate sample corpus
np.random.seed(42)
vocab = [f"word{i}" for i in range(1000)]
corpus = [
    " ".join(np.random.choice(vocab, size=np.random.randint(50, 200)))
    for _ in range(1000)
]
 
X_count, X_hash = benchmark_tf_methods(corpus)
 
# Show sparse matrix efficiency
print("\n--- Sparse Matrix Formats ---")
dense_size = X_count.shape[0] * X_count.shape[1] * 4  # 4 bytes per int32
sparse_size = X_count.data.nbytes + X_count.indices.nbytes + X_count.indptr.nbytes
 
print(f"Dense storage would need: {dense_size / 1e6:.1f} MB")
print(f"Sparse storage (CSR): {sparse_size / 1e6:.3f} MB")
print(f"Compression ratio: {dense_size / sparse_size:.1f}x")

The Hashing Trick for Production

In production systems processing streaming text, HashingVectorizer is invaluable. It requires no vocabulary storage, handles out-of-vocabulary terms gracefully, and has constant memory. The trade-off: slight collision risk and inability to map features back to terms. For many applications (online learning, privacy-preserving ML), these trade-offs are acceptable.

Summary: From Term Frequency to TF-IDF

We've explored Term Frequency in depth—from raw counts through normalization variants to computational optimizations. Let's consolidate the key insights and preview how TF evolves into TF-IDF.

Key Takeaways

•TF quantifies local word importance — How important is a term within a specific document? Frequency is a proxy for importance.
•Raw counts are biased — Longer documents naturally have higher counts. Normalization (L1, L2, max) enables fair comparison.
•Sublinear scaling is crucial — Logarithmic TF reflects that the 10th occurrence adds less information than the 1st.
•Multiple standardized variants exist — SMART notation (n, l, a, b, L) documents these; logarithmic (l) is most common.
•TF alone is insufficient for retrieval — Common words like 'the' have high TF everywhere but aren't discriminative.
•Sparsity enables scalability — TF vectors are extremely sparse; sparse representations are mandatory at scale.
•Efficiency techniques matter — Hash-based counting, sparse matrices, and the hashing trick enable real-world performance.

The Missing Piece: Global Discriminative Power

TF tells us how important a term is within a document. But it doesn't tell us how discriminative that term is across documents.

Consider: "algorithm" appearing 5 times in a CS paper is significant. "The" appearing 50 times in the same paper is not—because "the" appears frequently in every document.

This observation leads to Inverse Document Frequency (IDF), covered in the next module:

$$idf(t) = \log\frac{N}{df(t)}$$

where N is the total number of documents and df(t) is the number of documents containing term t.

Combining TF and IDF yields TF-IDF, arguably the most successful classical text representation:

$$tfidf(t, d) = tf(t, d) \times idf(t)$$

This multiplies local importance (TF) by global discriminative power (IDF), creating representations that have driven information retrieval and text classification for decades.

Page Complete

You now have a rigorous understanding of Term Frequency—its definitions, variants, mathematical properties, and computational considerations. This knowledge is foundational: TF is a component of TF-IDF, BM25, and even influences how neural language models process input. Next, we'll explore vocabulary construction strategies that determine which terms receive TF values.

2 / 5

Loading learning content...

Machine LearningFeature Engineering & Selection

Text Feature Engineering - Basics

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

2 / 5

Term Frequency

Counting Words: Deceptively Simple, Surprisingly Deep

What You Will Master

Raw Term Frequency: The Foundation

Raw term frequency is the simplest and most intuitive definition: the number of times a term appears in a document.

Definition:

For a term t and document d, the raw term frequency is:

$$tf(t, d) = f_{t,d}$$

where $f_{t,d}$ denotes the count of occurrences of term t in document d.

Example:

Consider the document: "The quick brown fox jumps over the lazy dog. The fox is quick."

Term	Raw TF
the	3
quick	2
fox	2
brown	1
jumps	1
over	1
lazy	1
dog	1
is	1

raw_term_frequency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from collections import Counter
from typing import Dict, List
import re
 
def compute_raw_tf(document: str) -> Dict[str, int]:
    """
    Compute raw term frequency for all terms in a document.
    
    Args:
        document: Input text string
    
    Returns:
        Dictionary mapping each term to its raw frequency
    """
    # Tokenize: lowercase and split on non-alphanumeric
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    # Count occurrences
    return dict(Counter(tokens))
 
def get_term_frequency(term: str, document: str) -> int:
    """
    Get the raw frequency of a specific term in a document.
    
    Args:
        term: The term to count
        document: The document text
    
    Returns:
        Number of occurrences of term in document
    """
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    return tokens.count(term.lower())
 
 
# Demonstration
doc = "The quick brown fox jumps over the lazy dog. The fox is quick."
 
tf = compute_raw_tf(doc)
print("Raw Term Frequencies:")
for term, freq in sorted(tf.items(), key=lambda x: -x[1]):
    print(f"  {term}: {freq}")
 
# Query specific term
print(f"\nFrequency of 'fox': {get_term_frequency('fox', doc)}")
print(f"Frequency of 'cat': {get_term_frequency('cat', doc)}")

The Raw TF Problem

Document Length Normalization

To compare term importance across documents of varying lengths, we must normalize raw frequencies. Several normalization strategies exist, each with distinct properties and use cases.

Relative Term Frequency (L1 Normalization):

The most intuitive normalization divides raw count by document length:

$$tf_{norm}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} = \frac{f_{t,d}}{|d|}$$

where $|d|$ is the total number of term occurrences in document d (document length in tokens).

This produces values in [0, 1] that sum to 1 across all terms—essentially converting counts to a probability distribution over vocabulary terms.

Example:

If "algorithm" appears 5 times in a 100-word document: $$tf_{norm}(\text{algorithm}, d) = \frac{5}{100} = 0.05$$

If "algorithm" appears 10 times in a 10,000-word document: $$tf_{norm}(\text{algorithm}, d') = \frac{10}{10000} = 0.001$$

Despite having higher raw frequency, the second document has lower normalized TF—indicating "algorithm" is less central to its content.

Document Length Normalization Methods
Method	Formula	Range	Properties
Raw Count	f(t,d)	[0, ∞)	Simple but biased toward long documents
L1 Normalization (Relative TF)	f(t,d) / \|d\|	[0, 1]	Sums to 1; represents proportion of document
L2 Normalization (Unit Vector)	f(t,d) / \|\|d\|\|₂	[0, 1]	Document vector has unit length; cosine similarity = dot product
Max Normalization	f(t,d) / max(f(t',d))	[0, 1]	Scales relative to most frequent term in document

tf_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from collections import Counter
from typing import Dict, Literal
import re
 
NormMethod = Literal["raw", "l1", "l2", "max"]
 
def compute_normalized_tf(
    document: str,
    method: NormMethod = "l1"
) -> Dict[str, float]:
    """
    Compute term frequency with various normalization methods.
    
    Args:
        document: Input text string
        method: Normalization method ('raw', 'l1', 'l2', 'max')
    
    Returns:
        Dictionary mapping terms to normalized frequencies
    """
    # Tokenize
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    if not tokens:
        return {}
    
    # Get raw counts
    raw_counts = Counter(tokens)
    
    if method == "raw":
        return dict(raw_counts)
    
    elif method == "l1":
        # Divide by document length (sum of all counts)
        total = sum(raw_counts.values())
        return {term: count / total for term, count in raw_counts.items()}
    
    elif method == "l2":
        # Divide by L2 norm (Euclidean length of count vector)
        l2_norm = np.sqrt(sum(c ** 2 for c in raw_counts.values()))
        return {term: count / l2_norm for term, count in raw_counts.items()}
    
    elif method == "max":
        # Divide by maximum term frequency in document
        max_freq = max(raw_counts.values())
        return {term: count / max_freq for term, count in raw_counts.items()}
    
    else:
        raise ValueError(f"Unknown normalization method: {method}")
 
 
# Demonstration with two documents of different lengths
doc_short = "Algorithm design. Algorithm analysis. Algorithm optimization."
doc_long = (
    "This comprehensive guide covers software engineering practices. "
    "We discuss algorithm design principles briefly. "
    "The focus is on maintainability, scalability, testing, documentation, "
    "code review, deployment strategies, monitoring, and team collaboration."
)
 
print("Short document (15 words):")
tf_short = compute_normalized_tf(doc_short, "l1")
print(f"  TF('algorithm') = {tf_short.get('algorithm', 0):.4f}")
 
print("\nLong document (40+ words):")
tf_long = compute_normalized_tf(doc_long, "l1")
print(f"  TF('algorithm') = {tf_long.get('algorithm', 0):.4f}")
 
print("\n--- Comparison of normalization methods ---")
doc = "data science data analysis data visualization machine learning"
 
for method in ["raw", "l1", "l2", "max"]:
    tf = compute_normalized_tf(doc, method)
    print(f"\n{method.upper()} normalization:")
    for term in sorted(tf.keys()):
        print(f"  {term}: {tf[term]:.4f}")

Sublinear Term Frequency Scaling

A fundamental question in TF design is: Does the 10th occurrence of a word provide as much information as the 1st?

This insight motivates sublinear scaling of term frequency, where additional occurrences provide diminishing returns.

Logarithmic TF:

The most common sublinear scaling applies a logarithm:

$$tf_{log}(t, d) = \begin{cases} 1 + \log(f_{t,d}) & \text{if } f_{t,d} > 0 \ 0 & \text{otherwise} \end{cases}$$

The +1 ensures the minimum value for present terms is 1 (since log(1) = 0).

Effect of logarithmic scaling:

Raw TF	Log TF (1 + log)
1	1.00
2	1.69
5	2.61
10	3.30
100	5.61
1000	7.91

Notice how a 1000x increase in raw frequency translates to only ~8x increase in log TF. This compression prevents highly frequent terms from dominating the representation.

Why Logarithmic?

sublinear_tf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from collections import Counter
from typing import Dict, Literal
import re
 
SublinearMethod = Literal["raw", "log", "double_norm", "boolean"]
 
def compute_sublinear_tf(
    document: str,
    method: SublinearMethod = "log",
    k: float = 0.5  # For double normalization
) -> Dict[str, float]:
    """
    Compute term frequency with sublinear scaling.
    
    Args:
        document: Input text
        method: Scaling method
        k: Smoothing parameter for double normalization (typically 0.4-0.5)
    
    Returns:
        Dictionary mapping terms to scaled frequencies
    """
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    if not tokens:
        return {}
    
    raw_counts = Counter(tokens)
    
    if method == "raw":
        return {t: float(c) for t, c in raw_counts.items()}
    
    elif method == "log":
        # 1 + log(count) for count > 0
        return {
            term: 1 + np.log(count) if count > 0 else 0
            for term, count in raw_counts.items()
        }
    
    elif method == "double_norm":
        # k + (1-k) * (count / max_count)
        # Also called "augmented frequency"
        max_count = max(raw_counts.values())
        return {
            term: k + (1 - k) * (count / max_count)
            for term, count in raw_counts.items()
        }
    
    elif method == "boolean":
        # Binary: 1 if present, 0 otherwise
        return {term: 1.0 for term in raw_counts}
    
    else:
        raise ValueError(f"Unknown method: {method}")
 
 
# Demonstrate the dampening effect
doc = "neural " * 100 + "networks " * 50 + "deep learning machine algorithm"
 
print("Sublinear scaling comparison:")
print(f"\nDocument has 'neural' 100x, 'networks' 50x, others 1x")
 
for method in ["raw", "log", "double_norm", "boolean"]:
    tf = compute_sublinear_tf(doc, method)
    print(f"\n{method.upper()}:")
    # Show top 5 by value
    sorted_tf = sorted(tf.items(), key=lambda x: -x[1])[:5]
    for term, val in sorted_tf:
        print(f"  {term}: {val:.3f}")
 
# Show the compression effect across different frequencies
print("\n--- Logarithmic compression ---")
print("Raw Count → Log TF")
for raw in [1, 2, 5, 10, 20, 50, 100, 500, 1000]:
    log_tf = 1 + np.log(raw)
    print(f"  {raw:5d} → {log_tf:.3f}")

Double Normalization (Augmented Frequency):

Another sublinear approach, often called "augmented" or "double" normalization, scales TF relative to the maximum frequency in the document:

$$tf_{aug}(t, d) = k + (1 - k) \cdot \frac{f_{t,d}}{\max_{t' \in d} f_{t',d}}$$

where k is a smoothing constant (typically 0.4 or 0.5).

This approach:

Bounds TF between k and 1
The most frequent term in any document always has TF = 1
Prevents bias from document-specific word usage patterns
The k parameter ensures rare terms still have meaningful weight

Double normalization was popular in early information retrieval systems (notably in the SMART system) but has been largely superseded by simpler schemes combined with IDF weighting.

TF in Information Retrieval

The Retrieval Problem:

Given a query q = (q₁, q₂, ..., qₖ) consisting of query terms, rank documents in a corpus by relevance to q.

TF-based Scoring:

A simple scoring function sums TF values for query terms:

$$score(q, d) = \sum_{t \in q} tf(t, d)$$

Documents mentioning query terms more frequently score higher.

Vector Space Model:

In the vector space model, both queries and documents are represented as TF vectors. Relevance is measured by vector similarity:

$$similarity(q, d) = \frac{\vec{q} \cdot \vec{d}}{||\vec{q}|| \cdot ||\vec{d}||}$$

This cosine similarity measures the angle between query and document vectors, making it invariant to document length when vectors are L2-normalized.

Query TF

TF Alone Is Insufficient

tf_retrieval.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from collections import Counter
from typing import List, Tuple
import re
 
def tokenize(text: str) -> List[str]:
    """Simple tokenization: lowercase + alphanumeric."""
    return re.findall(r'\b[a-z]+\b', text.lower())
 
def build_tf_vector(
    tokens: List[str],
    vocabulary: dict
) -> np.ndarray:
    """
    Build TF vector for a token sequence.
    
    Args:
        tokens: List of tokens
        vocabulary: Mapping from term to index
    
    Returns:
        TF vector of shape (vocab_size,)
    """
    vec = np.zeros(len(vocabulary))
    counts = Counter(tokens)
    
    for term, count in counts.items():
        if term in vocabulary:
            # Using log TF
            vec[vocabulary[term]] = 1 + np.log(count)
    
    return vec
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    return np.dot(a, b) / (norm_a * norm_b)
 
def tf_retrieval(
    query: str,
    corpus: List[str]
) -> List[Tuple[int, float, str]]:
    """
    Simple TF-based document retrieval.
    
    Args:
        query: Search query
        corpus: List of documents
    
    Returns:
        List of (doc_index, score, doc_preview) sorted by score descending
    """
    # Build vocabulary from corpus
    all_tokens = []
    doc_tokens = []
    
    for doc in corpus:
        tokens = tokenize(doc)
        doc_tokens.append(tokens)
        all_tokens.extend(tokens)
    
    vocabulary = {term: idx for idx, term in enumerate(set(all_tokens))}
    
    # Build query vector
    query_tokens = tokenize(query)
    query_vec = build_tf_vector(query_tokens, vocabulary)
    
    # Score each document
    results = []
    for doc_idx, tokens in enumerate(doc_tokens):
        doc_vec = build_tf_vector(tokens, vocabulary)
        score = cosine_similarity(query_vec, doc_vec)
        preview = corpus[doc_idx][:80] + "..." if len(corpus[doc_idx]) > 80 else corpus[doc_idx]
        results.append((doc_idx, score, preview))
    
    # Sort by score descending
    results.sort(key=lambda x: -x[1])
    
    return results
 
 
# Demo: Simple search engine
corpus = [
    "Machine learning algorithms learn patterns from data automatically.",
    "Deep neural networks have multiple hidden layers for feature learning.",
    "Data preprocessing is essential before training machine learning models.",
    "The weather today is sunny with a chance of rain.",
    "Natural language processing uses machine learning for text analysis.",
    "Supervised learning requires labeled training examples.",
]
 
query = "machine learning training"
 
print(f"Query: '{query}'")
print("\nRanked results (TF-based):")
print("-" * 60)
 
for doc_idx, score, preview in tf_retrieval(query, corpus):
    if score > 0:
        print(f"  Score: {score:.4f} | Doc {doc_idx}: {preview}")

Mathematical Properties of Term Frequency

Understanding the mathematical properties of TF helps explain its effectiveness and limitations. Let's analyze key properties rigorously.

Property 1: Non-negativity

$$tf(t, d) \geq 0 \quad \forall t, d$$

Property 2: Sparsity

For any document d, the TF vector is sparse: $$|{t : tf(t, d) > 0}| << |V|$$

Most vocabulary terms don't appear in any given document. A 500-word document might have 200 unique terms from a 100,000-word vocabulary—99.8% zero values.

Property 3: Zipf's Law Distribution

Term frequencies follow Zipf's law: $$f_r \propto \frac{1}{r^\alpha}$$

where r is the rank of a term by frequency, and α ≈ 1. This means:

A few terms are very frequent
Most terms are very rare
The distribution has a long tail

This extreme skew motivates logarithmic scaling—without it, common terms completely dominate representations.

tf_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import re
 
# Sample text (you'd use a real corpus in practice)
text = """
Natural language processing is a field of artificial intelligence.
Language models learn from large text corpora.
Machine learning algorithms process text data.
Deep learning has revolutionized natural language processing.
Text classification, sentiment analysis, and machine translation
are common natural language processing tasks.
Language understanding requires processing context and meaning.
"""
 
# Tokenize and count
tokens = re.findall(r'\b[a-z]+\b', text.lower())
freq = Counter(tokens)
 
# Sort by frequency (descending)
sorted_freq = sorted(freq.values(), reverse=True)
ranks = range(1, len(sorted_freq) + 1)
 
print("Demonstrating Zipf's Law in term frequencies:")
print("\nRank | Frequency | Term")
print("-" * 35)
for i, (term, count) in enumerate(freq.most_common(15), 1):
    print(f"  {i:2d} |     {count:3d}   | {term}")
 
# Calculate sparsity for a hypothetical large vocabulary
vocab_size = 100000  # Realistic English vocabulary
unique_terms = len(freq)
sparsity = 1 - (unique_terms / vocab_size)
 
print(f"\n--- Sparsity Analysis ---")
print(f"Vocabulary size (full): {vocab_size:,}")
print(f"Unique terms in document: {unique_terms}")
print(f"Sparsity ratio: {sparsity:.4%}")
print(f"Non-zero entries: {(1-sparsity)*100:.4f}%")
 
# Zipf's Law verification
print("\n--- Zipf's Law (rank × frequency ≈ constant) ---")
for rank, f in list(zip(ranks, sorted_freq))[:10]:
    print(f"Rank {rank:2d}: freq={f:3d}, rank×freq={rank*f}")

Property 4: Linear Additivity

For document concatenation, raw TF is additive: $$tf(t, d_1 \oplus d_2) = tf(t, d_1) + tf(t, d_2)$$

This property is useful for document merging and hierarchical text aggregation. However, normalized TF variants don't preserve this property—the normalization denominator changes.

Property 5: Invariance Under Permutation

Bag of Words, and thus TF, is invariant to word order: $$tf(t, permute(d)) = tf(t, d)$$

Property 6: Dimensionality

TF vectors have dimensionality equal to vocabulary size: $$dim(tf(d)) = |V|$$

This high dimensionality (often 10,000-100,000+) would be computationally prohibitive without sparsity. Sparse representations exploit the fact that only ~0.1% of entries are non-zero.

TF Variants in Practice

SMART TF Notation
Code	Name	Formula	Common Use
n	Natural	f(t,d)	Raw counts; simple baseline
l	Logarithm	1 + log(f(t,d))	Most common; dampens high frequencies
a	Augmented	0.5 + 0.5 × f(t,d) / max(f)	Bounds to [0.5, 1]; legacy systems
b	Boolean	1 if f(t,d) > 0 else 0	Presence-only; spam detection
L	Log average	(1 + log(f)) / (1 + log(avg))	Normalized by average TF in doc

Recommended TF Variants by Task:

Text Classification: Logarithmic (l) or Boolean (b)
- Log TF with multinomial Naive Bayes or logistic regression works well
- Boolean is effective for presence-based features (spam keywords)
Document Retrieval: Logarithmic (l) combined with IDF
- The classic TF-IDF uses log TF × IDF
- BM25 (Okapi BM25) uses a saturation function instead of log
Semantic Similarity: Logarithmic (l) with L2 normalization
- Dampened counts prevent common words from dominating
- L2 normalization enables cosine similarity as dot product
Topic Modeling: Natural (n) or Logarithmic (l)
- LDA and related models often work with raw counts
- Log-transformed counts can stabilize training
Short Text (tweets, titles): Boolean (b) or Natural (n)
- With very short texts, presence is more reliable than frequency
- Counts beyond 1 are rare anyway

smart_tf_schemes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from collections import Counter
from typing import Dict, Literal
import re
 
SmartTF = Literal["n", "l", "a", "b", "L"]
 
def smart_tf(
    document: str,
    scheme: SmartTF = "l"
) -> Dict[str, float]:
    """
    Compute TF using SMART notation schemes.
    
    Args:
        document: Input text
        scheme: SMART TF code (n, l, a, b, L)
    
    Returns:
        Dictionary of term frequencies
    """
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    
    if not tokens:
        return {}
    
    raw = Counter(tokens)
    
    if scheme == "n":
        # Natural (raw count)
        return {t: float(c) for t, c in raw.items()}
    
    elif scheme == "l":
        # Logarithm: 1 + log(f)
        return {t: 1 + np.log(c) for t, c in raw.items()}
    
    elif scheme == "a":
        # Augmented: 0.5 + 0.5 * f / max_f
        max_f = max(raw.values())
        return {t: 0.5 + 0.5 * (c / max_f) for t, c in raw.items()}
    
    elif scheme == "b":
        # Boolean: 1 if present
        return {t: 1.0 for t in raw}
    
    elif scheme == "L":
        # Log average: (1 + log(f)) / (1 + log(avg))
        avg_f = np.mean(list(raw.values()))
        denom = 1 + np.log(avg_f)
        return {t: (1 + np.log(c)) / denom for t, c in raw.items()}
    
    else:
        raise ValueError(f"Unknown SMART scheme: {scheme}")
 
 
# Compare schemes on a sample document
doc = """
Machine learning machine learning neural networks.
Deep learning uses neural networks for machine learning tasks.
Learning algorithms learn patterns.
"""
 
print("SMART TF Scheme Comparison:")
print("=" * 50)
 
for scheme in ["n", "l", "a", "b", "L"]:
    tf = smart_tf(doc, scheme)
    
    # Show key terms
    sorted_tf = sorted(tf.items(), key=lambda x: -x[1])[:5]
    
    print(f"\n[{scheme}] Scheme:")
    for term, val in sorted_tf:
        print(f"  {term:15s}: {val:6.3f}")

Computational Efficiency Considerations

Computing TF at scale—across millions of documents with vocabularies of hundreds of thousands of terms—requires careful attention to computational efficiency.

Efficiency Techniques

•Hash-based Counting — Use hash tables (Python dicts, C++ unordered_map) for O(1) average-case term count updates. Memory overhead is acceptable given sparse nature of TF vectors.
•Sparse Matrix Storage — Store document-term matrices in sparse formats (CSR, CSC, COO). A 1M-document × 100K-vocabulary matrix at 0.1% density needs ~100MB sparse vs ~400GB dense.
•Streaming Computation — Process documents one at a time rather than loading entire corpus. Essential when corpus exceeds memory. Vocabulary can be fixed or learned incrementally.
•Parallel Processing — TF computation is embarrassingly parallel—documents are independent. Use multiprocessing/multithreading for linear speedup with core count.
•Vocabulary Hashing (Hashing Trick) — Instead of explicit vocabulary, hash terms to fixed-size feature vector. Eliminates vocabulary storage; enables online learning. Minor collision risk.

efficient_tf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy.sparse import csr_matrix, lil_matrix
from sklearn.feature_extraction.text import (
    CountVectorizer, 
    HashingVectorizer
)
from collections import Counter
import re
from typing import List
import time
 
def benchmark_tf_methods(corpus: List[str]):
    """
    Compare efficiency of different TF computation approaches.
    """
    print(f"Corpus size: {len(corpus)} documents")
    print(f"Total tokens: ~{sum(len(d.split()) for d in corpus):,}")
    print("-" * 50)
    
    # Method 1: Naive dense approach (don't do this!)
    # Skipped for large corpora - would exhaust memory
    
    # Method 2: Scikit-learn CountVectorizer (sparse, optimized)
    start = time.time()
    vectorizer = CountVectorizer(
        lowercase=True,
        token_pattern=r'\b[a-zA-Z]+\b'
    )
    X_count = vectorizer.fit_transform(corpus)
    elapsed_count = time.time() - start
    
    print(f"\nCountVectorizer:")
    print(f"  Time: {elapsed_count:.3f}s")
    print(f"  Shape: {X_count.shape}")
    print(f"  Non-zeros: {X_count.nnz:,} ({100*X_count.nnz/(X_count.shape[0]*X_count.shape[1]):.4f}%)")
    print(f"  Memory (sparse): ~{X_count.data.nbytes / 1024:.1f} KB")
    
    # Method 3: HashingVectorizer (no vocabulary storage)
    start = time.time()
    hasher = HashingVectorizer(
        n_features=2**16,  # 65,536 features
        lowercase=True,
        token_pattern=r'\b[a-zA-Z]+\b',
        norm=None,  # Raw counts
        alternate_sign=False
    )
    X_hash = hasher.fit_transform(corpus)
    elapsed_hash = time.time() - start
    
    print(f"\nHashingVectorizer:")
    print(f"  Time: {elapsed_hash:.3f}s")
    print(f"  Shape: {X_hash.shape} (fixed)")
    print(f"  Non-zeros: {X_hash.nnz:,}")
    print(f"  Memory (sparse): ~{X_hash.data.nbytes / 1024:.1f} KB")
    print(f"  Note: No vocabulary stored - constant memory!")
    
    return X_count, X_hash
 
 
# Generate sample corpus
np.random.seed(42)
vocab = [f"word{i}" for i in range(1000)]
corpus = [
    " ".join(np.random.choice(vocab, size=np.random.randint(50, 200)))
    for _ in range(1000)
]
 
X_count, X_hash = benchmark_tf_methods(corpus)
 
# Show sparse matrix efficiency
print("\n--- Sparse Matrix Formats ---")
dense_size = X_count.shape[0] * X_count.shape[1] * 4  # 4 bytes per int32
sparse_size = X_count.data.nbytes + X_count.indices.nbytes + X_count.indptr.nbytes
 
print(f"Dense storage would need: {dense_size / 1e6:.1f} MB")
print(f"Sparse storage (CSR): {sparse_size / 1e6:.3f} MB")
print(f"Compression ratio: {dense_size / sparse_size:.1f}x")

The Hashing Trick for Production

Summary: From Term Frequency to TF-IDF

We've explored Term Frequency in depth—from raw counts through normalization variants to computational optimizations. Let's consolidate the key insights and preview how TF evolves into TF-IDF.

Key Takeaways

•TF quantifies local word importance — How important is a term within a specific document? Frequency is a proxy for importance.
•Raw counts are biased — Longer documents naturally have higher counts. Normalization (L1, L2, max) enables fair comparison.
•Sublinear scaling is crucial — Logarithmic TF reflects that the 10th occurrence adds less information than the 1st.
•Multiple standardized variants exist — SMART notation (n, l, a, b, L) documents these; logarithmic (l) is most common.
•TF alone is insufficient for retrieval — Common words like 'the' have high TF everywhere but aren't discriminative.
•Sparsity enables scalability — TF vectors are extremely sparse; sparse representations are mandatory at scale.
•Efficiency techniques matter — Hash-based counting, sparse matrices, and the hashing trick enable real-world performance.

The Missing Piece: Global Discriminative Power

TF tells us how important a term is within a document. But it doesn't tell us how discriminative that term is across documents.

Consider: "algorithm" appearing 5 times in a CS paper is significant. "The" appearing 50 times in the same paper is not—because "the" appears frequently in every document.

This observation leads to Inverse Document Frequency (IDF), covered in the next module:

$$idf(t) = \log\frac{N}{df(t)}$$

where N is the total number of documents and df(t) is the number of documents containing term t.

Combining TF and IDF yields TF-IDF, arguably the most successful classical text representation:

$$tfidf(t, d) = tf(t, d) \times idf(t)$$

This multiplies local importance (TF) by global discriminative power (IDF), creating representations that have driven information retrieval and text classification for decades.

Page Complete

2 / 5