Feature Engineering & SelectionText Feature Engineering - TF-IDF

Text Feature Engineering: TF-IDF

LevelIntermediate

Duration90 mins

TopicText Feature Engineering - TF-IDF

4 / 5

TF-IDF Weighting

Combining TF and IDF: The Complete Picture

We've now developed deep understanding of both Term Frequency (local importance) and Inverse Document Frequency (global discrimination). The power of TF-IDF lies in their combination—a term matters when it's frequent in a document AND rare across the corpus.

But how exactly should we combine them? It turns out there are many valid weighting schemes, each with different properties. The classic "TF × IDF" is just one option among several, and the choice matters more than many practitioners realize.

This page explores the complete landscape of TF-IDF weighting: the SMART notation for describing schemes, common variants, implementation details, and guidance on selecting the right approach for your task.

What You Will Master

By the end of this page, you will understand: (1) How TF and IDF combine into complete weighting schemes, (2) The SMART notation for describing TF-IDF variants, (3) Common weighting schemes and their trade-offs, (4) Complete implementation of multiple TF-IDF variants, and (5) How to choose the right scheme for your application.

The Basic TF-IDF Formula

At its core, TF-IDF is a product of two factors:

$$\text{tfidf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

Interpretation:

High TF, High IDF: Term is frequent in this document AND rare overall → Very high weight (distinctive content term)
High TF, Low IDF: Term is frequent in this document BUT common overall → Modest weight (common term, not distinctive)
Low TF, High IDF: Term is rare in this document AND rare overall → Modest weight (rare term, but not emphasized here)
Low TF, Low IDF: Term is rare in this document AND common overall → Low weight (likely a stop word)

Example:

Consider a document about "deep learning" in a general news corpus:

Term	TF in Doc	IDF	TF-IDF	Interpretation
"the"	45	0.0	0.0	Common word, zero weight
"said"	12	0.5	6.0	Common in news, modest weight
"technology"	8	1.8	14.4	Moderately distinctive
"learning"	6	3.2	19.2	More distinctive
"neural"	4	4.5	18.0	Highly distinctive
"backpropagation"	2	7.1	14.2	Very rare, but only mentioned twice

The Multiplicative Insight

Multiplication means BOTH factors must be significant for the product to be large. A term needs to be important locally (high TF) AND globally distinctive (high IDF). Either factor being zero (or near-zero) makes the product small. This AND-like behavior is exactly what we want for identifying characteristic terms.

The Vector Space Model:

TF-IDF creates a vector representation for each document:

$$\vec{d} = [\text{tfidf}(t_1, d), \text{tfidf}(t_2, d), ..., \text{tfidf}(t_M, d)]$$

where $M$ is the vocabulary size.

Properties of TF-IDF Vectors:

Sparse: Most entries are zero (terms not present)
Non-negative: All weights ≥ 0
Unbounded (without normalization): Values can grow with document length
Comparable: Documents can be compared via vector similarity

Common Operations:

Operation	Formula	Purpose
Cosine Similarity	$\frac{\vec{d_1} \cdot \vec{d_2}}{\|\vec{d_1}\| \|\vec{d_2}\|}$	Document similarity
Euclidean Distance	$\|\vec{d_1} - \vec{d_2}\|$	Clustering, nearest neighbors
Query Matching	$\vec{q} \cdot \vec{d}$	Information retrieval
Feature Input	$\vec{d}$	ML classification/regression

SMART Notation for Weighting Schemes

The information retrieval community developed SMART notation to precisely describe TF-IDF weighting schemes. Understanding SMART notation is essential for reproducing results and comparing approaches.

SMART Format: ddd.qqq

First 3 characters: Document weighting scheme
Last 3 characters: Query weighting scheme (for retrieval)

Each position specifies:

Position	Meaning	Options
1st	TF component	n, l, a, b, L (see below)
2nd	IDF component	n, t, p (see below)
3rd	Normalization	n, c, u, b (see below)

SMART TF Component Options
Code	Name	Formula	Description
n	natural	$f_{t,d}$	Raw term frequency (count)
l	logarithm	$1 + \log(f_{t,d})$	Log-scaled frequency
a	augmented	$0.5 + \frac{0.5 \cdot f_{t,d}}{\max_t f_{t,d}}$	Normalized by max TF in doc
b	boolean	$1$ if $f_{t,d} > 0$, else $0$	Binary presence
L	log average	$\frac{1 + \log(f_{t,d})}{1 + \log(\text{avg}{t \in d}(f{t,d}))}$	Log normalized by doc average

SMART IDF Component Options
Code	Name	Formula	Description
n	none	$1$	No IDF weighting
t	idf	$\log(N / \text{df}(t))$	Standard IDF
p	prob idf	$\max(0, \log\frac{N - \text{df}(t)}{\text{df}(t)})$	Probabilistic IDF (BM25-style)

SMART Normalization Options
Code	Name	Formula	Description
n	none	unchanged	No normalization
c	cosine	$\frac{w}{\sqrt{\sum_t w_t^2}}$	L2 normalization
u	pivoted unique	complex	Pivoted length normalization
b	byte size	$\frac{w}{\text{CharLength}^{\alpha}}$	Byte-length normalization

Common SMART Schemes

ntc.ntc: Natural TF, standard IDF, cosine norm (for both docs and queries) ltc.ltc: Log TF, standard IDF, cosine norm (very common) lnc.ltc: Log TF, no IDF, cosine norm for docs; log TF, IDF, cosine for queries atc.atc: Augmented TF, IDF, cosine norm (classic SMART default) bnc.bnc: Boolean TF, no IDF, cosine norm (for set-based matching)

Common TF-IDF Variants in Practice

Let's examine the most commonly used TF-IDF variants and their properties.

Variant 1: Basic TF-IDF (ntc)

$$\text{tfidf}{\text{basic}}(t, d) = f{t,d} \cdot \log\left(\frac{N}{\text{df}(t)}\right)$$

Simple and interpretable
Raw counts can cause long-document bias
Used when document length variation is minimal

Variant 2: Log-normalized TF-IDF (ltc)

$$\text{tfidf}{\text{log}}(t, d) = (1 + \log f{t,d}) \cdot \log\left(\frac{N}{\text{df}(t)}\right)$$

Dampens high-frequency terms
Better for documents with repetitive terms
Scikit-learn's default (with smooth IDF)

TF-IDF Variant Comparison
Variant	SMART	TF Handling	Best For
Basic TF-IDF	ntc	Raw count	Uniform-length documents
Log TF-IDF	ltc	1 + log(count)	General purpose; default choice
Augmented TF-IDF	atc	0.5 + 0.5*(count/max)	Variable-length documents
Boolean TF-IDF	btc	Binary presence	Set-based matching; topics
Sublinear TF-IDF	ltc	1 + log(count)	Avoiding keyword stuffing effects

Scikit-learn's Default:

When you use TfidfVectorizer with default parameters, you get:

$$\text{tfidf}{\text{sklearn}}(t, d) = f{t,d} \cdot \left(\log\frac{1 + N}{1 + \text{df}(t)} + 1\right)$$

with L2 normalization applied afterward.

Key differences from textbook TF-IDF:

Smooth IDF: $(1+N)/(1+\text{df})$ instead of $N/\text{df}$
IDF offset: The "+1" outside the log
L2 normalization: Vectors have unit length by default

To get textbook TF-IDF in sklearn:

TfidfVectorizer(
    norm=None,           # Disable normalization
    smooth_idf=False,    # Use N/df instead of (1+N)/(1+df)
    sublinear_tf=False   # Use raw TF, not 1+log(TF)
)

Default ≠ Textbook

Many practitioners are surprised that sklearn's TfidfVectorizer doesn't compute textbook TF-IDF. The defaults are chosen for practical robustness (avoiding division by zero, numerical stability), not theoretical purity. Always check your library's documentation and be explicit about which variant you're using.

Complete TF-IDF Implementation

Let's implement a flexible TF-IDF vectorizer that supports multiple weighting schemes.

tfidf_vectorizer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
import numpy as np
from collections import Counter
from typing import List, Dict, Literal, Optional
from scipy.sparse import csr_matrix, lil_matrix
from dataclasses import dataclass
import re
 
@dataclass
class TFIDFConfig:
    """Configuration for TF-IDF computation using SMART-like notation."""
    tf_scheme: Literal["natural", "log", "augmented", "boolean"] = "log"
    idf_scheme: Literal["none", "standard", "smooth", "probabilistic"] = "smooth"
    normalization: Literal["none", "l1", "l2"] = "l2"
    
    @classmethod
    def from_smart(cls, smart_code: str) -> 'TFIDFConfig':
        """Parse SMART notation (e.g., 'ltc' -> log TF, standard IDF, cosine norm)."""
        if len(smart_code) != 3:
            raise ValueError(f"SMART code must be 3 characters, got: {smart_code}")
        
        tf_map = {'n': 'natural', 'l': 'log', 'a': 'augmented', 'b': 'boolean'}
        idf_map = {'n': 'none', 't': 'standard', 'p': 'probabilistic'}
        norm_map = {'n': 'none', 'c': 'l2', 'u': 'l2'}  # 'u' simplified to l2
        
        return cls(
            tf_scheme=tf_map.get(smart_code[0], 'natural'),
            idf_scheme=idf_map.get(smart_code[1], 'standard'),
            normalization=norm_map.get(smart_code[2], 'none')
        )
 
 
class TFIDFVectorizer:
    """
    Flexible TF-IDF vectorizer supporting multiple weighting schemes.
    
    Supports SMART notation and provides detailed control over
    TF, IDF, and normalization components.
    """
    
    def __init__(
        self,
        config: Optional[TFIDFConfig] = None,
        min_df: int = 1,
        max_df: float = 1.0,
        lowercase: bool = True,
        token_pattern: str = r"(?u)\b\w\w+\b"
    ):
        self.config = config or TFIDFConfig()
        self.min_df = min_df
        self.max_df = max_df
        self.lowercase = lowercase
        self.token_pattern = re.compile(token_pattern)
        
        self.vocabulary_: Dict[str, int] = {}
        self.idf_values_: np.ndarray = np.array([])
        self.n_documents_: int = 0
        self._fitted = False
    
    def _tokenize(self, text: str) -> List[str]:
        """Tokenize a document."""
        if self.lowercase:
            text = text.lower()
        return self.token_pattern.findall(text)
    
    def _compute_tf(
        self,
        term_counts: Counter,
        scheme: str
    ) -> Dict[str, float]:
        """Compute term frequency values based on scheme."""
        
        if not term_counts:
            return {}
        
        if scheme == "natural":
            return dict(term_counts)
        
        elif scheme == "log":
            return {
                term: 1 + np.log(count) if count > 0 else 0
                for term, count in term_counts.items()
            }
        
        elif scheme == "augmented":
            max_count = max(term_counts.values())
            return {
                term: 0.5 + 0.5 * (count / max_count)
                for term, count in term_counts.items()
            }
        
        elif scheme == "boolean":
            return {term: 1.0 for term in term_counts}
        
        else:
            raise ValueError(f"Unknown TF scheme: {scheme}")
    
    def _compute_idf_vector(
        self,
        document_frequencies: Dict[str, int],
        n_docs: int,
        scheme: str
    ) -> np.ndarray:
        """Compute IDF values for all vocabulary terms."""
        
        n_vocab = len(self.vocabulary_)
        idf = np.zeros(n_vocab)
        
        for term, idx in self.vocabulary_.items():
            df = document_frequencies.get(term, 0)
            
            if scheme == "none":
                idf[idx] = 1.0
            
            elif scheme == "standard":
                if df > 0:
                    idf[idx] = np.log(n_docs / df)
                else:
                    idf[idx] = 0.0
            
            elif scheme == "smooth":
                # Scikit-learn style: log((1+N)/(1+df)) + 1
                idf[idx] = np.log((1 + n_docs) / (1 + df)) + 1
            
            elif scheme == "probabilistic":
                # BM25 style: max(0, log((N-df+0.5)/(df+0.5)))
                if df > 0:
                    idf[idx] = max(0, np.log((n_docs - df + 0.5) / (df + 0.5)))
                else:
                    idf[idx] = 0.0
            
            else:
                raise ValueError(f"Unknown IDF scheme: {scheme}")
        
        return idf
    
    def _normalize(self, X: csr_matrix, norm: str) -> csr_matrix:
        """Apply normalization to TF-IDF matrix."""
        
        if norm == "none":
            return X
        
        # Convert to LIL for efficient row operations
        X_lil = X.tolil()
        
        for i in range(X.shape[0]):
            row = X_lil[i, :].toarray().flatten()
            
            if norm == "l1":
                row_norm = np.sum(np.abs(row))
            elif norm == "l2":
                row_norm = np.sqrt(np.sum(row ** 2))
            else:
                raise ValueError(f"Unknown normalization: {norm}")
            
            if row_norm > 0:
                X_lil[i, :] = row / row_norm
        
        return X_lil.tocsr()
    
    def fit(self, documents: List[str]) -> 'TFIDFVectorizer':
        """Fit vectorizer on corpus."""
        
        self.n_documents_ = len(documents)
        n_docs = self.n_documents_
        
        # Count document frequencies
        df_counter = Counter()
        all_terms = set()
        
        for doc in documents:
            tokens = self._tokenize(doc)
            unique_tokens = set(tokens)
            all_terms.update(unique_tokens)
            for token in unique_tokens:
                df_counter[token] += 1
        
        # Apply DF thresholds
        max_doc_count = int(self.max_df * n_docs)
        
        filtered_terms = [
            term for term in all_terms
            if self.min_df <= df_counter[term] <= max_doc_count
        ]
        
        # Build vocabulary
        self.vocabulary_ = {
            term: idx for idx, term in enumerate(sorted(filtered_terms))
        }
        
        # Compute IDF values
        self.idf_values_ = self._compute_idf_vector(
            df_counter, n_docs, self.config.idf_scheme
        )
        
        self._fitted = True
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """Transform documents to TF-IDF matrix."""
        
        if not self._fitted:
            raise RuntimeError("Must fit before transform")
        
        n_docs = len(documents)
        n_vocab = len(self.vocabulary_)
        
        # Build TF matrix
        X = lil_matrix((n_docs, n_vocab), dtype=np.float64)
        
        for doc_idx, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            term_counts = Counter(tokens)
            
            tf_values = self._compute_tf(term_counts, self.config.tf_scheme)
            
            for term, tf in tf_values.items():
                if term in self.vocabulary_:
                    term_idx = self.vocabulary_[term]
                    # TF * IDF
                    X[doc_idx, term_idx] = tf * self.idf_values_[term_idx]
        
        X_csr = X.tocsr()
        
        # Apply normalization
        return self._normalize(X_csr, self.config.normalization)
    
    def fit_transform(self, documents: List[str]) -> csr_matrix:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
    
    def get_feature_names(self) -> List[str]:
        """Return vocabulary terms in index order."""
        return [term for term, _ in sorted(self.vocabulary_.items(), key=lambda x: x[1])]
 
 
# Demonstration
if __name__ == "__main__":
    documents = [
        "The quick brown fox jumps over the lazy dog",
        "A quick brown dog outpaces a lazy fox",
        "Machine learning algorithms process data efficiently",
        "Deep learning neural networks learn patterns from data",
        "Natural language processing uses machine learning",
    ]
    
    print("=" * 70)
    print("TF-IDF WEIGHTING SCHEME COMPARISON")
    print("=" * 70)
    
    schemes = [
        ("ntc (Basic)", TFIDFConfig.from_smart("ntc")),
        ("ltc (Log TF)", TFIDFConfig.from_smart("ltc")),
        ("atc (Augmented)", TFIDFConfig.from_smart("atc")),
        ("ltn (No norm)", TFIDFConfig(tf_scheme="log", idf_scheme="standard", normalization="none")),
    ]
    
    for name, config in schemes:
        vectorizer = TFIDFVectorizer(config=config)
        X = vectorizer.fit_transform(documents)
        
        print(f"\n{name}:")
        print(f"  Shape: {X.shape}")
        print(f"  Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}")
        print(f"  Value range: [{X.min():.4f}, {X.max():.4f}]")
        
        # Show top terms for first document
        vocab = vectorizer.get_feature_names()
        doc0_weights = X[0].toarray().flatten()
        top_indices = np.argsort(doc0_weights)[-5:][::-1]
        print(f"  Top terms (doc 0): ", end="")
        print(", ".join([f"{vocab[i]}({doc0_weights[i]:.3f})" for i in top_indices]))

Production Note

This implementation prioritizes clarity over performance. For production use, leverage scikit-learn's TfidfVectorizer, which uses Cython for fast tokenization and efficient sparse matrix operations. Use this custom implementation for understanding and experimentation.

Asymmetric Document-Query Weighting

In information retrieval, documents and queries often use different weighting schemes. This asymmetry reflects their different roles.

Why Asymmetric?

Documents are long, stored, and processed offline
Queries are short, dynamic, and processed in real-time

Different characteristics suggest different optimal weightings.

Common Asymmetric Schemes:

Scheme	Document	Query	Rationale
lnc.ltc	Log TF, No IDF, Cosine	Log TF, IDF, Cosine	IDF discriminates queries, not docs
atn.ntc	Aug TF, IDF, None	Natural TF, IDF, Cosine	Different length handling
Lnc.Ltc	LogAvg TF, No IDF, Cosine	LogAvg TF, IDF, Cosine	Advanced normalization

The lnc.ltc Scheme (Detailed):

This popular scheme uses:

Documents (lnc): $$w_{t,d} = \frac{1 + \log(f_{t,d})}{\sqrt{\sum_{t' \in d}(1 + \log(f_{t',d}))^2}}$$

Queries (ltc): $$w_{t,q} = \frac{(1 + \log(f_{t,q})) \cdot \log(N/\text{df}(t))}{\sqrt{\sum_{t' \in q}((1 + \log(f_{t',q})) \cdot \log(N/\text{df}(t')))^2}}$$

Key Insight: Documents don't use IDF at all! The IDF discrimination happens only at query time. This is computationally efficient (document vectors can be precomputed without knowing the query) and empirically effective.

Why No IDF for Documents?

When IDF is applied at query time only, it acts as a query term weighting rather than document term weighting. This makes intuitive sense: we want to give more weight to rare query terms when matching, but document terms should be weighted by their local importance. The cross-product at retrieval time implicitly combines both perspectives.

Choosing the Right Weighting Scheme

With many variants available, how do you choose? Here's a decision framework based on task requirements.

Decision Framework

•Document Classification/Clustering: Use ltc (log TF, IDF, cosine norm). The log TF dampens repetition, IDF discriminates, and cosine normalization enables fair comparison.
•Information Retrieval: Use asymmetric lnc.ltc. Efficient indexing with IDF applied at query time for discrimination.
•Topic Modeling Preprocessing: Use boolean TF (btc) or log TF without normalization. Topics are about presence/absence, not frequency emphasis.
•Short Documents (tweets, titles): Use natural TF (ntc). Log scaling is less beneficial when counts are low (1-5).
•Long Documents (papers, books): Use augmented TF (atc). Better handling of length variation and term repetition.
•Semantic Similarity: Consider sublinear TF with L2 norm (ltc). Cosine similarity is standard for semantic comparison.

Quick Reference: Scheme Selection
Task	Recommended	Avoid
Text Classification	ltc, sklearn defaults	ntc without normalization
Document Retrieval	lnc.ltc, BM25	Symmetric schemes
Clustering	ltc with L2 norm	Unnormalized variants
Feature Engineering	Experiment; ltc is good default	Boolean TF (too sparse)
Duplicate Detection	Boolean TF, Jaccard similarity	Heavy IDF weighting

When in Doubt, Use Defaults

Scikit-learn's TfidfVectorizer defaults (smooth IDF, raw TF, L2 normalization) work well for most classification and clustering tasks. Start there and experiment only if performance is insufficient. Don't over-optimize the weighting scheme—often the choice of min_df, max_df, and vocabulary size matters more.

TF-IDF Beyond Text

The TF-IDF principle—weighting by local frequency and global rarity—applies beyond natural language text.

Genomics: TF-IDF for DNA Sequences

"Terms" = k-mers (short DNA subsequences)
"Documents" = Gene sequences or genomes
TF-IDF identifies characteristic k-mers for species identification

Log Analysis:

"Terms" = Log event types or patterns
"Documents" = Time windows or service instances
TF-IDF finds anomalous events (rare globally, frequent locally = potential issues)

Recommendation Systems:

"Terms" = Items (products, movies)
"Documents" = Users
TF-IDF-like weighting identifies items that are popular for a user but rare globally = strong preferences

Code Analysis:

"Terms" = Function calls, API usage patterns
"Documents" = Source files or repositories
TF-IDF identifies characteristic code patterns for clone detection

The Generalized Principle

TF-IDF embodies a universal principle: features that are frequent in a specific context but rare globally are highly informative. This "local frequency × global rarity" pattern appears throughout machine learning: attention mechanisms, contrastive learning, and anomaly detection all share this intuition.

Summary: Mastering TF-IDF Weighting

Key Takeaways

•TF-IDF is TF × IDF: The product captures terms that are locally important AND globally distinctive—the "AND" is the key insight.
•SMART notation precisely describes schemes: Learn to read and specify schemes like "ltc" for reproducibility and comparison.
•Many valid variants exist: Different TF components (natural, log, augmented, boolean) and IDF variants (standard, smooth, probabilistic) combine into dozens of options.
•Asymmetric weighting is common: Documents and queries often use different schemes, with IDF applied at query time for efficiency.
•Library defaults differ from textbooks: Scikit-learn's smooth IDF and automatic L2 normalization differ from standard formulas—know what your tools compute.
•Start with ltc, then experiment: Log TF, standard IDF, and cosine normalization work well for most tasks; optimize only when needed.
•The principle generalizes beyond text: TF-IDF thinking applies to sequences, logs, recommendations, and any domain with frequency and rarity signals.

What's Next:

We've explored TF-IDF computation in depth, but there's one crucial component remaining: normalization. The next page examines why and how we normalize TF-IDF vectors, covering L1 vs. L2 normalization, document length effects, and the profound impact of normalization choices on similarity computations.

Page Complete

You now understand TF-IDF weighting schemes comprehensively—from basic formulas to SMART notation to variant selection. Next, we'll complete the picture with vector normalization, ensuring fair comparisons across documents of varying lengths.

4 / 5

Loading learning content...

Feature Engineering & SelectionText Feature Engineering - TF-IDF

Text Feature Engineering: TF-IDF

LevelIntermediate

Duration90 mins

TopicText Feature Engineering - TF-IDF

4 / 5

TF-IDF Weighting

Combining TF and IDF: The Complete Picture

What You Will Master

The Basic TF-IDF Formula

At its core, TF-IDF is a product of two factors:

$$\text{tfidf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

Interpretation:

High TF, High IDF: Term is frequent in this document AND rare overall → Very high weight (distinctive content term)
High TF, Low IDF: Term is frequent in this document BUT common overall → Modest weight (common term, not distinctive)
Low TF, High IDF: Term is rare in this document AND rare overall → Modest weight (rare term, but not emphasized here)
Low TF, Low IDF: Term is rare in this document AND common overall → Low weight (likely a stop word)

Example:

Consider a document about "deep learning" in a general news corpus:

Term	TF in Doc	IDF	TF-IDF	Interpretation
"the"	45	0.0	0.0	Common word, zero weight
"said"	12	0.5	6.0	Common in news, modest weight
"technology"	8	1.8	14.4	Moderately distinctive
"learning"	6	3.2	19.2	More distinctive
"neural"	4	4.5	18.0	Highly distinctive
"backpropagation"	2	7.1	14.2	Very rare, but only mentioned twice

The Multiplicative Insight

The Vector Space Model:

TF-IDF creates a vector representation for each document:

$$\vec{d} = [\text{tfidf}(t_1, d), \text{tfidf}(t_2, d), ..., \text{tfidf}(t_M, d)]$$

where $M$ is the vocabulary size.

Properties of TF-IDF Vectors:

Sparse: Most entries are zero (terms not present)
Non-negative: All weights ≥ 0
Unbounded (without normalization): Values can grow with document length
Comparable: Documents can be compared via vector similarity

Common Operations:

Operation	Formula	Purpose
Cosine Similarity	$\frac{\vec{d_1} \cdot \vec{d_2}}{\|\vec{d_1}\| \|\vec{d_2}\|}$	Document similarity
Euclidean Distance	$\|\vec{d_1} - \vec{d_2}\|$	Clustering, nearest neighbors
Query Matching	$\vec{q} \cdot \vec{d}$	Information retrieval
Feature Input	$\vec{d}$	ML classification/regression

SMART Notation for Weighting Schemes

SMART Format: ddd.qqq

First 3 characters: Document weighting scheme
Last 3 characters: Query weighting scheme (for retrieval)

Each position specifies:

Position	Meaning	Options
1st	TF component	n, l, a, b, L (see below)
2nd	IDF component	n, t, p (see below)
3rd	Normalization	n, c, u, b (see below)

SMART TF Component Options
Code	Name	Formula	Description
n	natural	$f_{t,d}$	Raw term frequency (count)
l	logarithm	$1 + \log(f_{t,d})$	Log-scaled frequency
a	augmented	$0.5 + \frac{0.5 \cdot f_{t,d}}{\max_t f_{t,d}}$	Normalized by max TF in doc
b	boolean	$1$ if $f_{t,d} > 0$, else $0$	Binary presence
L	log average	$\frac{1 + \log(f_{t,d})}{1 + \log(\text{avg}{t \in d}(f{t,d}))}$	Log normalized by doc average

SMART IDF Component Options
Code	Name	Formula	Description
n	none	$1$	No IDF weighting
t	idf	$\log(N / \text{df}(t))$	Standard IDF
p	prob idf	$\max(0, \log\frac{N - \text{df}(t)}{\text{df}(t)})$	Probabilistic IDF (BM25-style)

SMART Normalization Options
Code	Name	Formula	Description
n	none	unchanged	No normalization
c	cosine	$\frac{w}{\sqrt{\sum_t w_t^2}}$	L2 normalization
u	pivoted unique	complex	Pivoted length normalization
b	byte size	$\frac{w}{\text{CharLength}^{\alpha}}$	Byte-length normalization

Common SMART Schemes

Common TF-IDF Variants in Practice

Let's examine the most commonly used TF-IDF variants and their properties.

Variant 1: Basic TF-IDF (ntc)

$$\text{tfidf}{\text{basic}}(t, d) = f{t,d} \cdot \log\left(\frac{N}{\text{df}(t)}\right)$$

Simple and interpretable
Raw counts can cause long-document bias
Used when document length variation is minimal

Variant 2: Log-normalized TF-IDF (ltc)

$$\text{tfidf}{\text{log}}(t, d) = (1 + \log f{t,d}) \cdot \log\left(\frac{N}{\text{df}(t)}\right)$$

Dampens high-frequency terms
Better for documents with repetitive terms
Scikit-learn's default (with smooth IDF)

TF-IDF Variant Comparison
Variant	SMART	TF Handling	Best For
Basic TF-IDF	ntc	Raw count	Uniform-length documents
Log TF-IDF	ltc	1 + log(count)	General purpose; default choice
Augmented TF-IDF	atc	0.5 + 0.5*(count/max)	Variable-length documents
Boolean TF-IDF	btc	Binary presence	Set-based matching; topics
Sublinear TF-IDF	ltc	1 + log(count)	Avoiding keyword stuffing effects

Scikit-learn's Default:

When you use TfidfVectorizer with default parameters, you get:

$$\text{tfidf}{\text{sklearn}}(t, d) = f{t,d} \cdot \left(\log\frac{1 + N}{1 + \text{df}(t)} + 1\right)$$

with L2 normalization applied afterward.

Key differences from textbook TF-IDF:

Smooth IDF: $(1+N)/(1+\text{df})$ instead of $N/\text{df}$
IDF offset: The "+1" outside the log
L2 normalization: Vectors have unit length by default

To get textbook TF-IDF in sklearn:

TfidfVectorizer(
    norm=None,           # Disable normalization
    smooth_idf=False,    # Use N/df instead of (1+N)/(1+df)
    sublinear_tf=False   # Use raw TF, not 1+log(TF)
)

Default ≠ Textbook

Complete TF-IDF Implementation

Let's implement a flexible TF-IDF vectorizer that supports multiple weighting schemes.

tfidf_vectorizer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
import numpy as np
from collections import Counter
from typing import List, Dict, Literal, Optional
from scipy.sparse import csr_matrix, lil_matrix
from dataclasses import dataclass
import re
 
@dataclass
class TFIDFConfig:
    """Configuration for TF-IDF computation using SMART-like notation."""
    tf_scheme: Literal["natural", "log", "augmented", "boolean"] = "log"
    idf_scheme: Literal["none", "standard", "smooth", "probabilistic"] = "smooth"
    normalization: Literal["none", "l1", "l2"] = "l2"
    
    @classmethod
    def from_smart(cls, smart_code: str) -> 'TFIDFConfig':
        """Parse SMART notation (e.g., 'ltc' -> log TF, standard IDF, cosine norm)."""
        if len(smart_code) != 3:
            raise ValueError(f"SMART code must be 3 characters, got: {smart_code}")
        
        tf_map = {'n': 'natural', 'l': 'log', 'a': 'augmented', 'b': 'boolean'}
        idf_map = {'n': 'none', 't': 'standard', 'p': 'probabilistic'}
        norm_map = {'n': 'none', 'c': 'l2', 'u': 'l2'}  # 'u' simplified to l2
        
        return cls(
            tf_scheme=tf_map.get(smart_code[0], 'natural'),
            idf_scheme=idf_map.get(smart_code[1], 'standard'),
            normalization=norm_map.get(smart_code[2], 'none')
        )
 
 
class TFIDFVectorizer:
    """
    Flexible TF-IDF vectorizer supporting multiple weighting schemes.
    
    Supports SMART notation and provides detailed control over
    TF, IDF, and normalization components.
    """
    
    def __init__(
        self,
        config: Optional[TFIDFConfig] = None,
        min_df: int = 1,
        max_df: float = 1.0,
        lowercase: bool = True,
        token_pattern: str = r"(?u)\b\w\w+\b"
    ):
        self.config = config or TFIDFConfig()
        self.min_df = min_df
        self.max_df = max_df
        self.lowercase = lowercase
        self.token_pattern = re.compile(token_pattern)
        
        self.vocabulary_: Dict[str, int] = {}
        self.idf_values_: np.ndarray = np.array([])
        self.n_documents_: int = 0
        self._fitted = False
    
    def _tokenize(self, text: str) -> List[str]:
        """Tokenize a document."""
        if self.lowercase:
            text = text.lower()
        return self.token_pattern.findall(text)
    
    def _compute_tf(
        self,
        term_counts: Counter,
        scheme: str
    ) -> Dict[str, float]:
        """Compute term frequency values based on scheme."""
        
        if not term_counts:
            return {}
        
        if scheme == "natural":
            return dict(term_counts)
        
        elif scheme == "log":
            return {
                term: 1 + np.log(count) if count > 0 else 0
                for term, count in term_counts.items()
            }
        
        elif scheme == "augmented":
            max_count = max(term_counts.values())
            return {
                term: 0.5 + 0.5 * (count / max_count)
                for term, count in term_counts.items()
            }
        
        elif scheme == "boolean":
            return {term: 1.0 for term in term_counts}
        
        else:
            raise ValueError(f"Unknown TF scheme: {scheme}")
    
    def _compute_idf_vector(
        self,
        document_frequencies: Dict[str, int],
        n_docs: int,
        scheme: str
    ) -> np.ndarray:
        """Compute IDF values for all vocabulary terms."""
        
        n_vocab = len(self.vocabulary_)
        idf = np.zeros(n_vocab)
        
        for term, idx in self.vocabulary_.items():
            df = document_frequencies.get(term, 0)
            
            if scheme == "none":
                idf[idx] = 1.0
            
            elif scheme == "standard":
                if df > 0:
                    idf[idx] = np.log(n_docs / df)
                else:
                    idf[idx] = 0.0
            
            elif scheme == "smooth":
                # Scikit-learn style: log((1+N)/(1+df)) + 1
                idf[idx] = np.log((1 + n_docs) / (1 + df)) + 1
            
            elif scheme == "probabilistic":
                # BM25 style: max(0, log((N-df+0.5)/(df+0.5)))
                if df > 0:
                    idf[idx] = max(0, np.log((n_docs - df + 0.5) / (df + 0.5)))
                else:
                    idf[idx] = 0.0
            
            else:
                raise ValueError(f"Unknown IDF scheme: {scheme}")
        
        return idf
    
    def _normalize(self, X: csr_matrix, norm: str) -> csr_matrix:
        """Apply normalization to TF-IDF matrix."""
        
        if norm == "none":
            return X
        
        # Convert to LIL for efficient row operations
        X_lil = X.tolil()
        
        for i in range(X.shape[0]):
            row = X_lil[i, :].toarray().flatten()
            
            if norm == "l1":
                row_norm = np.sum(np.abs(row))
            elif norm == "l2":
                row_norm = np.sqrt(np.sum(row ** 2))
            else:
                raise ValueError(f"Unknown normalization: {norm}")
            
            if row_norm > 0:
                X_lil[i, :] = row / row_norm
        
        return X_lil.tocsr()
    
    def fit(self, documents: List[str]) -> 'TFIDFVectorizer':
        """Fit vectorizer on corpus."""
        
        self.n_documents_ = len(documents)
        n_docs = self.n_documents_
        
        # Count document frequencies
        df_counter = Counter()
        all_terms = set()
        
        for doc in documents:
            tokens = self._tokenize(doc)
            unique_tokens = set(tokens)
            all_terms.update(unique_tokens)
            for token in unique_tokens:
                df_counter[token] += 1
        
        # Apply DF thresholds
        max_doc_count = int(self.max_df * n_docs)
        
        filtered_terms = [
            term for term in all_terms
            if self.min_df <= df_counter[term] <= max_doc_count
        ]
        
        # Build vocabulary
        self.vocabulary_ = {
            term: idx for idx, term in enumerate(sorted(filtered_terms))
        }
        
        # Compute IDF values
        self.idf_values_ = self._compute_idf_vector(
            df_counter, n_docs, self.config.idf_scheme
        )
        
        self._fitted = True
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """Transform documents to TF-IDF matrix."""
        
        if not self._fitted:
            raise RuntimeError("Must fit before transform")
        
        n_docs = len(documents)
        n_vocab = len(self.vocabulary_)
        
        # Build TF matrix
        X = lil_matrix((n_docs, n_vocab), dtype=np.float64)
        
        for doc_idx, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            term_counts = Counter(tokens)
            
            tf_values = self._compute_tf(term_counts, self.config.tf_scheme)
            
            for term, tf in tf_values.items():
                if term in self.vocabulary_:
                    term_idx = self.vocabulary_[term]
                    # TF * IDF
                    X[doc_idx, term_idx] = tf * self.idf_values_[term_idx]
        
        X_csr = X.tocsr()
        
        # Apply normalization
        return self._normalize(X_csr, self.config.normalization)
    
    def fit_transform(self, documents: List[str]) -> csr_matrix:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
    
    def get_feature_names(self) -> List[str]:
        """Return vocabulary terms in index order."""
        return [term for term, _ in sorted(self.vocabulary_.items(), key=lambda x: x[1])]
 
 
# Demonstration
if __name__ == "__main__":
    documents = [
        "The quick brown fox jumps over the lazy dog",
        "A quick brown dog outpaces a lazy fox",
        "Machine learning algorithms process data efficiently",
        "Deep learning neural networks learn patterns from data",
        "Natural language processing uses machine learning",
    ]
    
    print("=" * 70)
    print("TF-IDF WEIGHTING SCHEME COMPARISON")
    print("=" * 70)
    
    schemes = [
        ("ntc (Basic)", TFIDFConfig.from_smart("ntc")),
        ("ltc (Log TF)", TFIDFConfig.from_smart("ltc")),
        ("atc (Augmented)", TFIDFConfig.from_smart("atc")),
        ("ltn (No norm)", TFIDFConfig(tf_scheme="log", idf_scheme="standard", normalization="none")),
    ]
    
    for name, config in schemes:
        vectorizer = TFIDFVectorizer(config=config)
        X = vectorizer.fit_transform(documents)
        
        print(f"\n{name}:")
        print(f"  Shape: {X.shape}")
        print(f"  Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}")
        print(f"  Value range: [{X.min():.4f}, {X.max():.4f}]")
        
        # Show top terms for first document
        vocab = vectorizer.get_feature_names()
        doc0_weights = X[0].toarray().flatten()
        top_indices = np.argsort(doc0_weights)[-5:][::-1]
        print(f"  Top terms (doc 0): ", end="")
        print(", ".join([f"{vocab[i]}({doc0_weights[i]:.3f})" for i in top_indices]))

Production Note

Asymmetric Document-Query Weighting

In information retrieval, documents and queries often use different weighting schemes. This asymmetry reflects their different roles.

Why Asymmetric?

Documents are long, stored, and processed offline
Queries are short, dynamic, and processed in real-time

Different characteristics suggest different optimal weightings.

Common Asymmetric Schemes:

Scheme	Document	Query	Rationale
lnc.ltc	Log TF, No IDF, Cosine	Log TF, IDF, Cosine	IDF discriminates queries, not docs
atn.ntc	Aug TF, IDF, None	Natural TF, IDF, Cosine	Different length handling
Lnc.Ltc	LogAvg TF, No IDF, Cosine	LogAvg TF, IDF, Cosine	Advanced normalization

The lnc.ltc Scheme (Detailed):

This popular scheme uses:

Documents (lnc): $$w_{t,d} = \frac{1 + \log(f_{t,d})}{\sqrt{\sum_{t' \in d}(1 + \log(f_{t',d}))^2}}$$

Queries (ltc): $$w_{t,q} = \frac{(1 + \log(f_{t,q})) \cdot \log(N/\text{df}(t))}{\sqrt{\sum_{t' \in q}((1 + \log(f_{t',q})) \cdot \log(N/\text{df}(t')))^2}}$$

Why No IDF for Documents?

Choosing the Right Weighting Scheme

With many variants available, how do you choose? Here's a decision framework based on task requirements.

Decision Framework

•Document Classification/Clustering: Use ltc (log TF, IDF, cosine norm). The log TF dampens repetition, IDF discriminates, and cosine normalization enables fair comparison.
•Information Retrieval: Use asymmetric lnc.ltc. Efficient indexing with IDF applied at query time for discrimination.
•Topic Modeling Preprocessing: Use boolean TF (btc) or log TF without normalization. Topics are about presence/absence, not frequency emphasis.
•Short Documents (tweets, titles): Use natural TF (ntc). Log scaling is less beneficial when counts are low (1-5).
•Long Documents (papers, books): Use augmented TF (atc). Better handling of length variation and term repetition.
•Semantic Similarity: Consider sublinear TF with L2 norm (ltc). Cosine similarity is standard for semantic comparison.

Quick Reference: Scheme Selection
Task	Recommended	Avoid
Text Classification	ltc, sklearn defaults	ntc without normalization
Document Retrieval	lnc.ltc, BM25	Symmetric schemes
Clustering	ltc with L2 norm	Unnormalized variants
Feature Engineering	Experiment; ltc is good default	Boolean TF (too sparse)
Duplicate Detection	Boolean TF, Jaccard similarity	Heavy IDF weighting

When in Doubt, Use Defaults

TF-IDF Beyond Text

The TF-IDF principle—weighting by local frequency and global rarity—applies beyond natural language text.

Genomics: TF-IDF for DNA Sequences

"Terms" = k-mers (short DNA subsequences)
"Documents" = Gene sequences or genomes
TF-IDF identifies characteristic k-mers for species identification

Log Analysis:

"Terms" = Log event types or patterns
"Documents" = Time windows or service instances
TF-IDF finds anomalous events (rare globally, frequent locally = potential issues)

Recommendation Systems:

"Terms" = Items (products, movies)
"Documents" = Users
TF-IDF-like weighting identifies items that are popular for a user but rare globally = strong preferences

Code Analysis:

"Terms" = Function calls, API usage patterns
"Documents" = Source files or repositories
TF-IDF identifies characteristic code patterns for clone detection

The Generalized Principle

Summary: Mastering TF-IDF Weighting

Key Takeaways

•TF-IDF is TF × IDF: The product captures terms that are locally important AND globally distinctive—the "AND" is the key insight.
•SMART notation precisely describes schemes: Learn to read and specify schemes like "ltc" for reproducibility and comparison.
•Many valid variants exist: Different TF components (natural, log, augmented, boolean) and IDF variants (standard, smooth, probabilistic) combine into dozens of options.
•Asymmetric weighting is common: Documents and queries often use different schemes, with IDF applied at query time for efficiency.
•Library defaults differ from textbooks: Scikit-learn's smooth IDF and automatic L2 normalization differ from standard formulas—know what your tools compute.
•Start with ltc, then experiment: Log TF, standard IDF, and cosine normalization work well for most tasks; optimize only when needed.
•The principle generalizes beyond text: TF-IDF thinking applies to sequences, logs, recommendations, and any domain with frequency and rarity signals.

What's Next:

Page Complete

4 / 5