Loading learning content...
We've now developed deep understanding of both Term Frequency (local importance) and Inverse Document Frequency (global discrimination). The power of TF-IDF lies in their combination—a term matters when it's frequent in a document AND rare across the corpus.
But how exactly should we combine them? It turns out there are many valid weighting schemes, each with different properties. The classic "TF × IDF" is just one option among several, and the choice matters more than many practitioners realize.
This page explores the complete landscape of TF-IDF weighting: the SMART notation for describing schemes, common variants, implementation details, and guidance on selecting the right approach for your task.
By the end of this page, you will understand: (1) How TF and IDF combine into complete weighting schemes, (2) The SMART notation for describing TF-IDF variants, (3) Common weighting schemes and their trade-offs, (4) Complete implementation of multiple TF-IDF variants, and (5) How to choose the right scheme for your application.
At its core, TF-IDF is a product of two factors:
$$\text{tfidf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$
Interpretation:
Example:
Consider a document about "deep learning" in a general news corpus:
| Term | TF in Doc | IDF | TF-IDF | Interpretation |
|---|---|---|---|---|
| "the" | 45 | 0.0 | 0.0 | Common word, zero weight |
| "said" | 12 | 0.5 | 6.0 | Common in news, modest weight |
| "technology" | 8 | 1.8 | 14.4 | Moderately distinctive |
| "learning" | 6 | 3.2 | 19.2 | More distinctive |
| "neural" | 4 | 4.5 | 18.0 | Highly distinctive |
| "backpropagation" | 2 | 7.1 | 14.2 | Very rare, but only mentioned twice |
Multiplication means BOTH factors must be significant for the product to be large. A term needs to be important locally (high TF) AND globally distinctive (high IDF). Either factor being zero (or near-zero) makes the product small. This AND-like behavior is exactly what we want for identifying characteristic terms.
The Vector Space Model:
TF-IDF creates a vector representation for each document:
$$\vec{d} = [\text{tfidf}(t_1, d), \text{tfidf}(t_2, d), ..., \text{tfidf}(t_M, d)]$$
where $M$ is the vocabulary size.
Properties of TF-IDF Vectors:
Common Operations:
| Operation | Formula | Purpose |
|---|---|---|
| Cosine Similarity | $\frac{\vec{d_1} \cdot \vec{d_2}}{|\vec{d_1}| |\vec{d_2}|}$ | Document similarity |
| Euclidean Distance | $|\vec{d_1} - \vec{d_2}|$ | Clustering, nearest neighbors |
| Query Matching | $\vec{q} \cdot \vec{d}$ | Information retrieval |
| Feature Input | $\vec{d}$ | ML classification/regression |
The information retrieval community developed SMART notation to precisely describe TF-IDF weighting schemes. Understanding SMART notation is essential for reproducing results and comparing approaches.
SMART Format: ddd.qqq
Each position specifies:
| Position | Meaning | Options |
|---|---|---|
| 1st | TF component | n, l, a, b, L (see below) |
| 2nd | IDF component | n, t, p (see below) |
| 3rd | Normalization | n, c, u, b (see below) |
| Code | Name | Formula | Description |
|---|---|---|---|
| n | natural | $f_{t,d}$ | Raw term frequency (count) |
| l | logarithm | $1 + \log(f_{t,d})$ | Log-scaled frequency |
| a | augmented | $0.5 + \frac{0.5 \cdot f_{t,d}}{\max_t f_{t,d}}$ | Normalized by max TF in doc |
| b | boolean | $1$ if $f_{t,d} > 0$, else $0$ | Binary presence |
| L | log average | $\frac{1 + \log(f_{t,d})}{1 + \log(\text{avg}{t \in d}(f{t,d}))}$ | Log normalized by doc average |
| Code | Name | Formula | Description |
|---|---|---|---|
| n | none | $1$ | No IDF weighting |
| t | idf | $\log(N / \text{df}(t))$ | Standard IDF |
| p | prob idf | $\max(0, \log\frac{N - \text{df}(t)}{\text{df}(t)})$ | Probabilistic IDF (BM25-style) |
| Code | Name | Formula | Description |
|---|---|---|---|
| n | none | unchanged | No normalization |
| c | cosine | $\frac{w}{\sqrt{\sum_t w_t^2}}$ | L2 normalization |
| u | pivoted unique | complex | Pivoted length normalization |
| b | byte size | $\frac{w}{\text{CharLength}^{\alpha}}$ | Byte-length normalization |
ntc.ntc: Natural TF, standard IDF, cosine norm (for both docs and queries) ltc.ltc: Log TF, standard IDF, cosine norm (very common) lnc.ltc: Log TF, no IDF, cosine norm for docs; log TF, IDF, cosine for queries atc.atc: Augmented TF, IDF, cosine norm (classic SMART default) bnc.bnc: Boolean TF, no IDF, cosine norm (for set-based matching)
Let's examine the most commonly used TF-IDF variants and their properties.
Variant 1: Basic TF-IDF (ntc)
$$\text{tfidf}{\text{basic}}(t, d) = f{t,d} \cdot \log\left(\frac{N}{\text{df}(t)}\right)$$
Variant 2: Log-normalized TF-IDF (ltc)
$$\text{tfidf}{\text{log}}(t, d) = (1 + \log f{t,d}) \cdot \log\left(\frac{N}{\text{df}(t)}\right)$$
| Variant | SMART | TF Handling | Best For |
|---|---|---|---|
| Basic TF-IDF | ntc | Raw count | Uniform-length documents |
| Log TF-IDF | ltc | 1 + log(count) | General purpose; default choice |
| Augmented TF-IDF | atc | 0.5 + 0.5*(count/max) | Variable-length documents |
| Boolean TF-IDF | btc | Binary presence | Set-based matching; topics |
| Sublinear TF-IDF | ltc | 1 + log(count) | Avoiding keyword stuffing effects |
Scikit-learn's Default:
When you use TfidfVectorizer with default parameters, you get:
$$\text{tfidf}{\text{sklearn}}(t, d) = f{t,d} \cdot \left(\log\frac{1 + N}{1 + \text{df}(t)} + 1\right)$$
with L2 normalization applied afterward.
Key differences from textbook TF-IDF:
To get textbook TF-IDF in sklearn:
TfidfVectorizer(
norm=None, # Disable normalization
smooth_idf=False, # Use N/df instead of (1+N)/(1+df)
sublinear_tf=False # Use raw TF, not 1+log(TF)
)
Many practitioners are surprised that sklearn's TfidfVectorizer doesn't compute textbook TF-IDF. The defaults are chosen for practical robustness (avoiding division by zero, numerical stability), not theoretical purity. Always check your library's documentation and be explicit about which variant you're using.
Let's implement a flexible TF-IDF vectorizer that supports multiple weighting schemes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271
import numpy as npfrom collections import Counterfrom typing import List, Dict, Literal, Optionalfrom scipy.sparse import csr_matrix, lil_matrixfrom dataclasses import dataclassimport re @dataclassclass TFIDFConfig: """Configuration for TF-IDF computation using SMART-like notation.""" tf_scheme: Literal["natural", "log", "augmented", "boolean"] = "log" idf_scheme: Literal["none", "standard", "smooth", "probabilistic"] = "smooth" normalization: Literal["none", "l1", "l2"] = "l2" @classmethod def from_smart(cls, smart_code: str) -> 'TFIDFConfig': """Parse SMART notation (e.g., 'ltc' -> log TF, standard IDF, cosine norm).""" if len(smart_code) != 3: raise ValueError(f"SMART code must be 3 characters, got: {smart_code}") tf_map = {'n': 'natural', 'l': 'log', 'a': 'augmented', 'b': 'boolean'} idf_map = {'n': 'none', 't': 'standard', 'p': 'probabilistic'} norm_map = {'n': 'none', 'c': 'l2', 'u': 'l2'} # 'u' simplified to l2 return cls( tf_scheme=tf_map.get(smart_code[0], 'natural'), idf_scheme=idf_map.get(smart_code[1], 'standard'), normalization=norm_map.get(smart_code[2], 'none') ) class TFIDFVectorizer: """ Flexible TF-IDF vectorizer supporting multiple weighting schemes. Supports SMART notation and provides detailed control over TF, IDF, and normalization components. """ def __init__( self, config: Optional[TFIDFConfig] = None, min_df: int = 1, max_df: float = 1.0, lowercase: bool = True, token_pattern: str = r"(?u)\b\w\w+\b" ): self.config = config or TFIDFConfig() self.min_df = min_df self.max_df = max_df self.lowercase = lowercase self.token_pattern = re.compile(token_pattern) self.vocabulary_: Dict[str, int] = {} self.idf_values_: np.ndarray = np.array([]) self.n_documents_: int = 0 self._fitted = False def _tokenize(self, text: str) -> List[str]: """Tokenize a document.""" if self.lowercase: text = text.lower() return self.token_pattern.findall(text) def _compute_tf( self, term_counts: Counter, scheme: str ) -> Dict[str, float]: """Compute term frequency values based on scheme.""" if not term_counts: return {} if scheme == "natural": return dict(term_counts) elif scheme == "log": return { term: 1 + np.log(count) if count > 0 else 0 for term, count in term_counts.items() } elif scheme == "augmented": max_count = max(term_counts.values()) return { term: 0.5 + 0.5 * (count / max_count) for term, count in term_counts.items() } elif scheme == "boolean": return {term: 1.0 for term in term_counts} else: raise ValueError(f"Unknown TF scheme: {scheme}") def _compute_idf_vector( self, document_frequencies: Dict[str, int], n_docs: int, scheme: str ) -> np.ndarray: """Compute IDF values for all vocabulary terms.""" n_vocab = len(self.vocabulary_) idf = np.zeros(n_vocab) for term, idx in self.vocabulary_.items(): df = document_frequencies.get(term, 0) if scheme == "none": idf[idx] = 1.0 elif scheme == "standard": if df > 0: idf[idx] = np.log(n_docs / df) else: idf[idx] = 0.0 elif scheme == "smooth": # Scikit-learn style: log((1+N)/(1+df)) + 1 idf[idx] = np.log((1 + n_docs) / (1 + df)) + 1 elif scheme == "probabilistic": # BM25 style: max(0, log((N-df+0.5)/(df+0.5))) if df > 0: idf[idx] = max(0, np.log((n_docs - df + 0.5) / (df + 0.5))) else: idf[idx] = 0.0 else: raise ValueError(f"Unknown IDF scheme: {scheme}") return idf def _normalize(self, X: csr_matrix, norm: str) -> csr_matrix: """Apply normalization to TF-IDF matrix.""" if norm == "none": return X # Convert to LIL for efficient row operations X_lil = X.tolil() for i in range(X.shape[0]): row = X_lil[i, :].toarray().flatten() if norm == "l1": row_norm = np.sum(np.abs(row)) elif norm == "l2": row_norm = np.sqrt(np.sum(row ** 2)) else: raise ValueError(f"Unknown normalization: {norm}") if row_norm > 0: X_lil[i, :] = row / row_norm return X_lil.tocsr() def fit(self, documents: List[str]) -> 'TFIDFVectorizer': """Fit vectorizer on corpus.""" self.n_documents_ = len(documents) n_docs = self.n_documents_ # Count document frequencies df_counter = Counter() all_terms = set() for doc in documents: tokens = self._tokenize(doc) unique_tokens = set(tokens) all_terms.update(unique_tokens) for token in unique_tokens: df_counter[token] += 1 # Apply DF thresholds max_doc_count = int(self.max_df * n_docs) filtered_terms = [ term for term in all_terms if self.min_df <= df_counter[term] <= max_doc_count ] # Build vocabulary self.vocabulary_ = { term: idx for idx, term in enumerate(sorted(filtered_terms)) } # Compute IDF values self.idf_values_ = self._compute_idf_vector( df_counter, n_docs, self.config.idf_scheme ) self._fitted = True return self def transform(self, documents: List[str]) -> csr_matrix: """Transform documents to TF-IDF matrix.""" if not self._fitted: raise RuntimeError("Must fit before transform") n_docs = len(documents) n_vocab = len(self.vocabulary_) # Build TF matrix X = lil_matrix((n_docs, n_vocab), dtype=np.float64) for doc_idx, doc in enumerate(documents): tokens = self._tokenize(doc) term_counts = Counter(tokens) tf_values = self._compute_tf(term_counts, self.config.tf_scheme) for term, tf in tf_values.items(): if term in self.vocabulary_: term_idx = self.vocabulary_[term] # TF * IDF X[doc_idx, term_idx] = tf * self.idf_values_[term_idx] X_csr = X.tocsr() # Apply normalization return self._normalize(X_csr, self.config.normalization) def fit_transform(self, documents: List[str]) -> csr_matrix: """Fit and transform in one step.""" return self.fit(documents).transform(documents) def get_feature_names(self) -> List[str]: """Return vocabulary terms in index order.""" return [term for term, _ in sorted(self.vocabulary_.items(), key=lambda x: x[1])] # Demonstrationif __name__ == "__main__": documents = [ "The quick brown fox jumps over the lazy dog", "A quick brown dog outpaces a lazy fox", "Machine learning algorithms process data efficiently", "Deep learning neural networks learn patterns from data", "Natural language processing uses machine learning", ] print("=" * 70) print("TF-IDF WEIGHTING SCHEME COMPARISON") print("=" * 70) schemes = [ ("ntc (Basic)", TFIDFConfig.from_smart("ntc")), ("ltc (Log TF)", TFIDFConfig.from_smart("ltc")), ("atc (Augmented)", TFIDFConfig.from_smart("atc")), ("ltn (No norm)", TFIDFConfig(tf_scheme="log", idf_scheme="standard", normalization="none")), ] for name, config in schemes: vectorizer = TFIDFVectorizer(config=config) X = vectorizer.fit_transform(documents) print(f"\n{name}:") print(f" Shape: {X.shape}") print(f" Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}") print(f" Value range: [{X.min():.4f}, {X.max():.4f}]") # Show top terms for first document vocab = vectorizer.get_feature_names() doc0_weights = X[0].toarray().flatten() top_indices = np.argsort(doc0_weights)[-5:][::-1] print(f" Top terms (doc 0): ", end="") print(", ".join([f"{vocab[i]}({doc0_weights[i]:.3f})" for i in top_indices]))This implementation prioritizes clarity over performance. For production use, leverage scikit-learn's TfidfVectorizer, which uses Cython for fast tokenization and efficient sparse matrix operations. Use this custom implementation for understanding and experimentation.
In information retrieval, documents and queries often use different weighting schemes. This asymmetry reflects their different roles.
Why Asymmetric?
Different characteristics suggest different optimal weightings.
Common Asymmetric Schemes:
| Scheme | Document | Query | Rationale |
|---|---|---|---|
| lnc.ltc | Log TF, No IDF, Cosine | Log TF, IDF, Cosine | IDF discriminates queries, not docs |
| atn.ntc | Aug TF, IDF, None | Natural TF, IDF, Cosine | Different length handling |
| Lnc.Ltc | LogAvg TF, No IDF, Cosine | LogAvg TF, IDF, Cosine | Advanced normalization |
The lnc.ltc Scheme (Detailed):
This popular scheme uses:
Documents (lnc): $$w_{t,d} = \frac{1 + \log(f_{t,d})}{\sqrt{\sum_{t' \in d}(1 + \log(f_{t',d}))^2}}$$
Queries (ltc): $$w_{t,q} = \frac{(1 + \log(f_{t,q})) \cdot \log(N/\text{df}(t))}{\sqrt{\sum_{t' \in q}((1 + \log(f_{t',q})) \cdot \log(N/\text{df}(t')))^2}}$$
Key Insight: Documents don't use IDF at all! The IDF discrimination happens only at query time. This is computationally efficient (document vectors can be precomputed without knowing the query) and empirically effective.
When IDF is applied at query time only, it acts as a query term weighting rather than document term weighting. This makes intuitive sense: we want to give more weight to rare query terms when matching, but document terms should be weighted by their local importance. The cross-product at retrieval time implicitly combines both perspectives.
With many variants available, how do you choose? Here's a decision framework based on task requirements.
| Task | Recommended | Avoid |
|---|---|---|
| Text Classification | ltc, sklearn defaults | ntc without normalization |
| Document Retrieval | lnc.ltc, BM25 | Symmetric schemes |
| Clustering | ltc with L2 norm | Unnormalized variants |
| Feature Engineering | Experiment; ltc is good default | Boolean TF (too sparse) |
| Duplicate Detection | Boolean TF, Jaccard similarity | Heavy IDF weighting |
Scikit-learn's TfidfVectorizer defaults (smooth IDF, raw TF, L2 normalization) work well for most classification and clustering tasks. Start there and experiment only if performance is insufficient. Don't over-optimize the weighting scheme—often the choice of min_df, max_df, and vocabulary size matters more.
The TF-IDF principle—weighting by local frequency and global rarity—applies beyond natural language text.
Genomics: TF-IDF for DNA Sequences
Log Analysis:
Recommendation Systems:
Code Analysis:
TF-IDF embodies a universal principle: features that are frequent in a specific context but rare globally are highly informative. This "local frequency × global rarity" pattern appears throughout machine learning: attention mechanisms, contrastive learning, and anomaly detection all share this intuition.
What's Next:
We've explored TF-IDF computation in depth, but there's one crucial component remaining: normalization. The next page examines why and how we normalize TF-IDF vectors, covering L1 vs. L2 normalization, document length effects, and the profound impact of normalization choices on similarity computations.
You now understand TF-IDF weighting schemes comprehensively—from basic formulas to SMART notation to variant selection. Next, we'll complete the picture with vector normalization, ensuring fair comparisons across documents of varying lengths.