Text Feature Engineering Tfidf - Learning Module

Loading content...

0/278

Normalization

The Critical Final Step: Vector Normalization

You've mastered Term Frequency, Inverse Document Frequency, and their combination into TF-IDF weights. But there's one more critical component that profoundly affects how TF-IDF vectors behave: normalization.

Without normalization, a 10,000-word document will have dramatically larger TF-IDF magnitude than a 100-word document, even if both discuss the same topic with equal focus. This creates unfair comparisons—longer documents appear more "similar" to everything simply because they have more terms.

Normalization addresses this by scaling vectors to comparable magnitudes. But the choice of normalization (L1, L2, or none) has deep implications for similarity metrics, clustering behavior, and machine learning performance.

This page provides a complete treatment of normalization: why it matters, how different norms work, their geometric interpretations, and practical guidelines for choosing the right approach.

What You Will Master

By the end of this page, you will understand: (1) Why normalization is essential for fair document comparison, (2) Mathematical properties of L1 and L2 norms, (3) Geometric interpretations and their implications, (4) The relationship between normalization and similarity metrics, (5) Document length effects and pivoted normalization, and (6) Practical implementation and selection guidance.

The Document Length Problem

Before examining solutions, let's fully understand the problem normalization solves.

The Magnitude Disparity:

Consider two documents about "machine learning":

Document A: 5,000 words, mentions "learning" 50 times
Document B: 500 words, mentions "learning" 5 times

Both have the same proportion of "learning" (1%), but:

Metric	Document A	Document B	Ratio
Raw TF(learning)	50	5	10:1
TF-IDF(learning)	50 × idf	5 × idf	10:1
Vector magnitude	~10× larger	~10× smaller	10:1

Why This Matters:

Problems Without Normalization

•Dot product similarity is biased: Long documents have higher dot products with everything, appearing "more similar" to all queries.
•Euclidean distance is biased: Long documents are farther from everything in Euclidean space, distorting clustering.
•Classification is biased: ML models using TF-IDF features can learn to predict based on document length rather than content.
•Retrieval ranking fails: Relevant short documents are ranked below irrelevant long documents.
•Feature scales vary wildly: Gradient-based learning is unstable when feature magnitudes differ by orders of magnitude.

The Core Insight:

We want similarity to measure topic overlap, not document length overlap. A 100-word document entirely about machine learning should be as similar to a machine learning query as a 10,000-word document entirely about machine learning.

Normalization makes this possible by removing the magnitude dimension—after normalization, documents are compared by their direction in term space, not their length.

The Geometric View

Think of TF-IDF vectors as arrows in a high-dimensional space. Without normalization, arrows have different lengths. Normalization scales all arrows to the same length (typically 1), so we compare only their DIRECTIONS. Two arrows pointing in similar directions are similar, regardless of their original lengths.

L2 (Euclidean) Normalization

L2 normalization, also called Euclidean or cosine normalization, is the most common choice for TF-IDF vectors.

Definition:

For a vector $\vec{v} = [v_1, v_2, ..., v_n]$, the L2 norm is:

$$|\vec{v}|2 = \sqrt{\sum{i=1}^{n} v_i^2}$$

The L2-normalized vector is:

$$\hat{v}_i = \frac{v_i}{|\vec{v}|_2}$$

Properties of L2-normalized vectors:

Unit length: $|\hat{v}|_2 = 1$ for all normalized vectors
Direction preserved: $\hat{v}$ points in the same direction as $\vec{v}$
Non-negative for TF-IDF: Since TF-IDF values are ≥ 0, normalized values are also ≥ 0
Bounded: Each component $\hat{v}_i \in [0, 1]$

L2 Normalization Example
Term	TF-IDF	Squared	L2-Normalized
machine	3.2	10.24	0.494
learning	4.1	16.81	0.633
algorithm	2.8	7.84	0.432
data	2.5	6.25	0.386
Sum / Norm	—	41.14	\|\|v\|\|₂ = 6.41

Geometric Interpretation:

L2 normalization projects all vectors onto the unit hypersphere. In 2D, this is the unit circle; in 3D, the unit sphere; in high-dimensional TF-IDF space, a hypersphere of radius 1.

Why L2 is so common:

Cosine similarity becomes dot product:

$$\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}|_2 |\vec{b}|_2}$$

For L2-normalized vectors, $|\hat{a}|_2 = |\hat{b}|_2 = 1$, so:

$$\cos(\theta) = \hat{a} \cdot \hat{b}$$

This is computationally efficient—just compute the dot product!

Euclidean distance relates to cosine:

$$|\hat{a} - \hat{b}|_2^2 = 2(1 - \cos(\theta))$$

For L2-normalized vectors, Euclidean distance and cosine similarity are monotonically related. Minimizing distance = maximizing similarity.

The Dot Product Shortcut

After L2 normalization, cosine similarity is just the dot product. This enables massive speedups: dot products are highly optimized in BLAS libraries, GPUs, and specialized hardware. Precompute normalized vectors once, then compute millions of similarities with fast matrix multiplication.

L1 (Manhattan) Normalization

L1 normalization, while less common for TF-IDF, has distinct properties that make it valuable in certain contexts.

Definition:

For a vector $\vec{v} = [v_1, v_2, ..., v_n]$, the L1 norm is:

$$|\vec{v}|1 = \sum{i=1}^{n} |v_i|$$

The L1-normalized vector is:

$$\hat{v}_i = \frac{v_i}{|\vec{v}|_1}$$

Properties of L1-normalized vectors:

Sum to 1: $\sum_i \hat{v}_i = 1$ (for non-negative vectors)
Probability interpretation: L1-normalized TF-IDF can be viewed as a distribution over terms
Sparsity preserved: Zero elements remain zero; structure is maintained
Bounded: Each component $\hat{v}_i \in [0, 1]$

L1 vs L2 Normalization Comparison
Aspect	L1 Normalization	L2 Normalization
Formula	$v_i / \sum_j \|v_j\|$	$v_i / \sqrt{\sum_j v_j^2}$
Unit	Sum = 1	Euclidean length = 1
Interpretation	Probability distribution	Direction in space
High values	Equally penalized	More penalized (squared)
Similarity metric	Manhattan distance, KL divergence	Cosine similarity, Euclidean
Common use	Topic models, probability contexts	Most TF-IDF applications

When to Use L1:

Probabilistic interpretations: When you want TF-IDF values to represent "probability of term given document"
KL divergence comparisons: KL divergence requires probability distributions (sum to 1)
Robustness to outliers: L1 is less sensitive to extreme values than L2 (no squaring)

Numerical Example:

Term	TF-IDF	L1-Normalized	L2-Normalized
machine	3.2	0.254	0.494
learning	4.1	0.325	0.633
algorithm	2.8	0.222	0.432
data	2.5	0.198	0.386
Sum	12.6	1.000	1.946
Norm	—	12.6	6.41

Notice how L1 values sum to 1 (probability-like), while L2 values are individually larger but have Euclidean length 1.

The High-Value Sensitivity Difference

L2 squares values before summing, making it more sensitive to high values. If one term has TF-IDF = 10 and others have TF-IDF = 1, L2 normalization will be dominated by the large term (100 vs 1 in squared space). L1 treats all values equally by magnitude. This can matter when one term dominates a document.

When to Skip Normalization

Normalization isn't always appropriate. There are valid cases where unnormalized TF-IDF is preferred.

Case 1: Document Length Is Informative

If longer documents genuinely contain more information (not just more filler), normalization discards this signal.

Example: In academic paper retrieval, a comprehensive survey paper covering a topic extensively may be more valuable than a brief note. The longer paper's larger TF-IDF magnitude reflects its comprehensiveness.

Case 2: Subsequent Normalization by Downstream Model

Many ML models (neural networks with batch normalization, SVMs with certain kernels) effectively normalize inputs internally. Pre-normalizing may be redundant or even harmful.

Case 3: Retrieval with Length-Based Ranking Factors

Search engines often incorporate document length as an explicit ranking factor. Normalizing TF-IDF and then re-incorporating length can be less effective than using unnormalized TF-IDF with separate length features.

Case 4: Sparse Linear Models

For interpretable linear models (logistic regression, linear SVM), unnormalized TF-IDF coefficients have clearer interpretations: "each additional occurrence of term X increases log-odds by β."

Beware of Implicit Assumptions

When skipping normalization, ensure your similarity metric or model doesn't implicitly expect normalized inputs. Cosine similarity on unnormalized vectors works (it normalizes internally), but Euclidean distance on unnormalized TF-IDF will be heavily biased toward long documents.

Normalization Decision Guide
Scenario	Recommendation	Rationale
Document similarity/clustering	L2 normalize	Fair comparison regardless of length
Text classification	Usually L2 normalize	Prevents length-based classification
Topic modeling input	L1 or no normalization	Probabilistic interpretation needed
Search ranking	Often no normalization	Length can indicate relevance
Neural network input	Experiment	Internal normalization may suffice

Document Length Normalization

Beyond L1/L2 normalization, there are specialized techniques specifically addressing document length effects.

Pivoted Length Normalization:

Introduced in BM25 and later formalized, pivoted normalization adjusts for the observation that the "optimal" normalization depends on document length.

$$\text{norm}_{\text{pivot}}(d) = (1 - b) + b \cdot \frac{|d|}{\text{avgdl}}$$

where:

$|d|$ = document length
$\text{avgdl}$ = average document length in corpus
$b \in [0, 1]$ = tuning parameter

The "Pivot":

Documents shorter than average ($|d| < \text{avgdl}$): normalization factor < 1 → boosted
Documents longer than average ($|d| > \text{avgdl}$): normalization factor > 1 → penalized
Documents at average length ($|d| = \text{avgdl}$): normalization factor = 1 → neutral

The pivot point is at average document length—weights "pivot" around this point.

BM25's Length Normalization:

The famous BM25 scoring function incorporates this directly:

$$\text{BM25}(t, d) = \text{idf}(t) \cdot \frac{f_{t,d} \cdot (k_1 + 1)}{f_{t,d} + k_1 \cdot (1 - b + b \cdot |d|/\text{avgdl})}$$

Parameter effects:

Parameter	Effect	Typical Value
$k_1$	TF saturation rate	1.2 - 2.0
$b$	Length normalization strength	0.75

When $b = 0$: No length normalization When $b = 1$: Full length normalization

The default $b = 0.75$ provides partial length normalization, acknowledging that longer documents may have more relevant content while still penalizing excessive length.

Why Not Just Use L2?

L2 normalization is "all or nothing"—it completely removes length effects. Pivoted normalization is tunable: you can choose how much length should matter. This flexibility often yields better retrieval performance.

BM25 is Often Better Than TF-IDF

BM25 isn't strictly TF-IDF, but it addresses TF-IDF's weaknesses through: (1) TF saturation (sublinear, bounded growth), (2) Tunable length normalization, and (3) Probabilistic IDF. For retrieval and ranking tasks, BM25 consistently outperforms basic TF-IDF. Consider it as the "evolved" form of TF-IDF.

Implementation

Let's implement various normalization approaches with attention to efficiency and edge cases.

tfidf_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
import numpy as np
from scipy.sparse import csr_matrix, issparse
from typing import Literal, Union, Optional
from dataclasses import dataclass
 
@dataclass 
class NormalizationStats:
    """Statistics from normalization operation."""
    n_documents: int
    n_zero_vectors: int
    mean_original_norm: float
    std_original_norm: float
    min_original_norm: float
    max_original_norm: float
 
 
def normalize_vectors(
    X: Union[np.ndarray, csr_matrix],
    norm: Literal["l1", "l2", "max", "none"] = "l2",
    copy: bool = True,
    return_stats: bool = False
) -> Union[csr_matrix, tuple]:
    """
    Normalize TF-IDF vectors with various norms.
    
    Parameters:
    -----------
    X : array-like or sparse matrix
        Input vectors (rows are documents)
    norm : str
        Normalization type: 'l1', 'l2', 'max', or 'none'
    copy : bool
        Whether to copy input before modifying
    return_stats : bool
        Whether to return normalization statistics
        
    Returns:
    --------
    X_normalized : same type as input
        Normalized vectors
    stats : NormalizationStats (if return_stats=True)
        Statistics about the normalization
    """
    if norm == "none":
        return X.copy() if copy else X
    
    if copy:
        X = X.copy()
    
    is_sparse = issparse(X)
    
    # Compute norms for each row
    if is_sparse:
        X_dense_rows = X.toarray()
    else:
        X_dense_rows = X
    
    if norm == "l1":
        norms = np.abs(X_dense_rows).sum(axis=1)
    elif norm == "l2":
        norms = np.sqrt((X_dense_rows ** 2).sum(axis=1))
    elif norm == "max":
        norms = np.abs(X_dense_rows).max(axis=1)
    else:
        raise ValueError(f"Unknown norm: {norm}")
    
    # Collect statistics before normalization
    n_zero = np.sum(norms == 0)
    stats = NormalizationStats(
        n_documents=X.shape[0],
        n_zero_vectors=int(n_zero),
        mean_original_norm=float(np.mean(norms[norms > 0])) if n_zero < len(norms) else 0,
        std_original_norm=float(np.std(norms[norms > 0])) if n_zero < len(norms) else 0,
        min_original_norm=float(np.min(norms[norms > 0])) if n_zero < len(norms) else 0,
        max_original_norm=float(np.max(norms[norms > 0])) if n_zero < len(norms) else 0,
    )
    
    # Avoid division by zero
    norms[norms == 0] = 1.0
    
    # Normalize
    if is_sparse:
        # Efficient sparse normalization
        X = X.tocsr()
        for i in range(X.shape[0]):
            start, end = X.indptr[i], X.indptr[i + 1]
            X.data[start:end] /= norms[i]
    else:
        X = X / norms[:, np.newaxis]
    
    if return_stats:
        return X, stats
    return X
 
 
def pivoted_length_normalization(
    X: Union[np.ndarray, csr_matrix],
    doc_lengths: np.ndarray,
    avg_doc_length: Optional[float] = None,
    b: float = 0.75
) -> Union[np.ndarray, csr_matrix]:
    """
    Apply pivoted document length normalization.
    
    Parameters:
    -----------
    X : array-like or sparse matrix
        TF-IDF vectors (not yet normalized)
    doc_lengths : array
        Length of each document (number of tokens)
    avg_doc_length : float, optional
        Average document length. If None, computed from doc_lengths.
    b : float
        Normalization strength (0 = none, 1 = full)
    
    Returns:
    --------
    X_normalized : same type as input
        Pivoted-normalized vectors
    """
    if avg_doc_length is None:
        avg_doc_length = np.mean(doc_lengths)
    
    # Pivoted normalization factor
    norm_factors = (1 - b) + b * (doc_lengths / avg_doc_length)
    
    X = X.copy()
    
    if issparse(X):
        X = X.tocsr()
        for i in range(X.shape[0]):
            start, end = X.indptr[i], X.indptr[i + 1]
            X.data[start:end] /= norm_factors[i]
    else:
        X = X / norm_factors[:, np.newaxis]
    
    return X
 
 
def demonstrate_normalization_effects():
    """Demonstrate how normalization affects document comparison."""
    
    # Create sample TF-IDF vectors (3 documents, 5 terms)
    # Doc 0: Short, focused on term 0
    # Doc 1: Long, covers many terms
    # Doc 2: Medium, focused on term 0 (similar to doc 0)
    
    X = np.array([
        [5.0, 0.5, 0.0, 0.0, 0.0],   # Short doc about term 0
        [8.0, 6.0, 4.0, 3.0, 2.0],   # Long doc covering everything
        [10.0, 1.0, 0.0, 0.0, 0.0],  # Medium doc about term 0
    ])
    
    print("Original TF-IDF vectors:")
    print(X)
    print()
    
    # Compare with different normalizations
    for norm in ["none", "l1", "l2"]:
        X_norm, stats = normalize_vectors(X, norm=norm, return_stats=True)
        
        print(f"
{norm.upper()} Normalization:")
        print(f"  Vectors:
{X_norm}")
        
        # Compute pairwise cosine similarities
        if norm == "l2":
            # For L2-normalized, dot product = cosine similarity
            sims = X_norm @ X_norm.T
        else:
            # Compute cosine similarity properly
            norms = np.sqrt((X_norm ** 2).sum(axis=1))
            sims = (X_norm @ X_norm.T) / np.outer(norms, norms)
        
        print(f"  Cosine similarities:")
        print(f"    Doc0-Doc1: {sims[0,1]:.4f}")
        print(f"    Doc0-Doc2: {sims[0,2]:.4f}")
        print(f"    Doc1-Doc2: {sims[1,2]:.4f}")
        
        # Key insight: With L2 normalization, Doc0 and Doc2 are MOST similar
        # (both focus on term 0), despite Doc1 having higher magnitude
        if norm == "l2":
            print("  ✓ L2 correctly identifies Doc0 and Doc2 as most similar")
 
 
if __name__ == "__main__":
    demonstrate_normalization_effects()
    
    print("
" + "="*60)
    print("PIVOTED LENGTH NORMALIZATION DEMO")
    print("="*60)
    
    # Show effect of different 'b' values
    X = np.array([
        [10.0, 5.0],  # Short doc
        [50.0, 25.0], # Long doc (5x longer)
    ])
    doc_lengths = np.array([100, 500])
    
    print(f"
Original vectors: {X[0]}, {X[1]}")
    print(f"Doc lengths: {doc_lengths}")
    
    for b in [0.0, 0.5, 0.75, 1.0]:
        X_norm = pivoted_length_normalization(X, doc_lengths, b=b)
        ratio = X_norm[1, 0] / X_norm[0, 0]
        print(f"  b={b}: Normalized values = {X_norm[0,0]:.2f}, {X_norm[1,0]:.2f} (ratio: {ratio:.2f})")

Efficiency Note

For sparse matrices, row-wise normalization can be slow if done naively. The implementation above directly modifies the data array of CSR sparse matrices, avoiding expensive conversions. For very large matrices, consider using sklearn.preprocessing.normalize, which is highly optimized.

Normalization and Similarity Metrics

Normalization choice interacts deeply with similarity metrics. Understanding these interactions prevents unexpected behavior.

The Similarity-Normalization Correspondence:

Similarity Metrics and Normalization
Similarity Metric	Best Normalization	Why
Cosine Similarity	L2 or none	Cosine normalizes internally; L2 pre-norm makes dot product = cosine
Dot Product	L2 (if you want cosine)	Without normalization, dot product favors long docs
Euclidean Distance	L2	Without normalization, distances dominated by magnitude
Manhattan Distance	L1 or none	L1 normalization makes Manhattan interpretable
Jaccard Similarity	None or boolean	Jaccard typically uses set membership, not magnitudes
KL Divergence	L1	KL requires probability distributions (sum to 1)

Deep Dive: Cosine Similarity and L2 Normalization

Cosine similarity is defined as:

$$\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}|_2 |\vec{b}|_2}$$

For L2-normalized vectors ($|\hat{a}|_2 = |\hat{b}|_2 = 1$):

$$\cos(\theta) = \hat{a} \cdot \hat{b}$$

The Computational Advantage:

Operation	Unnormalized	L2-Normalized
Single similarity	3N operations	N operations
All-pairs (M docs)	O(M²N)	O(M²N), but simpler
Nearest neighbors	Complex	Matrix multiply + argmax

With L2-normalized vectors, finding most similar documents becomes a single matrix multiplication followed by argmax—highly optimized operations.

Euclidean Distance and Cosine:

For L2-normalized vectors, there's a beautiful relationship:

$$|\hat{a} - \hat{b}|_2^2 = (\hat{a} - \hat{b}) \cdot (\hat{a} - \hat{b}) = |\hat{a}|_2^2 + |\hat{b}|_2^2 - 2\hat{a} \cdot \hat{b}$$ $$= 1 + 1 - 2\cos(\theta) = 2(1 - \cos(\theta))$$

So: $|\hat{a} - \hat{b}|_2 = \sqrt{2(1 - \cos(\theta))} = \sqrt{2} \sin(\theta/2)$

Minimizing Euclidean distance ≡ maximizing cosine similarity for L2-normalized vectors!

Consistent Choice Matters

Choose normalization and similarity metric together as a pair. Don't normalize with L1 and then use Euclidean distance—the results won't be meaningful. Common pairs: (L2, cosine), (L2, Euclidean), (L1, Manhattan), (none, cosine with internal normalization).

Summary: Normalization Completes TF-IDF

We've completed our comprehensive journey through TF-IDF. Normalization is the final piece that makes TF-IDF vectors truly comparable.

Key Takeaways

•Document length creates unfair comparisons: Without normalization, longer documents appear more similar to everything, biasing similarity, distance, and ML model training.
•L2 normalization is the default choice: It projects vectors onto the unit sphere, making cosine similarity equivalent to dot product—computationally efficient and geometrically interpretable.
•L1 normalization creates probability distributions: Useful for KL divergence, topic models, and contexts requiring probabilistic interpretation.
•Sometimes skip normalization: When document length is informative, when downstream models normalize internally, or when interpretable coefficients are needed.
•Pivoted normalization offers tunable control: BM25-style normalization lets you choose how much length should matter, often outperforming binary normalize/don't-normalize choices.
•Match normalization to similarity metric: L2 with cosine/Euclidean, L1 with Manhattan/KL—inconsistent pairs produce meaningless results.

The Complete TF-IDF Pipeline:

You now understand every component of TF-IDF:

Term Frequency (TF): Measures local term importance within a document
Inverse Document Frequency (IDF): Measures global term discrimination across corpus
Logarithmic Scaling: Compresses ranges, provides information-theoretic foundation
TF-IDF Weighting: Combines TF and IDF to identify characteristic terms
Normalization: Enables fair comparison regardless of document length

Together, these components create one of the most successful text representation techniques in NLP history—simple enough to implement from scratch, yet powerful enough to underpin production search engines and classification systems.

What's Next:

With TF-IDF mastered, you're ready to explore more advanced text representations: word embeddings (Word2Vec, GloVe), contextual embeddings (BERT, GPT), and how TF-IDF relates to modern neural approaches.

Module Complete!

Congratulations! You've completed the comprehensive TF-IDF module. You now understand TF-IDF at a depth matching experienced practitioners: the mathematics, the intuitions, the variants, the implementations, and the practical considerations. This knowledge forms a solid foundation for text feature engineering and information retrieval.

Normalization

The Critical Final Step: Vector Normalization

This page provides a complete treatment of normalization: why it matters, how different norms work, their geometric interpretations, and practical guidelines for choosing the right approach.

What You Will Master

The Document Length Problem

Before examining solutions, let's fully understand the problem normalization solves.

The Magnitude Disparity:

Consider two documents about "machine learning":

Document A: 5,000 words, mentions "learning" 50 times
Document B: 500 words, mentions "learning" 5 times

Both have the same proportion of "learning" (1%), but:

Metric	Document A	Document B	Ratio
Raw TF(learning)	50	5	10:1
TF-IDF(learning)	50 × idf	5 × idf	10:1
Vector magnitude	~10× larger	~10× smaller	10:1

Why This Matters:

Problems Without Normalization

•Dot product similarity is biased: Long documents have higher dot products with everything, appearing "more similar" to all queries.
•Euclidean distance is biased: Long documents are farther from everything in Euclidean space, distorting clustering.
•Classification is biased: ML models using TF-IDF features can learn to predict based on document length rather than content.
•Retrieval ranking fails: Relevant short documents are ranked below irrelevant long documents.
•Feature scales vary wildly: Gradient-based learning is unstable when feature magnitudes differ by orders of magnitude.

The Core Insight:

Normalization makes this possible by removing the magnitude dimension—after normalization, documents are compared by their direction in term space, not their length.

The Geometric View

L2 (Euclidean) Normalization

L2 normalization, also called Euclidean or cosine normalization, is the most common choice for TF-IDF vectors.

Definition:

For a vector $\vec{v} = [v_1, v_2, ..., v_n]$, the L2 norm is:

$$|\vec{v}|2 = \sqrt{\sum{i=1}^{n} v_i^2}$$

The L2-normalized vector is:

$$\hat{v}_i = \frac{v_i}{|\vec{v}|_2}$$

Properties of L2-normalized vectors:

Unit length: $|\hat{v}|_2 = 1$ for all normalized vectors
Direction preserved: $\hat{v}$ points in the same direction as $\vec{v}$
Non-negative for TF-IDF: Since TF-IDF values are ≥ 0, normalized values are also ≥ 0
Bounded: Each component $\hat{v}_i \in [0, 1]$

L2 Normalization Example
Term	TF-IDF	Squared	L2-Normalized
machine	3.2	10.24	0.494
learning	4.1	16.81	0.633
algorithm	2.8	7.84	0.432
data	2.5	6.25	0.386
Sum / Norm	—	41.14	\|\|v\|\|₂ = 6.41

Geometric Interpretation:

L2 normalization projects all vectors onto the unit hypersphere. In 2D, this is the unit circle; in 3D, the unit sphere; in high-dimensional TF-IDF space, a hypersphere of radius 1.

Why L2 is so common:

Cosine similarity becomes dot product:

$$\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}|_2 |\vec{b}|_2}$$

For L2-normalized vectors, $|\hat{a}|_2 = |\hat{b}|_2 = 1$, so:

$$\cos(\theta) = \hat{a} \cdot \hat{b}$$

This is computationally efficient—just compute the dot product!

Euclidean distance relates to cosine:

$$|\hat{a} - \hat{b}|_2^2 = 2(1 - \cos(\theta))$$

For L2-normalized vectors, Euclidean distance and cosine similarity are monotonically related. Minimizing distance = maximizing similarity.

The Dot Product Shortcut

L1 (Manhattan) Normalization

L1 normalization, while less common for TF-IDF, has distinct properties that make it valuable in certain contexts.

Definition:

For a vector $\vec{v} = [v_1, v_2, ..., v_n]$, the L1 norm is:

$$|\vec{v}|1 = \sum{i=1}^{n} |v_i|$$

The L1-normalized vector is:

$$\hat{v}_i = \frac{v_i}{|\vec{v}|_1}$$

Properties of L1-normalized vectors:

Sum to 1: $\sum_i \hat{v}_i = 1$ (for non-negative vectors)
Probability interpretation: L1-normalized TF-IDF can be viewed as a distribution over terms
Sparsity preserved: Zero elements remain zero; structure is maintained
Bounded: Each component $\hat{v}_i \in [0, 1]$

L1 vs L2 Normalization Comparison
Aspect	L1 Normalization	L2 Normalization
Formula	$v_i / \sum_j \|v_j\|$	$v_i / \sqrt{\sum_j v_j^2}$
Unit	Sum = 1	Euclidean length = 1
Interpretation	Probability distribution	Direction in space
High values	Equally penalized	More penalized (squared)
Similarity metric	Manhattan distance, KL divergence	Cosine similarity, Euclidean
Common use	Topic models, probability contexts	Most TF-IDF applications

When to Use L1:

Probabilistic interpretations: When you want TF-IDF values to represent "probability of term given document"
KL divergence comparisons: KL divergence requires probability distributions (sum to 1)
Robustness to outliers: L1 is less sensitive to extreme values than L2 (no squaring)

Numerical Example:

Term	TF-IDF	L1-Normalized	L2-Normalized
machine	3.2	0.254	0.494
learning	4.1	0.325	0.633
algorithm	2.8	0.222	0.432
data	2.5	0.198	0.386
Sum	12.6	1.000	1.946
Norm	—	12.6	6.41

Notice how L1 values sum to 1 (probability-like), while L2 values are individually larger but have Euclidean length 1.

The High-Value Sensitivity Difference

When to Skip Normalization

Normalization isn't always appropriate. There are valid cases where unnormalized TF-IDF is preferred.

Case 1: Document Length Is Informative

If longer documents genuinely contain more information (not just more filler), normalization discards this signal.

Case 2: Subsequent Normalization by Downstream Model

Many ML models (neural networks with batch normalization, SVMs with certain kernels) effectively normalize inputs internally. Pre-normalizing may be redundant or even harmful.

Case 3: Retrieval with Length-Based Ranking Factors

Case 4: Sparse Linear Models

For interpretable linear models (logistic regression, linear SVM), unnormalized TF-IDF coefficients have clearer interpretations: "each additional occurrence of term X increases log-odds by β."

Beware of Implicit Assumptions

Normalization Decision Guide
Scenario	Recommendation	Rationale
Document similarity/clustering	L2 normalize	Fair comparison regardless of length
Text classification	Usually L2 normalize	Prevents length-based classification
Topic modeling input	L1 or no normalization	Probabilistic interpretation needed
Search ranking	Often no normalization	Length can indicate relevance
Neural network input	Experiment	Internal normalization may suffice

Document Length Normalization

Beyond L1/L2 normalization, there are specialized techniques specifically addressing document length effects.

Pivoted Length Normalization:

Introduced in BM25 and later formalized, pivoted normalization adjusts for the observation that the "optimal" normalization depends on document length.

$$\text{norm}_{\text{pivot}}(d) = (1 - b) + b \cdot \frac{|d|}{\text{avgdl}}$$

where:

$|d|$ = document length
$\text{avgdl}$ = average document length in corpus
$b \in [0, 1]$ = tuning parameter

The "Pivot":

Documents shorter than average ($|d| < \text{avgdl}$): normalization factor < 1 → boosted
Documents longer than average ($|d| > \text{avgdl}$): normalization factor > 1 → penalized
Documents at average length ($|d| = \text{avgdl}$): normalization factor = 1 → neutral

The pivot point is at average document length—weights "pivot" around this point.

BM25's Length Normalization:

The famous BM25 scoring function incorporates this directly:

$$\text{BM25}(t, d) = \text{idf}(t) \cdot \frac{f_{t,d} \cdot (k_1 + 1)}{f_{t,d} + k_1 \cdot (1 - b + b \cdot |d|/\text{avgdl})}$$

Parameter effects:

Parameter	Effect	Typical Value
$k_1$	TF saturation rate	1.2 - 2.0
$b$	Length normalization strength	0.75

When $b = 0$: No length normalization When $b = 1$: Full length normalization

The default $b = 0.75$ provides partial length normalization, acknowledging that longer documents may have more relevant content while still penalizing excessive length.

Why Not Just Use L2?

BM25 is Often Better Than TF-IDF

Implementation

Let's implement various normalization approaches with attention to efficiency and edge cases.

tfidf_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
import numpy as np
from scipy.sparse import csr_matrix, issparse
from typing import Literal, Union, Optional
from dataclasses import dataclass
 
@dataclass 
class NormalizationStats:
    """Statistics from normalization operation."""
    n_documents: int
    n_zero_vectors: int
    mean_original_norm: float
    std_original_norm: float
    min_original_norm: float
    max_original_norm: float
 
 
def normalize_vectors(
    X: Union[np.ndarray, csr_matrix],
    norm: Literal["l1", "l2", "max", "none"] = "l2",
    copy: bool = True,
    return_stats: bool = False
) -> Union[csr_matrix, tuple]:
    """
    Normalize TF-IDF vectors with various norms.
    
    Parameters:
    -----------
    X : array-like or sparse matrix
        Input vectors (rows are documents)
    norm : str
        Normalization type: 'l1', 'l2', 'max', or 'none'
    copy : bool
        Whether to copy input before modifying
    return_stats : bool
        Whether to return normalization statistics
        
    Returns:
    --------
    X_normalized : same type as input
        Normalized vectors
    stats : NormalizationStats (if return_stats=True)
        Statistics about the normalization
    """
    if norm == "none":
        return X.copy() if copy else X
    
    if copy:
        X = X.copy()
    
    is_sparse = issparse(X)
    
    # Compute norms for each row
    if is_sparse:
        X_dense_rows = X.toarray()
    else:
        X_dense_rows = X
    
    if norm == "l1":
        norms = np.abs(X_dense_rows).sum(axis=1)
    elif norm == "l2":
        norms = np.sqrt((X_dense_rows ** 2).sum(axis=1))
    elif norm == "max":
        norms = np.abs(X_dense_rows).max(axis=1)
    else:
        raise ValueError(f"Unknown norm: {norm}")
    
    # Collect statistics before normalization
    n_zero = np.sum(norms == 0)
    stats = NormalizationStats(
        n_documents=X.shape[0],
        n_zero_vectors=int(n_zero),
        mean_original_norm=float(np.mean(norms[norms > 0])) if n_zero < len(norms) else 0,
        std_original_norm=float(np.std(norms[norms > 0])) if n_zero < len(norms) else 0,
        min_original_norm=float(np.min(norms[norms > 0])) if n_zero < len(norms) else 0,
        max_original_norm=float(np.max(norms[norms > 0])) if n_zero < len(norms) else 0,
    )
    
    # Avoid division by zero
    norms[norms == 0] = 1.0
    
    # Normalize
    if is_sparse:
        # Efficient sparse normalization
        X = X.tocsr()
        for i in range(X.shape[0]):
            start, end = X.indptr[i], X.indptr[i + 1]
            X.data[start:end] /= norms[i]
    else:
        X = X / norms[:, np.newaxis]
    
    if return_stats:
        return X, stats
    return X
 
 
def pivoted_length_normalization(
    X: Union[np.ndarray, csr_matrix],
    doc_lengths: np.ndarray,
    avg_doc_length: Optional[float] = None,
    b: float = 0.75
) -> Union[np.ndarray, csr_matrix]:
    """
    Apply pivoted document length normalization.
    
    Parameters:
    -----------
    X : array-like or sparse matrix
        TF-IDF vectors (not yet normalized)
    doc_lengths : array
        Length of each document (number of tokens)
    avg_doc_length : float, optional
        Average document length. If None, computed from doc_lengths.
    b : float
        Normalization strength (0 = none, 1 = full)
    
    Returns:
    --------
    X_normalized : same type as input
        Pivoted-normalized vectors
    """
    if avg_doc_length is None:
        avg_doc_length = np.mean(doc_lengths)
    
    # Pivoted normalization factor
    norm_factors = (1 - b) + b * (doc_lengths / avg_doc_length)
    
    X = X.copy()
    
    if issparse(X):
        X = X.tocsr()
        for i in range(X.shape[0]):
            start, end = X.indptr[i], X.indptr[i + 1]
            X.data[start:end] /= norm_factors[i]
    else:
        X = X / norm_factors[:, np.newaxis]
    
    return X
 
 
def demonstrate_normalization_effects():
    """Demonstrate how normalization affects document comparison."""
    
    # Create sample TF-IDF vectors (3 documents, 5 terms)
    # Doc 0: Short, focused on term 0
    # Doc 1: Long, covers many terms
    # Doc 2: Medium, focused on term 0 (similar to doc 0)
    
    X = np.array([
        [5.0, 0.5, 0.0, 0.0, 0.0],   # Short doc about term 0
        [8.0, 6.0, 4.0, 3.0, 2.0],   # Long doc covering everything
        [10.0, 1.0, 0.0, 0.0, 0.0],  # Medium doc about term 0
    ])
    
    print("Original TF-IDF vectors:")
    print(X)
    print()
    
    # Compare with different normalizations
    for norm in ["none", "l1", "l2"]:
        X_norm, stats = normalize_vectors(X, norm=norm, return_stats=True)
        
        print(f"
{norm.upper()} Normalization:")
        print(f"  Vectors:
{X_norm}")
        
        # Compute pairwise cosine similarities
        if norm == "l2":
            # For L2-normalized, dot product = cosine similarity
            sims = X_norm @ X_norm.T
        else:
            # Compute cosine similarity properly
            norms = np.sqrt((X_norm ** 2).sum(axis=1))
            sims = (X_norm @ X_norm.T) / np.outer(norms, norms)
        
        print(f"  Cosine similarities:")
        print(f"    Doc0-Doc1: {sims[0,1]:.4f}")
        print(f"    Doc0-Doc2: {sims[0,2]:.4f}")
        print(f"    Doc1-Doc2: {sims[1,2]:.4f}")
        
        # Key insight: With L2 normalization, Doc0 and Doc2 are MOST similar
        # (both focus on term 0), despite Doc1 having higher magnitude
        if norm == "l2":
            print("  ✓ L2 correctly identifies Doc0 and Doc2 as most similar")
 
 
if __name__ == "__main__":
    demonstrate_normalization_effects()
    
    print("
" + "="*60)
    print("PIVOTED LENGTH NORMALIZATION DEMO")
    print("="*60)
    
    # Show effect of different 'b' values
    X = np.array([
        [10.0, 5.0],  # Short doc
        [50.0, 25.0], # Long doc (5x longer)
    ])
    doc_lengths = np.array([100, 500])
    
    print(f"
Original vectors: {X[0]}, {X[1]}")
    print(f"Doc lengths: {doc_lengths}")
    
    for b in [0.0, 0.5, 0.75, 1.0]:
        X_norm = pivoted_length_normalization(X, doc_lengths, b=b)
        ratio = X_norm[1, 0] / X_norm[0, 0]
        print(f"  b={b}: Normalized values = {X_norm[0,0]:.2f}, {X_norm[1,0]:.2f} (ratio: {ratio:.2f})")

Efficiency Note

Normalization and Similarity Metrics

Normalization choice interacts deeply with similarity metrics. Understanding these interactions prevents unexpected behavior.

The Similarity-Normalization Correspondence:

Similarity Metrics and Normalization
Similarity Metric	Best Normalization	Why
Cosine Similarity	L2 or none	Cosine normalizes internally; L2 pre-norm makes dot product = cosine
Dot Product	L2 (if you want cosine)	Without normalization, dot product favors long docs
Euclidean Distance	L2	Without normalization, distances dominated by magnitude
Manhattan Distance	L1 or none	L1 normalization makes Manhattan interpretable
Jaccard Similarity	None or boolean	Jaccard typically uses set membership, not magnitudes
KL Divergence	L1	KL requires probability distributions (sum to 1)

Deep Dive: Cosine Similarity and L2 Normalization

Cosine similarity is defined as:

$$\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}|_2 |\vec{b}|_2}$$

For L2-normalized vectors ($|\hat{a}|_2 = |\hat{b}|_2 = 1$):

$$\cos(\theta) = \hat{a} \cdot \hat{b}$$

The Computational Advantage:

Operation	Unnormalized	L2-Normalized
Single similarity	3N operations	N operations
All-pairs (M docs)	O(M²N)	O(M²N), but simpler
Nearest neighbors	Complex	Matrix multiply + argmax

With L2-normalized vectors, finding most similar documents becomes a single matrix multiplication followed by argmax—highly optimized operations.

Euclidean Distance and Cosine:

For L2-normalized vectors, there's a beautiful relationship:

$$|\hat{a} - \hat{b}|_2^2 = (\hat{a} - \hat{b}) \cdot (\hat{a} - \hat{b}) = |\hat{a}|_2^2 + |\hat{b}|_2^2 - 2\hat{a} \cdot \hat{b}$$ $$= 1 + 1 - 2\cos(\theta) = 2(1 - \cos(\theta))$$

So: $|\hat{a} - \hat{b}|_2 = \sqrt{2(1 - \cos(\theta))} = \sqrt{2} \sin(\theta/2)$

Minimizing Euclidean distance ≡ maximizing cosine similarity for L2-normalized vectors!

Consistent Choice Matters

Summary: Normalization Completes TF-IDF

We've completed our comprehensive journey through TF-IDF. Normalization is the final piece that makes TF-IDF vectors truly comparable.

Key Takeaways

•Document length creates unfair comparisons: Without normalization, longer documents appear more similar to everything, biasing similarity, distance, and ML model training.
•L2 normalization is the default choice: It projects vectors onto the unit sphere, making cosine similarity equivalent to dot product—computationally efficient and geometrically interpretable.
•L1 normalization creates probability distributions: Useful for KL divergence, topic models, and contexts requiring probabilistic interpretation.
•Sometimes skip normalization: When document length is informative, when downstream models normalize internally, or when interpretable coefficients are needed.
•Pivoted normalization offers tunable control: BM25-style normalization lets you choose how much length should matter, often outperforming binary normalize/don't-normalize choices.
•Match normalization to similarity metric: L2 with cosine/Euclidean, L1 with Manhattan/KL—inconsistent pairs produce meaningless results.

The Complete TF-IDF Pipeline:

You now understand every component of TF-IDF:

Term Frequency (TF): Measures local term importance within a document
Inverse Document Frequency (IDF): Measures global term discrimination across corpus
Logarithmic Scaling: Compresses ranges, provides information-theoretic foundation
TF-IDF Weighting: Combines TF and IDF to identify characteristic terms
Normalization: Enables fair comparison regardless of document length

What's Next:

Module Complete!