Text Feature Engineering Basics - Learning Module

Loading content...

0/245

Sparse Representation

The Tyranny of Zeros: Why Sparsity Matters

Consider a text classification task with 100,000 documents and a vocabulary of 50,000 terms. A naive dense representation would require:

100,000 documents × 50,000 features × 4 bytes = 20 GB

This enormous memory footprint makes the dataset impractical to load, let alone process. Yet the vast majority of this space is wasted on zeros—over 99% of entries in a typical document-term matrix are zero.

This extreme sparsity is not a bug but a fundamental property of language. Any given document uses only a tiny fraction of the total vocabulary. A 500-word news article might contain 200 unique words from a 50,000-word vocabulary—that's 99.6% zeros.

Sparse representations exploit this structure by storing only the non-zero values. That same 20 GB matrix, with 99% sparsity, compresses to roughly 200 MB in a sparse format—a 100× reduction. This isn't just a memory optimization; it's what makes text feature engineering computationally feasible at scale.

What You Will Master

By the end of this page, you will understand: (1) The mathematical definition of sparsity and its prevalence in text data, (2) Sparse matrix storage formats (CSR, CSC, COO, DOK, LIL), (3) When and how to use each format, (4) Computational operations on sparse matrices, (5) Memory and performance trade-offs, and (6) Integration with ML libraries and pipelines.

Understanding Sparsity

What is Sparsity?

A vector or matrix is sparse when most of its entries are zero. Formally:

Sparsity ratio = (number of zero entries) / (total entries)

A matrix with 95% sparsity has 95% zeros and only 5% non-zero values.

Density is the complement:

Density = 1 - sparsity = (number of non-zero entries) / (total entries)

For text data, typical densities range from 0.001 (0.1%) to 0.05 (5%), meaning 95-99.9% of entries are zero.

Typical Sparsity in Text Data
Scenario	Vocab Size	Avg Doc Length	Typical Density	Sparsity
Short tweets	10,000	15 words	0.15%	99.85%
News articles	50,000	500 words	0.6%	99.4%
Academic papers	100,000	8,000 words	1.5%	98.5%
Legal documents	80,000	20,000 words	3%	97%
Web corpus (bigrams)	1,000,000	1,000 words	0.01%	99.99%

sparsity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import CountVectorizer
from typing import Tuple
 
def analyze_sparsity(X) -> dict:
    """
    Analyze sparsity properties of a matrix.
    
    Args:
        X: Dense array or sparse matrix
    
    Returns:
        Dictionary of sparsity metrics
    """
    if hasattr(X, 'toarray'):
        # Sparse matrix
        n_nonzero = X.nnz
        total = X.shape[0] * X.shape[1]
    else:
        # Dense array
        n_nonzero = np.count_nonzero(X)
        total = X.size
    
    n_zero = total - n_nonzero
    
    return {
        'shape': X.shape,
        'total_entries': total,
        'nonzero_entries': n_nonzero,
        'zero_entries': n_zero,
        'density': n_nonzero / total,
        'sparsity': n_zero / total,
        'compression_potential': total / max(n_nonzero, 1)
    }
 
def memory_comparison(shape: Tuple[int, int], density: float) -> dict:
    """
    Compare memory usage of dense vs sparse representations.
    """
    n_rows, n_cols = shape
    total_entries = n_rows * n_cols
    n_nonzero = int(total_entries * density)
    
    # Dense: 8 bytes per float64
    dense_bytes = total_entries * 8
    
    # CSR sparse: data (8 bytes) + indices (4 bytes) + indptr (4 bytes)
    # data: n_nonzero * 8 bytes
    # column indices: n_nonzero * 4 bytes
    # row pointers: (n_rows + 1) * 4 bytes
    sparse_bytes = n_nonzero * 8 + n_nonzero * 4 + (n_rows + 1) * 4
    
    return {
        'dense_mb': dense_bytes / (1024 ** 2),
        'sparse_mb': sparse_bytes / (1024 ** 2),
        'compression_ratio': dense_bytes / max(sparse_bytes, 1),
        'memory_saved_pct': 100 * (1 - sparse_bytes / dense_bytes)
    }
 
 
# Demonstration with real text data
corpus = [
    "Machine learning algorithms process data to find patterns.",
    "Deep learning uses neural networks with multiple layers.",
    "Natural language processing enables text understanding.",
    "Computer vision algorithms detect objects in images.",
    "Reinforcement learning agents learn through interaction.",
] * 100  # 500 documents
 
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
 
print("Sparsity Analysis of Text Document-Term Matrix")
print("=" * 60)
 
metrics = analyze_sparsity(X)
print(f"\nMatrix shape: {metrics['shape']}")
print(f"Total entries: {metrics['total_entries']:,}")
print(f"Non-zero entries: {metrics['nonzero_entries']:,}")
print(f"Density: {metrics['density']:.4%}")
print(f"Sparsity: {metrics['sparsity']:.4%}")
 
# Memory comparison
mem = memory_comparison(X.shape, metrics['density'])
print(f"\n--- Memory Comparison ---")
print(f"Dense representation: {mem['dense_mb']:.2f} MB")
print(f"Sparse (CSR): {mem['sparse_mb']:.4f} MB")
print(f"Compression ratio: {mem['compression_ratio']:.1f}x")
print(f"Memory saved: {mem['memory_saved_pct']:.1f}%")
 
# Scale to realistic corpus
print("\n--- Scaled to Realistic Corpus ---")
large_shape = (100_000, 50_000)  # 100K docs, 50K vocab
large_density = 0.005  # 0.5% typical
 
large_mem = memory_comparison(large_shape, large_density)
print(f"Corpus: {large_shape[0]:,} documents, {large_shape[1]:,} vocabulary")
print(f"Dense: {large_mem['dense_mb']:,.0f} MB ({large_mem['dense_mb']/1024:.1f} GB)")
print(f"Sparse: {large_mem['sparse_mb']:,.0f} MB")
print(f"Compression: {large_mem['compression_ratio']:.0f}x")

Rule of Thumb

Sparse representations become advantageous when density falls below ~10-30%. At 1% density (typical for text), sparse formats use ~100x less memory. At 0.1% density (large vocabularies, n-grams), the savings approach 1000x.

Sparse Matrix Storage Formats

Different sparse matrix formats optimize for different operations. Understanding these formats is essential for writing efficient code.

The Core Trade-off:

Each format balances:

Memory efficiency
Construction speed
Access patterns (row, column, element)
Arithmetic operation efficiency

Sparse Matrix Format Comparison
Format	Full Name	Best For	Avoid For
CSR	Compressed Sparse Row	Row slicing, matrix-vector products, ML algorithms	Column slicing, incremental construction
CSC	Compressed Sparse Column	Column slicing, some linear algebra operations	Row slicing, incremental construction
COO	Coordinate List	Incremental construction, format conversion	Arithmetic operations, slicing
DOK	Dictionary of Keys	Random element access, incremental construction	Arithmetic operations (slow)
LIL	List of Lists	Row-wise incremental construction	Column operations, arithmetic

CSR (Compressed Sparse Row) — The Workhorse:

CSR is the most common format for sparse matrices in machine learning. It stores:

data: Array of all non-zero values (left-to-right, top-to-bottom)
indices: Column index for each non-zero value
indptr: Array of pointers indicating where each row starts in data/indices

Example:

Consider this 3×4 matrix:

[1, 0, 2, 0]
[0, 0, 3, 4]
[5, 0, 0, 6]

CSR representation:

data = [1, 2, 3, 4, 5, 6]
indices = [0, 2, 2, 3, 0, 3]
indptr = [0, 2, 4, 6]

To access row i: data[indptr[i]:indptr[i+1]] with columns indices[indptr[i]:indptr[i+1]]

sparse_formats.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy.sparse import (
    csr_matrix, csc_matrix, coo_matrix, 
    dok_matrix, lil_matrix
)
 
# Create a sample dense matrix
dense = np.array([
    [1, 0, 2, 0],
    [0, 0, 3, 4],
    [5, 0, 0, 6]
], dtype=float)
 
print("Original Dense Matrix:")
print(dense)
 
# Convert to different sparse formats
print("\n" + "=" * 60)
print("Sparse Format Representations")
print("=" * 60)
 
# CSR - Compressed Sparse Row
csr = csr_matrix(dense)
print("\n--- CSR (Compressed Sparse Row) ---")
print(f"data:   {csr.data}")
print(f"indices (columns): {csr.indices}")
print(f"indptr (row pointers): {csr.indptr}")
print(f"Shape: {csr.shape}, NNZ: {csr.nnz}")
 
# CSC - Compressed Sparse Column
csc = csc_matrix(dense)
print("\n--- CSC (Compressed Sparse Column) ---")
print(f"data:   {csc.data}")
print(f"indices (rows): {csc.indices}")
print(f"indptr (column pointers): {csc.indptr}")
 
# COO - Coordinate List
coo = coo_matrix(dense)
print("\n--- COO (Coordinate List) ---")
print(f"data: {coo.data}")
print(f"row:  {coo.row}")
print(f"col:  {coo.col}")
 
# Memory usage comparison
print("\n" + "=" * 60)
print("Memory Usage Comparison (bytes)")
print("=" * 60)
 
def sparse_memory(m):
    """Estimate memory usage of sparse matrix."""
    if hasattr(m, 'data'):
        total = m.data.nbytes
        if hasattr(m, 'indices'):
            total += m.indices.nbytes
        if hasattr(m, 'indptr'):
            total += m.indptr.nbytes
        if hasattr(m, 'row'):
            total += m.row.nbytes
        if hasattr(m, 'col'):
            total += m.col.nbytes
        return total
    return 0
 
print(f"Dense:  {dense.nbytes} bytes")
print(f"CSR:    {sparse_memory(csr)} bytes")
print(f"CSC:    {sparse_memory(csc)} bytes")
print(f"COO:    {sparse_memory(coo)} bytes")
 
# Access patterns
print("\n" + "=" * 60)
print("Access Pattern Demonstration")
print("=" * 60)
 
print("\nRow 1 access (CSR is optimized):")
print(f"  CSR row 1: {csr[1].toarray()}")
 
print("\nColumn 2 access (CSC is optimized):")
print(f"  CSC col 2: {csc[:, 2].toarray().flatten()}")

Operations on Sparse Matrices

Sparse matrices support most operations you'd perform on dense matrices, but efficiency varies dramatically by format and operation.

Key Operations and Their Complexity:

Let nnz = number of non-zeros, n = number of rows, m = number of columns.

Sparse Operation Complexity by Format
Operation	CSR	CSC	COO	Dense
Matrix × Vector	O(nnz) ✓	O(nnz)	O(nnz)	O(nm)
Row slice	O(row_nnz) ✓	O(nnz)	O(nnz)	O(m)
Column slice	O(nnz)	O(col_nnz) ✓	O(nnz)	O(n)
Element access	O(log row_nnz)	O(log col_nnz)	O(nnz)	O(1) ✓
Matrix × Matrix	O(nnz × avg_row)	O(nnz × avg_col)	Convert first	O(nmk)
Element-wise ops	O(nnz) ✓	O(nnz) ✓	O(nnz)	O(nm)

sparse_operations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.sparse import csr_matrix, csc_matrix
from scipy.sparse import diags, eye, vstack, hstack
import time
 
def benchmark(func, *args, n_runs=100):
    """Benchmark a function call."""
    start = time.perf_counter()
    for _ in range(n_runs):
        result = func(*args)
    elapsed = (time.perf_counter() - start) / n_runs
    return elapsed * 1000  # milliseconds
 
# Create test matrices
np.random.seed(42)
n, m = 5000, 10000
density = 0.01  # 1% non-zero
 
# Random sparse matrix
rows = np.random.randint(0, n, int(n * m * density))
cols = np.random.randint(0, m, int(n * m * density))
data = np.random.rand(len(rows))
 
sparse_csr = csr_matrix((data, (rows, cols)), shape=(n, m))
sparse_csc = csc_matrix(sparse_csr)
dense = sparse_csr.toarray()
 
print(f"Matrix shape: {n} × {m}")
print(f"Density: {density:.1%}")
print(f"Non-zeros: {sparse_csr.nnz:,}")
 
# Benchmark matrix-vector multiplication
vec = np.random.rand(m)
 
print("\n--- Matrix-Vector Multiplication ---")
t_sparse = benchmark(lambda: sparse_csr @ vec)
t_dense = benchmark(lambda: dense @ vec)
print(f"Sparse (CSR): {t_sparse:.3f} ms")
print(f"Dense:        {t_dense:.3f} ms")
print(f"Speedup:      {t_dense/t_sparse:.1f}x")
 
# Benchmark row slicing
print("\n--- Row Slicing (100 rows) ---")
t_csr = benchmark(lambda: sparse_csr[100:200])
t_csc = benchmark(lambda: sparse_csc[100:200])
t_dense_slice = benchmark(lambda: dense[100:200])
print(f"CSR:   {t_csr:.3f} ms (optimized)")
print(f"CSC:   {t_csc:.3f} ms")
print(f"Dense: {t_dense_slice:.3f} ms")
 
# Benchmark column slicing
print("\n--- Column Slicing (50 columns) ---")
t_csr_col = benchmark(lambda: sparse_csr[:, 100:150])
t_csc_col = benchmark(lambda: sparse_csc[:, 100:150])
print(f"CSR:   {t_csr_col:.3f} ms")
print(f"CSC:   {t_csc_col:.3f} ms (optimized)")
 
# Demonstrate sparse-specific operations
print("\n--- Sparse-Specific Operations ---")
 
# Create identity and diagonal matrices efficiently
identity = eye(1000, format='csr')
diagonal = diags([1, 2, 3], [0, 1, -1], shape=(1000, 1000), format='csr')
 
print(f"Identity (1000×1000): {identity.nnz} non-zeros")
print(f"Tridiagonal (1000×1000): {diagonal.nnz} non-zeros")
 
# Efficient stacking
sparse1 = csr_matrix(np.random.rand(100, 500) > 0.99)
sparse2 = csr_matrix(np.random.rand(100, 500) > 0.99)
 
stacked_v = vstack([sparse1, sparse2])
stacked_h = hstack([sparse1, sparse2])
 
print(f"\nVertical stack: {sparse1.shape} + {sparse2.shape} = {stacked_v.shape}")
print(f"Horizontal stack: {sparse1.shape} + {sparse2.shape} = {stacked_h.shape}")

Avoid Dense Conversion

Calling .toarray() or .todense() defeats the purpose of sparse storage. A 100K × 50K sparse matrix at 0.5% density fits in ~40MB sparse; dense would require ~40GB. Only convert small matrices or when absolutely necessary.

Constructing Sparse Matrices Efficiently

How you construct a sparse matrix significantly affects performance. Different patterns call for different approaches.

Construction Strategies

•COO for batch construction — When you have all (row, col, data) triplets upfront, COO is fastest. Convert to CSR for operations.
•LIL for row-wise construction — When building row-by-row (common in text processing), LIL allows efficient appends. Convert to CSR when done.
•DOK for random access — When you need to read and write arbitrary elements during construction. Slowest format; convert when construction complete.
•Direct CSR from data — If you can precompute the CSR arrays (data, indices, indptr), construct directly. Most memory efficient.
•Never use dense intermediate — Don't create a dense matrix and convert to sparse. This defeats the entire purpose.

sparse_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
from scipy.sparse import csr_matrix, coo_matrix, lil_matrix, dok_matrix
import time
from typing import List, Tuple
 
def time_construction(name: str, func):
    """Time a sparse matrix construction."""
    start = time.perf_counter()
    result = func()
    elapsed = (time.perf_counter() - start) * 1000
    print(f"{name:30s}: {elapsed:8.2f} ms, shape={result.shape}, nnz={result.nnz:,}")
    return result
 
# Simulate text data: documents with word counts
np.random.seed(42)
n_docs = 10000
vocab_size = 50000
avg_unique_words = 200
 
print("Sparse Matrix Construction Comparison")
print("=" * 60)
print(f"Simulating {n_docs:,} documents, {vocab_size:,} vocabulary")
print(f"Average {avg_unique_words} unique words per document")
print()
 
# Generate sample data: list of (doc_id, word_id, count) triplets
triplets: List[Tuple[int, int, int]] = []
for doc_id in range(n_docs):
    n_unique = np.random.randint(100, 300)
    word_ids = np.random.choice(vocab_size, n_unique, replace=False)
    counts = np.random.randint(1, 10, n_unique)
    for word_id, count in zip(word_ids, counts):
        triplets.append((doc_id, word_id, count))
 
rows, cols, data = zip(*triplets)
rows = np.array(rows)
cols = np.array(cols)
data = np.array(data, dtype=np.float64)
 
print(f"Total non-zeros: {len(triplets):,}")
print()
 
# Method 1: COO (recommended for batch construction)
def coo_construction():
    return coo_matrix((data, (rows, cols)), shape=(n_docs, vocab_size)).tocsr()
 
csr_from_coo = time_construction("COO → CSR", coo_construction)
 
# Method 2: Direct CSR construction
def direct_csr():
    # Sort by row for CSR construction
    sort_idx = np.lexsort((cols, rows))
    sorted_rows = rows[sort_idx]
    sorted_cols = cols[sort_idx]
    sorted_data = data[sort_idx]
    
    # Build indptr
    indptr = np.zeros(n_docs + 1, dtype=np.int32)
    for r in sorted_rows:
        indptr[r + 1] += 1
    indptr = np.cumsum(indptr)
    
    return csr_matrix((sorted_data, sorted_cols, indptr), shape=(n_docs, vocab_size))
 
csr_direct = time_construction("Direct CSR", direct_csr)
 
# Method 3: LIL (row-wise construction)
def lil_construction():
    lil = lil_matrix((n_docs, vocab_size))
    for doc_id, word_id, count in triplets:
        lil[doc_id, word_id] = count
    return lil.tocsr()
 
# Note: LIL is slow for this many insertions; we'll skip on large data
print("LIL construction:                 [skipped - too slow for 2M insertions]")
 
# Method 4: DOK (dictionary construction)
def dok_construction():
    dok = dok_matrix((n_docs, vocab_size))
    for doc_id, word_id, count in triplets:
        dok[doc_id, word_id] = count
    return dok.tocsr()
 
# Also slow for many insertions
print("DOK construction:                 [skipped - too slow for 2M insertions]")
 
# Verify matrices are equivalent
print("\n--- Verification ---")
print(f"COO→CSR and Direct CSR are equal: {(csr_from_coo != csr_direct).nnz == 0}")
 
# Memory comparison
print("\n--- Memory Usage ---")
def csr_memory(m):
    return m.data.nbytes + m.indices.nbytes + m.indptr.nbytes
 
print(f"CSR memory: {csr_from_coo.data.nbytes/1e6:.1f} MB (data)")
print(f"           + {csr_from_coo.indices.nbytes/1e6:.1f} MB (indices)")
print(f"           + {csr_from_coo.indptr.nbytes/1e3:.1f} KB (indptr)")
print(f"           = {csr_memory(csr_from_coo)/1e6:.1f} MB total")
 
dense_size = n_docs * vocab_size * 8 / 1e9
print(f"\nDense equivalent: {dense_size:.1f} GB")
print(f"Compression: {dense_size*1000/csr_memory(csr_from_coo)*1e6:.0f}x")

Sparse Matrices in ML Pipelines

Scikit-learn and other ML libraries have excellent sparse matrix support, but you need to know which algorithms maintain sparsity and which don't.

Scikit-learn Sparse Matrix Support
Algorithm/Transform	Sparse Input	Sparse Output	Notes
CountVectorizer	N/A (text)	✓	Always outputs CSR
TfidfVectorizer	N/A (text)	✓	Always outputs CSR
TfidfTransformer	✓	✓	Preserves sparsity
LogisticRegression	✓	N/A	Efficient on sparse data
SGDClassifier	✓	N/A	Designed for sparse data
MultinomialNB	✓	N/A	Expects sparse counts
RandomForest	✓ (slow)	N/A	Works but not optimized
SVC (kernel)	✓	N/A	Linear kernel only for speed
StandardScaler	✓ (with_mean=False)	✓	Centering destroys sparsity!
MaxAbsScaler	✓	✓	Preserves sparsity
PCA	✗	✗	Converts to dense; use TruncatedSVD
TruncatedSVD	✓	✗	Sparse input, dense output

Centering Destroys Sparsity

StandardScaler with with_mean=True (the default!) subtracts the mean from every entry, turning all zeros into non-zeros. For sparse data, always use with_mean=False or use MaxAbsScaler. This is a common performance trap.

sparse_ml_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy.sparse import issparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MaxAbsScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')
 
# Load text dataset
print("Loading 20 Newsgroups dataset...")
categories = ['comp.graphics', 'sci.med', 'rec.sport.baseball', 'talk.politics.misc']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)
 
X_text = newsgroups.data
y = newsgroups.target
 
print(f"Documents: {len(X_text)}")
print(f"Categories: {len(set(y))}")
 
# Vectorize
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
X_sparse = vectorizer.fit_transform(X_text)
 
print(f"\nSparse matrix shape: {X_sparse.shape}")
print(f"Density: {X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.4%}")
print(f"Is sparse: {issparse(X_sparse)}")
 
# Demonstrate sparse-aware scaling
print("\n--- Scaling Comparison ---")
 
# WRONG: StandardScaler destroys sparsity
scaler_std = StandardScaler(with_mean=True, with_std=True)
try:
    X_std = scaler_std.fit_transform(X_sparse)
except ValueError as e:
    print(f"StandardScaler(with_mean=True): ERROR - {str(e)[:50]}...")
 
# RIGHT: StandardScaler with with_mean=False
scaler_std_nomean = StandardScaler(with_mean=False)
X_std_nomean = scaler_std_nomean.fit_transform(X_sparse)
print(f"StandardScaler(with_mean=False): sparse={issparse(X_std_nomean)}")
 
# RIGHT: MaxAbsScaler preserves sparsity
scaler_maxabs = MaxAbsScaler()
X_maxabs = scaler_maxabs.fit_transform(X_sparse)
print(f"MaxAbsScaler: sparse={issparse(X_maxabs)}")
 
# Benchmark different classifiers on sparse data
print("\n--- Classifier Comparison (5-fold CV) ---")
 
classifiers = [
    ('LogisticRegression', LogisticRegression(max_iter=1000, random_state=42)),
    ('SGDClassifier', SGDClassifier(random_state=42)),
    ('MultinomialNB', MultinomialNB()),
]
 
for name, clf in classifiers:
    scores = cross_val_score(clf, X_sparse, y, cv=5, scoring='accuracy')
    print(f"{name:20s}: {scores.mean():.3f} ± {scores.std():.3f}")
 
# Complete sparse-aware pipeline
print("\n--- Complete Pipeline ---")
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('scaler', MaxAbsScaler()),  # Sparse-safe scaler
    ('classifier', LogisticRegression(max_iter=1000))
])
 
scores = cross_val_score(pipeline, X_text, y, cv=5)
print(f"Pipeline accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Advanced Sparse Matrix Techniques

Beyond basic operations, several advanced techniques are essential for production sparse matrix handling.

Advanced Techniques

•Sparse matrix chunking — Process large matrices in row chunks that fit in memory. Aggregate results.
•Parallel sparse operations — Sparse matrix operations can be parallelized; use libraries that support it (scipy, cupy).
•GPU sparse matrices — CuPy provides GPU-accelerated sparse operations for massive speedups.
•Incremental computation — Stream data through, maintaining only necessary sparse structures.
•Sparse matrix I/O — Efficient save/load formats (npz) preserve sparse structure.

advanced_sparse.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from scipy.sparse import csr_matrix, save_npz, load_npz, vstack
from scipy.sparse.linalg import norm, svds
import tempfile
import os
 
# Create a sample sparse matrix
np.random.seed(42)
n_rows, n_cols = 10000, 5000
density = 0.01
 
rows = np.random.randint(0, n_rows, int(n_rows * n_cols * density))
cols = np.random.randint(0, n_cols, int(n_rows * n_cols * density))
data = np.random.rand(len(rows))
X = csr_matrix((data, (rows, cols)), shape=(n_rows, n_cols))
 
print(f"Sparse matrix: {X.shape}, nnz={X.nnz:,}, density={X.nnz/(n_rows*n_cols):.4%}")
 
# 1. Efficient I/O with NPZ format
print("\n--- Sparse I/O ---")
with tempfile.TemporaryDirectory() as tmpdir:
    filepath = os.path.join(tmpdir, 'matrix.npz')
    
    # Save
    save_npz(filepath, X)
    file_size = os.path.getsize(filepath)
    print(f"Saved to NPZ: {file_size / 1024:.1f} KB")
    
    # Load
    X_loaded = load_npz(filepath)
    print(f"Loaded: shape={X_loaded.shape}, nnz={X_loaded.nnz}")
    print(f"Matrices equal: {(X != X_loaded).nnz == 0}")
 
# 2. Sparse matrix norms
print("\n--- Sparse Norms ---")
frobenius = norm(X, 'fro')
l1_norm = np.abs(X).sum()
max_norm = np.abs(X).max()
 
print(f"Frobenius norm: {frobenius:.4f}")
print(f"L1 norm (sum of |elements|): {l1_norm:.4f}")
print(f"Max norm (largest |element|): {max_norm:.4f}")
 
# 3. Truncated SVD on sparse matrices
print("\n--- Truncated SVD (sparse-efficient) ---")
k = 100  # Number of components
U, s, Vt = svds(X, k=k)
 
print(f"U: {U.shape}, s: {s.shape}, Vt: {Vt.shape}")
print(f"Explained variance ratio (approximate): {np.sum(s**2) / (norm(X, 'fro')**2):.2%}")
 
# 4. Chunked processing for memory efficiency
print("\n--- Chunked Processing ---")
 
def process_in_chunks(X, chunk_size=1000, process_func=None):
    """
    Process a sparse matrix in row chunks.
    Useful when full matrix operations exceed memory.
    """
    n_rows = X.shape[0]
    results = []
    
    for start in range(0, n_rows, chunk_size):
        end = min(start + chunk_size, n_rows)
        chunk = X[start:end]
        
        if process_func is not None:
            result = process_func(chunk)
        else:
            result = chunk.mean(axis=1).A1  # Example: row means
        
        results.append(result)
    
    return np.concatenate(results)
 
# Compute row means in chunks
row_means_chunked = process_in_chunks(X, chunk_size=1000)
row_means_direct = np.array(X.mean(axis=1)).flatten()
 
print(f"Chunked row means: {row_means_chunked[:5]}")
print(f"Direct row means:  {row_means_direct[:5]}")
print(f"Results match: {np.allclose(row_means_chunked, row_means_direct)}")
 
# 5. Sparse matrix statistics
print("\n--- Sparse Matrix Statistics ---")
 
# Row-wise statistics
row_nnz = np.diff(X.indptr)  # Non-zeros per row
print(f"Non-zeros per row: min={row_nnz.min()}, max={row_nnz.max()}, mean={row_nnz.mean():.1f}")
 
# Column-wise statistics (need CSC for efficiency)
X_csc = X.tocsc()
col_nnz = np.diff(X_csc.indptr)
print(f"Non-zeros per column: min={col_nnz.min()}, max={col_nnz.max()}, mean={col_nnz.mean():.1f}")

Common Pitfalls and Best Practices

Common Pitfalls

•Calling .toarray() unnecessarily — Converts to dense, exploding memory. Only when truly needed.
•Using StandardScaler with centering — with_mean=True destroys sparsity. Use with_mean=False.
•Wrong format for operation — CSR for row ops, CSC for column ops. Convert appropriately.
•Incremental construction in CSR — CSR is terrible for appending. Use LIL or COO, then convert.
•Ignoring copy warnings — Slicing sparse matrices may or may not create copies. Be explicit.
•Assuming element access is O(1) — It's O(log nnz) for CSR/CSC. Use DOK for random access patterns.

Best Practices

•Start with COO or LIL for construction — Convert to CSR when complete.
•Use CSR for ML — Most algorithms are optimized for CSR. It's the default.
•Preserve sparsity in pipelines — Check that each transform maintains sparse output.
•Use save_npz/load_npz — Efficient, sparse-native serialization.
•Profile memory — Use scipy.sparse memory utilities; don't guess.
•Test at scale — Algorithms that work on small dense data may fail on large sparse data.

The Sparsity Check

Always verify your pipeline maintains sparsity: from scipy.sparse import issparse; assert issparse(X_transformed). Add this after every transformation during development. It catches silent densification that kills performance.

Summary: Sparsity as Enabler

We've explored sparse representations comprehensively—from the fundamental concept through storage formats, operations, construction patterns, and integration with ML pipelines. Let's consolidate the key insights:

Key Takeaways

•Text data is inherently sparse — 95-99.9% of document-term matrix entries are zero. Sparse formats exploit this for 10-1000× memory reduction.
•CSR is the default for ML — Compressed Sparse Row format is optimized for row operations and matrix-vector products. Use it for model training.
•Format choice matters — CSR for rows, CSC for columns, COO/LIL for construction. Wrong format = slow operations.
•Construction pattern determines format — Batch triplets → COO. Row-by-row → LIL. Random access → DOK. Then convert to CSR.
•Preserve sparsity in pipelines — Some transforms (centering, PCA) destroy sparsity. Use sparse-safe alternatives (MaxAbsScaler, TruncatedSVD).
•Avoid .toarray() — Dense conversion explodes memory. Work in sparse format; only densify when absolutely necessary.
•Sparse I/O is efficient — save_npz/load_npz preserve sparse structure efficiently. Don't save as CSV.

Looking Ahead:

With sparse representations mastered, the next page examines the limitations of Bag of Words—where this foundational approach fails and what techniques address those failures. Understanding BoW's weaknesses is essential for knowing when to reach for more sophisticated representations like word embeddings or transformers.

Page Complete

You now have comprehensive knowledge of sparse representations—the data structures that make Bag of Words computationally feasible at scale. From storage formats through operations to ML pipeline integration, you understand how to work efficiently with high-dimensional text data. Next, we'll examine where BoW falls short and what alternatives have emerged.