Loading content...
Consider a text classification task with 100,000 documents and a vocabulary of 50,000 terms. A naive dense representation would require:
100,000 documents × 50,000 features × 4 bytes = 20 GB
This enormous memory footprint makes the dataset impractical to load, let alone process. Yet the vast majority of this space is wasted on zeros—over 99% of entries in a typical document-term matrix are zero.
This extreme sparsity is not a bug but a fundamental property of language. Any given document uses only a tiny fraction of the total vocabulary. A 500-word news article might contain 200 unique words from a 50,000-word vocabulary—that's 99.6% zeros.
Sparse representations exploit this structure by storing only the non-zero values. That same 20 GB matrix, with 99% sparsity, compresses to roughly 200 MB in a sparse format—a 100× reduction. This isn't just a memory optimization; it's what makes text feature engineering computationally feasible at scale.
By the end of this page, you will understand: (1) The mathematical definition of sparsity and its prevalence in text data, (2) Sparse matrix storage formats (CSR, CSC, COO, DOK, LIL), (3) When and how to use each format, (4) Computational operations on sparse matrices, (5) Memory and performance trade-offs, and (6) Integration with ML libraries and pipelines.
What is Sparsity?
A vector or matrix is sparse when most of its entries are zero. Formally:
Sparsity ratio = (number of zero entries) / (total entries)
A matrix with 95% sparsity has 95% zeros and only 5% non-zero values.
Density is the complement:
Density = 1 - sparsity = (number of non-zero entries) / (total entries)
For text data, typical densities range from 0.001 (0.1%) to 0.05 (5%), meaning 95-99.9% of entries are zero.
| Scenario | Vocab Size | Avg Doc Length | Typical Density | Sparsity |
|---|---|---|---|---|
| Short tweets | 10,000 | 15 words | 0.15% | 99.85% |
| News articles | 50,000 | 500 words | 0.6% | 99.4% |
| Academic papers | 100,000 | 8,000 words | 1.5% | 98.5% |
| Legal documents | 80,000 | 20,000 words | 3% | 97% |
| Web corpus (bigrams) | 1,000,000 | 1,000 words | 0.01% | 99.99% |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom scipy.sparse import csr_matrixfrom sklearn.feature_extraction.text import CountVectorizerfrom typing import Tuple def analyze_sparsity(X) -> dict: """ Analyze sparsity properties of a matrix. Args: X: Dense array or sparse matrix Returns: Dictionary of sparsity metrics """ if hasattr(X, 'toarray'): # Sparse matrix n_nonzero = X.nnz total = X.shape[0] * X.shape[1] else: # Dense array n_nonzero = np.count_nonzero(X) total = X.size n_zero = total - n_nonzero return { 'shape': X.shape, 'total_entries': total, 'nonzero_entries': n_nonzero, 'zero_entries': n_zero, 'density': n_nonzero / total, 'sparsity': n_zero / total, 'compression_potential': total / max(n_nonzero, 1) } def memory_comparison(shape: Tuple[int, int], density: float) -> dict: """ Compare memory usage of dense vs sparse representations. """ n_rows, n_cols = shape total_entries = n_rows * n_cols n_nonzero = int(total_entries * density) # Dense: 8 bytes per float64 dense_bytes = total_entries * 8 # CSR sparse: data (8 bytes) + indices (4 bytes) + indptr (4 bytes) # data: n_nonzero * 8 bytes # column indices: n_nonzero * 4 bytes # row pointers: (n_rows + 1) * 4 bytes sparse_bytes = n_nonzero * 8 + n_nonzero * 4 + (n_rows + 1) * 4 return { 'dense_mb': dense_bytes / (1024 ** 2), 'sparse_mb': sparse_bytes / (1024 ** 2), 'compression_ratio': dense_bytes / max(sparse_bytes, 1), 'memory_saved_pct': 100 * (1 - sparse_bytes / dense_bytes) } # Demonstration with real text datacorpus = [ "Machine learning algorithms process data to find patterns.", "Deep learning uses neural networks with multiple layers.", "Natural language processing enables text understanding.", "Computer vision algorithms detect objects in images.", "Reinforcement learning agents learn through interaction.",] * 100 # 500 documents vectorizer = CountVectorizer()X = vectorizer.fit_transform(corpus) print("Sparsity Analysis of Text Document-Term Matrix")print("=" * 60) metrics = analyze_sparsity(X)print(f"\nMatrix shape: {metrics['shape']}")print(f"Total entries: {metrics['total_entries']:,}")print(f"Non-zero entries: {metrics['nonzero_entries']:,}")print(f"Density: {metrics['density']:.4%}")print(f"Sparsity: {metrics['sparsity']:.4%}") # Memory comparisonmem = memory_comparison(X.shape, metrics['density'])print(f"\n--- Memory Comparison ---")print(f"Dense representation: {mem['dense_mb']:.2f} MB")print(f"Sparse (CSR): {mem['sparse_mb']:.4f} MB")print(f"Compression ratio: {mem['compression_ratio']:.1f}x")print(f"Memory saved: {mem['memory_saved_pct']:.1f}%") # Scale to realistic corpusprint("\n--- Scaled to Realistic Corpus ---")large_shape = (100_000, 50_000) # 100K docs, 50K vocablarge_density = 0.005 # 0.5% typical large_mem = memory_comparison(large_shape, large_density)print(f"Corpus: {large_shape[0]:,} documents, {large_shape[1]:,} vocabulary")print(f"Dense: {large_mem['dense_mb']:,.0f} MB ({large_mem['dense_mb']/1024:.1f} GB)")print(f"Sparse: {large_mem['sparse_mb']:,.0f} MB")print(f"Compression: {large_mem['compression_ratio']:.0f}x")Sparse representations become advantageous when density falls below ~10-30%. At 1% density (typical for text), sparse formats use ~100x less memory. At 0.1% density (large vocabularies, n-grams), the savings approach 1000x.
Different sparse matrix formats optimize for different operations. Understanding these formats is essential for writing efficient code.
The Core Trade-off:
Each format balances:
| Format | Full Name | Best For | Avoid For |
|---|---|---|---|
| CSR | Compressed Sparse Row | Row slicing, matrix-vector products, ML algorithms | Column slicing, incremental construction |
| CSC | Compressed Sparse Column | Column slicing, some linear algebra operations | Row slicing, incremental construction |
| COO | Coordinate List | Incremental construction, format conversion | Arithmetic operations, slicing |
| DOK | Dictionary of Keys | Random element access, incremental construction | Arithmetic operations (slow) |
| LIL | List of Lists | Row-wise incremental construction | Column operations, arithmetic |
CSR (Compressed Sparse Row) — The Workhorse:
CSR is the most common format for sparse matrices in machine learning. It stores:
Example:
Consider this 3×4 matrix:
[1, 0, 2, 0]
[0, 0, 3, 4]
[5, 0, 0, 6]
CSR representation:
To access row i: data[indptr[i]:indptr[i+1]] with columns indices[indptr[i]:indptr[i+1]]
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy.sparse import ( csr_matrix, csc_matrix, coo_matrix, dok_matrix, lil_matrix) # Create a sample dense matrixdense = np.array([ [1, 0, 2, 0], [0, 0, 3, 4], [5, 0, 0, 6]], dtype=float) print("Original Dense Matrix:")print(dense) # Convert to different sparse formatsprint("\n" + "=" * 60)print("Sparse Format Representations")print("=" * 60) # CSR - Compressed Sparse Rowcsr = csr_matrix(dense)print("\n--- CSR (Compressed Sparse Row) ---")print(f"data: {csr.data}")print(f"indices (columns): {csr.indices}")print(f"indptr (row pointers): {csr.indptr}")print(f"Shape: {csr.shape}, NNZ: {csr.nnz}") # CSC - Compressed Sparse Columncsc = csc_matrix(dense)print("\n--- CSC (Compressed Sparse Column) ---")print(f"data: {csc.data}")print(f"indices (rows): {csc.indices}")print(f"indptr (column pointers): {csc.indptr}") # COO - Coordinate Listcoo = coo_matrix(dense)print("\n--- COO (Coordinate List) ---")print(f"data: {coo.data}")print(f"row: {coo.row}")print(f"col: {coo.col}") # Memory usage comparisonprint("\n" + "=" * 60)print("Memory Usage Comparison (bytes)")print("=" * 60) def sparse_memory(m): """Estimate memory usage of sparse matrix.""" if hasattr(m, 'data'): total = m.data.nbytes if hasattr(m, 'indices'): total += m.indices.nbytes if hasattr(m, 'indptr'): total += m.indptr.nbytes if hasattr(m, 'row'): total += m.row.nbytes if hasattr(m, 'col'): total += m.col.nbytes return total return 0 print(f"Dense: {dense.nbytes} bytes")print(f"CSR: {sparse_memory(csr)} bytes")print(f"CSC: {sparse_memory(csc)} bytes")print(f"COO: {sparse_memory(coo)} bytes") # Access patternsprint("\n" + "=" * 60)print("Access Pattern Demonstration")print("=" * 60) print("\nRow 1 access (CSR is optimized):")print(f" CSR row 1: {csr[1].toarray()}") print("\nColumn 2 access (CSC is optimized):")print(f" CSC col 2: {csc[:, 2].toarray().flatten()}")Sparse matrices support most operations you'd perform on dense matrices, but efficiency varies dramatically by format and operation.
Key Operations and Their Complexity:
Let nnz = number of non-zeros, n = number of rows, m = number of columns.
| Operation | CSR | CSC | COO | Dense |
|---|---|---|---|---|
| Matrix × Vector | O(nnz) ✓ | O(nnz) | O(nnz) | O(nm) |
| Row slice | O(row_nnz) ✓ | O(nnz) | O(nnz) | O(m) |
| Column slice | O(nnz) | O(col_nnz) ✓ | O(nnz) | O(n) |
| Element access | O(log row_nnz) | O(log col_nnz) | O(nnz) | O(1) ✓ |
| Matrix × Matrix | O(nnz × avg_row) | O(nnz × avg_col) | Convert first | O(nmk) |
| Element-wise ops | O(nnz) ✓ | O(nnz) ✓ | O(nnz) | O(nm) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom scipy.sparse import csr_matrix, csc_matrixfrom scipy.sparse import diags, eye, vstack, hstackimport time def benchmark(func, *args, n_runs=100): """Benchmark a function call.""" start = time.perf_counter() for _ in range(n_runs): result = func(*args) elapsed = (time.perf_counter() - start) / n_runs return elapsed * 1000 # milliseconds # Create test matricesnp.random.seed(42)n, m = 5000, 10000density = 0.01 # 1% non-zero # Random sparse matrixrows = np.random.randint(0, n, int(n * m * density))cols = np.random.randint(0, m, int(n * m * density))data = np.random.rand(len(rows)) sparse_csr = csr_matrix((data, (rows, cols)), shape=(n, m))sparse_csc = csc_matrix(sparse_csr)dense = sparse_csr.toarray() print(f"Matrix shape: {n} × {m}")print(f"Density: {density:.1%}")print(f"Non-zeros: {sparse_csr.nnz:,}") # Benchmark matrix-vector multiplicationvec = np.random.rand(m) print("\n--- Matrix-Vector Multiplication ---")t_sparse = benchmark(lambda: sparse_csr @ vec)t_dense = benchmark(lambda: dense @ vec)print(f"Sparse (CSR): {t_sparse:.3f} ms")print(f"Dense: {t_dense:.3f} ms")print(f"Speedup: {t_dense/t_sparse:.1f}x") # Benchmark row slicingprint("\n--- Row Slicing (100 rows) ---")t_csr = benchmark(lambda: sparse_csr[100:200])t_csc = benchmark(lambda: sparse_csc[100:200])t_dense_slice = benchmark(lambda: dense[100:200])print(f"CSR: {t_csr:.3f} ms (optimized)")print(f"CSC: {t_csc:.3f} ms")print(f"Dense: {t_dense_slice:.3f} ms") # Benchmark column slicingprint("\n--- Column Slicing (50 columns) ---")t_csr_col = benchmark(lambda: sparse_csr[:, 100:150])t_csc_col = benchmark(lambda: sparse_csc[:, 100:150])print(f"CSR: {t_csr_col:.3f} ms")print(f"CSC: {t_csc_col:.3f} ms (optimized)") # Demonstrate sparse-specific operationsprint("\n--- Sparse-Specific Operations ---") # Create identity and diagonal matrices efficientlyidentity = eye(1000, format='csr')diagonal = diags([1, 2, 3], [0, 1, -1], shape=(1000, 1000), format='csr') print(f"Identity (1000×1000): {identity.nnz} non-zeros")print(f"Tridiagonal (1000×1000): {diagonal.nnz} non-zeros") # Efficient stackingsparse1 = csr_matrix(np.random.rand(100, 500) > 0.99)sparse2 = csr_matrix(np.random.rand(100, 500) > 0.99) stacked_v = vstack([sparse1, sparse2])stacked_h = hstack([sparse1, sparse2]) print(f"\nVertical stack: {sparse1.shape} + {sparse2.shape} = {stacked_v.shape}")print(f"Horizontal stack: {sparse1.shape} + {sparse2.shape} = {stacked_h.shape}")Calling .toarray() or .todense() defeats the purpose of sparse storage. A 100K × 50K sparse matrix at 0.5% density fits in ~40MB sparse; dense would require ~40GB. Only convert small matrices or when absolutely necessary.
How you construct a sparse matrix significantly affects performance. Different patterns call for different approaches.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
import numpy as npfrom scipy.sparse import csr_matrix, coo_matrix, lil_matrix, dok_matriximport timefrom typing import List, Tuple def time_construction(name: str, func): """Time a sparse matrix construction.""" start = time.perf_counter() result = func() elapsed = (time.perf_counter() - start) * 1000 print(f"{name:30s}: {elapsed:8.2f} ms, shape={result.shape}, nnz={result.nnz:,}") return result # Simulate text data: documents with word countsnp.random.seed(42)n_docs = 10000vocab_size = 50000avg_unique_words = 200 print("Sparse Matrix Construction Comparison")print("=" * 60)print(f"Simulating {n_docs:,} documents, {vocab_size:,} vocabulary")print(f"Average {avg_unique_words} unique words per document")print() # Generate sample data: list of (doc_id, word_id, count) tripletstriplets: List[Tuple[int, int, int]] = []for doc_id in range(n_docs): n_unique = np.random.randint(100, 300) word_ids = np.random.choice(vocab_size, n_unique, replace=False) counts = np.random.randint(1, 10, n_unique) for word_id, count in zip(word_ids, counts): triplets.append((doc_id, word_id, count)) rows, cols, data = zip(*triplets)rows = np.array(rows)cols = np.array(cols)data = np.array(data, dtype=np.float64) print(f"Total non-zeros: {len(triplets):,}")print() # Method 1: COO (recommended for batch construction)def coo_construction(): return coo_matrix((data, (rows, cols)), shape=(n_docs, vocab_size)).tocsr() csr_from_coo = time_construction("COO → CSR", coo_construction) # Method 2: Direct CSR constructiondef direct_csr(): # Sort by row for CSR construction sort_idx = np.lexsort((cols, rows)) sorted_rows = rows[sort_idx] sorted_cols = cols[sort_idx] sorted_data = data[sort_idx] # Build indptr indptr = np.zeros(n_docs + 1, dtype=np.int32) for r in sorted_rows: indptr[r + 1] += 1 indptr = np.cumsum(indptr) return csr_matrix((sorted_data, sorted_cols, indptr), shape=(n_docs, vocab_size)) csr_direct = time_construction("Direct CSR", direct_csr) # Method 3: LIL (row-wise construction)def lil_construction(): lil = lil_matrix((n_docs, vocab_size)) for doc_id, word_id, count in triplets: lil[doc_id, word_id] = count return lil.tocsr() # Note: LIL is slow for this many insertions; we'll skip on large dataprint("LIL construction: [skipped - too slow for 2M insertions]") # Method 4: DOK (dictionary construction)def dok_construction(): dok = dok_matrix((n_docs, vocab_size)) for doc_id, word_id, count in triplets: dok[doc_id, word_id] = count return dok.tocsr() # Also slow for many insertionsprint("DOK construction: [skipped - too slow for 2M insertions]") # Verify matrices are equivalentprint("\n--- Verification ---")print(f"COO→CSR and Direct CSR are equal: {(csr_from_coo != csr_direct).nnz == 0}") # Memory comparisonprint("\n--- Memory Usage ---")def csr_memory(m): return m.data.nbytes + m.indices.nbytes + m.indptr.nbytes print(f"CSR memory: {csr_from_coo.data.nbytes/1e6:.1f} MB (data)")print(f" + {csr_from_coo.indices.nbytes/1e6:.1f} MB (indices)")print(f" + {csr_from_coo.indptr.nbytes/1e3:.1f} KB (indptr)")print(f" = {csr_memory(csr_from_coo)/1e6:.1f} MB total") dense_size = n_docs * vocab_size * 8 / 1e9print(f"\nDense equivalent: {dense_size:.1f} GB")print(f"Compression: {dense_size*1000/csr_memory(csr_from_coo)*1e6:.0f}x")Scikit-learn and other ML libraries have excellent sparse matrix support, but you need to know which algorithms maintain sparsity and which don't.
| Algorithm/Transform | Sparse Input | Sparse Output | Notes |
|---|---|---|---|
| CountVectorizer | N/A (text) | ✓ | Always outputs CSR |
| TfidfVectorizer | N/A (text) | ✓ | Always outputs CSR |
| TfidfTransformer | ✓ | ✓ | Preserves sparsity |
| LogisticRegression | ✓ | N/A | Efficient on sparse data |
| SGDClassifier | ✓ | N/A | Designed for sparse data |
| MultinomialNB | ✓ | N/A | Expects sparse counts |
| RandomForest | ✓ (slow) | N/A | Works but not optimized |
| SVC (kernel) | ✓ | N/A | Linear kernel only for speed |
| StandardScaler | ✓ (with_mean=False) | ✓ | Centering destroys sparsity! |
| MaxAbsScaler | ✓ | ✓ | Preserves sparsity |
| PCA | ✗ | ✗ | Converts to dense; use TruncatedSVD |
| TruncatedSVD | ✓ | ✗ | Sparse input, dense output |
StandardScaler with with_mean=True (the default!) subtracts the mean from every entry, turning all zeros into non-zeros. For sparse data, always use with_mean=False or use MaxAbsScaler. This is a common performance trap.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom scipy.sparse import issparsefrom sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.linear_model import LogisticRegression, SGDClassifierfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.preprocessing import MaxAbsScaler, StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import cross_val_scoreimport warningswarnings.filterwarnings('ignore') # Load text datasetprint("Loading 20 Newsgroups dataset...")categories = ['comp.graphics', 'sci.med', 'rec.sport.baseball', 'talk.politics.misc']newsgroups = fetch_20newsgroups(subset='train', categories=categories) X_text = newsgroups.datay = newsgroups.target print(f"Documents: {len(X_text)}")print(f"Categories: {len(set(y))}") # Vectorizevectorizer = TfidfVectorizer(max_features=10000, stop_words='english')X_sparse = vectorizer.fit_transform(X_text) print(f"\nSparse matrix shape: {X_sparse.shape}")print(f"Density: {X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.4%}")print(f"Is sparse: {issparse(X_sparse)}") # Demonstrate sparse-aware scalingprint("\n--- Scaling Comparison ---") # WRONG: StandardScaler destroys sparsityscaler_std = StandardScaler(with_mean=True, with_std=True)try: X_std = scaler_std.fit_transform(X_sparse)except ValueError as e: print(f"StandardScaler(with_mean=True): ERROR - {str(e)[:50]}...") # RIGHT: StandardScaler with with_mean=Falsescaler_std_nomean = StandardScaler(with_mean=False)X_std_nomean = scaler_std_nomean.fit_transform(X_sparse)print(f"StandardScaler(with_mean=False): sparse={issparse(X_std_nomean)}") # RIGHT: MaxAbsScaler preserves sparsityscaler_maxabs = MaxAbsScaler()X_maxabs = scaler_maxabs.fit_transform(X_sparse)print(f"MaxAbsScaler: sparse={issparse(X_maxabs)}") # Benchmark different classifiers on sparse dataprint("\n--- Classifier Comparison (5-fold CV) ---") classifiers = [ ('LogisticRegression', LogisticRegression(max_iter=1000, random_state=42)), ('SGDClassifier', SGDClassifier(random_state=42)), ('MultinomialNB', MultinomialNB()),] for name, clf in classifiers: scores = cross_val_score(clf, X_sparse, y, cv=5, scoring='accuracy') print(f"{name:20s}: {scores.mean():.3f} ± {scores.std():.3f}") # Complete sparse-aware pipelineprint("\n--- Complete Pipeline ---")pipeline = Pipeline([ ('vectorizer', TfidfVectorizer(max_features=10000, stop_words='english')), ('scaler', MaxAbsScaler()), # Sparse-safe scaler ('classifier', LogisticRegression(max_iter=1000))]) scores = cross_val_score(pipeline, X_text, y, cv=5)print(f"Pipeline accuracy: {scores.mean():.3f} ± {scores.std():.3f}")Beyond basic operations, several advanced techniques are essential for production sparse matrix handling.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import numpy as npfrom scipy.sparse import csr_matrix, save_npz, load_npz, vstackfrom scipy.sparse.linalg import norm, svdsimport tempfileimport os # Create a sample sparse matrixnp.random.seed(42)n_rows, n_cols = 10000, 5000density = 0.01 rows = np.random.randint(0, n_rows, int(n_rows * n_cols * density))cols = np.random.randint(0, n_cols, int(n_rows * n_cols * density))data = np.random.rand(len(rows))X = csr_matrix((data, (rows, cols)), shape=(n_rows, n_cols)) print(f"Sparse matrix: {X.shape}, nnz={X.nnz:,}, density={X.nnz/(n_rows*n_cols):.4%}") # 1. Efficient I/O with NPZ formatprint("\n--- Sparse I/O ---")with tempfile.TemporaryDirectory() as tmpdir: filepath = os.path.join(tmpdir, 'matrix.npz') # Save save_npz(filepath, X) file_size = os.path.getsize(filepath) print(f"Saved to NPZ: {file_size / 1024:.1f} KB") # Load X_loaded = load_npz(filepath) print(f"Loaded: shape={X_loaded.shape}, nnz={X_loaded.nnz}") print(f"Matrices equal: {(X != X_loaded).nnz == 0}") # 2. Sparse matrix normsprint("\n--- Sparse Norms ---")frobenius = norm(X, 'fro')l1_norm = np.abs(X).sum()max_norm = np.abs(X).max() print(f"Frobenius norm: {frobenius:.4f}")print(f"L1 norm (sum of |elements|): {l1_norm:.4f}")print(f"Max norm (largest |element|): {max_norm:.4f}") # 3. Truncated SVD on sparse matricesprint("\n--- Truncated SVD (sparse-efficient) ---")k = 100 # Number of componentsU, s, Vt = svds(X, k=k) print(f"U: {U.shape}, s: {s.shape}, Vt: {Vt.shape}")print(f"Explained variance ratio (approximate): {np.sum(s**2) / (norm(X, 'fro')**2):.2%}") # 4. Chunked processing for memory efficiencyprint("\n--- Chunked Processing ---") def process_in_chunks(X, chunk_size=1000, process_func=None): """ Process a sparse matrix in row chunks. Useful when full matrix operations exceed memory. """ n_rows = X.shape[0] results = [] for start in range(0, n_rows, chunk_size): end = min(start + chunk_size, n_rows) chunk = X[start:end] if process_func is not None: result = process_func(chunk) else: result = chunk.mean(axis=1).A1 # Example: row means results.append(result) return np.concatenate(results) # Compute row means in chunksrow_means_chunked = process_in_chunks(X, chunk_size=1000)row_means_direct = np.array(X.mean(axis=1)).flatten() print(f"Chunked row means: {row_means_chunked[:5]}")print(f"Direct row means: {row_means_direct[:5]}")print(f"Results match: {np.allclose(row_means_chunked, row_means_direct)}") # 5. Sparse matrix statisticsprint("\n--- Sparse Matrix Statistics ---") # Row-wise statisticsrow_nnz = np.diff(X.indptr) # Non-zeros per rowprint(f"Non-zeros per row: min={row_nnz.min()}, max={row_nnz.max()}, mean={row_nnz.mean():.1f}") # Column-wise statistics (need CSC for efficiency)X_csc = X.tocsc()col_nnz = np.diff(X_csc.indptr)print(f"Non-zeros per column: min={col_nnz.min()}, max={col_nnz.max()}, mean={col_nnz.mean():.1f}")Always verify your pipeline maintains sparsity: from scipy.sparse import issparse; assert issparse(X_transformed). Add this after every transformation during development. It catches silent densification that kills performance.
We've explored sparse representations comprehensively—from the fundamental concept through storage formats, operations, construction patterns, and integration with ML pipelines. Let's consolidate the key insights:
Looking Ahead:
With sparse representations mastered, the next page examines the limitations of Bag of Words—where this foundational approach fails and what techniques address those failures. Understanding BoW's weaknesses is essential for knowing when to reach for more sophisticated representations like word embeddings or transformers.
You now have comprehensive knowledge of sparse representations—the data structures that make Bag of Words computationally feasible at scale. From storage formats through operations to ML pipeline integration, you understand how to work efficiently with high-dimensional text data. Next, we'll examine where BoW falls short and what alternatives have emerged.