Machine LearningDimensionality Reduction

Dimensionality Reduction Motivation

LevelIntermediate

Duration75 mins

TopicDimensionality Reduction

4 / 5

Compression

Storing the Essence: Data Compression via Dimensionality Reduction

In an era of exponentially growing data, storage and bandwidth are precious resources. A single genomics experiment might produce terabytes of data. A recommendation system might track billions of user interactions. A sensor network might generate millions of readings per second. Storing, transmitting, and processing this data at full fidelity is often impossible—or at least prohibitively expensive.

Compression addresses this challenge by finding more compact representations of data. Traditional compression algorithms (gzip, PNG, JPEG) exploit statistical redundancy in individual data points. Dimensionality reduction offers a fundamentally different approach: it exploits structure across the dataset to compress the entire collection.

The key insight: if your data lies on or near a low-dimensional subspace of the original feature space, storing the subspace coordinates plus the subspace itself is far more efficient than storing full-dimensional vectors. A dataset of 1 million samples with 10,000 features each might be equivalently represented using only 50 principal components—achieving a 200× compression ratio while preserving most information.

This page explores dimensionality reduction as a compression strategy: when it works, how much compression is achievable, the relationship between compression and reconstruction quality, and the tradeoffs with traditional compression methods.

What You Will Learn

By the end of this page, you will understand how dimensionality reduction achieves compression, the relationship between compression ratio and reconstruction quality, how to optimize for different fidelity-efficiency tradeoffs, and when DR-based compression is preferable to traditional methods. You'll gain practical insight into implementing compression pipelines for high-dimensional data.

The Compression Framework

Compression via dimensionality reduction follows a simple encoder-decoder framework:

Encoding (Compression):

Learn a projection from d dimensions to k dimensions (k << d)
Project each data point: z_i = encode(x_i) ∈ ℝ^k

Decoding (Reconstruction):

Map compressed representation back to original space: x̂_i = decode(z_i) ∈ ℝ^d
Accept approximation x̂_i ≈ x_i

What Must Be Stored:

The k-dimensional representations {z_i} for all n samples
The projection parameters (e.g., PCA components matrix)

Storage Analysis for PCA:

Original data: n × d floating point numbers

Compressed representation:

Code vectors: n × k floating point numbers
Principal components: k × d floating point numbers (one-time cost)
Mean vector: 1 × d floating point numbers (one-time cost)

Compression Ratio:

$$\text{Compression Ratio} = \frac{n \cdot d}{n \cdot k + k \cdot d + d} \approx \frac{d}{k}$$

When n is large (many samples), the one-time storage cost of the projection matrix becomes negligible, and compression ratio approaches d/k.

Example:

d = 10,000 features, k = 50 components, n = 1,000,000 samples
Original: 10^7 × 8 bytes = 80 GB (float64)
Compressed: (10^6 × 50 + 50 × 10,000 + 10,000) × 8 bytes ≈ 400 MB
Compression ratio: ~200×

compression_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def analyze_compression_ratio(n_samples, original_dim, compressed_dim, 
                              dtype_bytes=8):
    """
    Calculate compression ratio and storage requirements for PCA compression.
    
    Parameters:
    -----------
    n_samples : int
        Number of data points
    original_dim : int
        Original feature dimensionality
    compressed_dim : int
        Number of principal components to keep
    dtype_bytes : int
        Bytes per number (8 for float64, 4 for float32)
    
    Returns:
    --------
    dict with storage analysis
    """
    # Original storage
    original_bytes = n_samples * original_dim * dtype_bytes
    
    # Compressed storage
    codes_bytes = n_samples * compressed_dim * dtype_bytes  # z vectors
    components_bytes = compressed_dim * original_dim * dtype_bytes  # PC matrix
    mean_bytes = original_dim * dtype_bytes  # Mean vector
    
    compressed_total = codes_bytes + components_bytes + mean_bytes
    
    # Compression ratio
    ratio = original_bytes / compressed_total
    asymptotic_ratio = original_dim / compressed_dim  # For large n
    
    return {
        'original_GB': original_bytes / 1e9,
        'compressed_GB': compressed_total / 1e9,
        'codes_GB': codes_bytes / 1e9,
        'overhead_GB': (components_bytes + mean_bytes) / 1e9,
        'compression_ratio': ratio,
        'asymptotic_ratio': asymptotic_ratio,
        'overhead_percentage': 100 * (components_bytes + mean_bytes) / compressed_total
    }
 
# Example scenarios
scenarios = [
    {'name': 'Genomics (small)', 'n': 1000, 'd': 20000, 'k': 50},
    {'name': 'Genomics (large)', 'n': 100000, 'd': 20000, 'k': 50},
    {'name': 'Images (features)', 'n': 1000000, 'd': 4096, 'k': 128},
    {'name': 'Text embeddings', 'n': 10000000, 'd': 768, 'k': 64},
    {'name': 'Sensor data', 'n': 100000000, 'd': 100, 'k': 10},
]
 
print("PCA Compression Analysis")
print("=" * 80)
for s in scenarios:
    result = analyze_compression_ratio(s['n'], s['d'], s['k'])
    print(f"\n{s['name']}:")
    print(f"  Samples: {s['n']:,}, Dims: {s['d']} -> {s['k']}")
    print(f"  Original: {result['original_GB']:.2f} GB")
    print(f"  Compressed: {result['compressed_GB']:.4f} GB")
    print(f"  Compression ratio: {result['compression_ratio']:.1f}x")
    print(f"  Overhead (PC matrix): {result['overhead_percentage']:.1f}%")

The Overhead Crossover

For very small datasets, the overhead of storing the projection matrix can exceed the savings from compression! The breakeven point is roughly n > d. Below this, dimensionality reduction for compression makes little sense—though it may still be valuable for other purposes like visualization or denoising.

Lossy vs. Lossless: The Fidelity-Efficiency Tradeoff

Dimensionality reduction is inherently lossy compression: information is discarded and cannot be perfectly recovered. This fundamentally differs from lossless compression (gzip, PNG) where the original data is exactly recoverable.

The Key Insight:

In lossy compression, we trade reconstruction quality for compression ratio. More aggressive compression (smaller k) means:

Higher compression ratio
More reconstruction error
Potentially acceptable if the lost information is noise or irrelevant detail

Rate-Distortion Theory:

Information theory provides a rigorous framework for thinking about lossy compression. The rate-distortion curve characterizes the fundamental tradeoff:

Rate (R): Bits needed per sample (related to compression ratio)
Distortion (D): Reconstruction error (e.g., mean squared error)

For a given source distribution, there exists a function D(R) giving the minimum achievable distortion at rate R. PCA approaches this limit when data is Gaussian.

Reconstruction Error Decomposition:

When we keep k principal components:

$$\text{Reconstruction Error} = \sum_{i=k+1}^{d} \lambda_i$$

Where λ_i are eigenvalues. The error equals the sum of discarded eigenvalues—precisely the variance we didn't capture.

Lossless Compression

•Perfect reconstruction: Original data exactly recoverable
•Limited ratio: Compression bounded by data entropy
•Examples: gzip, PNG, FLAC
•Use when: Legal/regulatory requirements, scientific reproducibility
•Typical ratios: 2-10× depending on data redundancy

Lossy Compression (DR)

•Approximate reconstruction: Information lost permanently
•Flexible ratio: Can compress arbitrarily at cost of quality
•Examples: PCA, autoencoders, JPEG
•Use when: Downstream tasks tolerate error, storage critical
•Typical ratios: 10-1000× depending on acceptable distortion

Perceptual Quality vs. Numerical Error

For many applications, numerical reconstruction error doesn't capture what matters. JPEG discards high-frequency information humans can't perceive. Similarly, for ML features, what matters is preserving predictive information for downstream tasks—not exact numerical fidelity. Evaluate compression quality on the end task, not just MSE.

Selecting the Optimal Compression Level

How many components should we keep? The answer depends on your constraints and goals:

Variance-Based Selection:

The simplest approach: keep enough components to capture a specified fraction of variance.

$$\text{Explained Variance Ratio}(k) = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i}$$

Common thresholds:

95%: Conservative compression, high fidelity
90%: Moderate compression, good fidelity
80%: Aggressive compression, acceptable for many ML tasks
50%+: Very aggressive, significant information loss

Reconstruction Error Target:

Alternatively, specify a maximum acceptable reconstruction error (MSE or RMSE) and find the minimum k that achieves it.

Downstream Task Performance:

The most principled approach: evaluate compressed representations on your actual downstream task.

Compress to various k values
Train downstream model on compressed data
Choose smallest k with acceptable task performance

This accounts for task-specific importance of different directions—some variance might be irrelevant for your goal.

Storage Constraint:

If storage is the primary constraint:

Calculate available storage budget
Compute maximum k that fits budget
Accept the resulting quality

$$k_{\max} = \frac{B - (k_{\max} \cdot d + d) \cdot b}{n \cdot b}$$

Where B is budget in bytes, b is bytes per number.

compression_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
 
def select_components_for_compression(X, y=None, 
                                       variance_thresholds=[0.95, 0.90, 0.80],
                                       storage_budget_gb=None):
    """
    Analyze component selection for compression.
    
    Returns k values for different selection criteria.
    """
    n, d = X.shape
    
    # Fit full PCA
    pca = PCA()
    pca.fit(X)
    
    cumvar = np.cumsum(pca.explained_variance_ratio_)
    eigenvalues = pca.explained_variance_
    
    results = {}
    
    # Variance-based selection
    for thresh in variance_thresholds:
        k = np.argmax(cumvar >= thresh) + 1
        ratio = d / k
        results[f'variance_{int(thresh*100)}pct'] = {
            'k': k, 
            'actual_variance': cumvar[k-1],
            'compression_ratio': ratio
        }
    
    # MSE-based selection (relative to max MSE = sum of all eigenvalues)
    total_var = sum(eigenvalues)
    for mse_fraction in [0.05, 0.10, 0.20]:  # Target 5%, 10%, 20% of total MSE
        target_mse = mse_fraction * total_var
        cumulative_discarded = total_var - np.cumsum(eigenvalues)
        k = np.argmax(cumulative_discarded <= target_mse) + 1
        results[f'mse_{int(mse_fraction*100)}pct'] = {
            'k': k,
            'actual_mse_fraction': cumulative_discarded[k-1] / total_var,
            'compression_ratio': d / k
        }
    
    # Storage budget selection
    if storage_budget_gb is not None:
        budget_bytes = storage_budget_gb * 1e9
        bytes_per_float = 8  # float64
        
        # Solve: n*k + k*d + d <= budget/bytes_per_float
        max_floats = budget_bytes / bytes_per_float
        # Quadratic in k: k*(n + d) <= max_floats - d
        # k <= (max_floats - d) / (n + d)
        k_max = int((max_floats - d) / (n + d))
        k_max = max(1, min(k_max, d))
        
        results['storage_budget'] = {
            'k': k_max,
            'variance': cumvar[k_max-1] if k_max <= len(cumvar) else 1.0,
            'compression_ratio': d / k_max
        }
    
    # Task-based selection (if labels provided)
    if y is not None:
        scores = []
        k_values = list(range(1, min(51, d+1)))  # Test up to 50 components
        
        for k in k_values:
            X_compressed = pca.transform(X)[:, :k]
            clf = LogisticRegression(max_iter=1000, random_state=42)
            score = np.mean(cross_val_score(clf, X_compressed, y, cv=5))
            scores.append(score)
        
        # Find "elbow" - where adding components gives diminishing returns
        best_k = k_values[np.argmax(scores)]
        
        # Also find minimum k for 99% of best score
        max_score = max(scores)
        for i, k in enumerate(k_values):
            if scores[i] >= 0.99 * max_score:
                min_adequate_k = k
                break
        
        results['task_based'] = {
            'best_k': best_k,
            'best_score': max(scores),
            'min_adequate_k': min_adequate_k,
            'compression_ratio': d / min_adequate_k
        }
    
    return results, cumvar, eigenvalues
 
# Example
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
 
results, cumvar, eigenvalues = select_components_for_compression(
    X, y, storage_budget_gb=0.001)
 
print("Component Selection Analysis for Digits Dataset")
print(f"Original dimensions: {X.shape[1]}")
print("=" * 60)
for method, info in results.items():
    print(f"\n{method}:")
    for key, val in info.items():
        if isinstance(val, float):
            print(f"  {key}: {val:.4f}")
        else:
            print(f"  {key}: {val}")

Beware Overfitting to Compression Criteria

If you select k based on downstream task performance on the same data you'll deploy on, you risk overfitting the compression level. Use held-out validation data for selection, or use theoretical criteria that don't depend on the specific samples.

Beyond PCA: Neural Network Compression

PCA provides optimal linear compression but cannot capture nonlinear structure. Autoencoders extend the encoder-decoder framework with neural networks, learning nonlinear projections that can achieve better compression for complex data.

Autoencoder Architecture:

$$\text{Encoder: } z = f_\theta(x) \in \mathbb{R}^k$$ $$\text{Decoder: } \hat{x} = g_\phi(z) \in \mathbb{R}^d$$ $$\text{Loss: } \mathcal{L} = |x - \hat{x}|^2$$

Where f and g are neural networks with learnable parameters θ and φ.

Advantages Over PCA:

Nonlinear mappings: Can unfold curved manifolds, exploiting nonlinear structure
Flexible architecture: Convolutional, recurrent, attention-based encoders for structured data
End-to-end learning: Jointly optimize encoder and decoder for reconstruction
Asymmetric complexity: Decoder can be larger (fast encoding, slower decoding ok)

Disadvantages:

Training required: PCA is a closed-form solution; autoencoders need gradient descent
Hyperparameter sensitivity: Architecture choices significantly affect quality
Local minima: Non-convex optimization may find suboptimal solutions
Overfitting risk: Need regularization for small datasets

Variational Autoencoders (VAE):

VAEs add probabilistic structure, learning a distribution over codes rather than deterministic mappings. This enables:

Smooth, interpretable latent space
Generation of new samples
Principled uncertainty quantification

For compression, VAEs provide rate-distortion optimal encoding when combined with entropy coding of the latent distribution.

Compression Methods Comparison
Method	Type	Strengths	Best For
PCA	Linear	Fast, optimal for Gaussian, closed-form	General numeric data, baselines
Autoencoder	Nonlinear	Captures complex structure	Images, audio, complex manifolds
VAE	Probabilistic	Smooth latent space, generation	When generation or uncertainty needed
Sparse coding	Linear + sparse	Interpretable, part-based	When interpretability matters
Vector Quantization	Discrete codes	Very high compression	Extreme compression ratios

Learned Image Compression

Modern learned image codecs (using neural autoencoders with entropy models) now outperform classical JPEG and approach or exceed hand-engineered standards like HEVC/H.265. These methods learn end-to-end from data, automatically discovering what humans perceive as important to preserve.

Compression for Databases and Similarity Search

A major application of dimensionality reduction compression is enabling fast similarity search at scale. Modern embedding-based search systems (image search, recommendation, semantic text search) must find nearest neighbors among billions of vectors. DR-based compression makes this tractable.

The Vector Search Problem:

Given:

Database of n vectors, each in ℝ^d
Query vector q ∈ ℝ^d

Find k vectors most similar to q (typically by Euclidean distance or cosine similarity).

Why Compression Helps:

Storage: Billion-scale databases can exceed available RAM. Compression enables in-memory operation.
Search speed: Distance computation is O(d). Smaller d means faster search.
Cache efficiency: Smaller vectors fit more in CPU cache, dramatically speeding up search.

Product Quantization (PQ):

A powerful technique combining dimensionality reduction with quantization:

Split d-dimensional vectors into m subvectors
Learn k cluster centroids (codebook) for each subvector
Encode each subvector as its nearest centroid ID
Store only m IDs (e.g., 8-bit each) instead of d floats

Compression ratio: d × 4 bytes (float32) → m × 1 byte = 4d/m compression. For d=768, m=8: 384× compression!

PQ-Based Search:

Distances can be computed approximately from the compressed codes using precomputed lookup tables—enabling billion-scale search on a single machine.

vector_compression_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
 
class SimpleProductQuantizer:
    """
    Simple Product Quantization implementation for vector compression.
    
    Splits vectors into subspaces and quantizes each separately.
    """
    
    def __init__(self, n_subspaces=8, n_clusters=256):
        """
        Parameters:
        -----------
        n_subspaces : int
            Number of subspaces to split vectors into
        n_clusters : int
            Codebook size per subspace (typically 256 for uint8 codes)
        """
        self.n_subspaces = n_subspaces
        self.n_clusters = n_clusters
        self.codebooks = []
        self.subspace_dims = None
        
    def fit(self, X):
        """Learn codebooks from training data."""
        n_samples, d = X.shape
        
        # Split dimensions evenly across subspaces
        self.subspace_dims = d // self.n_subspaces
        assert d % self.n_subspaces == 0, "Dimensions must be divisible by n_subspaces"
        
        self.codebooks = []
        for i in range(self.n_subspaces):
            # Extract subspace
            start = i * self.subspace_dims
            end = start + self.subspace_dims
            X_sub = X[:, start:end]
            
            # Learn codebook via k-means
            kmeans = MiniBatchKMeans(n_clusters=self.n_clusters, random_state=42,
                                     n_init='auto')
            kmeans.fit(X_sub)
            self.codebooks.append(kmeans)
        
        return self
    
    def encode(self, X):
        """Compress vectors to codes."""
        n_samples = X.shape[0]
        codes = np.zeros((n_samples, self.n_subspaces), dtype=np.uint8)
        
        for i, kmeans in enumerate(self.codebooks):
            start = i * self.subspace_dims
            end = start + self.subspace_dims
            codes[:, i] = kmeans.predict(X[:, start:end])
        
        return codes
    
    def decode(self, codes):
        """Decompress codes to vectors."""
        n_samples = codes.shape[0]
        d = self.n_subspaces * self.subspace_dims
        X_reconstructed = np.zeros((n_samples, d))
        
        for i, kmeans in enumerate(self.codebooks):
            start = i * self.subspace_dims
            end = start + self.subspace_dims
            X_reconstructed[:, start:end] = kmeans.cluster_centers_[codes[:, i]]
        
        return X_reconstructed
    
    def compression_ratio(self, d):
        """Calculate compression ratio."""
        original_bytes = d * 4  # float32
        compressed_bytes = self.n_subspaces * 1  # uint8
        return original_bytes / compressed_bytes
 
# Demo
np.random.seed(42)
n, d = 100000, 768  # Simulate embedding vectors
 
# Generate structured data (low intrinsic dimensionality)
true_dim = 32
V = np.random.randn(d, true_dim)
X = np.random.randn(n, true_dim) @ V.T
X = X + 0.1 * np.random.randn(n, d)  # Add noise
 
# PQ compression
pq = SimpleProductQuantizer(n_subspaces=8, n_clusters=256)
pq.fit(X[:10000])  # Train on subset
 
codes = pq.encode(X)
X_reconstructed = pq.decode(codes)
 
# Evaluate
mse = np.mean((X - X_reconstructed) ** 2)
total_var = np.var(X)
 
print(f"Vector Compression Results:")
print(f"  Original dimensions: {d}")
print(f"  Compression ratio: {pq.compression_ratio(d):.0f}x")
print(f"  Storage: {n*d*4/1e9:.2f} GB -> {n*8/1e9:.4f} GB")
print(f"  MSE: {mse:.4f}")
print(f"  Relative error: {np.sqrt(mse / total_var):.4f}")

Industry-Scale Systems

Production vector search systems (Pinecone, Weaviate, Milvus, FAISS) use sophisticated combinations of PCA, Product Quantization, and inverted file indices. Google's ScaNN and Meta's FAISS achieve sub-millisecond search over billion-scale databases through these techniques. Understanding the compression principles helps you configure these systems effectively.

Compression in Machine Learning Pipelines

Beyond storage efficiency, dimensionality reduction compression accelerates entire ML pipelines:

Training Speedup:

Many algorithms have time complexity that depends on dimensionality:

Linear models: O(n · d) per epoch
k-NN: O(n · d) per query
Kernel SVM: O(n² · d) for kernel computation
Gaussian Mixtures: O(n · k · d²) for full covariance

Reducing d by 10× accelerates these algorithms proportionally—often with minimal accuracy loss.

Memory Efficiency During Training:

Deep learning training is often memory-bound:

Mini-batch size limited by GPU memory
Feature dimensions multiply with batch through all layers

Compressing input features enables:

Larger batch sizes (better gradient estimates)
Deeper networks (more layers fit in memory)
Training on larger datasets

Inference Latency:

For real-time applications, dimension reduction directly impacts latency:

Fewer input features → faster preprocessing
Smaller embeddings → faster similarity computation
Reduced model size → faster inference

Practical Considerations:

Consistency: Training and inference must use the same projection
Streaming data: Learned projection must generalize to new data
Feature drift: Monitor if new data violates original projection assumptions

Compression Speedup Scenarios

•Transfer learning features: CNN features (2048-dim) compressed to 256-dim retain discriminative power for many tasks
•Text embeddings: BERT embeddings (768-dim) can often be compressed to 64-128 dimensions for search/classification
•Tabular data preprocessing: High-dimensional one-hot encodings compressed before model training
•Time series features: Rolling window statistics compressed before forecasting models
•Multi-modal fusion: Compress each modality before combining, reducing fusion layer complexity

Compression-Speed-Quality Pareto Frontier

In practice, find the Pareto frontier of compression ratio, training/inference speed, and model quality. Often, moderate compression (4-10×) gives substantial speedup with negligible quality loss. More aggressive compression (10-100×) may require quality tradeoffs acceptable for non-critical applications.

Progressive and Hierarchical Compression

An elegant property of PCA-based compression is progressivity: early components capture coarse structure, later components add fine detail. This enables hierarchical, multi-resolution representations.

Progressive Transmission:

Instead of sending all k components at once, send them progressively:

Transmit first k₁ components → coarse preview
Transmit next k₂ - k₁ components → refined view
Continue until full quality or connection terminates

The receiver can begin processing/displaying before transmission completes—crucial for bandwidth-limited or latency-sensitive applications.

Applications:

Image previews: JPEG2000 uses similar progressive wavelet coding
Database queries: Return low-fidelity results fast, refine as needed
Streaming ML: Make predictions from coarse representation, refine as more data arrives
Adaptive quality: Automatically adjust fidelity based on available bandwidth

Hierarchical Representations:

Store data at multiple resolutions:

Level 0: Full d dimensions (original data)
Level 1: First k₁ PCs
Level 2: First k₂ PCs (k₂ < k₁)
...

Different queries access different levels:

Quick filtering/ranking: Level 2 (coarse)
Detailed analysis: Level 0 (full)

This mirrors database indexing strategies—trade precision for speed depending on query needs.

progressive_compression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from sklearn.decomposition import PCA
import time
 
class ProgressiveCompressor:
    """
    Implements progressive/hierarchical compression using PCA.
    
    Enables multi-resolution storage and progressive transmission.
    """
    
    def __init__(self, levels=[4, 16, 64, 256]):
        """
        Parameters:
        -----------
        levels : list of int
            Number of components at each quality level (coarse to fine)
        """
        self.levels = sorted(levels)
        self.pca = None
        
    def fit(self, X):
        """Fit PCA on training data."""
        max_k = min(self.levels[-1], X.shape[0]-1, X.shape[1])
        self.pca = PCA(n_components=max_k)
        self.pca.fit(X)
        # Adjust levels if they exceed max_k
        self.levels = [l for l in self.levels if l <= max_k]
        return self
    
    def compress_hierarchical(self, X):
        """
        Compress data to hierarchical representation.
        
        Returns dict with compressed representation at each level.
        """
        codes = self.pca.transform(X)
        
        return {
            level: codes[:, :level] 
            for level in self.levels
        }
    
    def decompress(self, codes, level=None):
        """Decompress from specified level."""
        if level is None:
            level = codes.shape[1]
        
        k = min(level, codes.shape[1])
        return codes[:, :k] @ self.pca.components_[:k, :] + self.pca.mean_
    
    def simulate_progressive_transmission(self, X, bandwidth_mbps=1.0):
        """
        Simulate progressive transmission with timing.
        
        Parameters:
        -----------
        X : array (n_samples, d)
        bandwidth_mbps : float
            Simulated bandwidth in MB/s
        """
        compressed = self.compress_hierarchical(X)
        n = X.shape[0]
        d = X.shape[1]
        
        print(f"Progressive Transmission Simulation")
        print(f"Original: {n} samples, {d} dimensions")
        print(f"Bandwidth: {bandwidth_mbps} MB/s")
        print("-" * 60)
        
        total_bytes = 0
        prev_level = 0
        
        for level in self.levels:
            # Calculate transmission size for this increment
            level_bytes = n * (level - prev_level) * 4  # float32
            time_seconds = level_bytes / (bandwidth_mbps * 1e6)
            total_bytes += level_bytes
            
            # Calculate quality
            reconstructed = self.decompress(compressed[level], level)
            mse = np.mean((X - reconstructed) ** 2)
            psnr = 10 * np.log10(np.max(X)**2 / (mse + 1e-10))
            
            variance_explained = sum(self.pca.explained_variance_ratio_[:level])
            
            print(f"Level {level:3d}: {level_bytes/1e3:8.1f} KB | "
                  f"Time: {time_seconds:.2f}s | "
                  f"Variance: {100*variance_explained:.1f}% | "
                  f"PSNR: {psnr:.1f} dB")
            
            prev_level = level
        
        # Compare to non-progressive
        full_bytes = n * d * 4
        print("-" * 60)
        print(f"Full transmission: {full_bytes/1e6:.1f} MB | "
              f"Time: {full_bytes/(bandwidth_mbps*1e6):.1f}s")
 
# Demo
np.random.seed(42)
n, d = 10000, 1000
 
# Generate data
X = np.random.randn(n, 50) @ np.random.randn(50, d) + 0.1 * np.random.randn(n, d)
 
compressor = ProgressiveCompressor(levels=[8, 32, 128, 256])
compressor.fit(X)
compressor.simulate_progressive_transmission(X, bandwidth_mbps=10.0)

Summary: Compression via Dimensionality Reduction

Dimensionality reduction provides a powerful approach to data compression—exploiting the structure across a dataset to find compact representations. Unlike lossless compression, DR-based compression is lossy but achieves compression ratios impossible otherwise while preserving essential structure.

Key takeaways from this page:

Key Insights

•Encoder-decoder framework: Compress to k dimensions, reconstruct from low-dimensional codes
•Compression ratio ≈ d/k: For large datasets, ratio approaches dimension reduction factor
•Lossy but principled: Information loss is quantifiable via eigenvalue spectrum
•Rate-distortion tradeoff: More compression means more error; optimize for your tolerance
•Selection methods: Variance threshold, task performance, storage budget all valid criteria
•Neural compression: Autoencoders extend to nonlinear manifold compression
•Vector search: Product quantization enables billion-scale similarity search
•Progressive transmission: PCA naturally supports multi-resolution hierarchies

Next up:

Having explored visualization, noise reduction, and compression as motivations for dimensionality reduction, we turn to the final foundational topic: Feature Extraction vs. Feature Selection. These represent fundamentally different philosophies for reducing dimensionality, with distinct tradeoffs in interpretability, computational cost, and information preservation.

Page Complete

You now understand dimensionality reduction as a compression strategy: the encoder-decoder framework, compression ratio analysis, selection criteria, neural extensions, applications in vector search, ML pipeline acceleration, and progressive transmission. You can apply these techniques to efficiently store, transmit, and process high-dimensional data.

4 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Dimensionality Reduction Motivation

LevelIntermediate

Duration75 mins

TopicDimensionality Reduction

4 / 5

Compression

Storing the Essence: Data Compression via Dimensionality Reduction

What You Will Learn

The Compression Framework

Compression via dimensionality reduction follows a simple encoder-decoder framework:

Encoding (Compression):

Learn a projection from d dimensions to k dimensions (k << d)
Project each data point: z_i = encode(x_i) ∈ ℝ^k

Decoding (Reconstruction):

Map compressed representation back to original space: x̂_i = decode(z_i) ∈ ℝ^d
Accept approximation x̂_i ≈ x_i

What Must Be Stored:

The k-dimensional representations {z_i} for all n samples
The projection parameters (e.g., PCA components matrix)

Storage Analysis for PCA:

Original data: n × d floating point numbers

Compressed representation:

Code vectors: n × k floating point numbers
Principal components: k × d floating point numbers (one-time cost)
Mean vector: 1 × d floating point numbers (one-time cost)

Compression Ratio:

$$\text{Compression Ratio} = \frac{n \cdot d}{n \cdot k + k \cdot d + d} \approx \frac{d}{k}$$

When n is large (many samples), the one-time storage cost of the projection matrix becomes negligible, and compression ratio approaches d/k.

Example:

d = 10,000 features, k = 50 components, n = 1,000,000 samples
Original: 10^7 × 8 bytes = 80 GB (float64)
Compressed: (10^6 × 50 + 50 × 10,000 + 10,000) × 8 bytes ≈ 400 MB
Compression ratio: ~200×

compression_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def analyze_compression_ratio(n_samples, original_dim, compressed_dim, 
                              dtype_bytes=8):
    """
    Calculate compression ratio and storage requirements for PCA compression.
    
    Parameters:
    -----------
    n_samples : int
        Number of data points
    original_dim : int
        Original feature dimensionality
    compressed_dim : int
        Number of principal components to keep
    dtype_bytes : int
        Bytes per number (8 for float64, 4 for float32)
    
    Returns:
    --------
    dict with storage analysis
    """
    # Original storage
    original_bytes = n_samples * original_dim * dtype_bytes
    
    # Compressed storage
    codes_bytes = n_samples * compressed_dim * dtype_bytes  # z vectors
    components_bytes = compressed_dim * original_dim * dtype_bytes  # PC matrix
    mean_bytes = original_dim * dtype_bytes  # Mean vector
    
    compressed_total = codes_bytes + components_bytes + mean_bytes
    
    # Compression ratio
    ratio = original_bytes / compressed_total
    asymptotic_ratio = original_dim / compressed_dim  # For large n
    
    return {
        'original_GB': original_bytes / 1e9,
        'compressed_GB': compressed_total / 1e9,
        'codes_GB': codes_bytes / 1e9,
        'overhead_GB': (components_bytes + mean_bytes) / 1e9,
        'compression_ratio': ratio,
        'asymptotic_ratio': asymptotic_ratio,
        'overhead_percentage': 100 * (components_bytes + mean_bytes) / compressed_total
    }
 
# Example scenarios
scenarios = [
    {'name': 'Genomics (small)', 'n': 1000, 'd': 20000, 'k': 50},
    {'name': 'Genomics (large)', 'n': 100000, 'd': 20000, 'k': 50},
    {'name': 'Images (features)', 'n': 1000000, 'd': 4096, 'k': 128},
    {'name': 'Text embeddings', 'n': 10000000, 'd': 768, 'k': 64},
    {'name': 'Sensor data', 'n': 100000000, 'd': 100, 'k': 10},
]
 
print("PCA Compression Analysis")
print("=" * 80)
for s in scenarios:
    result = analyze_compression_ratio(s['n'], s['d'], s['k'])
    print(f"\n{s['name']}:")
    print(f"  Samples: {s['n']:,}, Dims: {s['d']} -> {s['k']}")
    print(f"  Original: {result['original_GB']:.2f} GB")
    print(f"  Compressed: {result['compressed_GB']:.4f} GB")
    print(f"  Compression ratio: {result['compression_ratio']:.1f}x")
    print(f"  Overhead (PC matrix): {result['overhead_percentage']:.1f}%")

The Overhead Crossover

Lossy vs. Lossless: The Fidelity-Efficiency Tradeoff

The Key Insight:

In lossy compression, we trade reconstruction quality for compression ratio. More aggressive compression (smaller k) means:

Higher compression ratio
More reconstruction error
Potentially acceptable if the lost information is noise or irrelevant detail

Rate-Distortion Theory:

Information theory provides a rigorous framework for thinking about lossy compression. The rate-distortion curve characterizes the fundamental tradeoff:

Rate (R): Bits needed per sample (related to compression ratio)
Distortion (D): Reconstruction error (e.g., mean squared error)

For a given source distribution, there exists a function D(R) giving the minimum achievable distortion at rate R. PCA approaches this limit when data is Gaussian.

Reconstruction Error Decomposition:

When we keep k principal components:

$$\text{Reconstruction Error} = \sum_{i=k+1}^{d} \lambda_i$$

Where λ_i are eigenvalues. The error equals the sum of discarded eigenvalues—precisely the variance we didn't capture.

Lossless Compression

•Perfect reconstruction: Original data exactly recoverable
•Limited ratio: Compression bounded by data entropy
•Examples: gzip, PNG, FLAC
•Use when: Legal/regulatory requirements, scientific reproducibility
•Typical ratios: 2-10× depending on data redundancy

Lossy Compression (DR)

•Approximate reconstruction: Information lost permanently
•Flexible ratio: Can compress arbitrarily at cost of quality
•Examples: PCA, autoencoders, JPEG
•Use when: Downstream tasks tolerate error, storage critical
•Typical ratios: 10-1000× depending on acceptable distortion

Perceptual Quality vs. Numerical Error

Selecting the Optimal Compression Level

How many components should we keep? The answer depends on your constraints and goals:

Variance-Based Selection:

The simplest approach: keep enough components to capture a specified fraction of variance.

$$\text{Explained Variance Ratio}(k) = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i}$$

Common thresholds:

95%: Conservative compression, high fidelity
90%: Moderate compression, good fidelity
80%: Aggressive compression, acceptable for many ML tasks
50%+: Very aggressive, significant information loss

Reconstruction Error Target:

Alternatively, specify a maximum acceptable reconstruction error (MSE or RMSE) and find the minimum k that achieves it.

Downstream Task Performance:

The most principled approach: evaluate compressed representations on your actual downstream task.

Compress to various k values
Train downstream model on compressed data
Choose smallest k with acceptable task performance

This accounts for task-specific importance of different directions—some variance might be irrelevant for your goal.

Storage Constraint:

If storage is the primary constraint:

Calculate available storage budget
Compute maximum k that fits budget
Accept the resulting quality

$$k_{\max} = \frac{B - (k_{\max} \cdot d + d) \cdot b}{n \cdot b}$$

Where B is budget in bytes, b is bytes per number.

compression_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
 
def select_components_for_compression(X, y=None, 
                                       variance_thresholds=[0.95, 0.90, 0.80],
                                       storage_budget_gb=None):
    """
    Analyze component selection for compression.
    
    Returns k values for different selection criteria.
    """
    n, d = X.shape
    
    # Fit full PCA
    pca = PCA()
    pca.fit(X)
    
    cumvar = np.cumsum(pca.explained_variance_ratio_)
    eigenvalues = pca.explained_variance_
    
    results = {}
    
    # Variance-based selection
    for thresh in variance_thresholds:
        k = np.argmax(cumvar >= thresh) + 1
        ratio = d / k
        results[f'variance_{int(thresh*100)}pct'] = {
            'k': k, 
            'actual_variance': cumvar[k-1],
            'compression_ratio': ratio
        }
    
    # MSE-based selection (relative to max MSE = sum of all eigenvalues)
    total_var = sum(eigenvalues)
    for mse_fraction in [0.05, 0.10, 0.20]:  # Target 5%, 10%, 20% of total MSE
        target_mse = mse_fraction * total_var
        cumulative_discarded = total_var - np.cumsum(eigenvalues)
        k = np.argmax(cumulative_discarded <= target_mse) + 1
        results[f'mse_{int(mse_fraction*100)}pct'] = {
            'k': k,
            'actual_mse_fraction': cumulative_discarded[k-1] / total_var,
            'compression_ratio': d / k
        }
    
    # Storage budget selection
    if storage_budget_gb is not None:
        budget_bytes = storage_budget_gb * 1e9
        bytes_per_float = 8  # float64
        
        # Solve: n*k + k*d + d <= budget/bytes_per_float
        max_floats = budget_bytes / bytes_per_float
        # Quadratic in k: k*(n + d) <= max_floats - d
        # k <= (max_floats - d) / (n + d)
        k_max = int((max_floats - d) / (n + d))
        k_max = max(1, min(k_max, d))
        
        results['storage_budget'] = {
            'k': k_max,
            'variance': cumvar[k_max-1] if k_max <= len(cumvar) else 1.0,
            'compression_ratio': d / k_max
        }
    
    # Task-based selection (if labels provided)
    if y is not None:
        scores = []
        k_values = list(range(1, min(51, d+1)))  # Test up to 50 components
        
        for k in k_values:
            X_compressed = pca.transform(X)[:, :k]
            clf = LogisticRegression(max_iter=1000, random_state=42)
            score = np.mean(cross_val_score(clf, X_compressed, y, cv=5))
            scores.append(score)
        
        # Find "elbow" - where adding components gives diminishing returns
        best_k = k_values[np.argmax(scores)]
        
        # Also find minimum k for 99% of best score
        max_score = max(scores)
        for i, k in enumerate(k_values):
            if scores[i] >= 0.99 * max_score:
                min_adequate_k = k
                break
        
        results['task_based'] = {
            'best_k': best_k,
            'best_score': max(scores),
            'min_adequate_k': min_adequate_k,
            'compression_ratio': d / min_adequate_k
        }
    
    return results, cumvar, eigenvalues
 
# Example
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
 
results, cumvar, eigenvalues = select_components_for_compression(
    X, y, storage_budget_gb=0.001)
 
print("Component Selection Analysis for Digits Dataset")
print(f"Original dimensions: {X.shape[1]}")
print("=" * 60)
for method, info in results.items():
    print(f"\n{method}:")
    for key, val in info.items():
        if isinstance(val, float):
            print(f"  {key}: {val:.4f}")
        else:
            print(f"  {key}: {val}")

Beware Overfitting to Compression Criteria

Beyond PCA: Neural Network Compression

Autoencoder Architecture:

$$\text{Encoder: } z = f_\theta(x) \in \mathbb{R}^k$$ $$\text{Decoder: } \hat{x} = g_\phi(z) \in \mathbb{R}^d$$ $$\text{Loss: } \mathcal{L} = |x - \hat{x}|^2$$

Where f and g are neural networks with learnable parameters θ and φ.

Advantages Over PCA:

Nonlinear mappings: Can unfold curved manifolds, exploiting nonlinear structure
Flexible architecture: Convolutional, recurrent, attention-based encoders for structured data
End-to-end learning: Jointly optimize encoder and decoder for reconstruction
Asymmetric complexity: Decoder can be larger (fast encoding, slower decoding ok)

Disadvantages:

Training required: PCA is a closed-form solution; autoencoders need gradient descent
Hyperparameter sensitivity: Architecture choices significantly affect quality
Local minima: Non-convex optimization may find suboptimal solutions
Overfitting risk: Need regularization for small datasets

Variational Autoencoders (VAE):

VAEs add probabilistic structure, learning a distribution over codes rather than deterministic mappings. This enables:

Smooth, interpretable latent space
Generation of new samples
Principled uncertainty quantification

For compression, VAEs provide rate-distortion optimal encoding when combined with entropy coding of the latent distribution.

Compression Methods Comparison
Method	Type	Strengths	Best For
PCA	Linear	Fast, optimal for Gaussian, closed-form	General numeric data, baselines
Autoencoder	Nonlinear	Captures complex structure	Images, audio, complex manifolds
VAE	Probabilistic	Smooth latent space, generation	When generation or uncertainty needed
Sparse coding	Linear + sparse	Interpretable, part-based	When interpretability matters
Vector Quantization	Discrete codes	Very high compression	Extreme compression ratios

Learned Image Compression

Compression for Databases and Similarity Search

The Vector Search Problem:

Given:

Database of n vectors, each in ℝ^d
Query vector q ∈ ℝ^d

Find k vectors most similar to q (typically by Euclidean distance or cosine similarity).

Why Compression Helps:

Storage: Billion-scale databases can exceed available RAM. Compression enables in-memory operation.
Search speed: Distance computation is O(d). Smaller d means faster search.
Cache efficiency: Smaller vectors fit more in CPU cache, dramatically speeding up search.

Product Quantization (PQ):

A powerful technique combining dimensionality reduction with quantization:

Split d-dimensional vectors into m subvectors
Learn k cluster centroids (codebook) for each subvector
Encode each subvector as its nearest centroid ID
Store only m IDs (e.g., 8-bit each) instead of d floats

Compression ratio: d × 4 bytes (float32) → m × 1 byte = 4d/m compression. For d=768, m=8: 384× compression!

PQ-Based Search:

Distances can be computed approximately from the compressed codes using precomputed lookup tables—enabling billion-scale search on a single machine.

vector_compression_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
 
class SimpleProductQuantizer:
    """
    Simple Product Quantization implementation for vector compression.
    
    Splits vectors into subspaces and quantizes each separately.
    """
    
    def __init__(self, n_subspaces=8, n_clusters=256):
        """
        Parameters:
        -----------
        n_subspaces : int
            Number of subspaces to split vectors into
        n_clusters : int
            Codebook size per subspace (typically 256 for uint8 codes)
        """
        self.n_subspaces = n_subspaces
        self.n_clusters = n_clusters
        self.codebooks = []
        self.subspace_dims = None
        
    def fit(self, X):
        """Learn codebooks from training data."""
        n_samples, d = X.shape
        
        # Split dimensions evenly across subspaces
        self.subspace_dims = d // self.n_subspaces
        assert d % self.n_subspaces == 0, "Dimensions must be divisible by n_subspaces"
        
        self.codebooks = []
        for i in range(self.n_subspaces):
            # Extract subspace
            start = i * self.subspace_dims
            end = start + self.subspace_dims
            X_sub = X[:, start:end]
            
            # Learn codebook via k-means
            kmeans = MiniBatchKMeans(n_clusters=self.n_clusters, random_state=42,
                                     n_init='auto')
            kmeans.fit(X_sub)
            self.codebooks.append(kmeans)
        
        return self
    
    def encode(self, X):
        """Compress vectors to codes."""
        n_samples = X.shape[0]
        codes = np.zeros((n_samples, self.n_subspaces), dtype=np.uint8)
        
        for i, kmeans in enumerate(self.codebooks):
            start = i * self.subspace_dims
            end = start + self.subspace_dims
            codes[:, i] = kmeans.predict(X[:, start:end])
        
        return codes
    
    def decode(self, codes):
        """Decompress codes to vectors."""
        n_samples = codes.shape[0]
        d = self.n_subspaces * self.subspace_dims
        X_reconstructed = np.zeros((n_samples, d))
        
        for i, kmeans in enumerate(self.codebooks):
            start = i * self.subspace_dims
            end = start + self.subspace_dims
            X_reconstructed[:, start:end] = kmeans.cluster_centers_[codes[:, i]]
        
        return X_reconstructed
    
    def compression_ratio(self, d):
        """Calculate compression ratio."""
        original_bytes = d * 4  # float32
        compressed_bytes = self.n_subspaces * 1  # uint8
        return original_bytes / compressed_bytes
 
# Demo
np.random.seed(42)
n, d = 100000, 768  # Simulate embedding vectors
 
# Generate structured data (low intrinsic dimensionality)
true_dim = 32
V = np.random.randn(d, true_dim)
X = np.random.randn(n, true_dim) @ V.T
X = X + 0.1 * np.random.randn(n, d)  # Add noise
 
# PQ compression
pq = SimpleProductQuantizer(n_subspaces=8, n_clusters=256)
pq.fit(X[:10000])  # Train on subset
 
codes = pq.encode(X)
X_reconstructed = pq.decode(codes)
 
# Evaluate
mse = np.mean((X - X_reconstructed) ** 2)
total_var = np.var(X)
 
print(f"Vector Compression Results:")
print(f"  Original dimensions: {d}")
print(f"  Compression ratio: {pq.compression_ratio(d):.0f}x")
print(f"  Storage: {n*d*4/1e9:.2f} GB -> {n*8/1e9:.4f} GB")
print(f"  MSE: {mse:.4f}")
print(f"  Relative error: {np.sqrt(mse / total_var):.4f}")

Industry-Scale Systems

Compression in Machine Learning Pipelines

Beyond storage efficiency, dimensionality reduction compression accelerates entire ML pipelines:

Training Speedup:

Many algorithms have time complexity that depends on dimensionality:

Linear models: O(n · d) per epoch
k-NN: O(n · d) per query
Kernel SVM: O(n² · d) for kernel computation
Gaussian Mixtures: O(n · k · d²) for full covariance

Reducing d by 10× accelerates these algorithms proportionally—often with minimal accuracy loss.

Memory Efficiency During Training:

Deep learning training is often memory-bound:

Mini-batch size limited by GPU memory
Feature dimensions multiply with batch through all layers

Compressing input features enables:

Larger batch sizes (better gradient estimates)
Deeper networks (more layers fit in memory)
Training on larger datasets

Inference Latency:

For real-time applications, dimension reduction directly impacts latency:

Fewer input features → faster preprocessing
Smaller embeddings → faster similarity computation
Reduced model size → faster inference

Practical Considerations:

Consistency: Training and inference must use the same projection
Streaming data: Learned projection must generalize to new data
Feature drift: Monitor if new data violates original projection assumptions

Compression Speedup Scenarios

•Transfer learning features: CNN features (2048-dim) compressed to 256-dim retain discriminative power for many tasks
•Text embeddings: BERT embeddings (768-dim) can often be compressed to 64-128 dimensions for search/classification
•Tabular data preprocessing: High-dimensional one-hot encodings compressed before model training
•Time series features: Rolling window statistics compressed before forecasting models
•Multi-modal fusion: Compress each modality before combining, reducing fusion layer complexity

Compression-Speed-Quality Pareto Frontier

Progressive and Hierarchical Compression

Progressive Transmission:

Instead of sending all k components at once, send them progressively:

Transmit first k₁ components → coarse preview
Transmit next k₂ - k₁ components → refined view
Continue until full quality or connection terminates

The receiver can begin processing/displaying before transmission completes—crucial for bandwidth-limited or latency-sensitive applications.

Applications:

Image previews: JPEG2000 uses similar progressive wavelet coding
Database queries: Return low-fidelity results fast, refine as needed
Streaming ML: Make predictions from coarse representation, refine as more data arrives
Adaptive quality: Automatically adjust fidelity based on available bandwidth

Hierarchical Representations:

Store data at multiple resolutions:

Level 0: Full d dimensions (original data)
Level 1: First k₁ PCs
Level 2: First k₂ PCs (k₂ < k₁)
...

Different queries access different levels:

Quick filtering/ranking: Level 2 (coarse)
Detailed analysis: Level 0 (full)

This mirrors database indexing strategies—trade precision for speed depending on query needs.

progressive_compression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from sklearn.decomposition import PCA
import time
 
class ProgressiveCompressor:
    """
    Implements progressive/hierarchical compression using PCA.
    
    Enables multi-resolution storage and progressive transmission.
    """
    
    def __init__(self, levels=[4, 16, 64, 256]):
        """
        Parameters:
        -----------
        levels : list of int
            Number of components at each quality level (coarse to fine)
        """
        self.levels = sorted(levels)
        self.pca = None
        
    def fit(self, X):
        """Fit PCA on training data."""
        max_k = min(self.levels[-1], X.shape[0]-1, X.shape[1])
        self.pca = PCA(n_components=max_k)
        self.pca.fit(X)
        # Adjust levels if they exceed max_k
        self.levels = [l for l in self.levels if l <= max_k]
        return self
    
    def compress_hierarchical(self, X):
        """
        Compress data to hierarchical representation.
        
        Returns dict with compressed representation at each level.
        """
        codes = self.pca.transform(X)
        
        return {
            level: codes[:, :level] 
            for level in self.levels
        }
    
    def decompress(self, codes, level=None):
        """Decompress from specified level."""
        if level is None:
            level = codes.shape[1]
        
        k = min(level, codes.shape[1])
        return codes[:, :k] @ self.pca.components_[:k, :] + self.pca.mean_
    
    def simulate_progressive_transmission(self, X, bandwidth_mbps=1.0):
        """
        Simulate progressive transmission with timing.
        
        Parameters:
        -----------
        X : array (n_samples, d)
        bandwidth_mbps : float
            Simulated bandwidth in MB/s
        """
        compressed = self.compress_hierarchical(X)
        n = X.shape[0]
        d = X.shape[1]
        
        print(f"Progressive Transmission Simulation")
        print(f"Original: {n} samples, {d} dimensions")
        print(f"Bandwidth: {bandwidth_mbps} MB/s")
        print("-" * 60)
        
        total_bytes = 0
        prev_level = 0
        
        for level in self.levels:
            # Calculate transmission size for this increment
            level_bytes = n * (level - prev_level) * 4  # float32
            time_seconds = level_bytes / (bandwidth_mbps * 1e6)
            total_bytes += level_bytes
            
            # Calculate quality
            reconstructed = self.decompress(compressed[level], level)
            mse = np.mean((X - reconstructed) ** 2)
            psnr = 10 * np.log10(np.max(X)**2 / (mse + 1e-10))
            
            variance_explained = sum(self.pca.explained_variance_ratio_[:level])
            
            print(f"Level {level:3d}: {level_bytes/1e3:8.1f} KB | "
                  f"Time: {time_seconds:.2f}s | "
                  f"Variance: {100*variance_explained:.1f}% | "
                  f"PSNR: {psnr:.1f} dB")
            
            prev_level = level
        
        # Compare to non-progressive
        full_bytes = n * d * 4
        print("-" * 60)
        print(f"Full transmission: {full_bytes/1e6:.1f} MB | "
              f"Time: {full_bytes/(bandwidth_mbps*1e6):.1f}s")
 
# Demo
np.random.seed(42)
n, d = 10000, 1000
 
# Generate data
X = np.random.randn(n, 50) @ np.random.randn(50, d) + 0.1 * np.random.randn(n, d)
 
compressor = ProgressiveCompressor(levels=[8, 32, 128, 256])
compressor.fit(X)
compressor.simulate_progressive_transmission(X, bandwidth_mbps=10.0)

Summary: Compression via Dimensionality Reduction

Key takeaways from this page:

Key Insights

•Encoder-decoder framework: Compress to k dimensions, reconstruct from low-dimensional codes
•Compression ratio ≈ d/k: For large datasets, ratio approaches dimension reduction factor
•Lossy but principled: Information loss is quantifiable via eigenvalue spectrum
•Rate-distortion tradeoff: More compression means more error; optimize for your tolerance
•Selection methods: Variance threshold, task performance, storage budget all valid criteria
•Neural compression: Autoencoders extend to nonlinear manifold compression
•Vector search: Product quantization enables billion-scale similarity search
•Progressive transmission: PCA naturally supports multi-resolution hierarchies

Next up:

Page Complete

4 / 5