Umap - Learning Module | OneNoughtOne

Loading content...

0/278

UMAP vs t-SNE

The Two Giants of Visualization

When practitioners need to visualize high-dimensional data, two algorithms dominate the conversation: t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection). Both produce striking visualizations that reveal cluster structure invisible to linear methods like PCA. Both have become essential tools in fields from single-cell genomics to deep learning interpretability.

Yet despite superficially similar outputs, these algorithms rest on fundamentally different mathematical foundations and exhibit distinct behaviors that matter deeply in practice. This page provides a rigorous comparative analysis, moving beyond the surface-level "UMAP is faster" to understand the theoretical roots of their differences and practical implications for choosing between them.

Understanding these distinctions elevates you from a tool user to a tool expert—someone who knows not just how to apply these methods, but when and why each excels.

What You Will Learn

By the end of this page, you will understand: (1) The theoretical foundations distinguishing t-SNE and UMAP, (2) Empirical differences in structure preservation, speed, and scalability, (3) When to choose each method based on your specific requirements, and (4) Common misconceptions and pitfalls when interpreting results from either method.

Theoretical Foundations Compared

Both t-SNE and UMAP aim to preserve neighborhood structure during dimensionality reduction, but they conceptualize this goal through completely different mathematical lenses.

t-SNE: Information-Theoretic Foundation

t-SNE (van der Maaten & Hinton, 2008) frames dimensionality reduction as an information-theoretic problem:

Convert high-D distances to conditional probabilities: $$p_{j|i} = \frac{\exp(-|x_i - x_j|^2 / 2\sigma_i^2)}{\sum_{k eq i} \exp(-|x_i - x_k|^2 / 2\sigma_i^2)}$$
Symmetrize: (p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n})
Define low-D affinities using Student-t distribution: $$q_{ij} = \frac{(1 + |y_i - y_j|^2)^{-1}}{\sum_{k eq l} (1 + |y_k - y_l|^2)^{-1}}$$
Minimize KL divergence: (KL(P | Q) = \sum_{i eq j} p_{ij} \log \frac{p_{ij}}{q_{ij}})

UMAP: Topological Foundation

UMAP (McInnes, Healy & Melville, 2018) frames dimensionality reduction as a topological problem:

Construct local fuzzy simplicial sets from data
Combine into global fuzzy simplicial set via fuzzy union
Define low-D fuzzy membership via parametric function
Minimize fuzzy set cross-entropy between high-D and low-D representations

The key philosophical difference: t-SNE optimizes probability distributions (normalized to sum to 1), while UMAP optimizes membership functions (independent fuzzy degrees).

Theoretical Framework Comparison
Aspect	t-SNE	UMAP
Mathematical basis	Information theory	Algebraic topology + category theory
High-D representation	Probability distribution over pairs	Fuzzy simplicial set
Low-D representation	Probability distribution (normalized)	Fuzzy simplicial set (unnormalized)
Objective function	KL divergence	Fuzzy set cross-entropy
Core assumption	Local Gaussian neighborhoods	Uniform distribution on Riemannian manifold
Theoretical grounding	Empirical (works well in practice)	Principled (derived from axioms)

Normalization: The Critical Difference

t-SNE's normalization (probabilities sum to 1) means that learning about one relationship affects all others. If points A and B get closer, everyone else must get relatively farther. UMAP's unnormalized formulation treats relationships independently—A and B getting closer doesn't directly affect C and D. This fundamental difference explains many empirical observations about global structure preservation.

Global Structure Preservation

One of the most practically important differences between t-SNE and UMAP is their treatment of global structure—the large-scale arrangement of clusters and their relative positions.

t-SNE's Global Structure Problem:

t-SNE is notorious for distorting global structure. Clusters that are far apart in high-D may appear at arbitrary distances in the t-SNE embedding. The relative positions of clusters often don't reflect their true relationships.

Why? Two key factors:

Global normalization: The (q_{ij}) probabilities must sum to 1. This creates a "crowding problem" where points compete for probability mass, forcing arbitrary arrangements.
KL divergence asymmetry: (KL(P | Q)) heavily penalizes (q_{ij} < p_{ij}) (missing neighbors) but not (q_{ij} > p_{ij}) (false neighbors). t-SNE prioritizes preserving local structure over global.

t-SNE Distance Interpretation Warning

Never interpret distances between clusters in t-SNE embeddings as meaningful! Cluster A appearing twice as far from cluster B as from cluster C tells you nothing about the true high-dimensional relationships. Only within-cluster structure is reliable.

UMAP's Better Global Structure:

UMAP generally preserves global structure more faithfully:

No global normalization: Relationships are treated independently. Bringing clusters together doesn't force others apart.
Explicit repulsion term: The ((1-\mu)\log(1- u)) term directly pushes unrelated points apart without normalization side effects.
Spectral initialization: Starting from graph Laplacian eigenvectors captures global structure that the optimization preserves.

t-SNE Global Structure

•Inter-cluster distances not preserved
•Cluster positions often arbitrary
•Different runs give different arrangements
•Cannot compare cluster distances
•No guarantee of relative positioning

UMAP Global Structure

•Better inter-cluster distance preservation
•Cluster arrangements more meaningful
•More consistent across runs
•Relative distances somewhat interpretable
•Captures manifold topology more faithfully

Empirical Evidence:

Studies comparing t-SNE and UMAP on known ground-truth data consistently show UMAP better preserves:

Relative cluster distances
Hierarchical cluster relationships
Manifold continuity (e.g., gradients, progressions)
Global topology (loops, branches, holes)

However, "better" doesn't mean "perfect"—UMAP's global structure is also approximate, especially for very high-dimensional data.

Local Structure Preservation

Both t-SNE and UMAP prioritize local structure preservation—keeping nearby points nearby. Here they perform more similarly, though with characteristic differences.

Quantifying Local Structure:

Local structure preservation is typically measured by:

Trustworthiness: Are the k nearest neighbors in the embedding also true neighbors in high-D? $$T(k) = 1 - \frac{2}{nk(2n-3k-1)} \sum_{i=1}^n \sum_{j \in U_k(i)} (r(i,j) - k)$$ where (U_k(i)) are the k nearest neighbors in the embedding that are not among the k nearest in high-D.
Continuity: Are the k nearest neighbors in high-D also neighbors in the embedding?
k-NN accuracy: If we train k-NN classifiers on both spaces, how well do predictions match?

Local Structure Metrics (Typical Results)
Metric	t-SNE	UMAP	Notes
Trustworthiness (k=10)	0.95-0.99	0.94-0.98	Both excellent; t-SNE slightly better
Continuity (k=10)	0.92-0.97	0.93-0.98	UMAP slightly better
k-NN accuracy	High	High	Comparable for small k
Neighborhood overlap	High for small k	High for small k	Both degrade for large k

Key Differences in Local Structure:

Aspect	t-SNE	UMAP
Cluster tightness	Very tight, compact clusters	Slightly looser clusters
Cluster separation	Sharp boundaries	More gradual transitions
Within-cluster density	Relatively uniform	Can show internal structure
Outliers	Often absorbed into clusters	More likely to remain isolated

t-SNE tends to "inflate" clusters to approximately equal sizes, potentially obscuring true density differences. UMAP better preserves relative densities.

The Perplexity/n_neighbors Connection

t-SNE's perplexity and UMAP's n_neighbors serve analogous roles—controlling the local scale. Perplexity ≈ 2^(effective neighbors), so perplexity 30 ≈ n_neighbors 15. However, the relationship is approximate; tuning each method's parameter independently is recommended.

Cluster Boundary Behavior:

A notable difference: t-SNE creates very sharp cluster boundaries, while UMAP allows more gradual transitions:

t-SNE: Points near cluster boundaries are often "snapped" to one cluster or another. Ambiguous points are rare.
UMAP: Boundary regions can show continuous gradients. Points with intermediate properties may appear between clusters.

For data with hard cluster boundaries (e.g., distinct cell types), this difference is minor. For data with continuous structure (e.g., developmental trajectories), UMAP's softer boundaries may be more appropriate.

Computational Efficiency and Scalability

For practitioners, computational efficiency often determines which method is practical. UMAP holds a significant advantage here.

Complexity Analysis:

Computational Complexity Comparison
Operation	t-SNE (exact)	t-SNE (Barnes-Hut)	UMAP
Nearest neighbors	O(n²)	O(n log n)	O(n log n) or O(n)
Per-iteration cost	O(n²)	O(n log n)	O(n × k × neg)
Space complexity	O(n²)	O(n)	O(n × k)
Typical runtime (100k pts)	Hours	30-60 minutes	2-5 minutes
Typical runtime (1M pts)	Days	Hours	15-30 minutes

Where:

(k) = n_neighbors (typically 15)
(neg) = negative samples per edge (typically 5)

Why UMAP is Faster:

Negative sampling: Instead of computing all O(n²) pairwise repulsions, UMAP samples O(n × k × neg) pairs. For n=100k, k=15, neg=5: 7.5M operations vs. 10B operations.
Approximate nearest neighbors: UMAP uses NNDescent or Annoy for near-linear neighbor finding.
No normalization: Each iteration doesn't require O(n²) sums across all pairs.
Numba JIT compilation: UMAP's implementation uses just-in-time compilation for the core loops.

timing_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
"""
Empirical timing comparison between UMAP and t-SNE.
 
This script demonstrates the practical speed difference
you can expect when embedding datasets of various sizes.
"""
 
import numpy as np
import time
from sklearn.manifold import TSNE
import umap
 
def timing_comparison(n_samples_list=[1000, 5000, 10000, 50000]):
    """
    Compare UMAP and t-SNE execution times.
    
    Note: For very large datasets, t-SNE may run out of memory
    or take impractically long times.
    """
    results = []
    
    for n in n_samples_list:
        print(f"
Testing with n={n} samples...")
        
        # Generate random high-dimensional data
        np.random.seed(42)
        X = np.random.randn(n, 100).astype(np.float32)
        
        # Time UMAP
        start = time.time()
        umap_model = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
        _ = umap_model.fit_transform(X)
        umap_time = time.time() - start
        print(f"  UMAP: {umap_time:.2f}s")
        
        # Time t-SNE (use Barnes-Hut for n > 1000)
        if n <= 10000:  # Skip for very large n
            start = time.time()
            tsne = TSNE(n_components=2, perplexity=30, random_state=42)
            _ = tsne.fit_transform(X)
            tsne_time = time.time() - start
            print(f"  t-SNE: {tsne_time:.2f}s")
        else:
            tsne_time = float('inf')
            print(f"  t-SNE: skipped (too slow)")
        
        results.append({
            'n_samples': n,
            'umap_seconds': umap_time,
            'tsne_seconds': tsne_time,
            'speedup': tsne_time / umap_time if tsne_time < float('inf') else None
        })
    
    return results
 
 
def memory_comparison(n_samples=50000, n_features=100):
    """
    Compare memory footprint.
    
    UMAP's sparse graph representation uses much less memory
    than t-SNE's dense probability matrix.
    """
    import tracemalloc
    
    X = np.random.randn(n_samples, n_features).astype(np.float32)
    
    # UMAP memory
    tracemalloc.start()
    umap_model = umap.UMAP(n_neighbors=15, low_memory=True)
    _ = umap_model.fit_transform(X)
    umap_current, umap_peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    
    print(f"UMAP peak memory: {umap_peak / 1e9:.2f} GB")
    
    # t-SNE would require approximately:
    # - n² float32 values for P matrix = n² × 4 bytes
    tsne_estimated = n_samples ** 2 * 4 / 1e9
    print(f"t-SNE estimated memory (exact): {tsne_estimated:.2f} GB")
    
    return {
        'umap_peak_gb': umap_peak / 1e9,
        'tsne_exact_estimated_gb': tsne_estimated
    }
 
 
if __name__ == "__main__":
    # Run timing comparison
    timing_results = timing_comparison([1000, 5000, 10000])
    
    print("
=== Summary ===")
    for r in timing_results:
        if r['speedup']:
            print(f"n={r['n_samples']}: UMAP is {r['speedup']:.1f}x faster than t-SNE")

GPU Acceleration

Both methods have GPU-accelerated implementations (RAPIDS cuML for UMAP, tsne-cuda for t-SNE) that dramatically improve speed for large datasets. With GPUs, million-point embeddings become feasible for either method in minutes.

Reproducibility and Stability

Both t-SNE and UMAP involve stochastic elements that affect reproducibility. Understanding these helps manage expectations and ensure comparable results.

Sources of Randomness:

Stochastic Components
Component	t-SNE	UMAP	Impact
Initialization	Random (default)	Spectral (default)	UMAP more deterministic start
Optimization	SGD with momentum	SGD with negative sampling	Both have random noise
Nearest neighbors	Exact (slow) or approximate	Always approximate	Approximate adds variability
Gradient estimation	Exact (slow) or Barnes-Hut	Negative sampling	Sampling adds variance

Practical Implications:

Aspect	t-SNE	UMAP
Same seed → same result?	Yes (exact), Mostly (BH)	Yes (within library version)
Local structure consistency	Excellent	Excellent
Global arrangement consistency	Poor	Moderate
Run-to-run cluster identity	Usually consistent	Usually consistent
Run-to-run cluster position	Highly variable	Moderately variable

Ensuring Reproducibility

For reproducible results: (1) Always set random_state/random_seed, (2) Use the same library version, (3) Document all parameters, (4) Consider running multiple times and reporting ensemble results for high-stakes analyses.

Stability with Respect to Perturbations:

How stable are embeddings when data changes slightly?

Adding points: UMAP handles new points via transform(); t-SNE requires refitting
Removing points: Both methods can change significantly
Small data changes: Both methods should give similar overall structure
Outlier sensitivity: t-SNE can be distorted by outliers; UMAP is more robust

UMAP's ability to embed new points into an existing embedding is a significant practical advantage for applications like real-time visualization of streaming data.

Hyperparameter Sensitivity

Both methods have hyperparameters that significantly affect output. Understanding sensitivity helps avoid misleading visualizations.

t-SNE Hyperparameters:

t-SNE Parameter Effects
Parameter	Range	Effect	Sensitivity
perplexity	5-50 (typically 30)	Effective neighborhood size	HIGH - changes cluster structure
learning_rate	100-1000 (typically 200)	Optimization step size	Medium - affects convergence
n_iter	250-2000	Optimization iterations	Medium - must be enough to converge
early_exaggeration	4-50 (typically 12)	Initial attractive force multiplier	Low - mainly affects speed

UMAP Hyperparameters:

UMAP Parameter Effects
Parameter	Range	Effect	Sensitivity
n_neighbors	5-100 (typically 15)	Local neighborhood size	HIGH - changes connectivity/clusters
min_dist	0.0-0.99 (typically 0.1)	Minimum embedding distance	HIGH - changes cluster tightness
n_epochs	200-500	Optimization iterations	Low - more = better convergence
metric	euclidean, cosine, etc.	High-D distance measure	Can be HIGH for some data types
spread	0.5-3.0 (typically 1.0)	Embedding scale	Low - mostly visual adjustment

The Cherry-Picking Problem

It's possible to find parameter settings that support almost any desired narrative. Different perplexity/n_neighbors values can make the same data look clustered or continuous, show 3 clusters or 7. Always report parameters, try multiple settings, and be skeptical of "perfect" results.

Comparing Sensitivity:

In practice, UMAP is often considered more forgiving:

Default parameters work well for most datasets
Results are more consistent across parameter ranges
The n_neighbors/min_dist interaction is more intuitive

t-SNE can produce dramatically different results with perplexity changes, sometimes requiring careful tuning for each dataset.

When to Use Each Method

Given the analysis above, when should you choose t-SNE vs. UMAP? Here's practical guidance:

Prefer t-SNE When

•You only care about local neighborhood structure
•Dataset is small (< 10k points)
•You want very tight, well-separated clusters
•The research field expects t-SNE (biology tradition)
•You need exact reproducibility with fixed seed
•Computational resources are not a constraint

Prefer UMAP When

•Global structure matters (cluster positions, trajectories)
•Dataset is large (> 50k points)
•You need to embed new points later (transform)
•Speed is important (interactive applications)
•Memory is constrained
•You want to use the embedding for downstream ML

Use Case Decision Framework:

Method Selection by Use Case
Use Case	Recommended	Reason
Exploratory visualization	Either	Both reveal cluster structure well
Publication figure	UMAP (or both)	UMAP's global structure is more interpretable
Single-cell RNA-seq	Either (field uses both)	Consider journal/reviewer expectations
Real-time dashboard	UMAP	Transform support, speed
Pre-processing for clustering	UMAP	Better global structure for downstream tasks
Word embeddings visualization	Both (compare)	Different insights from each
Very high dimensions (>10k)	UMAP	Speed; consider PCA first
Streaming data	UMAP	Transform support
Detecting subtle local structure	t-SNE	Strong local preservation

The Best Practice: Use Both

For important analyses, run both t-SNE and UMAP. If they agree, you can be confident in the structure. If they disagree, investigate why—the disagreement often reveals something important about the data (e.g., multi-scale structure that one method captures better).

Common Pitfalls and Misconceptions

Both methods are commonly misused. Understanding these pitfalls prevents incorrect conclusions.

Critical Pitfalls

•Over-interpreting distances: Neither method preserves absolute distances. Don't conclude 'A is twice as far from B as from C' from the visualization.
•Trusting cluster sizes: Apparent cluster sizes don't reflect true data proportions. Small groups may be inflated; large groups compressed.
•Ignoring parameter sensitivity: Different parameters give different pictures. Never present a single parameterization as 'the truth.'
•Forgetting initialization effects: Run multiple times. Consistent structure is real; inconsistent structure is noise.
•Over-interpreting density: Apparent density in the embedding doesn't directly reflect high-D density.
•Quantitative analysis: Don't perform distance-based analysis (k-means, etc.) on the 2D embedding. Use the original high-D data.

The Visualization-Analysis Trap

t-SNE and UMAP are for visualization—helping humans see patterns. They are NOT for quantitative analysis. If you're doing clustering, regression, or any numerical analysis, work in the original high-dimensional space or use appropriate dimensionality reduction (like PCA) that preserves distances.

Common Misconceptions:

Misconception	Reality
"UMAP is always better than t-SNE"	Both have strengths; choice depends on use case
"Clusters in the plot are real clusters"	Visual clusters suggest structure but don't prove it
"I can compare positions across runs"	Only compare with fixed random seed and parameters
"More iterations always helps"	After convergence, more iterations add runtime without benefit
"Perplexity 30 is always correct"	Optimal perplexity is data-dependent
"UMAP preserves global distances"	UMAP is better than t-SNE at global structure but still distorts

Summary: UMAP vs t-SNE

This comprehensive comparison has highlighted the theoretical and practical distinctions between UMAP and t-SNE. Let's consolidate the key points:

Key Takeaways

•Theoretical Foundations: t-SNE is information-theoretic (KL divergence); UMAP is topological (fuzzy simplicial sets). This underlies their behavioral differences.
•Global Structure: UMAP preserves global structure better due to unnormalized formulation. t-SNE's global cluster positions are often meaningless.
•Local Structure: Both excel at local preservation. t-SNE produces tighter clusters; UMAP allows more internal structure.
•Scalability: UMAP is significantly faster (often 10-100x) due to negative sampling and approximate algorithms.
•Reproducibility: Both require fixed seeds. UMAP's spectral initialization provides more consistent global arrangements.
•Practical Choice: Use UMAP for large data, global structure, and downstream ML. Use t-SNE when local structure is paramount and data is small.

What's Next:

The final page of this module covers hyperparameter tuning for UMAP—providing detailed guidance on how n_neighbors, min_dist, and other parameters affect results, with practical strategies for finding optimal settings for your specific use case.

Page Complete

You now have a comprehensive understanding of how UMAP and t-SNE compare—theoretically, empirically, and practically. This knowledge enables you to make informed choices between these methods and correctly interpret their outputs.