Machine LearningManifold Learning

t-Distributed Stochastic Neighbor Embedding (t-SNE)

LevelAdvanced

Duration90 mins

TopicManifold Learning

3 / 5

Optimization Challenges in t-SNE

The Art and Science of t-SNE Optimization

t-SNE's beautiful objective function—minimizing KL divergence between probability distributions—hides a computationally challenging reality. The optimization landscape is highly non-convex, riddled with local minima, and sensitive to algorithmic choices that can make the difference between stunning visualizations and meaningless noise.

This page confronts these challenges head-on. We will explore the non-convex nature of the cost function, understand why initialization matters so critically, examine the "early exaggeration" trick that enables cluster formation, and develop practical strategies for achieving high-quality embeddings.

Mastering t-SNE optimization is essential because the same data with the same perplexity can produce wildly different visualizations depending on how you run the optimization. Understanding these issues transforms t-SNE from a black box into a tool you control with precision.

Learning Objectives

By the end of this page, you will: • Understand why t-SNE's optimization landscape is non-convex • Master initialization strategies and their effects on results • Comprehend the early exaggeration phase and its purpose • Know how to set learning rates and iteration counts • Recognize when optimization has succeeded or failed • Apply Barnes-Hut and other acceleration techniques

The Non-Convex Optimization Landscape

The t-SNE cost function, C = KL(P || Q), is not convex in the embedding coordinates Y. This has profound implications for optimization.

Why Non-Convexity Arises:

The Q distribution depends on pairwise distances in the embedding:

qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Z

where Z = Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹ is the normalization constant.

This creates complex dependencies:

Ratio of functions: The qᵢⱼ values are ratios, not simple functions of positions
Global normalization: Z depends on ALL pairs, coupling all points
Non-linear coupling: Each point's optimal position depends on all other points' positions

The result is a cost function with many local minima, saddle points, and plateaus.

Local Minima are Real

Unlike convex optimization where any local minimum is global, t-SNE genuinely has different local minima that produce qualitatively different embeddings. Running t-SNE twice with different random seeds can yield different-looking results. This is not a bug—it's a fundamental property of the optimization landscape.

Characteristics of the Landscape:

Symmetry breaking: The initial random embedding must "break symmetry" to form clusters. Points that start uniformly distributed have no reason to cluster until optimization creates separation.
Basin of attraction: Each local minimum has a "basin of attraction"—initializations within the basin converge to that minimum. Different basins lead to different cluster arrangements.
Permutation invariance: The cost function is invariant to:
- Global translation (shifting all points)
- Global rotation
- Global reflection
This means there's an infinite family of equivalent solutions.
Scale sensitivity: The absolute scale of the embedding matters because it affects qᵢⱼ values through the Student-t kernel.

Implications for Practice:

Practical Implications of Non-Convexity

•Run multiple times: Different seeds may produce different results; run 3-5 times to assess stability
•Initialization matters: PCA initialization typically produces more reproducible results than random initialization
•Longer optimization helps: More iterations allow escape from shallow local minima
•Early exaggeration is crucial: Without it, optimization often gets stuck in poor configurations
•Don't over-interpret single runs: A single t-SNE run may not show the complete picture

Initialization Strategies

The initial embedding Y⁰ significantly influences the final result. Modern t-SNE implementations offer several initialization strategies, each with tradeoffs.

Random Initialization:

The simplest approach samples initial positions from a Gaussian distribution:

Random Initialization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
 
def random_initialization(N, d=2, scale=1e-4):
    """
    Initialize embedding with small random values.
    
    Args:
        N: Number of points
        d: Embedding dimensionality (usually 2)
        scale: Standard deviation of initialization (critical!)
    
    Returns:
        Y: Initial embedding (N x d)
    """
    Y = np.random.randn(N, d) * scale
    return Y
 
# Why small scale (1e-4)?
# - Prevents initial distances from being too large
# - Allows early exaggeration to work effectively
# - Points start "collapsed" and expand during optimization

Properties of Random Initialization:

✓ Simple and fast
✓ Each run explores a different local minimum
✗ Results vary significantly across runs
✗ May require more iterations to converge
✗ Can get stuck in poor local minima

PCA Initialization (Recommended):

Initialize using the first d principal components of the data:

PCA Initialization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from sklearn.decomposition import PCA
import numpy as np
 
def pca_initialization(X, d=2, scale=1e-4):
    """
    Initialize embedding using PCA.
    
    PCA provides a sensible starting point that respects
    global variance structure of the data.
    
    Args:
        X: Input data (N x D)
        d: Embedding dimensionality
        scale: Scale factor for initialization
    
    Returns:
        Y: Initial embedding (N x d)
    """
    # Center the data
    X_centered = X - X.mean(axis=0)
    
    # PCA
    pca = PCA(n_components=d)
    Y = pca.fit_transform(X_centered)
    
    # Scale to small values (important!)
    Y = Y / np.std(Y) * scale
    
    return Y
 
# Why scale down?
# - t-SNE optimization works best when initial embedding is compact
# - Early exaggeration then "explodes" the embedding
# - This process creates better cluster separation

Properties of PCA Initialization:

✓ More reproducible results across runs
✓ Respects global structure as starting point
✓ Often converges faster
✓ Recommended by original t-SNE authors and sklearn
✗ May bias toward linear structure initially
✗ Might miss some local minima that random init would find

Spectral Initialization:

Some implementations offer spectral initialization based on the graph Laplacian:

Spectral Initialization
# Spectral initialization idea:
1. Construct affinity matrix from high-D affinities
2. Compute normalized graph Laplacian L
3. Find smallest eigenvectors of L (excluding trivial eigenvector)
4. Use these as initial embedding
 
# Advantages:
- Respects neighborhood structure from the start
- Can lead to faster convergence
 
# Disadvantages:
- Computationally expensive for large datasets
- May not be necessary with proper optimization

Modern Best Practice

Use PCA initialization (init='pca' in sklearn). It provides a good balance of reproducibility and quality. Random initialization is still useful if you want to explore different local minima intentionally. Always use a small initial scale regardless of initialization method.

Early Exaggeration: Forcing Cluster Formation

One of the cleverest tricks in t-SNE optimization is early exaggeration—a technique that artificially inflates the high-dimensional affinities during the initial phase of optimization.

The Mechanism:

During the first E iterations (typically 250), multiply all pᵢⱼ values by an exaggeration factor α (typically 4-12):

p̃ᵢⱼ = α · pᵢⱼ for iterations 1 to E

After iteration E, revert to normal: p̃ᵢⱼ = pᵢⱼ

Why This Works:

The intuition is beautifully simple:

Normal pᵢⱼ sums to 1: This means pᵢⱼ values are quite small for most pairs (1/N² scale for uniform distribution)
Small pᵢⱼ → weak attractive forces: Without exaggeration, attraction between neighbors is initially weak because pᵢⱼ is small
Compact initialization: Points start close together, so qᵢⱼ values are all similar and moderate
The problem: Weak attractions can't overcome the tendency for points to remain uniformly distributed
The solution: Multiply pᵢⱼ by α to create strong "super-attractive" forces between designated neighbors

Early Exaggeration Effect
Without early exaggeration:
  - pᵢⱼ ≈ 0.0001 for neighbors (small)
  - Attractive force ∝ pᵢⱼ - qᵢⱼ ≈ weak
  - Points drift slowly; may get stuck in "uniform soup"
 
With early exaggeration (α = 12):
  - p̃ᵢⱼ ≈ 0.0012 for neighbors (12× larger)
  - Total p̃ᵢⱼ sums to α (not 1) → artificially "boosted" affinities
  - Attractive force ∝ p̃ᵢⱼ - qᵢⱼ ≈ much stronger
  - Strong attraction → points rapidly form tight clusters
  - Clusters form quickly, establishing structure
 
After exaggeration ends:
  - Revert to pᵢⱼ (normal affinities)
  - Clusters are already formed; fine-tuning continues
  - Balanced forces lead to stable final embedding

Physical Analogy:

Think of early exaggeration like heating a material before shaping it:

Early phase (exaggerated): High "temperature" allows rapid restructuring—clusters form quickly under strong forces
Later phase (normal): Lower "temperature" allows fine-tuning without disrupting established structure

Practical Parameters:

Early Exaggeration Parameters
Parameter	Typical Value	Effect of Increasing	Effect of Decreasing
Exaggeration factor (α)	12	Tighter initial clusters, more separation	Weaker clustering effect, may not form clusters
Exaggeration iterations (E)	250	Longer cluster formation phase	Shorter phase, may not fully separate clusters
Learning rate during exaggeration	Higher	Faster convergence in early phase	Slower but more stable early phase

Automatic Learning Rate

Modern implementations (sklearn 1.2+) use learning_rate='auto', which sets the learning rate to N/early_exaggeration. This adaptive choice works well across dataset sizes and is now the recommended default. It's particularly important during the exaggeration phase.

t-SNE Optimization Loop with Early Exaggeration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
 
def tsne_optimize(P, N, d=2, n_iter=1000, 
                  early_exag_factor=12.0, early_exag_iter=250,
                  learning_rate=200.0, momentum_init=0.5, 
                  momentum_final=0.8, random_state=42):
    """
    Complete t-SNE optimization loop with early exaggeration.
    """
    np.random.seed(random_state)
    
    # Initialize embedding
    Y = np.random.randn(N, d) * 1e-4
    velocity = np.zeros_like(Y)
    
    for iteration in range(n_iter):
        # Early exaggeration phase
        if iteration < early_exag_iter:
            P_current = P * early_exag_factor
            momentum = momentum_init
        else:
            P_current = P
            momentum = momentum_final
        
        # Compute Q distribution
        Q, Q_numerator = compute_q_distribution(Y)
        
        # Compute gradient
        grad = compute_gradient(P_current, Q, Q_numerator, Y)
        
        # Gradient descent with momentum
        velocity = momentum * velocity - learning_rate * grad
        Y = Y + velocity
        
        # Center embedding (remove mean for numerical stability)
        Y = Y - Y.mean(axis=0)
        
        # Optional: compute and log cost every 50 iterations
        if iteration % 50 == 0:
            cost = compute_kl_divergence(P, Q)
            print(f"Iteration {iteration}: KL divergence = {cost:.4f}")
    
    return Y
 
 
def compute_q_distribution(Y):
    """Compute Student-t based Q distribution."""
    N = Y.shape[0]
    D = np.sum(Y**2, axis=1)[:, np.newaxis] + np.sum(Y**2, axis=1) - 2 * Y @ Y.T
    
    # Student-t kernel
    Q_numerator = 1 / (1 + D)
    np.fill_diagonal(Q_numerator, 0)
    
    Q = Q_numerator / np.sum(Q_numerator)
    Q = np.maximum(Q, 1e-12)
    
    return Q, Q_numerator
 
 
def compute_gradient(P, Q, Q_numerator, Y):
    """Compute gradient of KL divergence."""
    N = Y.shape[0]
    PQ_diff = P - Q
    
    grad = np.zeros_like(Y)
    for i in range(N):
        diff = Y[i] - Y  # (N, d)
        grad[i] = 4 * np.sum(
            (PQ_diff[i, :] * Q_numerator[i, :])[:, np.newaxis] * diff,
            axis=0
        )
    
    return grad

Learning Rate and Momentum

Proper tuning of the learning rate and momentum is essential for successful t-SNE optimization. These parameters control the speed and stability of gradient descent.

Learning Rate:

The learning rate (often denoted η or lr) determines the step size in each gradient descent iteration:

Y^(t+1) = Y^(t) - η · ∇C(Y^(t))

Too small: Slow convergence; may not reach good solution in limited iterations Too large: Unstable oscillations; embedding may "explode" or fail to converge Just right: Steady convergence to well-structured embedding

Learning Rate Guidelines
Dataset Size	Recommended Learning Rate	Notes
< 1,000	100-200	Standard range works well
1,000-10,000	200-500	May need higher for larger datasets
10,000-100,000	N / early_exag (auto)	Use automatic learning rate
100,000	N / early_exag (auto)	Automatic is essential

Use Automatic Learning Rate

In sklearn, set learning_rate='auto'. This computes learning_rate = max(N / early_exaggeration, 50). This adaptive choice scale correctly with dataset size and is now the recommended default. If using older implementations, consider setting learning_rate ≈ N/12 as a starting point.

Momentum:

Momentum accelerates gradient descent by accumulating a running average of gradients:

v^(t+1) = μ · v^(t) - η · ∇C(Y^(t)) Y^(t+1) = Y^(t) + v^(t+1)

where μ is the momentum coefficient (0 ≤ μ < 1).

Why momentum helps t-SNE:

Accelerates through flat regions: The accumulated velocity carries optimization through plateaus
Smooths noisy gradients: Averaging reduces oscillations from point-to-point gradient variation
Helps escape shallow local minima: Momentum can "roll past" shallow valleys

Standard momentum schedule:

Momentum Schedule
# Two-phase momentum schedule (commonly used):
 
Phase 1 (Early exaggeration, iterations 0-250):
  momentum = 0.5
  # Lower momentum during cluster formation
  # Allows rapid restructuring without overshooting
 
Phase 2 (Main optimization, iterations 250+):
  momentum = 0.8
  # Higher momentum for efficient convergence
  # Accelerates "fine-tuning" of established structure
 
# Some implementations use gradual increase:
momentum(t) = 0.5 + 0.3 * min(t / 250, 1)

Signs of Learning Rate Problems:

Learning rate too high:

Embedding "explodes" (points move to extreme coordinates)
Cost function oscillates wildly or increases
Clusters appear, disappear, and reappear chaotically
Final embedding looks "shattered" or scattered

Learning rate too low:

Very slow cost decrease
Embedding looks "globby"—clusters don't separate well
Even after many iterations, structure doesn't emerge
Points remain stuck in initial configuration or uniform "soup"

Debugging Tip

If your t-SNE embedding looks bad, try: (1) increasing n_iter to 1000+, (2) using learning_rate='auto' or increasing learning rate, (3) checking that early exaggeration is enabled, (4) using init='pca'. Most failed embeddings come from insufficient iterations or suboptimal learning rate.

Convergence and Stopping Criteria

Knowing when t-SNE optimization has converged is important for obtaining high-quality embeddings without wasting computation.

Convergence Indicators:

KL Divergence: The primary cost function should decrease and stabilize

Monitoring Convergence
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import matplotlib.pyplot as plt
 
def monitor_tsne_convergence(X, perplexity=30, n_iter=1500, verbose=True):
    """
    Run t-SNE while monitoring convergence metrics.
    
    Returns embedding and convergence history.
    """
    from sklearn.manifold import TSNE
    
    # Run t-SNE with verbose output
    tsne = TSNE(
        n_components=2,
        perplexity=perplexity,
        n_iter=n_iter,
        learning_rate='auto',
        init='pca',
        verbose=2 if verbose else 0,  # Shows KL divergence per iteration
        random_state=42
    )
    
    Y = tsne.fit_transform(X)
    
    # tsne.kl_divergence_ contains final KL divergence
    print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")
    
    return Y, tsne.kl_divergence_
 
# Typical convergence pattern:
# Iteration 50:  KL divergence = 4.5  (early exaggeration)
# Iteration 100: KL divergence = 3.2  (clusters forming)
# Iteration 250: KL divergence = 2.8  (end of exaggeration)
# Iteration 300: KL divergence = 1.5  (jump down after exaggeration ends)
# Iteration 500: KL divergence = 1.2  (fine-tuning)
# Iteration 1000: KL divergence = 1.1 (near convergence)

Understanding the KL Divergence Curve:

The KL divergence typically follows a characteristic pattern:

Initial decrease (iterations 0-50): Rapid drop as initial structure forms
Plateau or slow decrease (iterations 50-250): Exaggerated probabilities create stable but separated clusters
Drop at end of exaggeration (iteration 250): When exaggeration ends, true pᵢⱼ values are smaller, so KL drops
Fine-tuning (iterations 250+): Gradual optimization toward local minimum
Convergence (iterations 500-1000+): KL stabilizes; further iterations yield diminishing returns

How Many Iterations?

Iteration Guidelines
Scenario	Recommended Iterations	Rationale
Quick exploration	250-500	See rough structure; may not be fully optimized
Standard visualization	1000	Default; usually sufficient for good embedding
Publication-quality	1500-2000	Ensure full convergence
Large datasets (N > 50k)	500-1000	More iterations may not help; focus on other parameters
Reproducibility check	Same as primary run	Use same n_iter across runs for fair comparison

Alternative Stopping Criteria:

While fixed iteration count is most common, alternatives exist:

Gradient norm: Stop when ||∇C|| < ε. Indicates forces are balanced.
Cost change: Stop when |C^(t) - C^(t-k)| / C^(t) < ε over last k iterations.
Embedding stability: Stop when embedding changes negligibly: ||Y^(t) - Y^(t-k)|| < ε.

In practice, these are rarely needed—1000 iterations with good settings almost always suffices.

Convergence is Necessary but Not Sufficient

A converged embedding is not necessarily a good embedding. t-SNE can converge to poor local minima. Always visually inspect results, try multiple random seeds, and validate against known structure if available. Low final KL divergence suggests a good fit, but interpretation still requires care.

Barnes-Hut and Other Acceleration Techniques

Naive t-SNE has O(N²) complexity per iteration—computing all pairwise interactions. For large datasets, this becomes impractical. The Barnes-Hut approximation reduces this to O(N log N), enabling t-SNE on datasets with tens of thousands to millions of points.

The Barnes-Hut Idea:

Barnes-Hut is a tree-based approximation originally developed for N-body gravitational simulations. The key insight:

Distant points can be treated as a single "super-point" (center of mass)
Only nearby points need individual force calculations
A spatial tree (quad-tree in 2D, oct-tree in 3D) organizes points hierarchically

How It Works for t-SNE:

Barnes-Hut Algorithm Sketch
Barnes-Hut t-SNE Algorithm:
 
1. BUILD TREE: Construct quad-tree of current embedding Y
   - Recursively subdivide space into quadrants
   - Each node stores: center of mass, total mass, bounds
 
2. COMPUTE ATTRACTIVE FORCES (unchanged):
   - For each point i, compute attraction to neighbors
   - Uses precomputed sparse P matrix (only k nearest neighbors)
   - Complexity: O(N × k) where k ≪ N
 
3. COMPUTE REPULSIVE FORCES (approximated):
   For each point i:
     Walk tree from root:
       If current node is "far enough" (θ criterion):
         Treat entire subtree as single point at center of mass
         Add approximate repulsive force
       Else:
         Recurse into children
 
   The θ parameter controls accuracy:
   - θ = 0: Exact computation (no approximation)
   - θ = 0.5: Good balance of speed and accuracy (default)
   - θ = 1.0: Faster but less accurate
 
4. UPDATE: Apply gradient descent update as usual
 
Complexity: O(N log N) per iteration with good θ

The θ Parameter:

The Barnes-Hut approximation uses a parameter θ (angle) to decide when to approximate:

A tree node can be approximated if: (node size) / (distance to point) < θ

θ = 0: Never approximate → exact O(N²) computation
θ = 0.5: Default; good approximation with significant speedup
θ = 1.0: More aggressive approximation; faster but lower quality

t-SNE Algorithm Comparison
Algorithm	Complexity per Iteration	Suitable Dataset Size	Notes
Exact t-SNE	O(N²)	< 5,000	Accurate but slow
Barnes-Hut t-SNE	O(N log N)	5,000 - 100,000	Default in sklearn
FIt-SNE (FFT-accelerated)	O(N)	100,000 - 1,000,000	Uses FFT for repulsive forces
Approximate NN + BH	O(N log N)	1,000,000+	Combines ANN for P with BH

Using Barnes-Hut in Practice

In sklearn, Barnes-Hut is used automatically when method='barnes_hut' (default for N > 3). Set angle parameter to control θ (default 0.5). For N < 500, exact method may be preferable. For very large datasets, consider FIt-SNE or openTSNE implementations.

sklearn Barnes-Hut t-SNE
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.manifold import TSNE
 
# Barnes-Hut t-SNE for medium/large datasets
tsne_bh = TSNE(
    n_components=2,
    perplexity=30,
    method='barnes_hut',  # Default for n_samples > 3
    angle=0.5,            # Barnes-Hut angle (θ)
    n_iter=1000,
    learning_rate='auto',
    init='pca',
    random_state=42,
    n_jobs=-1             # Parallelize where possible
)
 
Y = tsne_bh.fit_transform(X)
 
# For small datasets, exact method may be better
tsne_exact = TSNE(
    n_components=2,
    perplexity=30,
    method='exact',       # Exact computation
    n_iter=1000,
    random_state=42
)

Summary: Mastering t-SNE Optimization

t-SNE optimization is challenging but manageable with the right techniques. Let's consolidate the key lessons.

Key Takeaways

•Non-convex landscape: t-SNE has many local minima; results can vary across runs—this is fundamental, not a bug
•Initialization matters: Use init='pca' for reproducibility; random initialization explores different solutions
•Early exaggeration is essential: Multiplying P by ~12 for first 250 iterations enables cluster formation
•Automatic learning rate: Use learning_rate='auto' to scale appropriately with dataset size
•Sufficient iterations: Use at least 1000 iterations; monitor KL divergence for convergence
•Barnes-Hut for scale: Use method='barnes_hut' (default) for datasets larger than ~1000 points

What's Next:

With a solid understanding of t-SNE's objective, perplexity, and optimization, we now turn to interpretation. t-SNE visualizations are often misread—cluster sizes don't mean anything, distances between clusters aren't reliable, and many visual patterns can be artifacts. The next page provides rigorous guidelines for correctly interpreting t-SNE results.

Page Complete

You now understand the optimization challenges in t-SNE and how to address them. With proper initialization, early exaggeration, learning rate, and sufficient iterations, you can reliably obtain high-quality embeddings. Next, we'll learn how to correctly interpret these visualizations.

3 / 5

Loading learning content...

Machine LearningManifold Learning

t-Distributed Stochastic Neighbor Embedding (t-SNE)

LevelAdvanced

Duration90 mins

TopicManifold Learning

3 / 5

Optimization Challenges in t-SNE

The Art and Science of t-SNE Optimization

Learning Objectives

The Non-Convex Optimization Landscape

The t-SNE cost function, C = KL(P || Q), is not convex in the embedding coordinates Y. This has profound implications for optimization.

Why Non-Convexity Arises:

The Q distribution depends on pairwise distances in the embedding:

qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Z

where Z = Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹ is the normalization constant.

This creates complex dependencies:

Ratio of functions: The qᵢⱼ values are ratios, not simple functions of positions
Global normalization: Z depends on ALL pairs, coupling all points
Non-linear coupling: Each point's optimal position depends on all other points' positions

The result is a cost function with many local minima, saddle points, and plateaus.

Local Minima are Real

Characteristics of the Landscape:

Symmetry breaking: The initial random embedding must "break symmetry" to form clusters. Points that start uniformly distributed have no reason to cluster until optimization creates separation.
Basin of attraction: Each local minimum has a "basin of attraction"—initializations within the basin converge to that minimum. Different basins lead to different cluster arrangements.
Permutation invariance: The cost function is invariant to:
- Global translation (shifting all points)
- Global rotation
- Global reflection
This means there's an infinite family of equivalent solutions.
Scale sensitivity: The absolute scale of the embedding matters because it affects qᵢⱼ values through the Student-t kernel.

Implications for Practice:

Practical Implications of Non-Convexity

•Run multiple times: Different seeds may produce different results; run 3-5 times to assess stability
•Initialization matters: PCA initialization typically produces more reproducible results than random initialization
•Longer optimization helps: More iterations allow escape from shallow local minima
•Early exaggeration is crucial: Without it, optimization often gets stuck in poor configurations
•Don't over-interpret single runs: A single t-SNE run may not show the complete picture

Initialization Strategies

The initial embedding Y⁰ significantly influences the final result. Modern t-SNE implementations offer several initialization strategies, each with tradeoffs.

Random Initialization:

The simplest approach samples initial positions from a Gaussian distribution:

Random Initialization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
 
def random_initialization(N, d=2, scale=1e-4):
    """
    Initialize embedding with small random values.
    
    Args:
        N: Number of points
        d: Embedding dimensionality (usually 2)
        scale: Standard deviation of initialization (critical!)
    
    Returns:
        Y: Initial embedding (N x d)
    """
    Y = np.random.randn(N, d) * scale
    return Y
 
# Why small scale (1e-4)?
# - Prevents initial distances from being too large
# - Allows early exaggeration to work effectively
# - Points start "collapsed" and expand during optimization

Properties of Random Initialization:

✓ Simple and fast
✓ Each run explores a different local minimum
✗ Results vary significantly across runs
✗ May require more iterations to converge
✗ Can get stuck in poor local minima

PCA Initialization (Recommended):

Initialize using the first d principal components of the data:

PCA Initialization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from sklearn.decomposition import PCA
import numpy as np
 
def pca_initialization(X, d=2, scale=1e-4):
    """
    Initialize embedding using PCA.
    
    PCA provides a sensible starting point that respects
    global variance structure of the data.
    
    Args:
        X: Input data (N x D)
        d: Embedding dimensionality
        scale: Scale factor for initialization
    
    Returns:
        Y: Initial embedding (N x d)
    """
    # Center the data
    X_centered = X - X.mean(axis=0)
    
    # PCA
    pca = PCA(n_components=d)
    Y = pca.fit_transform(X_centered)
    
    # Scale to small values (important!)
    Y = Y / np.std(Y) * scale
    
    return Y
 
# Why scale down?
# - t-SNE optimization works best when initial embedding is compact
# - Early exaggeration then "explodes" the embedding
# - This process creates better cluster separation

Properties of PCA Initialization:

✓ More reproducible results across runs
✓ Respects global structure as starting point
✓ Often converges faster
✓ Recommended by original t-SNE authors and sklearn
✗ May bias toward linear structure initially
✗ Might miss some local minima that random init would find

Spectral Initialization:

Some implementations offer spectral initialization based on the graph Laplacian:

Spectral Initialization
# Spectral initialization idea:
1. Construct affinity matrix from high-D affinities
2. Compute normalized graph Laplacian L
3. Find smallest eigenvectors of L (excluding trivial eigenvector)
4. Use these as initial embedding
 
# Advantages:
- Respects neighborhood structure from the start
- Can lead to faster convergence
 
# Disadvantages:
- Computationally expensive for large datasets
- May not be necessary with proper optimization

Modern Best Practice

Early Exaggeration: Forcing Cluster Formation

One of the cleverest tricks in t-SNE optimization is early exaggeration—a technique that artificially inflates the high-dimensional affinities during the initial phase of optimization.

The Mechanism:

During the first E iterations (typically 250), multiply all pᵢⱼ values by an exaggeration factor α (typically 4-12):

p̃ᵢⱼ = α · pᵢⱼ for iterations 1 to E

After iteration E, revert to normal: p̃ᵢⱼ = pᵢⱼ

Why This Works:

The intuition is beautifully simple:

Normal pᵢⱼ sums to 1: This means pᵢⱼ values are quite small for most pairs (1/N² scale for uniform distribution)
Small pᵢⱼ → weak attractive forces: Without exaggeration, attraction between neighbors is initially weak because pᵢⱼ is small
Compact initialization: Points start close together, so qᵢⱼ values are all similar and moderate
The problem: Weak attractions can't overcome the tendency for points to remain uniformly distributed
The solution: Multiply pᵢⱼ by α to create strong "super-attractive" forces between designated neighbors

Early Exaggeration Effect
Without early exaggeration:
  - pᵢⱼ ≈ 0.0001 for neighbors (small)
  - Attractive force ∝ pᵢⱼ - qᵢⱼ ≈ weak
  - Points drift slowly; may get stuck in "uniform soup"
 
With early exaggeration (α = 12):
  - p̃ᵢⱼ ≈ 0.0012 for neighbors (12× larger)
  - Total p̃ᵢⱼ sums to α (not 1) → artificially "boosted" affinities
  - Attractive force ∝ p̃ᵢⱼ - qᵢⱼ ≈ much stronger
  - Strong attraction → points rapidly form tight clusters
  - Clusters form quickly, establishing structure
 
After exaggeration ends:
  - Revert to pᵢⱼ (normal affinities)
  - Clusters are already formed; fine-tuning continues
  - Balanced forces lead to stable final embedding

Physical Analogy:

Think of early exaggeration like heating a material before shaping it:

Early phase (exaggerated): High "temperature" allows rapid restructuring—clusters form quickly under strong forces
Later phase (normal): Lower "temperature" allows fine-tuning without disrupting established structure

Practical Parameters:

Early Exaggeration Parameters
Parameter	Typical Value	Effect of Increasing	Effect of Decreasing
Exaggeration factor (α)	12	Tighter initial clusters, more separation	Weaker clustering effect, may not form clusters
Exaggeration iterations (E)	250	Longer cluster formation phase	Shorter phase, may not fully separate clusters
Learning rate during exaggeration	Higher	Faster convergence in early phase	Slower but more stable early phase

Automatic Learning Rate

t-SNE Optimization Loop with Early Exaggeration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
 
def tsne_optimize(P, N, d=2, n_iter=1000, 
                  early_exag_factor=12.0, early_exag_iter=250,
                  learning_rate=200.0, momentum_init=0.5, 
                  momentum_final=0.8, random_state=42):
    """
    Complete t-SNE optimization loop with early exaggeration.
    """
    np.random.seed(random_state)
    
    # Initialize embedding
    Y = np.random.randn(N, d) * 1e-4
    velocity = np.zeros_like(Y)
    
    for iteration in range(n_iter):
        # Early exaggeration phase
        if iteration < early_exag_iter:
            P_current = P * early_exag_factor
            momentum = momentum_init
        else:
            P_current = P
            momentum = momentum_final
        
        # Compute Q distribution
        Q, Q_numerator = compute_q_distribution(Y)
        
        # Compute gradient
        grad = compute_gradient(P_current, Q, Q_numerator, Y)
        
        # Gradient descent with momentum
        velocity = momentum * velocity - learning_rate * grad
        Y = Y + velocity
        
        # Center embedding (remove mean for numerical stability)
        Y = Y - Y.mean(axis=0)
        
        # Optional: compute and log cost every 50 iterations
        if iteration % 50 == 0:
            cost = compute_kl_divergence(P, Q)
            print(f"Iteration {iteration}: KL divergence = {cost:.4f}")
    
    return Y
 
 
def compute_q_distribution(Y):
    """Compute Student-t based Q distribution."""
    N = Y.shape[0]
    D = np.sum(Y**2, axis=1)[:, np.newaxis] + np.sum(Y**2, axis=1) - 2 * Y @ Y.T
    
    # Student-t kernel
    Q_numerator = 1 / (1 + D)
    np.fill_diagonal(Q_numerator, 0)
    
    Q = Q_numerator / np.sum(Q_numerator)
    Q = np.maximum(Q, 1e-12)
    
    return Q, Q_numerator
 
 
def compute_gradient(P, Q, Q_numerator, Y):
    """Compute gradient of KL divergence."""
    N = Y.shape[0]
    PQ_diff = P - Q
    
    grad = np.zeros_like(Y)
    for i in range(N):
        diff = Y[i] - Y  # (N, d)
        grad[i] = 4 * np.sum(
            (PQ_diff[i, :] * Q_numerator[i, :])[:, np.newaxis] * diff,
            axis=0
        )
    
    return grad

Learning Rate and Momentum

Proper tuning of the learning rate and momentum is essential for successful t-SNE optimization. These parameters control the speed and stability of gradient descent.

Learning Rate:

The learning rate (often denoted η or lr) determines the step size in each gradient descent iteration:

Y^(t+1) = Y^(t) - η · ∇C(Y^(t))

Learning Rate Guidelines
Dataset Size	Recommended Learning Rate	Notes
< 1,000	100-200	Standard range works well
1,000-10,000	200-500	May need higher for larger datasets
10,000-100,000	N / early_exag (auto)	Use automatic learning rate
100,000	N / early_exag (auto)	Automatic is essential

Use Automatic Learning Rate

Momentum:

Momentum accelerates gradient descent by accumulating a running average of gradients:

v^(t+1) = μ · v^(t) - η · ∇C(Y^(t)) Y^(t+1) = Y^(t) + v^(t+1)

where μ is the momentum coefficient (0 ≤ μ < 1).

Why momentum helps t-SNE:

Accelerates through flat regions: The accumulated velocity carries optimization through plateaus
Smooths noisy gradients: Averaging reduces oscillations from point-to-point gradient variation
Helps escape shallow local minima: Momentum can "roll past" shallow valleys

Standard momentum schedule:

Momentum Schedule
# Two-phase momentum schedule (commonly used):
 
Phase 1 (Early exaggeration, iterations 0-250):
  momentum = 0.5
  # Lower momentum during cluster formation
  # Allows rapid restructuring without overshooting
 
Phase 2 (Main optimization, iterations 250+):
  momentum = 0.8
  # Higher momentum for efficient convergence
  # Accelerates "fine-tuning" of established structure
 
# Some implementations use gradual increase:
momentum(t) = 0.5 + 0.3 * min(t / 250, 1)

Signs of Learning Rate Problems:

Learning rate too high:

Embedding "explodes" (points move to extreme coordinates)
Cost function oscillates wildly or increases
Clusters appear, disappear, and reappear chaotically
Final embedding looks "shattered" or scattered

Learning rate too low:

Very slow cost decrease
Embedding looks "globby"—clusters don't separate well
Even after many iterations, structure doesn't emerge
Points remain stuck in initial configuration or uniform "soup"

Debugging Tip

Convergence and Stopping Criteria

Knowing when t-SNE optimization has converged is important for obtaining high-quality embeddings without wasting computation.

Convergence Indicators:

KL Divergence: The primary cost function should decrease and stabilize

Monitoring Convergence
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import matplotlib.pyplot as plt
 
def monitor_tsne_convergence(X, perplexity=30, n_iter=1500, verbose=True):
    """
    Run t-SNE while monitoring convergence metrics.
    
    Returns embedding and convergence history.
    """
    from sklearn.manifold import TSNE
    
    # Run t-SNE with verbose output
    tsne = TSNE(
        n_components=2,
        perplexity=perplexity,
        n_iter=n_iter,
        learning_rate='auto',
        init='pca',
        verbose=2 if verbose else 0,  # Shows KL divergence per iteration
        random_state=42
    )
    
    Y = tsne.fit_transform(X)
    
    # tsne.kl_divergence_ contains final KL divergence
    print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")
    
    return Y, tsne.kl_divergence_
 
# Typical convergence pattern:
# Iteration 50:  KL divergence = 4.5  (early exaggeration)
# Iteration 100: KL divergence = 3.2  (clusters forming)
# Iteration 250: KL divergence = 2.8  (end of exaggeration)
# Iteration 300: KL divergence = 1.5  (jump down after exaggeration ends)
# Iteration 500: KL divergence = 1.2  (fine-tuning)
# Iteration 1000: KL divergence = 1.1 (near convergence)

Understanding the KL Divergence Curve:

The KL divergence typically follows a characteristic pattern:

Initial decrease (iterations 0-50): Rapid drop as initial structure forms
Plateau or slow decrease (iterations 50-250): Exaggerated probabilities create stable but separated clusters
Drop at end of exaggeration (iteration 250): When exaggeration ends, true pᵢⱼ values are smaller, so KL drops
Fine-tuning (iterations 250+): Gradual optimization toward local minimum
Convergence (iterations 500-1000+): KL stabilizes; further iterations yield diminishing returns

How Many Iterations?

Iteration Guidelines
Scenario	Recommended Iterations	Rationale
Quick exploration	250-500	See rough structure; may not be fully optimized
Standard visualization	1000	Default; usually sufficient for good embedding
Publication-quality	1500-2000	Ensure full convergence
Large datasets (N > 50k)	500-1000	More iterations may not help; focus on other parameters
Reproducibility check	Same as primary run	Use same n_iter across runs for fair comparison

Alternative Stopping Criteria:

While fixed iteration count is most common, alternatives exist:

Gradient norm: Stop when ||∇C|| < ε. Indicates forces are balanced.
Cost change: Stop when |C^(t) - C^(t-k)| / C^(t) < ε over last k iterations.
Embedding stability: Stop when embedding changes negligibly: ||Y^(t) - Y^(t-k)|| < ε.

In practice, these are rarely needed—1000 iterations with good settings almost always suffices.

Convergence is Necessary but Not Sufficient

Barnes-Hut and Other Acceleration Techniques

The Barnes-Hut Idea:

Barnes-Hut is a tree-based approximation originally developed for N-body gravitational simulations. The key insight:

Distant points can be treated as a single "super-point" (center of mass)
Only nearby points need individual force calculations
A spatial tree (quad-tree in 2D, oct-tree in 3D) organizes points hierarchically

How It Works for t-SNE:

Barnes-Hut Algorithm Sketch
Barnes-Hut t-SNE Algorithm:
 
1. BUILD TREE: Construct quad-tree of current embedding Y
   - Recursively subdivide space into quadrants
   - Each node stores: center of mass, total mass, bounds
 
2. COMPUTE ATTRACTIVE FORCES (unchanged):
   - For each point i, compute attraction to neighbors
   - Uses precomputed sparse P matrix (only k nearest neighbors)
   - Complexity: O(N × k) where k ≪ N
 
3. COMPUTE REPULSIVE FORCES (approximated):
   For each point i:
     Walk tree from root:
       If current node is "far enough" (θ criterion):
         Treat entire subtree as single point at center of mass
         Add approximate repulsive force
       Else:
         Recurse into children
 
   The θ parameter controls accuracy:
   - θ = 0: Exact computation (no approximation)
   - θ = 0.5: Good balance of speed and accuracy (default)
   - θ = 1.0: Faster but less accurate
 
4. UPDATE: Apply gradient descent update as usual
 
Complexity: O(N log N) per iteration with good θ

The θ Parameter:

The Barnes-Hut approximation uses a parameter θ (angle) to decide when to approximate:

A tree node can be approximated if: (node size) / (distance to point) < θ

θ = 0: Never approximate → exact O(N²) computation
θ = 0.5: Default; good approximation with significant speedup
θ = 1.0: More aggressive approximation; faster but lower quality

t-SNE Algorithm Comparison
Algorithm	Complexity per Iteration	Suitable Dataset Size	Notes
Exact t-SNE	O(N²)	< 5,000	Accurate but slow
Barnes-Hut t-SNE	O(N log N)	5,000 - 100,000	Default in sklearn
FIt-SNE (FFT-accelerated)	O(N)	100,000 - 1,000,000	Uses FFT for repulsive forces
Approximate NN + BH	O(N log N)	1,000,000+	Combines ANN for P with BH

Using Barnes-Hut in Practice

sklearn Barnes-Hut t-SNE
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.manifold import TSNE
 
# Barnes-Hut t-SNE for medium/large datasets
tsne_bh = TSNE(
    n_components=2,
    perplexity=30,
    method='barnes_hut',  # Default for n_samples > 3
    angle=0.5,            # Barnes-Hut angle (θ)
    n_iter=1000,
    learning_rate='auto',
    init='pca',
    random_state=42,
    n_jobs=-1             # Parallelize where possible
)
 
Y = tsne_bh.fit_transform(X)
 
# For small datasets, exact method may be better
tsne_exact = TSNE(
    n_components=2,
    perplexity=30,
    method='exact',       # Exact computation
    n_iter=1000,
    random_state=42
)

Summary: Mastering t-SNE Optimization

t-SNE optimization is challenging but manageable with the right techniques. Let's consolidate the key lessons.

Key Takeaways

•Non-convex landscape: t-SNE has many local minima; results can vary across runs—this is fundamental, not a bug
•Initialization matters: Use init='pca' for reproducibility; random initialization explores different solutions
•Early exaggeration is essential: Multiplying P by ~12 for first 250 iterations enables cluster formation
•Automatic learning rate: Use learning_rate='auto' to scale appropriately with dataset size
•Sufficient iterations: Use at least 1000 iterations; monitor KL divergence for convergence
•Barnes-Hut for scale: Use method='barnes_hut' (default) for datasets larger than ~1000 points

What's Next:

Page Complete

3 / 5