Loading learning content...
t-SNE's beautiful objective function—minimizing KL divergence between probability distributions—hides a computationally challenging reality. The optimization landscape is highly non-convex, riddled with local minima, and sensitive to algorithmic choices that can make the difference between stunning visualizations and meaningless noise.
This page confronts these challenges head-on. We will explore the non-convex nature of the cost function, understand why initialization matters so critically, examine the "early exaggeration" trick that enables cluster formation, and develop practical strategies for achieving high-quality embeddings.
Mastering t-SNE optimization is essential because the same data with the same perplexity can produce wildly different visualizations depending on how you run the optimization. Understanding these issues transforms t-SNE from a black box into a tool you control with precision.
By the end of this page, you will: • Understand why t-SNE's optimization landscape is non-convex • Master initialization strategies and their effects on results • Comprehend the early exaggeration phase and its purpose • Know how to set learning rates and iteration counts • Recognize when optimization has succeeded or failed • Apply Barnes-Hut and other acceleration techniques
The t-SNE cost function, C = KL(P || Q), is not convex in the embedding coordinates Y. This has profound implications for optimization.
Why Non-Convexity Arises:
The Q distribution depends on pairwise distances in the embedding:
qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Z
where Z = Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹ is the normalization constant.
This creates complex dependencies:
The result is a cost function with many local minima, saddle points, and plateaus.
Unlike convex optimization where any local minimum is global, t-SNE genuinely has different local minima that produce qualitatively different embeddings. Running t-SNE twice with different random seeds can yield different-looking results. This is not a bug—it's a fundamental property of the optimization landscape.
Characteristics of the Landscape:
Symmetry breaking: The initial random embedding must "break symmetry" to form clusters. Points that start uniformly distributed have no reason to cluster until optimization creates separation.
Basin of attraction: Each local minimum has a "basin of attraction"—initializations within the basin converge to that minimum. Different basins lead to different cluster arrangements.
Permutation invariance: The cost function is invariant to:
This means there's an infinite family of equivalent solutions.
Scale sensitivity: The absolute scale of the embedding matters because it affects qᵢⱼ values through the Student-t kernel.
Implications for Practice:
The initial embedding Y⁰ significantly influences the final result. Modern t-SNE implementations offer several initialization strategies, each with tradeoffs.
Random Initialization:
The simplest approach samples initial positions from a Gaussian distribution:
123456789101112131415161718192021
import numpy as np def random_initialization(N, d=2, scale=1e-4): """ Initialize embedding with small random values. Args: N: Number of points d: Embedding dimensionality (usually 2) scale: Standard deviation of initialization (critical!) Returns: Y: Initial embedding (N x d) """ Y = np.random.randn(N, d) * scale return Y # Why small scale (1e-4)?# - Prevents initial distances from being too large# - Allows early exaggeration to work effectively# - Points start "collapsed" and expand during optimizationProperties of Random Initialization:
PCA Initialization (Recommended):
Initialize using the first d principal components of the data:
12345678910111213141516171819202122232425262728293031323334
from sklearn.decomposition import PCAimport numpy as np def pca_initialization(X, d=2, scale=1e-4): """ Initialize embedding using PCA. PCA provides a sensible starting point that respects global variance structure of the data. Args: X: Input data (N x D) d: Embedding dimensionality scale: Scale factor for initialization Returns: Y: Initial embedding (N x d) """ # Center the data X_centered = X - X.mean(axis=0) # PCA pca = PCA(n_components=d) Y = pca.fit_transform(X_centered) # Scale to small values (important!) Y = Y / np.std(Y) * scale return Y # Why scale down?# - t-SNE optimization works best when initial embedding is compact# - Early exaggeration then "explodes" the embedding# - This process creates better cluster separationProperties of PCA Initialization:
Spectral Initialization:
Some implementations offer spectral initialization based on the graph Laplacian:
# Spectral initialization idea:1. Construct affinity matrix from high-D affinities2. Compute normalized graph Laplacian L3. Find smallest eigenvectors of L (excluding trivial eigenvector)4. Use these as initial embedding # Advantages:- Respects neighborhood structure from the start- Can lead to faster convergence # Disadvantages:- Computationally expensive for large datasets- May not be necessary with proper optimizationUse PCA initialization (init='pca' in sklearn). It provides a good balance of reproducibility and quality. Random initialization is still useful if you want to explore different local minima intentionally. Always use a small initial scale regardless of initialization method.
One of the cleverest tricks in t-SNE optimization is early exaggeration—a technique that artificially inflates the high-dimensional affinities during the initial phase of optimization.
The Mechanism:
During the first E iterations (typically 250), multiply all pᵢⱼ values by an exaggeration factor α (typically 4-12):
p̃ᵢⱼ = α · pᵢⱼ for iterations 1 to E
After iteration E, revert to normal: p̃ᵢⱼ = pᵢⱼ
Why This Works:
The intuition is beautifully simple:
Normal pᵢⱼ sums to 1: This means pᵢⱼ values are quite small for most pairs (1/N² scale for uniform distribution)
Small pᵢⱼ → weak attractive forces: Without exaggeration, attraction between neighbors is initially weak because pᵢⱼ is small
Compact initialization: Points start close together, so qᵢⱼ values are all similar and moderate
The problem: Weak attractions can't overcome the tendency for points to remain uniformly distributed
The solution: Multiply pᵢⱼ by α to create strong "super-attractive" forces between designated neighbors
Without early exaggeration: - pᵢⱼ ≈ 0.0001 for neighbors (small) - Attractive force ∝ pᵢⱼ - qᵢⱼ ≈ weak - Points drift slowly; may get stuck in "uniform soup" With early exaggeration (α = 12): - p̃ᵢⱼ ≈ 0.0012 for neighbors (12× larger) - Total p̃ᵢⱼ sums to α (not 1) → artificially "boosted" affinities - Attractive force ∝ p̃ᵢⱼ - qᵢⱼ ≈ much stronger - Strong attraction → points rapidly form tight clusters - Clusters form quickly, establishing structure After exaggeration ends: - Revert to pᵢⱼ (normal affinities) - Clusters are already formed; fine-tuning continues - Balanced forces lead to stable final embeddingPhysical Analogy:
Think of early exaggeration like heating a material before shaping it:
Practical Parameters:
| Parameter | Typical Value | Effect of Increasing | Effect of Decreasing |
|---|---|---|---|
| Exaggeration factor (α) | 12 | Tighter initial clusters, more separation | Weaker clustering effect, may not form clusters |
| Exaggeration iterations (E) | 250 | Longer cluster formation phase | Shorter phase, may not fully separate clusters |
| Learning rate during exaggeration | Higher | Faster convergence in early phase | Slower but more stable early phase |
Modern implementations (sklearn 1.2+) use learning_rate='auto', which sets the learning rate to N/early_exaggeration. This adaptive choice works well across dataset sizes and is now the recommended default. It's particularly important during the exaggeration phase.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as np def tsne_optimize(P, N, d=2, n_iter=1000, early_exag_factor=12.0, early_exag_iter=250, learning_rate=200.0, momentum_init=0.5, momentum_final=0.8, random_state=42): """ Complete t-SNE optimization loop with early exaggeration. """ np.random.seed(random_state) # Initialize embedding Y = np.random.randn(N, d) * 1e-4 velocity = np.zeros_like(Y) for iteration in range(n_iter): # Early exaggeration phase if iteration < early_exag_iter: P_current = P * early_exag_factor momentum = momentum_init else: P_current = P momentum = momentum_final # Compute Q distribution Q, Q_numerator = compute_q_distribution(Y) # Compute gradient grad = compute_gradient(P_current, Q, Q_numerator, Y) # Gradient descent with momentum velocity = momentum * velocity - learning_rate * grad Y = Y + velocity # Center embedding (remove mean for numerical stability) Y = Y - Y.mean(axis=0) # Optional: compute and log cost every 50 iterations if iteration % 50 == 0: cost = compute_kl_divergence(P, Q) print(f"Iteration {iteration}: KL divergence = {cost:.4f}") return Y def compute_q_distribution(Y): """Compute Student-t based Q distribution.""" N = Y.shape[0] D = np.sum(Y**2, axis=1)[:, np.newaxis] + np.sum(Y**2, axis=1) - 2 * Y @ Y.T # Student-t kernel Q_numerator = 1 / (1 + D) np.fill_diagonal(Q_numerator, 0) Q = Q_numerator / np.sum(Q_numerator) Q = np.maximum(Q, 1e-12) return Q, Q_numerator def compute_gradient(P, Q, Q_numerator, Y): """Compute gradient of KL divergence.""" N = Y.shape[0] PQ_diff = P - Q grad = np.zeros_like(Y) for i in range(N): diff = Y[i] - Y # (N, d) grad[i] = 4 * np.sum( (PQ_diff[i, :] * Q_numerator[i, :])[:, np.newaxis] * diff, axis=0 ) return gradProper tuning of the learning rate and momentum is essential for successful t-SNE optimization. These parameters control the speed and stability of gradient descent.
Learning Rate:
The learning rate (often denoted η or lr) determines the step size in each gradient descent iteration:
Y^(t+1) = Y^(t) - η · ∇C(Y^(t))
Too small: Slow convergence; may not reach good solution in limited iterations Too large: Unstable oscillations; embedding may "explode" or fail to converge Just right: Steady convergence to well-structured embedding
| Dataset Size | Recommended Learning Rate | Notes |
|---|---|---|
| < 1,000 | 100-200 | Standard range works well |
| 1,000-10,000 | 200-500 | May need higher for larger datasets |
| 10,000-100,000 | N / early_exag (auto) | Use automatic learning rate |
100,000 | N / early_exag (auto) | Automatic is essential |
In sklearn, set learning_rate='auto'. This computes learning_rate = max(N / early_exaggeration, 50). This adaptive choice scale correctly with dataset size and is now the recommended default. If using older implementations, consider setting learning_rate ≈ N/12 as a starting point.
Momentum:
Momentum accelerates gradient descent by accumulating a running average of gradients:
v^(t+1) = μ · v^(t) - η · ∇C(Y^(t)) Y^(t+1) = Y^(t) + v^(t+1)
where μ is the momentum coefficient (0 ≤ μ < 1).
Why momentum helps t-SNE:
Standard momentum schedule:
# Two-phase momentum schedule (commonly used): Phase 1 (Early exaggeration, iterations 0-250): momentum = 0.5 # Lower momentum during cluster formation # Allows rapid restructuring without overshooting Phase 2 (Main optimization, iterations 250+): momentum = 0.8 # Higher momentum for efficient convergence # Accelerates "fine-tuning" of established structure # Some implementations use gradual increase:momentum(t) = 0.5 + 0.3 * min(t / 250, 1)Signs of Learning Rate Problems:
Learning rate too high:
Learning rate too low:
If your t-SNE embedding looks bad, try: (1) increasing n_iter to 1000+, (2) using learning_rate='auto' or increasing learning rate, (3) checking that early exaggeration is enabled, (4) using init='pca'. Most failed embeddings come from insufficient iterations or suboptimal learning rate.
Knowing when t-SNE optimization has converged is important for obtaining high-quality embeddings without wasting computation.
Convergence Indicators:
1234567891011121314151617181920212223242526272829303132333435
import matplotlib.pyplot as plt def monitor_tsne_convergence(X, perplexity=30, n_iter=1500, verbose=True): """ Run t-SNE while monitoring convergence metrics. Returns embedding and convergence history. """ from sklearn.manifold import TSNE # Run t-SNE with verbose output tsne = TSNE( n_components=2, perplexity=perplexity, n_iter=n_iter, learning_rate='auto', init='pca', verbose=2 if verbose else 0, # Shows KL divergence per iteration random_state=42 ) Y = tsne.fit_transform(X) # tsne.kl_divergence_ contains final KL divergence print(f"Final KL divergence: {tsne.kl_divergence_:.4f}") return Y, tsne.kl_divergence_ # Typical convergence pattern:# Iteration 50: KL divergence = 4.5 (early exaggeration)# Iteration 100: KL divergence = 3.2 (clusters forming)# Iteration 250: KL divergence = 2.8 (end of exaggeration)# Iteration 300: KL divergence = 1.5 (jump down after exaggeration ends)# Iteration 500: KL divergence = 1.2 (fine-tuning)# Iteration 1000: KL divergence = 1.1 (near convergence)Understanding the KL Divergence Curve:
The KL divergence typically follows a characteristic pattern:
How Many Iterations?
| Scenario | Recommended Iterations | Rationale |
|---|---|---|
| Quick exploration | 250-500 | See rough structure; may not be fully optimized |
| Standard visualization | 1000 | Default; usually sufficient for good embedding |
| Publication-quality | 1500-2000 | Ensure full convergence |
| Large datasets (N > 50k) | 500-1000 | More iterations may not help; focus on other parameters |
| Reproducibility check | Same as primary run | Use same n_iter across runs for fair comparison |
Alternative Stopping Criteria:
While fixed iteration count is most common, alternatives exist:
Gradient norm: Stop when ||∇C|| < ε. Indicates forces are balanced.
Cost change: Stop when |C^(t) - C^(t-k)| / C^(t) < ε over last k iterations.
Embedding stability: Stop when embedding changes negligibly: ||Y^(t) - Y^(t-k)|| < ε.
In practice, these are rarely needed—1000 iterations with good settings almost always suffices.
A converged embedding is not necessarily a good embedding. t-SNE can converge to poor local minima. Always visually inspect results, try multiple random seeds, and validate against known structure if available. Low final KL divergence suggests a good fit, but interpretation still requires care.
Naive t-SNE has O(N²) complexity per iteration—computing all pairwise interactions. For large datasets, this becomes impractical. The Barnes-Hut approximation reduces this to O(N log N), enabling t-SNE on datasets with tens of thousands to millions of points.
The Barnes-Hut Idea:
Barnes-Hut is a tree-based approximation originally developed for N-body gravitational simulations. The key insight:
How It Works for t-SNE:
Barnes-Hut t-SNE Algorithm: 1. BUILD TREE: Construct quad-tree of current embedding Y - Recursively subdivide space into quadrants - Each node stores: center of mass, total mass, bounds 2. COMPUTE ATTRACTIVE FORCES (unchanged): - For each point i, compute attraction to neighbors - Uses precomputed sparse P matrix (only k nearest neighbors) - Complexity: O(N × k) where k ≪ N 3. COMPUTE REPULSIVE FORCES (approximated): For each point i: Walk tree from root: If current node is "far enough" (θ criterion): Treat entire subtree as single point at center of mass Add approximate repulsive force Else: Recurse into children The θ parameter controls accuracy: - θ = 0: Exact computation (no approximation) - θ = 0.5: Good balance of speed and accuracy (default) - θ = 1.0: Faster but less accurate 4. UPDATE: Apply gradient descent update as usual Complexity: O(N log N) per iteration with good θThe θ Parameter:
The Barnes-Hut approximation uses a parameter θ (angle) to decide when to approximate:
A tree node can be approximated if: (node size) / (distance to point) < θ
| Algorithm | Complexity per Iteration | Suitable Dataset Size | Notes |
|---|---|---|---|
| Exact t-SNE | O(N²) | < 5,000 | Accurate but slow |
| Barnes-Hut t-SNE | O(N log N) | 5,000 - 100,000 | Default in sklearn |
| FIt-SNE (FFT-accelerated) | O(N) | 100,000 - 1,000,000 | Uses FFT for repulsive forces |
| Approximate NN + BH | O(N log N) | 1,000,000+ | Combines ANN for P with BH |
In sklearn, Barnes-Hut is used automatically when method='barnes_hut' (default for N > 3). Set angle parameter to control θ (default 0.5). For N < 500, exact method may be preferable. For very large datasets, consider FIt-SNE or openTSNE implementations.
12345678910111213141516171819202122232425
from sklearn.manifold import TSNE # Barnes-Hut t-SNE for medium/large datasetstsne_bh = TSNE( n_components=2, perplexity=30, method='barnes_hut', # Default for n_samples > 3 angle=0.5, # Barnes-Hut angle (θ) n_iter=1000, learning_rate='auto', init='pca', random_state=42, n_jobs=-1 # Parallelize where possible) Y = tsne_bh.fit_transform(X) # For small datasets, exact method may be bettertsne_exact = TSNE( n_components=2, perplexity=30, method='exact', # Exact computation n_iter=1000, random_state=42)t-SNE optimization is challenging but manageable with the right techniques. Let's consolidate the key lessons.
What's Next:
With a solid understanding of t-SNE's objective, perplexity, and optimization, we now turn to interpretation. t-SNE visualizations are often misread—cluster sizes don't mean anything, distances between clusters aren't reliable, and many visual patterns can be artifacts. The next page provides rigorous guidelines for correctly interpreting t-SNE results.
You now understand the optimization challenges in t-SNE and how to address them. With proper initialization, early exaggeration, learning rate, and sufficient iterations, you can reliably obtain high-quality embeddings. Next, we'll learn how to correctly interpret these visualizations.