Loading content...
If there is one t-SNE hyperparameter that every practitioner must understand deeply, it is perplexity. This single number controls the fundamental tradeoff between local and global structure in your visualizations, determines how many neighbors influence each point's embedding, and can dramatically change the appearance of your results.
Yet perplexity remains one of the most misunderstood concepts in dimensionality reduction. Users often treat it as a mysterious "magic number" to be tuned by trial and error. This page will dispel that mystery. We will develop a rigorous understanding of what perplexity measures, how it determines the Gaussian bandwidths in high-dimensional space, and how to reason about appropriate values for different datasets and use cases.
By the end of this page, you will have the theoretical foundation and practical intuition to confidently set perplexity based on your data characteristics and visualization goals.
By the end of this page, you will: • Understand the mathematical definition of perplexity in terms of entropy • Master the binary search algorithm that finds bandwidths σᵢ from perplexity • Comprehend the relationship between perplexity and effective number of neighbors • Develop intuition for how perplexity affects visualization structure • Know practical guidelines for setting perplexity across different scenarios
Perplexity is an information-theoretic measure that quantifies the "effective number of neighbors" each point considers in t-SNE's neighborhood probability distribution. It directly determines the Gaussian bandwidth σᵢ for each high-dimensional point.
The Definition:
Recall that for each point xᵢ, t-SNE defines conditional probabilities:
p(j|i) = exp(-||xᵢ - xⱼ||² / 2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||² / 2σᵢ²)
Different values of σᵢ produce different probability distributions over neighbors. A small σᵢ concentrates probability on the nearest neighbors; a large σᵢ spreads probability more uniformly.
Perplexity measures the effective number of neighbors implied by this distribution:
Perplexity(Pᵢ) = 2^H(Pᵢ) where H(Pᵢ) is the Shannon entropy of the conditional distribution: H(Pᵢ) = -Σⱼ≠ᵢ p(j|i) log₂ p(j|i) Equivalently: Perplexity = exp(H(Pᵢ)) when using natural logarithm Perplexity = 2^H(Pᵢ) when using log base 2 Key insight: For a uniform distribution over k elements: H = log₂(k) → Perplexity = 2^log₂(k) = k So perplexity equals the "effective number" of equally-weighted neighbors.Imagine you have 100 data points. If you set perplexity = 30, each point's neighborhood probability distribution will have an entropy equivalent to a uniform distribution over 30 points. This doesn't mean exactly 30 points have positive probability—rather, the probability is spread across neighbors such that the "effective" count is 30, balancing certainty (few neighbors) against coverage (many neighbors).
The Relationship Between σᵢ and Perplexity:
Given a target perplexity (specified by the user), t-SNE must find the bandwidth σᵢ for each point that achieves this target. This is an inverse problem: we know the desired entropy, and we must find σᵢ.
Larger σᵢ → more spread-out distribution → higher entropy → higher perplexity Smaller σᵢ → more concentrated distribution → lower entropy → lower perplexity
Why does each point get its own σᵢ?
Different points live in regions of varying density:
The perplexity parameter provides a consistent measure across varying densities: regardless of local density, each point will consider an approximately equal effective number of neighbors.
Given a target perplexity value, how do we find the corresponding bandwidth σᵢ for each point? The solution is elegant: binary search.
The Algorithm:
For each point xᵢ:
The binary search works because perplexity is a monotonically increasing function of σᵢ:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as np def compute_perplexity_and_P(distances_sq, sigma): """ Compute conditional probabilities P(j|i) and perplexity for given sigma. Args: distances_sq: Squared distances from point i to all other points (1D array) sigma: Bandwidth parameter Returns: P: Conditional probability distribution perplexity: Resulting perplexity value """ # Compute Gaussian kernel values beta = 1.0 / (2.0 * sigma**2) # Precision kernel = np.exp(-distances_sq * beta) # Normalize to get probabilities sum_kernel = np.sum(kernel) if sum_kernel < 1e-10: # All points too far; use uniform distribution P = np.ones_like(kernel) / len(kernel) else: P = kernel / sum_kernel # Compute entropy (using natural log, then convert to perplexity) # Avoid log(0) by masking zero probabilities P_safe = np.maximum(P, 1e-12) H = -np.sum(P * np.log(P_safe)) # Entropy in nats perplexity = np.exp(H) # Convert to perplexity return P, perplexity def find_sigma_binary_search(distances_sq, target_perplexity, tol=1e-5, max_iter=50): """ Binary search to find sigma that achieves target perplexity. Args: distances_sq: Squared distances from point i to all others target_perplexity: Desired perplexity value tol: Tolerance for perplexity match max_iter: Maximum iterations for binary search Returns: sigma: Bandwidth achieving target perplexity P: Resulting probability distribution """ # Initialize search bounds for sigma sigma_min = 1e-10 sigma_max = 1e10 sigma = 1.0 # Initial guess for iteration in range(max_iter): P, current_perplexity = compute_perplexity_and_P(distances_sq, sigma) perplexity_diff = current_perplexity - target_perplexity if np.abs(perplexity_diff) < tol: break # Converged if perplexity_diff > 0: # Perplexity too high -> sigma too large -> decrease sigma sigma_max = sigma sigma = (sigma_min + sigma) / 2 else: # Perplexity too low -> sigma too small -> increase sigma sigma_min = sigma sigma = (sigma + sigma_max) / 2 return sigma, P def compute_all_P(X, target_perplexity): """ Compute the joint probability matrix P for all points. Args: X: Input data matrix (N x D) target_perplexity: Perplexity hyperparameter Returns: P: Symmetrized joint probability matrix (N x N) """ N = X.shape[0] P = np.zeros((N, N)) sigmas = np.zeros(N) # Compute pairwise squared distances sum_X = np.sum(X**2, axis=1) D_sq = sum_X[:, np.newaxis] + sum_X - 2 * X @ X.T for i in range(N): # Get distances from point i to all others (excluding self) distances_sq_i = np.concatenate([D_sq[i, :i], D_sq[i, i+1:]]) # Binary search for sigma sigma_i, P_i = find_sigma_binary_search( distances_sq_i, target_perplexity ) sigmas[i] = sigma_i # Fill in conditional probabilities P[i, :i] = P_i[:i] P[i, i+1:] = P_i[i:] # Symmetrize: P_ij = (P(j|i) + P(i|j)) / (2N) P = (P + P.T) / (2 * N) P = np.maximum(P, 1e-12) # Ensure numerical stability return P, sigmasComputational Considerations:
The binary search typically converges in 10-30 iterations per point. For N points, this gives O(N²) complexity just for computing P (dominated by pairwise distances). In practice, this preprocessing step is fast compared to the main optimization loop.
What the binary search reveals:
After running the binary search, you obtain a bandwidth σᵢ for each point. These bandwidths encode local density information:
This adaptive bandwidth mechanism is what makes t-SNE robust to varying data density—a crucial advantage over methods that use a fixed kernel width.
Many implementations parameterize the Gaussian using precision β = 1/(2σ²) rather than variance σ². This is mathematically equivalent but can be more numerically stable. The binary search then optimizes over β instead of σ, finding β such that the resulting distribution has the target perplexity.
The perplexity parameter fundamentally controls the scale at which t-SNE operates. Understanding how different perplexity values affect the visualization is essential for effective use.
Low Perplexity (5-15):
Medium Perplexity (30-50):
High Perplexity (100+):
Perplexity must be smaller than the number of data points (N). In practice, perplexity should be significantly smaller—a common rule is perplexity < N/3. Setting perplexity too high relative to N results in nearly uniform probabilities, erasing meaningful structure.
| Dataset Size (N) | Suggested Perplexity Range | Rationale |
|---|---|---|
| 100-500 | 5-30 | Limited data; focus on local structure |
| 500-5,000 | 15-50 | Moderate data; balance local and global |
| 5,000-50,000 | 30-100 | Ample data; can explore multiple scales |
| 50,000+ | 50-200 | Large data; may need higher for global view |
Visualizing the Effect:
Consider embedding the MNIST dataset (70,000 handwritten digits) with different perplexity values:
The key insight is that perplexity selects the scale of neighborhoods that t-SNE tries to preserve. There is no single "correct" value—the choice depends on what structure you want to reveal.
Perplexity controls a fundamental tradeoff in t-SNE: the balance between preserving local neighborhood relationships and global structure. Understanding this tradeoff is essential for interpreting t-SNE visualizations correctly.
The Core Tension:
t-SNE's objective function (KL divergence from P to Q) primarily rewards preserving local structure—specifically, pairs with high pᵢⱼ must have high qᵢⱼ. But the objective doesn't strongly penalize placing globally distant points near each other in the embedding.
Perplexity modulates what "local" means:
Why t-SNE Doesn't Preserve Global Structure:
Even with high perplexity, t-SNE is not designed to preserve global distances. The reasons are:
Probability normalization: The pᵢⱼ values sum to 1, so all pairs compete for probability mass. Global structure would require explicit preservation of distances, which t-SNE doesn't do.
KL asymmetry: The KL(P||Q) formulation tolerates placing distant points near each other (low pᵢⱼ, high qᵢⱼ is cheap). This is by design—it allows flexible cluster arrangement.
Heavy tails of Student-t: The Student-t kernel makes large distances in the embedding cheap in terms of probability. This creates separation but doesn't guarantee meaningful global distances.
Cluster distances in t-SNE are NOT meaningful. A cluster appearing far from another doesn't necessarily mean those data points are more different. t-SNE only reliably preserves local neighborhoods. The relative positions and distances between clusters should not be over-interpreted.
Multi-Scale t-SNE:
One approach to addressing the local-global tradeoff is to run t-SNE at multiple perplexity values and compare results:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import matplotlib.pyplot as pltfrom sklearn.manifold import TSNEimport numpy as np def multi_scale_tsne(X, perplexities=[5, 30, 100, 300], random_state=42): """ Run t-SNE at multiple perplexity values for multi-scale analysis. This reveals structure at different scales: - Low perplexity: Fine-grained local structure - High perplexity: Broader neighborhood structure """ n_perplexities = len(perplexities) fig, axes = plt.subplots(1, n_perplexities, figsize=(5*n_perplexities, 5)) embeddings = {} for idx, perp in enumerate(perplexities): print(f"Running t-SNE with perplexity={perp}...") tsne = TSNE( n_components=2, perplexity=perp, random_state=random_state, n_iter=1000, learning_rate='auto', init='pca' ) Y = tsne.fit_transform(X) embeddings[perp] = Y ax = axes[idx] ax.scatter(Y[:, 0], Y[:, 1], s=5, alpha=0.5) ax.set_title(f'Perplexity = {perp}') ax.set_xticks([]) ax.set_yticks([]) plt.tight_layout() return embeddings, fig # Usage:# embeddings, fig = multi_scale_tsne(X_normalized, perplexities=[5, 30, 100, 300])# plt.savefig('multi_scale_tsne.png', dpi=150)Interpreting Multi-Scale Results:
When analyzing multi-scale t-SNE:
This multi-scale approach aligns with recent work on hierarchical t-SNE and multi-resolution embeddings that explicitly model structure at multiple scales.
While there's no universal "correct" perplexity for all datasets, the following guidelines help you make informed choices.
Rule of Thumb: Start with Perplexity = 30
The canonical default of 30 works surprisingly well across many datasets. This is a good starting point unless you have specific reasons to deviate.
Adjust Based on Dataset Characteristics:
| Scenario | Recommended Perplexity | Rationale |
|---|---|---|
| Small dataset (N < 200) | 5-15 | Not enough points for high perplexity |
| Looking for small subclusters | 10-20 | Fine-grained local structure |
| General exploration | 30-50 | Balanced view |
| Very large dataset (N > 50k) | 50-100 | More neighbors for stability |
| Confirming broad cluster structure | 100-200 | Global organization |
| High intrinsic dimensionality | 50+ | Need larger neighborhoods |
Signs Your Perplexity is Too Low:
Signs Your Perplexity is Too High:
If you have ground truth labels or domain knowledge about expected cluster structure, use them to guide perplexity selection. Run t-SNE at several perplexity values and choose the one that best reveals known structure while also showing new patterns. This semi-supervised approach leverages your expertise.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
from sklearn.metrics import silhouette_scorefrom sklearn.manifold import TSNEimport numpy as np def select_perplexity_with_labels(X, labels, perplexities=[5, 15, 30, 50, 100]): """ Use ground truth labels to help select perplexity. Computes silhouette score on the embedding for each perplexity. Higher silhouette = better cluster separation in embedding. Note: This is a heuristic, not a definitive measure of quality. """ results = [] for perp in perplexities: if perp >= X.shape[0] / 3: print(f"Skipping perplexity={perp} (too high for N={X.shape[0]})") continue tsne = TSNE( n_components=2, perplexity=perp, random_state=42, n_iter=1000 ) Y = tsne.fit_transform(X) # Compute silhouette score on embedding score = silhouette_score(Y, labels) results.append({ 'perplexity': perp, 'silhouette': score }) print(f"Perplexity {perp:3d}: Silhouette = {score:.3f}") # Find best perplexity best = max(results, key=lambda x: x['silhouette']) print(f"Recommended perplexity: {best['perplexity']}") return results # Example usage:# results = select_perplexity_with_labels(X_train, y_train)Advanced: Automatic Perplexity Selection
Recent research has proposed methods for automatic perplexity selection:
However, for practical visualization, manually exploring 3-5 perplexity values usually suffices and provides valuable insight into data structure at different scales.
While perplexity is primarily a modeling parameter, it also has computational implications, especially for approximate t-SNE implementations.
Exact t-SNE:
In the naive O(N²) implementation, perplexity doesn't affect computational cost significantly. The binary search for σᵢ converges in O(log(range)) iterations regardless of the target perplexity.
Barnes-Hut t-SNE:
The Barnes-Hut approximation uses tree-based acceleration (typically quad-trees in 2D) to approximate the repulsive forces. Perplexity affects this:
In practice, the effect is usually modest.
Approximate Nearest Neighbors:
For very large datasets, t-SNE implementations use approximate nearest neighbor (ANN) algorithms to compute the k nearest neighbors needed for constructing P. The relationship with perplexity:
# Typical relationship in approximate t-SNE: K_neighbors ≈ 3 × perplexity # Why 3×?# The neighborhood probability extends beyond the exact # perplexity nearest neighbors. To ensure accurate probabilities,# we need neighbors in the "tail" of the distribution. # For perplexity = 30:# K_neighbors ≈ 90 nearest neighbors computed # For perplexity = 100:# K_neighbors ≈ 300 nearest neighbors computed # Computational implication:# - ANN algorithms scale as O(N × K × log(N))# - Higher perplexity → larger K → more computation# - But still much faster than exact O(N²)Modern implementations like openTSNE, FIt-SNE, and scikit-learn's TSNE use combinations of Barnes-Hut approximation and approximate nearest neighbors. For datasets up to ~100k points, these run in minutes. The perplexity parameter primarily affects quality, not runtime, in these implementations.
Memory Considerations:
Perplexity indirectly affects memory usage:
However, efficient implementations store P as a sparse matrix, so the overhead is typically manageable.
Perplexity is the central hyperparameter of t-SNE, controlling the fundamental scale at which neighborhood structure is preserved. Let's consolidate the key insights.
What's Next:
With a solid understanding of the objective function and perplexity, we now turn to the challenges of optimization. t-SNE's optimization landscape is highly non-convex, leading to local minima, sensitivity to initialization, and other challenges. The next page explores these issues and the techniques developed to address them.
You now understand the perplexity parameter—its mathematical definition, how it determines Gaussian bandwidths via binary search, and how to choose appropriate values. This knowledge is essential for using t-SNE effectively and interpreting results correctly. Next, we'll tackle the optimization challenges that make t-SNE tricky to use in practice.