T Sne - Learning Module

Loading content...

0/278

The Perplexity Parameter

The Most Important Hyperparameter in t-SNE

If there is one t-SNE hyperparameter that every practitioner must understand deeply, it is perplexity. This single number controls the fundamental tradeoff between local and global structure in your visualizations, determines how many neighbors influence each point's embedding, and can dramatically change the appearance of your results.

Yet perplexity remains one of the most misunderstood concepts in dimensionality reduction. Users often treat it as a mysterious "magic number" to be tuned by trial and error. This page will dispel that mystery. We will develop a rigorous understanding of what perplexity measures, how it determines the Gaussian bandwidths in high-dimensional space, and how to reason about appropriate values for different datasets and use cases.

By the end of this page, you will have the theoretical foundation and practical intuition to confidently set perplexity based on your data characteristics and visualization goals.

Learning Objectives

By the end of this page, you will: • Understand the mathematical definition of perplexity in terms of entropy • Master the binary search algorithm that finds bandwidths σᵢ from perplexity • Comprehend the relationship between perplexity and effective number of neighbors • Develop intuition for how perplexity affects visualization structure • Know practical guidelines for setting perplexity across different scenarios

What is Perplexity?

Perplexity is an information-theoretic measure that quantifies the "effective number of neighbors" each point considers in t-SNE's neighborhood probability distribution. It directly determines the Gaussian bandwidth σᵢ for each high-dimensional point.

The Definition:

Recall that for each point xᵢ, t-SNE defines conditional probabilities:

p(j|i) = exp(-||xᵢ - xⱼ||² / 2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||² / 2σᵢ²)

Different values of σᵢ produce different probability distributions over neighbors. A small σᵢ concentrates probability on the nearest neighbors; a large σᵢ spreads probability more uniformly.

Perplexity measures the effective number of neighbors implied by this distribution:

Perplexity Definition
Perplexity(Pᵢ) = 2^H(Pᵢ)
 
where H(Pᵢ) is the Shannon entropy of the conditional distribution:
 
  H(Pᵢ) = -Σⱼ≠ᵢ p(j|i) log₂ p(j|i)
 
Equivalently:
  Perplexity = exp(H(Pᵢ)) when using natural logarithm
  Perplexity = 2^H(Pᵢ) when using log base 2
 
Key insight: For a uniform distribution over k elements:
  H = log₂(k)  →  Perplexity = 2^log₂(k) = k
 
So perplexity equals the "effective number" of equally-weighted neighbors.

Intuition: Perplexity as Effective Neighbors

Imagine you have 100 data points. If you set perplexity = 30, each point's neighborhood probability distribution will have an entropy equivalent to a uniform distribution over 30 points. This doesn't mean exactly 30 points have positive probability—rather, the probability is spread across neighbors such that the "effective" count is 30, balancing certainty (few neighbors) against coverage (many neighbors).

The Relationship Between σᵢ and Perplexity:

Given a target perplexity (specified by the user), t-SNE must find the bandwidth σᵢ for each point that achieves this target. This is an inverse problem: we know the desired entropy, and we must find σᵢ.

Larger σᵢ → more spread-out distribution → higher entropy → higher perplexity Smaller σᵢ → more concentrated distribution → lower entropy → lower perplexity

Why does each point get its own σᵢ?

Different points live in regions of varying density:

In a dense region with many nearby points, we need a smaller σᵢ to distinguish meaningful neighbors from the crowd
In a sparse region with few nearby points, we need a larger σᵢ to include enough points in the neighborhood

The perplexity parameter provides a consistent measure across varying densities: regardless of local density, each point will consider an approximately equal effective number of neighbors.

Finding Bandwidths via Binary Search

Given a target perplexity value, how do we find the corresponding bandwidth σᵢ for each point? The solution is elegant: binary search.

The Algorithm:

For each point xᵢ:

Compute squared distances to all other points: dᵢⱼ² = ||xᵢ - xⱼ||²
Binary search for σᵢ such that Perplexity(Pᵢ; σᵢ) = target perplexity

The binary search works because perplexity is a monotonically increasing function of σᵢ:

As σᵢ → 0, the distribution becomes concentrated on the nearest neighbor, and Perplexity → 1
As σᵢ → ∞, the distribution becomes uniform, and Perplexity → N-1

Binary Search for Bandwidth
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
 
def compute_perplexity_and_P(distances_sq, sigma):
    """
    Compute conditional probabilities P(j|i) and perplexity for given sigma.
    
    Args:
        distances_sq: Squared distances from point i to all other points (1D array)
        sigma: Bandwidth parameter
    
    Returns:
        P: Conditional probability distribution
        perplexity: Resulting perplexity value
    """
    # Compute Gaussian kernel values
    beta = 1.0 / (2.0 * sigma**2)  # Precision
    kernel = np.exp(-distances_sq * beta)
    
    # Normalize to get probabilities
    sum_kernel = np.sum(kernel)
    if sum_kernel < 1e-10:
        # All points too far; use uniform distribution
        P = np.ones_like(kernel) / len(kernel)
    else:
        P = kernel / sum_kernel
    
    # Compute entropy (using natural log, then convert to perplexity)
    # Avoid log(0) by masking zero probabilities
    P_safe = np.maximum(P, 1e-12)
    H = -np.sum(P * np.log(P_safe))  # Entropy in nats
    perplexity = np.exp(H)           # Convert to perplexity
    
    return P, perplexity
 
 
def find_sigma_binary_search(distances_sq, target_perplexity, 
                              tol=1e-5, max_iter=50):
    """
    Binary search to find sigma that achieves target perplexity.
    
    Args:
        distances_sq: Squared distances from point i to all others
        target_perplexity: Desired perplexity value
        tol: Tolerance for perplexity match
        max_iter: Maximum iterations for binary search
    
    Returns:
        sigma: Bandwidth achieving target perplexity
        P: Resulting probability distribution
    """
    # Initialize search bounds for sigma
    sigma_min = 1e-10
    sigma_max = 1e10
    sigma = 1.0  # Initial guess
    
    for iteration in range(max_iter):
        P, current_perplexity = compute_perplexity_and_P(distances_sq, sigma)
        perplexity_diff = current_perplexity - target_perplexity
        
        if np.abs(perplexity_diff) < tol:
            break  # Converged
        
        if perplexity_diff > 0:
            # Perplexity too high -> sigma too large -> decrease sigma
            sigma_max = sigma
            sigma = (sigma_min + sigma) / 2
        else:
            # Perplexity too low -> sigma too small -> increase sigma
            sigma_min = sigma
            sigma = (sigma + sigma_max) / 2
    
    return sigma, P
 
 
def compute_all_P(X, target_perplexity):
    """
    Compute the joint probability matrix P for all points.
    
    Args:
        X: Input data matrix (N x D)
        target_perplexity: Perplexity hyperparameter
    
    Returns:
        P: Symmetrized joint probability matrix (N x N)
    """
    N = X.shape[0]
    P = np.zeros((N, N))
    sigmas = np.zeros(N)
    
    # Compute pairwise squared distances
    sum_X = np.sum(X**2, axis=1)
    D_sq = sum_X[:, np.newaxis] + sum_X - 2 * X @ X.T
    
    for i in range(N):
        # Get distances from point i to all others (excluding self)
        distances_sq_i = np.concatenate([D_sq[i, :i], D_sq[i, i+1:]])
        
        # Binary search for sigma
        sigma_i, P_i = find_sigma_binary_search(
            distances_sq_i, target_perplexity
        )
        sigmas[i] = sigma_i
        
        # Fill in conditional probabilities
        P[i, :i] = P_i[:i]
        P[i, i+1:] = P_i[i:]
    
    # Symmetrize: P_ij = (P(j|i) + P(i|j)) / (2N)
    P = (P + P.T) / (2 * N)
    P = np.maximum(P, 1e-12)  # Ensure numerical stability
    
    return P, sigmas

Computational Considerations:

The binary search typically converges in 10-30 iterations per point. For N points, this gives O(N²) complexity just for computing P (dominated by pairwise distances). In practice, this preprocessing step is fast compared to the main optimization loop.

What the binary search reveals:

After running the binary search, you obtain a bandwidth σᵢ for each point. These bandwidths encode local density information:

Points in dense regions get small σᵢ (many neighbors nearby)
Points in sparse regions get large σᵢ (must reach farther for neighbors)

This adaptive bandwidth mechanism is what makes t-SNE robust to varying data density—a crucial advantage over methods that use a fixed kernel width.

Precision (β) vs Variance (σ²)

Many implementations parameterize the Gaussian using precision β = 1/(2σ²) rather than variance σ². This is mathematically equivalent but can be more numerically stable. The binary search then optimizes over β instead of σ, finding β such that the resulting distribution has the target perplexity.

Effects of Different Perplexity Values

The perplexity parameter fundamentally controls the scale at which t-SNE operates. Understanding how different perplexity values affect the visualization is essential for effective use.

Low Perplexity (5-15):

Characteristics of Low Perplexity

•Focuses on very local structure: Each point considers only ~5-15 nearest neighbors
•May fragment continuous manifolds: Connected regions can appear as separate clusters
•Reveals fine-grained structure: Useful for detecting small, tight subclusters
•Risk of noise sensitivity: Individual noisy points can create spurious structure
•May miss global patterns: Large-scale organization can be obscured

Medium Perplexity (30-50):

Characteristics of Medium Perplexity

•Balances local and global: Common default range (perplexity=30 is typical)
•Usually reveals meaningful clusters: Good for exploratory visualization
•Robust to parameter choice: Results typically stable within this range
•Works well for most datasets: A reasonable starting point for most use cases
•May smooth over fine structure: Very small subclusters might merge

High Perplexity (100+):

Characteristics of High Perplexity

•Emphasizes global structure: Each point considers many (~100+) neighbors
•Produces more uniform embeddings: Clusters may appear less separated
•May wash out local detail: Fine-grained structure can be lost
•Useful for seeing overall layout: Can reveal global organization patterns
•Requires more data points: Need N > 3× perplexity for meaningful results

The Perplexity-N Constraint

Perplexity must be smaller than the number of data points (N). In practice, perplexity should be significantly smaller—a common rule is perplexity < N/3. Setting perplexity too high relative to N results in nearly uniform probabilities, erasing meaningful structure.

Perplexity Guidelines by Dataset Size
Dataset Size (N)	Suggested Perplexity Range	Rationale
100-500	5-30	Limited data; focus on local structure
500-5,000	15-50	Moderate data; balance local and global
5,000-50,000	30-100	Ample data; can explore multiple scales
50,000+	50-200	Large data; may need higher for global view

Visualizing the Effect:

Consider embedding the MNIST dataset (70,000 handwritten digits) with different perplexity values:

Perplexity = 5: You might see fragmented clusters, with single digits split into multiple groups based on writing style variations
Perplexity = 30: Clean, well-separated digit clusters emerge, with similar digits (like 4 and 9) positioned nearby
Perplexity = 100: Clusters become more diffuse, but you might see the global "shape" of the digit space more clearly
Perplexity = 500: The embedding becomes more uniform, possibly washing out cluster structure

The key insight is that perplexity selects the scale of neighborhoods that t-SNE tries to preserve. There is no single "correct" value—the choice depends on what structure you want to reveal.

The Local vs Global Tradeoff

Perplexity controls a fundamental tradeoff in t-SNE: the balance between preserving local neighborhood relationships and global structure. Understanding this tradeoff is essential for interpreting t-SNE visualizations correctly.

The Core Tension:

t-SNE's objective function (KL divergence from P to Q) primarily rewards preserving local structure—specifically, pairs with high pᵢⱼ must have high qᵢⱼ. But the objective doesn't strongly penalize placing globally distant points near each other in the embedding.

Perplexity modulates what "local" means:

Low perplexity: "Local" means the nearest handful of points
High perplexity: "Local" encompasses a larger neighborhood

Why t-SNE Doesn't Preserve Global Structure:

Even with high perplexity, t-SNE is not designed to preserve global distances. The reasons are:

Probability normalization: The pᵢⱼ values sum to 1, so all pairs compete for probability mass. Global structure would require explicit preservation of distances, which t-SNE doesn't do.
KL asymmetry: The KL(P||Q) formulation tolerates placing distant points near each other (low pᵢⱼ, high qᵢⱼ is cheap). This is by design—it allows flexible cluster arrangement.
Heavy tails of Student-t: The Student-t kernel makes large distances in the embedding cheap in terms of probability. This creates separation but doesn't guarantee meaningful global distances.

Critical Interpretation Warning

Cluster distances in t-SNE are NOT meaningful. A cluster appearing far from another doesn't necessarily mean those data points are more different. t-SNE only reliably preserves local neighborhoods. The relative positions and distances between clusters should not be over-interpreted.

Multi-Scale t-SNE:

One approach to addressing the local-global tradeoff is to run t-SNE at multiple perplexity values and compare results:

Multi-Scale t-SNE Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import numpy as np
 
def multi_scale_tsne(X, perplexities=[5, 30, 100, 300], random_state=42):
    """
    Run t-SNE at multiple perplexity values for multi-scale analysis.
    
    This reveals structure at different scales:
    - Low perplexity: Fine-grained local structure
    - High perplexity: Broader neighborhood structure
    """
    n_perplexities = len(perplexities)
    fig, axes = plt.subplots(1, n_perplexities, figsize=(5*n_perplexities, 5))
    
    embeddings = {}
    
    for idx, perp in enumerate(perplexities):
        print(f"Running t-SNE with perplexity={perp}...")
        
        tsne = TSNE(
            n_components=2,
            perplexity=perp,
            random_state=random_state,
            n_iter=1000,
            learning_rate='auto',
            init='pca'
        )
        
        Y = tsne.fit_transform(X)
        embeddings[perp] = Y
        
        ax = axes[idx]
        ax.scatter(Y[:, 0], Y[:, 1], s=5, alpha=0.5)
        ax.set_title(f'Perplexity = {perp}')
        ax.set_xticks([])
        ax.set_yticks([])
    
    plt.tight_layout()
    return embeddings, fig
 
# Usage:
# embeddings, fig = multi_scale_tsne(X_normalized, perplexities=[5, 30, 100, 300])
# plt.savefig('multi_scale_tsne.png', dpi=150)

Interpreting Multi-Scale Results:

When analyzing multi-scale t-SNE:

Consistent structures that appear across multiple perplexity values are likely robust, genuine features of the data
Structures that appear only at low perplexity might represent fine-grained substructure or noise
Structures that appear only at high perplexity might represent broader organization that gets fragmented at finer scales
Contradictory structures (e.g., points clustered together at one scale but separated at another) suggest the data has complex multi-scale organization

This multi-scale approach aligns with recent work on hierarchical t-SNE and multi-resolution embeddings that explicitly model structure at multiple scales.

Practical Guidelines for Setting Perplexity

While there's no universal "correct" perplexity for all datasets, the following guidelines help you make informed choices.

Rule of Thumb: Start with Perplexity = 30

The canonical default of 30 works surprisingly well across many datasets. This is a good starting point unless you have specific reasons to deviate.

Adjust Based on Dataset Characteristics:

Perplexity Recommendations by Scenario
Scenario	Recommended Perplexity	Rationale
Small dataset (N < 200)	5-15	Not enough points for high perplexity
Looking for small subclusters	10-20	Fine-grained local structure
General exploration	30-50	Balanced view
Very large dataset (N > 50k)	50-100	More neighbors for stability
Confirming broad cluster structure	100-200	Global organization
High intrinsic dimensionality	50+	Need larger neighborhoods

Signs Your Perplexity is Too Low:

Clusters appear fragmented or scattered
You see many tiny "island" clusters
Known similar points appear in different clusters
The embedding looks "noisy" or chaotic
Running with different seeds produces very different results

Signs Your Perplexity is Too High:

Distinct clusters seem merged together
The embedding looks blob-like without clear structure
Known different points appear mixed together
Fine-grained structure you expect to see is missing
The entropy of the embedding is very high (points spread uniformly)

Using Domain Knowledge

If you have ground truth labels or domain knowledge about expected cluster structure, use them to guide perplexity selection. Run t-SNE at several perplexity values and choose the one that best reveals known structure while also showing new patterns. This semi-supervised approach leverages your expertise.

Perplexity Selection with Ground Truth
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
import numpy as np
 
def select_perplexity_with_labels(X, labels, perplexities=[5, 15, 30, 50, 100]):
    """
    Use ground truth labels to help select perplexity.
    
    Computes silhouette score on the embedding for each perplexity.
    Higher silhouette = better cluster separation in embedding.
    
    Note: This is a heuristic, not a definitive measure of quality.
    """
    results = []
    
    for perp in perplexities:
        if perp >= X.shape[0] / 3:
            print(f"Skipping perplexity={perp} (too high for N={X.shape[0]})")
            continue
            
        tsne = TSNE(
            n_components=2,
            perplexity=perp,
            random_state=42,
            n_iter=1000
        )
        Y = tsne.fit_transform(X)
        
        # Compute silhouette score on embedding
        score = silhouette_score(Y, labels)
        
        results.append({
            'perplexity': perp,
            'silhouette': score
        })
        print(f"Perplexity {perp:3d}: Silhouette = {score:.3f}")
    
    # Find best perplexity
    best = max(results, key=lambda x: x['silhouette'])
    print(f"
Recommended perplexity: {best['perplexity']}")
    
    return results
 
# Example usage:
# results = select_perplexity_with_labels(X_train, y_train)

Advanced: Automatic Perplexity Selection

Recent research has proposed methods for automatic perplexity selection:

UMAP as alternative: UMAP has a related parameter (n_neighbors) but is often considered more robust to parameter choice
Multiscale approaches: Some methods automatically combine multiple scales
Information-theoretic criteria: Using measures like normalized stress or neighborhood preservation
Stability-based selection: Choosing perplexity values that produce stable embeddings across random seeds

However, for practical visualization, manually exploring 3-5 perplexity values usually suffices and provides valuable insight into data structure at different scales.

Perplexity and Computational Cost

While perplexity is primarily a modeling parameter, it also has computational implications, especially for approximate t-SNE implementations.

Exact t-SNE:

In the naive O(N²) implementation, perplexity doesn't affect computational cost significantly. The binary search for σᵢ converges in O(log(range)) iterations regardless of the target perplexity.

Barnes-Hut t-SNE:

The Barnes-Hut approximation uses tree-based acceleration (typically quad-trees in 2D) to approximate the repulsive forces. Perplexity affects this:

Higher perplexity → more neighbors with significant attraction → denser effective graph
This can slow down tree-based approximations somewhat

In practice, the effect is usually modest.

Approximate Nearest Neighbors:

For very large datasets, t-SNE implementations use approximate nearest neighbor (ANN) algorithms to compute the k nearest neighbors needed for constructing P. The relationship with perplexity:

Perplexity and Neighbor Computation
# Typical relationship in approximate t-SNE:
 
K_neighbors ≈ 3 × perplexity
 
# Why 3×?
# The neighborhood probability extends beyond the exact 
# perplexity nearest neighbors. To ensure accurate probabilities,
# we need neighbors in the "tail" of the distribution.
 
# For perplexity = 30:
#   K_neighbors ≈ 90 nearest neighbors computed
 
# For perplexity = 100:
#   K_neighbors ≈ 300 nearest neighbors computed
 
# Computational implication:
# - ANN algorithms scale as O(N × K × log(N))
# - Higher perplexity → larger K → more computation
# - But still much faster than exact O(N²)

Modern t-SNE Implementations

Modern implementations like openTSNE, FIt-SNE, and scikit-learn's TSNE use combinations of Barnes-Hut approximation and approximate nearest neighbors. For datasets up to ~100k points, these run in minutes. The perplexity parameter primarily affects quality, not runtime, in these implementations.

Memory Considerations:

Perplexity indirectly affects memory usage:

The probability matrix P is sparse (only neighbors have significant probability)
Higher perplexity → more non-zero entries → more memory
For very large datasets, this can matter

However, efficient implementations store P as a sparse matrix, so the overhead is typically manageable.

Summary: Mastering Perplexity

Perplexity is the central hyperparameter of t-SNE, controlling the fundamental scale at which neighborhood structure is preserved. Let's consolidate the key insights.

Key Takeaways

•Definition: Perplexity = 2^H(Pᵢ), measuring the effective number of neighbors each point considers
•Adaptive bandwidths: Binary search finds σᵢ for each point to achieve target perplexity, automatically adapting to local density
•Scale selection: Low perplexity emphasizes local structure; high perplexity captures broader neighborhoods
•Not global structure: Even high perplexity doesn't reliably preserve global distances—distances between clusters are not meaningful
•Default: Start with perplexity ≈ 30; adjust based on dataset size and desired scale
•Multi-scale analysis: Running at multiple perplexity values reveals structure at different scales and identifies robust features

What's Next:

With a solid understanding of the objective function and perplexity, we now turn to the challenges of optimization. t-SNE's optimization landscape is highly non-convex, leading to local minima, sensitivity to initialization, and other challenges. The next page explores these issues and the techniques developed to address them.

Page Complete

You now understand the perplexity parameter—its mathematical definition, how it determines Gaussian bandwidths via binary search, and how to choose appropriate values. This knowledge is essential for using t-SNE effectively and interpreting results correctly. Next, we'll tackle the optimization challenges that make t-SNE tricky to use in practice.