Machine LearningManifold Learning

t-Distributed Stochastic Neighbor Embedding (t-SNE)

LevelAdvanced

Duration90 mins

TopicManifold Learning

1 / 5

The t-SNE Objective Function

Visualizing the Unseeable

In 2008, Laurens van der Maaten and Geoffrey Hinton published a paper that would fundamentally transform how we visualize high-dimensional data. Their algorithm, t-Distributed Stochastic Neighbor Embedding (t-SNE), solved a problem that had plagued dimensionality reduction for decades: how do you faithfully represent the local structure of high-dimensional data in two or three dimensions?

The answer lies in a beautifully elegant objective function that frames dimensionality reduction as a problem of matching probability distributions. Rather than preserving distances (like PCA or MDS), t-SNE preserves the probabilities that points are neighbors—a subtle but profound distinction that makes t-SNE extraordinarily effective at revealing cluster structure.

This page develops a complete, rigorous understanding of the t-SNE objective function. We will derive every component from first principles, understand the mathematical intuitions, and see exactly why this formulation produces such compelling visualizations.

Learning Objectives

By the end of this page, you will: • Understand the probabilistic formulation of neighborhood relationships • Derive the joint probability distributions in both high and low dimensions • Comprehend why t-SNE uses the Student-t distribution in the embedding space • Master the Kullback-Leibler divergence objective and its implications • Analyze the gradient and understand the attractive/repulsive force interpretation

The Core Insight: Neighborhoods as Probabilities

Traditional dimensionality reduction methods like PCA and MDS attempt to preserve distances between points. While mathematically natural, this approach has a fundamental limitation: distances in high dimensions behave counterintuitively due to the curse of dimensionality.

In high-dimensional spaces, distances between points become increasingly similar. A consequence of measure concentration is that, for uniformly distributed points in high dimensions, the difference between the nearest and farthest neighbor shrinks relative to the absolute distances. This makes distance-based methods less effective at distinguishing meaningful structure.

t-SNE takes a radically different approach: instead of preserving distances, it preserves neighborhood probabilities. The key insight is to convert distances into probabilities that quantify "the likelihood that point i would pick point j as its neighbor."

The Probability Interpretation

Think of each point i as randomly selecting neighbors according to their proximity. Nearby points have high selection probability; distant points have near-zero probability. t-SNE tries to ensure that this neighbor-selection distribution is the same in high-dimensional and low-dimensional spaces.

Why probabilities instead of distances?

Probabilities offer several advantages over raw distances:

Automatic normalization: Probabilities naturally sum to 1, providing a scale-invariant measure of relative proximity
Information-theoretic framework: We can use KL divergence to measure distribution mismatches, which has well-understood properties
Focus on local structure: The probability formulation naturally emphasizes nearby points over distant ones
Robustness to curse of dimensionality: By converting to probabilities conditioned on each point's local neighborhood, we sidestep issues of measure concentration

The probabilistic formulation transforms dimensionality reduction from a geometric problem into a statistical inference problem: finding a low-dimensional configuration whose neighborhood probabilities best match those in high dimensions.

High-Dimensional Affinities: The Gaussian Model

We begin by defining neighborhood relationships in the original high-dimensional space. Given N data points X = {x₁, x₂, ..., xₙ} in ℝᴰ, we need to quantify the probability that point xᵢ would "choose" point xⱼ as its neighbor.

Conditional Probabilities

t-SNE models this using a Gaussian distribution centered at each point. The conditional probability that xᵢ picks xⱼ as a neighbor is:

High-Dimensional Conditional Probability
p(j|i) = exp(-||xᵢ - xⱼ||² / 2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||² / 2σᵢ²)
 
where:
  • ||xᵢ - xⱼ||² is the squared Euclidean distance
  • σᵢ is the bandwidth (standard deviation) of the Gaussian centered at xᵢ
  • The sum in the denominator runs over all points except i
  • By definition, p(i|i) = 0

Understanding the Gaussian kernel:

The Gaussian kernel has several desirable properties for defining neighborhood probabilities:

Monotonic decay: As points get farther apart, the probability decreases smoothly
Rapidly vanishing tails: Very distant points contribute negligibly
Smooth gradients: The function is infinitely differentiable, enabling gradient-based optimization
Natural probabilistic interpretation: Related to the probability density of a Gaussian distribution

The bandwidth parameter σᵢ:

Notice that each point i has its own bandwidth σᵢ. This is crucial because data density varies across the manifold. In dense regions, we need smaller σᵢ to distinguish between many nearby points. In sparse regions, larger σᵢ ensures points still have meaningful neighbors.

The bandwidth is determined by a user-specified perplexity (discussed in detail on the next page), which effectively controls how many neighbors each point considers.

Asymmetry in Conditional Probabilities

Note that p(j|i) ≠ p(i|j) in general! Point i might consider j a close neighbor (high p(j|i)), while j, surrounded by many other close points, might assign low probability to i (low p(i|j)). This asymmetry complicates optimization, which is why t-SNE symmetrizes the distribution.

Symmetrization: Joint Probabilities

To eliminate asymmetry and simplify the optimization, t-SNE converts conditional probabilities to symmetric joint probabilities:

Symmetrized Joint Probability
pᵢⱼ = (p(j|i) + p(i|j)) / 2N
 
Properties:
  • pᵢⱼ = pⱼᵢ (symmetric)
  • Σᵢⱼ pᵢⱼ = 1 (normalized to form a proper probability distribution)
  • pᵢᵢ = 0 (no self-loops)
  • pᵢⱼ > 0 for all pairs (strictly positive)

The division by 2N ensures the joint distribution sums to 1. This symmetrization has a nice interpretation: pᵢⱼ represents the probability that points i and j are neighbors when we sample pairs uniformly at random and then select according to their conditional probabilities.

Handling outliers:

An important benefit of this symmetrization is robustness to outliers. If point i is an outlier (far from all other points), P(j|i) will be spread relatively uniformly over all j (since all points are similarly far). But other points j will assign very low P(i|j) to the outlier. The average smooths this, ensuring outliers don't dominate the objective but still have meaningful embeddings.

Low-Dimensional Affinities: The Student-t Distribution

Now we define neighborhood relationships in the low-dimensional embedding space. Given the map points Y = {y₁, y₂, ..., yₙ} in ℝᵈ (typically d = 2 or 3), we define joint probabilities qᵢⱼ analogously—but with a crucial difference.

The Crowding Problem

Using a Gaussian kernel in the low-dimensional space leads to a fundamental problem that van der Maaten and Hinton termed the crowding problem.

Consider a moderately large number of points in D dimensions that are mutual neighbors (all within similar distance of each other). When we try to embed these into 2D while preserving distances, we run into a geometric impossibility: the 2D space doesn't have "room" for all points to maintain their high-dimensional relative positions.

Mathematically, the volume of a d-dimensional ball scales as rᵈ. In D dimensions, there's ample room for many points at similar distances; in 2 dimensions, points get "crowded" together. To accommodate moderately distant points, truly nearby points must be crushed together, destroying local structure.

The Crowding Problem Illustrated

Imagine 10 points arranged as vertices of a 9-simplex in high dimensions, all equidistant from each other. In 2D, it's geometrically impossible to place 10 points all equidistant. Something has to give—either nearby points collapse together, or moderately distant points are pushed too far apart. This is the crowding problem.

The Student-t Solution

t-SNE's key innovation is using a Student-t distribution with one degree of freedom (equivalently, a Cauchy distribution) in the low-dimensional space:

Low-Dimensional Joint Probability (Student-t)
qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹
 
where:
  • ||yᵢ - yⱼ||² is the squared Euclidean distance in embedding space
  • The normalization constant ensures Σᵢⱼ qᵢⱼ = 1
  • qᵢᵢ = 0 (no self-loops)
  
The Student-t distribution with ν=1 degrees of freedom has the form:
  f(t) ∝ (1 + t²)⁻¹
 
This is a "heavy-tailed" distribution compared to the Gaussian.

Why heavy tails solve the crowding problem:

The Student-t distribution has heavier tails than the Gaussian, meaning it assigns relatively more probability to distant points and relatively less to nearby points. This seemingly counterintuitive choice is exactly what's needed:

Nearby points: The numerator (1 + d²)⁻¹ remains close to 1 for small distances d, so nearby points still have high probability
Moderately distant points: The polynomial decay (versus exponential for Gaussian) allows moderately distant points to be pushed further apart without catastrophically reducing their probability
Net effect: There's "room" in the probability budget for moderately distant points to spread out, relieving pressure on truly nearby points

The heavy tails effectively create more space in the probability distribution for intermediate distances, alleviating the crowding problem.

Comparison: Gaussian vs Student-t Kernels
Property	Gaussian (High-D)	Student-t (Low-D)
Formula	exp(-d²/2σ²)	(1 + d²)⁻¹
Tail behavior	Exponential decay (light tails)	Polynomial decay (heavy tails)
Probability at d=0	Maximum (normalized)	Maximum (normalized)
Probability at d=2	Very small	Moderate (1/5)
Probability at d=10	Essentially zero	Still ~1% of maximum
Effect on embedding	N/A (data space)	Allows spreading of distant points

The t-distribution as a limiting case:

The choice of ν = 1 degree of freedom emerges naturally. Van der Maaten and Hinton showed empirically that ν = 1 works well across many datasets. Theoretically, lower degrees of freedom give heavier tails (more spreading), while higher degrees approach the Gaussian (less spreading). The value ν = 1 provides a balance that works well for typical 2D visualizations.

Recent work on parametric t-SNE variants has explored learning the degrees of freedom, but ν = 1 remains the default for its robustness and simplicity.

The KL Divergence Objective

With probability distributions P (from high-dimensional affinities) and Q (from low-dimensional affinities), we need an objective function that measures their similarity. t-SNE uses the Kullback-Leibler (KL) divergence, a fundamental information-theoretic measure.

The Cost Function

t-SNE minimizes the KL divergence from P to Q over the embedding coordinates Y:

t-SNE Cost Function
C = KL(P || Q) = Σᵢⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ)
 
Expanding:
  C = Σᵢⱼ pᵢⱼ log(pᵢⱼ) - Σᵢⱼ pᵢⱼ log(qᵢⱼ)
    = -H(P) - Σᵢⱼ pᵢⱼ log(qᵢⱼ)
 
where H(P) = -Σᵢⱼ pᵢⱼ log(pᵢⱼ) is the entropy of P.
 
Since P is fixed (computed once from the input data), minimizing C is equivalent to:
  maximize Σᵢⱼ pᵢⱼ log(qᵢⱼ)
 
This is maximizing the expected log-likelihood of Q under P.

Properties of KL Divergence:

Non-negativity: KL(P || Q) ≥ 0, with equality if and only if P = Q (Gibbs' inequality)
Asymmetry: KL(P || Q) ≠ KL(Q || P) in general—this asymmetry has important implications
Not a metric: Doesn't satisfy the triangle inequality
Information-theoretic interpretation: Expected number of extra bits needed to encode samples from P using a code optimal for Q

The Critical Asymmetry:

The direction of KL divergence—KL(P || Q) rather than KL(Q || P)—fundamentally shapes t-SNE's behavior:

KL(P || Q): t-SNE's Choice

•Penalizes: Large pᵢⱼ with small qᵢⱼ
•Effect: Nearby points in high-D MUST be nearby in low-D
•Allows: Small pᵢⱼ with large qᵢⱼ (distant points can be embedded nearby if it helps)
•Result: Preserves LOCAL structure faithfully

KL(Q || P): Alternative

•Penalizes: Large qᵢⱼ with small pᵢⱼ
•Effect: Distant points in high-D MUST be distant in low-D
•Problem: Forces global structure preservation at expense of local
•Result: Often poor cluster visualization

Intuition for KL Asymmetry

Think of pᵢⱼ as "true importance" and qᵢⱼ as "modeled importance." KL(P||Q) says: "If a pair is truly important (high pᵢⱼ), make sure you model it as important (high qᵢⱼ)." It's okay to model unimportant pairs as important—you just can't miss the important ones. This is exactly right for visualization: preserve what matters locally.

Mathematical Consequence:

When qᵢⱼ is small but pᵢⱼ is large, the term pᵢⱼ log(pᵢⱼ/qᵢⱼ) contributes a large positive value to the cost. The log(1/qᵢⱼ) term can grow without bound as qᵢⱼ → 0, creating strong pressure to maintain high qᵢⱼ for pairs with high pᵢⱼ.

Conversely, when pᵢⱼ is small, even if qᵢⱼ is large, the multiplicative factor pᵢⱼ keeps the contribution small. This is why t-SNE can place globally distant points near each other in the embedding—there's little cost to doing so if they weren't neighbors in high dimensions.

The Gradient and Force Interpretation

To minimize the KL divergence, t-SNE uses gradient descent on the embedding coordinates. The gradient with respect to each embedding point yᵢ reveals a beautiful physical interpretation.

Deriving the Gradient:

The gradient of the cost function with respect to yᵢ is:

t-SNE Gradient
∂C/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ)(yᵢ - yⱼ)(1 + ||yᵢ - yⱼ||²)⁻¹
 
This can be rewritten as:
  ∂C/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ) qᵢⱼ Z (yᵢ - yⱼ)
 
where Z = Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹ is the normalization constant.
 
Breaking into components:
  ∂C/∂yᵢ = 4 [ Σⱼ pᵢⱼ qᵢⱼ Z (yᵢ - yⱼ) - Σⱼ qᵢⱼ² Z (yᵢ - yⱼ) ]
           = Attractive Forces    - Repulsive Forces

Physical Interpretation: Springs and Repulsion

The gradient has an elegant interpretation as a system of forces acting on each point:

Attractive Forces (pᵢⱼ > qᵢⱼ): When two points should be neighbors (high pᵢⱼ) but are currently too far apart in the embedding (low qᵢⱼ), the term (pᵢⱼ - qᵢⱼ) is positive. This creates an attractive force that pulls yᵢ toward yⱼ, like a spring connecting neighborhood points.

Repulsive Forces (pᵢⱼ < qᵢⱼ): When two points shouldn't be close neighbors (low pᵢⱼ) but are currently too close in the embedding (high qᵢⱼ), the term (pᵢⱼ - qᵢⱼ) is negative. This creates a repulsive force that pushes yᵢ away from yⱼ.

The Attenuation Factor (1 + ||yᵢ - yⱼ||²)⁻¹: This factor, coming from the Student-t kernel, attenuates forces for distant pairs. Points far apart in the embedding exert weaker forces on each other, allowing local neighborhoods to settle without interference from distant points.

Force-Directed Graph Drawing Analogy

t-SNE is closely related to force-directed graph layouts used in network visualization. In graph drawing, nodes are connected by springs (edges) that exert attractive forces, while all node pairs repel to prevent overlap. t-SNE generalizes this: spring strengths are probabilities pᵢⱼ, and the Student-t kernel provides the repulsion profile.

Equilibrium and Cluster Formation:

At equilibrium (when the gradient is near zero for all points), each point experiences balanced attractive and repulsive forces:

Within clusters: Points with high mutual pᵢⱼ values settle at distances where qᵢⱼ ≈ pᵢⱼ, forming tight groups
Between clusters: Points from different clusters have low pᵢⱼ, so repulsive forces dominate, pushing clusters apart
Global arrangement: The heavy tails of the Student-t distribution allow clusters to spread out, using the full embedding space

This force-based view explains why t-SNE produces separated clusters: the combination of local attraction and global repulsion naturally creates discrete, well-separated groups from continuous high-dimensional manifolds.

Gradient Descent Update (Simplified)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def tsne_gradient_step(Y, P, learning_rate=200.0, momentum=0.8):
    """
    Perform one gradient descent step for t-SNE.
    
    Y: Current embedding (N x d array)
    P: High-dimensional joint probabilities (N x N array)
    """
    N, d = Y.shape
    
    # Compute pairwise distances in embedding space
    sum_Y = np.sum(Y**2, axis=1)
    D = sum_Y[:, np.newaxis] + sum_Y[np.newaxis, :] - 2 * Y @ Y.T
    
    # Compute Q distribution (Student-t with 1 d.f.)
    Q_numerator = 1 / (1 + D)
    np.fill_diagonal(Q_numerator, 0)
    Q = Q_numerator / np.sum(Q_numerator)
    Q = np.maximum(Q, 1e-12)  # Numerical stability
    
    # Compute gradient
    PQ_diff = P - Q
    grad = np.zeros_like(Y)
    for i in range(N):
        diff = Y[i] - Y  # (N x d) differences
        grad[i] = 4 * np.sum(
            (PQ_diff[i, :] * Q_numerator[i, :])[:, np.newaxis] * diff, 
            axis=0
        )
    
    # Update with momentum
    Y_new = Y - learning_rate * grad + momentum * velocity
    velocity = Y_new - Y
    
    return Y_new, velocity

Comparison with Original SNE

t-SNE evolved from the original Stochastic Neighbor Embedding (SNE) algorithm proposed by Hinton and Roweis in 2002. Understanding the differences illuminates why t-SNE's modifications are essential.

Original SNE Formulation:

SNE used Gaussian kernels in both high-dimensional and low-dimensional spaces:

Original SNE vs t-SNE
Original SNE:
  High-D: p(j|i) = exp(-||xᵢ - xⱼ||²/2σᵢ²) / Σₖ exp(-||xᵢ - xₖ||²/2σᵢ²)
  Low-D:  q(j|i) = exp(-||yᵢ - yⱼ||²) / Σₖ exp(-||yᵢ - yₖ||²)
  Cost:   Σᵢ KL(Pᵢ || Qᵢ) = Σᵢⱼ p(j|i) log(p(j|i) / q(j|i))
 
Key differences from t-SNE:
  1. Uses Gaussian in low-D (not Student-t)
  2. Uses conditional probabilities (not joint)
  3. Sums KL divergences for each point (not single joint KL)

SNE vs t-SNE: Key Improvements
Aspect	Original SNE	t-SNE	Why t-SNE is Better
Low-D kernel	Gaussian	Student-t (ν=1)	Solves crowding problem; heavy tails allow spreading
Probability type	Conditional p(j\|i)	Joint pᵢⱼ	Symmetric; simpler gradient; better optimization
Outlier handling	Can collapse to center	Robust embedding	Symmetrized p includes information from neighbors of outliers
Gradient	Complex, asymmetric	Clean, symmetric	Easier to optimize; more stable dynamics
Cluster separation	Moderate	Excellent	Heavy tails create natural gaps between clusters

Why the Student-t Distribution Works:

The original SNE suffered severely from the crowding problem because Gaussian kernels in both spaces meant that any spreading of moderately distant points would cause their probabilities to plummet, incurring huge cost. Points were forced into a compromised state where neither local nor global structure was well-preserved.

The Student-t distribution's polynomial tails (versus the Gaussian's exponential tails) change the game:

At d = 2 units apart:
- Gaussian: exp(-4) ≈ 0.018 (1.8% of maximum)
- Student-t: (1+4)⁻¹ = 0.2 (20% of maximum)
At d = 5 units apart:
- Gaussian: exp(-25) ≈ 10⁻¹¹ (essentially zero)
- Student-t: (1+25)⁻¹ ≈ 0.04 (4% of maximum)

The Student-t distribution maintains meaningful probability for moderately distant points, giving the embedding "room to breathe" and place clusters at appropriate distances.

Why ν = 1 (Cauchy Distribution)?

The choice of exactly one degree of freedom makes the Student-t distribution equivalent to the Cauchy distribution. This is the heaviest possible Student-t distribution (heavier tails than any ν > 1). The heavy tails provide maximum flexibility for spreading clusters apart. Empirically, this choice works remarkably well across diverse datasets.

Summary: The t-SNE Objective Function

We have developed a complete understanding of the t-SNE objective function, its mathematical foundations, and the design choices that make it effective for visualization.

Key Takeaways

•Probabilistic framing: t-SNE converts distances to probabilities, measuring 'likelihood of being neighbors' rather than raw distances—more robust to the curse of dimensionality
•Gaussian in high-D: Uses Gaussian kernels with adaptive bandwidth σᵢ per point to define neighborhood probabilities, symmetrized into joint distribution P
•Student-t in low-D: Uses heavy-tailed Student-t distribution (ν=1) to solve the crowding problem, allowing moderately distant points to spread apart
•KL divergence objective: KL(P||Q) penalizes missing important neighbors but tolerates placing distant points nearby—perfect for local structure preservation
•Force interpretation: Gradient consists of attractive forces (pulling neighbors together) and repulsive forces (pushing non-neighbors apart), attenuated by the Student-t kernel
•Improvements over SNE: Symmetric probabilities and Student-t kernel solve the crowding problem and stabilize optimization

What's Next:

The objective function provides the mathematical foundation. But t-SNE's behavior in practice depends critically on the perplexity parameter, which controls the effective number of neighbors considered for each point. On the next page, we'll dive deep into perplexity—how it's defined, how it determines the bandwidths σᵢ, and how to choose it appropriately for different datasets.

Page Complete

You now understand the mathematical foundations of t-SNE's objective function. The probabilistic approach—matching neighborhood distributions via KL divergence—combined with the Student-t kernel for the embedding space, creates a powerful method for visualizing high-dimensional structure. Next, we'll explore the perplexity parameter that controls the scale of neighborhoods.

1 / 5

Loading learning content...

Machine LearningManifold Learning

t-Distributed Stochastic Neighbor Embedding (t-SNE)

LevelAdvanced

Duration90 mins

TopicManifold Learning

1 / 5

The t-SNE Objective Function

Visualizing the Unseeable

Learning Objectives

The Core Insight: Neighborhoods as Probabilities

The Probability Interpretation

Why probabilities instead of distances?

Probabilities offer several advantages over raw distances:

Automatic normalization: Probabilities naturally sum to 1, providing a scale-invariant measure of relative proximity
Information-theoretic framework: We can use KL divergence to measure distribution mismatches, which has well-understood properties
Focus on local structure: The probability formulation naturally emphasizes nearby points over distant ones
Robustness to curse of dimensionality: By converting to probabilities conditioned on each point's local neighborhood, we sidestep issues of measure concentration

High-Dimensional Affinities: The Gaussian Model

Conditional Probabilities

t-SNE models this using a Gaussian distribution centered at each point. The conditional probability that xᵢ picks xⱼ as a neighbor is:

High-Dimensional Conditional Probability
p(j|i) = exp(-||xᵢ - xⱼ||² / 2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||² / 2σᵢ²)
 
where:
  • ||xᵢ - xⱼ||² is the squared Euclidean distance
  • σᵢ is the bandwidth (standard deviation) of the Gaussian centered at xᵢ
  • The sum in the denominator runs over all points except i
  • By definition, p(i|i) = 0

Understanding the Gaussian kernel:

The Gaussian kernel has several desirable properties for defining neighborhood probabilities:

Monotonic decay: As points get farther apart, the probability decreases smoothly
Rapidly vanishing tails: Very distant points contribute negligibly
Smooth gradients: The function is infinitely differentiable, enabling gradient-based optimization
Natural probabilistic interpretation: Related to the probability density of a Gaussian distribution

The bandwidth parameter σᵢ:

The bandwidth is determined by a user-specified perplexity (discussed in detail on the next page), which effectively controls how many neighbors each point considers.

Asymmetry in Conditional Probabilities

Symmetrization: Joint Probabilities

To eliminate asymmetry and simplify the optimization, t-SNE converts conditional probabilities to symmetric joint probabilities:

Symmetrized Joint Probability
pᵢⱼ = (p(j|i) + p(i|j)) / 2N
 
Properties:
  • pᵢⱼ = pⱼᵢ (symmetric)
  • Σᵢⱼ pᵢⱼ = 1 (normalized to form a proper probability distribution)
  • pᵢᵢ = 0 (no self-loops)
  • pᵢⱼ > 0 for all pairs (strictly positive)

Handling outliers:

Low-Dimensional Affinities: The Student-t Distribution

The Crowding Problem

Using a Gaussian kernel in the low-dimensional space leads to a fundamental problem that van der Maaten and Hinton termed the crowding problem.

The Crowding Problem Illustrated

The Student-t Solution

t-SNE's key innovation is using a Student-t distribution with one degree of freedom (equivalently, a Cauchy distribution) in the low-dimensional space:

Low-Dimensional Joint Probability (Student-t)
qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹
 
where:
  • ||yᵢ - yⱼ||² is the squared Euclidean distance in embedding space
  • The normalization constant ensures Σᵢⱼ qᵢⱼ = 1
  • qᵢᵢ = 0 (no self-loops)
  
The Student-t distribution with ν=1 degrees of freedom has the form:
  f(t) ∝ (1 + t²)⁻¹
 
This is a "heavy-tailed" distribution compared to the Gaussian.

Why heavy tails solve the crowding problem:

Nearby points: The numerator (1 + d²)⁻¹ remains close to 1 for small distances d, so nearby points still have high probability
Moderately distant points: The polynomial decay (versus exponential for Gaussian) allows moderately distant points to be pushed further apart without catastrophically reducing their probability
Net effect: There's "room" in the probability budget for moderately distant points to spread out, relieving pressure on truly nearby points

The heavy tails effectively create more space in the probability distribution for intermediate distances, alleviating the crowding problem.

Comparison: Gaussian vs Student-t Kernels
Property	Gaussian (High-D)	Student-t (Low-D)
Formula	exp(-d²/2σ²)	(1 + d²)⁻¹
Tail behavior	Exponential decay (light tails)	Polynomial decay (heavy tails)
Probability at d=0	Maximum (normalized)	Maximum (normalized)
Probability at d=2	Very small	Moderate (1/5)
Probability at d=10	Essentially zero	Still ~1% of maximum
Effect on embedding	N/A (data space)	Allows spreading of distant points

The t-distribution as a limiting case:

Recent work on parametric t-SNE variants has explored learning the degrees of freedom, but ν = 1 remains the default for its robustness and simplicity.

The KL Divergence Objective

The Cost Function

t-SNE minimizes the KL divergence from P to Q over the embedding coordinates Y:

t-SNE Cost Function
C = KL(P || Q) = Σᵢⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ)
 
Expanding:
  C = Σᵢⱼ pᵢⱼ log(pᵢⱼ) - Σᵢⱼ pᵢⱼ log(qᵢⱼ)
    = -H(P) - Σᵢⱼ pᵢⱼ log(qᵢⱼ)
 
where H(P) = -Σᵢⱼ pᵢⱼ log(pᵢⱼ) is the entropy of P.
 
Since P is fixed (computed once from the input data), minimizing C is equivalent to:
  maximize Σᵢⱼ pᵢⱼ log(qᵢⱼ)
 
This is maximizing the expected log-likelihood of Q under P.

Properties of KL Divergence:

Non-negativity: KL(P || Q) ≥ 0, with equality if and only if P = Q (Gibbs' inequality)
Asymmetry: KL(P || Q) ≠ KL(Q || P) in general—this asymmetry has important implications
Not a metric: Doesn't satisfy the triangle inequality
Information-theoretic interpretation: Expected number of extra bits needed to encode samples from P using a code optimal for Q

The Critical Asymmetry:

The direction of KL divergence—KL(P || Q) rather than KL(Q || P)—fundamentally shapes t-SNE's behavior:

KL(P || Q): t-SNE's Choice

•Penalizes: Large pᵢⱼ with small qᵢⱼ
•Effect: Nearby points in high-D MUST be nearby in low-D
•Allows: Small pᵢⱼ with large qᵢⱼ (distant points can be embedded nearby if it helps)
•Result: Preserves LOCAL structure faithfully

KL(Q || P): Alternative

•Penalizes: Large qᵢⱼ with small pᵢⱼ
•Effect: Distant points in high-D MUST be distant in low-D
•Problem: Forces global structure preservation at expense of local
•Result: Often poor cluster visualization

Intuition for KL Asymmetry

Mathematical Consequence:

The Gradient and Force Interpretation

To minimize the KL divergence, t-SNE uses gradient descent on the embedding coordinates. The gradient with respect to each embedding point yᵢ reveals a beautiful physical interpretation.

Deriving the Gradient:

The gradient of the cost function with respect to yᵢ is:

t-SNE Gradient
∂C/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ)(yᵢ - yⱼ)(1 + ||yᵢ - yⱼ||²)⁻¹
 
This can be rewritten as:
  ∂C/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ) qᵢⱼ Z (yᵢ - yⱼ)
 
where Z = Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹ is the normalization constant.
 
Breaking into components:
  ∂C/∂yᵢ = 4 [ Σⱼ pᵢⱼ qᵢⱼ Z (yᵢ - yⱼ) - Σⱼ qᵢⱼ² Z (yᵢ - yⱼ) ]
           = Attractive Forces    - Repulsive Forces

Physical Interpretation: Springs and Repulsion

The gradient has an elegant interpretation as a system of forces acting on each point:

Force-Directed Graph Drawing Analogy

Equilibrium and Cluster Formation:

At equilibrium (when the gradient is near zero for all points), each point experiences balanced attractive and repulsive forces:

Within clusters: Points with high mutual pᵢⱼ values settle at distances where qᵢⱼ ≈ pᵢⱼ, forming tight groups
Between clusters: Points from different clusters have low pᵢⱼ, so repulsive forces dominate, pushing clusters apart
Global arrangement: The heavy tails of the Student-t distribution allow clusters to spread out, using the full embedding space

Gradient Descent Update (Simplified)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def tsne_gradient_step(Y, P, learning_rate=200.0, momentum=0.8):
    """
    Perform one gradient descent step for t-SNE.
    
    Y: Current embedding (N x d array)
    P: High-dimensional joint probabilities (N x N array)
    """
    N, d = Y.shape
    
    # Compute pairwise distances in embedding space
    sum_Y = np.sum(Y**2, axis=1)
    D = sum_Y[:, np.newaxis] + sum_Y[np.newaxis, :] - 2 * Y @ Y.T
    
    # Compute Q distribution (Student-t with 1 d.f.)
    Q_numerator = 1 / (1 + D)
    np.fill_diagonal(Q_numerator, 0)
    Q = Q_numerator / np.sum(Q_numerator)
    Q = np.maximum(Q, 1e-12)  # Numerical stability
    
    # Compute gradient
    PQ_diff = P - Q
    grad = np.zeros_like(Y)
    for i in range(N):
        diff = Y[i] - Y  # (N x d) differences
        grad[i] = 4 * np.sum(
            (PQ_diff[i, :] * Q_numerator[i, :])[:, np.newaxis] * diff, 
            axis=0
        )
    
    # Update with momentum
    Y_new = Y - learning_rate * grad + momentum * velocity
    velocity = Y_new - Y
    
    return Y_new, velocity

Comparison with Original SNE

Original SNE Formulation:

SNE used Gaussian kernels in both high-dimensional and low-dimensional spaces:

Original SNE vs t-SNE
Original SNE:
  High-D: p(j|i) = exp(-||xᵢ - xⱼ||²/2σᵢ²) / Σₖ exp(-||xᵢ - xₖ||²/2σᵢ²)
  Low-D:  q(j|i) = exp(-||yᵢ - yⱼ||²) / Σₖ exp(-||yᵢ - yₖ||²)
  Cost:   Σᵢ KL(Pᵢ || Qᵢ) = Σᵢⱼ p(j|i) log(p(j|i) / q(j|i))
 
Key differences from t-SNE:
  1. Uses Gaussian in low-D (not Student-t)
  2. Uses conditional probabilities (not joint)
  3. Sums KL divergences for each point (not single joint KL)

SNE vs t-SNE: Key Improvements
Aspect	Original SNE	t-SNE	Why t-SNE is Better
Low-D kernel	Gaussian	Student-t (ν=1)	Solves crowding problem; heavy tails allow spreading
Probability type	Conditional p(j\|i)	Joint pᵢⱼ	Symmetric; simpler gradient; better optimization
Outlier handling	Can collapse to center	Robust embedding	Symmetrized p includes information from neighbors of outliers
Gradient	Complex, asymmetric	Clean, symmetric	Easier to optimize; more stable dynamics
Cluster separation	Moderate	Excellent	Heavy tails create natural gaps between clusters

Why the Student-t Distribution Works:

The Student-t distribution's polynomial tails (versus the Gaussian's exponential tails) change the game:

At d = 2 units apart:
- Gaussian: exp(-4) ≈ 0.018 (1.8% of maximum)
- Student-t: (1+4)⁻¹ = 0.2 (20% of maximum)
At d = 5 units apart:
- Gaussian: exp(-25) ≈ 10⁻¹¹ (essentially zero)
- Student-t: (1+25)⁻¹ ≈ 0.04 (4% of maximum)

The Student-t distribution maintains meaningful probability for moderately distant points, giving the embedding "room to breathe" and place clusters at appropriate distances.

Why ν = 1 (Cauchy Distribution)?

Summary: The t-SNE Objective Function

We have developed a complete understanding of the t-SNE objective function, its mathematical foundations, and the design choices that make it effective for visualization.

Key Takeaways

•Probabilistic framing: t-SNE converts distances to probabilities, measuring 'likelihood of being neighbors' rather than raw distances—more robust to the curse of dimensionality
•Gaussian in high-D: Uses Gaussian kernels with adaptive bandwidth σᵢ per point to define neighborhood probabilities, symmetrized into joint distribution P
•Student-t in low-D: Uses heavy-tailed Student-t distribution (ν=1) to solve the crowding problem, allowing moderately distant points to spread apart
•KL divergence objective: KL(P||Q) penalizes missing important neighbors but tolerates placing distant points nearby—perfect for local structure preservation
•Force interpretation: Gradient consists of attractive forces (pulling neighbors together) and repulsive forces (pushing non-neighbors apart), attenuated by the Student-t kernel
•Improvements over SNE: Symmetric probabilities and Student-t kernel solve the crowding problem and stabilize optimization

What's Next:

Page Complete

1 / 5