Loading learning content...
In 2008, Laurens van der Maaten and Geoffrey Hinton published a paper that would fundamentally transform how we visualize high-dimensional data. Their algorithm, t-Distributed Stochastic Neighbor Embedding (t-SNE), solved a problem that had plagued dimensionality reduction for decades: how do you faithfully represent the local structure of high-dimensional data in two or three dimensions?
The answer lies in a beautifully elegant objective function that frames dimensionality reduction as a problem of matching probability distributions. Rather than preserving distances (like PCA or MDS), t-SNE preserves the probabilities that points are neighbors—a subtle but profound distinction that makes t-SNE extraordinarily effective at revealing cluster structure.
This page develops a complete, rigorous understanding of the t-SNE objective function. We will derive every component from first principles, understand the mathematical intuitions, and see exactly why this formulation produces such compelling visualizations.
By the end of this page, you will: • Understand the probabilistic formulation of neighborhood relationships • Derive the joint probability distributions in both high and low dimensions • Comprehend why t-SNE uses the Student-t distribution in the embedding space • Master the Kullback-Leibler divergence objective and its implications • Analyze the gradient and understand the attractive/repulsive force interpretation
Traditional dimensionality reduction methods like PCA and MDS attempt to preserve distances between points. While mathematically natural, this approach has a fundamental limitation: distances in high dimensions behave counterintuitively due to the curse of dimensionality.
In high-dimensional spaces, distances between points become increasingly similar. A consequence of measure concentration is that, for uniformly distributed points in high dimensions, the difference between the nearest and farthest neighbor shrinks relative to the absolute distances. This makes distance-based methods less effective at distinguishing meaningful structure.
t-SNE takes a radically different approach: instead of preserving distances, it preserves neighborhood probabilities. The key insight is to convert distances into probabilities that quantify "the likelihood that point i would pick point j as its neighbor."
Think of each point i as randomly selecting neighbors according to their proximity. Nearby points have high selection probability; distant points have near-zero probability. t-SNE tries to ensure that this neighbor-selection distribution is the same in high-dimensional and low-dimensional spaces.
Why probabilities instead of distances?
Probabilities offer several advantages over raw distances:
The probabilistic formulation transforms dimensionality reduction from a geometric problem into a statistical inference problem: finding a low-dimensional configuration whose neighborhood probabilities best match those in high dimensions.
We begin by defining neighborhood relationships in the original high-dimensional space. Given N data points X = {x₁, x₂, ..., xₙ} in ℝᴰ, we need to quantify the probability that point xᵢ would "choose" point xⱼ as its neighbor.
Conditional Probabilities
t-SNE models this using a Gaussian distribution centered at each point. The conditional probability that xᵢ picks xⱼ as a neighbor is:
p(j|i) = exp(-||xᵢ - xⱼ||² / 2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||² / 2σᵢ²) where: • ||xᵢ - xⱼ||² is the squared Euclidean distance • σᵢ is the bandwidth (standard deviation) of the Gaussian centered at xᵢ • The sum in the denominator runs over all points except i • By definition, p(i|i) = 0Understanding the Gaussian kernel:
The Gaussian kernel has several desirable properties for defining neighborhood probabilities:
The bandwidth parameter σᵢ:
Notice that each point i has its own bandwidth σᵢ. This is crucial because data density varies across the manifold. In dense regions, we need smaller σᵢ to distinguish between many nearby points. In sparse regions, larger σᵢ ensures points still have meaningful neighbors.
The bandwidth is determined by a user-specified perplexity (discussed in detail on the next page), which effectively controls how many neighbors each point considers.
Note that p(j|i) ≠ p(i|j) in general! Point i might consider j a close neighbor (high p(j|i)), while j, surrounded by many other close points, might assign low probability to i (low p(i|j)). This asymmetry complicates optimization, which is why t-SNE symmetrizes the distribution.
Symmetrization: Joint Probabilities
To eliminate asymmetry and simplify the optimization, t-SNE converts conditional probabilities to symmetric joint probabilities:
pᵢⱼ = (p(j|i) + p(i|j)) / 2N Properties: • pᵢⱼ = pⱼᵢ (symmetric) • Σᵢⱼ pᵢⱼ = 1 (normalized to form a proper probability distribution) • pᵢᵢ = 0 (no self-loops) • pᵢⱼ > 0 for all pairs (strictly positive)The division by 2N ensures the joint distribution sums to 1. This symmetrization has a nice interpretation: pᵢⱼ represents the probability that points i and j are neighbors when we sample pairs uniformly at random and then select according to their conditional probabilities.
Handling outliers:
An important benefit of this symmetrization is robustness to outliers. If point i is an outlier (far from all other points), P(j|i) will be spread relatively uniformly over all j (since all points are similarly far). But other points j will assign very low P(i|j) to the outlier. The average smooths this, ensuring outliers don't dominate the objective but still have meaningful embeddings.
Now we define neighborhood relationships in the low-dimensional embedding space. Given the map points Y = {y₁, y₂, ..., yₙ} in ℝᵈ (typically d = 2 or 3), we define joint probabilities qᵢⱼ analogously—but with a crucial difference.
The Crowding Problem
Using a Gaussian kernel in the low-dimensional space leads to a fundamental problem that van der Maaten and Hinton termed the crowding problem.
Consider a moderately large number of points in D dimensions that are mutual neighbors (all within similar distance of each other). When we try to embed these into 2D while preserving distances, we run into a geometric impossibility: the 2D space doesn't have "room" for all points to maintain their high-dimensional relative positions.
Mathematically, the volume of a d-dimensional ball scales as rᵈ. In D dimensions, there's ample room for many points at similar distances; in 2 dimensions, points get "crowded" together. To accommodate moderately distant points, truly nearby points must be crushed together, destroying local structure.
Imagine 10 points arranged as vertices of a 9-simplex in high dimensions, all equidistant from each other. In 2D, it's geometrically impossible to place 10 points all equidistant. Something has to give—either nearby points collapse together, or moderately distant points are pushed too far apart. This is the crowding problem.
The Student-t Solution
t-SNE's key innovation is using a Student-t distribution with one degree of freedom (equivalently, a Cauchy distribution) in the low-dimensional space:
qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹ where: • ||yᵢ - yⱼ||² is the squared Euclidean distance in embedding space • The normalization constant ensures Σᵢⱼ qᵢⱼ = 1 • qᵢᵢ = 0 (no self-loops) The Student-t distribution with ν=1 degrees of freedom has the form: f(t) ∝ (1 + t²)⁻¹ This is a "heavy-tailed" distribution compared to the Gaussian.Why heavy tails solve the crowding problem:
The Student-t distribution has heavier tails than the Gaussian, meaning it assigns relatively more probability to distant points and relatively less to nearby points. This seemingly counterintuitive choice is exactly what's needed:
The heavy tails effectively create more space in the probability distribution for intermediate distances, alleviating the crowding problem.
| Property | Gaussian (High-D) | Student-t (Low-D) |
|---|---|---|
| Formula | exp(-d²/2σ²) | (1 + d²)⁻¹ |
| Tail behavior | Exponential decay (light tails) | Polynomial decay (heavy tails) |
| Probability at d=0 | Maximum (normalized) | Maximum (normalized) |
| Probability at d=2 | Very small | Moderate (1/5) |
| Probability at d=10 | Essentially zero | Still ~1% of maximum |
| Effect on embedding | N/A (data space) | Allows spreading of distant points |
The t-distribution as a limiting case:
The choice of ν = 1 degree of freedom emerges naturally. Van der Maaten and Hinton showed empirically that ν = 1 works well across many datasets. Theoretically, lower degrees of freedom give heavier tails (more spreading), while higher degrees approach the Gaussian (less spreading). The value ν = 1 provides a balance that works well for typical 2D visualizations.
Recent work on parametric t-SNE variants has explored learning the degrees of freedom, but ν = 1 remains the default for its robustness and simplicity.
With probability distributions P (from high-dimensional affinities) and Q (from low-dimensional affinities), we need an objective function that measures their similarity. t-SNE uses the Kullback-Leibler (KL) divergence, a fundamental information-theoretic measure.
The Cost Function
t-SNE minimizes the KL divergence from P to Q over the embedding coordinates Y:
C = KL(P || Q) = Σᵢⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ) Expanding: C = Σᵢⱼ pᵢⱼ log(pᵢⱼ) - Σᵢⱼ pᵢⱼ log(qᵢⱼ) = -H(P) - Σᵢⱼ pᵢⱼ log(qᵢⱼ) where H(P) = -Σᵢⱼ pᵢⱼ log(pᵢⱼ) is the entropy of P. Since P is fixed (computed once from the input data), minimizing C is equivalent to: maximize Σᵢⱼ pᵢⱼ log(qᵢⱼ) This is maximizing the expected log-likelihood of Q under P.Properties of KL Divergence:
The Critical Asymmetry:
The direction of KL divergence—KL(P || Q) rather than KL(Q || P)—fundamentally shapes t-SNE's behavior:
Think of pᵢⱼ as "true importance" and qᵢⱼ as "modeled importance." KL(P||Q) says: "If a pair is truly important (high pᵢⱼ), make sure you model it as important (high qᵢⱼ)." It's okay to model unimportant pairs as important—you just can't miss the important ones. This is exactly right for visualization: preserve what matters locally.
Mathematical Consequence:
When qᵢⱼ is small but pᵢⱼ is large, the term pᵢⱼ log(pᵢⱼ/qᵢⱼ) contributes a large positive value to the cost. The log(1/qᵢⱼ) term can grow without bound as qᵢⱼ → 0, creating strong pressure to maintain high qᵢⱼ for pairs with high pᵢⱼ.
Conversely, when pᵢⱼ is small, even if qᵢⱼ is large, the multiplicative factor pᵢⱼ keeps the contribution small. This is why t-SNE can place globally distant points near each other in the embedding—there's little cost to doing so if they weren't neighbors in high dimensions.
To minimize the KL divergence, t-SNE uses gradient descent on the embedding coordinates. The gradient with respect to each embedding point yᵢ reveals a beautiful physical interpretation.
Deriving the Gradient:
The gradient of the cost function with respect to yᵢ is:
∂C/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ)(yᵢ - yⱼ)(1 + ||yᵢ - yⱼ||²)⁻¹ This can be rewritten as: ∂C/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ) qᵢⱼ Z (yᵢ - yⱼ) where Z = Σₖ≠ₗ (1 + ||yₖ - yₗ||²)⁻¹ is the normalization constant. Breaking into components: ∂C/∂yᵢ = 4 [ Σⱼ pᵢⱼ qᵢⱼ Z (yᵢ - yⱼ) - Σⱼ qᵢⱼ² Z (yᵢ - yⱼ) ] = Attractive Forces - Repulsive ForcesPhysical Interpretation: Springs and Repulsion
The gradient has an elegant interpretation as a system of forces acting on each point:
Attractive Forces (pᵢⱼ > qᵢⱼ): When two points should be neighbors (high pᵢⱼ) but are currently too far apart in the embedding (low qᵢⱼ), the term (pᵢⱼ - qᵢⱼ) is positive. This creates an attractive force that pulls yᵢ toward yⱼ, like a spring connecting neighborhood points.
Repulsive Forces (pᵢⱼ < qᵢⱼ): When two points shouldn't be close neighbors (low pᵢⱼ) but are currently too close in the embedding (high qᵢⱼ), the term (pᵢⱼ - qᵢⱼ) is negative. This creates a repulsive force that pushes yᵢ away from yⱼ.
The Attenuation Factor (1 + ||yᵢ - yⱼ||²)⁻¹: This factor, coming from the Student-t kernel, attenuates forces for distant pairs. Points far apart in the embedding exert weaker forces on each other, allowing local neighborhoods to settle without interference from distant points.
t-SNE is closely related to force-directed graph layouts used in network visualization. In graph drawing, nodes are connected by springs (edges) that exert attractive forces, while all node pairs repel to prevent overlap. t-SNE generalizes this: spring strengths are probabilities pᵢⱼ, and the Student-t kernel provides the repulsion profile.
Equilibrium and Cluster Formation:
At equilibrium (when the gradient is near zero for all points), each point experiences balanced attractive and repulsive forces:
This force-based view explains why t-SNE produces separated clusters: the combination of local attraction and global repulsion naturally creates discrete, well-separated groups from continuous high-dimensional manifolds.
12345678910111213141516171819202122232425262728293031323334
def tsne_gradient_step(Y, P, learning_rate=200.0, momentum=0.8): """ Perform one gradient descent step for t-SNE. Y: Current embedding (N x d array) P: High-dimensional joint probabilities (N x N array) """ N, d = Y.shape # Compute pairwise distances in embedding space sum_Y = np.sum(Y**2, axis=1) D = sum_Y[:, np.newaxis] + sum_Y[np.newaxis, :] - 2 * Y @ Y.T # Compute Q distribution (Student-t with 1 d.f.) Q_numerator = 1 / (1 + D) np.fill_diagonal(Q_numerator, 0) Q = Q_numerator / np.sum(Q_numerator) Q = np.maximum(Q, 1e-12) # Numerical stability # Compute gradient PQ_diff = P - Q grad = np.zeros_like(Y) for i in range(N): diff = Y[i] - Y # (N x d) differences grad[i] = 4 * np.sum( (PQ_diff[i, :] * Q_numerator[i, :])[:, np.newaxis] * diff, axis=0 ) # Update with momentum Y_new = Y - learning_rate * grad + momentum * velocity velocity = Y_new - Y return Y_new, velocityt-SNE evolved from the original Stochastic Neighbor Embedding (SNE) algorithm proposed by Hinton and Roweis in 2002. Understanding the differences illuminates why t-SNE's modifications are essential.
Original SNE Formulation:
SNE used Gaussian kernels in both high-dimensional and low-dimensional spaces:
Original SNE: High-D: p(j|i) = exp(-||xᵢ - xⱼ||²/2σᵢ²) / Σₖ exp(-||xᵢ - xₖ||²/2σᵢ²) Low-D: q(j|i) = exp(-||yᵢ - yⱼ||²) / Σₖ exp(-||yᵢ - yₖ||²) Cost: Σᵢ KL(Pᵢ || Qᵢ) = Σᵢⱼ p(j|i) log(p(j|i) / q(j|i)) Key differences from t-SNE: 1. Uses Gaussian in low-D (not Student-t) 2. Uses conditional probabilities (not joint) 3. Sums KL divergences for each point (not single joint KL)| Aspect | Original SNE | t-SNE | Why t-SNE is Better |
|---|---|---|---|
| Low-D kernel | Gaussian | Student-t (ν=1) | Solves crowding problem; heavy tails allow spreading |
| Probability type | Conditional p(j|i) | Joint pᵢⱼ | Symmetric; simpler gradient; better optimization |
| Outlier handling | Can collapse to center | Robust embedding | Symmetrized p includes information from neighbors of outliers |
| Gradient | Complex, asymmetric | Clean, symmetric | Easier to optimize; more stable dynamics |
| Cluster separation | Moderate | Excellent | Heavy tails create natural gaps between clusters |
Why the Student-t Distribution Works:
The original SNE suffered severely from the crowding problem because Gaussian kernels in both spaces meant that any spreading of moderately distant points would cause their probabilities to plummet, incurring huge cost. Points were forced into a compromised state where neither local nor global structure was well-preserved.
The Student-t distribution's polynomial tails (versus the Gaussian's exponential tails) change the game:
At d = 2 units apart:
At d = 5 units apart:
The Student-t distribution maintains meaningful probability for moderately distant points, giving the embedding "room to breathe" and place clusters at appropriate distances.
The choice of exactly one degree of freedom makes the Student-t distribution equivalent to the Cauchy distribution. This is the heaviest possible Student-t distribution (heavier tails than any ν > 1). The heavy tails provide maximum flexibility for spreading clusters apart. Empirically, this choice works remarkably well across diverse datasets.
We have developed a complete understanding of the t-SNE objective function, its mathematical foundations, and the design choices that make it effective for visualization.
What's Next:
The objective function provides the mathematical foundation. But t-SNE's behavior in practice depends critically on the perplexity parameter, which controls the effective number of neighbors considered for each point. On the next page, we'll dive deep into perplexity—how it's defined, how it determines the bandwidths σᵢ, and how to choose it appropriately for different datasets.
You now understand the mathematical foundations of t-SNE's objective function. The probabilistic approach—matching neighborhood distributions via KL divergence—combined with the Student-t kernel for the embedding space, creates a powerful method for visualizing high-dimensional structure. Next, we'll explore the perplexity parameter that controls the scale of neighborhoods.