Loading content...
With the high-dimensional fuzzy simplicial set constructed—a weighted graph encoding the topological structure of our data—UMAP faces its core computational challenge: finding a low-dimensional embedding whose fuzzy simplicial set matches the high-dimensional one as closely as possible.
This matching problem is framed as an optimization task. UMAP defines a loss function measuring the divergence between high- and low-dimensional fuzzy sets, then uses stochastic gradient descent to minimize it. The specific loss function—fuzzy set cross-entropy—is not arbitrary but emerges from principled information-theoretic considerations applied to membership functions.
This page provides a comprehensive treatment of UMAP's optimization framework, from the mathematical formulation of the loss function through the practical details of the optimization procedure.
By the end of this page, you will understand: (1) The fuzzy set cross-entropy loss function and its information-theoretic interpretation, (2) How the loss decomposes into attractive and repulsive forces, (3) The stochastic gradient descent optimization procedure with negative sampling, and (4) How these design choices lead to UMAP's remarkable computational efficiency.
Let's precisely formulate what UMAP optimizes. We have:
High-Dimensional Representation:
Low-Dimensional Representation:
The low-dimensional affinities are defined as:
$$\mu_{ij}^{\text{low}} = \left(1 + a |y_i - y_j|^{2b}\right)^{-1}$$
where (a) and (b) are parameters fitted to match the min_dist hyperparameter (typically (a \approx 1.93), (b \approx 0.79) for min_dist = 0.1).
When a = b = 1, the low-dimensional affinity becomes 1/(1 + ||y_i - y_j||²)—exactly the Student-t distribution with 1 degree of freedom (Cauchy distribution), same as t-SNE. The min_dist parameter allows UMAP to generalize this family for better control over embedding density.
The Optimization Goal:
Find embedding coordinates (Y) such that (\mu^{\text{low}}) is as "close" as possible to (\mu^{\text{high}}).
But what does "close" mean for fuzzy sets? We need a divergence measure between two fuzzy set membership functions.
UMAP uses fuzzy set cross-entropy as its divergence measure. For membership functions (\mu) (high-D) and ( u) (low-D) over a set of 1-simplices (edges), the cross-entropy is:
$$CE(\mu, u) = \sum_{(i,j)} \left[ \mu_{ij} \log\left(\frac{\mu_{ij}}{ u_{ij}}\right) + (1 - \mu_{ij}) \log\left(\frac{1 - \mu_{ij}}{1 - u_{ij}}\right) \right]$$
This is the binary cross-entropy applied to each edge, summed over all edges. Equivalently:
$$CE(\mu, u) = \sum_{(i,j)} \left[ -\mu_{ij} \log( u_{ij}) - (1 - \mu_{ij}) \log(1 - u_{ij}) \right] + \text{const}$$
since the (\mu \log \mu + (1-\mu) \log(1-\mu)) terms don't depend on (Y).
| Term | When It's Large | Effect on Optimization |
|---|---|---|
| (-\mu_{ij} \log( u_{ij})) | High-D connected (μ≈1), Low-D disconnected (ν≈0) | Pull connected points together |
| (-(1-\mu_{ij}) \log(1- u_{ij})) | High-D disconnected (μ≈0), Low-D connected (ν≈1) | Push disconnected points apart |
Attractive and Repulsive Terms:
We can decompose the loss into two conceptually distinct parts:
$$\mathcal{L} = \underbrace{\sum_{(i,j)} -\mu_{ij} \log( u_{ij})}{\text{Attractive (pull together)}} + \underbrace{\sum{(i,j)} -(1 - \mu_{ij}) \log(1 - u_{ij})}_{\text{Repulsive (push apart)}}$$
Attractive Term: When (\mu_{ij} > 0) (points are connected in high-D), minimizing (-\mu_{ij} \log( u_{ij})) requires ( u_{ij}) to be large—i.e., (y_i) and (y_j) must be close.
Repulsive Term: When (\mu_{ij} < 1) (points are not fully connected), minimizing (-(1-\mu_{ij}) \log(1- u_{ij})) requires ( u_{ij}) to be small—i.e., (y_i) and (y_j) must be far apart.
The balance between these forces creates well-separated clusters while preserving internal structure.
Cross-entropy is the natural divergence for probability distributions. When μ and ν are membership functions (soft probabilities of edge existence), cross-entropy measures how much information is lost by using ν to encode data generated by μ. Minimizing it makes the low-D structure as faithful as possible to the high-D structure.
Understanding how UMAP's objective differs from t-SNE's illuminates key design decisions.
t-SNE's Objective:
t-SNE minimizes the Kullback-Leibler (KL) divergence between high-D and low-D probability distributions:
$$KL(P | Q) = \sum_{i eq j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right)$$
where (p_{ij}) and (q_{ij}) are normalized probability distributions (they sum to 1).
UMAP's Objective:
$$CE(\mu, u) = \sum_{i eq j} \mu_{ij} \log\left(\frac{\mu_{ij}}{ u_{ij}}\right) + (1-\mu_{ij}) \log\left(\frac{1-\mu_{ij}}{1- u_{ij}}\right)$$
where (\mu_{ij}) and ( u_{ij}) are unnormalized membership strengths.
Consequences of the Difference:
| Aspect | t-SNE | UMAP |
|---|---|---|
| Global structure | Poor (normalization hides it) | Better (no global normalization) |
| Relative cluster distances | Not meaningful | More meaningful |
| Computational scaling | O(n²) for exact, O(n log n) approximate | O(n) with negative sampling |
| Run-to-run consistency | Variable | More consistent |
The key insight: t-SNE's normalization means that pushing 10% of all pairs apart is equivalent to pulling 90% together—the absolute scale vanishes. UMAP's unnormalized approach preserves more information about global relationships.
In t-SNE, repulsion is implicit: the q_ij normalization means that when some pairs are pulled together, others are automatically pushed apart (probability mass is conserved). In UMAP, repulsion is explicit: the (1-μ)log(1-ν) term directly penalizes low-D proximity when high-D distance is large. This explicit formulation enables efficient optimization via negative sampling.
To minimize the cross-entropy via gradient descent, we compute gradients with respect to the embedding coordinates.
Gradient Derivation:
The low-dimensional affinity is:
$$ u_{ij} = \left(1 + a|y_i - y_j|^{2b}\right)^{-1}$$
Let (d_{ij} = |y_i - y_j|) and (\phi_{ij} = 1 + a \cdot d_{ij}^{2b}). Then ( u_{ij} = \phi_{ij}^{-1}).
The gradient of the attractive term:
$$\frac{\partial}{\partial y_i}\left(-\mu_{ij} \log u_{ij}\right) = \mu_{ij} \cdot \frac{1}{ u_{ij}} \cdot \frac{\partial u_{ij}}{\partial y_i}$$
After computing (\frac{\partial u_{ij}}{\partial y_i} = 2ab \cdot d_{ij}^{2b-2} \cdot u_{ij}^2 \cdot (y_i - y_j)):
$$ abla_{y_i}^{\text{attr}} = -\frac{2ab \cdot d_{ij}^{2b-2}}{\phi_{ij}} \cdot \mu_{ij} \cdot (y_i - y_j)$$
Similarly, the gradient of the repulsive term:
$$ abla_{y_i}^{\text{rep}} = \frac{2b}{d_{ij}^2 \cdot \phi_{ij}} \cdot (1 - \mu_{ij}) \cdot u_{ij} \cdot (y_i - y_j)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as np def compute_umap_gradients(Y, high_weights, a, b, negative_sample_rate=5): """ Compute UMAP gradients for embedding optimization. This implementation shows the core gradient computation. Production UMAP uses stochastic sampling for efficiency. Parameters: ----------- Y : ndarray of shape (n_samples, n_components) Current embedding coordinates high_weights : sparse matrix High-dimensional fuzzy set weights (μ_ij) a, b : floats Low-dimensional affinity parameters negative_sample_rate : int Number of negative samples per positive edge Returns: -------- gradients : ndarray of shape (n_samples, n_components) Gradient for each point """ n, d = Y.shape gradients = np.zeros_like(Y) # Convert sparse to COO for iteration rows, cols = high_weights.nonzero() # Process positive edges (attractive forces) for idx in range(len(rows)): i, j = rows[idx], cols[idx] if i >= j: continue # Process each edge once mu_ij = high_weights[i, j] # Distance and affinity in low-D diff = Y[i] - Y[j] dist_sq = np.sum(diff ** 2) dist = np.sqrt(dist_sq) + 1e-10 # ν_ij = (1 + a * d^(2b))^(-1) phi = 1.0 + a * (dist ** (2 * b)) nu_ij = 1.0 / phi # Attractive gradient # ∇ = -2ab * d^(2b-2) / φ * μ * (y_i - y_j) grad_coeff = -2.0 * a * b * (dist ** (2*b - 2)) / phi * mu_ij grad = grad_coeff * diff # Apply to both points (Newton's third law) gradients[i] += grad gradients[j] -= grad # Process negative samples (repulsive forces) # For each positive edge, sample random points for repulsion for idx in range(len(rows)): i, j = rows[idx], cols[idx] if i >= j: continue for _ in range(negative_sample_rate): # Sample a random point k ≠ i k = np.random.randint(n) while k == i: k = np.random.randint(n) # Assume μ_ik ≈ 0 for random pairs (not neighbors) mu_ik = 0.0 # Approximation for non-neighbors diff = Y[i] - Y[k] dist_sq = np.sum(diff ** 2) + 1e-10 dist = np.sqrt(dist_sq) phi = 1.0 + a * (dist ** (2 * b)) nu_ik = 1.0 / phi # Repulsive gradient # ∇ = 2b / (d² * φ) * (1 - μ) * ν * (y_i - y_k) grad_coeff = 2.0 * b / (dist_sq * phi) * (1.0 - mu_ik) * nu_ik grad = grad_coeff * diff # Apply repulsion (push i away from k) gradients[i] += grad return gradients def attractive_force(y_i, y_j, mu_ij, a, b): """ Compute attractive force between connected points. Force pulls y_i toward y_j proportional to their high-D connection. """ diff = y_i - y_j dist = np.linalg.norm(diff) + 1e-10 phi = 1.0 + a * (dist ** (2 * b)) # Force magnitude: proportional to μ and gradient of log(ν) force_mag = 2.0 * a * b * (dist ** (2*b - 2)) / phi * mu_ij # Force direction: toward the other point return -force_mag * diff def repulsive_force(y_i, y_k, a, b, gamma=1.0): """ Compute repulsive force between unconnected points. Force pushes y_i away from y_k. gamma controls repulsion strength (UMAP uses 1.0). """ diff = y_i - y_k dist_sq = np.sum(diff ** 2) + 0.001 # Regularization dist = np.sqrt(dist_sq) phi = 1.0 + a * (dist ** (2 * b)) nu = 1.0 / phi # Force magnitude: want (1-μ)ν to be small → push apart # Assuming μ ≈ 0 for negative samples: force_mag = gamma * 2.0 * b * nu / (dist_sq * phi) # Force direction: away from the other point return force_mag * diffPhysical Interpretation:
The gradients correspond to forces in a spring-mass system:
Attractive forces: Connected points are joined by springs that pull them together. Spring strength is proportional to (\mu_{ij}).
Repulsive forces: All points repel each other, preventing collapse. Repulsion is stronger for points that should be far apart (low (\mu_{ij})).
The equilibrium configuration—where all forces balance—is the UMAP embedding.
The naive implementation of UMAP's repulsive term requires summing over all (O(n^2)) pairs—prohibitively expensive for large datasets. UMAP achieves linear scaling through negative sampling, borrowed from word embedding methods like word2vec.
The Efficiency Problem:
The repulsive term is: $$\sum_{i eq j} (1 - \mu_{ij}) \log(1 - u_{ij})$$
For (n) points, this has (O(n^2)) terms. Moreover, most pairs have (\mu_{ij} \approx 0) (not neighbors), so the ((1 - \mu_{ij}) \approx 1) factor doesn't help much.
The Negative Sampling Solution:
Instead of computing repulsion between all pairs, UMAP samples a small number of "negative" pairs uniformly at random. For each positive edge ((i, j)) with (\mu_{ij} > 0):
Negative sampling provides an unbiased estimator of the full repulsive gradient. Since most pairs are negatives (not neighbors), random samples are almost always true negatives. The variance scales with 1/k, so k=5 gives reasonable estimates with huge computational savings. The noise actually helps optimization by acting like regularization.
| Method | Per-Iteration Cost | Total Cost (typical) | Memory |
|---|---|---|---|
| Exact UMAP | O(n²) | O(n² × epochs) | O(n²) |
| UMAP + Negative Sampling | O(E × k) | O(n × k × epochs) | O(n + E) |
| Exact t-SNE | O(n²) | O(n² × epochs) | O(n²) |
| Barnes-Hut t-SNE | O(n log n) | O(n log n × epochs) | O(n) |
Where:
For a dataset with 100,000 points and 15 neighbors, we have:
That's a 1000x speedup—the difference between hours and seconds.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom numba import njit @njitdef optimize_layout_epoch( embedding, # (n, d) float32 array head, # Edge source indices tail, # Edge target indices weights, # Edge weights (μ values) a, b, # Low-D affinity parameters n_negative_samples, # Negative samples per positive learning_rate, # SGD learning rate epochs_per_sample # How many times to sample each edge): """ One epoch of UMAP optimization with negative sampling. This is the core optimization loop, JIT-compiled for speed. """ n_vertices = embedding.shape[0] n_edges = len(head) for edge in range(n_edges): # How many times to process this edge? n_samples = int(epochs_per_sample[edge]) for _ in range(n_samples): i = head[edge] j = tail[edge] # === ATTRACTIVE FORCE === dist_sq = 0.0 for d in range(embedding.shape[1]): dist_sq += (embedding[i, d] - embedding[j, d]) ** 2 if dist_sq > 0.0: # Gradient coefficient for attraction grad_coeff = -2.0 * a * b * (dist_sq ** (b - 1.0)) grad_coeff /= (1.0 + a * (dist_sq ** b)) grad_coeff *= weights[edge] # Scale by μ_ij else: grad_coeff = 0.0 # Apply attractive update for d in range(embedding.shape[1]): diff = embedding[i, d] - embedding[j, d] gradient = grad_coeff * diff # Clip gradient to prevent explosion gradient = max(-4.0, min(4.0, gradient)) embedding[i, d] -= learning_rate * gradient embedding[j, d] += learning_rate * gradient # === REPULSIVE FORCES (Negative Sampling) === for _ in range(n_negative_samples): # Sample random negative target k = np.random.randint(n_vertices) # Skip if sampled the same point if k == i: continue dist_sq = 0.0 for d in range(embedding.shape[1]): dist_sq += (embedding[i, d] - embedding[k, d]) ** 2 if dist_sq > 0.0: # Gradient coefficient for repulsion # (1 - μ_ik) ≈ 1 for random negative samples grad_coeff = 2.0 * b grad_coeff /= (1e-6 + dist_sq) * (1.0 + a * (dist_sq ** b)) else: grad_coeff = 4.0 # Max repulsion when overlapping # Apply repulsive update (only to point i) for d in range(embedding.shape[1]): diff = embedding[i, d] - embedding[k, d] gradient = grad_coeff * diff # Clip gradient gradient = max(-4.0, min(4.0, gradient)) embedding[i, d] += learning_rate * gradient return embeddingUMAP's optimization uses several techniques to ensure stable convergence to good embeddings.
Initialization:
The initial embedding positions significantly affect the final result. UMAP offers several initialization strategies:
Spectral initialization (default): Use eigenvectors of the graph Laplacian. This gives a globally coherent starting point that captures coarse structure.
Random initialization: Uniform random in a small cube. Simple but may require more epochs.
Custom initialization: User-provided coordinates (e.g., from a previous UMAP run).
Spectral initialization is crucial for global structure preservation—it starts the optimization near a good local minimum.
| Parameter | Default | Effect | Guidance |
|---|---|---|---|
| n_epochs | 200 (small n) to 500 (large n) | Total optimization iterations | More epochs = better convergence, slower |
| learning_rate | 1.0 | SGD step size | Rarely needs tuning |
| negative_sample_rate | 5 | Negatives per positive edge | Higher = better repulsion estimate, slower |
| init | spectral | Initialization method | Use spectral for global structure |
| spread | 1.0 | Embedding scale | Affects visual density |
| min_dist | 0.1 | Minimum embedding distance | Lower = tighter clusters |
Learning Rate Schedule:
UMAP uses a decreasing learning rate:
$$\alpha_t = \alpha_0 \cdot \left(1 - \frac{t}{T}\right)$$
where (t) is the current epoch and (T) is total epochs. This "annealing" schedule allows:
Edge Sampling Schedule:
Not all edges are sampled equally often. UMAP samples edges proportional to their weights (\mu_{ij}). High-weight edges (strong connections) are sampled more frequently, ensuring important local relationships are well-optimized.
The epochs_per_sample array controls this:
$$\text{epochs_per_sample}{ij} = \frac{\mu{ij}}{\max(\mu)} \cdot n_{\text{epochs}}$$
Due to random initialization (sometimes) and negative sampling, UMAP runs are not perfectly reproducible unless you set a random seed. However, the overall structure should be consistent across runs. For reproducibility, set random_state in the UMAP constructor.
How do we know if UMAP has converged to a good solution? Unlike supervised learning, there's no obvious "loss on test set" to monitor.
Monitoring Convergence:
UMAP doesn't typically expose the loss during training, but you can monitor:
Point movement: Track how much the embedding changes between epochs. At convergence, movement should be minimal.
Loss value: Compute cross-entropy periodically (expensive but informative).
Visual stability: For visualization, the appearance should stabilize.
Indicators of Good Convergence:
Common Problems and Solutions:
| Problem | Symptom | Likely Cause | Solution |
|---|---|---|---|
| Disconnected fragments | Small isolated point groups | n_neighbors too small | Increase n_neighbors |
| Everything in one blob | No cluster separation | n_neighbors too large OR min_dist too high | Decrease n_neighbors or min_dist |
| Inconsistent runs | Very different layouts each time | Poor initialization or too few epochs | Use spectral init; increase n_epochs |
| Noisy embedding | Points scattered chaotically | Too much repulsion or bad data | Check data quality; reduce negative_sample_rate |
| Slow convergence | Structure still changing at epoch 500 | Complex data structure | Increase n_epochs; decrease learning_rate |
Remember: UMAP optimizes a proxy objective (cross-entropy), not your actual task (visualization quality, downstream classification, etc.). A lower cross-entropy doesn't guarantee a more useful embedding. Always evaluate embeddings based on your actual use case.
We have now covered UMAP's optimization procedure comprehensively. Let's consolidate the key concepts:
What's Next:
With UMAP's theory and optimization fully covered, the next page provides a comparative analysis of UMAP vs. t-SNE—examining their similarities, differences, and guidance on when to prefer each method.
You now understand UMAP's optimization framework: the cross-entropy objective that measures fuzzy set divergence, the gradient decomposition into attractive and repulsive forces, the negative sampling strategy that enables linear scaling, and the optimization schedule that balances global and local structure optimization.