Umap - Learning Module | OneNoughtOne

Loading content...

0/278

Cross-Entropy Optimization

Finding the Optimal Embedding

With the high-dimensional fuzzy simplicial set constructed—a weighted graph encoding the topological structure of our data—UMAP faces its core computational challenge: finding a low-dimensional embedding whose fuzzy simplicial set matches the high-dimensional one as closely as possible.

This matching problem is framed as an optimization task. UMAP defines a loss function measuring the divergence between high- and low-dimensional fuzzy sets, then uses stochastic gradient descent to minimize it. The specific loss function—fuzzy set cross-entropy—is not arbitrary but emerges from principled information-theoretic considerations applied to membership functions.

This page provides a comprehensive treatment of UMAP's optimization framework, from the mathematical formulation of the loss function through the practical details of the optimization procedure.

What You Will Learn

By the end of this page, you will understand: (1) The fuzzy set cross-entropy loss function and its information-theoretic interpretation, (2) How the loss decomposes into attractive and repulsive forces, (3) The stochastic gradient descent optimization procedure with negative sampling, and (4) How these design choices lead to UMAP's remarkable computational efficiency.

The Matching Problem

Let's precisely formulate what UMAP optimizes. We have:

High-Dimensional Representation:

Input data (X = {x_1, ..., x_n} \subset \mathbb{R}^D)
Fuzzy simplicial set with edge weights (\mu_{ij}^{\text{high}}) (from local distance normalization + fuzzy union)

Low-Dimensional Representation:

Embedding (Y = {y_1, ..., y_n} \subset \mathbb{R}^d) (typically (d = 2) or (3))
Fuzzy simplicial set with edge weights (\mu_{ij}^{\text{low}}) (from parametric affinity function)

The low-dimensional affinities are defined as:

$$\mu_{ij}^{\text{low}} = \left(1 + a |y_i - y_j|^{2b}\right)^{-1}$$

where (a) and (b) are parameters fitted to match the min_dist hyperparameter (typically (a \approx 1.93), (b \approx 0.79) for min_dist = 0.1).

The Student-t Connection

When a = b = 1, the low-dimensional affinity becomes 1/(1 + ||y_i - y_j||²)—exactly the Student-t distribution with 1 degree of freedom (Cauchy distribution), same as t-SNE. The min_dist parameter allows UMAP to generalize this family for better control over embedding density.

The Optimization Goal:

Find embedding coordinates (Y) such that (\mu^{\text{low}}) is as "close" as possible to (\mu^{\text{high}}).

But what does "close" mean for fuzzy sets? We need a divergence measure between two fuzzy set membership functions.

Fuzzy Set Cross-Entropy

UMAP uses fuzzy set cross-entropy as its divergence measure. For membership functions (\mu) (high-D) and ( u) (low-D) over a set of 1-simplices (edges), the cross-entropy is:

$$CE(\mu, u) = \sum_{(i,j)} \left[ \mu_{ij} \log\left(\frac{\mu_{ij}}{ u_{ij}}\right) + (1 - \mu_{ij}) \log\left(\frac{1 - \mu_{ij}}{1 - u_{ij}}\right) \right]$$

This is the binary cross-entropy applied to each edge, summed over all edges. Equivalently:

$$CE(\mu, u) = \sum_{(i,j)} \left[ -\mu_{ij} \log( u_{ij}) - (1 - \mu_{ij}) \log(1 - u_{ij}) \right] + \text{const}$$

since the (\mu \log \mu + (1-\mu) \log(1-\mu)) terms don't depend on (Y).

Cross-Entropy Term Interpretation
Term	When It's Large	Effect on Optimization
(-\mu_{ij} \log( u_{ij}))	High-D connected (μ≈1), Low-D disconnected (ν≈0)	Pull connected points together
(-(1-\mu_{ij}) \log(1- u_{ij}))	High-D disconnected (μ≈0), Low-D connected (ν≈1)	Push disconnected points apart

Attractive and Repulsive Terms:

We can decompose the loss into two conceptually distinct parts:

$$\mathcal{L} = \underbrace{\sum_{(i,j)} -\mu_{ij} \log( u_{ij})}{\text{Attractive (pull together)}} + \underbrace{\sum{(i,j)} -(1 - \mu_{ij}) \log(1 - u_{ij})}_{\text{Repulsive (push apart)}}$$

Attractive Term: When (\mu_{ij} > 0) (points are connected in high-D), minimizing (-\mu_{ij} \log( u_{ij})) requires ( u_{ij}) to be large—i.e., (y_i) and (y_j) must be close.

Repulsive Term: When (\mu_{ij} < 1) (points are not fully connected), minimizing (-(1-\mu_{ij}) \log(1- u_{ij})) requires ( u_{ij}) to be small—i.e., (y_i) and (y_j) must be far apart.

The balance between these forces creates well-separated clusters while preserving internal structure.

Why Cross-Entropy Works

Cross-entropy is the natural divergence for probability distributions. When μ and ν are membership functions (soft probabilities of edge existence), cross-entropy measures how much information is lost by using ν to encode data generated by μ. Minimizing it makes the low-D structure as faithful as possible to the high-D structure.

Comparison with t-SNE's Objective

Understanding how UMAP's objective differs from t-SNE's illuminates key design decisions.

t-SNE's Objective:

t-SNE minimizes the Kullback-Leibler (KL) divergence between high-D and low-D probability distributions:

$$KL(P | Q) = \sum_{i eq j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right)$$

where (p_{ij}) and (q_{ij}) are normalized probability distributions (they sum to 1).

UMAP's Objective:

$$CE(\mu, u) = \sum_{i eq j} \mu_{ij} \log\left(\frac{\mu_{ij}}{ u_{ij}}\right) + (1-\mu_{ij}) \log\left(\frac{1-\mu_{ij}}{1- u_{ij}}\right)$$

where (\mu_{ij}) and ( u_{ij}) are unnormalized membership strengths.

t-SNE Characteristics

•Normalization: Probabilities sum to 1 globally
•Crowding: Severe crowding problem requires careful handling
•Only attractive in formula: KL has no explicit repulsion term
•Implicit repulsion: Comes from normalization constraint
•Focus on pull: Heavily penalizes missing neighbors

UMAP Characteristics

•No normalization: Membership strengths are independent
•Direct embedding: No global normalization needed
•Explicit repulsion: Second term pushes points apart
•Balanced forces: Both terms contribute explicitly
•Pull and push: Equally penalizes wrong neighbors and missing separation

Consequences of the Difference:

Aspect	t-SNE	UMAP
Global structure	Poor (normalization hides it)	Better (no global normalization)
Relative cluster distances	Not meaningful	More meaningful
Computational scaling	O(n²) for exact, O(n log n) approximate	O(n) with negative sampling
Run-to-run consistency	Variable	More consistent

The key insight: t-SNE's normalization means that pushing 10% of all pairs apart is equivalent to pulling 90% together—the absolute scale vanishes. UMAP's unnormalized approach preserves more information about global relationships.

The Repulsion Mechanism

In t-SNE, repulsion is implicit: the q_ij normalization means that when some pairs are pulled together, others are automatically pushed apart (probability mass is conserved). In UMAP, repulsion is explicit: the (1-μ)log(1-ν) term directly penalizes low-D proximity when high-D distance is large. This explicit formulation enables efficient optimization via negative sampling.

Gradients and Forces

To minimize the cross-entropy via gradient descent, we compute gradients with respect to the embedding coordinates.

Gradient Derivation:

The low-dimensional affinity is:

$$ u_{ij} = \left(1 + a|y_i - y_j|^{2b}\right)^{-1}$$

Let (d_{ij} = |y_i - y_j|) and (\phi_{ij} = 1 + a \cdot d_{ij}^{2b}). Then ( u_{ij} = \phi_{ij}^{-1}).

The gradient of the attractive term:

$$\frac{\partial}{\partial y_i}\left(-\mu_{ij} \log u_{ij}\right) = \mu_{ij} \cdot \frac{1}{ u_{ij}} \cdot \frac{\partial u_{ij}}{\partial y_i}$$

After computing (\frac{\partial u_{ij}}{\partial y_i} = 2ab \cdot d_{ij}^{2b-2} \cdot u_{ij}^2 \cdot (y_i - y_j)):

$$ abla_{y_i}^{\text{attr}} = -\frac{2ab \cdot d_{ij}^{2b-2}}{\phi_{ij}} \cdot \mu_{ij} \cdot (y_i - y_j)$$

Similarly, the gradient of the repulsive term:

$$ abla_{y_i}^{\text{rep}} = \frac{2b}{d_{ij}^2 \cdot \phi_{ij}} \cdot (1 - \mu_{ij}) \cdot u_{ij} \cdot (y_i - y_j)$$

umap_gradients.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
 
def compute_umap_gradients(Y, high_weights, a, b, negative_sample_rate=5):
    """
    Compute UMAP gradients for embedding optimization.
    
    This implementation shows the core gradient computation.
    Production UMAP uses stochastic sampling for efficiency.
    
    Parameters:
    -----------
    Y : ndarray of shape (n_samples, n_components)
        Current embedding coordinates
    high_weights : sparse matrix
        High-dimensional fuzzy set weights (μ_ij)
    a, b : floats
        Low-dimensional affinity parameters
    negative_sample_rate : int
        Number of negative samples per positive edge
    
    Returns:
    --------
    gradients : ndarray of shape (n_samples, n_components)
        Gradient for each point
    """
    n, d = Y.shape
    gradients = np.zeros_like(Y)
    
    # Convert sparse to COO for iteration
    rows, cols = high_weights.nonzero()
    
    # Process positive edges (attractive forces)
    for idx in range(len(rows)):
        i, j = rows[idx], cols[idx]
        if i >= j:
            continue  # Process each edge once
        
        mu_ij = high_weights[i, j]
        
        # Distance and affinity in low-D
        diff = Y[i] - Y[j]
        dist_sq = np.sum(diff ** 2)
        dist = np.sqrt(dist_sq) + 1e-10
        
        # ν_ij = (1 + a * d^(2b))^(-1)
        phi = 1.0 + a * (dist ** (2 * b))
        nu_ij = 1.0 / phi
        
        # Attractive gradient
        # ∇ = -2ab * d^(2b-2) / φ * μ * (y_i - y_j)
        grad_coeff = -2.0 * a * b * (dist ** (2*b - 2)) / phi * mu_ij
        grad = grad_coeff * diff
        
        # Apply to both points (Newton's third law)
        gradients[i] += grad
        gradients[j] -= grad
    
    # Process negative samples (repulsive forces)
    # For each positive edge, sample random points for repulsion
    for idx in range(len(rows)):
        i, j = rows[idx], cols[idx]
        if i >= j:
            continue
        
        for _ in range(negative_sample_rate):
            # Sample a random point k ≠ i
            k = np.random.randint(n)
            while k == i:
                k = np.random.randint(n)
            
            # Assume μ_ik ≈ 0 for random pairs (not neighbors)
            mu_ik = 0.0  # Approximation for non-neighbors
            
            diff = Y[i] - Y[k]
            dist_sq = np.sum(diff ** 2) + 1e-10
            dist = np.sqrt(dist_sq)
            
            phi = 1.0 + a * (dist ** (2 * b))
            nu_ik = 1.0 / phi
            
            # Repulsive gradient
            # ∇ = 2b / (d² * φ) * (1 - μ) * ν * (y_i - y_k)
            grad_coeff = 2.0 * b / (dist_sq * phi) * (1.0 - mu_ik) * nu_ik
            grad = grad_coeff * diff
            
            # Apply repulsion (push i away from k)
            gradients[i] += grad
    
    return gradients
 
 
def attractive_force(y_i, y_j, mu_ij, a, b):
    """
    Compute attractive force between connected points.
    
    Force pulls y_i toward y_j proportional to their high-D connection.
    """
    diff = y_i - y_j
    dist = np.linalg.norm(diff) + 1e-10
    
    phi = 1.0 + a * (dist ** (2 * b))
    
    # Force magnitude: proportional to μ and gradient of log(ν)
    force_mag = 2.0 * a * b * (dist ** (2*b - 2)) / phi * mu_ij
    
    # Force direction: toward the other point
    return -force_mag * diff
 
 
def repulsive_force(y_i, y_k, a, b, gamma=1.0):
    """
    Compute repulsive force between unconnected points.
    
    Force pushes y_i away from y_k.
    gamma controls repulsion strength (UMAP uses 1.0).
    """
    diff = y_i - y_k
    dist_sq = np.sum(diff ** 2) + 0.001  # Regularization
    dist = np.sqrt(dist_sq)
    
    phi = 1.0 + a * (dist ** (2 * b))
    nu = 1.0 / phi
    
    # Force magnitude: want (1-μ)ν to be small → push apart
    # Assuming μ ≈ 0 for negative samples:
    force_mag = gamma * 2.0 * b * nu / (dist_sq * phi)
    
    # Force direction: away from the other point
    return force_mag * diff

Physical Interpretation:

The gradients correspond to forces in a spring-mass system:

Attractive forces: Connected points are joined by springs that pull them together. Spring strength is proportional to (\mu_{ij}).
Repulsive forces: All points repel each other, preventing collapse. Repulsion is stronger for points that should be far apart (low (\mu_{ij})).

The equilibrium configuration—where all forces balance—is the UMAP embedding.

Negative Sampling for Efficiency

The naive implementation of UMAP's repulsive term requires summing over all (O(n^2)) pairs—prohibitively expensive for large datasets. UMAP achieves linear scaling through negative sampling, borrowed from word embedding methods like word2vec.

The Efficiency Problem:

The repulsive term is: $$\sum_{i eq j} (1 - \mu_{ij}) \log(1 - u_{ij})$$

For (n) points, this has (O(n^2)) terms. Moreover, most pairs have (\mu_{ij} \approx 0) (not neighbors), so the ((1 - \mu_{ij}) \approx 1) factor doesn't help much.

The Negative Sampling Solution:

Instead of computing repulsion between all pairs, UMAP samples a small number of "negative" pairs uniformly at random. For each positive edge ((i, j)) with (\mu_{ij} > 0):

Apply the attractive gradient for ((i, j))
Sample (k) random points ({n_1, ..., n_k}) (typically (k = 5))
Apply repulsive gradients for ((i, n_1), ..., (i, n_k))
Repeat for the other endpoint (j)

Why Negative Sampling Works

Negative sampling provides an unbiased estimator of the full repulsive gradient. Since most pairs are negatives (not neighbors), random samples are almost always true negatives. The variance scales with 1/k, so k=5 gives reasonable estimates with huge computational savings. The noise actually helps optimization by acting like regularization.

Computational Complexity Comparison
Method	Per-Iteration Cost	Total Cost (typical)	Memory
Exact UMAP	O(n²)	O(n² × epochs)	O(n²)
UMAP + Negative Sampling	O(E × k)	O(n × k × epochs)	O(n + E)
Exact t-SNE	O(n²)	O(n² × epochs)	O(n²)
Barnes-Hut t-SNE	O(n log n)	O(n log n × epochs)	O(n)

Where:

(E) = number of edges in the fuzzy graph (typically (O(n \cdot k_{\text{neighbors}})))
(k) = number of negative samples per edge (default 5)
epochs = number of optimization iterations (default 200-500)

For a dataset with 100,000 points and 15 neighbors, we have:

Edges: (E \approx 100,000 \times 15 = 1.5M)
Exact cost: (100,000^2 = 10^{10}) operations per iteration
Negative sampling cost: (1.5M \times 5 = 7.5M) operations per iteration

That's a 1000x speedup—the difference between hours and seconds.

negative_sampling_sgd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from numba import njit
 
@njit
def optimize_layout_epoch(
    embedding,           # (n, d) float32 array
    head,                # Edge source indices
    tail,                # Edge target indices  
    weights,             # Edge weights (μ values)
    a, b,                # Low-D affinity parameters
    n_negative_samples,  # Negative samples per positive
    learning_rate,       # SGD learning rate
    epochs_per_sample    # How many times to sample each edge
):
    """
    One epoch of UMAP optimization with negative sampling.
    
    This is the core optimization loop, JIT-compiled for speed.
    """
    n_vertices = embedding.shape[0]
    n_edges = len(head)
    
    for edge in range(n_edges):
        # How many times to process this edge?
        n_samples = int(epochs_per_sample[edge])
        
        for _ in range(n_samples):
            i = head[edge]
            j = tail[edge]
            
            # === ATTRACTIVE FORCE ===
            
            dist_sq = 0.0
            for d in range(embedding.shape[1]):
                dist_sq += (embedding[i, d] - embedding[j, d]) ** 2
            
            if dist_sq > 0.0:
                # Gradient coefficient for attraction
                grad_coeff = -2.0 * a * b * (dist_sq ** (b - 1.0))
                grad_coeff /= (1.0 + a * (dist_sq ** b))
                grad_coeff *= weights[edge]  # Scale by μ_ij
            else:
                grad_coeff = 0.0
            
            # Apply attractive update
            for d in range(embedding.shape[1]):
                diff = embedding[i, d] - embedding[j, d]
                gradient = grad_coeff * diff
                
                # Clip gradient to prevent explosion
                gradient = max(-4.0, min(4.0, gradient))
                
                embedding[i, d] -= learning_rate * gradient
                embedding[j, d] += learning_rate * gradient
            
            # === REPULSIVE FORCES (Negative Sampling) ===
            
            for _ in range(n_negative_samples):
                # Sample random negative target
                k = np.random.randint(n_vertices)
                
                # Skip if sampled the same point
                if k == i:
                    continue
                
                dist_sq = 0.0
                for d in range(embedding.shape[1]):
                    dist_sq += (embedding[i, d] - embedding[k, d]) ** 2
                
                if dist_sq > 0.0:
                    # Gradient coefficient for repulsion
                    # (1 - μ_ik) ≈ 1 for random negative samples
                    grad_coeff = 2.0 * b
                    grad_coeff /= (1e-6 + dist_sq) * (1.0 + a * (dist_sq ** b))
                else:
                    grad_coeff = 4.0  # Max repulsion when overlapping
                
                # Apply repulsive update (only to point i)
                for d in range(embedding.shape[1]):
                    diff = embedding[i, d] - embedding[k, d]
                    gradient = grad_coeff * diff
                    
                    # Clip gradient
                    gradient = max(-4.0, min(4.0, gradient))
                    
                    embedding[i, d] += learning_rate * gradient
    
    return embedding

Optimization Schedule and Initialization

UMAP's optimization uses several techniques to ensure stable convergence to good embeddings.

Initialization:

The initial embedding positions significantly affect the final result. UMAP offers several initialization strategies:

Spectral initialization (default): Use eigenvectors of the graph Laplacian. This gives a globally coherent starting point that captures coarse structure.
Random initialization: Uniform random in a small cube. Simple but may require more epochs.
Custom initialization: User-provided coordinates (e.g., from a previous UMAP run).

Spectral initialization is crucial for global structure preservation—it starts the optimization near a good local minimum.

UMAP Optimization Parameters
Parameter	Default	Effect	Guidance
n_epochs	200 (small n) to 500 (large n)	Total optimization iterations	More epochs = better convergence, slower
learning_rate	1.0	SGD step size	Rarely needs tuning
negative_sample_rate	5	Negatives per positive edge	Higher = better repulsion estimate, slower
init	spectral	Initialization method	Use spectral for global structure
spread	1.0	Embedding scale	Affects visual density
min_dist	0.1	Minimum embedding distance	Lower = tighter clusters

Learning Rate Schedule:

UMAP uses a decreasing learning rate:

$$\alpha_t = \alpha_0 \cdot \left(1 - \frac{t}{T}\right)$$

where (t) is the current epoch and (T) is total epochs. This "annealing" schedule allows:

Early epochs: Large steps to find global structure
Late epochs: Small steps to refine local details

Edge Sampling Schedule:

Not all edges are sampled equally often. UMAP samples edges proportional to their weights (\mu_{ij}). High-weight edges (strong connections) are sampled more frequently, ensuring important local relationships are well-optimized.

The epochs_per_sample array controls this: $$\text{epochs_per_sample}{ij} = \frac{\mu{ij}}{\max(\mu)} \cdot n_{\text{epochs}}$$

Stochastic Nature

Due to random initialization (sometimes) and negative sampling, UMAP runs are not perfectly reproducible unless you set a random seed. However, the overall structure should be consistent across runs. For reproducibility, set random_state in the UMAP constructor.

Convergence and Diagnostics

How do we know if UMAP has converged to a good solution? Unlike supervised learning, there's no obvious "loss on test set" to monitor.

Monitoring Convergence:

UMAP doesn't typically expose the loss during training, but you can monitor:

Point movement: Track how much the embedding changes between epochs. At convergence, movement should be minimal.
Loss value: Compute cross-entropy periodically (expensive but informative).
Visual stability: For visualization, the appearance should stabilize.

Indicators of Good Convergence:

Signs of Successful Optimization

•Cluster separation: Distinct clusters visible if data has cluster structure
•Preserved neighborhoods: Points that were neighbors in high-D remain neighbors in low-D
•Consistent structure: Multiple runs with different seeds give similar overall layout
•Reasonable spread: Points occupy the embedding space without extreme crowding or dispersion
•Low trustworthiness error: Formal metric for neighborhood preservation

Common Problems and Solutions:

UMAP Optimization Issues
Problem	Symptom	Likely Cause	Solution
Disconnected fragments	Small isolated point groups	n_neighbors too small	Increase n_neighbors
Everything in one blob	No cluster separation	n_neighbors too large OR min_dist too high	Decrease n_neighbors or min_dist
Inconsistent runs	Very different layouts each time	Poor initialization or too few epochs	Use spectral init; increase n_epochs
Noisy embedding	Points scattered chaotically	Too much repulsion or bad data	Check data quality; reduce negative_sample_rate
Slow convergence	Structure still changing at epoch 500	Complex data structure	Increase n_epochs; decrease learning_rate

No Ground Truth

Remember: UMAP optimizes a proxy objective (cross-entropy), not your actual task (visualization quality, downstream classification, etc.). A lower cross-entropy doesn't guarantee a more useful embedding. Always evaluate embeddings based on your actual use case.

Summary: Cross-Entropy Optimization

We have now covered UMAP's optimization procedure comprehensively. Let's consolidate the key concepts:

Key Takeaways

•Cross-Entropy Objective: UMAP minimizes fuzzy set cross-entropy between high-D and low-D representations, a principled information-theoretic divergence.
•Attractive and Repulsive Terms: The loss naturally decomposes into forces pulling connected points together and pushing disconnected points apart.
•Difference from t-SNE: UMAP's unnormalized approach preserves more global structure than t-SNE's KL divergence.
•Negative Sampling: Random sampling of repulsive pairs enables O(n) scaling instead of O(n²), making UMAP practical for large datasets.
•Spectral Initialization: Starting from graph Laplacian eigenvectors provides a globally coherent initial layout.
•Learning Rate Annealing: Decreasing step sizes allow coarse-to-fine optimization, finding global structure first then refining locally.

What's Next:

With UMAP's theory and optimization fully covered, the next page provides a comparative analysis of UMAP vs. t-SNE—examining their similarities, differences, and guidance on when to prefer each method.

Page Complete

You now understand UMAP's optimization framework: the cross-entropy objective that measures fuzzy set divergence, the gradient decomposition into attractive and repulsive forces, the negative sampling strategy that enables linear scaling, and the optimization schedule that balances global and local structure optimization.