Machine LearningManifold Learning

UMAP (Uniform Manifold Approximation and Projection)

LevelAdvanced

Duration90 mins

TopicManifold Learning

1 / 5

UMAP Theory

The Quest for Superior Dimensionality Reduction

In 2018, Leland McInnes, John Healy, and James Melville introduced Uniform Manifold Approximation and Projection (UMAP)—an algorithm that would fundamentally transform how practitioners approach high-dimensional data visualization and analysis. UMAP emerged not merely as an incremental improvement over existing methods, but as a paradigm shift grounded in rigorous mathematical foundations from algebraic topology and category theory.

What makes UMAP remarkable is its ability to simultaneously achieve what seemed like incompatible goals: preserving both local and global structure, scaling efficiently to massive datasets, and providing a theoretically principled framework that extends beyond mere heuristics. Where t-SNE pioneered the use of heavy-tailed distributions for dimensionality reduction, UMAP constructs a fundamentally different mathematical object—a weighted graph derived from fuzzy simplicial sets—that captures the topological essence of high-dimensional data.

This page establishes the theoretical foundations necessary to understand not just how UMAP works, but why it works so effectively. We will build intuition from first principles, connecting abstract mathematical concepts to practical algorithmic decisions.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical foundations of UMAP from Riemannian geometry and algebraic topology, (2) How UMAP constructs manifold approximations using fuzzy simplicial sets, (3) The key theoretical insights that distinguish UMAP from other dimensionality reduction methods, and (4) Why UMAP's approach enables both superior embeddings and computational efficiency.

The Manifold Learning Problem Revisited

Before diving into UMAP's specifics, we must precisely articulate the problem it solves. The manifold hypothesis posits that high-dimensional data often lies on or near a low-dimensional manifold embedded in the ambient space. This isn't merely a convenient assumption—it reflects deep structure in how real-world data is generated.

Consider a dataset of face images. Each image might contain millions of pixels, placing it in a correspondingly high-dimensional space. Yet the degrees of freedom generating these images—head pose, lighting angle, expression, identity—number far fewer. The data lies on a manifold whose intrinsic dimensionality matches these generative factors, not the pixel count.

The Central Challenge:

Given samples from an unknown manifold (\mathcal{M}) embedded in (\mathbb{R}^D), we seek a mapping (f: \mathcal{M} \rightarrow \mathbb{R}^d) (where (d \ll D)) that preserves essential structural properties. But what exactly should be preserved?

Structural Properties in Dimensionality Reduction
Property	Definition	Importance	Challenge
Local Structure	Neighborhoods of nearby points	Captures fine-grained data relationships	Sensitive to noise; requires careful distance calculations
Global Structure	Large-scale arrangement and clusters	Reveals overall data organization	Computationally expensive; often sacrificed for local fidelity
Topological Structure	Connectivity, holes, loops in the data	Captures intrinsic shape invariants	Requires sophisticated mathematical machinery
Metric Structure	Distances between all point pairs	Enables quantitative comparisons	Complete preservation is generally impossible (Johnson-Lindenstrauss)

Classical approaches like PCA focus on variance preservation—a global criterion that ignores manifold curvature entirely. Methods like Isomap attempt to preserve geodesic distances but struggle with non-convex manifolds. LLE preserves local linear relationships but provides no guarantees about global structure. t-SNE revolutionized the field by optimizing for local neighborhood preservation using probability distributions, but sacrifices global structure and lacks theoretical grounding beyond empirical success.

UMAP's Innovation:

UMAP approaches this problem from a fundamentally different angle: rather than defining what to preserve in terms of distances or probabilities, it asks: What is the topological structure of the data, and how can we find an embedding that has equivalent topology?

This shift from metric-centric to topology-centric thinking is the key insight underlying UMAP's effectiveness.

Riemannian Geometry Foundations

UMAP's theoretical framework rests on a profound assumption: the data is uniformly distributed on a Riemannian manifold. This might seem restrictive, but it leads to a powerful construction that handles arbitrary data distributions through local metric adaptation.

Riemannian Manifolds and Local Metrics:

A Riemannian manifold is a smooth manifold equipped with a metric tensor that defines inner products on tangent spaces at each point. This metric determines how we measure distances and angles locally. Crucially, the metric can vary from point to point—allowing the geometry to "bend" and "stretch."

For a manifold (\mathcal{M}) with metric tensor (g), the distance between infinitesimally close points is given by:

$$ds^2 = \sum_{i,j} g_{ij}(x) , dx^i , dx^j$$

In Euclidean space, (g_{ij} = \delta_{ij}) (the identity), so (ds^2 = \sum_i (dx^i)^2)—the familiar Pythagorean theorem. On a curved manifold, (g_{ij}) varies with position, creating non-Euclidean geometry.

The Uniform Distribution Assumption

UMAP assumes data is uniformly distributed with respect to the Riemannian metric on the manifold. This means the metric adapts to the local density of points: regions with more data points have a "compressed" metric (larger distances become smaller), while sparse regions have an "expanded" metric. This assumption is what allows UMAP to handle varying data densities gracefully.

Local Distance Normalization:

Given data points sampled from an unknown manifold, UMAP cannot directly compute the true Riemannian metric. Instead, it infers a locally adapted metric that would make the data appear uniformly distributed.

For each point (x_i), UMAP defines a local distance function:

$$d_i(x_j) = \frac{d(x_i, x_j) - \rho_i}{\sigma_i}$$

where:

(d(x_i, x_j)) is the ambient Euclidean distance
(\rho_i) is the distance to the nearest neighbor (ensuring connectivity)
(\sigma_i) is a normalization factor determined by perplexity/local density

This transformation is conceptually similar to defining a local metric at each point that expands or contracts distances based on local data density. In dense regions, (\sigma_i) is small, making distances larger; in sparse regions, (\sigma_i) is large, making distances smaller.

local_metric_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.spatial.distance import cdist
 
def compute_local_distances(X, k_neighbors=15, local_connectivity=1.0):
    """
    Compute locally-adapted distances for UMAP.
    
    This function demonstrates UMAP's core insight: transforming
    ambient distances into manifold-respecting local distances.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        High-dimensional data points
    k_neighbors : int
        Number of neighbors to consider (controls locality)
    local_connectivity : float
        Minimum number of local connections (typically 1.0)
    
    Returns:
    --------
    rho : ndarray of shape (n_samples,)
        Distance to each point's nearest neighbor
    sigma : ndarray of shape (n_samples,)
        Local scaling factors (bandwidth)
    """
    n = X.shape[0]
    
    # Compute all pairwise distances
    distances = cdist(X, X)
    
    # For each point, find k nearest neighbors
    neighbor_distances = np.sort(distances, axis=1)[:, 1:k_neighbors+1]
    
    # rho: distance to nearest neighbor (ensures graph connectivity)
    # We use local_connectivity to interpolate between nearest neighbors
    rho = neighbor_distances[:, int(np.floor(local_connectivity)) - 1]
    
    # sigma: local bandwidth that makes the local neighborhood
    # "look uniform" - solved via binary search to achieve
    # target sum of membership strengths (log2(k_neighbors))
    sigma = np.zeros(n)
    target = np.log2(k_neighbors)
    
    for i in range(n):
        # Binary search for sigma that gives target perplexity
        lo, hi = 1e-10, 1000.0
        for _ in range(64):  # Binary search iterations
            mid = (lo + hi) / 2
            # Compute sum of fuzzy membership strengths
            psum = np.sum(np.exp(-(neighbor_distances[i] - rho[i]) / mid))
            if psum > target:
                hi = mid
            else:
                lo = mid
        sigma[i] = mid
    
    return rho, sigma
 
 
def fuzzy_membership_strength(d_ij, rho_i, sigma_i):
    """
    Compute fuzzy set membership strength from point i to point j.
    
    This is the core transformation: ambient distances become
    membership strengths in a fuzzy topological representation.
    
    The formula: exp(-(d_ij - rho_i) / sigma_i)
    
    - Subtracting rho_i: ensures at least one neighbor has strength ~1
    - Dividing by sigma_i: normalizes for local density
    - Exponential: converts to (0, 1] membership strength
    """
    if d_ij <= rho_i:
        return 1.0
    return np.exp(-(d_ij - rho_i) / sigma_i)

The Exponential Map Connection:

In Riemannian geometry, the exponential map transforms tangent vectors at a point into points on the manifold by "shooting" along geodesics. UMAP's local distance normalization can be understood as implicitly constructing coordinates in the tangent space at each point, where distances respect the local Riemannian metric induced by the uniform distribution assumption.

This geometric perspective explains why UMAP handles varying densities more gracefully than methods assuming a fixed global metric. Each point effectively lives in its own locally-adapted coordinate system, and UMAP's fuzzy set formulation provides a principled way to reconcile these different local views into a coherent global structure.

Topological Data Analysis and Simplicial Sets

The second pillar of UMAP's theoretical foundation is algebraic topology—specifically, the theory of simplicial sets added with a "fuzzy" extension. This provides the mathematical machinery to define and manipulate the topological structure we wish to preserve.

From Point Clouds to Topology:

Given a finite set of data points, we need to construct a mathematical object that captures their topological relationships. The classical approach uses simplicial complexes:

0-simplices: Individual points (vertices)
1-simplices: Edges connecting pairs of points
2-simplices: Triangles (faces) connecting triples
n-simplices: Higher-dimensional generalizations

A simplicial complex is a collection of simplices satisfying closure under taking faces (every face of a simplex is also in the complex).

The Nerve Theorem

A fundamental result in algebraic topology states that if we cover a space with open sets, the nerve of that cover (the simplicial complex formed by taking simplices for each non-empty intersection of the covering sets) captures the topological structure of the original space. This is the theoretical justification for approximating manifold topology from local neighborhoods.

The Čech Complex:

For point cloud data, a natural construction is the Čech complex: place a ball of radius (\epsilon) around each point, and form simplices whenever balls intersect. The Nerve Theorem guarantees this captures manifold topology—if we choose (\epsilon) correctly.

The problem: there is no single "correct" (\epsilon). Too small, and the complex is disconnected; too large, and we lose topological detail. Persistent homology addresses this by studying how topology changes across all scales, but this is computationally expensive.

UMAP's Solution: Fuzzy Simplicial Sets

Rather than choosing a single threshold, UMAP introduces fuzzy simplicial sets where simplices have membership strengths in ([0, 1]) rather than binary presence/absence. This elegantly handles the multi-scale nature of real data.

A fuzzy simplicial set assigns to each potential simplex a degree of membership. For UMAP's purposes, we focus primarily on 0-simplices (points, always with membership 1) and 1-simplices (edges with fuzzy weights).

Simplicial Complex vs. Fuzzy Simplicial Set
Aspect	Classical Simplicial Complex	Fuzzy Simplicial Set
Edge existence	Binary: present or absent	Continuous: strength in [0, 1]
Threshold dependence	Requires choosing ε	Integrates over all scales implicitly
Multi-scale structure	Lost at fixed scale	Preserved in edge weights
Distance sensitivity	Discontinuous (threshold)	Smooth (exponential decay)
Computational form	Sparse adjacency matrix	Weighted graph / sparse matrix

Mathematical Formulation:

For each data point (x_i), UMAP constructs a local fuzzy simplicial set. The membership strength of the edge ((x_i, x_j)) in (x_i)'s local set is:

$$\mu_i(x_j) = \exp\left( -\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i} \right)$$

Note that this is asymmetric: (\mu_i(x_j) \neq \mu_j(x_i)) in general, because (\rho) and (\sigma) differ for each point. This asymmetry reflects the different local geometries at each point.

Category Theory Perspective

In the original UMAP paper, the authors develop these ideas using category theory and extended pseudo-metric spaces. The fuzzy simplicial set construction arises from a functor (the "fuzzy singular set" functor) that translates between metric and topological representations. While this mathematical machinery isn't necessary to use UMAP, it provides the theoretical guarantees that justify the algorithmic choices.

The Fuzzy Union Operation

Each data point's local neighborhood defines an asymmetric fuzzy simplicial set. To obtain a global representation of the entire dataset, we must combine these local views. This is where UMAP's fuzzy set theory comes into play.

Reconciling Asymmetric Views:

Consider points (x_i) and (x_j). From (x_i)'s perspective, the edge ((x_i, x_j)) might have strength 0.8 (they're close in (x_i)'s local metric). From (x_j)'s perspective, the same edge might have strength 0.2 ((x_j) is in a denser region, so distances are larger).

How do we reconcile these views? UMAP uses a probabilistic t-conorm (fuzzy union):

$$\mu(x_i, x_j) = \mu_i(x_j) + \mu_j(x_i) - \mu_i(x_j) \cdot \mu_j(x_i)$$

This is equivalent to the probability of at least one event occurring if we treat membership strengths as independent probabilities.

fuzzy_union.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.sparse import coo_matrix, csr_matrix
 
def compute_membership_strengths(X, rho, sigma, n_neighbors):
    """
    Compute asymmetric fuzzy membership strengths.
    
    For each point i, computes the membership strength of edges
    to its k nearest neighbors based on locally-adapted distances.
    """
    from scipy.spatial import cKDTree
    
    n = X.shape[0]
    tree = cKDTree(X)
    
    # Query k+1 neighbors (includes self)
    distances, indices = tree.query(X, k=n_neighbors + 1)
    distances = distances[:, 1:]  # Exclude self
    indices = indices[:, 1:]
    
    # Compute membership strengths
    rows = np.repeat(np.arange(n), n_neighbors)
    cols = indices.ravel()
    
    # μ_i(x_j) = exp(-(d_ij - ρ_i) / σ_i)
    d_normalized = np.maximum(0, distances - rho[:, np.newaxis])
    strengths = np.exp(-d_normalized / sigma[:, np.newaxis])
    
    # Create sparse asymmetric membership matrix
    membership = coo_matrix(
        (strengths.ravel(), (rows, cols)),
        shape=(n, n)
    ).tocsr()
    
    return membership
 
 
def fuzzy_simplicial_set_union(membership):
    """
    Combine asymmetric local fuzzy sets into symmetric global fuzzy set.
    
    Uses the probabilistic t-conorm (fuzzy union):
    μ(i,j) = μ_i(j) + μ_j(i) - μ_i(j) * μ_j(i)
    
    This can be rewritten as:
    μ(i,j) = 1 - (1 - μ_i(j)) * (1 - μ_j(i))
    
    Interpretation: probability that edge exists if each direction
    contributes independently.
    """
    # Get transpose for reverse direction
    membership_T = membership.T
    
    # Fuzzy union: A + B - A*B
    # For sparse matrices, we need to handle this carefully
    combined = membership + membership_T - membership.multiply(membership_T)
    
    return combined
 
 
def construct_fuzzy_simplicial_set(X, n_neighbors=15, local_connectivity=1.0):
    """
    Complete pipeline: from raw data to fuzzy simplicial set.
    
    This is the high-dimensional representation that UMAP will
    try to match in the low-dimensional embedding.
    """
    # Step 1: Compute local scaling parameters
    rho, sigma = compute_local_distances(X, n_neighbors, local_connectivity)
    
    # Step 2: Compute asymmetric membership strengths
    membership = compute_membership_strengths(X, rho, sigma, n_neighbors)
    
    # Step 3: Combine into symmetric fuzzy simplicial set
    fuzzy_set = fuzzy_simplicial_set_union(membership)
    
    # Step 4: Normalize (apply set-theoretic intersection with complement)
    # This removes very weak edges that add noise
    fuzzy_set.data[fuzzy_set.data < 1e-4] = 0
    fuzzy_set.eliminate_zeros()
    
    return fuzzy_set

Why Fuzzy Union Works:

The fuzzy union operation has several desirable properties:

Symmetry: The result (\mu(x_i, x_j)) equals (\mu(x_j, x_i)), essential for undirected graph optimization.
Preservation of Strong Connections: If either (\mu_i(x_j)) or (\mu_j(x_i)) is close to 1, the combined strength is close to 1. Strong relationships are preserved.
Weak Connection Handling: If both strengths are small, their combination is even smaller (since (a + b - ab < a + b)). Noise doesn't accumulate.
Idempotence: Combining a set with itself yields the same set: (\mu + \mu - \mu^2 = \mu(2 - \mu)) when (\mu \in {0, 1}).

The Result: A Weighted k-NN Graph

After fuzzy union, we have a symmetric weighted graph where edge weights represent "topological similarity" between points. This graph is the high-dimensional fuzzy simplicial set that UMAP will seek to reproduce in low-dimensional space.

Computational Considerations

For large datasets, exact nearest neighbor computation becomes expensive (O(n² log n)). UMAP uses approximate nearest neighbor algorithms (like Annoy or NNDescent) that achieve near-linear scaling. This is one of the key practical advantages over t-SNE, which traditionally required exact distances.

The Fundamental Principle

With the high-dimensional fuzzy simplicial set constructed, we can now state UMAP's fundamental optimization principle:

UMAP's Core Objective:

Find a low-dimensional embedding such that the fuzzy simplicial set constructed from the embedding is as close as possible to the fuzzy simplicial set from the high-dimensional data.

Mathematically, if we denote:

(G^h): The high-dimensional fuzzy simplicial set (edge weights from original data)
(G^l): The low-dimensional fuzzy simplicial set (edge weights from embedding)

Then UMAP minimizes a measure of divergence between (G^h) and (G^l).

Key Theoretical Insights

•Topology Preservation: By matching fuzzy simplicial sets rather than distances, UMAP preserves topological structure—connectivity, clusters, and relative positions—rather than exact metric relationships.
•Scale Integration: The fuzzy formulation naturally handles multi-scale structure. Strong edges (nearby points) are preserved precisely; weak edges (distant points) contribute less to the loss.
•Local Metric Adaptation: Each point's local geometry is respected through the ρ and σ parameters. Dense regions don't dominate sparse ones.
•Mathematical Rigor: Unlike heuristic methods, UMAP's objective arises from principled mathematical constructions. The algorithm can be analyzed and extended systematically.

Contrast with t-SNE:

Both t-SNE and UMAP can be viewed as optimizing for agreement between high-dimensional and low-dimensional representations. The key differences are:

Aspect	t-SNE	UMAP
High-D representation	Gaussian affinities normalized per point	Fuzzy set memberships from local metrics
Low-D representation	Student-t affinities (global normalization)	Fuzzy set memberships (no normalization)
Symmetrization	Average of conditional probabilities	Fuzzy union (probabilistic t-conorm)
Theoretical basis	Information-theoretic (KL divergence)	Topological (fuzzy simplicial sets)

While the practical outcomes often look similar, UMAP's theoretical foundations provide stronger guarantees and enable principled extensions (e.g., supervised UMAP, metric learning).

The Power of Abstraction

UMAP's approach exemplifies a powerful pattern in applied mathematics: by formulating the problem at the right level of abstraction (topology rather than geometry), we obtain solutions that are both more general and more effective. The topology captures what truly matters for understanding data structure while ignoring irrelevant metric details that vary with noise and sampling.

Low-Dimensional Graph Construction

To complete the theoretical picture, we must define how UMAP constructs the low-dimensional fuzzy simplicial set (G^l) from an embedding (Y = {y_1, ..., y_n} \subset \mathbb{R}^d).

The Low-Dimensional Affinity Function:

In low-dimensional space, UMAP uses a parametric family of functions to define edge weights:

$$\nu(y_i, y_j) = \left(1 + a |y_i - y_j|^{2b}\right)^{-1}$$

where (a) and (b) are parameters (typically (a \approx 1.93), (b \approx 0.79) for the default min_dist of 0.1).

When (a = b = 1), this simplifies to:

$$\nu(y_i, y_j) = \frac{1}{1 + |y_i - y_j|^2}$$

which is exactly the Student-t distribution with 1 degree of freedom—the same heavy-tailed distribution used by t-SNE!

The Power of Heavy Tails

Heavy-tailed distributions in low-dimensional space are crucial for dimensionality reduction. In high dimensions, most randomly placed points are far apart. A Gaussian affinity would place vanishingly small probability on these distances. The heavy-tailed Student-t (or UMAP's variant) allows moderate-distance relationships in high-D to become larger distances in low-D without vanishing gradient—enabling clusters to separate cleanly.

The min_dist Parameter:

UMAP's distinctive (a) and (b) parameters are derived from the user-specified min_dist hyperparameter, which controls the minimum distance at which points are allowed to separate. The relationship is found by fitting:

$$\psi(d) = \begin{cases} 1 & \text{if } d \leq \text{min_dist} \ e^{-(d - \text{min_dist})} & \text{otherwise} \end{cases}$$

with the parametric family (\left(1 + a \cdot d^{2b}\right)^{-1}).

Effect of min_dist:

min_dist	Effect	Use Case
0.0	Maximum packing, points can overlap	Dense cluster analysis
0.1 (default)	Balanced separation	General visualization
0.5	More spread within clusters	When individual points matter
0.99	Very loose packing	Smooth, continuous embeddings

Lower min_dist values create tighter, more distinct clusters; higher values produce more uniform, spread-out embeddings.

low_dim_weights.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy.optimize import curve_fit
 
def fit_ab_parameters(min_dist=0.1, spread=1.0):
    """
    Fit the a and b parameters for the low-dimensional affinity function.
    
    UMAP defines a target function:
        ψ(d) = 1 if d <= min_dist
        ψ(d) = exp(-(d - min_dist) / spread) otherwise
    
    We fit (1 + a * d^(2b))^(-1) to match this target.
    
    Parameters:
    -----------
    min_dist : float
        Minimum distance at which points can appear in embedding
    spread : float
        Effective scale of the embedding (usually 1.0)
    
    Returns:
    --------
    a, b : floats
        Parameters for the low-dimensional weight function
    """
    def target_curve(x, min_dist, spread):
        """Piecewise target function."""
        result = np.zeros_like(x)
        result[x <= min_dist] = 1.0
        result[x > min_dist] = np.exp(-(x[x > min_dist] - min_dist) / spread)
        return result
    
    def parametric_curve(x, a, b):
        """Parametric family to fit."""
        return 1.0 / (1.0 + a * np.power(x, 2 * b))
    
    # Generate target values
    x = np.linspace(0, 3 * spread, 300)
    y = target_curve(x, min_dist, spread)
    
    # Fit parameters (initial guess based on Student-t)
    (a, b), _ = curve_fit(parametric_curve, x, y, p0=[1.0, 1.0])
    
    return a, b
 
 
def low_dimensional_weight(d, a, b):
    """
    Compute low-dimensional edge weight.
    
    This is the membership strength in the low-dimensional fuzzy 
    simplicial set.
    
    Parameters:
    -----------
    d : float or ndarray
        Euclidean distance(s) in embedding space
    a, b : floats
        Fitted curve parameters
    
    Returns:
    --------
    weight : same shape as d
        Edge weight(s) in [0, 1]
    """
    return 1.0 / (1.0 + a * np.power(d, 2 * b))
 
 
# Example: compute weights for various distances
if __name__ == "__main__":
    a, b = fit_ab_parameters(min_dist=0.1)
    print(f"Fitted parameters: a={a:.4f}, b={b:.4f}")
    
    distances = np.array([0.0, 0.1, 0.5, 1.0, 2.0, 5.0])
    weights = low_dimensional_weight(distances, a, b)
    
    print("\nDistance -> Weight mapping:")
    for d, w in zip(distances, weights):
        print(f"  d={d:.1f} -> w={w:.4f}")

Summary: UMAP's Theoretical Framework

We have now established the complete theoretical foundation for UMAP. Let's consolidate the key concepts:

Key Theoretical Components

•Manifold Assumption: High-dimensional data lies on a low-dimensional Riemannian manifold. The uniform distribution assumption allows local metric adaptation.
•Local Metric Normalization: Parameters ρ (nearest neighbor distance) and σ (local bandwidth) transform ambient distances into manifold-respecting local distances at each point.
•Fuzzy Simplicial Sets: Local neighborhoods are represented as fuzzy simplicial sets where edges have continuous membership strengths rather than binary existence.
•Fuzzy Union: Asymmetric local views are combined using the probabilistic t-conorm to create a global symmetric fuzzy simplicial set.
•Low-Dimensional Affinity: A parametric heavy-tailed function (controlled by min_dist) defines edge weights in the embedding, enabling cluster separation.
•Optimization Objective: Find an embedding where the low-dimensional fuzzy simplicial set best matches the high-dimensional one.

What's Next:

With the theoretical foundations established, the next page explores the fuzzy topological representation in greater depth—examining how UMAP's category-theoretic foundations lead to specific algorithmic choices and how the fuzzy simplicial set captures manifold topology through weighted graphs.

Page Complete

You now understand UMAP's theoretical foundations: the Riemannian geometry assumptions, fuzzy simplicial set constructions, and the fundamental optimization principle. These concepts provide the mathematical rigor that distinguishes UMAP from heuristic dimensionality reduction methods and enables its remarkable effectiveness across diverse applications.

1 / 5

Loading learning content...

Machine LearningManifold Learning

UMAP (Uniform Manifold Approximation and Projection)

LevelAdvanced

Duration90 mins

TopicManifold Learning

1 / 5

UMAP Theory

The Quest for Superior Dimensionality Reduction

What You Will Learn

The Manifold Learning Problem Revisited

The Central Challenge:

Structural Properties in Dimensionality Reduction
Property	Definition	Importance	Challenge
Local Structure	Neighborhoods of nearby points	Captures fine-grained data relationships	Sensitive to noise; requires careful distance calculations
Global Structure	Large-scale arrangement and clusters	Reveals overall data organization	Computationally expensive; often sacrificed for local fidelity
Topological Structure	Connectivity, holes, loops in the data	Captures intrinsic shape invariants	Requires sophisticated mathematical machinery
Metric Structure	Distances between all point pairs	Enables quantitative comparisons	Complete preservation is generally impossible (Johnson-Lindenstrauss)

UMAP's Innovation:

This shift from metric-centric to topology-centric thinking is the key insight underlying UMAP's effectiveness.

Riemannian Geometry Foundations

Riemannian Manifolds and Local Metrics:

For a manifold (\mathcal{M}) with metric tensor (g), the distance between infinitesimally close points is given by:

$$ds^2 = \sum_{i,j} g_{ij}(x) , dx^i , dx^j$$

The Uniform Distribution Assumption

Local Distance Normalization:

For each point (x_i), UMAP defines a local distance function:

$$d_i(x_j) = \frac{d(x_i, x_j) - \rho_i}{\sigma_i}$$

where:

(d(x_i, x_j)) is the ambient Euclidean distance
(\rho_i) is the distance to the nearest neighbor (ensuring connectivity)
(\sigma_i) is a normalization factor determined by perplexity/local density

local_metric_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.spatial.distance import cdist
 
def compute_local_distances(X, k_neighbors=15, local_connectivity=1.0):
    """
    Compute locally-adapted distances for UMAP.
    
    This function demonstrates UMAP's core insight: transforming
    ambient distances into manifold-respecting local distances.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        High-dimensional data points
    k_neighbors : int
        Number of neighbors to consider (controls locality)
    local_connectivity : float
        Minimum number of local connections (typically 1.0)
    
    Returns:
    --------
    rho : ndarray of shape (n_samples,)
        Distance to each point's nearest neighbor
    sigma : ndarray of shape (n_samples,)
        Local scaling factors (bandwidth)
    """
    n = X.shape[0]
    
    # Compute all pairwise distances
    distances = cdist(X, X)
    
    # For each point, find k nearest neighbors
    neighbor_distances = np.sort(distances, axis=1)[:, 1:k_neighbors+1]
    
    # rho: distance to nearest neighbor (ensures graph connectivity)
    # We use local_connectivity to interpolate between nearest neighbors
    rho = neighbor_distances[:, int(np.floor(local_connectivity)) - 1]
    
    # sigma: local bandwidth that makes the local neighborhood
    # "look uniform" - solved via binary search to achieve
    # target sum of membership strengths (log2(k_neighbors))
    sigma = np.zeros(n)
    target = np.log2(k_neighbors)
    
    for i in range(n):
        # Binary search for sigma that gives target perplexity
        lo, hi = 1e-10, 1000.0
        for _ in range(64):  # Binary search iterations
            mid = (lo + hi) / 2
            # Compute sum of fuzzy membership strengths
            psum = np.sum(np.exp(-(neighbor_distances[i] - rho[i]) / mid))
            if psum > target:
                hi = mid
            else:
                lo = mid
        sigma[i] = mid
    
    return rho, sigma
 
 
def fuzzy_membership_strength(d_ij, rho_i, sigma_i):
    """
    Compute fuzzy set membership strength from point i to point j.
    
    This is the core transformation: ambient distances become
    membership strengths in a fuzzy topological representation.
    
    The formula: exp(-(d_ij - rho_i) / sigma_i)
    
    - Subtracting rho_i: ensures at least one neighbor has strength ~1
    - Dividing by sigma_i: normalizes for local density
    - Exponential: converts to (0, 1] membership strength
    """
    if d_ij <= rho_i:
        return 1.0
    return np.exp(-(d_ij - rho_i) / sigma_i)

The Exponential Map Connection:

Topological Data Analysis and Simplicial Sets

From Point Clouds to Topology:

Given a finite set of data points, we need to construct a mathematical object that captures their topological relationships. The classical approach uses simplicial complexes:

0-simplices: Individual points (vertices)
1-simplices: Edges connecting pairs of points
2-simplices: Triangles (faces) connecting triples
n-simplices: Higher-dimensional generalizations

A simplicial complex is a collection of simplices satisfying closure under taking faces (every face of a simplex is also in the complex).

The Nerve Theorem

The Čech Complex:

UMAP's Solution: Fuzzy Simplicial Sets

Simplicial Complex vs. Fuzzy Simplicial Set
Aspect	Classical Simplicial Complex	Fuzzy Simplicial Set
Edge existence	Binary: present or absent	Continuous: strength in [0, 1]
Threshold dependence	Requires choosing ε	Integrates over all scales implicitly
Multi-scale structure	Lost at fixed scale	Preserved in edge weights
Distance sensitivity	Discontinuous (threshold)	Smooth (exponential decay)
Computational form	Sparse adjacency matrix	Weighted graph / sparse matrix

Mathematical Formulation:

For each data point (x_i), UMAP constructs a local fuzzy simplicial set. The membership strength of the edge ((x_i, x_j)) in (x_i)'s local set is:

$$\mu_i(x_j) = \exp\left( -\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i} \right)$$

Note that this is asymmetric: (\mu_i(x_j) \neq \mu_j(x_i)) in general, because (\rho) and (\sigma) differ for each point. This asymmetry reflects the different local geometries at each point.

Category Theory Perspective

The Fuzzy Union Operation

Reconciling Asymmetric Views:

How do we reconcile these views? UMAP uses a probabilistic t-conorm (fuzzy union):

$$\mu(x_i, x_j) = \mu_i(x_j) + \mu_j(x_i) - \mu_i(x_j) \cdot \mu_j(x_i)$$

This is equivalent to the probability of at least one event occurring if we treat membership strengths as independent probabilities.

fuzzy_union.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.sparse import coo_matrix, csr_matrix
 
def compute_membership_strengths(X, rho, sigma, n_neighbors):
    """
    Compute asymmetric fuzzy membership strengths.
    
    For each point i, computes the membership strength of edges
    to its k nearest neighbors based on locally-adapted distances.
    """
    from scipy.spatial import cKDTree
    
    n = X.shape[0]
    tree = cKDTree(X)
    
    # Query k+1 neighbors (includes self)
    distances, indices = tree.query(X, k=n_neighbors + 1)
    distances = distances[:, 1:]  # Exclude self
    indices = indices[:, 1:]
    
    # Compute membership strengths
    rows = np.repeat(np.arange(n), n_neighbors)
    cols = indices.ravel()
    
    # μ_i(x_j) = exp(-(d_ij - ρ_i) / σ_i)
    d_normalized = np.maximum(0, distances - rho[:, np.newaxis])
    strengths = np.exp(-d_normalized / sigma[:, np.newaxis])
    
    # Create sparse asymmetric membership matrix
    membership = coo_matrix(
        (strengths.ravel(), (rows, cols)),
        shape=(n, n)
    ).tocsr()
    
    return membership
 
 
def fuzzy_simplicial_set_union(membership):
    """
    Combine asymmetric local fuzzy sets into symmetric global fuzzy set.
    
    Uses the probabilistic t-conorm (fuzzy union):
    μ(i,j) = μ_i(j) + μ_j(i) - μ_i(j) * μ_j(i)
    
    This can be rewritten as:
    μ(i,j) = 1 - (1 - μ_i(j)) * (1 - μ_j(i))
    
    Interpretation: probability that edge exists if each direction
    contributes independently.
    """
    # Get transpose for reverse direction
    membership_T = membership.T
    
    # Fuzzy union: A + B - A*B
    # For sparse matrices, we need to handle this carefully
    combined = membership + membership_T - membership.multiply(membership_T)
    
    return combined
 
 
def construct_fuzzy_simplicial_set(X, n_neighbors=15, local_connectivity=1.0):
    """
    Complete pipeline: from raw data to fuzzy simplicial set.
    
    This is the high-dimensional representation that UMAP will
    try to match in the low-dimensional embedding.
    """
    # Step 1: Compute local scaling parameters
    rho, sigma = compute_local_distances(X, n_neighbors, local_connectivity)
    
    # Step 2: Compute asymmetric membership strengths
    membership = compute_membership_strengths(X, rho, sigma, n_neighbors)
    
    # Step 3: Combine into symmetric fuzzy simplicial set
    fuzzy_set = fuzzy_simplicial_set_union(membership)
    
    # Step 4: Normalize (apply set-theoretic intersection with complement)
    # This removes very weak edges that add noise
    fuzzy_set.data[fuzzy_set.data < 1e-4] = 0
    fuzzy_set.eliminate_zeros()
    
    return fuzzy_set

Why Fuzzy Union Works:

The fuzzy union operation has several desirable properties:

Symmetry: The result (\mu(x_i, x_j)) equals (\mu(x_j, x_i)), essential for undirected graph optimization.
Preservation of Strong Connections: If either (\mu_i(x_j)) or (\mu_j(x_i)) is close to 1, the combined strength is close to 1. Strong relationships are preserved.
Weak Connection Handling: If both strengths are small, their combination is even smaller (since (a + b - ab < a + b)). Noise doesn't accumulate.
Idempotence: Combining a set with itself yields the same set: (\mu + \mu - \mu^2 = \mu(2 - \mu)) when (\mu \in {0, 1}).

The Result: A Weighted k-NN Graph

Computational Considerations

The Fundamental Principle

With the high-dimensional fuzzy simplicial set constructed, we can now state UMAP's fundamental optimization principle:

UMAP's Core Objective:

Find a low-dimensional embedding such that the fuzzy simplicial set constructed from the embedding is as close as possible to the fuzzy simplicial set from the high-dimensional data.

Mathematically, if we denote:

(G^h): The high-dimensional fuzzy simplicial set (edge weights from original data)
(G^l): The low-dimensional fuzzy simplicial set (edge weights from embedding)

Then UMAP minimizes a measure of divergence between (G^h) and (G^l).

Key Theoretical Insights

•Topology Preservation: By matching fuzzy simplicial sets rather than distances, UMAP preserves topological structure—connectivity, clusters, and relative positions—rather than exact metric relationships.
•Scale Integration: The fuzzy formulation naturally handles multi-scale structure. Strong edges (nearby points) are preserved precisely; weak edges (distant points) contribute less to the loss.
•Local Metric Adaptation: Each point's local geometry is respected through the ρ and σ parameters. Dense regions don't dominate sparse ones.
•Mathematical Rigor: Unlike heuristic methods, UMAP's objective arises from principled mathematical constructions. The algorithm can be analyzed and extended systematically.

Contrast with t-SNE:

Both t-SNE and UMAP can be viewed as optimizing for agreement between high-dimensional and low-dimensional representations. The key differences are:

Aspect	t-SNE	UMAP
High-D representation	Gaussian affinities normalized per point	Fuzzy set memberships from local metrics
Low-D representation	Student-t affinities (global normalization)	Fuzzy set memberships (no normalization)
Symmetrization	Average of conditional probabilities	Fuzzy union (probabilistic t-conorm)
Theoretical basis	Information-theoretic (KL divergence)	Topological (fuzzy simplicial sets)

While the practical outcomes often look similar, UMAP's theoretical foundations provide stronger guarantees and enable principled extensions (e.g., supervised UMAP, metric learning).

The Power of Abstraction

Low-Dimensional Graph Construction

To complete the theoretical picture, we must define how UMAP constructs the low-dimensional fuzzy simplicial set (G^l) from an embedding (Y = {y_1, ..., y_n} \subset \mathbb{R}^d).

The Low-Dimensional Affinity Function:

In low-dimensional space, UMAP uses a parametric family of functions to define edge weights:

$$\nu(y_i, y_j) = \left(1 + a |y_i - y_j|^{2b}\right)^{-1}$$

where (a) and (b) are parameters (typically (a \approx 1.93), (b \approx 0.79) for the default min_dist of 0.1).

When (a = b = 1), this simplifies to:

$$\nu(y_i, y_j) = \frac{1}{1 + |y_i - y_j|^2}$$

which is exactly the Student-t distribution with 1 degree of freedom—the same heavy-tailed distribution used by t-SNE!

The Power of Heavy Tails

The min_dist Parameter:

$$\psi(d) = \begin{cases} 1 & \text{if } d \leq \text{min_dist} \ e^{-(d - \text{min_dist})} & \text{otherwise} \end{cases}$$

with the parametric family (\left(1 + a \cdot d^{2b}\right)^{-1}).

Effect of min_dist:

min_dist	Effect	Use Case
0.0	Maximum packing, points can overlap	Dense cluster analysis
0.1 (default)	Balanced separation	General visualization
0.5	More spread within clusters	When individual points matter
0.99	Very loose packing	Smooth, continuous embeddings

Lower min_dist values create tighter, more distinct clusters; higher values produce more uniform, spread-out embeddings.

low_dim_weights.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy.optimize import curve_fit
 
def fit_ab_parameters(min_dist=0.1, spread=1.0):
    """
    Fit the a and b parameters for the low-dimensional affinity function.
    
    UMAP defines a target function:
        ψ(d) = 1 if d <= min_dist
        ψ(d) = exp(-(d - min_dist) / spread) otherwise
    
    We fit (1 + a * d^(2b))^(-1) to match this target.
    
    Parameters:
    -----------
    min_dist : float
        Minimum distance at which points can appear in embedding
    spread : float
        Effective scale of the embedding (usually 1.0)
    
    Returns:
    --------
    a, b : floats
        Parameters for the low-dimensional weight function
    """
    def target_curve(x, min_dist, spread):
        """Piecewise target function."""
        result = np.zeros_like(x)
        result[x <= min_dist] = 1.0
        result[x > min_dist] = np.exp(-(x[x > min_dist] - min_dist) / spread)
        return result
    
    def parametric_curve(x, a, b):
        """Parametric family to fit."""
        return 1.0 / (1.0 + a * np.power(x, 2 * b))
    
    # Generate target values
    x = np.linspace(0, 3 * spread, 300)
    y = target_curve(x, min_dist, spread)
    
    # Fit parameters (initial guess based on Student-t)
    (a, b), _ = curve_fit(parametric_curve, x, y, p0=[1.0, 1.0])
    
    return a, b
 
 
def low_dimensional_weight(d, a, b):
    """
    Compute low-dimensional edge weight.
    
    This is the membership strength in the low-dimensional fuzzy 
    simplicial set.
    
    Parameters:
    -----------
    d : float or ndarray
        Euclidean distance(s) in embedding space
    a, b : floats
        Fitted curve parameters
    
    Returns:
    --------
    weight : same shape as d
        Edge weight(s) in [0, 1]
    """
    return 1.0 / (1.0 + a * np.power(d, 2 * b))
 
 
# Example: compute weights for various distances
if __name__ == "__main__":
    a, b = fit_ab_parameters(min_dist=0.1)
    print(f"Fitted parameters: a={a:.4f}, b={b:.4f}")
    
    distances = np.array([0.0, 0.1, 0.5, 1.0, 2.0, 5.0])
    weights = low_dimensional_weight(distances, a, b)
    
    print("\nDistance -> Weight mapping:")
    for d, w in zip(distances, weights):
        print(f"  d={d:.1f} -> w={w:.4f}")

Summary: UMAP's Theoretical Framework

We have now established the complete theoretical foundation for UMAP. Let's consolidate the key concepts:

Key Theoretical Components

•Manifold Assumption: High-dimensional data lies on a low-dimensional Riemannian manifold. The uniform distribution assumption allows local metric adaptation.
•Local Metric Normalization: Parameters ρ (nearest neighbor distance) and σ (local bandwidth) transform ambient distances into manifold-respecting local distances at each point.
•Fuzzy Simplicial Sets: Local neighborhoods are represented as fuzzy simplicial sets where edges have continuous membership strengths rather than binary existence.
•Fuzzy Union: Asymmetric local views are combined using the probabilistic t-conorm to create a global symmetric fuzzy simplicial set.
•Low-Dimensional Affinity: A parametric heavy-tailed function (controlled by min_dist) defines edge weights in the embedding, enabling cluster separation.
•Optimization Objective: Find an embedding where the low-dimensional fuzzy simplicial set best matches the high-dimensional one.

What's Next:

Page Complete

1 / 5