Loading learning content...
In 2018, Leland McInnes, John Healy, and James Melville introduced Uniform Manifold Approximation and Projection (UMAP)—an algorithm that would fundamentally transform how practitioners approach high-dimensional data visualization and analysis. UMAP emerged not merely as an incremental improvement over existing methods, but as a paradigm shift grounded in rigorous mathematical foundations from algebraic topology and category theory.
What makes UMAP remarkable is its ability to simultaneously achieve what seemed like incompatible goals: preserving both local and global structure, scaling efficiently to massive datasets, and providing a theoretically principled framework that extends beyond mere heuristics. Where t-SNE pioneered the use of heavy-tailed distributions for dimensionality reduction, UMAP constructs a fundamentally different mathematical object—a weighted graph derived from fuzzy simplicial sets—that captures the topological essence of high-dimensional data.
This page establishes the theoretical foundations necessary to understand not just how UMAP works, but why it works so effectively. We will build intuition from first principles, connecting abstract mathematical concepts to practical algorithmic decisions.
By the end of this page, you will understand: (1) The mathematical foundations of UMAP from Riemannian geometry and algebraic topology, (2) How UMAP constructs manifold approximations using fuzzy simplicial sets, (3) The key theoretical insights that distinguish UMAP from other dimensionality reduction methods, and (4) Why UMAP's approach enables both superior embeddings and computational efficiency.
Before diving into UMAP's specifics, we must precisely articulate the problem it solves. The manifold hypothesis posits that high-dimensional data often lies on or near a low-dimensional manifold embedded in the ambient space. This isn't merely a convenient assumption—it reflects deep structure in how real-world data is generated.
Consider a dataset of face images. Each image might contain millions of pixels, placing it in a correspondingly high-dimensional space. Yet the degrees of freedom generating these images—head pose, lighting angle, expression, identity—number far fewer. The data lies on a manifold whose intrinsic dimensionality matches these generative factors, not the pixel count.
The Central Challenge:
Given samples from an unknown manifold (\mathcal{M}) embedded in (\mathbb{R}^D), we seek a mapping (f: \mathcal{M} \rightarrow \mathbb{R}^d) (where (d \ll D)) that preserves essential structural properties. But what exactly should be preserved?
| Property | Definition | Importance | Challenge |
|---|---|---|---|
| Local Structure | Neighborhoods of nearby points | Captures fine-grained data relationships | Sensitive to noise; requires careful distance calculations |
| Global Structure | Large-scale arrangement and clusters | Reveals overall data organization | Computationally expensive; often sacrificed for local fidelity |
| Topological Structure | Connectivity, holes, loops in the data | Captures intrinsic shape invariants | Requires sophisticated mathematical machinery |
| Metric Structure | Distances between all point pairs | Enables quantitative comparisons | Complete preservation is generally impossible (Johnson-Lindenstrauss) |
Classical approaches like PCA focus on variance preservation—a global criterion that ignores manifold curvature entirely. Methods like Isomap attempt to preserve geodesic distances but struggle with non-convex manifolds. LLE preserves local linear relationships but provides no guarantees about global structure. t-SNE revolutionized the field by optimizing for local neighborhood preservation using probability distributions, but sacrifices global structure and lacks theoretical grounding beyond empirical success.
UMAP's Innovation:
UMAP approaches this problem from a fundamentally different angle: rather than defining what to preserve in terms of distances or probabilities, it asks: What is the topological structure of the data, and how can we find an embedding that has equivalent topology?
This shift from metric-centric to topology-centric thinking is the key insight underlying UMAP's effectiveness.
UMAP's theoretical framework rests on a profound assumption: the data is uniformly distributed on a Riemannian manifold. This might seem restrictive, but it leads to a powerful construction that handles arbitrary data distributions through local metric adaptation.
Riemannian Manifolds and Local Metrics:
A Riemannian manifold is a smooth manifold equipped with a metric tensor that defines inner products on tangent spaces at each point. This metric determines how we measure distances and angles locally. Crucially, the metric can vary from point to point—allowing the geometry to "bend" and "stretch."
For a manifold (\mathcal{M}) with metric tensor (g), the distance between infinitesimally close points is given by:
$$ds^2 = \sum_{i,j} g_{ij}(x) , dx^i , dx^j$$
In Euclidean space, (g_{ij} = \delta_{ij}) (the identity), so (ds^2 = \sum_i (dx^i)^2)—the familiar Pythagorean theorem. On a curved manifold, (g_{ij}) varies with position, creating non-Euclidean geometry.
UMAP assumes data is uniformly distributed with respect to the Riemannian metric on the manifold. This means the metric adapts to the local density of points: regions with more data points have a "compressed" metric (larger distances become smaller), while sparse regions have an "expanded" metric. This assumption is what allows UMAP to handle varying data densities gracefully.
Local Distance Normalization:
Given data points sampled from an unknown manifold, UMAP cannot directly compute the true Riemannian metric. Instead, it infers a locally adapted metric that would make the data appear uniformly distributed.
For each point (x_i), UMAP defines a local distance function:
$$d_i(x_j) = \frac{d(x_i, x_j) - \rho_i}{\sigma_i}$$
where:
This transformation is conceptually similar to defining a local metric at each point that expands or contracts distances based on local data density. In dense regions, (\sigma_i) is small, making distances larger; in sparse regions, (\sigma_i) is large, making distances smaller.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom scipy.spatial.distance import cdist def compute_local_distances(X, k_neighbors=15, local_connectivity=1.0): """ Compute locally-adapted distances for UMAP. This function demonstrates UMAP's core insight: transforming ambient distances into manifold-respecting local distances. Parameters: ----------- X : ndarray of shape (n_samples, n_features) High-dimensional data points k_neighbors : int Number of neighbors to consider (controls locality) local_connectivity : float Minimum number of local connections (typically 1.0) Returns: -------- rho : ndarray of shape (n_samples,) Distance to each point's nearest neighbor sigma : ndarray of shape (n_samples,) Local scaling factors (bandwidth) """ n = X.shape[0] # Compute all pairwise distances distances = cdist(X, X) # For each point, find k nearest neighbors neighbor_distances = np.sort(distances, axis=1)[:, 1:k_neighbors+1] # rho: distance to nearest neighbor (ensures graph connectivity) # We use local_connectivity to interpolate between nearest neighbors rho = neighbor_distances[:, int(np.floor(local_connectivity)) - 1] # sigma: local bandwidth that makes the local neighborhood # "look uniform" - solved via binary search to achieve # target sum of membership strengths (log2(k_neighbors)) sigma = np.zeros(n) target = np.log2(k_neighbors) for i in range(n): # Binary search for sigma that gives target perplexity lo, hi = 1e-10, 1000.0 for _ in range(64): # Binary search iterations mid = (lo + hi) / 2 # Compute sum of fuzzy membership strengths psum = np.sum(np.exp(-(neighbor_distances[i] - rho[i]) / mid)) if psum > target: hi = mid else: lo = mid sigma[i] = mid return rho, sigma def fuzzy_membership_strength(d_ij, rho_i, sigma_i): """ Compute fuzzy set membership strength from point i to point j. This is the core transformation: ambient distances become membership strengths in a fuzzy topological representation. The formula: exp(-(d_ij - rho_i) / sigma_i) - Subtracting rho_i: ensures at least one neighbor has strength ~1 - Dividing by sigma_i: normalizes for local density - Exponential: converts to (0, 1] membership strength """ if d_ij <= rho_i: return 1.0 return np.exp(-(d_ij - rho_i) / sigma_i)The Exponential Map Connection:
In Riemannian geometry, the exponential map transforms tangent vectors at a point into points on the manifold by "shooting" along geodesics. UMAP's local distance normalization can be understood as implicitly constructing coordinates in the tangent space at each point, where distances respect the local Riemannian metric induced by the uniform distribution assumption.
This geometric perspective explains why UMAP handles varying densities more gracefully than methods assuming a fixed global metric. Each point effectively lives in its own locally-adapted coordinate system, and UMAP's fuzzy set formulation provides a principled way to reconcile these different local views into a coherent global structure.
The second pillar of UMAP's theoretical foundation is algebraic topology—specifically, the theory of simplicial sets added with a "fuzzy" extension. This provides the mathematical machinery to define and manipulate the topological structure we wish to preserve.
From Point Clouds to Topology:
Given a finite set of data points, we need to construct a mathematical object that captures their topological relationships. The classical approach uses simplicial complexes:
A simplicial complex is a collection of simplices satisfying closure under taking faces (every face of a simplex is also in the complex).
A fundamental result in algebraic topology states that if we cover a space with open sets, the nerve of that cover (the simplicial complex formed by taking simplices for each non-empty intersection of the covering sets) captures the topological structure of the original space. This is the theoretical justification for approximating manifold topology from local neighborhoods.
The Čech Complex:
For point cloud data, a natural construction is the Čech complex: place a ball of radius (\epsilon) around each point, and form simplices whenever balls intersect. The Nerve Theorem guarantees this captures manifold topology—if we choose (\epsilon) correctly.
The problem: there is no single "correct" (\epsilon). Too small, and the complex is disconnected; too large, and we lose topological detail. Persistent homology addresses this by studying how topology changes across all scales, but this is computationally expensive.
UMAP's Solution: Fuzzy Simplicial Sets
Rather than choosing a single threshold, UMAP introduces fuzzy simplicial sets where simplices have membership strengths in ([0, 1]) rather than binary presence/absence. This elegantly handles the multi-scale nature of real data.
A fuzzy simplicial set assigns to each potential simplex a degree of membership. For UMAP's purposes, we focus primarily on 0-simplices (points, always with membership 1) and 1-simplices (edges with fuzzy weights).
| Aspect | Classical Simplicial Complex | Fuzzy Simplicial Set |
|---|---|---|
| Edge existence | Binary: present or absent | Continuous: strength in [0, 1] |
| Threshold dependence | Requires choosing ε | Integrates over all scales implicitly |
| Multi-scale structure | Lost at fixed scale | Preserved in edge weights |
| Distance sensitivity | Discontinuous (threshold) | Smooth (exponential decay) |
| Computational form | Sparse adjacency matrix | Weighted graph / sparse matrix |
Mathematical Formulation:
For each data point (x_i), UMAP constructs a local fuzzy simplicial set. The membership strength of the edge ((x_i, x_j)) in (x_i)'s local set is:
$$\mu_i(x_j) = \exp\left( -\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i} \right)$$
Note that this is asymmetric: (\mu_i(x_j) \neq \mu_j(x_i)) in general, because (\rho) and (\sigma) differ for each point. This asymmetry reflects the different local geometries at each point.
In the original UMAP paper, the authors develop these ideas using category theory and extended pseudo-metric spaces. The fuzzy simplicial set construction arises from a functor (the "fuzzy singular set" functor) that translates between metric and topological representations. While this mathematical machinery isn't necessary to use UMAP, it provides the theoretical guarantees that justify the algorithmic choices.
Each data point's local neighborhood defines an asymmetric fuzzy simplicial set. To obtain a global representation of the entire dataset, we must combine these local views. This is where UMAP's fuzzy set theory comes into play.
Reconciling Asymmetric Views:
Consider points (x_i) and (x_j). From (x_i)'s perspective, the edge ((x_i, x_j)) might have strength 0.8 (they're close in (x_i)'s local metric). From (x_j)'s perspective, the same edge might have strength 0.2 ((x_j) is in a denser region, so distances are larger).
How do we reconcile these views? UMAP uses a probabilistic t-conorm (fuzzy union):
$$\mu(x_i, x_j) = \mu_i(x_j) + \mu_j(x_i) - \mu_i(x_j) \cdot \mu_j(x_i)$$
This is equivalent to the probability of at least one event occurring if we treat membership strengths as independent probabilities.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom scipy.sparse import coo_matrix, csr_matrix def compute_membership_strengths(X, rho, sigma, n_neighbors): """ Compute asymmetric fuzzy membership strengths. For each point i, computes the membership strength of edges to its k nearest neighbors based on locally-adapted distances. """ from scipy.spatial import cKDTree n = X.shape[0] tree = cKDTree(X) # Query k+1 neighbors (includes self) distances, indices = tree.query(X, k=n_neighbors + 1) distances = distances[:, 1:] # Exclude self indices = indices[:, 1:] # Compute membership strengths rows = np.repeat(np.arange(n), n_neighbors) cols = indices.ravel() # μ_i(x_j) = exp(-(d_ij - ρ_i) / σ_i) d_normalized = np.maximum(0, distances - rho[:, np.newaxis]) strengths = np.exp(-d_normalized / sigma[:, np.newaxis]) # Create sparse asymmetric membership matrix membership = coo_matrix( (strengths.ravel(), (rows, cols)), shape=(n, n) ).tocsr() return membership def fuzzy_simplicial_set_union(membership): """ Combine asymmetric local fuzzy sets into symmetric global fuzzy set. Uses the probabilistic t-conorm (fuzzy union): μ(i,j) = μ_i(j) + μ_j(i) - μ_i(j) * μ_j(i) This can be rewritten as: μ(i,j) = 1 - (1 - μ_i(j)) * (1 - μ_j(i)) Interpretation: probability that edge exists if each direction contributes independently. """ # Get transpose for reverse direction membership_T = membership.T # Fuzzy union: A + B - A*B # For sparse matrices, we need to handle this carefully combined = membership + membership_T - membership.multiply(membership_T) return combined def construct_fuzzy_simplicial_set(X, n_neighbors=15, local_connectivity=1.0): """ Complete pipeline: from raw data to fuzzy simplicial set. This is the high-dimensional representation that UMAP will try to match in the low-dimensional embedding. """ # Step 1: Compute local scaling parameters rho, sigma = compute_local_distances(X, n_neighbors, local_connectivity) # Step 2: Compute asymmetric membership strengths membership = compute_membership_strengths(X, rho, sigma, n_neighbors) # Step 3: Combine into symmetric fuzzy simplicial set fuzzy_set = fuzzy_simplicial_set_union(membership) # Step 4: Normalize (apply set-theoretic intersection with complement) # This removes very weak edges that add noise fuzzy_set.data[fuzzy_set.data < 1e-4] = 0 fuzzy_set.eliminate_zeros() return fuzzy_setWhy Fuzzy Union Works:
The fuzzy union operation has several desirable properties:
Symmetry: The result (\mu(x_i, x_j)) equals (\mu(x_j, x_i)), essential for undirected graph optimization.
Preservation of Strong Connections: If either (\mu_i(x_j)) or (\mu_j(x_i)) is close to 1, the combined strength is close to 1. Strong relationships are preserved.
Weak Connection Handling: If both strengths are small, their combination is even smaller (since (a + b - ab < a + b)). Noise doesn't accumulate.
Idempotence: Combining a set with itself yields the same set: (\mu + \mu - \mu^2 = \mu(2 - \mu)) when (\mu \in {0, 1}).
The Result: A Weighted k-NN Graph
After fuzzy union, we have a symmetric weighted graph where edge weights represent "topological similarity" between points. This graph is the high-dimensional fuzzy simplicial set that UMAP will seek to reproduce in low-dimensional space.
For large datasets, exact nearest neighbor computation becomes expensive (O(n² log n)). UMAP uses approximate nearest neighbor algorithms (like Annoy or NNDescent) that achieve near-linear scaling. This is one of the key practical advantages over t-SNE, which traditionally required exact distances.
With the high-dimensional fuzzy simplicial set constructed, we can now state UMAP's fundamental optimization principle:
UMAP's Core Objective:
Find a low-dimensional embedding such that the fuzzy simplicial set constructed from the embedding is as close as possible to the fuzzy simplicial set from the high-dimensional data.
Mathematically, if we denote:
Then UMAP minimizes a measure of divergence between (G^h) and (G^l).
Contrast with t-SNE:
Both t-SNE and UMAP can be viewed as optimizing for agreement between high-dimensional and low-dimensional representations. The key differences are:
| Aspect | t-SNE | UMAP |
|---|---|---|
| High-D representation | Gaussian affinities normalized per point | Fuzzy set memberships from local metrics |
| Low-D representation | Student-t affinities (global normalization) | Fuzzy set memberships (no normalization) |
| Symmetrization | Average of conditional probabilities | Fuzzy union (probabilistic t-conorm) |
| Theoretical basis | Information-theoretic (KL divergence) | Topological (fuzzy simplicial sets) |
While the practical outcomes often look similar, UMAP's theoretical foundations provide stronger guarantees and enable principled extensions (e.g., supervised UMAP, metric learning).
UMAP's approach exemplifies a powerful pattern in applied mathematics: by formulating the problem at the right level of abstraction (topology rather than geometry), we obtain solutions that are both more general and more effective. The topology captures what truly matters for understanding data structure while ignoring irrelevant metric details that vary with noise and sampling.
To complete the theoretical picture, we must define how UMAP constructs the low-dimensional fuzzy simplicial set (G^l) from an embedding (Y = {y_1, ..., y_n} \subset \mathbb{R}^d).
The Low-Dimensional Affinity Function:
In low-dimensional space, UMAP uses a parametric family of functions to define edge weights:
$$\nu(y_i, y_j) = \left(1 + a |y_i - y_j|^{2b}\right)^{-1}$$
where (a) and (b) are parameters (typically (a \approx 1.93), (b \approx 0.79) for the default min_dist of 0.1).
When (a = b = 1), this simplifies to:
$$\nu(y_i, y_j) = \frac{1}{1 + |y_i - y_j|^2}$$
which is exactly the Student-t distribution with 1 degree of freedom—the same heavy-tailed distribution used by t-SNE!
Heavy-tailed distributions in low-dimensional space are crucial for dimensionality reduction. In high dimensions, most randomly placed points are far apart. A Gaussian affinity would place vanishingly small probability on these distances. The heavy-tailed Student-t (or UMAP's variant) allows moderate-distance relationships in high-D to become larger distances in low-D without vanishing gradient—enabling clusters to separate cleanly.
The min_dist Parameter:
UMAP's distinctive (a) and (b) parameters are derived from the user-specified min_dist hyperparameter, which controls the minimum distance at which points are allowed to separate. The relationship is found by fitting:
$$\psi(d) = \begin{cases} 1 & \text{if } d \leq \text{min_dist} \ e^{-(d - \text{min_dist})} & \text{otherwise} \end{cases}$$
with the parametric family (\left(1 + a \cdot d^{2b}\right)^{-1}).
Effect of min_dist:
| min_dist | Effect | Use Case |
|---|---|---|
| 0.0 | Maximum packing, points can overlap | Dense cluster analysis |
| 0.1 (default) | Balanced separation | General visualization |
| 0.5 | More spread within clusters | When individual points matter |
| 0.99 | Very loose packing | Smooth, continuous embeddings |
Lower min_dist values create tighter, more distinct clusters; higher values produce more uniform, spread-out embeddings.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom scipy.optimize import curve_fit def fit_ab_parameters(min_dist=0.1, spread=1.0): """ Fit the a and b parameters for the low-dimensional affinity function. UMAP defines a target function: ψ(d) = 1 if d <= min_dist ψ(d) = exp(-(d - min_dist) / spread) otherwise We fit (1 + a * d^(2b))^(-1) to match this target. Parameters: ----------- min_dist : float Minimum distance at which points can appear in embedding spread : float Effective scale of the embedding (usually 1.0) Returns: -------- a, b : floats Parameters for the low-dimensional weight function """ def target_curve(x, min_dist, spread): """Piecewise target function.""" result = np.zeros_like(x) result[x <= min_dist] = 1.0 result[x > min_dist] = np.exp(-(x[x > min_dist] - min_dist) / spread) return result def parametric_curve(x, a, b): """Parametric family to fit.""" return 1.0 / (1.0 + a * np.power(x, 2 * b)) # Generate target values x = np.linspace(0, 3 * spread, 300) y = target_curve(x, min_dist, spread) # Fit parameters (initial guess based on Student-t) (a, b), _ = curve_fit(parametric_curve, x, y, p0=[1.0, 1.0]) return a, b def low_dimensional_weight(d, a, b): """ Compute low-dimensional edge weight. This is the membership strength in the low-dimensional fuzzy simplicial set. Parameters: ----------- d : float or ndarray Euclidean distance(s) in embedding space a, b : floats Fitted curve parameters Returns: -------- weight : same shape as d Edge weight(s) in [0, 1] """ return 1.0 / (1.0 + a * np.power(d, 2 * b)) # Example: compute weights for various distancesif __name__ == "__main__": a, b = fit_ab_parameters(min_dist=0.1) print(f"Fitted parameters: a={a:.4f}, b={b:.4f}") distances = np.array([0.0, 0.1, 0.5, 1.0, 2.0, 5.0]) weights = low_dimensional_weight(distances, a, b) print("\nDistance -> Weight mapping:") for d, w in zip(distances, weights): print(f" d={d:.1f} -> w={w:.4f}")We have now established the complete theoretical foundation for UMAP. Let's consolidate the key concepts:
What's Next:
With the theoretical foundations established, the next page explores the fuzzy topological representation in greater depth—examining how UMAP's category-theoretic foundations lead to specific algorithmic choices and how the fuzzy simplicial set captures manifold topology through weighted graphs.
You now understand UMAP's theoretical foundations: the Riemannian geometry assumptions, fuzzy simplicial set constructions, and the fundamental optimization principle. These concepts provide the mathematical rigor that distinguishes UMAP from heuristic dimensionality reduction methods and enables its remarkable effectiveness across diverse applications.