Loading learning content...
UMAP's power comes with responsibility: its hyperparameters profoundly influence the resulting embedding. While default parameters work well for many scenarios, understanding how to tune UMAP unlocks its full potential for your specific data and objectives.
This page transforms you from a UMAP user who accepts default outputs to a practitioner who crafts embeddings optimized for their exact needs. We'll explore each parameter's effect, provide systematic tuning strategies, and address the subtle interactions between parameters that even experienced users often overlook.
Effective hyperparameter tuning isn't just about "making the visualization look good"—it's about ensuring the embedding faithfully represents the aspects of your data that matter for your analysis.
By the end of this page, you will understand: (1) The effect and interaction of each major UMAP hyperparameter, (2) Systematic approaches for tuning parameters to your objectives, (3) How data characteristics should influence parameter choices, and (4) Diagnostic techniques for evaluating embedding quality.
UMAP exposes numerous parameters. We'll organize them by category, focusing on those with the most significant impact on results.
Graph Construction Parameters (affect the high-D fuzzy simplicial set):
| Parameter | Default | Controls | Typical Range |
|---|---|---|---|
| n_neighbors | 15 | Local neighborhood size | 5-100 |
| metric | euclidean | Distance function in high-D | euclidean, cosine, manhattan, etc. |
| local_connectivity | 1.0 | Minimum neighbors per point | 1-5 |
| set_op_mix_ratio | 1.0 | Fuzzy union vs intersection blend | 0.0-1.0 |
Embedding Parameters (affect the low-D representation):
| Parameter | Default | Controls | Typical Range |
|---|---|---|---|
| n_components | 2 | Embedding dimensionality | 2, 3, or higher for ML tasks |
| min_dist | 0.1 | Minimum distance between points | 0.0-0.99 |
| spread | 1.0 | Embedding scale | 0.5-3.0 |
| negative_sample_rate | 5 | Repulsive samples per edge | 1-20 |
Optimization Parameters (affect convergence):
| Parameter | Default | Controls | Typical Range |
|---|---|---|---|
| n_epochs | auto | Optimization iterations | 200-1000 |
| learning_rate | 1.0 | SGD step size | 0.1-2.0 |
| init | spectral | Initialization strategy | spectral, random, or custom |
| random_state | None | RNG seed for reproducibility | Any integer |
For most tuning tasks, focus on three parameters: n_neighbors, min_dist, and metric. These have the largest impact on embedding quality and interpretability. The other parameters rarely need adjustment from defaults.
n_neighbors is UMAP's most important parameter. It controls the balance between local and global structure preservation.
What n_neighbors Controls:
This parameter determines how many neighbors each point considers when constructing the fuzzy simplicial set. Higher values mean each point "sees" more of the dataset, creating a more connected graph that emphasizes global structure. Lower values isolate local neighborhoods, emphasizing fine-grained local patterns.
Mathematical Effect:
n_neighbors directly affects the σ (sigma) normalization parameter at each point. The σ is chosen so that:
$$\sum_{j \neq i} \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right) = \log_2(k)$$
where (k) is n_neighbors. More neighbors → larger σ → smoother, more connected graph.
| Value | Local Structure | Global Structure | Cluster Appearance | Best For |
|---|---|---|---|---|
| 2-5 | Excellent | Poor | Fragmented, disconnected | Very fine local detail |
| 10-15 | Very good | Moderate | Distinct, separated | General-purpose visualization |
| 30-50 | Good | Good | Connected, flowing | Trajectory/continuum data |
| 100+ | Moderate | Excellent | Merged, continuous | Emphasizing global topology |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
import numpy as npimport matplotlib.pyplot as pltimport umap def explore_n_neighbors(X, y=None, values=[5, 15, 50, 100]): """ Visualize the effect of n_neighbors on embedding. Creates a grid of embeddings at different n_neighbors values to help identify the appropriate setting for your data. Parameters: ----------- X : ndarray of shape (n_samples, n_features) High-dimensional data y : ndarray of shape (n_samples,), optional Labels for coloring (ground truth or cluster assignments) values : list of int n_neighbors values to explore """ n_values = len(values) fig, axes = plt.subplots(2, (n_values + 1) // 2, figsize=(5 * (n_values + 1) // 2, 10)) axes = axes.flatten() for i, n_neighbors in enumerate(values): print(f"Computing UMAP with n_neighbors={n_neighbors}...") reducer = umap.UMAP( n_neighbors=n_neighbors, min_dist=0.1, # Keep fixed to isolate n_neighbors effect random_state=42 ) embedding = reducer.fit_transform(X) ax = axes[i] scatter = ax.scatter( embedding[:, 0], embedding[:, 1], c=y if y is not None else 'steelblue', cmap='Spectral' if y is not None else None, s=5, alpha=0.6 ) ax.set_title(f'n_neighbors = {n_neighbors}') ax.set_xticks([]) ax.set_yticks([]) # Hide unused axes for i in range(len(values), len(axes)): axes[i].axis('off') plt.tight_layout() plt.suptitle('Effect of n_neighbors on UMAP Embedding', y=1.02, fontsize=14) return fig def estimate_optimal_n_neighbors(X, k_range=range(5, 101, 5)): """ Heuristic for estimating optimal n_neighbors. Uses the elbow method on the graph connectivity: - Too small k: disconnected graph - Too large k: over-smoothed structure - Optimal k: connected graph with distinct structure Metric: number of connected components + edge weight entropy """ from scipy.sparse.csgraph import connected_components results = [] for k in k_range: reducer = umap.UMAP(n_neighbors=k, random_state=42) reducer.fit(X) # Get the fuzzy graph graph = reducer.graph_ # Metric 1: Connected components (want 1) n_components, _ = connected_components(graph > 0.01) # Metric 2: Edge weight entropy (measure of structure) weights = graph.data weights_norm = weights / weights.sum() entropy = -np.sum(weights_norm * np.log(weights_norm + 1e-10)) results.append({ 'n_neighbors': k, 'n_components': n_components, 'entropy': entropy }) return results def plot_n_neighbors_diagnostics(results): """Plot diagnostic metrics for n_neighbors selection.""" fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) ks = [r['n_neighbors'] for r in results] n_comps = [r['n_components'] for r in results] entropies = [r['entropy'] for r in results] ax1.plot(ks, n_comps, 'b-o') ax1.axhline(y=1, color='r', linestyle='--', label='Target: 1 component') ax1.set_xlabel('n_neighbors') ax1.set_ylabel('Number of Connected Components') ax1.set_title('Graph Connectivity') ax1.legend() ax2.plot(ks, entropies, 'g-o') ax2.set_xlabel('n_neighbors') ax2.set_ylabel('Edge Weight Entropy') ax2.set_title('Structure Complexity') plt.tight_layout() return figOptimal n_neighbors often scales with dataset size. For 1000 points, n_neighbors=15 might be ideal. For 100,000 points, n_neighbors=30-50 may be better. A rule of thumb: try sqrt(n) / 2 as a starting point for large datasets, but always validate visually.
min_dist controls how tightly points pack together in the low-dimensional embedding. It determines the effective "radius" of each point.
What min_dist Controls:
This parameter sets the minimum distance at which points can appear in the embedding. Lower values produce tighter, more distinct clusters; higher values create more uniform, spread-out embeddings.
Technically, min_dist affects the shape of the curve mapping high-D distances to low-D edge weights:
$$\nu(d) \approx \begin{cases} 1 & \text{if } d \leq \text{min_dist} \ \exp(-(d - \text{min_dist})) & \text{otherwise} \end{cases}$$
This curve is fitted with parameters (a) and (b), which are computed from min_dist.
| Value | Cluster Appearance | Within-Cluster Structure | Use Case |
|---|---|---|---|
| 0.0 | Extremely tight, points can overlap | Minimal visible structure | Dense cluster analysis, data density |
| 0.1 | Tight, well-separated clusters | Some internal structure visible | General visualization (default) |
| 0.25 | Moderate density clusters | Clear internal structure | When internal gradients matter |
| 0.5 | Loose, spread-out clusters | Excellent internal visibility | Continuous data, trajectories |
| 0.8-1.0 | Very spread out, almost uniform | Maximum internal detail | When every point matters visually |
A common misinterpretation: dense regions in UMAP visualizations don't necessarily correspond to dense regions in the original data. min_dist affects visual density directly but doesn't change the underlying relationship structure. Always verify density interpretations against the original high-dimensional data.
The min_dist and n_neighbors Interaction:
These parameters interact significantly:
| Combination | Typical Result |
|---|---|
| Low n_neighbors + Low min_dist | Small, tight, scattered clusters (possibly fragmented) |
| Low n_neighbors + High min_dist | Spread-out local patches |
| High n_neighbors + Low min_dist | Large, tight, connected regions |
| High n_neighbors + High min_dist | Smooth, continuous, spread-out embedding |
Finding the right combination requires experimentation. Start with defaults (n_neighbors=15, min_dist=0.1), then adjust one parameter at a time.
The metric parameter determines how distances are computed in the high-dimensional space. Choosing the right metric can dramatically improve embedding quality for specific data types.
Available Metrics in UMAP:
| Metric | Formula | Best For | Considerations |
|---|---|---|---|
| euclidean | ||x - y||₂ | Dense, continuous features | Sensitive to scale; normalize first |
| cosine | 1 - (x·y)/(||x|| ||y||) | Text, sparse vectors, embeddings | Ignores magnitude |
| manhattan | Σ|xᵢ - yᵢ| | High dimensions, sparse data | More robust to outliers |
| correlation | 1 - correlation(x, y) | Gene expression, time series | Centers data implicitly |
| jaccard | 1 - |A∩B|/|A∪B| | Binary/set data | For binary features only |
| hamming | Fraction of differing bits | Binary vectors, categorical | For binary features only |
Data Type → Metric Selection:
| Data Type | Recommended Metric | Reasoning |
|---|---|---|
| Images (pixels) | euclidean or cosine | Euclidean after normalization works well |
| Text (TF-IDF) | cosine | Angle matters more than magnitude |
| Text (embeddings) | cosine or euclidean | Both work; cosine is more common |
| Single-cell RNA-seq | correlation | Centers per-cell variation |
| GPS/spatial coordinates | haversine | Proper great-circle distances |
| Binary features | jaccard or hamming | Appropriate for binary data |
| Mixed types | gower (custom) | Handles mixed numeric/categorical |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npimport umapimport matplotlib.pyplot as plt def compare_metrics(X, y=None, metrics=['euclidean', 'cosine', 'manhattan', 'correlation']): """ Compare UMAP embeddings using different distance metrics. Parameters: ----------- X : ndarray of shape (n_samples, n_features) High-dimensional data y : ndarray of shape (n_samples,), optional Labels for coloring metrics : list of str Metrics to compare """ n_metrics = len(metrics) fig, axes = plt.subplots(1, n_metrics, figsize=(5 * n_metrics, 5)) for i, metric in enumerate(metrics): print(f"Computing UMAP with metric={metric}...") try: reducer = umap.UMAP( n_neighbors=15, min_dist=0.1, metric=metric, random_state=42 ) embedding = reducer.fit_transform(X) ax = axes[i] if n_metrics > 1 else axes ax.scatter( embedding[:, 0], embedding[:, 1], c=y if y is not None else 'steelblue', cmap='Spectral' if y is not None else None, s=5, alpha=0.6 ) ax.set_title(f'metric = "{metric}"') ax.set_xticks([]) ax.set_yticks([]) except Exception as e: ax = axes[i] if n_metrics > 1 else axes ax.text(0.5, 0.5, f"Error: {e}", ha='center', va='center') ax.set_title(f'metric = "{metric}" (failed)') plt.tight_layout() return fig def custom_metric_example(): """ Example: Using a custom distance metric. UMAP supports any function with signature: f(x, y) -> float where x and y are 1D arrays. """ from numba import njit @njit def weighted_euclidean(x, y, weights=None): """ Weighted Euclidean distance. Allows different features to contribute differently to the distance calculation. """ if weights is None: weights = np.ones(len(x)) d = 0.0 for i in range(len(x)): d += weights[i] * (x[i] - y[i]) ** 2 return np.sqrt(d) # Note: For custom metrics, you may need to specify # metric_kwds with any additional parameters # Example usage (pseudo-code): # reducer = umap.UMAP(metric=weighted_euclidean, # metric_kwds={'weights': my_weights}) return weighted_euclideanIf you've already computed a distance matrix, use metric='precomputed' and pass the distance matrix directly. This is useful when using domain-specific distance functions too complex for UMAP's metric interface, or when reusing distances across multiple UMAP runs.
While the main parameters (n_neighbors, min_dist, metric) control what UMAP learns, optimization parameters control how it learns. These rarely need adjustment but can help in specific situations.
n_epochs (Number of Training Iterations):
| Dataset Size | Default n_epochs | When to Increase | When to Decrease |
|---|---|---|---|
| < 10,000 | 500 | Complex structure not converging | Quick exploratory runs |
| 10,000 - 100,000 | 200 | Large n_neighbors, complex data | Good enough visual quality |
100,000 | 200 | Need perfect convergence | Time constraints |
init (Initialization Strategy):
| Strategy | Description | When to Use |
|---|---|---|
| spectral | Graph Laplacian eigenvectors | Default; best for global structure |
| random | Uniform random | When spectral fails on disconnected graphs |
| pca | PCA projection | Fast, deterministic starting point |
| custom | User-provided coordinates | Warm-starting, animation continuity |
Spectral initialization is crucial for global structure preservation. If UMAP produces fragmented or inconsistent results, initialization problems may be the cause.
learning_rate:
The learning rate controls SGD step size. The default (1.0) works well almost always.
negative_sample_rate:
Controls how many negative (repulsive) samples are drawn per positive (attractive) edge.
| Value | Effect | Use Case |
|---|---|---|
| 1-3 | Weak repulsion, possible crowding | Rarely useful |
| 5 (default) | Balanced forces | General use |
| 10-20 | Strong repulsion, wide separation | Very dense data |
The spread parameter (default 1.0) works with min_dist to determine the a and b curve parameters. In practice, you rarely need to change spread independently—min_dist alone is usually sufficient. If you need very large or compact embeddings, adjust spread proportionally with min_dist.
Rather than randomly trying parameters, follow a systematic approach:
Step 1: Start with Defaults and Understand Your Data
Step 2: Adjust n_neighbors First
This has the largest impact. Create a sweep:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
import numpy as npimport umapfrom sklearn.neighbors import NearestNeighborsfrom sklearn.metrics import silhouette_scoreimport matplotlib.pyplot as plt def systematic_umap_tuning(X, y=None, n_jobs=-1): """ Systematic UMAP hyperparameter tuning. Follows a structured approach: 1. Sweep n_neighbors 2. For best n_neighbors, sweep min_dist 3. Evaluate with quantitative metrics and visual inspection Parameters: ----------- X : ndarray High-dimensional data y : ndarray, optional Labels for silhouette score calculation Returns: -------- best_params : dict Recommended parameters results : list All evaluation results """ results = [] # ============================================ # STEP 1: n_neighbors sweep # ============================================ print("Step 1: Sweeping n_neighbors...") n_neighbors_values = [5, 10, 15, 20, 30, 50, 75, 100] for n_neighbors in n_neighbors_values: reducer = umap.UMAP( n_neighbors=n_neighbors, min_dist=0.1, # Fixed for this sweep random_state=42 ) embedding = reducer.fit_transform(X) # Compute evaluation metrics metrics = evaluate_embedding(X, embedding, y) metrics['n_neighbors'] = n_neighbors metrics['min_dist'] = 0.1 results.append(metrics) print(f" n_neighbors={n_neighbors}: " f"trustworthiness={metrics['trustworthiness']:.3f}, " f"continuity={metrics['continuity']:.3f}") # Find best n_neighbors based on combined metric n_neighbors_results = [r for r in results if r['min_dist'] == 0.1] best_nn_result = max(n_neighbors_results, key=lambda x: x['trustworthiness'] + x['continuity']) best_n_neighbors = best_nn_result['n_neighbors'] print(f"\nBest n_neighbors: {best_n_neighbors}") # ============================================ # STEP 2: min_dist sweep with best n_neighbors # ============================================ print("\nStep 2: Sweeping min_dist...") min_dist_values = [0.0, 0.05, 0.1, 0.2, 0.3, 0.5, 0.8] for min_dist in min_dist_values: reducer = umap.UMAP( n_neighbors=best_n_neighbors, min_dist=min_dist, random_state=42 ) embedding = reducer.fit_transform(X) metrics = evaluate_embedding(X, embedding, y) metrics['n_neighbors'] = best_n_neighbors metrics['min_dist'] = min_dist results.append(metrics) print(f" min_dist={min_dist}: " f"trustworthiness={metrics['trustworthiness']:.3f}") # Find best min_dist min_dist_results = [r for r in results if r['n_neighbors'] == best_n_neighbors] best_md_result = max(min_dist_results, key=lambda x: x['trustworthiness']) best_min_dist = best_md_result['min_dist'] print(f"\nBest min_dist: {best_min_dist}") best_params = { 'n_neighbors': best_n_neighbors, 'min_dist': best_min_dist } return best_params, results def evaluate_embedding(X, embedding, y=None, k=10): """ Compute quantitative metrics for embedding quality. Metrics: - Trustworthiness: Are embedding neighbors true neighbors? - Continuity: Are high-D neighbors still neighbors in embedding? - Silhouette (if labels): How well-separated are clusters? """ from sklearn.manifold import trustworthiness as tw n_samples = X.shape[0] metrics = {} # Trustworthiness metrics['trustworthiness'] = tw(X, embedding, n_neighbors=k) # Continuity (inverse trustworthiness) metrics['continuity'] = tw(embedding, X, n_neighbors=k) # Silhouette score if labels provided if y is not None and len(np.unique(y)) > 1: metrics['silhouette'] = silhouette_score(embedding, y) else: metrics['silhouette'] = np.nan return metrics def visualize_tuning_results(results): """Visualize the tuning sweep results.""" fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # n_neighbors sweep results nn_results = [r for r in results if r['min_dist'] == 0.1] nn_values = [r['n_neighbors'] for r in nn_results] trust = [r['trustworthiness'] for r in nn_results] cont = [r['continuity'] for r in nn_results] axes[0].plot(nn_values, trust, 'b-o', label='Trustworthiness') axes[0].plot(nn_values, cont, 'g-o', label='Continuity') axes[0].set_xlabel('n_neighbors') axes[0].set_ylabel('Score') axes[0].set_title('n_neighbors Sweep') axes[0].legend() # min_dist sweep results md_results = [r for r in results if r['min_dist'] != 0.1 or r == nn_results[2]] md_values = [r['min_dist'] for r in md_results] trust = [r['trustworthiness'] for r in md_results] axes[1].plot(md_values, trust, 'b-o') axes[1].set_xlabel('min_dist') axes[1].set_ylabel('Trustworthiness') axes[1].set_title('min_dist Sweep (best n_neighbors)') plt.tight_layout() return figStep 3: Validate Visually
Quantitative metrics are helpful but not sufficient. Always visually inspect:
Step 4: Consider Your Use Case
Optimal parameters depend on your goal:
| Goal | Parameter Preference |
|---|---|
| Exploratory visualization | Higher n_neighbors, lower min_dist |
| Cluster identification | Lower n_neighbors, very low min_dist |
| Trajectory analysis | Higher n_neighbors, higher min_dist |
| Pre-processing for ML | Optimize for downstream task metric |
How do you know if your embedding is "good"? Several quantitative metrics help assess embedding quality.
Core Metrics:
| Metric | What It Measures | Range | Good Value |
|---|---|---|---|
| Trustworthiness | False neighbors in embedding | 0‒1 | 0.95 |
| Continuity | Missing neighbors in embedding | 0‒1 | 0.90 |
| Silhouette (with labels) | Cluster separation quality | -1‒1 | 0.5 |
| k-NN accuracy (with labels) | Neighborhood label preservation | 0‒1 | baseline |
| Stress (MDS-style) | Distance distortion | 0‒∞ | Lower is better |
Interpreting Trustworthiness and Continuity:
High trustworthiness doesn't guarantee a useful embedding. You could have perfect local structure but terrible global structure. Always combine quantitative metrics with visual inspection and domain knowledge. If the embedding doesn't make sense for your application, metrics don't matter.
Diagnostic Visualizations:
Beyond quantitative metrics, these visualizations help diagnose issues:
Residual distance plots: Scatter high-D distances vs. low-D distances. Ideally shows monotonic relationship.
k-NN overlap histograms: Distribution of how many k nearest neighbors are preserved.
Multi-parameter comparison grids: Side-by-side embeddings at different parameters.
Label coloring: If you have labels, color by them to check cluster preservation.
Feature hover: Interactive plots showing original features on hover.
Some datasets and use cases require special consideration.
Very Large Datasets (> 1 million points):
Very High Dimensions (> 10,000 features):
| Strategy | When To Use |
|---|---|
| PCA pre-reduction | Standard approach; reduce to 50-200 dims first |
| Feature selection | If only some features are relevant |
| Random projection | Fast dimensionality reduction when PCA is too slow |
| Use cosine metric | Often works better for very sparse high-D data |
Supervised/Semi-Supervised UMAP:
UMAP can incorporate label information:
# Supervised UMAP - uses labels to guide embedding
reducer = umap.UMAP(target_metric='categorical')
embedding = reducer.fit_transform(X, y=labels)
# Semi-supervised - uses labels where available (labels=-1 for unlabeled)
This produces embeddings that better separate known classes while still respecting unlabeled point relationships.
Unlike t-SNE, UMAP can embed new points without refitting: reducer.transform(X_new). This is invaluable for production systems, streaming data, and consistent embeddings across train/test splits. The transform uses the learned fuzzy graph structure to place new points optimally.
You now have comprehensive knowledge for tuning UMAP to achieve optimal embeddings. Let's consolidate the key insights:
| Scenario | n_neighbors | min_dist | Other |
|---|---|---|---|
| General visualization | 15 | 0.1 | Defaults work well |
| Large dataset | 30-50 | 0.1 | Reduce n_epochs |
| Find tight clusters | 10-15 | 0.0-0.05 | Use spectral init |
| Continuous trajectories | 30-50 | 0.3-0.5 | Consider higher n_epochs |
| Text data | 15-30 | 0.1 | Use cosine metric |
| Pre-processing for ML | Tune for task | 0.0 | Evaluate downstream |
Module Complete:
This concludes Module 6 on UMAP. You now possess a comprehensive understanding of UMAP—from its theoretical foundations in Riemannian geometry and algebraic topology, through its fuzzy topological representation and cross-entropy optimization, to practical comparison with t-SNE and hyperparameter tuning strategies.
Armed with this knowledge, you can confidently apply UMAP to your high-dimensional data, anticipate its behavior, interpret its outputs correctly, and tune it for optimal results in your specific domain.
Congratulations on completing Module 6: UMAP. You now have world-class knowledge of one of the most important dimensionality reduction algorithms in modern machine learning—its theory, implementation, comparison to alternatives, and practical tuning. Apply this knowledge to explore the hidden structure in your high-dimensional data.