Machine LearningManifold Learning

UMAP (Uniform Manifold Approximation and Projection)

LevelAdvanced

Duration90 mins

TopicManifold Learning

5 / 5

Hyperparameter Tuning

Mastering UMAP Configuration

UMAP's power comes with responsibility: its hyperparameters profoundly influence the resulting embedding. While default parameters work well for many scenarios, understanding how to tune UMAP unlocks its full potential for your specific data and objectives.

This page transforms you from a UMAP user who accepts default outputs to a practitioner who crafts embeddings optimized for their exact needs. We'll explore each parameter's effect, provide systematic tuning strategies, and address the subtle interactions between parameters that even experienced users often overlook.

Effective hyperparameter tuning isn't just about "making the visualization look good"—it's about ensuring the embedding faithfully represents the aspects of your data that matter for your analysis.

What You Will Learn

By the end of this page, you will understand: (1) The effect and interaction of each major UMAP hyperparameter, (2) Systematic approaches for tuning parameters to your objectives, (3) How data characteristics should influence parameter choices, and (4) Diagnostic techniques for evaluating embedding quality.

Overview of Key Parameters

UMAP exposes numerous parameters. We'll organize them by category, focusing on those with the most significant impact on results.

Graph Construction Parameters (affect the high-D fuzzy simplicial set):

Graph Construction Parameters
Parameter	Default	Controls	Typical Range
n_neighbors	15	Local neighborhood size	5-100
metric	euclidean	Distance function in high-D	euclidean, cosine, manhattan, etc.
local_connectivity	1.0	Minimum neighbors per point	1-5
set_op_mix_ratio	1.0	Fuzzy union vs intersection blend	0.0-1.0

Embedding Parameters (affect the low-D representation):

Embedding Parameters
Parameter	Default	Controls	Typical Range
n_components	2	Embedding dimensionality	2, 3, or higher for ML tasks
min_dist	0.1	Minimum distance between points	0.0-0.99
spread	1.0	Embedding scale	0.5-3.0
negative_sample_rate	5	Repulsive samples per edge	1-20

Optimization Parameters (affect convergence):

Optimization Parameters
Parameter	Default	Controls	Typical Range
n_epochs	auto	Optimization iterations	200-1000
learning_rate	1.0	SGD step size	0.1-2.0
init	spectral	Initialization strategy	spectral, random, or custom
random_state	None	RNG seed for reproducibility	Any integer

The Essential Three

For most tuning tasks, focus on three parameters: n_neighbors, min_dist, and metric. These have the largest impact on embedding quality and interpretability. The other parameters rarely need adjustment from defaults.

n_neighbors: The Locality Parameter

n_neighbors is UMAP's most important parameter. It controls the balance between local and global structure preservation.

What n_neighbors Controls:

This parameter determines how many neighbors each point considers when constructing the fuzzy simplicial set. Higher values mean each point "sees" more of the dataset, creating a more connected graph that emphasizes global structure. Lower values isolate local neighborhoods, emphasizing fine-grained local patterns.

Mathematical Effect:

n_neighbors directly affects the σ (sigma) normalization parameter at each point. The σ is chosen so that:

$$\sum_{j \neq i} \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right) = \log_2(k)$$

where (k) is n_neighbors. More neighbors → larger σ → smoother, more connected graph.

n_neighbors Effect on Embedding
Value	Local Structure	Global Structure	Cluster Appearance	Best For
2-5	Excellent	Poor	Fragmented, disconnected	Very fine local detail
10-15	Very good	Moderate	Distinct, separated	General-purpose visualization
30-50	Good	Good	Connected, flowing	Trajectory/continuum data
100+	Moderate	Excellent	Merged, continuous	Emphasizing global topology

n_neighbors_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
import matplotlib.pyplot as plt
import umap
 
def explore_n_neighbors(X, y=None, values=[5, 15, 50, 100]):
    """
    Visualize the effect of n_neighbors on embedding.
    
    Creates a grid of embeddings at different n_neighbors values
    to help identify the appropriate setting for your data.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        High-dimensional data
    y : ndarray of shape (n_samples,), optional
        Labels for coloring (ground truth or cluster assignments)
    values : list of int
        n_neighbors values to explore
    """
    n_values = len(values)
    fig, axes = plt.subplots(2, (n_values + 1) // 2, figsize=(5 * (n_values + 1) // 2, 10))
    axes = axes.flatten()
    
    for i, n_neighbors in enumerate(values):
        print(f"Computing UMAP with n_neighbors={n_neighbors}...")
        
        reducer = umap.UMAP(
            n_neighbors=n_neighbors,
            min_dist=0.1,  # Keep fixed to isolate n_neighbors effect
            random_state=42
        )
        embedding = reducer.fit_transform(X)
        
        ax = axes[i]
        scatter = ax.scatter(
            embedding[:, 0], 
            embedding[:, 1], 
            c=y if y is not None else 'steelblue',
            cmap='Spectral' if y is not None else None,
            s=5, 
            alpha=0.6
        )
        ax.set_title(f'n_neighbors = {n_neighbors}')
        ax.set_xticks([])
        ax.set_yticks([])
    
    # Hide unused axes
    for i in range(len(values), len(axes)):
        axes[i].axis('off')
    
    plt.tight_layout()
    plt.suptitle('Effect of n_neighbors on UMAP Embedding', y=1.02, fontsize=14)
    return fig
 
 
def estimate_optimal_n_neighbors(X, k_range=range(5, 101, 5)):
    """
    Heuristic for estimating optimal n_neighbors.
    
    Uses the elbow method on the graph connectivity:
    - Too small k: disconnected graph
    - Too large k: over-smoothed structure
    - Optimal k: connected graph with distinct structure
    
    Metric: number of connected components + edge weight entropy
    """
    from scipy.sparse.csgraph import connected_components
    
    results = []
    
    for k in k_range:
        reducer = umap.UMAP(n_neighbors=k, random_state=42)
        reducer.fit(X)
        
        # Get the fuzzy graph
        graph = reducer.graph_
        
        # Metric 1: Connected components (want 1)
        n_components, _ = connected_components(graph > 0.01)
        
        # Metric 2: Edge weight entropy (measure of structure)
        weights = graph.data
        weights_norm = weights / weights.sum()
        entropy = -np.sum(weights_norm * np.log(weights_norm + 1e-10))
        
        results.append({
            'n_neighbors': k,
            'n_components': n_components,
            'entropy': entropy
        })
    
    return results
 
 
def plot_n_neighbors_diagnostics(results):
    """Plot diagnostic metrics for n_neighbors selection."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    ks = [r['n_neighbors'] for r in results]
    n_comps = [r['n_components'] for r in results]
    entropies = [r['entropy'] for r in results]
    
    ax1.plot(ks, n_comps, 'b-o')
    ax1.axhline(y=1, color='r', linestyle='--', label='Target: 1 component')
    ax1.set_xlabel('n_neighbors')
    ax1.set_ylabel('Number of Connected Components')
    ax1.set_title('Graph Connectivity')
    ax1.legend()
    
    ax2.plot(ks, entropies, 'g-o')
    ax2.set_xlabel('n_neighbors')
    ax2.set_ylabel('Edge Weight Entropy')
    ax2.set_title('Structure Complexity')
    
    plt.tight_layout()
    return fig

Data-Size Guidance

Optimal n_neighbors often scales with dataset size. For 1000 points, n_neighbors=15 might be ideal. For 100,000 points, n_neighbors=30-50 may be better. A rule of thumb: try sqrt(n) / 2 as a starting point for large datasets, but always validate visually.

min_dist: The Tightness Parameter

min_dist controls how tightly points pack together in the low-dimensional embedding. It determines the effective "radius" of each point.

What min_dist Controls:

This parameter sets the minimum distance at which points can appear in the embedding. Lower values produce tighter, more distinct clusters; higher values create more uniform, spread-out embeddings.

Technically, min_dist affects the shape of the curve mapping high-D distances to low-D edge weights:

$$\nu(d) \approx \begin{cases} 1 & \text{if } d \leq \text{min_dist} \ \exp(-(d - \text{min_dist})) & \text{otherwise} \end{cases}$$

This curve is fitted with parameters (a) and (b), which are computed from min_dist.

min_dist Effect on Embedding
Value	Cluster Appearance	Within-Cluster Structure	Use Case
0.0	Extremely tight, points can overlap	Minimal visible structure	Dense cluster analysis, data density
0.1	Tight, well-separated clusters	Some internal structure visible	General visualization (default)
0.25	Moderate density clusters	Clear internal structure	When internal gradients matter
0.5	Loose, spread-out clusters	Excellent internal visibility	Continuous data, trajectories
0.8-1.0	Very spread out, almost uniform	Maximum internal detail	When every point matters visually

Low min_dist Effects

•Clusters appear as dense points/cores
•Strong visual separation between clusters
•Internal cluster structure hard to see
•Outliers clearly visible
•Good for discrete cluster identification

High min_dist Effects

•Clusters spread out, less distinct
•Cluster boundaries become fuzzy
•Internal structure clearly visible
•Gradients and transitions preserved
•Good for continuous/trajectory data

Visual Density ≠ Data Density

A common misinterpretation: dense regions in UMAP visualizations don't necessarily correspond to dense regions in the original data. min_dist affects visual density directly but doesn't change the underlying relationship structure. Always verify density interpretations against the original high-dimensional data.

The min_dist and n_neighbors Interaction:

These parameters interact significantly:

Combination	Typical Result
Low n_neighbors + Low min_dist	Small, tight, scattered clusters (possibly fragmented)
Low n_neighbors + High min_dist	Spread-out local patches
High n_neighbors + Low min_dist	Large, tight, connected regions
High n_neighbors + High min_dist	Smooth, continuous, spread-out embedding

Finding the right combination requires experimentation. Start with defaults (n_neighbors=15, min_dist=0.1), then adjust one parameter at a time.

Distance Metrics: Matching Data Characteristics

The metric parameter determines how distances are computed in the high-dimensional space. Choosing the right metric can dramatically improve embedding quality for specific data types.

Available Metrics in UMAP:

Common Distance Metrics
Metric	Formula	Best For	Considerations
euclidean	\|\|x - y\|\|₂	Dense, continuous features	Sensitive to scale; normalize first
cosine	1 - (x·y)/(\|\|x\|\| \|\|y\|\|)	Text, sparse vectors, embeddings	Ignores magnitude
manhattan	Σ\|xᵢ - yᵢ\|	High dimensions, sparse data	More robust to outliers
correlation	1 - correlation(x, y)	Gene expression, time series	Centers data implicitly
jaccard	1 - \|A∩B\|/\|A∪B\|	Binary/set data	For binary features only
hamming	Fraction of differing bits	Binary vectors, categorical	For binary features only

Data Type → Metric Selection:

Metric Selection by Data Type
Data Type	Recommended Metric	Reasoning
Images (pixels)	euclidean or cosine	Euclidean after normalization works well
Text (TF-IDF)	cosine	Angle matters more than magnitude
Text (embeddings)	cosine or euclidean	Both work; cosine is more common
Single-cell RNA-seq	correlation	Centers per-cell variation
GPS/spatial coordinates	haversine	Proper great-circle distances
Binary features	jaccard or hamming	Appropriate for binary data
Mixed types	gower (custom)	Handles mixed numeric/categorical

metric_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
import umap
import matplotlib.pyplot as plt
 
def compare_metrics(X, y=None, metrics=['euclidean', 'cosine', 'manhattan', 'correlation']):
    """
    Compare UMAP embeddings using different distance metrics.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        High-dimensional data
    y : ndarray of shape (n_samples,), optional
        Labels for coloring
    metrics : list of str
        Metrics to compare
    """
    n_metrics = len(metrics)
    fig, axes = plt.subplots(1, n_metrics, figsize=(5 * n_metrics, 5))
    
    for i, metric in enumerate(metrics):
        print(f"Computing UMAP with metric={metric}...")
        
        try:
            reducer = umap.UMAP(
                n_neighbors=15,
                min_dist=0.1,
                metric=metric,
                random_state=42
            )
            embedding = reducer.fit_transform(X)
            
            ax = axes[i] if n_metrics > 1 else axes
            ax.scatter(
                embedding[:, 0], 
                embedding[:, 1],
                c=y if y is not None else 'steelblue',
                cmap='Spectral' if y is not None else None,
                s=5, 
                alpha=0.6
            )
            ax.set_title(f'metric = "{metric}"')
            ax.set_xticks([])
            ax.set_yticks([])
            
        except Exception as e:
            ax = axes[i] if n_metrics > 1 else axes
            ax.text(0.5, 0.5, f"Error: {e}", ha='center', va='center')
            ax.set_title(f'metric = "{metric}" (failed)')
    
    plt.tight_layout()
    return fig
 
 
def custom_metric_example():
    """
    Example: Using a custom distance metric.
    
    UMAP supports any function with signature:
    f(x, y) -> float
    
    where x and y are 1D arrays.
    """
    from numba import njit
    
    @njit
    def weighted_euclidean(x, y, weights=None):
        """
        Weighted Euclidean distance.
        
        Allows different features to contribute differently
        to the distance calculation.
        """
        if weights is None:
            weights = np.ones(len(x))
        
        d = 0.0
        for i in range(len(x)):
            d += weights[i] * (x[i] - y[i]) ** 2
        return np.sqrt(d)
    
    # Note: For custom metrics, you may need to specify
    # metric_kwds with any additional parameters
    
    # Example usage (pseudo-code):
    # reducer = umap.UMAP(metric=weighted_euclidean, 
    #                     metric_kwds={'weights': my_weights})
    
    return weighted_euclidean

Pre-computed Distances

If you've already computed a distance matrix, use metric='precomputed' and pass the distance matrix directly. This is useful when using domain-specific distance functions too complex for UMAP's metric interface, or when reusing distances across multiple UMAP runs.

Optimization Parameters: Fine-Tuning Convergence

While the main parameters (n_neighbors, min_dist, metric) control what UMAP learns, optimization parameters control how it learns. These rarely need adjustment but can help in specific situations.

n_epochs (Number of Training Iterations):

n_epochs Guidelines
Dataset Size	Default n_epochs	When to Increase	When to Decrease
< 10,000	500	Complex structure not converging	Quick exploratory runs
10,000 - 100,000	200	Large n_neighbors, complex data	Good enough visual quality
100,000	200	Need perfect convergence	Time constraints

init (Initialization Strategy):

Strategy	Description	When to Use
spectral	Graph Laplacian eigenvectors	Default; best for global structure
random	Uniform random	When spectral fails on disconnected graphs
pca	PCA projection	Fast, deterministic starting point
custom	User-provided coordinates	Warm-starting, animation continuity

Spectral initialization is crucial for global structure preservation. If UMAP produces fragmented or inconsistent results, initialization problems may be the cause.

learning_rate:

The learning rate controls SGD step size. The default (1.0) works well almost always.

Too high: Embedding becomes unstable, points oscillate
Too low: Very slow convergence, may get stuck
Adjust when: You see numerical instabilities or non-convergence

negative_sample_rate:

Controls how many negative (repulsive) samples are drawn per positive (attractive) edge.

Value	Effect	Use Case
1-3	Weak repulsion, possible crowding	Rarely useful
5 (default)	Balanced forces	General use
10-20	Strong repulsion, wide separation	Very dense data

The spread Parameter

The spread parameter (default 1.0) works with min_dist to determine the a and b curve parameters. In practice, you rarely need to change spread independently—min_dist alone is usually sufficient. If you need very large or compact embeddings, adjust spread proportionally with min_dist.

Systematic Tuning Strategy

Rather than randomly trying parameters, follow a systematic approach:

Step 1: Start with Defaults and Understand Your Data

Run UMAP with default parameters
Note what works and what doesn't
Identify specific issues (e.g., "clusters too merged" or "structure too fragmented")

Step 2: Adjust n_neighbors First

This has the largest impact. Create a sweep:

systematic_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import numpy as np
import umap
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
 
def systematic_umap_tuning(X, y=None, n_jobs=-1):
    """
    Systematic UMAP hyperparameter tuning.
    
    Follows a structured approach:
    1. Sweep n_neighbors
    2. For best n_neighbors, sweep min_dist
    3. Evaluate with quantitative metrics and visual inspection
    
    Parameters:
    -----------
    X : ndarray
        High-dimensional data
    y : ndarray, optional
        Labels for silhouette score calculation
    
    Returns:
    --------
    best_params : dict
        Recommended parameters
    results : list
        All evaluation results
    """
    results = []
    
    # ============================================
    # STEP 1: n_neighbors sweep
    # ============================================
    print("Step 1: Sweeping n_neighbors...")
    n_neighbors_values = [5, 10, 15, 20, 30, 50, 75, 100]
    
    for n_neighbors in n_neighbors_values:
        reducer = umap.UMAP(
            n_neighbors=n_neighbors,
            min_dist=0.1,  # Fixed for this sweep
            random_state=42
        )
        embedding = reducer.fit_transform(X)
        
        # Compute evaluation metrics
        metrics = evaluate_embedding(X, embedding, y)
        metrics['n_neighbors'] = n_neighbors
        metrics['min_dist'] = 0.1
        results.append(metrics)
        
        print(f"  n_neighbors={n_neighbors}: "
              f"trustworthiness={metrics['trustworthiness']:.3f}, "
              f"continuity={metrics['continuity']:.3f}")
    
    # Find best n_neighbors based on combined metric
    n_neighbors_results = [r for r in results if r['min_dist'] == 0.1]
    best_nn_result = max(n_neighbors_results, 
                          key=lambda x: x['trustworthiness'] + x['continuity'])
    best_n_neighbors = best_nn_result['n_neighbors']
    print(f"\nBest n_neighbors: {best_n_neighbors}")
    
    # ============================================
    # STEP 2: min_dist sweep with best n_neighbors
    # ============================================
    print("\nStep 2: Sweeping min_dist...")
    min_dist_values = [0.0, 0.05, 0.1, 0.2, 0.3, 0.5, 0.8]
    
    for min_dist in min_dist_values:
        reducer = umap.UMAP(
            n_neighbors=best_n_neighbors,
            min_dist=min_dist,
            random_state=42
        )
        embedding = reducer.fit_transform(X)
        
        metrics = evaluate_embedding(X, embedding, y)
        metrics['n_neighbors'] = best_n_neighbors
        metrics['min_dist'] = min_dist
        results.append(metrics)
        
        print(f"  min_dist={min_dist}: "
              f"trustworthiness={metrics['trustworthiness']:.3f}")
    
    # Find best min_dist
    min_dist_results = [r for r in results 
                        if r['n_neighbors'] == best_n_neighbors]
    best_md_result = max(min_dist_results,
                         key=lambda x: x['trustworthiness'])
    best_min_dist = best_md_result['min_dist']
    print(f"\nBest min_dist: {best_min_dist}")
    
    best_params = {
        'n_neighbors': best_n_neighbors,
        'min_dist': best_min_dist
    }
    
    return best_params, results
 
 
def evaluate_embedding(X, embedding, y=None, k=10):
    """
    Compute quantitative metrics for embedding quality.
    
    Metrics:
    - Trustworthiness: Are embedding neighbors true neighbors?
    - Continuity: Are high-D neighbors still neighbors in embedding?
    - Silhouette (if labels): How well-separated are clusters?
    """
    from sklearn.manifold import trustworthiness as tw
    
    n_samples = X.shape[0]
    metrics = {}
    
    # Trustworthiness
    metrics['trustworthiness'] = tw(X, embedding, n_neighbors=k)
    
    # Continuity (inverse trustworthiness)
    metrics['continuity'] = tw(embedding, X, n_neighbors=k)
    
    # Silhouette score if labels provided
    if y is not None and len(np.unique(y)) > 1:
        metrics['silhouette'] = silhouette_score(embedding, y)
    else:
        metrics['silhouette'] = np.nan
    
    return metrics
 
 
def visualize_tuning_results(results):
    """Visualize the tuning sweep results."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # n_neighbors sweep results
    nn_results = [r for r in results if r['min_dist'] == 0.1]
    nn_values = [r['n_neighbors'] for r in nn_results]
    trust = [r['trustworthiness'] for r in nn_results]
    cont = [r['continuity'] for r in nn_results]
    
    axes[0].plot(nn_values, trust, 'b-o', label='Trustworthiness')
    axes[0].plot(nn_values, cont, 'g-o', label='Continuity')
    axes[0].set_xlabel('n_neighbors')
    axes[0].set_ylabel('Score')
    axes[0].set_title('n_neighbors Sweep')
    axes[0].legend()
    
    # min_dist sweep results
    md_results = [r for r in results if r['min_dist'] != 0.1 or r == nn_results[2]]
    md_values = [r['min_dist'] for r in md_results]
    trust = [r['trustworthiness'] for r in md_results]
    
    axes[1].plot(md_values, trust, 'b-o')
    axes[1].set_xlabel('min_dist')
    axes[1].set_ylabel('Trustworthiness')
    axes[1].set_title('min_dist Sweep (best n_neighbors)')
    
    plt.tight_layout()
    return fig

Step 3: Validate Visually

Quantitative metrics are helpful but not sufficient. Always visually inspect:

Does the embedding show expected structure (if you have prior knowledge)?
Are known relationships preserved?
Are clusters artificially split or merged?
Are there artifacts (e.g., disconnected fragments that should be connected)?

Step 4: Consider Your Use Case

Optimal parameters depend on your goal:

Goal	Parameter Preference
Exploratory visualization	Higher n_neighbors, lower min_dist
Cluster identification	Lower n_neighbors, very low min_dist
Trajectory analysis	Higher n_neighbors, higher min_dist
Pre-processing for ML	Optimize for downstream task metric

Evaluation Metrics and Diagnostics

How do you know if your embedding is "good"? Several quantitative metrics help assess embedding quality.

Core Metrics:

Embedding Quality Metrics
Metric	What It Measures	Range	Good Value
Trustworthiness	False neighbors in embedding	0‒1	0.95
Continuity	Missing neighbors in embedding	0‒1	0.90
Silhouette (with labels)	Cluster separation quality	-1‒1	0.5
k-NN accuracy (with labels)	Neighborhood label preservation	0‒1	baseline
Stress (MDS-style)	Distance distortion	0‒∞	Lower is better

Interpreting Trustworthiness and Continuity:

Trustworthiness < 0.90: Many points in embedding neighborhoods are not true neighbors. Local structure is distorted.
Continuity < 0.85: Many true neighbors are lost in embedding. Important relationships are missing.
Both high (> 0.95): Excellent local structure preservation.
Large gap between them: Systematic bias—either creating false connections or breaking true ones.

Metrics Are Not Everything

High trustworthiness doesn't guarantee a useful embedding. You could have perfect local structure but terrible global structure. Always combine quantitative metrics with visual inspection and domain knowledge. If the embedding doesn't make sense for your application, metrics don't matter.

Diagnostic Visualizations:

Beyond quantitative metrics, these visualizations help diagnose issues:

Residual distance plots: Scatter high-D distances vs. low-D distances. Ideally shows monotonic relationship.
k-NN overlap histograms: Distribution of how many k nearest neighbors are preserved.
Multi-parameter comparison grids: Side-by-side embeddings at different parameters.
Label coloring: If you have labels, color by them to check cluster preservation.
Feature hover: Interactive plots showing original features on hover.

Special Cases and Advanced Scenarios

Some datasets and use cases require special consideration.

Very Large Datasets (> 1 million points):

Large Dataset Strategies

•Use low_memory=True: Trades speed for memory efficiency
•Consider GPU: RAPIDS cuML offers GPU-accelerated UMAP
•Sample first: Embed a representative sample, then project remainder
•Reduce n_epochs: 100-200 is often sufficient for large data
•Increase n_neighbors slightly: Helps connectivity with sparse sampling

Very High Dimensions (> 10,000 features):

Strategy	When To Use
PCA pre-reduction	Standard approach; reduce to 50-200 dims first
Feature selection	If only some features are relevant
Random projection	Fast dimensionality reduction when PCA is too slow
Use cosine metric	Often works better for very sparse high-D data

Supervised/Semi-Supervised UMAP:

UMAP can incorporate label information:

# Supervised UMAP - uses labels to guide embedding
reducer = umap.UMAP(target_metric='categorical')
embedding = reducer.fit_transform(X, y=labels)

# Semi-supervised - uses labels where available (labels=-1 for unlabeled)

This produces embeddings that better separate known classes while still respecting unlabeled point relationships.

Out-of-Sample Embedding

Unlike t-SNE, UMAP can embed new points without refitting: reducer.transform(X_new). This is invaluable for production systems, streaming data, and consistent embeddings across train/test splits. The transform uses the learned fuzzy graph structure to place new points optimally.

Summary: UMAP Hyperparameter Mastery

You now have comprehensive knowledge for tuning UMAP to achieve optimal embeddings. Let's consolidate the key insights:

Key Takeaways

•Focus on n_neighbors and min_dist: These two parameters have the largest impact. Start with defaults (15, 0.1), then adjust based on your data and goals.
•n_neighbors controls local vs. global balance: Lower values emphasize local structure; higher values preserve more global relationships.
•min_dist controls cluster tightness: Lower values create dense, separated clusters; higher values preserve within-cluster gradients.
•Match metric to data type: Cosine for text/embeddings, correlation for expression data, euclidean (with normalization) for most other cases.
•Use systematic tuning: Sweep parameters methodically, evaluate with quantitative metrics, and always validate visually.
•Consider your use case: Optimal parameters depend on whether you're visualizing, clustering, or preparing data for downstream ML.

Quick Reference Parameter Guide
Scenario	n_neighbors	min_dist	Other
General visualization	15	0.1	Defaults work well
Large dataset	30-50	0.1	Reduce n_epochs
Find tight clusters	10-15	0.0-0.05	Use spectral init
Continuous trajectories	30-50	0.3-0.5	Consider higher n_epochs
Text data	15-30	0.1	Use cosine metric
Pre-processing for ML	Tune for task	0.0	Evaluate downstream

Module Complete:

This concludes Module 6 on UMAP. You now possess a comprehensive understanding of UMAP—from its theoretical foundations in Riemannian geometry and algebraic topology, through its fuzzy topological representation and cross-entropy optimization, to practical comparison with t-SNE and hyperparameter tuning strategies.

Armed with this knowledge, you can confidently apply UMAP to your high-dimensional data, anticipate its behavior, interpret its outputs correctly, and tune it for optimal results in your specific domain.

Module Complete

Congratulations on completing Module 6: UMAP. You now have world-class knowledge of one of the most important dimensionality reduction algorithms in modern machine learning—its theory, implementation, comparison to alternatives, and practical tuning. Apply this knowledge to explore the hidden structure in your high-dimensional data.

5 / 5

Loading learning content...

Machine LearningManifold Learning

UMAP (Uniform Manifold Approximation and Projection)

LevelAdvanced

Duration90 mins

TopicManifold Learning

5 / 5

Hyperparameter Tuning

Mastering UMAP Configuration

Effective hyperparameter tuning isn't just about "making the visualization look good"—it's about ensuring the embedding faithfully represents the aspects of your data that matter for your analysis.

What You Will Learn

Overview of Key Parameters

UMAP exposes numerous parameters. We'll organize them by category, focusing on those with the most significant impact on results.

Graph Construction Parameters (affect the high-D fuzzy simplicial set):

Graph Construction Parameters
Parameter	Default	Controls	Typical Range
n_neighbors	15	Local neighborhood size	5-100
metric	euclidean	Distance function in high-D	euclidean, cosine, manhattan, etc.
local_connectivity	1.0	Minimum neighbors per point	1-5
set_op_mix_ratio	1.0	Fuzzy union vs intersection blend	0.0-1.0

Embedding Parameters (affect the low-D representation):

Embedding Parameters
Parameter	Default	Controls	Typical Range
n_components	2	Embedding dimensionality	2, 3, or higher for ML tasks
min_dist	0.1	Minimum distance between points	0.0-0.99
spread	1.0	Embedding scale	0.5-3.0
negative_sample_rate	5	Repulsive samples per edge	1-20

Optimization Parameters (affect convergence):

Optimization Parameters
Parameter	Default	Controls	Typical Range
n_epochs	auto	Optimization iterations	200-1000
learning_rate	1.0	SGD step size	0.1-2.0
init	spectral	Initialization strategy	spectral, random, or custom
random_state	None	RNG seed for reproducibility	Any integer

The Essential Three

n_neighbors: The Locality Parameter

n_neighbors is UMAP's most important parameter. It controls the balance between local and global structure preservation.

What n_neighbors Controls:

Mathematical Effect:

n_neighbors directly affects the σ (sigma) normalization parameter at each point. The σ is chosen so that:

$$\sum_{j \neq i} \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right) = \log_2(k)$$

where (k) is n_neighbors. More neighbors → larger σ → smoother, more connected graph.

n_neighbors Effect on Embedding
Value	Local Structure	Global Structure	Cluster Appearance	Best For
2-5	Excellent	Poor	Fragmented, disconnected	Very fine local detail
10-15	Very good	Moderate	Distinct, separated	General-purpose visualization
30-50	Good	Good	Connected, flowing	Trajectory/continuum data
100+	Moderate	Excellent	Merged, continuous	Emphasizing global topology

n_neighbors_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
import matplotlib.pyplot as plt
import umap
 
def explore_n_neighbors(X, y=None, values=[5, 15, 50, 100]):
    """
    Visualize the effect of n_neighbors on embedding.
    
    Creates a grid of embeddings at different n_neighbors values
    to help identify the appropriate setting for your data.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        High-dimensional data
    y : ndarray of shape (n_samples,), optional
        Labels for coloring (ground truth or cluster assignments)
    values : list of int
        n_neighbors values to explore
    """
    n_values = len(values)
    fig, axes = plt.subplots(2, (n_values + 1) // 2, figsize=(5 * (n_values + 1) // 2, 10))
    axes = axes.flatten()
    
    for i, n_neighbors in enumerate(values):
        print(f"Computing UMAP with n_neighbors={n_neighbors}...")
        
        reducer = umap.UMAP(
            n_neighbors=n_neighbors,
            min_dist=0.1,  # Keep fixed to isolate n_neighbors effect
            random_state=42
        )
        embedding = reducer.fit_transform(X)
        
        ax = axes[i]
        scatter = ax.scatter(
            embedding[:, 0], 
            embedding[:, 1], 
            c=y if y is not None else 'steelblue',
            cmap='Spectral' if y is not None else None,
            s=5, 
            alpha=0.6
        )
        ax.set_title(f'n_neighbors = {n_neighbors}')
        ax.set_xticks([])
        ax.set_yticks([])
    
    # Hide unused axes
    for i in range(len(values), len(axes)):
        axes[i].axis('off')
    
    plt.tight_layout()
    plt.suptitle('Effect of n_neighbors on UMAP Embedding', y=1.02, fontsize=14)
    return fig
 
 
def estimate_optimal_n_neighbors(X, k_range=range(5, 101, 5)):
    """
    Heuristic for estimating optimal n_neighbors.
    
    Uses the elbow method on the graph connectivity:
    - Too small k: disconnected graph
    - Too large k: over-smoothed structure
    - Optimal k: connected graph with distinct structure
    
    Metric: number of connected components + edge weight entropy
    """
    from scipy.sparse.csgraph import connected_components
    
    results = []
    
    for k in k_range:
        reducer = umap.UMAP(n_neighbors=k, random_state=42)
        reducer.fit(X)
        
        # Get the fuzzy graph
        graph = reducer.graph_
        
        # Metric 1: Connected components (want 1)
        n_components, _ = connected_components(graph > 0.01)
        
        # Metric 2: Edge weight entropy (measure of structure)
        weights = graph.data
        weights_norm = weights / weights.sum()
        entropy = -np.sum(weights_norm * np.log(weights_norm + 1e-10))
        
        results.append({
            'n_neighbors': k,
            'n_components': n_components,
            'entropy': entropy
        })
    
    return results
 
 
def plot_n_neighbors_diagnostics(results):
    """Plot diagnostic metrics for n_neighbors selection."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    ks = [r['n_neighbors'] for r in results]
    n_comps = [r['n_components'] for r in results]
    entropies = [r['entropy'] for r in results]
    
    ax1.plot(ks, n_comps, 'b-o')
    ax1.axhline(y=1, color='r', linestyle='--', label='Target: 1 component')
    ax1.set_xlabel('n_neighbors')
    ax1.set_ylabel('Number of Connected Components')
    ax1.set_title('Graph Connectivity')
    ax1.legend()
    
    ax2.plot(ks, entropies, 'g-o')
    ax2.set_xlabel('n_neighbors')
    ax2.set_ylabel('Edge Weight Entropy')
    ax2.set_title('Structure Complexity')
    
    plt.tight_layout()
    return fig

Data-Size Guidance

min_dist: The Tightness Parameter

min_dist controls how tightly points pack together in the low-dimensional embedding. It determines the effective "radius" of each point.

What min_dist Controls:

This parameter sets the minimum distance at which points can appear in the embedding. Lower values produce tighter, more distinct clusters; higher values create more uniform, spread-out embeddings.

Technically, min_dist affects the shape of the curve mapping high-D distances to low-D edge weights:

$$\nu(d) \approx \begin{cases} 1 & \text{if } d \leq \text{min_dist} \ \exp(-(d - \text{min_dist})) & \text{otherwise} \end{cases}$$

This curve is fitted with parameters (a) and (b), which are computed from min_dist.

min_dist Effect on Embedding
Value	Cluster Appearance	Within-Cluster Structure	Use Case
0.0	Extremely tight, points can overlap	Minimal visible structure	Dense cluster analysis, data density
0.1	Tight, well-separated clusters	Some internal structure visible	General visualization (default)
0.25	Moderate density clusters	Clear internal structure	When internal gradients matter
0.5	Loose, spread-out clusters	Excellent internal visibility	Continuous data, trajectories
0.8-1.0	Very spread out, almost uniform	Maximum internal detail	When every point matters visually

Low min_dist Effects

•Clusters appear as dense points/cores
•Strong visual separation between clusters
•Internal cluster structure hard to see
•Outliers clearly visible
•Good for discrete cluster identification

High min_dist Effects

•Clusters spread out, less distinct
•Cluster boundaries become fuzzy
•Internal structure clearly visible
•Gradients and transitions preserved
•Good for continuous/trajectory data

Visual Density ≠ Data Density

The min_dist and n_neighbors Interaction:

These parameters interact significantly:

Combination	Typical Result
Low n_neighbors + Low min_dist	Small, tight, scattered clusters (possibly fragmented)
Low n_neighbors + High min_dist	Spread-out local patches
High n_neighbors + Low min_dist	Large, tight, connected regions
High n_neighbors + High min_dist	Smooth, continuous, spread-out embedding

Finding the right combination requires experimentation. Start with defaults (n_neighbors=15, min_dist=0.1), then adjust one parameter at a time.

Distance Metrics: Matching Data Characteristics

The metric parameter determines how distances are computed in the high-dimensional space. Choosing the right metric can dramatically improve embedding quality for specific data types.

Available Metrics in UMAP:

Common Distance Metrics
Metric	Formula	Best For	Considerations
euclidean	\|\|x - y\|\|₂	Dense, continuous features	Sensitive to scale; normalize first
cosine	1 - (x·y)/(\|\|x\|\| \|\|y\|\|)	Text, sparse vectors, embeddings	Ignores magnitude
manhattan	Σ\|xᵢ - yᵢ\|	High dimensions, sparse data	More robust to outliers
correlation	1 - correlation(x, y)	Gene expression, time series	Centers data implicitly
jaccard	1 - \|A∩B\|/\|A∪B\|	Binary/set data	For binary features only
hamming	Fraction of differing bits	Binary vectors, categorical	For binary features only

Data Type → Metric Selection:

Metric Selection by Data Type
Data Type	Recommended Metric	Reasoning
Images (pixels)	euclidean or cosine	Euclidean after normalization works well
Text (TF-IDF)	cosine	Angle matters more than magnitude
Text (embeddings)	cosine or euclidean	Both work; cosine is more common
Single-cell RNA-seq	correlation	Centers per-cell variation
GPS/spatial coordinates	haversine	Proper great-circle distances
Binary features	jaccard or hamming	Appropriate for binary data
Mixed types	gower (custom)	Handles mixed numeric/categorical

metric_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
import umap
import matplotlib.pyplot as plt
 
def compare_metrics(X, y=None, metrics=['euclidean', 'cosine', 'manhattan', 'correlation']):
    """
    Compare UMAP embeddings using different distance metrics.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        High-dimensional data
    y : ndarray of shape (n_samples,), optional
        Labels for coloring
    metrics : list of str
        Metrics to compare
    """
    n_metrics = len(metrics)
    fig, axes = plt.subplots(1, n_metrics, figsize=(5 * n_metrics, 5))
    
    for i, metric in enumerate(metrics):
        print(f"Computing UMAP with metric={metric}...")
        
        try:
            reducer = umap.UMAP(
                n_neighbors=15,
                min_dist=0.1,
                metric=metric,
                random_state=42
            )
            embedding = reducer.fit_transform(X)
            
            ax = axes[i] if n_metrics > 1 else axes
            ax.scatter(
                embedding[:, 0], 
                embedding[:, 1],
                c=y if y is not None else 'steelblue',
                cmap='Spectral' if y is not None else None,
                s=5, 
                alpha=0.6
            )
            ax.set_title(f'metric = "{metric}"')
            ax.set_xticks([])
            ax.set_yticks([])
            
        except Exception as e:
            ax = axes[i] if n_metrics > 1 else axes
            ax.text(0.5, 0.5, f"Error: {e}", ha='center', va='center')
            ax.set_title(f'metric = "{metric}" (failed)')
    
    plt.tight_layout()
    return fig
 
 
def custom_metric_example():
    """
    Example: Using a custom distance metric.
    
    UMAP supports any function with signature:
    f(x, y) -> float
    
    where x and y are 1D arrays.
    """
    from numba import njit
    
    @njit
    def weighted_euclidean(x, y, weights=None):
        """
        Weighted Euclidean distance.
        
        Allows different features to contribute differently
        to the distance calculation.
        """
        if weights is None:
            weights = np.ones(len(x))
        
        d = 0.0
        for i in range(len(x)):
            d += weights[i] * (x[i] - y[i]) ** 2
        return np.sqrt(d)
    
    # Note: For custom metrics, you may need to specify
    # metric_kwds with any additional parameters
    
    # Example usage (pseudo-code):
    # reducer = umap.UMAP(metric=weighted_euclidean, 
    #                     metric_kwds={'weights': my_weights})
    
    return weighted_euclidean

Pre-computed Distances

Optimization Parameters: Fine-Tuning Convergence

n_epochs (Number of Training Iterations):

n_epochs Guidelines
Dataset Size	Default n_epochs	When to Increase	When to Decrease
< 10,000	500	Complex structure not converging	Quick exploratory runs
10,000 - 100,000	200	Large n_neighbors, complex data	Good enough visual quality
100,000	200	Need perfect convergence	Time constraints

init (Initialization Strategy):

Strategy	Description	When to Use
spectral	Graph Laplacian eigenvectors	Default; best for global structure
random	Uniform random	When spectral fails on disconnected graphs
pca	PCA projection	Fast, deterministic starting point
custom	User-provided coordinates	Warm-starting, animation continuity

Spectral initialization is crucial for global structure preservation. If UMAP produces fragmented or inconsistent results, initialization problems may be the cause.

learning_rate:

The learning rate controls SGD step size. The default (1.0) works well almost always.

Too high: Embedding becomes unstable, points oscillate
Too low: Very slow convergence, may get stuck
Adjust when: You see numerical instabilities or non-convergence

negative_sample_rate:

Controls how many negative (repulsive) samples are drawn per positive (attractive) edge.

Value	Effect	Use Case
1-3	Weak repulsion, possible crowding	Rarely useful
5 (default)	Balanced forces	General use
10-20	Strong repulsion, wide separation	Very dense data

The spread Parameter

Systematic Tuning Strategy

Rather than randomly trying parameters, follow a systematic approach:

Step 1: Start with Defaults and Understand Your Data

Run UMAP with default parameters
Note what works and what doesn't
Identify specific issues (e.g., "clusters too merged" or "structure too fragmented")

Step 2: Adjust n_neighbors First

This has the largest impact. Create a sweep:

systematic_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import numpy as np
import umap
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
 
def systematic_umap_tuning(X, y=None, n_jobs=-1):
    """
    Systematic UMAP hyperparameter tuning.
    
    Follows a structured approach:
    1. Sweep n_neighbors
    2. For best n_neighbors, sweep min_dist
    3. Evaluate with quantitative metrics and visual inspection
    
    Parameters:
    -----------
    X : ndarray
        High-dimensional data
    y : ndarray, optional
        Labels for silhouette score calculation
    
    Returns:
    --------
    best_params : dict
        Recommended parameters
    results : list
        All evaluation results
    """
    results = []
    
    # ============================================
    # STEP 1: n_neighbors sweep
    # ============================================
    print("Step 1: Sweeping n_neighbors...")
    n_neighbors_values = [5, 10, 15, 20, 30, 50, 75, 100]
    
    for n_neighbors in n_neighbors_values:
        reducer = umap.UMAP(
            n_neighbors=n_neighbors,
            min_dist=0.1,  # Fixed for this sweep
            random_state=42
        )
        embedding = reducer.fit_transform(X)
        
        # Compute evaluation metrics
        metrics = evaluate_embedding(X, embedding, y)
        metrics['n_neighbors'] = n_neighbors
        metrics['min_dist'] = 0.1
        results.append(metrics)
        
        print(f"  n_neighbors={n_neighbors}: "
              f"trustworthiness={metrics['trustworthiness']:.3f}, "
              f"continuity={metrics['continuity']:.3f}")
    
    # Find best n_neighbors based on combined metric
    n_neighbors_results = [r for r in results if r['min_dist'] == 0.1]
    best_nn_result = max(n_neighbors_results, 
                          key=lambda x: x['trustworthiness'] + x['continuity'])
    best_n_neighbors = best_nn_result['n_neighbors']
    print(f"\nBest n_neighbors: {best_n_neighbors}")
    
    # ============================================
    # STEP 2: min_dist sweep with best n_neighbors
    # ============================================
    print("\nStep 2: Sweeping min_dist...")
    min_dist_values = [0.0, 0.05, 0.1, 0.2, 0.3, 0.5, 0.8]
    
    for min_dist in min_dist_values:
        reducer = umap.UMAP(
            n_neighbors=best_n_neighbors,
            min_dist=min_dist,
            random_state=42
        )
        embedding = reducer.fit_transform(X)
        
        metrics = evaluate_embedding(X, embedding, y)
        metrics['n_neighbors'] = best_n_neighbors
        metrics['min_dist'] = min_dist
        results.append(metrics)
        
        print(f"  min_dist={min_dist}: "
              f"trustworthiness={metrics['trustworthiness']:.3f}")
    
    # Find best min_dist
    min_dist_results = [r for r in results 
                        if r['n_neighbors'] == best_n_neighbors]
    best_md_result = max(min_dist_results,
                         key=lambda x: x['trustworthiness'])
    best_min_dist = best_md_result['min_dist']
    print(f"\nBest min_dist: {best_min_dist}")
    
    best_params = {
        'n_neighbors': best_n_neighbors,
        'min_dist': best_min_dist
    }
    
    return best_params, results
 
 
def evaluate_embedding(X, embedding, y=None, k=10):
    """
    Compute quantitative metrics for embedding quality.
    
    Metrics:
    - Trustworthiness: Are embedding neighbors true neighbors?
    - Continuity: Are high-D neighbors still neighbors in embedding?
    - Silhouette (if labels): How well-separated are clusters?
    """
    from sklearn.manifold import trustworthiness as tw
    
    n_samples = X.shape[0]
    metrics = {}
    
    # Trustworthiness
    metrics['trustworthiness'] = tw(X, embedding, n_neighbors=k)
    
    # Continuity (inverse trustworthiness)
    metrics['continuity'] = tw(embedding, X, n_neighbors=k)
    
    # Silhouette score if labels provided
    if y is not None and len(np.unique(y)) > 1:
        metrics['silhouette'] = silhouette_score(embedding, y)
    else:
        metrics['silhouette'] = np.nan
    
    return metrics
 
 
def visualize_tuning_results(results):
    """Visualize the tuning sweep results."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # n_neighbors sweep results
    nn_results = [r for r in results if r['min_dist'] == 0.1]
    nn_values = [r['n_neighbors'] for r in nn_results]
    trust = [r['trustworthiness'] for r in nn_results]
    cont = [r['continuity'] for r in nn_results]
    
    axes[0].plot(nn_values, trust, 'b-o', label='Trustworthiness')
    axes[0].plot(nn_values, cont, 'g-o', label='Continuity')
    axes[0].set_xlabel('n_neighbors')
    axes[0].set_ylabel('Score')
    axes[0].set_title('n_neighbors Sweep')
    axes[0].legend()
    
    # min_dist sweep results
    md_results = [r for r in results if r['min_dist'] != 0.1 or r == nn_results[2]]
    md_values = [r['min_dist'] for r in md_results]
    trust = [r['trustworthiness'] for r in md_results]
    
    axes[1].plot(md_values, trust, 'b-o')
    axes[1].set_xlabel('min_dist')
    axes[1].set_ylabel('Trustworthiness')
    axes[1].set_title('min_dist Sweep (best n_neighbors)')
    
    plt.tight_layout()
    return fig

Step 3: Validate Visually

Quantitative metrics are helpful but not sufficient. Always visually inspect:

Does the embedding show expected structure (if you have prior knowledge)?
Are known relationships preserved?
Are clusters artificially split or merged?
Are there artifacts (e.g., disconnected fragments that should be connected)?

Step 4: Consider Your Use Case

Optimal parameters depend on your goal:

Goal	Parameter Preference
Exploratory visualization	Higher n_neighbors, lower min_dist
Cluster identification	Lower n_neighbors, very low min_dist
Trajectory analysis	Higher n_neighbors, higher min_dist
Pre-processing for ML	Optimize for downstream task metric

Evaluation Metrics and Diagnostics

How do you know if your embedding is "good"? Several quantitative metrics help assess embedding quality.

Core Metrics:

Embedding Quality Metrics
Metric	What It Measures	Range	Good Value
Trustworthiness	False neighbors in embedding	0‒1	0.95
Continuity	Missing neighbors in embedding	0‒1	0.90
Silhouette (with labels)	Cluster separation quality	-1‒1	0.5
k-NN accuracy (with labels)	Neighborhood label preservation	0‒1	baseline
Stress (MDS-style)	Distance distortion	0‒∞	Lower is better

Interpreting Trustworthiness and Continuity:

Trustworthiness < 0.90: Many points in embedding neighborhoods are not true neighbors. Local structure is distorted.
Continuity < 0.85: Many true neighbors are lost in embedding. Important relationships are missing.
Both high (> 0.95): Excellent local structure preservation.
Large gap between them: Systematic bias—either creating false connections or breaking true ones.

Metrics Are Not Everything

Diagnostic Visualizations:

Beyond quantitative metrics, these visualizations help diagnose issues:

Residual distance plots: Scatter high-D distances vs. low-D distances. Ideally shows monotonic relationship.
k-NN overlap histograms: Distribution of how many k nearest neighbors are preserved.
Multi-parameter comparison grids: Side-by-side embeddings at different parameters.
Label coloring: If you have labels, color by them to check cluster preservation.
Feature hover: Interactive plots showing original features on hover.

Special Cases and Advanced Scenarios

Some datasets and use cases require special consideration.

Very Large Datasets (> 1 million points):

Large Dataset Strategies

•Use low_memory=True: Trades speed for memory efficiency
•Consider GPU: RAPIDS cuML offers GPU-accelerated UMAP
•Sample first: Embed a representative sample, then project remainder
•Reduce n_epochs: 100-200 is often sufficient for large data
•Increase n_neighbors slightly: Helps connectivity with sparse sampling

Very High Dimensions (> 10,000 features):

Strategy	When To Use
PCA pre-reduction	Standard approach; reduce to 50-200 dims first
Feature selection	If only some features are relevant
Random projection	Fast dimensionality reduction when PCA is too slow
Use cosine metric	Often works better for very sparse high-D data

Supervised/Semi-Supervised UMAP:

UMAP can incorporate label information:

# Supervised UMAP - uses labels to guide embedding
reducer = umap.UMAP(target_metric='categorical')
embedding = reducer.fit_transform(X, y=labels)

# Semi-supervised - uses labels where available (labels=-1 for unlabeled)

This produces embeddings that better separate known classes while still respecting unlabeled point relationships.

Out-of-Sample Embedding

Summary: UMAP Hyperparameter Mastery

You now have comprehensive knowledge for tuning UMAP to achieve optimal embeddings. Let's consolidate the key insights:

Key Takeaways

•Focus on n_neighbors and min_dist: These two parameters have the largest impact. Start with defaults (15, 0.1), then adjust based on your data and goals.
•n_neighbors controls local vs. global balance: Lower values emphasize local structure; higher values preserve more global relationships.
•min_dist controls cluster tightness: Lower values create dense, separated clusters; higher values preserve within-cluster gradients.
•Match metric to data type: Cosine for text/embeddings, correlation for expression data, euclidean (with normalization) for most other cases.
•Use systematic tuning: Sweep parameters methodically, evaluate with quantitative metrics, and always validate visually.
•Consider your use case: Optimal parameters depend on whether you're visualizing, clustering, or preparing data for downstream ML.

Quick Reference Parameter Guide
Scenario	n_neighbors	min_dist	Other
General visualization	15	0.1	Defaults work well
Large dataset	30-50	0.1	Reduce n_epochs
Find tight clusters	10-15	0.0-0.05	Use spectral init
Continuous trajectories	30-50	0.3-0.5	Consider higher n_epochs
Text data	15-30	0.1	Use cosine metric
Pre-processing for ML	Tune for task	0.0	Evaluate downstream

Module Complete:

Module Complete

5 / 5