Machine LearningIsomap

Isomap: Isometric Feature Mapping

LevelAdvanced

Duration90 mins

TopicIsomap

3 / 5

Landmark Isomap: Scaling to Large Datasets

The Curse of Computational Complexity

Classical Isomap has a fundamental scaling problem. Computing all-pairs shortest paths in a graph with $N$ vertices requires $O(N^2 \log N)$ time using Dijkstra's algorithm from each vertex, or $O(N^3)$ using Floyd-Warshall. Storing the $N \times N$ distance matrix requires $O(N^2)$ memory. For a dataset with 100,000 points, this means storing 10 billion distances and performing trillions of operations.\n\nLandmark Isomap solves this problem with a elegant idea: instead of computing geodesic distances between all pairs of points, compute distances only from a small set of landmark points to all other points. This reduces the complexity dramatically while preserving the essential geometric structure of the embedding.

What You Will Learn

By the end of this page, you will understand: (1) why classical Isomap doesn't scale and the computational bottlenecks, (2) the landmark selection strategies and their tradeoffs, (3) how landmark-based MDS (L-MDS) computes embeddings efficiently, (4) the theoretical guarantees and approximation quality of Landmark Isomap, and (5) practical implementation considerations for large-scale applications.

The Computational Bottleneck of Classical Isomap

Let's analyze the computational complexity of each step in classical Isomap to understand where the bottlenecks lie.

Classical Isomap: Computational Complexity Analysis
Step	Operation	Time Complexity	Space Complexity	Scalability
Graph Construction	k-NN search (brute force)	$O(N^2 D)$	$O(N^2)$ or $O(Nk)$	Quadratic; acceptable
Graph Construction	k-NN search (tree-based)	$O(N \log N \cdot D)$	$O(N)$ + $O(Nk)$	Almost linear; good
Shortest Paths	Dijkstra from all sources	$O(N^2 \log N + NE)$	$O(N^2)$	Bottleneck
Shortest Paths	Floyd-Warshall	$O(N^3)$	$O(N^2)$	Bottleneck
Distance Matrix	Store all pairs	$O(N^2)$	$O(N^2)$	Memory bottleneck
MDS	Eigendecomposition of $N \times N$ matrix	$O(N^3)$	$O(N^2)$	Bottleneck

The Numbers Are Sobering\n\nFor a dataset with $N = 100,000$ points:\n\n- Shortest paths: $N^2 \log N \approx 10^{10} \cdot 17 \approx 1.7 \times 10^{11}$ operations\n- Distance matrix: $N^2 = 10^{10}$ entries = 80 GB (double precision)\n- Eigendecomposition: $N^3 = 10^{15}$ operations\n\nEven with modern hardware, this is impractical. Classical Isomap typically becomes infeasible for $N > 10,000$ without significant computational resources.

The Quadratic Memory Wall

Even if we could compute all-pairs shortest paths efficiently, storing the full N × N distance matrix is often impossible. For N = 1 million points, the distance matrix requires 8 TB of memory. This memory constraint is often more limiting than computational time for large-scale applications.

The Landmark Approximation Strategy

Landmark Isomap, introduced by de Silva and Tenenbaum (2003), exploits a key observation: if we know the distances from all points to a subset of well-chosen landmarks, we can approximately reconstruct the full distance structure.\n\nThe Core Idea\n\nSelect $L \ll N$ landmark points $\{\ell_1, \ldots, \ell_L\}$. Instead of computing the full $N \times N$ geodesic distance matrix, compute only:\n\n$$\mathbf{D}_L = \begin{bmatrix} d_G(\ell_i, \mathbf{x}j) \end{bmatrix}{i=1,\ldots,L; \, j=1,\ldots,N}$$\n\nThis is an $L \times N$ matrix—dramatically smaller than $N \times N$ when $L \ll N$.

Converting Mermaid diagram...

Complexity Reduction\n\nWith $L$ landmarks:\n\n| Operation | Classical | Landmark | Reduction Factor |\n|-----------|-----------|----------|-----------------|\n| Shortest paths | $O(N^2 \log N)$ | $O(LN \log N)$ | $N/L$ |\n| Distance storage | $O(N^2)$ | $O(LN)$ | $N/L$ |\n| MDS embedding | $O(N^3)$ | $O(L^2 N + L^3)$ | $(N/L)^2$ to $(N/L)^3$ |\n\nFor $N = 100,000$ and $L = 1,000$, this represents a 100× to 10,000× speedup depending on the operation.

The Triangulation Intuition

Think of landmarks as GPS satellites. If you know your distance to several well-positioned satellites, you can determine your position without knowing distances to all other points on Earth. Similarly, knowing distances to landmarks allows us to determine each point's position in the embedding space.

Choosing Landmarks: Strategies and Tradeoffs

The choice of landmarks significantly impacts embedding quality. Several strategies exist, each with distinct properties:

Landmark Selection Strategies
Strategy	Method	Complexity	Pros	Cons
Random	Uniform random sample	$O(L)$	Simple, unbiased	May miss sparse regions
MaxMin	Iteratively add farthest point	$O(LN)$	Good coverage	Somewhat expensive
K-means++	K-means++ initialization	$O(LN)$	Density-aware spread	Biased toward dense regions
Farthest Point Sampling	Greedy max-min distance	$O(L N \log N)$	Optimal spread	Expensive for large L
Column Sampling	Based on distance matrix leverage scores	$O(N^2)$	Theoretically optimal	Requires full matrix (defeats purpose)

landmark_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.spatial.distance import cdist
 
def random_landmarks(X, n_landmarks, seed=None):
    """
    Select landmarks uniformly at random.
    
    Parameters:
        X: (N, D) array of data points
        n_landmarks: Number of landmarks L
        seed: Random seed for reproducibility
    
    Returns:
        landmark_indices: Array of L indices into X
    """
    rng = np.random.RandomState(seed)
    return rng.choice(X.shape[0], size=n_landmarks, replace=False)
 
def maxmin_landmarks(X, n_landmarks, seed=None):
    """
    Select landmarks using MaxMin (Gonzalez) algorithm.
    
    Iteratively selects the point farthest from all current landmarks.
    Provides good coverage of the data distribution.
    
    Parameters:
        X: (N, D) array of data points
        n_landmarks: Number of landmarks L
        seed: Random seed for first landmark
    
    Returns:
        landmark_indices: Array of L indices into X
    """
    N = X.shape[0]
    rng = np.random.RandomState(seed)
    
    # Initialize with random first landmark
    landmarks = [rng.randint(N)]
    
    # Track minimum distance to any landmark for each point
    min_distances = np.full(N, np.inf)
    
    for _ in range(n_landmarks - 1):
        # Update min distances with last added landmark
        last_landmark = landmarks[-1]
        dists = np.linalg.norm(X - X[last_landmark], axis=1)
        min_distances = np.minimum(min_distances, dists)
        
        # Exclude existing landmarks
        min_distances[landmarks] = -np.inf
        
        # Select point with maximum min-distance
        next_landmark = np.argmax(min_distances)
        landmarks.append(next_landmark)
    
    return np.array(landmarks)
 
def kmeansplusplus_landmarks(X, n_landmarks, seed=None):
    """
    Select landmarks using k-means++ initialization.
    
    Probabilistically selects points proportional to squared distance
    from nearest existing landmark.
    
    Parameters:
        X: (N, D) array of data points
        n_landmarks: Number of landmarks L
        seed: Random seed
    
    Returns:
        landmark_indices: Array of L indices into X
    """
    N = X.shape[0]
    rng = np.random.RandomState(seed)
    
    # Initialize with random first landmark
    landmarks = [rng.randint(N)]
    
    for _ in range(n_landmarks - 1):
        # Compute squared distances to nearest landmark
        landmark_points = X[landmarks]
        dists = cdist(X, landmark_points, metric='sqeuclidean')
        min_sq_dists = dists.min(axis=1)
        
        # Zero out existing landmarks
        min_sq_dists[landmarks] = 0
        
        # Sample proportional to squared distance
        probs = min_sq_dists / min_sq_dists.sum()
        next_landmark = rng.choice(N, p=probs)
        landmarks.append(next_landmark)
    
    return np.array(landmarks)
 
def compare_landmark_strategies(X, n_landmarks, n_trials=10):
    """
    Compare landmark selection strategies by measuring coverage.
    
    Coverage is measured as the average distance from each point
    to its nearest landmark (lower is better).
    """
    N = X.shape[0]
    results = {}
    
    for name, func in [('Random', random_landmarks),
                       ('MaxMin', maxmin_landmarks),
                       ('K-means++', kmeansplusplus_landmarks)]:
        coverages = []
        for trial in range(n_trials):
            landmarks = func(X, n_landmarks, seed=trial)
            landmark_points = X[landmarks]
            dists = cdist(X, landmark_points)
            min_dists = dists.min(axis=1)
            coverages.append(min_dists.mean())
        
        results[name] = {
            'mean_coverage': np.mean(coverages),
            'std_coverage': np.std(coverages)
        }
    
    return results

MaxMin Advantages

•Guarantees landmarks spread across data
•Deterministic (given first point)
•Works well for uniform manifolds
•Provable coverage bounds

K-means++ Advantages

•More landmarks in dense regions
•Better for non-uniform density
•Stochastic (provides diversity)
•Widely used and well-understood

Practical Recommendation

For most applications, MaxMin provides the best tradeoff between quality and simplicity. Random sampling works surprisingly well when L is large enough (L ≥ 200). K-means++ is preferred when the data has significant density variations. The original Landmark Isomap paper used random selection, which is the default in many implementations.

Landmark MDS: The Embedding Algorithm

Once we have geodesic distances from landmarks to all points, we need an embedding algorithm that works with this partial distance information. Landmark MDS (L-MDS), also known as Nyström extension for MDS, provides the solution.\n\nThe Two-Phase Approach\n\n1. Phase 1: Embed the landmarks using classical MDS on the $L \times L$ landmark-to-landmark distance matrix\n2. Phase 2: Embed all other points by projecting them into the landmark embedding space using the distance-to-landmark information

Phase 1: Landmark Embedding via Classical MDS\n\nLet $\mathbf{D}_{LL}$ be the $L \times L$ matrix of geodesic distances between landmarks. Classical MDS finds coordinates $\mathbf{Y}_L = [\mathbf{y}_1, \ldots, \mathbf{y}L]^\top$ such that Euclidean distances in the embedding approximate geodesic distances:\n\n$$\|\mathbf{y}i - \mathbf{y}j\| \approx d_G(\ell_i, \ell_j)$$\n\nThe classical MDS algorithm:\n\n1. Compute squared distance matrix: $\mathbf{D}{LL}^{(2)}$ where $D{ij}^{(2)} = d_G(\ell_i, \ell_j)^2$\n\n2. Double center the matrix: $\mathbf{B} = -\frac{1}{2} \mathbf{H} \mathbf{D}{LL}^{(2)} \mathbf{H}$ where $\mathbf{H} = \mathbf{I} - \frac{1}{L}\mathbf{1}\mathbf{1}^\top$ is the centering matrix\n\n3. Eigendecompose: $\mathbf{B} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^\top$\n\n4. Landmark coordinates: $\mathbf{Y}_L = \mathbf{V}_d \boldsymbol{\Lambda}_d^{1/2}$ where subscript $d$ denotes top $d$ eigenpairs

Phase 2: Out-of-Sample Extension (Nyström Method)\n\nFor each non-landmark point $\mathbf{x}_n$, we have distances to all landmarks: $\mathbf{d}_n = [d_G(\mathbf{x}_n, \ell_1), \ldots, d_G(\mathbf{x}_n, \ell_L)]^\top$.\n\nThe Nyström extension projects this point into the embedding space:\n\n$$\mathbf{y}_n = \frac{1}{2} \boldsymbol{\Lambda}_d^{-1/2} \mathbf{V}_d^\top \left( \bar{\mathbf{d}}_L - \mathbf{d}_n^{(2)} \right)$$\n\nwhere:\n- $\mathbf{d}_n^{(2)} = [d_G(\mathbf{x}_n, \ell_1)^2, \ldots, d_G(\mathbf{x}_n, \ell_L)^2]^\top$ is the vector of squared distances\n- $\bar{\mathbf{d}}L = \frac{1}{L} \sum_j D{LL}^{(2)}[:, j]$ is the column mean of the landmark squared distance matrix\n\nThis formula positions $\mathbf{x}_n$ to best preserve its distances to the landmarks in the embedding space.

landmark_mds.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from scipy.linalg import eigh
 
def landmark_mds(D_landmarks, D_all, n_components):
    """
    Perform Landmark MDS embedding.
    
    Parameters:
        D_landmarks: (L, L) geodesic distance matrix between landmarks
        D_all: (L, N) geodesic distances from landmarks to all points
        n_components: Target embedding dimension d
    
    Returns:
        embedding: (N, n_components) embedded coordinates
        eigenvalues: Eigenvalues (for explained variance)
    """
    L, N = D_all.shape
    
    # Phase 1: Embed landmarks using classical MDS
    # Step 1: Square the distances
    D_sq = D_landmarks ** 2
    
    # Step 2: Double centering
    # B = -0.5 * H @ D_sq @ H, where H = I - (1/L) * 1 @ 1.T
    row_mean = D_sq.mean(axis=1, keepdims=True)
    col_mean = D_sq.mean(axis=0, keepdims=True)
    total_mean = D_sq.mean()
    B = -0.5 * (D_sq - row_mean - col_mean + total_mean)
    
    # Step 3: Eigendecomposition (get top eigenvalues/vectors)
    # Use 'eigh' for symmetric matrices; returns eigenvalues in ascending order
    eigenvalues, eigenvectors = eigh(B)
    
    # Select top n_components (largest eigenvalues)
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Handle potential negative eigenvalues (due to non-Euclidean distances)
    eigenvalues = np.maximum(eigenvalues, 0)
    
    # Step 4: Landmark coordinates
    Y_landmarks = eigenvectors * np.sqrt(eigenvalues)
    
    # Phase 2: Embed all points using Nyström extension
    # Compute squared distances from all points to landmarks
    D_all_sq = D_all ** 2  # (L, N)
    
    # Column mean of landmark squared distance matrix
    d_bar = D_sq.mean(axis=1)  # (L,)
    
    # Nyström extension: y_n = (1/2) * Lambda^(-1/2) @ V.T @ (d_bar - d_n^2)
    # Rearranged: embedding = (D_all_sq - d_bar[:, None]).T @ V @ Lambda^(-1/2) * (-0.5)
    
    # For numerical stability, use pseudo-inverse
    Lambda_inv_sqrt = np.where(eigenvalues > 1e-10, 
                                1.0 / np.sqrt(eigenvalues), 
                                0)
    
    # embedding = -0.5 * (D_all_sq.T - d_bar) @ eigenvectors @ np.diag(Lambda_inv_sqrt)
    embedding = -0.5 * (D_all_sq.T - d_bar) @ (eigenvectors * Lambda_inv_sqrt)
    
    return embedding, eigenvalues
 
def explained_variance_ratio(eigenvalues):
    """Compute explained variance ratio for each component."""
    eigenvalues = np.maximum(eigenvalues, 0)  # Handle negative eigenvalues
    total = eigenvalues.sum()
    if total > 0:
        return eigenvalues / total
    return np.zeros_like(eigenvalues)
 
class LandmarkIsomap:
    """
    Complete Landmark Isomap implementation.
    
    Combines:
    1. k-NN graph construction
    2. Dijkstra shortest paths from landmarks
    3. Landmark MDS embedding
    """
    
    def __init__(self, n_neighbors=5, n_landmarks=100, n_components=2, 
                 landmark_method='maxmin'):
        self.n_neighbors = n_neighbors
        self.n_landmarks = n_landmarks
        self.n_components = n_components
        self.landmark_method = landmark_method
        
    def fit_transform(self, X):
        """Fit and transform data."""
        from scipy.sparse.csgraph import dijkstra
        
        N = X.shape[0]
        L = min(self.n_landmarks, N)
        
        # Step 1: Build k-NN graph
        adjacency = build_knn_graph_sklearn(X, self.n_neighbors)
        
        # Step 2: Select landmarks
        if self.landmark_method == 'maxmin':
            landmarks = maxmin_landmarks(X, L)
        elif self.landmark_method == 'random':
            landmarks = random_landmarks(X, L)
        else:
            landmarks = kmeansplusplus_landmarks(X, L)
        
        self.landmarks_ = landmarks
        
        # Step 3: Compute geodesic distances from landmarks
        D_all = np.zeros((L, N))
        for i, landmark in enumerate(landmarks):
            distances = dijkstra(adjacency, indices=landmark)
            D_all[i] = distances
        
        # Handle infinite distances (disconnected components)
        if np.isinf(D_all).any():
            print("Warning: Graph is disconnected. Using finite subgraph.")
            D_all = np.where(np.isinf(D_all), D_all[np.isfinite(D_all)].max() * 2, D_all)
        
        # Step 4: Extract landmark-to-landmark distances
        D_landmarks = D_all[:, landmarks]
        
        # Step 5: Landmark MDS
        self.embedding_, self.eigenvalues_ = landmark_mds(
            D_landmarks, D_all, self.n_components
        )
        
        return self.embedding_

The Nyström Connection

The Nyström method is a general technique for approximating large kernel matrices using a subset of columns. In the context of MDS, the 'kernel matrix' is the doubly-centered squared distance matrix. The Nyström extension provides a principled way to embed out-of-sample points using the landmark structure—a technique also used in kernel methods and Gaussian processes for scalability.

Approximation Quality and Theoretical Guarantees

A natural question: how does the Landmark Isomap embedding compare to classical Isomap? Several theoretical results provide reassurance.

Approximation Error Bounds\n\nLet $\mathbf{Y}$ be the classical Isomap embedding and $\tilde{\mathbf{Y}}$ be the Landmark Isomap embedding. The error depends on:\n\n1. Number of landmarks $L$: More landmarks → better approximation\n2. Landmark distribution: Well-spread landmarks → lower error\n3. Data eigenspectrum: Fast-decaying eigenvalues → better approximation\n\nUnder mild assumptions, the Nyström approximation error satisfies:\n\n$$\|\mathbf{K} - \tilde{\mathbf{K}}\|_F \leq \|\mathbf{K} - \mathbf{K}_k\|_F + \epsilon(L)$$\n\nwhere $\mathbf{K}$ is the kernel (Gram) matrix from classical MDS, $\mathbf{K}_k$ is its best rank-$k$ approximation, and $\epsilon(L)$ is the Nyström approximation error that decreases with $L$.

Landmark Isomap: Theoretical Properties
Property	Classical Isomap	Landmark Isomap	Note
Exact distance preservation	Within embedding dimension limits	Approximate	Error bounded by Nyström theory
Eigenvector accuracy	Exact (up to precision)	$O(1/\sqrt{L})$ error	Converges as $L \to N$
Out-of-sample extension	Requires recomputation	Natural via Nyström	Same formula used for all points
Residual variance	True residual	Approximate residual	Lower bound on true residual

Practical Error Levels\n\nEmpirical studies show that Landmark Isomap typically achieves:\n\n- Near-identical embeddings for $L \geq 0.1N$ (10% of data as landmarks)\n- Visually indistinguishable for $L \geq 200-500$ regardless of $N$\n- Qualitatively correct even for $L \approx 100$\n\nThe quality depends heavily on the intrinsic dimensionality: low-dimensional manifolds require fewer landmarks to capture the structure.

Choosing the Number of Landmarks

A good rule of thumb: L = max(200, sqrt(N)). This scales sublinearly with data size while providing sufficient landmarks for accurate embedding. For critical applications, compare embeddings with different L values and check stability. If the embedding changes significantly when increasing L, more landmarks are needed.

Complete Complexity Analysis

Let's perform a detailed complexity analysis of Landmark Isomap compared to classical Isomap.

Complete Complexity Comparison
Step	Classical Isomap	Landmark Isomap	Speedup
k-NN graph	$O(N^2 D)$ or $O(N \log N \cdot D)$	Same	1×
Shortest paths	$O(N^2 \log N)$	$O(L N \log N)$	$N/L$
Distance storage	$O(N^2)$	$O(LN)$	$N/L$
MDS: centering	$O(N^2)$	$O(L^2)$	$(N/L)^2$
MDS: eigendecomp	$O(N^3)$	$O(L^3)$	$(N/L)^3$
Nyström extension	—	$O(Nd L)$	New step
Total	$O(N^3)$	$O(L N \log N + L^3)$	Huge

Concrete Numbers\n\nFor $N = 100,000$, $L = 1,000$, $k = 10$, $d = 2$:\n\n| Step | Classical | Landmark | Ratio |\n|------|-----------|----------|-------|\n| k-NN graph | ~10M ops | ~10M ops | 1× |\n| Shortest paths | ~10¹¹ ops | ~10⁹ ops | 100× |\n| Distance storage | 80 GB | 800 MB | 100× |\n| MDS eigendecomp | ~10¹⁵ ops | ~10⁹ ops | 10⁶× |\n| Nyström extension | — | ~0.2M ops | — |\n| Total runtime | Days | Minutes | ~1000× |

The Practical Takeaway

Landmark Isomap transforms a method that's impractical for N > 10,000 into one that comfortably handles N = 1,000,000+. The speedup comes primarily from (1) computing shortest paths from only L sources instead of N, and (2) performing eigendecomposition on an L × L matrix instead of N × N. Memory savings are equally dramatic.

Practical Implementation Considerations

Several practical issues arise when implementing Landmark Isomap for real-world datasets:

Implementation Challenges and Solutions

•Negative eigenvalues: Real-world geodesic distances may not be perfectly Euclidean, leading to negative eigenvalues in the MDS Gram matrix. Solution: clip negative eigenvalues to zero or use a small positive value. This is a minor approximation that rarely affects embedding quality.
•Disconnected graphs: If landmarks are chosen from one component, points in other components have infinite distances. Solution: ensure landmarks are drawn from the largest connected component, or use adaptive connectivity strategies.
•Numerical stability in Nyström: The inverse square root of small eigenvalues can cause numerical issues. Solution: use pseudo-inverse with a tolerance threshold (e.g., ignore eigenvalues < 10⁻¹⁰).
•Landmark redundancy: If multiple landmarks are very close, they provide redundant information. Solution: use MaxMin or k-means++ selection to ensure landmarks are spread out.
•Memory for shortest paths: Storing L full shortest-path vectors requires O(LN) memory. For very large N, this may still be challenging. Solution: compute paths in batches and stream to disk, or use even sparser landmark sampling.

robust_landmark_isomap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy.sparse.csgraph import dijkstra, connected_components
from scipy.linalg import eigh
 
def robust_landmark_isomap(X, n_neighbors=5, n_landmarks=100, 
                            n_components=2, eigenvalue_tol=1e-10):
    """
    Robust Landmark Isomap with error handling.
    
    Handles:
    - Disconnected graphs
    - Negative eigenvalues
    - Numerical stability issues
    """
    from sklearn.neighbors import kneighbors_graph
    
    N = X.shape[0]
    L = min(n_landmarks, N)
    
    # Build k-NN graph
    adjacency = kneighbors_graph(X, n_neighbors, mode='distance', 
                                  include_self=False)
    adjacency = (adjacency + adjacency.T) / 2  # Symmetrize
    
    # Check connectivity
    n_components_graph, labels = connected_components(
        adjacency, directed=False, return_labels=True
    )
    
    if n_components_graph > 1:
        # Use largest component only
        sizes = np.bincount(labels)
        largest = np.argmax(sizes)
        mask = labels == largest
        
        print(f"Graph has {n_components_graph} components. "
              f"Using largest ({sizes[largest]}/{N} points).")
        
        X_sub = X[mask]
        adjacency_sub = adjacency[mask][:, mask]
        N_sub = X_sub.shape[0]
        
        # Recompute landmarks on connected subset
        landmarks = maxmin_landmarks(X_sub, min(L, N_sub))
    else:
        X_sub = X
        adjacency_sub = adjacency
        N_sub = N
        landmarks = maxmin_landmarks(X, L)
        mask = np.ones(N, dtype=bool)
    
    L = len(landmarks)
    
    # Compute geodesic distances from landmarks
    D_all = np.zeros((L, N_sub))
    for i, landmark in enumerate(landmarks):
        D_all[i] = dijkstra(adjacency_sub, indices=landmark)
    
    # Check for remaining infinite distances
    inf_mask = np.isinf(D_all)
    if inf_mask.any():
        max_finite = D_all[~inf_mask].max() if (~inf_mask).any() else 1.0
        D_all = np.where(inf_mask, max_finite * 2, D_all)
        print(f"Warning: {inf_mask.sum()} infinite distances replaced.")
    
    # Extract landmark-to-landmark distances
    D_landmarks = D_all[:, landmarks]
    
    # Landmark MDS with robust eigenvalue handling
    D_sq = D_landmarks ** 2
    row_mean = D_sq.mean(axis=1, keepdims=True)
    col_mean = D_sq.mean(axis=0, keepdims=True)
    total_mean = D_sq.mean()
    B = -0.5 * (D_sq - row_mean - col_mean + total_mean)
    
    eigenvalues, eigenvectors = eigh(B)
    
    # Sort descending and select top components
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Handle negative eigenvalues
    n_negative = (eigenvalues < 0).sum()
    if n_negative > 0:
        print(f"Clipping {n_negative} negative eigenvalues to zero.")
        eigenvalues = np.maximum(eigenvalues, 0)
    
    # Nyström extension with numerical stability
    Lambda_inv_sqrt = np.where(
        eigenvalues > eigenvalue_tol,
        1.0 / np.sqrt(eigenvalues),
        0
    )
    
    d_bar = D_sq.mean(axis=1)
    D_all_sq = D_all ** 2
    
    embedding = -0.5 * (D_all_sq.T - d_bar) @ (eigenvectors * Lambda_inv_sqrt)
    
    # Compute residual variance for diagnostics
    residual_variance = compute_residual_variance(D_all, embedding, landmarks)
    
    return embedding, eigenvalues, residual_variance, mask
 
def compute_residual_variance(D_true, embedding, landmark_indices):
    """
    Compute residual variance as diagnostic.
    
    1 - R² between geodesic and embedded distances.
    """
    L, N = D_true.shape
    
    # Compute embedded distances to landmarks
    landmark_embedding = embedding[landmark_indices]
    D_embedded = np.zeros((L, N))
    for i in range(L):
        D_embedded[i] = np.linalg.norm(embedding - landmark_embedding[i], axis=1)
    
    # Compute R²
    D_true_flat = D_true.flatten()
    D_embedded_flat = D_embedded.flatten()
    
    ss_res = np.sum((D_true_flat - D_embedded_flat) ** 2)
    ss_tot = np.sum((D_true_flat - D_true_flat.mean()) ** 2)
    
    r_squared = 1 - ss_res / ss_tot if ss_tot > 0 else 0
    residual_variance = 1 - r_squared
    
    return residual_variance

Summary: Scaling Isomap to Large Datasets

We've covered Landmark Isomap, the critical optimization that makes Isomap practical for large datasets. Let's consolidate the key insights:

Key Takeaways

•Classical Isomap has O(N³) complexity — shortest paths and eigendecomposition on full N × N matrices become impractical for N > 10,000.
•Landmark Isomap uses L ≪ N reference points — computing geodesics only from landmarks reduces complexity to O(LN log N + L³).
•Landmark selection matters — MaxMin provides good coverage; k-means++ adapts to density; random works when L is large enough.
•Landmark MDS (Nyström) embeds all points — first embed landmarks via classical MDS, then project all points using their distances to landmarks.
•Approximation quality is excellent — for L ≥ max(200, √N), embeddings are nearly indistinguishable from classical Isomap.
•Memory savings are dramatic — O(LN) instead of O(N²) enables datasets that would otherwise not fit in memory.
•Robust implementation requires care — handle negative eigenvalues, disconnected components, and numerical stability.

What's Next

The next page covers Assumptions and Limitations—a critical examination of when Isomap works and when it fails. We'll discuss the convexity requirement, the short-circuit problem, sensitivity to noise and sampling, and how to diagnose failure modes. Understanding these limitations is essential for knowing when to use Isomap versus alternative methods.

3 / 5

Loading learning content...

Machine LearningIsomap

Isomap: Isometric Feature Mapping

LevelAdvanced

Duration90 mins

TopicIsomap

3 / 5

Landmark Isomap: Scaling to Large Datasets

The Curse of Computational Complexity

What You Will Learn

The Computational Bottleneck of Classical Isomap

Let's analyze the computational complexity of each step in classical Isomap to understand where the bottlenecks lie.

Classical Isomap: Computational Complexity Analysis
Step	Operation	Time Complexity	Space Complexity	Scalability
Graph Construction	k-NN search (brute force)	$O(N^2 D)$	$O(N^2)$ or $O(Nk)$	Quadratic; acceptable
Graph Construction	k-NN search (tree-based)	$O(N \log N \cdot D)$	$O(N)$ + $O(Nk)$	Almost linear; good
Shortest Paths	Dijkstra from all sources	$O(N^2 \log N + NE)$	$O(N^2)$	Bottleneck
Shortest Paths	Floyd-Warshall	$O(N^3)$	$O(N^2)$	Bottleneck
Distance Matrix	Store all pairs	$O(N^2)$	$O(N^2)$	Memory bottleneck
MDS	Eigendecomposition of $N \times N$ matrix	$O(N^3)$	$O(N^2)$	Bottleneck

The Quadratic Memory Wall

The Landmark Approximation Strategy

Converting Mermaid diagram...

The Triangulation Intuition

Choosing Landmarks: Strategies and Tradeoffs

The choice of landmarks significantly impacts embedding quality. Several strategies exist, each with distinct properties:

Landmark Selection Strategies
Strategy	Method	Complexity	Pros	Cons
Random	Uniform random sample	$O(L)$	Simple, unbiased	May miss sparse regions
MaxMin	Iteratively add farthest point	$O(LN)$	Good coverage	Somewhat expensive
K-means++	K-means++ initialization	$O(LN)$	Density-aware spread	Biased toward dense regions
Farthest Point Sampling	Greedy max-min distance	$O(L N \log N)$	Optimal spread	Expensive for large L
Column Sampling	Based on distance matrix leverage scores	$O(N^2)$	Theoretically optimal	Requires full matrix (defeats purpose)

landmark_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.spatial.distance import cdist
 
def random_landmarks(X, n_landmarks, seed=None):
    """
    Select landmarks uniformly at random.
    
    Parameters:
        X: (N, D) array of data points
        n_landmarks: Number of landmarks L
        seed: Random seed for reproducibility
    
    Returns:
        landmark_indices: Array of L indices into X
    """
    rng = np.random.RandomState(seed)
    return rng.choice(X.shape[0], size=n_landmarks, replace=False)
 
def maxmin_landmarks(X, n_landmarks, seed=None):
    """
    Select landmarks using MaxMin (Gonzalez) algorithm.
    
    Iteratively selects the point farthest from all current landmarks.
    Provides good coverage of the data distribution.
    
    Parameters:
        X: (N, D) array of data points
        n_landmarks: Number of landmarks L
        seed: Random seed for first landmark
    
    Returns:
        landmark_indices: Array of L indices into X
    """
    N = X.shape[0]
    rng = np.random.RandomState(seed)
    
    # Initialize with random first landmark
    landmarks = [rng.randint(N)]
    
    # Track minimum distance to any landmark for each point
    min_distances = np.full(N, np.inf)
    
    for _ in range(n_landmarks - 1):
        # Update min distances with last added landmark
        last_landmark = landmarks[-1]
        dists = np.linalg.norm(X - X[last_landmark], axis=1)
        min_distances = np.minimum(min_distances, dists)
        
        # Exclude existing landmarks
        min_distances[landmarks] = -np.inf
        
        # Select point with maximum min-distance
        next_landmark = np.argmax(min_distances)
        landmarks.append(next_landmark)
    
    return np.array(landmarks)
 
def kmeansplusplus_landmarks(X, n_landmarks, seed=None):
    """
    Select landmarks using k-means++ initialization.
    
    Probabilistically selects points proportional to squared distance
    from nearest existing landmark.
    
    Parameters:
        X: (N, D) array of data points
        n_landmarks: Number of landmarks L
        seed: Random seed
    
    Returns:
        landmark_indices: Array of L indices into X
    """
    N = X.shape[0]
    rng = np.random.RandomState(seed)
    
    # Initialize with random first landmark
    landmarks = [rng.randint(N)]
    
    for _ in range(n_landmarks - 1):
        # Compute squared distances to nearest landmark
        landmark_points = X[landmarks]
        dists = cdist(X, landmark_points, metric='sqeuclidean')
        min_sq_dists = dists.min(axis=1)
        
        # Zero out existing landmarks
        min_sq_dists[landmarks] = 0
        
        # Sample proportional to squared distance
        probs = min_sq_dists / min_sq_dists.sum()
        next_landmark = rng.choice(N, p=probs)
        landmarks.append(next_landmark)
    
    return np.array(landmarks)
 
def compare_landmark_strategies(X, n_landmarks, n_trials=10):
    """
    Compare landmark selection strategies by measuring coverage.
    
    Coverage is measured as the average distance from each point
    to its nearest landmark (lower is better).
    """
    N = X.shape[0]
    results = {}
    
    for name, func in [('Random', random_landmarks),
                       ('MaxMin', maxmin_landmarks),
                       ('K-means++', kmeansplusplus_landmarks)]:
        coverages = []
        for trial in range(n_trials):
            landmarks = func(X, n_landmarks, seed=trial)
            landmark_points = X[landmarks]
            dists = cdist(X, landmark_points)
            min_dists = dists.min(axis=1)
            coverages.append(min_dists.mean())
        
        results[name] = {
            'mean_coverage': np.mean(coverages),
            'std_coverage': np.std(coverages)
        }
    
    return results

MaxMin Advantages

•Guarantees landmarks spread across data
•Deterministic (given first point)
•Works well for uniform manifolds
•Provable coverage bounds

K-means++ Advantages

•More landmarks in dense regions
•Better for non-uniform density
•Stochastic (provides diversity)
•Widely used and well-understood

Practical Recommendation

Landmark MDS: The Embedding Algorithm

landmark_mds.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from scipy.linalg import eigh
 
def landmark_mds(D_landmarks, D_all, n_components):
    """
    Perform Landmark MDS embedding.
    
    Parameters:
        D_landmarks: (L, L) geodesic distance matrix between landmarks
        D_all: (L, N) geodesic distances from landmarks to all points
        n_components: Target embedding dimension d
    
    Returns:
        embedding: (N, n_components) embedded coordinates
        eigenvalues: Eigenvalues (for explained variance)
    """
    L, N = D_all.shape
    
    # Phase 1: Embed landmarks using classical MDS
    # Step 1: Square the distances
    D_sq = D_landmarks ** 2
    
    # Step 2: Double centering
    # B = -0.5 * H @ D_sq @ H, where H = I - (1/L) * 1 @ 1.T
    row_mean = D_sq.mean(axis=1, keepdims=True)
    col_mean = D_sq.mean(axis=0, keepdims=True)
    total_mean = D_sq.mean()
    B = -0.5 * (D_sq - row_mean - col_mean + total_mean)
    
    # Step 3: Eigendecomposition (get top eigenvalues/vectors)
    # Use 'eigh' for symmetric matrices; returns eigenvalues in ascending order
    eigenvalues, eigenvectors = eigh(B)
    
    # Select top n_components (largest eigenvalues)
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Handle potential negative eigenvalues (due to non-Euclidean distances)
    eigenvalues = np.maximum(eigenvalues, 0)
    
    # Step 4: Landmark coordinates
    Y_landmarks = eigenvectors * np.sqrt(eigenvalues)
    
    # Phase 2: Embed all points using Nyström extension
    # Compute squared distances from all points to landmarks
    D_all_sq = D_all ** 2  # (L, N)
    
    # Column mean of landmark squared distance matrix
    d_bar = D_sq.mean(axis=1)  # (L,)
    
    # Nyström extension: y_n = (1/2) * Lambda^(-1/2) @ V.T @ (d_bar - d_n^2)
    # Rearranged: embedding = (D_all_sq - d_bar[:, None]).T @ V @ Lambda^(-1/2) * (-0.5)
    
    # For numerical stability, use pseudo-inverse
    Lambda_inv_sqrt = np.where(eigenvalues > 1e-10, 
                                1.0 / np.sqrt(eigenvalues), 
                                0)
    
    # embedding = -0.5 * (D_all_sq.T - d_bar) @ eigenvectors @ np.diag(Lambda_inv_sqrt)
    embedding = -0.5 * (D_all_sq.T - d_bar) @ (eigenvectors * Lambda_inv_sqrt)
    
    return embedding, eigenvalues
 
def explained_variance_ratio(eigenvalues):
    """Compute explained variance ratio for each component."""
    eigenvalues = np.maximum(eigenvalues, 0)  # Handle negative eigenvalues
    total = eigenvalues.sum()
    if total > 0:
        return eigenvalues / total
    return np.zeros_like(eigenvalues)
 
class LandmarkIsomap:
    """
    Complete Landmark Isomap implementation.
    
    Combines:
    1. k-NN graph construction
    2. Dijkstra shortest paths from landmarks
    3. Landmark MDS embedding
    """
    
    def __init__(self, n_neighbors=5, n_landmarks=100, n_components=2, 
                 landmark_method='maxmin'):
        self.n_neighbors = n_neighbors
        self.n_landmarks = n_landmarks
        self.n_components = n_components
        self.landmark_method = landmark_method
        
    def fit_transform(self, X):
        """Fit and transform data."""
        from scipy.sparse.csgraph import dijkstra
        
        N = X.shape[0]
        L = min(self.n_landmarks, N)
        
        # Step 1: Build k-NN graph
        adjacency = build_knn_graph_sklearn(X, self.n_neighbors)
        
        # Step 2: Select landmarks
        if self.landmark_method == 'maxmin':
            landmarks = maxmin_landmarks(X, L)
        elif self.landmark_method == 'random':
            landmarks = random_landmarks(X, L)
        else:
            landmarks = kmeansplusplus_landmarks(X, L)
        
        self.landmarks_ = landmarks
        
        # Step 3: Compute geodesic distances from landmarks
        D_all = np.zeros((L, N))
        for i, landmark in enumerate(landmarks):
            distances = dijkstra(adjacency, indices=landmark)
            D_all[i] = distances
        
        # Handle infinite distances (disconnected components)
        if np.isinf(D_all).any():
            print("Warning: Graph is disconnected. Using finite subgraph.")
            D_all = np.where(np.isinf(D_all), D_all[np.isfinite(D_all)].max() * 2, D_all)
        
        # Step 4: Extract landmark-to-landmark distances
        D_landmarks = D_all[:, landmarks]
        
        # Step 5: Landmark MDS
        self.embedding_, self.eigenvalues_ = landmark_mds(
            D_landmarks, D_all, self.n_components
        )
        
        return self.embedding_

The Nyström Connection

Approximation Quality and Theoretical Guarantees

A natural question: how does the Landmark Isomap embedding compare to classical Isomap? Several theoretical results provide reassurance.

Landmark Isomap: Theoretical Properties
Property	Classical Isomap	Landmark Isomap	Note
Exact distance preservation	Within embedding dimension limits	Approximate	Error bounded by Nyström theory
Eigenvector accuracy	Exact (up to precision)	$O(1/\sqrt{L})$ error	Converges as $L \to N$
Out-of-sample extension	Requires recomputation	Natural via Nyström	Same formula used for all points
Residual variance	True residual	Approximate residual	Lower bound on true residual

Choosing the Number of Landmarks

Complete Complexity Analysis

Let's perform a detailed complexity analysis of Landmark Isomap compared to classical Isomap.

Complete Complexity Comparison
Step	Classical Isomap	Landmark Isomap	Speedup
k-NN graph	$O(N^2 D)$ or $O(N \log N \cdot D)$	Same	1×
Shortest paths	$O(N^2 \log N)$	$O(L N \log N)$	$N/L$
Distance storage	$O(N^2)$	$O(LN)$	$N/L$
MDS: centering	$O(N^2)$	$O(L^2)$	$(N/L)^2$
MDS: eigendecomp	$O(N^3)$	$O(L^3)$	$(N/L)^3$
Nyström extension	—	$O(Nd L)$	New step
Total	$O(N^3)$	$O(L N \log N + L^3)$	Huge

The Practical Takeaway

Practical Implementation Considerations

Several practical issues arise when implementing Landmark Isomap for real-world datasets:

Implementation Challenges and Solutions

•Negative eigenvalues: Real-world geodesic distances may not be perfectly Euclidean, leading to negative eigenvalues in the MDS Gram matrix. Solution: clip negative eigenvalues to zero or use a small positive value. This is a minor approximation that rarely affects embedding quality.
•Disconnected graphs: If landmarks are chosen from one component, points in other components have infinite distances. Solution: ensure landmarks are drawn from the largest connected component, or use adaptive connectivity strategies.
•Numerical stability in Nyström: The inverse square root of small eigenvalues can cause numerical issues. Solution: use pseudo-inverse with a tolerance threshold (e.g., ignore eigenvalues < 10⁻¹⁰).
•Landmark redundancy: If multiple landmarks are very close, they provide redundant information. Solution: use MaxMin or k-means++ selection to ensure landmarks are spread out.
•Memory for shortest paths: Storing L full shortest-path vectors requires O(LN) memory. For very large N, this may still be challenging. Solution: compute paths in batches and stream to disk, or use even sparser landmark sampling.

robust_landmark_isomap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy.sparse.csgraph import dijkstra, connected_components
from scipy.linalg import eigh
 
def robust_landmark_isomap(X, n_neighbors=5, n_landmarks=100, 
                            n_components=2, eigenvalue_tol=1e-10):
    """
    Robust Landmark Isomap with error handling.
    
    Handles:
    - Disconnected graphs
    - Negative eigenvalues
    - Numerical stability issues
    """
    from sklearn.neighbors import kneighbors_graph
    
    N = X.shape[0]
    L = min(n_landmarks, N)
    
    # Build k-NN graph
    adjacency = kneighbors_graph(X, n_neighbors, mode='distance', 
                                  include_self=False)
    adjacency = (adjacency + adjacency.T) / 2  # Symmetrize
    
    # Check connectivity
    n_components_graph, labels = connected_components(
        adjacency, directed=False, return_labels=True
    )
    
    if n_components_graph > 1:
        # Use largest component only
        sizes = np.bincount(labels)
        largest = np.argmax(sizes)
        mask = labels == largest
        
        print(f"Graph has {n_components_graph} components. "
              f"Using largest ({sizes[largest]}/{N} points).")
        
        X_sub = X[mask]
        adjacency_sub = adjacency[mask][:, mask]
        N_sub = X_sub.shape[0]
        
        # Recompute landmarks on connected subset
        landmarks = maxmin_landmarks(X_sub, min(L, N_sub))
    else:
        X_sub = X
        adjacency_sub = adjacency
        N_sub = N
        landmarks = maxmin_landmarks(X, L)
        mask = np.ones(N, dtype=bool)
    
    L = len(landmarks)
    
    # Compute geodesic distances from landmarks
    D_all = np.zeros((L, N_sub))
    for i, landmark in enumerate(landmarks):
        D_all[i] = dijkstra(adjacency_sub, indices=landmark)
    
    # Check for remaining infinite distances
    inf_mask = np.isinf(D_all)
    if inf_mask.any():
        max_finite = D_all[~inf_mask].max() if (~inf_mask).any() else 1.0
        D_all = np.where(inf_mask, max_finite * 2, D_all)
        print(f"Warning: {inf_mask.sum()} infinite distances replaced.")
    
    # Extract landmark-to-landmark distances
    D_landmarks = D_all[:, landmarks]
    
    # Landmark MDS with robust eigenvalue handling
    D_sq = D_landmarks ** 2
    row_mean = D_sq.mean(axis=1, keepdims=True)
    col_mean = D_sq.mean(axis=0, keepdims=True)
    total_mean = D_sq.mean()
    B = -0.5 * (D_sq - row_mean - col_mean + total_mean)
    
    eigenvalues, eigenvectors = eigh(B)
    
    # Sort descending and select top components
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Handle negative eigenvalues
    n_negative = (eigenvalues < 0).sum()
    if n_negative > 0:
        print(f"Clipping {n_negative} negative eigenvalues to zero.")
        eigenvalues = np.maximum(eigenvalues, 0)
    
    # Nyström extension with numerical stability
    Lambda_inv_sqrt = np.where(
        eigenvalues > eigenvalue_tol,
        1.0 / np.sqrt(eigenvalues),
        0
    )
    
    d_bar = D_sq.mean(axis=1)
    D_all_sq = D_all ** 2
    
    embedding = -0.5 * (D_all_sq.T - d_bar) @ (eigenvectors * Lambda_inv_sqrt)
    
    # Compute residual variance for diagnostics
    residual_variance = compute_residual_variance(D_all, embedding, landmarks)
    
    return embedding, eigenvalues, residual_variance, mask
 
def compute_residual_variance(D_true, embedding, landmark_indices):
    """
    Compute residual variance as diagnostic.
    
    1 - R² between geodesic and embedded distances.
    """
    L, N = D_true.shape
    
    # Compute embedded distances to landmarks
    landmark_embedding = embedding[landmark_indices]
    D_embedded = np.zeros((L, N))
    for i in range(L):
        D_embedded[i] = np.linalg.norm(embedding - landmark_embedding[i], axis=1)
    
    # Compute R²
    D_true_flat = D_true.flatten()
    D_embedded_flat = D_embedded.flatten()
    
    ss_res = np.sum((D_true_flat - D_embedded_flat) ** 2)
    ss_tot = np.sum((D_true_flat - D_true_flat.mean()) ** 2)
    
    r_squared = 1 - ss_res / ss_tot if ss_tot > 0 else 0
    residual_variance = 1 - r_squared
    
    return residual_variance

Summary: Scaling Isomap to Large Datasets

We've covered Landmark Isomap, the critical optimization that makes Isomap practical for large datasets. Let's consolidate the key insights:

Key Takeaways

•Classical Isomap has O(N³) complexity — shortest paths and eigendecomposition on full N × N matrices become impractical for N > 10,000.
•Landmark Isomap uses L ≪ N reference points — computing geodesics only from landmarks reduces complexity to O(LN log N + L³).
•Landmark selection matters — MaxMin provides good coverage; k-means++ adapts to density; random works when L is large enough.
•Landmark MDS (Nyström) embeds all points — first embed landmarks via classical MDS, then project all points using their distances to landmarks.
•Approximation quality is excellent — for L ≥ max(200, √N), embeddings are nearly indistinguishable from classical Isomap.
•Memory savings are dramatic — O(LN) instead of O(N²) enables datasets that would otherwise not fit in memory.
•Robust implementation requires care — handle negative eigenvalues, disconnected components, and numerical stability.

What's Next

3 / 5