Loading learning content...
Classical Isomap has a fundamental scaling problem. Computing all-pairs shortest paths in a graph with $N$ vertices requires $O(N^2 \log N)$ time using Dijkstra's algorithm from each vertex, or $O(N^3)$ using Floyd-Warshall. Storing the $N \times N$ distance matrix requires $O(N^2)$ memory. For a dataset with 100,000 points, this means storing 10 billion distances and performing trillions of operations.\n\nLandmark Isomap solves this problem with a elegant idea: instead of computing geodesic distances between all pairs of points, compute distances only from a small set of landmark points to all other points. This reduces the complexity dramatically while preserving the essential geometric structure of the embedding.
By the end of this page, you will understand: (1) why classical Isomap doesn't scale and the computational bottlenecks, (2) the landmark selection strategies and their tradeoffs, (3) how landmark-based MDS (L-MDS) computes embeddings efficiently, (4) the theoretical guarantees and approximation quality of Landmark Isomap, and (5) practical implementation considerations for large-scale applications.
Let's analyze the computational complexity of each step in classical Isomap to understand where the bottlenecks lie.
| Step | Operation | Time Complexity | Space Complexity | Scalability |
|---|---|---|---|---|
| k-NN search (brute force) | $O(N^2 D)$ | $O(N^2)$ or $O(Nk)$ | Quadratic; acceptable |
| k-NN search (tree-based) | $O(N \log N \cdot D)$ | $O(N)$ + $O(Nk)$ | Almost linear; good |
| Dijkstra from all sources | $O(N^2 \log N + NE)$ | $O(N^2)$ | Bottleneck |
| Floyd-Warshall | $O(N^3)$ | $O(N^2)$ | Bottleneck |
| Store all pairs | $O(N^2)$ | $O(N^2)$ | Memory bottleneck |
| Eigendecomposition of $N \times N$ matrix | $O(N^3)$ | $O(N^2)$ | Bottleneck |
The Numbers Are Sobering\n\nFor a dataset with $N = 100,000$ points:\n\n- Shortest paths: $N^2 \log N \approx 10^{10} \cdot 17 \approx 1.7 \times 10^{11}$ operations\n- Distance matrix: $N^2 = 10^{10}$ entries = 80 GB (double precision)\n- Eigendecomposition: $N^3 = 10^{15}$ operations\n\nEven with modern hardware, this is impractical. Classical Isomap typically becomes infeasible for $N > 10,000$ without significant computational resources.
Even if we could compute all-pairs shortest paths efficiently, storing the full N × N distance matrix is often impossible. For N = 1 million points, the distance matrix requires 8 TB of memory. This memory constraint is often more limiting than computational time for large-scale applications.
Landmark Isomap, introduced by de Silva and Tenenbaum (2003), exploits a key observation: if we know the distances from all points to a subset of well-chosen landmarks, we can approximately reconstruct the full distance structure.\n\nThe Core Idea\n\nSelect $L \ll N$ landmark points $\{\ell_1, \ldots, \ell_L\}$. Instead of computing the full $N \times N$ geodesic distance matrix, compute only:\n\n$$\mathbf{D}_L = \begin{bmatrix} d_G(\ell_i, \mathbf{x}j) \end{bmatrix}{i=1,\ldots,L; \, j=1,\ldots,N}$$\n\nThis is an $L \times N$ matrix—dramatically smaller than $N \times N$ when $L \ll N$.
Complexity Reduction\n\nWith $L$ landmarks:\n\n| Operation | Classical | Landmark | Reduction Factor |\n|-----------|-----------|----------|-----------------|\n| Shortest paths | $O(N^2 \log N)$ | $O(LN \log N)$ | $N/L$ |\n| Distance storage | $O(N^2)$ | $O(LN)$ | $N/L$ |\n| MDS embedding | $O(N^3)$ | $O(L^2 N + L^3)$ | $(N/L)^2$ to $(N/L)^3$ |\n\nFor $N = 100,000$ and $L = 1,000$, this represents a 100× to 10,000× speedup depending on the operation.
Think of landmarks as GPS satellites. If you know your distance to several well-positioned satellites, you can determine your position without knowing distances to all other points on Earth. Similarly, knowing distances to landmarks allows us to determine each point's position in the embedding space.
The choice of landmarks significantly impacts embedding quality. Several strategies exist, each with distinct properties:
| Strategy | Method | Complexity | Pros | Cons |
|---|---|---|---|---|
| Random | Uniform random sample | $O(L)$ | Simple, unbiased | May miss sparse regions |
| MaxMin | Iteratively add farthest point | $O(LN)$ | Good coverage | Somewhat expensive |
| K-means++ | K-means++ initialization | $O(LN)$ | Density-aware spread | Biased toward dense regions |
| Farthest Point Sampling | Greedy max-min distance | $O(L N \log N)$ | Optimal spread | Expensive for large L |
| Column Sampling | Based on distance matrix leverage scores | $O(N^2)$ | Theoretically optimal | Requires full matrix (defeats purpose) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom scipy.spatial.distance import cdist def random_landmarks(X, n_landmarks, seed=None): """ Select landmarks uniformly at random. Parameters: X: (N, D) array of data points n_landmarks: Number of landmarks L seed: Random seed for reproducibility Returns: landmark_indices: Array of L indices into X """ rng = np.random.RandomState(seed) return rng.choice(X.shape[0], size=n_landmarks, replace=False) def maxmin_landmarks(X, n_landmarks, seed=None): """ Select landmarks using MaxMin (Gonzalez) algorithm. Iteratively selects the point farthest from all current landmarks. Provides good coverage of the data distribution. Parameters: X: (N, D) array of data points n_landmarks: Number of landmarks L seed: Random seed for first landmark Returns: landmark_indices: Array of L indices into X """ N = X.shape[0] rng = np.random.RandomState(seed) # Initialize with random first landmark landmarks = [rng.randint(N)] # Track minimum distance to any landmark for each point min_distances = np.full(N, np.inf) for _ in range(n_landmarks - 1): # Update min distances with last added landmark last_landmark = landmarks[-1] dists = np.linalg.norm(X - X[last_landmark], axis=1) min_distances = np.minimum(min_distances, dists) # Exclude existing landmarks min_distances[landmarks] = -np.inf # Select point with maximum min-distance next_landmark = np.argmax(min_distances) landmarks.append(next_landmark) return np.array(landmarks) def kmeansplusplus_landmarks(X, n_landmarks, seed=None): """ Select landmarks using k-means++ initialization. Probabilistically selects points proportional to squared distance from nearest existing landmark. Parameters: X: (N, D) array of data points n_landmarks: Number of landmarks L seed: Random seed Returns: landmark_indices: Array of L indices into X """ N = X.shape[0] rng = np.random.RandomState(seed) # Initialize with random first landmark landmarks = [rng.randint(N)] for _ in range(n_landmarks - 1): # Compute squared distances to nearest landmark landmark_points = X[landmarks] dists = cdist(X, landmark_points, metric='sqeuclidean') min_sq_dists = dists.min(axis=1) # Zero out existing landmarks min_sq_dists[landmarks] = 0 # Sample proportional to squared distance probs = min_sq_dists / min_sq_dists.sum() next_landmark = rng.choice(N, p=probs) landmarks.append(next_landmark) return np.array(landmarks) def compare_landmark_strategies(X, n_landmarks, n_trials=10): """ Compare landmark selection strategies by measuring coverage. Coverage is measured as the average distance from each point to its nearest landmark (lower is better). """ N = X.shape[0] results = {} for name, func in [('Random', random_landmarks), ('MaxMin', maxmin_landmarks), ('K-means++', kmeansplusplus_landmarks)]: coverages = [] for trial in range(n_trials): landmarks = func(X, n_landmarks, seed=trial) landmark_points = X[landmarks] dists = cdist(X, landmark_points) min_dists = dists.min(axis=1) coverages.append(min_dists.mean()) results[name] = { 'mean_coverage': np.mean(coverages), 'std_coverage': np.std(coverages) } return resultsFor most applications, MaxMin provides the best tradeoff between quality and simplicity. Random sampling works surprisingly well when L is large enough (L ≥ 200). K-means++ is preferred when the data has significant density variations. The original Landmark Isomap paper used random selection, which is the default in many implementations.
Once we have geodesic distances from landmarks to all points, we need an embedding algorithm that works with this partial distance information. Landmark MDS (L-MDS), also known as Nyström extension for MDS, provides the solution.\n\nThe Two-Phase Approach\n\n1. Phase 1: Embed the landmarks using classical MDS on the $L \times L$ landmark-to-landmark distance matrix\n2. Phase 2: Embed all other points by projecting them into the landmark embedding space using the distance-to-landmark information
Phase 1: Landmark Embedding via Classical MDS\n\nLet $\mathbf{D}_{LL}$ be the $L \times L$ matrix of geodesic distances between landmarks. Classical MDS finds coordinates $\mathbf{Y}_L = [\mathbf{y}_1, \ldots, \mathbf{y}L]^\top$ such that Euclidean distances in the embedding approximate geodesic distances:\n\n$$\|\mathbf{y}i - \mathbf{y}j\| \approx d_G(\ell_i, \ell_j)$$\n\nThe classical MDS algorithm:\n\n1. Compute squared distance matrix: $\mathbf{D}{LL}^{(2)}$ where $D{ij}^{(2)} = d_G(\ell_i, \ell_j)^2$\n\n2. Double center the matrix: $\mathbf{B} = -\frac{1}{2} \mathbf{H} \mathbf{D}{LL}^{(2)} \mathbf{H}$ where $\mathbf{H} = \mathbf{I} - \frac{1}{L}\mathbf{1}\mathbf{1}^\top$ is the centering matrix\n\n3. Eigendecompose: $\mathbf{B} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^\top$\n\n4. Landmark coordinates: $\mathbf{Y}_L = \mathbf{V}_d \boldsymbol{\Lambda}_d^{1/2}$ where subscript $d$ denotes top $d$ eigenpairs
Phase 2: Out-of-Sample Extension (Nyström Method)\n\nFor each non-landmark point $\mathbf{x}_n$, we have distances to all landmarks: $\mathbf{d}_n = [d_G(\mathbf{x}_n, \ell_1), \ldots, d_G(\mathbf{x}_n, \ell_L)]^\top$.\n\nThe Nyström extension projects this point into the embedding space:\n\n$$\mathbf{y}_n = \frac{1}{2} \boldsymbol{\Lambda}_d^{-1/2} \mathbf{V}_d^\top \left( \bar{\mathbf{d}}_L - \mathbf{d}_n^{(2)} \right)$$\n\nwhere:\n- $\mathbf{d}_n^{(2)} = [d_G(\mathbf{x}_n, \ell_1)^2, \ldots, d_G(\mathbf{x}_n, \ell_L)^2]^\top$ is the vector of squared distances\n- $\bar{\mathbf{d}}L = \frac{1}{L} \sum_j D{LL}^{(2)}[:, j]$ is the column mean of the landmark squared distance matrix\n\nThis formula positions $\mathbf{x}_n$ to best preserve its distances to the landmarks in the embedding space.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npfrom scipy.linalg import eigh def landmark_mds(D_landmarks, D_all, n_components): """ Perform Landmark MDS embedding. Parameters: D_landmarks: (L, L) geodesic distance matrix between landmarks D_all: (L, N) geodesic distances from landmarks to all points n_components: Target embedding dimension d Returns: embedding: (N, n_components) embedded coordinates eigenvalues: Eigenvalues (for explained variance) """ L, N = D_all.shape # Phase 1: Embed landmarks using classical MDS # Step 1: Square the distances D_sq = D_landmarks ** 2 # Step 2: Double centering # B = -0.5 * H @ D_sq @ H, where H = I - (1/L) * 1 @ 1.T row_mean = D_sq.mean(axis=1, keepdims=True) col_mean = D_sq.mean(axis=0, keepdims=True) total_mean = D_sq.mean() B = -0.5 * (D_sq - row_mean - col_mean + total_mean) # Step 3: Eigendecomposition (get top eigenvalues/vectors) # Use 'eigh' for symmetric matrices; returns eigenvalues in ascending order eigenvalues, eigenvectors = eigh(B) # Select top n_components (largest eigenvalues) idx = np.argsort(eigenvalues)[::-1][:n_components] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Handle potential negative eigenvalues (due to non-Euclidean distances) eigenvalues = np.maximum(eigenvalues, 0) # Step 4: Landmark coordinates Y_landmarks = eigenvectors * np.sqrt(eigenvalues) # Phase 2: Embed all points using Nyström extension # Compute squared distances from all points to landmarks D_all_sq = D_all ** 2 # (L, N) # Column mean of landmark squared distance matrix d_bar = D_sq.mean(axis=1) # (L,) # Nyström extension: y_n = (1/2) * Lambda^(-1/2) @ V.T @ (d_bar - d_n^2) # Rearranged: embedding = (D_all_sq - d_bar[:, None]).T @ V @ Lambda^(-1/2) * (-0.5) # For numerical stability, use pseudo-inverse Lambda_inv_sqrt = np.where(eigenvalues > 1e-10, 1.0 / np.sqrt(eigenvalues), 0) # embedding = -0.5 * (D_all_sq.T - d_bar) @ eigenvectors @ np.diag(Lambda_inv_sqrt) embedding = -0.5 * (D_all_sq.T - d_bar) @ (eigenvectors * Lambda_inv_sqrt) return embedding, eigenvalues def explained_variance_ratio(eigenvalues): """Compute explained variance ratio for each component.""" eigenvalues = np.maximum(eigenvalues, 0) # Handle negative eigenvalues total = eigenvalues.sum() if total > 0: return eigenvalues / total return np.zeros_like(eigenvalues) class LandmarkIsomap: """ Complete Landmark Isomap implementation. Combines: 1. k-NN graph construction 2. Dijkstra shortest paths from landmarks 3. Landmark MDS embedding """ def __init__(self, n_neighbors=5, n_landmarks=100, n_components=2, landmark_method='maxmin'): self.n_neighbors = n_neighbors self.n_landmarks = n_landmarks self.n_components = n_components self.landmark_method = landmark_method def fit_transform(self, X): """Fit and transform data.""" from scipy.sparse.csgraph import dijkstra N = X.shape[0] L = min(self.n_landmarks, N) # Step 1: Build k-NN graph adjacency = build_knn_graph_sklearn(X, self.n_neighbors) # Step 2: Select landmarks if self.landmark_method == 'maxmin': landmarks = maxmin_landmarks(X, L) elif self.landmark_method == 'random': landmarks = random_landmarks(X, L) else: landmarks = kmeansplusplus_landmarks(X, L) self.landmarks_ = landmarks # Step 3: Compute geodesic distances from landmarks D_all = np.zeros((L, N)) for i, landmark in enumerate(landmarks): distances = dijkstra(adjacency, indices=landmark) D_all[i] = distances # Handle infinite distances (disconnected components) if np.isinf(D_all).any(): print("Warning: Graph is disconnected. Using finite subgraph.") D_all = np.where(np.isinf(D_all), D_all[np.isfinite(D_all)].max() * 2, D_all) # Step 4: Extract landmark-to-landmark distances D_landmarks = D_all[:, landmarks] # Step 5: Landmark MDS self.embedding_, self.eigenvalues_ = landmark_mds( D_landmarks, D_all, self.n_components ) return self.embedding_The Nyström method is a general technique for approximating large kernel matrices using a subset of columns. In the context of MDS, the 'kernel matrix' is the doubly-centered squared distance matrix. The Nyström extension provides a principled way to embed out-of-sample points using the landmark structure—a technique also used in kernel methods and Gaussian processes for scalability.
A natural question: how does the Landmark Isomap embedding compare to classical Isomap? Several theoretical results provide reassurance.
Approximation Error Bounds\n\nLet $\mathbf{Y}$ be the classical Isomap embedding and $\tilde{\mathbf{Y}}$ be the Landmark Isomap embedding. The error depends on:\n\n1. Number of landmarks $L$: More landmarks → better approximation\n2. Landmark distribution: Well-spread landmarks → lower error\n3. Data eigenspectrum: Fast-decaying eigenvalues → better approximation\n\nUnder mild assumptions, the Nyström approximation error satisfies:\n\n$$\|\mathbf{K} - \tilde{\mathbf{K}}\|_F \leq \|\mathbf{K} - \mathbf{K}_k\|_F + \epsilon(L)$$\n\nwhere $\mathbf{K}$ is the kernel (Gram) matrix from classical MDS, $\mathbf{K}_k$ is its best rank-$k$ approximation, and $\epsilon(L)$ is the Nyström approximation error that decreases with $L$.
| Property | Classical Isomap | Landmark Isomap | Note |
|---|---|---|---|
| Exact distance preservation | Within embedding dimension limits | Approximate | Error bounded by Nyström theory |
| Eigenvector accuracy | Exact (up to precision) | $O(1/\sqrt{L})$ error | Converges as $L \to N$ |
| Out-of-sample extension | Requires recomputation | Natural via Nyström | Same formula used for all points |
| Residual variance | True residual | Approximate residual | Lower bound on true residual |
Practical Error Levels\n\nEmpirical studies show that Landmark Isomap typically achieves:\n\n- Near-identical embeddings for $L \geq 0.1N$ (10% of data as landmarks)\n- Visually indistinguishable for $L \geq 200-500$ regardless of $N$\n- Qualitatively correct even for $L \approx 100$\n\nThe quality depends heavily on the intrinsic dimensionality: low-dimensional manifolds require fewer landmarks to capture the structure.
A good rule of thumb: L = max(200, sqrt(N)). This scales sublinearly with data size while providing sufficient landmarks for accurate embedding. For critical applications, compare embeddings with different L values and check stability. If the embedding changes significantly when increasing L, more landmarks are needed.
Let's perform a detailed complexity analysis of Landmark Isomap compared to classical Isomap.
| Step | Classical Isomap | Landmark Isomap | Speedup |
|---|---|---|---|
| k-NN graph | $O(N^2 D)$ or $O(N \log N \cdot D)$ | Same | 1× |
| Shortest paths | $O(N^2 \log N)$ | $O(L N \log N)$ | $N/L$ |
| Distance storage | $O(N^2)$ | $O(LN)$ | $N/L$ |
| MDS: centering | $O(N^2)$ | $O(L^2)$ | $(N/L)^2$ |
| MDS: eigendecomp | $O(N^3)$ | $O(L^3)$ | $(N/L)^3$ |
| Nyström extension | — | $O(Nd L)$ | New step |
| Total | $O(N^3)$ | $O(L N \log N + L^3)$ | Huge |
Concrete Numbers\n\nFor $N = 100,000$, $L = 1,000$, $k = 10$, $d = 2$:\n\n| Step | Classical | Landmark | Ratio |\n|------|-----------|----------|-------|\n| k-NN graph | ~10M ops | ~10M ops | 1× |\n| Shortest paths | ~10¹¹ ops | ~10⁹ ops | 100× |\n| Distance storage | 80 GB | 800 MB | 100× |\n| MDS eigendecomp | ~10¹⁵ ops | ~10⁹ ops | 10⁶× |\n| Nyström extension | — | ~0.2M ops | — |\n| Total runtime | Days | Minutes | ~1000× |
Landmark Isomap transforms a method that's impractical for N > 10,000 into one that comfortably handles N = 1,000,000+. The speedup comes primarily from (1) computing shortest paths from only L sources instead of N, and (2) performing eigendecomposition on an L × L matrix instead of N × N. Memory savings are equally dramatic.
Several practical issues arise when implementing Landmark Isomap for real-world datasets:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom scipy.sparse.csgraph import dijkstra, connected_componentsfrom scipy.linalg import eigh def robust_landmark_isomap(X, n_neighbors=5, n_landmarks=100, n_components=2, eigenvalue_tol=1e-10): """ Robust Landmark Isomap with error handling. Handles: - Disconnected graphs - Negative eigenvalues - Numerical stability issues """ from sklearn.neighbors import kneighbors_graph N = X.shape[0] L = min(n_landmarks, N) # Build k-NN graph adjacency = kneighbors_graph(X, n_neighbors, mode='distance', include_self=False) adjacency = (adjacency + adjacency.T) / 2 # Symmetrize # Check connectivity n_components_graph, labels = connected_components( adjacency, directed=False, return_labels=True ) if n_components_graph > 1: # Use largest component only sizes = np.bincount(labels) largest = np.argmax(sizes) mask = labels == largest print(f"Graph has {n_components_graph} components. " f"Using largest ({sizes[largest]}/{N} points).") X_sub = X[mask] adjacency_sub = adjacency[mask][:, mask] N_sub = X_sub.shape[0] # Recompute landmarks on connected subset landmarks = maxmin_landmarks(X_sub, min(L, N_sub)) else: X_sub = X adjacency_sub = adjacency N_sub = N landmarks = maxmin_landmarks(X, L) mask = np.ones(N, dtype=bool) L = len(landmarks) # Compute geodesic distances from landmarks D_all = np.zeros((L, N_sub)) for i, landmark in enumerate(landmarks): D_all[i] = dijkstra(adjacency_sub, indices=landmark) # Check for remaining infinite distances inf_mask = np.isinf(D_all) if inf_mask.any(): max_finite = D_all[~inf_mask].max() if (~inf_mask).any() else 1.0 D_all = np.where(inf_mask, max_finite * 2, D_all) print(f"Warning: {inf_mask.sum()} infinite distances replaced.") # Extract landmark-to-landmark distances D_landmarks = D_all[:, landmarks] # Landmark MDS with robust eigenvalue handling D_sq = D_landmarks ** 2 row_mean = D_sq.mean(axis=1, keepdims=True) col_mean = D_sq.mean(axis=0, keepdims=True) total_mean = D_sq.mean() B = -0.5 * (D_sq - row_mean - col_mean + total_mean) eigenvalues, eigenvectors = eigh(B) # Sort descending and select top components idx = np.argsort(eigenvalues)[::-1][:n_components] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Handle negative eigenvalues n_negative = (eigenvalues < 0).sum() if n_negative > 0: print(f"Clipping {n_negative} negative eigenvalues to zero.") eigenvalues = np.maximum(eigenvalues, 0) # Nyström extension with numerical stability Lambda_inv_sqrt = np.where( eigenvalues > eigenvalue_tol, 1.0 / np.sqrt(eigenvalues), 0 ) d_bar = D_sq.mean(axis=1) D_all_sq = D_all ** 2 embedding = -0.5 * (D_all_sq.T - d_bar) @ (eigenvectors * Lambda_inv_sqrt) # Compute residual variance for diagnostics residual_variance = compute_residual_variance(D_all, embedding, landmarks) return embedding, eigenvalues, residual_variance, mask def compute_residual_variance(D_true, embedding, landmark_indices): """ Compute residual variance as diagnostic. 1 - R² between geodesic and embedded distances. """ L, N = D_true.shape # Compute embedded distances to landmarks landmark_embedding = embedding[landmark_indices] D_embedded = np.zeros((L, N)) for i in range(L): D_embedded[i] = np.linalg.norm(embedding - landmark_embedding[i], axis=1) # Compute R² D_true_flat = D_true.flatten() D_embedded_flat = D_embedded.flatten() ss_res = np.sum((D_true_flat - D_embedded_flat) ** 2) ss_tot = np.sum((D_true_flat - D_true_flat.mean()) ** 2) r_squared = 1 - ss_res / ss_tot if ss_tot > 0 else 0 residual_variance = 1 - r_squared return residual_varianceWe've covered Landmark Isomap, the critical optimization that makes Isomap practical for large datasets. Let's consolidate the key insights:
The next page covers Assumptions and Limitations—a critical examination of when Isomap works and when it fails. We'll discuss the convexity requirement, the short-circuit problem, sensitivity to noise and sampling, and how to diagnose failure modes. Understanding these limitations is essential for knowing when to use Isomap versus alternative methods.