Machine LearningManifold Learning

Multidimensional Scaling

LevelAdvanced

Duration90 mins

TopicManifold Learning

1 / 5

Classical Multidimensional Scaling

The Geometry of Similarity

Imagine you're an archaeologist who has discovered a collection of ancient artifacts at various excavation sites. You don't have GPS coordinates—only a table recording the physical distances between every pair of sites, painstakingly measured by foot. Your goal: reconstruct the spatial layout of these sites on a map. Given only pairwise distances, can you recover the original positions?

This is precisely the problem that Multidimensional Scaling (MDS) solves. It reconstructs a configuration of points in low-dimensional space such that the distances between points match, as closely as possible, a given set of observed dissimilarities. MDS is one of the oldest and most elegant dimensionality reduction techniques, predating PCA's modern formulation and serving as the conceptual ancestor of manifold learning methods like Isomap.

Classical MDS, also known as Principal Coordinate Analysis (PCoA) or Torgerson Scaling, represents the purest form of this idea: when distances are Euclidean, the reconstruction is exact and can be computed via a simple eigenvalue decomposition.

What You Will Learn

By the end of this page, you will understand the complete mathematical derivation of Classical MDS, master the double centering technique that converts distances to inner products, derive the eigendecomposition solution, and recognize when Classical MDS provides an exact reconstruction versus an approximation.

Historical Context and Motivation

Multidimensional Scaling emerged from the intersection of psychophysics and statistics in the mid-20th century. Psychologists sought to understand how humans perceive similarity—between colors, sounds, concepts, or products. Unlike physical measurements, psychological similarity doesn't yield direct coordinates. Researchers could only ask subjects to rate how similar or different pairs of stimuli were, producing dissimilarity matrices rather than feature vectors.

The fundamental question: Given only pairwise dissimilarities, can we infer an underlying geometric structure?

The breakthrough came from Warren Torgerson in 1952, who showed that when dissimilarities correspond to Euclidean distances, the original configuration can be exactly recovered (up to rotation, reflection, and translation). This result, now called Classical MDS, remains the theoretical foundation for all subsequent MDS variants.

Key insight: Classical MDS doesn't require raw features—only distances. This makes it applicable in domains where:

Features are unknown or unmeasurable (psychological perception)
Distance computation is the natural primitive (string edit distances, molecular similarities)
The original space is extremely high-dimensional (genomic distances)

PCoA vs PCA

When applied to Euclidean distances computed from centered data, Classical MDS produces exactly the same coordinates as PCA. This equivalence reveals deep connections between distance-based and variance-maximizing perspectives on dimensionality reduction.

Applications of Classical MDS

•Psychometrics: Mapping perceived similarities between products, brands, or concepts into interpretable 2D/3D visualizations
•Bioinformatics: Visualizing genetic distances between species or populations (phylogenetic trees as MDS embeddings)
•Ecology: Ordination analysis—positioning species or sampling sites based on community composition dissimilarity
•Chemoinformatics: Visualizing molecular structure spaces using chemical similarity metrics
•Geographic Information Systems: Reconstructing coordinates from travel time or cost matrices
•Social Network Analysis: Embedding nodes based on shortest path distances in the network

The Classical MDS Problem

Let's formalize the Classical MDS problem with mathematical precision.

Given:

A symmetric dissimilarity matrix $D \in \mathbb{R}^{n \times n}$ where $d_{ij}$ represents the dissimilarity between objects $i$ and $j$
$d_{ii} = 0$ for all $i$ (zero self-dissimilarity)
$d_{ij} = d_{ji}$ (symmetry)
A target dimensionality $k < n$

Goal:

Find a configuration of $n$ points $X = [\mathbf{x}_1, \ldots, \mathbf{x}_n]^T \in \mathbb{R}^{n \times k}$ such that the Euclidean distances in $X$ match the given dissimilarities as closely as possible:

$$|\mathbf{x}_i - \mathbf{x}_j|2 \approx d{ij}$$

The special case: When the dissimilarities are actually Euclidean distances—meaning there exists some configuration $Z$ in $\mathbb{R}^p$ (possibly high-dimensional) such that $d_{ij} = |\mathbf{z}_i - \mathbf{z}_j|_2$—then the matrix $D$ is called Euclidean. For Euclidean distance matrices, Classical MDS can recover $Z$ exactly (up to the intrinsic dimensionality).

When Is a Matrix Euclidean?

Not every symmetric matrix with zero diagonal is Euclidean. For example, the matrix where all off-diagonal entries are 1 is not Euclidean for n > 4. Classical MDS works optimally when the input is truly Euclidean; otherwise, it provides the best Euclidean approximation (possibly with negative eigenvalues).

Key mathematical observation:

The relationship between distances and inner products is the key to Classical MDS. If we know the pairwise distances, can we recover the inner products (the Gram matrix)? The answer is yes, through a technique called double centering.

Let $X \in \mathbb{R}^{n \times p}$ be a data matrix (rows are observations). The squared distance between points $i$ and $j$ is:

$$d_{ij}^2 = |\mathbf{x}_i - \mathbf{x}_j|^2 = |\mathbf{x}_i|^2 - 2\mathbf{x}_i^T\mathbf{x}_j + |\mathbf{x}_j|^2$$

This equation links squared distances to inner products. If we had a way to isolate the inner product term $\mathbf{x}_i^T\mathbf{x}_j$, we could recover the Gram matrix $B = XX^T$, and from $B$, we can extract coordinates via eigendecomposition.

Double Centering: From Distances to Inner Products

The double centering transformation is the mathematical heart of Classical MDS. It converts a squared distance matrix into a matrix of inner products, enabling eigendecomposition.

Setup: Let $D^{(2)} \in \mathbb{R}^{n \times n}$ be the squared distance matrix with entries $d_{ij}^2$.

The centering matrix: Define the centering matrix:

$$H = I_n - \frac{1}{n}\mathbf{1}\mathbf{1}^T$$

where $I_n$ is the $n \times n$ identity matrix and $\mathbf{1}$ is a vector of ones. The matrix $H$ is symmetric and idempotent ($H^2 = H$). When applied to a data matrix, it centers the columns to have zero mean.

The double centering transformation:

$$B = -\frac{1}{2} H D^{(2)} H$$

Theorem (Torgerson, 1952): If $D^{(2)}$ is a squared Euclidean distance matrix derived from centered data $X$ (with $X^T\mathbf{1} = \mathbf{0}$), then:

$$B = XX^T$$

That is, double centering recovers the Gram matrix—the matrix of all pairwise inner products.

Intuition Behind Double Centering

Double centering removes the norm terms that pollute the distance-to-inner-product relationship. By subtracting row means, column means, and adding the grand mean, we isolate the cross-term that contains the inner product information. It's analogous to how we center data to compute covariance from raw second moments.

Derivation of Double Centering:

Let's prove why double centering works. Start with the squared distance:

$$d_{ij}^2 = |\mathbf{x}_i|^2 - 2\mathbf{x}_i^T\mathbf{x}_j + |\mathbf{x}_j|^2$$

Define $a_i = |\mathbf{x}_i|^2$. Then:

$$d_{ij}^2 = a_i + a_j - 2b_{ij}$$

where $b_{ij} = \mathbf{x}_i^T\mathbf{x}_j$ is the $(i,j)$ entry of the Gram matrix.

Row mean of squared distances: $$\bar{d}{i\cdot}^2 = \frac{1}{n}\sum_j d{ij}^2 = a_i + \bar{a} - \frac{2}{n}\sum_j b_{ij}$$

Since the data is centered, $\sum_j \mathbf{x}j = \mathbf{0}$, so $\sum_j b{ij} = \mathbf{x}_i^T \sum_j \mathbf{x}j = 0$. Thus: $$\bar{d}{i\cdot}^2 = a_i + \bar{a}$$

Column mean (by symmetry): $\bar{d}_{\cdot j}^2 = a_j + \bar{a}$

Grand mean: $\bar{d}_{\cdot\cdot}^2 = 2\bar{a}$

Apply double centering: $$-\frac{1}{2}(d_{ij}^2 - \bar{d}{i\cdot}^2 - \bar{d}{\cdot j}^2 + \bar{d}{\cdot\cdot}^2)$$ $$= -\frac{1}{2}[(a_i + a_j - 2b{ij}) - (a_i + \bar{a}) - (a_j + \bar{a}) + 2\bar{a}]$$ $$= -\frac{1}{2}[-2b_{ij}] = b_{ij}$$

This proves that double centering recovers the Gram matrix entry $b_{ij}$.

double_centering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def double_centering(D_squared):
    """
    Apply double centering to a squared distance matrix.
    
    Given a squared distance matrix D^(2), computes:
        B = -0.5 * H @ D^(2) @ H
    where H is the centering matrix.
    
    Parameters:
    -----------
    D_squared : ndarray of shape (n, n)
        Squared distance matrix where D_squared[i,j] = d_{ij}^2
        
    Returns:
    --------
    B : ndarray of shape (n, n)
        The (estimated) Gram matrix
    """
    n = D_squared.shape[0]
    
    # Method 1: Explicit construction of centering matrix
    # H = np.eye(n) - np.ones((n, n)) / n
    # B = -0.5 * H @ D_squared @ H
    
    # Method 2: Efficient computation (avoids forming n x n centering matrix)
    # Compute row means, column means, and grand mean
    row_means = D_squared.mean(axis=1, keepdims=True)  # Shape (n, 1)
    col_means = D_squared.mean(axis=0, keepdims=True)  # Shape (1, n)
    grand_mean = D_squared.mean()
    
    # Double centering formula
    B = -0.5 * (D_squared - row_means - col_means + grand_mean)
    
    return B
 
 
def verify_double_centering():
    """
    Verify that double centering recovers the Gram matrix from Euclidean distances.
    """
    np.random.seed(42)
    
    # Generate random centered data
    n, p = 50, 10
    X = np.random.randn(n, p)
    X = X - X.mean(axis=0)  # Center the data
    
    # Compute Gram matrix directly
    B_true = X @ X.T
    
    # Compute squared distance matrix
    # d_ij^2 = ||x_i||^2 + ||x_j||^2 - 2 * x_i^T x_j
    norms_sq = np.sum(X**2, axis=1)
    D_squared = norms_sq[:, np.newaxis] + norms_sq[np.newaxis, :] - 2 * X @ X.T
    
    # Apply double centering
    B_recovered = double_centering(D_squared)
    
    # Check recovery
    max_error = np.max(np.abs(B_true - B_recovered))
    print(f"Max absolute error: {max_error:.2e}")
    print(f"Recovery successful: {max_error < 1e-10}")
    
    return B_true, B_recovered
 
if __name__ == "__main__":
    verify_double_centering()

Eigendecomposition and Coordinate Recovery

Once we have the Gram matrix $B = XX^T$, we can extract coordinates through eigendecomposition. This step reveals the intimate connection between Classical MDS and PCA.

The Gram matrix is symmetric and positive semi-definite (when $D$ is Euclidean), so it has a real eigendecomposition:

$$B = V \Lambda V^T$$

where:

$V \in \mathbb{R}^{n \times n}$ is orthogonal, with columns being eigenvectors
$\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0$

Coordinate recovery: Since $B = XX^T$, we can write:

$$X = V \Lambda^{1/2}$$

where $\Lambda^{1/2} = \text{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n})$.

Dimensionality reduction: To embed in $k$ dimensions, we take only the top $k$ eigenvector-eigenvalue pairs:

$$X_k = V_k \Lambda_k^{1/2}$$

where $V_k \in \mathbb{R}^{n \times k}$ contains the first $k$ eigenvectors and $\Lambda_k$ is the $k \times k$ diagonal matrix of corresponding eigenvalues.

Connection to PCA

When distances come from centered Euclidean data, Classical MDS produces the same coordinates as PCA, but computed from the Gram matrix B = XX^T rather than the covariance matrix X^T X / n. The eigenvalues of B are n times those of the covariance matrix. This duality—between the n×n Gram matrix and the p×p covariance matrix—underlies the 'kernel trick' in machine learning.

Choosing the embedding dimensionality:

The eigenvalues tell us how much 'variance' each dimension captures. To choose $k$, we examine:

Scree plot: Plot eigenvalues in decreasing order. Look for an 'elbow' where eigenvalues drop sharply.
Proportion of variance explained: $$\text{PoV}(k) = \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^n \lambda_i}$$

A common heuristic is to choose $k$ such that PoV$(k) \geq 0.8$ or $0.9$.

Intrinsic dimensionality: If the data truly lies on a $k$-dimensional subspace, only $k$ eigenvalues will be non-zero (up to numerical precision).

Handling negative eigenvalues:

If the input dissimilarities are not exactly Euclidean, $B$ may have negative eigenvalues. Common strategies:

Ignore negative eigenvalues: Only use positive eigenvalues for embedding
Add a constant to off-diagonal distances: This can make the matrix Euclidean
Use absolute values: Replace $\sqrt{\lambda_i}$ with $\sqrt{|\lambda_i|}$ for all eigenvalues
Switch to metric MDS: Use optimization-based approaches that handle non-Euclidean inputs

classical_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from scipy.linalg import eigh
 
def classical_mds(D, n_components=2):
    """
    Classical Multidimensional Scaling (Torgerson Scaling / PCoA).
    
    Given a distance matrix D, finds a k-dimensional embedding
    such that Euclidean distances in the embedding approximate
    the input distances.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Distance matrix (not squared)
    n_components : int
        Target dimensionality
        
    Returns:
    --------
    X : ndarray of shape (n, n_components)
        The embedded coordinates
    eigenvalues : ndarray
        All eigenvalues (for scree plot analysis)
    variance_explained : ndarray
        Proportion of variance explained by each component
    """
    n = D.shape[0]
    
    # Step 1: Square the distances
    D_squared = D ** 2
    
    # Step 2: Double centering to get Gram matrix
    row_means = D_squared.mean(axis=1, keepdims=True)
    col_means = D_squared.mean(axis=0, keepdims=True)
    grand_mean = D_squared.mean()
    B = -0.5 * (D_squared - row_means - col_means + grand_mean)
    
    # Step 3: Eigendecomposition
    # eigh returns eigenvalues in ascending order
    eigenvalues, eigenvectors = eigh(B)
    
    # Reverse to get descending order
    eigenvalues = eigenvalues[::-1]
    eigenvectors = eigenvectors[:, ::-1]
    
    # Step 4: Handle negative eigenvalues 
    # Count and warn about negative eigenvalues
    n_negative = np.sum(eigenvalues < 0)
    if n_negative > 0:
        print(f"Warning: {n_negative} negative eigenvalues detected.")
        print("The input may not be a valid Euclidean distance matrix.")
    
    # Step 5: Compute embedding using top k positive eigenvalues
    # Only use positive eigenvalues
    positive_mask = eigenvalues > 1e-10
    eigenvalues_positive = eigenvalues[positive_mask]
    eigenvectors_positive = eigenvectors[:, positive_mask]
    
    # Compute proportion of variance explained
    total_variance = np.sum(np.abs(eigenvalues))  # Use abs for robustness
    variance_explained = eigenvalues_positive / total_variance
    
    # Take top n_components
    k = min(n_components, len(eigenvalues_positive))
    eigenvalues_k = eigenvalues_positive[:k]
    eigenvectors_k = eigenvectors_positive[:, :k]
    
    # Compute coordinates: X = V * sqrt(Lambda)
    X = eigenvectors_k * np.sqrt(eigenvalues_k)
    
    return X, eigenvalues, variance_explained[:k]
 
 
def demonstrate_classical_mds():
    """
    Demonstrate Classical MDS on synthetic data and compare with PCA.
    """
    from sklearn.decomposition import PCA
    from sklearn.metrics import pairwise_distances
    import matplotlib.pyplot as plt
    
    np.random.seed(42)
    
    # Generate data on a 2D plane embedded in 10D
    n = 200
    t = np.linspace(0, 4*np.pi, n)
    X_true = np.column_stack([
        t * np.cos(t),
        t * np.sin(t)
    ])
    
    # Embed in higher dimensions with some noise
    W = np.random.randn(2, 10)
    X_high = X_true @ W + 0.1 * np.random.randn(n, 10)
    X_high = X_high - X_high.mean(axis=0)  # Center
    
    # Compute distance matrix
    D = pairwise_distances(X_high, metric='euclidean')
    
    # Apply Classical MDS
    X_mds, eigenvalues, var_explained = classical_mds(D, n_components=2)
    
    # Apply PCA for comparison
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_high)
    
    # Plot results
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # MDS embedding
    axes[0].scatter(X_mds[:, 0], X_mds[:, 1], c=t, cmap='viridis', s=20)
    axes[0].set_title('Classical MDS Embedding')
    axes[0].set_xlabel('MDS 1')
    axes[0].set_ylabel('MDS 2')
    
    # PCA embedding
    axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=t, cmap='viridis', s=20)
    axes[1].set_title('PCA Embedding')
    axes[1].set_xlabel('PC 1')
    axes[1].set_ylabel('PC 2')
    
    # Scree plot
    n_show = min(20, len(eigenvalues))
    axes[2].bar(range(1, n_show+1), eigenvalues[:n_show])
    axes[2].axhline(y=0, color='r', linestyle='--', alpha=0.5)
    axes[2].set_xlabel('Component')
    axes[2].set_ylabel('Eigenvalue')
    axes[2].set_title('Scree Plot')
    
    plt.tight_layout()
    plt.savefig('classical_mds_demo.png', dpi=150)
    plt.close()
    
    print(f"Variance explained by first 2 MDS components: {var_explained[:2].sum():.3f}")
    print(f"Variance explained by first 2 PCA components: {pca.explained_variance_ratio_[:2].sum():.3f}")
    
    return X_mds, X_pca, eigenvalues
 
if __name__ == "__main__":
    demonstrate_classical_mds()

The Classical MDS–PCA Equivalence

One of the most beautiful results in multivariate statistics is the equivalence between Classical MDS and PCA when applied to Euclidean distances from centered data. Understanding this equivalence deepens our appreciation for both methods and reveals why they appear in different contexts despite solving related problems.

Setup:

Let $X \in \mathbb{R}^{n \times p}$ be a centered data matrix ($X^T\mathbf{1} = \mathbf{0}$)
Let $D$ be the Euclidean distance matrix: $d_{ij} = |\mathbf{x}_i - \mathbf{x}_j|_2$

PCA perspective:

Compute covariance matrix: $C = \frac{1}{n}X^TX \in \mathbb{R}^{p \times p}$
Eigendecomposition: $C = W\Lambda_C W^T$
Project data: $Z_{\text{PCA}} = XW$

Classical MDS perspective:

Compute Gram matrix via double centering: $B = XX^T \in \mathbb{R}^{n \times n}$
Eigendecomposition: $B = V\Lambda_B V^T$
Extract coordinates: $Z_{\text{MDS}} = V\Lambda_B^{1/2}$

The equivalence theorem:

$$Z_{\text{MDS}} = Z_{\text{PCA}} \quad \text{(up to orthogonal transformation)}$$

Proof Sketch

The key insight is the singular value decomposition. If X = UΣW^T, then B = XX^T = UΣ²U^T and C = X^TX/n = WΣ²W^T/n. The MDS coordinates V√Λ_B equal UΣ, which are exactly the PCA scores XW = UΣW^TW = UΣ.

Why does this equivalence matter?

Computational choice:
- When $n < p$ (few samples, many features): use Gram matrix approach ($n \times n$ eigendecomposition)
- When $p < n$ (many samples, few features): use covariance matrix approach ($p \times p$ eigendecomposition)
- This is the basis for 'kernel PCA' in disguise
Conceptual unification:
- PCA view: Find directions of maximum variance
- MDS view: Find configuration that best preserves pairwise distances
- These are the same problem with different emphases
Extension to kernels:
- MDS requires only pairwise distances
- This enables the 'kernel trick': replace $X^TX$ with any positive semi-definite kernel matrix
- Kernel PCA is essentially MDS applied to kernel-space distances

When do they differ?

If data is not centered, MDS on raw Euclidean distances differs from PCA
MDS can accept any dissimilarity matrix, not necessarily Euclidean
PCA requires the feature matrix; MDS only needs distances

PCA vs Classical MDS: Comparative Analysis
Aspect	PCA	Classical MDS
Input	Feature matrix X (n × p)	Distance matrix D (n × n)
Key matrix	Covariance X^TX/n (p × p)	Gram matrix XX^T (n × n)
Objective	Maximize variance	Preserve pairwise distances
Computation	O(p³ + np²) for eigendecomposition	O(n³) for eigendecomposition
Better when	p << n (many samples)	n << p (few samples, high dim)
Kernel extension	Requires kernel formulation	Natural (distances → kernels)
Non-Euclidean	Not directly applicable	Can approximate with caveats

Measuring Reconstruction Quality

After computing a $k$-dimensional embedding, we need to assess how well it preserves the original distance structure. Several metrics quantify this.

1. Proportion of Variance Explained (PoV)

The ratio of eigenvalue sum determines how much geometric information is captured:

$$\text{PoV}(k) = \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^n |\lambda_i|}$$

For Euclidean inputs, all eigenvalues are non-negative, and this equals the fraction of total variance. Higher PoV indicates better preservation.

2. Residual Sum of Squares (RSS)

Compare original distances to embedded distances:

$$\text{RSS} = \sum_{i < j} (d_{ij} - \hat{d}_{ij})^2$$

where $\hat{d}_{ij} = |\mathbf{x}_i - \mathbf{x}_j|$ in the embedding. Lower is better.

3. Stress (in the MDS sense)

$$\text{Stress} = \sqrt{\frac{\sum_{i<j}(d_{ij} - \hat{d}{ij})^2}{\sum{i<j} d_{ij}^2}}$$

Normalized measure of distortion. Common interpretation:

Stress < 0.05: Excellent
Stress < 0.10: Good
Stress < 0.20: Fair
Stress > 0.20: Poor

Classical MDS Is Optimal for Euclidean Inputs

When input distances are truly Euclidean, Classical MDS provides the best k-dimensional embedding in the least-squares sense. The stress is minimized by the eigenvalue solution—no iterative optimization is needed. This is a closed-form optimal solution.

4. Shepard Diagram

A Shepard diagram plots original distances $d_{ij}$ against embedded distances $\hat{d}_{ij}$. For a perfect embedding, all points lie on the diagonal. Deviations reveal systematic distortions:

Points below diagonal: Embedded distances too small (compression)
Points above diagonal: Embedded distances too large (expansion)
Curved patterns: Nonlinear distortions

5. Procrustes Analysis

If you have ground truth coordinates (e.g., in simulation), Procrustes analysis finds the optimal rotation/reflection/scaling to align the embedding with truth, then measures residual error:

$$\text{Procrustes error} = \min_{R \in \mathcal{O}(k)} |X_{\text{true}} - X_{\text{MDS}} R|_F$$

where $\mathcal{O}(k)$ is the set of orthogonal matrices.

reconstruction_quality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
 
def compute_mds_quality_metrics(D_original, X_embedded):
    """
    Compute various quality metrics for an MDS embedding.
    
    Parameters:
    -----------
    D_original : ndarray of shape (n, n)
        Original distance matrix
    X_embedded : ndarray of shape (n, k)
        Embedded coordinates
        
    Returns:
    --------
    metrics : dict
        Dictionary containing various quality metrics
    """
    # Compute embedded distances
    D_embedded = cdist(X_embedded, X_embedded, metric='euclidean')
    
    # Extract upper triangular (exclude diagonal)
    n = D_original.shape[0]
    upper_tri_idx = np.triu_indices(n, k=1)
    d_orig = D_original[upper_tri_idx]
    d_emb = D_embedded[upper_tri_idx]
    
    # 1. Residual Sum of Squares
    rss = np.sum((d_orig - d_emb) ** 2)
    
    # 2. Kruskal's Stress-1
    stress = np.sqrt(np.sum((d_orig - d_emb)**2) / np.sum(d_orig**2))
    
    # 3. Normalized Stress
    normalized_stress = np.sqrt(np.sum((d_orig - d_emb)**2) / np.sum(d_emb**2))
    
    # 4. Correlation between distances
    correlation = np.corrcoef(d_orig, d_emb)[0, 1]
    
    # 5. Mean Absolute Error
    mae = np.mean(np.abs(d_orig - d_emb))
    
    # 6. Root Mean Squared Error
    rmse = np.sqrt(np.mean((d_orig - d_emb)**2))
    
    metrics = {
        'rss': rss,
        'stress': stress,
        'normalized_stress': normalized_stress,
        'correlation': correlation,
        'mae': mae,
        'rmse': rmse
    }
    
    return metrics, d_orig, d_emb
 
 
def plot_shepard_diagram(d_original, d_embedded, ax=None):
    """
    Create a Shepard diagram comparing original and embedded distances.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    ax.scatter(d_original, d_embedded, alpha=0.3, s=5)
    
    # Add diagonal reference line
    max_d = max(d_original.max(), d_embedded.max())
    ax.plot([0, max_d], [0, max_d], 'r--', linewidth=2, label='Perfect fit')
    
    # Compute correlation
    corr = np.corrcoef(d_original, d_embedded)[0, 1]
    
    ax.set_xlabel('Original Distance', fontsize=12)
    ax.set_ylabel('Embedded Distance', fontsize=12)
    ax.set_title(f'Shepard Diagram (r = {corr:.4f})', fontsize=14)
    ax.legend()
    ax.set_aspect('equal')
    
    return ax
 
 
def demonstrate_quality_metrics():
    """
    Demonstrate quality metrics on embeddings of different dimensionality.
    """
    np.random.seed(42)
    
    # Generate 5D data
    n = 100
    X = np.random.randn(n, 5)
    X = X - X.mean(axis=0)
    
    # True distance matrix
    D = cdist(X, X, metric='euclidean')
    
    # Embed at different dimensions
    from classical_mds import classical_mds
    
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    for dim, ax in zip([1, 2, 3, 5], axes):
        X_emb, eigenvals, var_exp = classical_mds(D, n_components=dim)
        metrics, d_orig, d_emb = compute_mds_quality_metrics(D, X_emb)
        
        plot_shepard_diagram(d_orig, d_emb, ax=ax)
        ax.set_title(f'{dim}D Embedding\nStress={metrics["stress"]:.4f}')
    
    plt.tight_layout()
    plt.savefig('mds_quality_comparison.png', dpi=150)
    plt.close()
    
    # Print metrics table
    print("\nQuality Metrics by Embedding Dimension:")
    print("-" * 60)
    print(f"{'Dim':>4} {'Stress':>10} {'Corr':>10} {'RMSE':>10} {'PoV':>10}")
    print("-" * 60)
    
    for dim in [1, 2, 3, 4, 5]:
        X_emb, eigenvals, var_exp = classical_mds(D, n_components=dim)
        metrics, _, _ = compute_mds_quality_metrics(D, X_emb)
        pov = eigenvals[:dim].sum() / eigenvals.sum()
        print(f"{dim:>4} {metrics['stress']:>10.4f} {metrics['correlation']:>10.4f} "
              f"{metrics['rmse']:>10.4f} {pov:>10.4f}")
 
if __name__ == "__main__":
    demonstrate_quality_metrics()

Computational Considerations

Classical MDS is computationally elegant but faces scalability challenges for large datasets. Let's analyze the computational requirements.

Time Complexity:

Distance matrix computation (if starting from features): $O(n^2 p)$
Squaring distances: $O(n^2)$
Double centering: $O(n^2)$
Eigendecomposition of $n \times n$ matrix: $O(n^3)$
Coordinate extraction: $O(nk)$

Total: $O(n^2 p + n^3)$

For $n = 10,000$ points, the eigendecomposition alone requires $\sim 10^{12}$ operations—potentially hours of computation.

Space Complexity:

Distance matrix $D$: $O(n^2)$
Gram matrix $B$: $O(n^2)$
Eigenvectors $V$: $O(n^2)$

Total: $O(n^2)$

For $n = 50,000$ with float64: $50000^2 \times 8 \approx 20$ GB per matrix. This quickly exceeds RAM.

Scalability Limit

Classical MDS is typically practical for n < 10,000 on standard hardware. For larger datasets, consider Landmark MDS, Nyström approximation, or out-of-sample extensions.

Optimization strategies:

1. Exploit sparsity in distance computation

If only $k$-nearest neighbor distances are needed, use $k$-d trees or ball trees
Approximate nearest neighbors (e.g., HNSW, Annoy) for very large $n$

2. Partial eigendecomposition

If only $k$ components are needed, use iterative methods (Lanczos, randomized SVD)
scipy.sparse.linalg.eigsh or sklearn.utils.extmath.randomized_svd
Reduces complexity from $O(n^3)$ to $O(n^2 k)$ or better

3. Landmark MDS (L-MDS)

Select $m \ll n$ landmark points
Compute exact MDS on landmarks: $O(m^3)$
Project remaining points using distances to landmarks: $O((n-m) \cdot m)$
Total: $O(m^3 + nm)$

4. Nyström approximation

Sample $m$ columns/rows of the Gram matrix
Approximate full eigendecomposition from this sample
Well-studied approximation guarantees

5. Out-of-sample extension

Fit MDS on a representative subset
Project new points without refitting

Practical Recommendations

•n < 5,000: Use full Classical MDS with dense eigendecomposition; fast and exact.
•5,000 < n < 50,000: Use partial eigendecomposition (ARPACK or randomized). Memory may be tight.
•n > 50,000: Use Landmark MDS with 500-2000 landmarks, or switch to approximate methods like t-SNE/UMAP for visualization.
•Streaming data: Pre-select landmarks from initial batch; project new data onto landmark coordinates.
•Numerical stability: Use scipy.linalg.eigh rather than np.linalg.eig for symmetric matrices.

landmark_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from scipy.linalg import eigh
from scipy.spatial.distance import cdist
 
def landmark_mds(D, n_landmarks, n_components=2, random_state=None):
    """
    Landmark MDS for scalable embedding.
    
    Uses a subset of points (landmarks) to compute exact MDS,
    then projects remaining points.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Full distance matrix (or can be modified to accept only distances to landmarks)
    n_landmarks : int
        Number of landmark points
    n_components : int
        Target dimensionality
    random_state : int, optional
        Random seed for reproducibility
        
    Returns:
    --------
    X : ndarray of shape (n, n_components)
        Embedded coordinates for all points
    """
    n = D.shape[0]
    rng = np.random.RandomState(random_state)
    
    # Step 1: Select landmarks (random selection, could also use k-means++)
    landmark_idx = rng.choice(n, size=n_landmarks, replace=False)
    non_landmark_idx = np.setdiff1d(np.arange(n), landmark_idx)
    
    # Step 2: Extract landmark distance submatrix
    D_landmarks = D[np.ix_(landmark_idx, landmark_idx)]
    
    # Step 3: Classical MDS on landmarks
    D_sq = D_landmarks ** 2
    row_means = D_sq.mean(axis=1, keepdims=True)
    col_means = D_sq.mean(axis=0, keepdims=True)
    grand_mean = D_sq.mean()
    B_landmarks = -0.5 * (D_sq - row_means - col_means + grand_mean)
    
    eigenvalues, eigenvectors = eigh(B_landmarks)
    eigenvalues = eigenvalues[::-1]
    eigenvectors = eigenvectors[:, ::-1]
    
    # Take top k positive eigenvalues
    k = n_components
    eigenvalues_k = eigenvalues[:k]
    eigenvectors_k = eigenvectors[:, :k]
    
    # Landmark coordinates
    X_landmarks = eigenvectors_k * np.sqrt(np.maximum(eigenvalues_k, 0))
    
    # Step 4: Project non-landmark points (Gower's method)
    # Distance from non-landmarks to landmarks
    D_to_landmarks = D[np.ix_(non_landmark_idx, landmark_idx)]
    D_sq_to_landmarks = D_to_landmarks ** 2
    
    # Mean squared distance from landmarks (used in centering)
    mean_sq_dist_landmarks = D_sq.mean(axis=1)  # Mean over landmarks
    
    # Project using Gower's formula
    # X_new = -0.5 * L_k^(-1) * V_k^T * (d_new^2 - mu)
    # where mu is the mean of squared distances among landmarks
    Delta_sq = D_sq_to_landmarks - mean_sq_dist_landmarks
    
    Lambda_inv = np.diag(1.0 / np.maximum(eigenvalues_k, 1e-10))
    X_non_landmarks = -0.5 * Delta_sq @ eigenvectors_k @ Lambda_inv
    
    # Step 5: Combine coordinates
    X = np.zeros((n, n_components))
    X[landmark_idx] = X_landmarks
    X[non_landmark_idx] = X_non_landmarks
    
    return X
 
 
def demo_landmark_mds():
    """Compare Landmark MDS with full Classical MDS."""
    from sklearn.metrics import pairwise_distances
    import time
    
    np.random.seed(42)
    
    # Generate data
    n = 2000
    X_true = np.random.randn(n, 50)  # 50D data
    D = pairwise_distances(X_true)
    
    # Full Classical MDS
    start = time.time()
    from classical_mds import classical_mds
    X_full, _, _ = classical_mds(D, n_components=2)
    time_full = time.time() - start
    
    # Landmark MDS with varying landmarks
    for n_landmarks in [50, 100, 200, 500]:
        start = time.time()
        X_lmds = landmark_mds(D, n_landmarks=n_landmarks, n_components=2, random_state=42)
        time_lmds = time.time() - start
        
        # Compute correlation with full MDS
        corr = np.corrcoef(X_full.flatten(), X_lmds.flatten())[0, 1]
        
        print(f"Landmarks: {n_landmarks:>4} | "
              f"Time: {time_lmds:.3f}s (vs {time_full:.3f}s) | "
              f"Correlation: {abs(corr):.4f}")
 
if __name__ == "__main__":
    demo_landmark_mds()

Summary: Classical MDS

Classical Multidimensional Scaling provides a foundational approach to understanding and visualizing data from pairwise distances. Let's consolidate the key concepts:

Key Takeaways

•Classical MDS reconstructs coordinates from distances using double centering and eigendecomposition—a closed-form, non-iterative solution.
•Double centering converts squared distances to inner products by removing row means, column means, and adding the grand mean.
•Eigendecomposition of the Gram matrix yields coordinates: X = V√Λ, where large eigenvalues correspond to principal directions.
•Classical MDS equals PCA when applied to Euclidean distances from centered data—different perspectives on the same mathematics.
•Negative eigenvalues indicate non-Euclidean input: The distance matrix cannot be exactly embedded in Euclidean space.
•Computational complexity is O(n³) due to eigendecomposition. For large n, use Landmark MDS or randomized methods.
•Quality metrics (stress, Shepard diagrams) quantify how well the embedding preserves original distances.

What's next:

Classical MDS assumes that dissimilarities are Euclidean distances and optimizes a specific (implicit) objective. In the next page, we explore Metric MDS, which explicitly optimizes a stress function, allowing us to handle a broader class of dissimilarities and to weight distance preservation according to importance. Metric MDS introduces iterative optimization, providing a more flexible framework at the cost of computational efficiency.

Page Complete

You now understand the complete mathematical machinery of Classical MDS: from pairwise distances to inner products via double centering, from Gram matrices to coordinates via eigendecomposition, and the deep connection to PCA. Next, we'll see how Metric MDS generalizes this with explicit optimization.

1 / 5

Loading learning content...

Machine LearningManifold Learning

Multidimensional Scaling

LevelAdvanced

Duration90 mins

TopicManifold Learning

1 / 5

Classical Multidimensional Scaling

The Geometry of Similarity

What You Will Learn

Historical Context and Motivation

The fundamental question: Given only pairwise dissimilarities, can we infer an underlying geometric structure?

Key insight: Classical MDS doesn't require raw features—only distances. This makes it applicable in domains where:

Features are unknown or unmeasurable (psychological perception)
Distance computation is the natural primitive (string edit distances, molecular similarities)
The original space is extremely high-dimensional (genomic distances)

PCoA vs PCA

Applications of Classical MDS

•Psychometrics: Mapping perceived similarities between products, brands, or concepts into interpretable 2D/3D visualizations
•Bioinformatics: Visualizing genetic distances between species or populations (phylogenetic trees as MDS embeddings)
•Ecology: Ordination analysis—positioning species or sampling sites based on community composition dissimilarity
•Chemoinformatics: Visualizing molecular structure spaces using chemical similarity metrics
•Geographic Information Systems: Reconstructing coordinates from travel time or cost matrices
•Social Network Analysis: Embedding nodes based on shortest path distances in the network

The Classical MDS Problem

Let's formalize the Classical MDS problem with mathematical precision.

Given:

A symmetric dissimilarity matrix $D \in \mathbb{R}^{n \times n}$ where $d_{ij}$ represents the dissimilarity between objects $i$ and $j$
$d_{ii} = 0$ for all $i$ (zero self-dissimilarity)
$d_{ij} = d_{ji}$ (symmetry)
A target dimensionality $k < n$

Goal:

Find a configuration of $n$ points $X = [\mathbf{x}_1, \ldots, \mathbf{x}_n]^T \in \mathbb{R}^{n \times k}$ such that the Euclidean distances in $X$ match the given dissimilarities as closely as possible:

$$|\mathbf{x}_i - \mathbf{x}_j|2 \approx d{ij}$$

When Is a Matrix Euclidean?

Key mathematical observation:

Let $X \in \mathbb{R}^{n \times p}$ be a data matrix (rows are observations). The squared distance between points $i$ and $j$ is:

$$d_{ij}^2 = |\mathbf{x}_i - \mathbf{x}_j|^2 = |\mathbf{x}_i|^2 - 2\mathbf{x}_i^T\mathbf{x}_j + |\mathbf{x}_j|^2$$

Double Centering: From Distances to Inner Products

The double centering transformation is the mathematical heart of Classical MDS. It converts a squared distance matrix into a matrix of inner products, enabling eigendecomposition.

Setup: Let $D^{(2)} \in \mathbb{R}^{n \times n}$ be the squared distance matrix with entries $d_{ij}^2$.

The centering matrix: Define the centering matrix:

$$H = I_n - \frac{1}{n}\mathbf{1}\mathbf{1}^T$$

The double centering transformation:

$$B = -\frac{1}{2} H D^{(2)} H$$

Theorem (Torgerson, 1952): If $D^{(2)}$ is a squared Euclidean distance matrix derived from centered data $X$ (with $X^T\mathbf{1} = \mathbf{0}$), then:

$$B = XX^T$$

That is, double centering recovers the Gram matrix—the matrix of all pairwise inner products.

Intuition Behind Double Centering

Derivation of Double Centering:

Let's prove why double centering works. Start with the squared distance:

$$d_{ij}^2 = |\mathbf{x}_i|^2 - 2\mathbf{x}_i^T\mathbf{x}_j + |\mathbf{x}_j|^2$$

Define $a_i = |\mathbf{x}_i|^2$. Then:

$$d_{ij}^2 = a_i + a_j - 2b_{ij}$$

where $b_{ij} = \mathbf{x}_i^T\mathbf{x}_j$ is the $(i,j)$ entry of the Gram matrix.

Row mean of squared distances: $$\bar{d}{i\cdot}^2 = \frac{1}{n}\sum_j d{ij}^2 = a_i + \bar{a} - \frac{2}{n}\sum_j b_{ij}$$

Since the data is centered, $\sum_j \mathbf{x}j = \mathbf{0}$, so $\sum_j b{ij} = \mathbf{x}_i^T \sum_j \mathbf{x}j = 0$. Thus: $$\bar{d}{i\cdot}^2 = a_i + \bar{a}$$

Column mean (by symmetry): $\bar{d}_{\cdot j}^2 = a_j + \bar{a}$

Grand mean: $\bar{d}_{\cdot\cdot}^2 = 2\bar{a}$

This proves that double centering recovers the Gram matrix entry $b_{ij}$.

double_centering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def double_centering(D_squared):
    """
    Apply double centering to a squared distance matrix.
    
    Given a squared distance matrix D^(2), computes:
        B = -0.5 * H @ D^(2) @ H
    where H is the centering matrix.
    
    Parameters:
    -----------
    D_squared : ndarray of shape (n, n)
        Squared distance matrix where D_squared[i,j] = d_{ij}^2
        
    Returns:
    --------
    B : ndarray of shape (n, n)
        The (estimated) Gram matrix
    """
    n = D_squared.shape[0]
    
    # Method 1: Explicit construction of centering matrix
    # H = np.eye(n) - np.ones((n, n)) / n
    # B = -0.5 * H @ D_squared @ H
    
    # Method 2: Efficient computation (avoids forming n x n centering matrix)
    # Compute row means, column means, and grand mean
    row_means = D_squared.mean(axis=1, keepdims=True)  # Shape (n, 1)
    col_means = D_squared.mean(axis=0, keepdims=True)  # Shape (1, n)
    grand_mean = D_squared.mean()
    
    # Double centering formula
    B = -0.5 * (D_squared - row_means - col_means + grand_mean)
    
    return B
 
 
def verify_double_centering():
    """
    Verify that double centering recovers the Gram matrix from Euclidean distances.
    """
    np.random.seed(42)
    
    # Generate random centered data
    n, p = 50, 10
    X = np.random.randn(n, p)
    X = X - X.mean(axis=0)  # Center the data
    
    # Compute Gram matrix directly
    B_true = X @ X.T
    
    # Compute squared distance matrix
    # d_ij^2 = ||x_i||^2 + ||x_j||^2 - 2 * x_i^T x_j
    norms_sq = np.sum(X**2, axis=1)
    D_squared = norms_sq[:, np.newaxis] + norms_sq[np.newaxis, :] - 2 * X @ X.T
    
    # Apply double centering
    B_recovered = double_centering(D_squared)
    
    # Check recovery
    max_error = np.max(np.abs(B_true - B_recovered))
    print(f"Max absolute error: {max_error:.2e}")
    print(f"Recovery successful: {max_error < 1e-10}")
    
    return B_true, B_recovered
 
if __name__ == "__main__":
    verify_double_centering()

Eigendecomposition and Coordinate Recovery

Once we have the Gram matrix $B = XX^T$, we can extract coordinates through eigendecomposition. This step reveals the intimate connection between Classical MDS and PCA.

The Gram matrix is symmetric and positive semi-definite (when $D$ is Euclidean), so it has a real eigendecomposition:

$$B = V \Lambda V^T$$

where:

$V \in \mathbb{R}^{n \times n}$ is orthogonal, with columns being eigenvectors
$\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0$

Coordinate recovery: Since $B = XX^T$, we can write:

$$X = V \Lambda^{1/2}$$

where $\Lambda^{1/2} = \text{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n})$.

Dimensionality reduction: To embed in $k$ dimensions, we take only the top $k$ eigenvector-eigenvalue pairs:

$$X_k = V_k \Lambda_k^{1/2}$$

where $V_k \in \mathbb{R}^{n \times k}$ contains the first $k$ eigenvectors and $\Lambda_k$ is the $k \times k$ diagonal matrix of corresponding eigenvalues.

Connection to PCA

Choosing the embedding dimensionality:

The eigenvalues tell us how much 'variance' each dimension captures. To choose $k$, we examine:

Scree plot: Plot eigenvalues in decreasing order. Look for an 'elbow' where eigenvalues drop sharply.
Proportion of variance explained: $$\text{PoV}(k) = \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^n \lambda_i}$$

A common heuristic is to choose $k$ such that PoV$(k) \geq 0.8$ or $0.9$.

Intrinsic dimensionality: If the data truly lies on a $k$-dimensional subspace, only $k$ eigenvalues will be non-zero (up to numerical precision).

Handling negative eigenvalues:

If the input dissimilarities are not exactly Euclidean, $B$ may have negative eigenvalues. Common strategies:

Ignore negative eigenvalues: Only use positive eigenvalues for embedding
Add a constant to off-diagonal distances: This can make the matrix Euclidean
Use absolute values: Replace $\sqrt{\lambda_i}$ with $\sqrt{|\lambda_i|}$ for all eigenvalues
Switch to metric MDS: Use optimization-based approaches that handle non-Euclidean inputs

classical_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from scipy.linalg import eigh
 
def classical_mds(D, n_components=2):
    """
    Classical Multidimensional Scaling (Torgerson Scaling / PCoA).
    
    Given a distance matrix D, finds a k-dimensional embedding
    such that Euclidean distances in the embedding approximate
    the input distances.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Distance matrix (not squared)
    n_components : int
        Target dimensionality
        
    Returns:
    --------
    X : ndarray of shape (n, n_components)
        The embedded coordinates
    eigenvalues : ndarray
        All eigenvalues (for scree plot analysis)
    variance_explained : ndarray
        Proportion of variance explained by each component
    """
    n = D.shape[0]
    
    # Step 1: Square the distances
    D_squared = D ** 2
    
    # Step 2: Double centering to get Gram matrix
    row_means = D_squared.mean(axis=1, keepdims=True)
    col_means = D_squared.mean(axis=0, keepdims=True)
    grand_mean = D_squared.mean()
    B = -0.5 * (D_squared - row_means - col_means + grand_mean)
    
    # Step 3: Eigendecomposition
    # eigh returns eigenvalues in ascending order
    eigenvalues, eigenvectors = eigh(B)
    
    # Reverse to get descending order
    eigenvalues = eigenvalues[::-1]
    eigenvectors = eigenvectors[:, ::-1]
    
    # Step 4: Handle negative eigenvalues 
    # Count and warn about negative eigenvalues
    n_negative = np.sum(eigenvalues < 0)
    if n_negative > 0:
        print(f"Warning: {n_negative} negative eigenvalues detected.")
        print("The input may not be a valid Euclidean distance matrix.")
    
    # Step 5: Compute embedding using top k positive eigenvalues
    # Only use positive eigenvalues
    positive_mask = eigenvalues > 1e-10
    eigenvalues_positive = eigenvalues[positive_mask]
    eigenvectors_positive = eigenvectors[:, positive_mask]
    
    # Compute proportion of variance explained
    total_variance = np.sum(np.abs(eigenvalues))  # Use abs for robustness
    variance_explained = eigenvalues_positive / total_variance
    
    # Take top n_components
    k = min(n_components, len(eigenvalues_positive))
    eigenvalues_k = eigenvalues_positive[:k]
    eigenvectors_k = eigenvectors_positive[:, :k]
    
    # Compute coordinates: X = V * sqrt(Lambda)
    X = eigenvectors_k * np.sqrt(eigenvalues_k)
    
    return X, eigenvalues, variance_explained[:k]
 
 
def demonstrate_classical_mds():
    """
    Demonstrate Classical MDS on synthetic data and compare with PCA.
    """
    from sklearn.decomposition import PCA
    from sklearn.metrics import pairwise_distances
    import matplotlib.pyplot as plt
    
    np.random.seed(42)
    
    # Generate data on a 2D plane embedded in 10D
    n = 200
    t = np.linspace(0, 4*np.pi, n)
    X_true = np.column_stack([
        t * np.cos(t),
        t * np.sin(t)
    ])
    
    # Embed in higher dimensions with some noise
    W = np.random.randn(2, 10)
    X_high = X_true @ W + 0.1 * np.random.randn(n, 10)
    X_high = X_high - X_high.mean(axis=0)  # Center
    
    # Compute distance matrix
    D = pairwise_distances(X_high, metric='euclidean')
    
    # Apply Classical MDS
    X_mds, eigenvalues, var_explained = classical_mds(D, n_components=2)
    
    # Apply PCA for comparison
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_high)
    
    # Plot results
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # MDS embedding
    axes[0].scatter(X_mds[:, 0], X_mds[:, 1], c=t, cmap='viridis', s=20)
    axes[0].set_title('Classical MDS Embedding')
    axes[0].set_xlabel('MDS 1')
    axes[0].set_ylabel('MDS 2')
    
    # PCA embedding
    axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=t, cmap='viridis', s=20)
    axes[1].set_title('PCA Embedding')
    axes[1].set_xlabel('PC 1')
    axes[1].set_ylabel('PC 2')
    
    # Scree plot
    n_show = min(20, len(eigenvalues))
    axes[2].bar(range(1, n_show+1), eigenvalues[:n_show])
    axes[2].axhline(y=0, color='r', linestyle='--', alpha=0.5)
    axes[2].set_xlabel('Component')
    axes[2].set_ylabel('Eigenvalue')
    axes[2].set_title('Scree Plot')
    
    plt.tight_layout()
    plt.savefig('classical_mds_demo.png', dpi=150)
    plt.close()
    
    print(f"Variance explained by first 2 MDS components: {var_explained[:2].sum():.3f}")
    print(f"Variance explained by first 2 PCA components: {pca.explained_variance_ratio_[:2].sum():.3f}")
    
    return X_mds, X_pca, eigenvalues
 
if __name__ == "__main__":
    demonstrate_classical_mds()

The Classical MDS–PCA Equivalence

Setup:

Let $X \in \mathbb{R}^{n \times p}$ be a centered data matrix ($X^T\mathbf{1} = \mathbf{0}$)
Let $D$ be the Euclidean distance matrix: $d_{ij} = |\mathbf{x}_i - \mathbf{x}_j|_2$

PCA perspective:

Compute covariance matrix: $C = \frac{1}{n}X^TX \in \mathbb{R}^{p \times p}$
Eigendecomposition: $C = W\Lambda_C W^T$
Project data: $Z_{\text{PCA}} = XW$

Classical MDS perspective:

Compute Gram matrix via double centering: $B = XX^T \in \mathbb{R}^{n \times n}$
Eigendecomposition: $B = V\Lambda_B V^T$
Extract coordinates: $Z_{\text{MDS}} = V\Lambda_B^{1/2}$

The equivalence theorem:

$$Z_{\text{MDS}} = Z_{\text{PCA}} \quad \text{(up to orthogonal transformation)}$$

Proof Sketch

Why does this equivalence matter?

Computational choice:
- When $n < p$ (few samples, many features): use Gram matrix approach ($n \times n$ eigendecomposition)
- When $p < n$ (many samples, few features): use covariance matrix approach ($p \times p$ eigendecomposition)
- This is the basis for 'kernel PCA' in disguise
Conceptual unification:
- PCA view: Find directions of maximum variance
- MDS view: Find configuration that best preserves pairwise distances
- These are the same problem with different emphases
Extension to kernels:
- MDS requires only pairwise distances
- This enables the 'kernel trick': replace $X^TX$ with any positive semi-definite kernel matrix
- Kernel PCA is essentially MDS applied to kernel-space distances

When do they differ?

If data is not centered, MDS on raw Euclidean distances differs from PCA
MDS can accept any dissimilarity matrix, not necessarily Euclidean
PCA requires the feature matrix; MDS only needs distances

PCA vs Classical MDS: Comparative Analysis
Aspect	PCA	Classical MDS
Input	Feature matrix X (n × p)	Distance matrix D (n × n)
Key matrix	Covariance X^TX/n (p × p)	Gram matrix XX^T (n × n)
Objective	Maximize variance	Preserve pairwise distances
Computation	O(p³ + np²) for eigendecomposition	O(n³) for eigendecomposition
Better when	p << n (many samples)	n << p (few samples, high dim)
Kernel extension	Requires kernel formulation	Natural (distances → kernels)
Non-Euclidean	Not directly applicable	Can approximate with caveats

Measuring Reconstruction Quality

After computing a $k$-dimensional embedding, we need to assess how well it preserves the original distance structure. Several metrics quantify this.

1. Proportion of Variance Explained (PoV)

The ratio of eigenvalue sum determines how much geometric information is captured:

$$\text{PoV}(k) = \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^n |\lambda_i|}$$

For Euclidean inputs, all eigenvalues are non-negative, and this equals the fraction of total variance. Higher PoV indicates better preservation.

2. Residual Sum of Squares (RSS)

Compare original distances to embedded distances:

$$\text{RSS} = \sum_{i < j} (d_{ij} - \hat{d}_{ij})^2$$

where $\hat{d}_{ij} = |\mathbf{x}_i - \mathbf{x}_j|$ in the embedding. Lower is better.

3. Stress (in the MDS sense)

$$\text{Stress} = \sqrt{\frac{\sum_{i<j}(d_{ij} - \hat{d}{ij})^2}{\sum{i<j} d_{ij}^2}}$$

Normalized measure of distortion. Common interpretation:

Stress < 0.05: Excellent
Stress < 0.10: Good
Stress < 0.20: Fair
Stress > 0.20: Poor

Classical MDS Is Optimal for Euclidean Inputs

4. Shepard Diagram

A Shepard diagram plots original distances $d_{ij}$ against embedded distances $\hat{d}_{ij}$. For a perfect embedding, all points lie on the diagonal. Deviations reveal systematic distortions:

Points below diagonal: Embedded distances too small (compression)
Points above diagonal: Embedded distances too large (expansion)
Curved patterns: Nonlinear distortions

5. Procrustes Analysis

If you have ground truth coordinates (e.g., in simulation), Procrustes analysis finds the optimal rotation/reflection/scaling to align the embedding with truth, then measures residual error:

$$\text{Procrustes error} = \min_{R \in \mathcal{O}(k)} |X_{\text{true}} - X_{\text{MDS}} R|_F$$

where $\mathcal{O}(k)$ is the set of orthogonal matrices.

reconstruction_quality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
 
def compute_mds_quality_metrics(D_original, X_embedded):
    """
    Compute various quality metrics for an MDS embedding.
    
    Parameters:
    -----------
    D_original : ndarray of shape (n, n)
        Original distance matrix
    X_embedded : ndarray of shape (n, k)
        Embedded coordinates
        
    Returns:
    --------
    metrics : dict
        Dictionary containing various quality metrics
    """
    # Compute embedded distances
    D_embedded = cdist(X_embedded, X_embedded, metric='euclidean')
    
    # Extract upper triangular (exclude diagonal)
    n = D_original.shape[0]
    upper_tri_idx = np.triu_indices(n, k=1)
    d_orig = D_original[upper_tri_idx]
    d_emb = D_embedded[upper_tri_idx]
    
    # 1. Residual Sum of Squares
    rss = np.sum((d_orig - d_emb) ** 2)
    
    # 2. Kruskal's Stress-1
    stress = np.sqrt(np.sum((d_orig - d_emb)**2) / np.sum(d_orig**2))
    
    # 3. Normalized Stress
    normalized_stress = np.sqrt(np.sum((d_orig - d_emb)**2) / np.sum(d_emb**2))
    
    # 4. Correlation between distances
    correlation = np.corrcoef(d_orig, d_emb)[0, 1]
    
    # 5. Mean Absolute Error
    mae = np.mean(np.abs(d_orig - d_emb))
    
    # 6. Root Mean Squared Error
    rmse = np.sqrt(np.mean((d_orig - d_emb)**2))
    
    metrics = {
        'rss': rss,
        'stress': stress,
        'normalized_stress': normalized_stress,
        'correlation': correlation,
        'mae': mae,
        'rmse': rmse
    }
    
    return metrics, d_orig, d_emb
 
 
def plot_shepard_diagram(d_original, d_embedded, ax=None):
    """
    Create a Shepard diagram comparing original and embedded distances.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    ax.scatter(d_original, d_embedded, alpha=0.3, s=5)
    
    # Add diagonal reference line
    max_d = max(d_original.max(), d_embedded.max())
    ax.plot([0, max_d], [0, max_d], 'r--', linewidth=2, label='Perfect fit')
    
    # Compute correlation
    corr = np.corrcoef(d_original, d_embedded)[0, 1]
    
    ax.set_xlabel('Original Distance', fontsize=12)
    ax.set_ylabel('Embedded Distance', fontsize=12)
    ax.set_title(f'Shepard Diagram (r = {corr:.4f})', fontsize=14)
    ax.legend()
    ax.set_aspect('equal')
    
    return ax
 
 
def demonstrate_quality_metrics():
    """
    Demonstrate quality metrics on embeddings of different dimensionality.
    """
    np.random.seed(42)
    
    # Generate 5D data
    n = 100
    X = np.random.randn(n, 5)
    X = X - X.mean(axis=0)
    
    # True distance matrix
    D = cdist(X, X, metric='euclidean')
    
    # Embed at different dimensions
    from classical_mds import classical_mds
    
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    for dim, ax in zip([1, 2, 3, 5], axes):
        X_emb, eigenvals, var_exp = classical_mds(D, n_components=dim)
        metrics, d_orig, d_emb = compute_mds_quality_metrics(D, X_emb)
        
        plot_shepard_diagram(d_orig, d_emb, ax=ax)
        ax.set_title(f'{dim}D Embedding\nStress={metrics["stress"]:.4f}')
    
    plt.tight_layout()
    plt.savefig('mds_quality_comparison.png', dpi=150)
    plt.close()
    
    # Print metrics table
    print("\nQuality Metrics by Embedding Dimension:")
    print("-" * 60)
    print(f"{'Dim':>4} {'Stress':>10} {'Corr':>10} {'RMSE':>10} {'PoV':>10}")
    print("-" * 60)
    
    for dim in [1, 2, 3, 4, 5]:
        X_emb, eigenvals, var_exp = classical_mds(D, n_components=dim)
        metrics, _, _ = compute_mds_quality_metrics(D, X_emb)
        pov = eigenvals[:dim].sum() / eigenvals.sum()
        print(f"{dim:>4} {metrics['stress']:>10.4f} {metrics['correlation']:>10.4f} "
              f"{metrics['rmse']:>10.4f} {pov:>10.4f}")
 
if __name__ == "__main__":
    demonstrate_quality_metrics()

Computational Considerations

Classical MDS is computationally elegant but faces scalability challenges for large datasets. Let's analyze the computational requirements.

Time Complexity:

Distance matrix computation (if starting from features): $O(n^2 p)$
Squaring distances: $O(n^2)$
Double centering: $O(n^2)$
Eigendecomposition of $n \times n$ matrix: $O(n^3)$
Coordinate extraction: $O(nk)$

Total: $O(n^2 p + n^3)$

For $n = 10,000$ points, the eigendecomposition alone requires $\sim 10^{12}$ operations—potentially hours of computation.

Space Complexity:

Distance matrix $D$: $O(n^2)$
Gram matrix $B$: $O(n^2)$
Eigenvectors $V$: $O(n^2)$

Total: $O(n^2)$

For $n = 50,000$ with float64: $50000^2 \times 8 \approx 20$ GB per matrix. This quickly exceeds RAM.

Scalability Limit

Classical MDS is typically practical for n < 10,000 on standard hardware. For larger datasets, consider Landmark MDS, Nyström approximation, or out-of-sample extensions.

Optimization strategies:

1. Exploit sparsity in distance computation

If only $k$-nearest neighbor distances are needed, use $k$-d trees or ball trees
Approximate nearest neighbors (e.g., HNSW, Annoy) for very large $n$

2. Partial eigendecomposition

If only $k$ components are needed, use iterative methods (Lanczos, randomized SVD)
scipy.sparse.linalg.eigsh or sklearn.utils.extmath.randomized_svd
Reduces complexity from $O(n^3)$ to $O(n^2 k)$ or better

3. Landmark MDS (L-MDS)

Select $m \ll n$ landmark points
Compute exact MDS on landmarks: $O(m^3)$
Project remaining points using distances to landmarks: $O((n-m) \cdot m)$
Total: $O(m^3 + nm)$

4. Nyström approximation

Sample $m$ columns/rows of the Gram matrix
Approximate full eigendecomposition from this sample
Well-studied approximation guarantees

5. Out-of-sample extension

Fit MDS on a representative subset
Project new points without refitting

Practical Recommendations

•n < 5,000: Use full Classical MDS with dense eigendecomposition; fast and exact.
•5,000 < n < 50,000: Use partial eigendecomposition (ARPACK or randomized). Memory may be tight.
•n > 50,000: Use Landmark MDS with 500-2000 landmarks, or switch to approximate methods like t-SNE/UMAP for visualization.
•Streaming data: Pre-select landmarks from initial batch; project new data onto landmark coordinates.
•Numerical stability: Use scipy.linalg.eigh rather than np.linalg.eig for symmetric matrices.

landmark_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from scipy.linalg import eigh
from scipy.spatial.distance import cdist
 
def landmark_mds(D, n_landmarks, n_components=2, random_state=None):
    """
    Landmark MDS for scalable embedding.
    
    Uses a subset of points (landmarks) to compute exact MDS,
    then projects remaining points.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Full distance matrix (or can be modified to accept only distances to landmarks)
    n_landmarks : int
        Number of landmark points
    n_components : int
        Target dimensionality
    random_state : int, optional
        Random seed for reproducibility
        
    Returns:
    --------
    X : ndarray of shape (n, n_components)
        Embedded coordinates for all points
    """
    n = D.shape[0]
    rng = np.random.RandomState(random_state)
    
    # Step 1: Select landmarks (random selection, could also use k-means++)
    landmark_idx = rng.choice(n, size=n_landmarks, replace=False)
    non_landmark_idx = np.setdiff1d(np.arange(n), landmark_idx)
    
    # Step 2: Extract landmark distance submatrix
    D_landmarks = D[np.ix_(landmark_idx, landmark_idx)]
    
    # Step 3: Classical MDS on landmarks
    D_sq = D_landmarks ** 2
    row_means = D_sq.mean(axis=1, keepdims=True)
    col_means = D_sq.mean(axis=0, keepdims=True)
    grand_mean = D_sq.mean()
    B_landmarks = -0.5 * (D_sq - row_means - col_means + grand_mean)
    
    eigenvalues, eigenvectors = eigh(B_landmarks)
    eigenvalues = eigenvalues[::-1]
    eigenvectors = eigenvectors[:, ::-1]
    
    # Take top k positive eigenvalues
    k = n_components
    eigenvalues_k = eigenvalues[:k]
    eigenvectors_k = eigenvectors[:, :k]
    
    # Landmark coordinates
    X_landmarks = eigenvectors_k * np.sqrt(np.maximum(eigenvalues_k, 0))
    
    # Step 4: Project non-landmark points (Gower's method)
    # Distance from non-landmarks to landmarks
    D_to_landmarks = D[np.ix_(non_landmark_idx, landmark_idx)]
    D_sq_to_landmarks = D_to_landmarks ** 2
    
    # Mean squared distance from landmarks (used in centering)
    mean_sq_dist_landmarks = D_sq.mean(axis=1)  # Mean over landmarks
    
    # Project using Gower's formula
    # X_new = -0.5 * L_k^(-1) * V_k^T * (d_new^2 - mu)
    # where mu is the mean of squared distances among landmarks
    Delta_sq = D_sq_to_landmarks - mean_sq_dist_landmarks
    
    Lambda_inv = np.diag(1.0 / np.maximum(eigenvalues_k, 1e-10))
    X_non_landmarks = -0.5 * Delta_sq @ eigenvectors_k @ Lambda_inv
    
    # Step 5: Combine coordinates
    X = np.zeros((n, n_components))
    X[landmark_idx] = X_landmarks
    X[non_landmark_idx] = X_non_landmarks
    
    return X
 
 
def demo_landmark_mds():
    """Compare Landmark MDS with full Classical MDS."""
    from sklearn.metrics import pairwise_distances
    import time
    
    np.random.seed(42)
    
    # Generate data
    n = 2000
    X_true = np.random.randn(n, 50)  # 50D data
    D = pairwise_distances(X_true)
    
    # Full Classical MDS
    start = time.time()
    from classical_mds import classical_mds
    X_full, _, _ = classical_mds(D, n_components=2)
    time_full = time.time() - start
    
    # Landmark MDS with varying landmarks
    for n_landmarks in [50, 100, 200, 500]:
        start = time.time()
        X_lmds = landmark_mds(D, n_landmarks=n_landmarks, n_components=2, random_state=42)
        time_lmds = time.time() - start
        
        # Compute correlation with full MDS
        corr = np.corrcoef(X_full.flatten(), X_lmds.flatten())[0, 1]
        
        print(f"Landmarks: {n_landmarks:>4} | "
              f"Time: {time_lmds:.3f}s (vs {time_full:.3f}s) | "
              f"Correlation: {abs(corr):.4f}")
 
if __name__ == "__main__":
    demo_landmark_mds()

Summary: Classical MDS

Classical Multidimensional Scaling provides a foundational approach to understanding and visualizing data from pairwise distances. Let's consolidate the key concepts:

Key Takeaways

•Classical MDS reconstructs coordinates from distances using double centering and eigendecomposition—a closed-form, non-iterative solution.
•Double centering converts squared distances to inner products by removing row means, column means, and adding the grand mean.
•Eigendecomposition of the Gram matrix yields coordinates: X = V√Λ, where large eigenvalues correspond to principal directions.
•Classical MDS equals PCA when applied to Euclidean distances from centered data—different perspectives on the same mathematics.
•Negative eigenvalues indicate non-Euclidean input: The distance matrix cannot be exactly embedded in Euclidean space.
•Computational complexity is O(n³) due to eigendecomposition. For large n, use Landmark MDS or randomized methods.
•Quality metrics (stress, Shepard diagrams) quantify how well the embedding preserves original distances.

What's next:

Page Complete

1 / 5