Multidimensional Scaling - Learning Module

Loading content...

0/245

Metric Multidimensional Scaling

When Closed-Form Isn't Enough

Classical MDS provides an elegant, closed-form solution when distances are exactly Euclidean. But the real world is messier:

Dissimilarities may not be Euclidean: Psychological similarity ratings, edit distances on strings, or graph-based distances often violate the Euclidean property.
Some distances matter more than others: We might care deeply about preserving small distances (local structure) while tolerating distortion in large distances.
We want explicit control: Rather than accepting whatever eigendecomposition gives us, we want to directly minimize the mismatch between original and embedded distances.

Metric MDS addresses these needs by casting the embedding problem as an explicit optimization: find the configuration that minimizes a stress function measuring distance discrepancy. Unlike Classical MDS, Metric MDS works iteratively, refining the embedding until convergence.

This transition from closed-form to optimization-based characterizes the evolution of dimensionality reduction—from the analytical simplicity of PCA/Classical MDS to the flexibility of modern methods like t-SNE and UMAP.

What You Will Learn

By the end of this page, you will understand the stress function formulations used in Metric MDS, master the SMACOF optimization algorithm, learn how weights can prioritize certain distances, and recognize when to choose Metric MDS over Classical MDS.

From Classical to Metric MDS

Recall the goal of MDS: given dissimilarities $\delta_{ij}$ between $n$ objects, find coordinates $\mathbf{x}_1, \ldots, \mathbf{x}n \in \mathbb{R}^k$ such that the Euclidean distances $d{ij}(X) = |\mathbf{x}_i - \mathbf{x}_j|$ approximate the dissimilarities.

Classical MDS approach:

Assume $\delta_{ij}$ are Euclidean distances
Use double centering to recover the Gram matrix
Eigendecomposition yields coordinates
Limitation: No explicit objective function; assumes Euclidean input

Metric MDS approach:

Define an explicit stress function measuring mismatch
Use optimization to minimize stress
Works for any dissimilarity matrix
Cost: Requires iterative algorithms; may have local minima

Why the name 'Metric'?

Metric MDS preserves the actual numerical values of dissimilarities (the 'metric' information). This contrasts with Non-metric MDS, which only preserves the rank ordering of dissimilarities (covered in the next page).

Historical Note

Metric MDS was developed by Joseph Kruskal and Roger Shepard in the 1960s. Kruskal's 1964 papers introduced the stress function and the iterative gradient descent approach that remains foundational. The term 'stress' comes from Kruskal's original terminology, representing the 'strain' on the configuration when distances don't match.

Classical MDS vs Metric MDS
Aspect	Classical MDS	Metric MDS
Objective	Implicit (eigenvalue-based)	Explicit (stress minimization)
Input assumption	Euclidean distances	Any dissimilarities
Solution method	Closed-form (eigendecomposition)	Iterative optimization
Computational complexity	O(n³) one-time	O(n² × iterations)
Local minima	No (unique solution)	Yes (multiple possible)
Weighting	Not supported	Supported (weighted stress)
Guarantees	Global optimum for Euclidean input	Local optimum (depends on initialization)

The Stress Function

The heart of Metric MDS is the stress function—a measure of how poorly the embedded distances match the target dissimilarities. Several variants exist, each with different properties.

Notation:

$\delta_{ij}$: Target dissimilarity between objects $i$ and $j$ (given)
$d_{ij}(X)$: Euclidean distance between $\mathbf{x}_i$ and $\mathbf{x}_j$ in the embedding
$w_{ij}$: Weight for pair $(i,j)$ (optional, typically $w_{ij} = 1$)

Kruskal's Stress-1 (Raw Stress):

$$\sigma_{\text{raw}}(X) = \sum_{i < j} w_{ij} \left( \delta_{ij} - d_{ij}(X) \right)^2$$

This is the sum of squared differences between target and embedded distances. Minimizing this finds a configuration where distances match as closely as possible.

Normalized Stress-1:

$$\sigma_1(X) = \sqrt{\frac{\sum_{i<j} w_{ij}(\delta_{ij} - d_{ij})^2}{\sum_{i<j} w_{ij} \delta_{ij}^2}}$$

Normalizing by the sum of squared target distances makes stress comparable across datasets of different scales.

Stress Interpretation

Kruskal provided rules of thumb for stress values: < 0.05 excellent, 0.05-0.10 good, 0.10-0.20 fair, > 0.20 poor. However, expected stress varies with n (more points → higher baseline stress) and k (fewer dimensions → higher stress). Always compare relative to your specific problem.

Stress-2 (Squared Normalization):

$$\sigma_2(X) = \sqrt{\frac{\sum_{i<j} (\delta_{ij} - d_{ij})^2}{\sum_{i<j} d_{ij}^2}}$$

Normalizes by embedded distances rather than target distances. Less commonly used.

Strain (used in Sammon Mapping):

$$E_{\text{Sammon}}(X) = \frac{1}{\sum_{i<j} \delta_{ij}} \sum_{i<j} \frac{(\delta_{ij} - d_{ij})^2}{\delta_{ij}}$$

Sammon's stress weights errors by $1/\delta_{ij}$, giving more importance to preserving small distances. This emphasizes local structure preservation.

Why square the differences?

Differentiability: Squared error is smooth everywhere, enabling gradient-based optimization.
Penalizes large errors: Big discrepancies contribute quadratically more to stress.
Connection to least squares: The stress function is essentially a weighted least squares problem.

Choosing a Stress Function

•Raw stress or Stress-1: General purpose, most common. Use when all distances are equally important.
•Sammon stress (1/δ weighting): When local neighborhood preservation matters more than global structure.
•Custom weights wᵢⱼ: When you have domain knowledge about which pairwise relationships are most important.
•SSTRESS (squared distances): Some algorithms minimize (δ²ᵢⱼ - d²ᵢⱼ)² which simplifies gradients.

The SMACOF Algorithm

SMACOF (Scaling by MAjorizing a COmplicated Function) is the workhorse algorithm for Metric MDS. It's a majorization-minimization (MM) algorithm that iteratively improves the embedding by solving a series of simpler problems.

The challenge: Raw stress is a non-convex function with many local minima. Direct gradient descent can be slow and may oscillate. SMACOF provides a guaranteed monotonic decrease in stress at each iteration.

Key insight: The stress function can be bounded above by a simpler quadratic function (the majorizer). Minimizing the majorizer is easy and guaranteed to decrease the original stress.

Mathematical setup:

Define the stress with weights $w_{ij}$:

$$\sigma(X) = \sum_{i<j} w_{ij}(\delta_{ij} - d_{ij}(X))^2$$

This can be rewritten using matrix notation. Let $X \in \mathbb{R}^{n \times k}$ be the coordinate matrix. Define:

$$d_{ij}(X) = \sqrt{\sum_{s=1}^k (x_{is} - x_{js})^2}$$

The stress can be decomposed as:

$$\sigma(X) = \eta^2 + \rho^2 - 2\rho(X)\eta(X)$$

where:

$\eta^2 = \sum_{i<j} w_{ij}\delta_{ij}^2$ (constant)
$\rho^2(X) = \sum_{i<j} w_{ij}d_{ij}^2(X) = \text{tr}(X^TVX)$
$V$ is the weighted Laplacian matrix
$\eta\rho$ term involves $\sum_{i<j} w_{ij}\delta_{ij}d_{ij}(X)$

Majorization–Minimization Intuition

Majorization is like building a ceiling that touches your function at the current point but lies above it everywhere else. By minimizing the ceiling, you're guaranteed to reduce the original function. SMACOF constructs such a ceiling that's quadratic in X, making minimization easy.

The SMACOF update:

At iteration $t$, given current configuration $X^{(t)}$, the update is:

$$X^{(t+1)} = V^+ B(X^{(t)}) X^{(t)}$$

where:

$V$ is the weighted Laplacian matrix: $$V_{ij} = \begin{cases} -w_{ij} & i \neq j \ \sum_{k \neq i} w_{ik} & i = j \end{cases}$$
$V^+$ is the Moore-Penrose pseudoinverse of $V$
$B(X)$ is the Guttman transform matrix: $$B(X){ij} = \begin{cases} -w{ij}\delta_{ij}/d_{ij}(X) & i \neq j, d_{ij} > 0 \ 0 & i \neq j, d_{ij} = 0 \ -\sum_{k \neq i} B(X)_{ik} & i = j \end{cases}$$

Key properties of SMACOF:

Monotonic convergence: $\sigma(X^{(t+1)}) \leq \sigma(X^{(t)})$ guaranteed.
Simple update: Each iteration is a matrix multiplication, no line search needed.
Converges to stationary point: Not necessarily global minimum, but a local minimum or saddle.

smacof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
import numpy as np
from scipy.spatial.distance import cdist
 
def compute_stress(D_target, X, weights=None):
    """
    Compute raw stress for current configuration.
    
    Parameters:
    -----------
    D_target : ndarray of shape (n, n)
        Target dissimilarity matrix
    X : ndarray of shape (n, k)
        Current embedding coordinates
    weights : ndarray of shape (n, n), optional
        Weight matrix (default: all ones)
        
    Returns:
    --------
    stress : float
        Raw stress value
    """
    n = X.shape[0]
    D_current = cdist(X, X, metric='euclidean')
    
    if weights is None:
        weights = np.ones((n, n))
    
    # Only sum over upper triangle
    mask = np.triu(np.ones((n, n), dtype=bool), k=1)
    diff_sq = (D_target - D_current) ** 2
    stress = np.sum(weights[mask] * diff_sq[mask])
    
    return stress
 
 
def compute_normalized_stress(D_target, X, weights=None):
    """Compute Kruskal's Stress-1 (normalized)."""
    n = X.shape[0]
    D_current = cdist(X, X, metric='euclidean')
    
    if weights is None:
        weights = np.ones((n, n))
    
    mask = np.triu(np.ones((n, n), dtype=bool), k=1)
    
    numerator = np.sum(weights[mask] * (D_target[mask] - D_current[mask])**2)
    denominator = np.sum(weights[mask] * D_target[mask]**2)
    
    return np.sqrt(numerator / denominator)
 
 
def smacof(D_target, n_components=2, weights=None, init=None,
           max_iter=300, eps=1e-6, random_state=None, verbose=False):
    """
    SMACOF algorithm for Metric MDS.
    
    Parameters:
    -----------
    D_target : ndarray of shape (n, n)
        Target dissimilarity matrix
    n_components : int
        Target dimensionality
    weights : ndarray of shape (n, n), optional
        Weight matrix (default: uniform weights)
    init : ndarray of shape (n, k), optional
        Initial configuration (default: Classical MDS solution)
    max_iter : int
        Maximum number of iterations
    eps : float
        Convergence threshold on stress improvement
    random_state : int, optional
        Random seed
    verbose : bool
        Print progress information
        
    Returns:
    --------
    X : ndarray of shape (n, n_components)
        Final embedding
    stress : float
        Final stress value
    n_iter : int
        Number of iterations performed
    """
    rng = np.random.RandomState(random_state)
    n = D_target.shape[0]
    k = n_components
    
    # Initialize weights
    if weights is None:
        weights = np.ones((n, n))
    np.fill_diagonal(weights, 0)  # No self-weights
    
    # Initialize configuration
    if init is None:
        # Use Classical MDS as initialization
        D_sq = D_target ** 2
        row_means = D_sq.mean(axis=1, keepdims=True)
        col_means = D_sq.mean(axis=0, keepdims=True)
        grand_mean = D_sq.mean()
        B = -0.5 * (D_sq - row_means - col_means + grand_mean)
        
        eigenvalues, eigenvectors = np.linalg.eigh(B)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx][:k]
        eigenvectors = eigenvectors[:, idx][:, :k]
        
        # Handle negative eigenvalues
        eigenvalues = np.maximum(eigenvalues, 0)
        X = eigenvectors * np.sqrt(eigenvalues)
    else:
        X = init.copy()
    
    # Compute V matrix (weighted Laplacian)
    V = np.diag(weights.sum(axis=1)) - weights
    
    # Pseudoinverse of V (for centering)
    # Note: V is singular (rows sum to 0), use pseudoinverse
    V_pinv = np.linalg.pinv(V)
    
    # Initial stress
    stress_prev = compute_stress(D_target, X, weights)
    
    if verbose:
        print(f"Initial stress: {stress_prev:.6f}")
    
    for iteration in range(max_iter):
        # Compute current distances
        D_current = cdist(X, X, metric='euclidean')
        
        # Compute B matrix (Guttman transform)
        # B_ij = -w_ij * delta_ij / d_ij  for i != j and d_ij > 0
        with np.errstate(divide='ignore', invalid='ignore'):
            B = -weights * D_target / D_current
        B[~np.isfinite(B)] = 0  # Handle division by zero
        np.fill_diagonal(B, 0)
        np.fill_diagonal(B, -B.sum(axis=1))  # Row sums to zero
        
        # SMACOF update: X_new = V^+ @ B @ X  
        X_new = V_pinv @ B @ X
        
        # Compute new stress
        stress_new = compute_stress(D_target, X_new, weights)
        
        if verbose and iteration % 10 == 0:
            print(f"Iteration {iteration}: stress = {stress_new:.6f}")
        
        # Check convergence
        if stress_prev - stress_new < eps:
            if verbose:
                print(f"Converged at iteration {iteration}")
            X = X_new
            break
        
        X = X_new
        stress_prev = stress_new
    
    final_stress = compute_normalized_stress(D_target, X, weights)
    
    return X, final_stress, iteration + 1
 
 
def demonstrate_smacof():
    """Demonstrate SMACOF on synthetic data."""
    np.random.seed(42)
    
    # Generate 2D spiral data
    n = 150
    t = np.linspace(0, 3*np.pi, n)
    X_true = np.column_stack([t * np.cos(t), t * np.sin(t)])
    
    # Compute true distances
    D = cdist(X_true, X_true, metric='euclidean')
    
    # Add some noise to distances (makes them non-exactly-Euclidean)
    noise = np.random.randn(n, n) * 0.1
    D_noisy = D + (noise + noise.T) / 2  # Keep symmetric
    D_noisy = np.maximum(D_noisy, 0)  # Keep non-negative
    np.fill_diagonal(D_noisy, 0)
    
    # Run SMACOF
    X_mds, stress, n_iter = smacof(
        D_noisy, n_components=2, max_iter=300, verbose=True
    )
    
    print(f"\nFinal normalized stress: {stress:.4f}")
    print(f"Iterations: {n_iter}")
    
    # Compare reconstructed distances with noisy distances
    D_reconstructed = cdist(X_mds, X_mds, metric='euclidean')
    corr = np.corrcoef(D_noisy.flatten(), D_reconstructed.flatten())[0, 1]
    print(f"Correlation with noisy input distances: {corr:.4f}")
    
    return X_mds, D_noisy, D_reconstructed
 
if __name__ == "__main__":
    demonstrate_smacof()

Weighted MDS: Prioritizing Distances

A key advantage of Metric MDS is the ability to assign weights to different pairwise dissimilarities. This enables:

Handling missing data: Set $w_{ij} = 0$ for missing pairs
Confidence weighting: Give higher weights to more reliable measurements
Local structure emphasis: Weight small distances more heavily
Hierarchical emphasis: Weight certain clusters or groups

The weighted stress function:

$$\sigma(X) = \sum_{i<j} w_{ij}(\delta_{ij} - d_{ij}(X))^2$$

The SMACOF algorithm naturally incorporates weights through the $V$ and $B$ matrices.

Common Weighting Schemes

• Uniform: w_ij = 1 for all pairs (standard Metric MDS) • Distance-based: w_ij = 1/δ_ij (emphasize small distances, like Sammon mapping) • Distance-based: w_ij = 1/δ²_ij (even stronger local emphasis) • K-nearest neighbor: w_ij = 1 if j ∈ KNN(i), else 0 (only preserve local structure) • Binary reliability: w_ij = 1 if pair is reliable, 0 if uncertain

Sammon Mapping as Weighted MDS:

Sammon mapping (1969) is a special case of weighted Metric MDS with $w_{ij} = 1/\delta_{ij}$:

$$E_{\text{Sammon}} = \frac{1}{\sum_{i<j}\delta_{ij}} \sum_{i<j} \frac{(\delta_{ij} - d_{ij})^2}{\delta_{ij}}$$

Interpretation: Small distances are weighted more heavily, so local neighborhoods are preserved at the expense of global structure. This makes Sammon mapping a precursor to modern methods like t-SNE.

Local MDS:

For extreme local emphasis, use k-nearest neighbor weights:

$$w_{ij} = \begin{cases} 1 & \text{if } j \in \text{KNN}(i) \text{ or } i \in \text{KNN}(j) \ 0 & \text{otherwise} \end{cases}$$

This only tries to preserve distances between nearby points, ignoring long-range structure. The result is an embedding focused purely on local neighborhoods—conceptually similar to LLE and t-SNE.

weighted_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
from scipy.spatial.distance import cdist
from smacof import smacof
 
def create_weight_matrix(D, scheme='uniform', k=None):
    """
    Create weight matrix for Weighted MDS.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Distance or dissimilarity matrix
    scheme : str
        Weighting scheme: 'uniform', 'inverse', 'inverse_sq', 'knn', 'sammon'
    k : int
        Number of neighbors for 'knn' scheme
        
    Returns:
    --------
    W : ndarray of shape (n, n)
        Weight matrix
    """
    n = D.shape[0]
    
    if scheme == 'uniform':
        W = np.ones((n, n))
        
    elif scheme == 'inverse' or scheme == 'sammon':
        # w_ij = 1 / delta_ij
        with np.errstate(divide='ignore'):
            W = 1.0 / D
        W[~np.isfinite(W)] = 0  # Handle division by zero
        
    elif scheme == 'inverse_sq':
        # w_ij = 1 / delta_ij^2
        with np.errstate(divide='ignore'):
            W = 1.0 / (D ** 2)
        W[~np.isfinite(W)] = 0
        
    elif scheme == 'knn':
        if k is None:
            k = min(10, n - 1)
        W = np.zeros((n, n))
        for i in range(n):
            # Find k nearest neighbors (excluding self)
            dists = D[i].copy()
            dists[i] = np.inf
            knn_idx = np.argsort(dists)[:k]
            W[i, knn_idx] = 1
        # Symmetrize
        W = np.maximum(W, W.T)
        
    else:
        raise ValueError(f"Unknown weighting scheme: {scheme}")
    
    np.fill_diagonal(W, 0)  # No self-weights
    return W
 
 
def compare_weighting_schemes():
    """Compare different weighting schemes on the same data."""
    np.random.seed(42)
    
    # Generate clustered data in 5D
    n_per_cluster = 50
    n_clusters = 3
    
    clusters = []
    for i in range(n_clusters):
        center = np.random.randn(5) * 10
        cluster = center + np.random.randn(n_per_cluster, 5)
        clusters.append(cluster)
    
    X_true = np.vstack(clusters)
    labels = np.repeat(range(n_clusters), n_per_cluster)
    n = len(labels)
    
    # Compute distances
    D = cdist(X_true, X_true, metric='euclidean')
    
    # Different weighting schemes
    schemes = ['uniform', 'inverse', 'inverse_sq', 'knn']
    results = {}
    
    for scheme in schemes:
        if scheme == 'knn':
            W = create_weight_matrix(D, scheme=scheme, k=10)
        else:
            W = create_weight_matrix(D, scheme=scheme)
        
        X_mds, stress, n_iter = smacof(
            D, n_components=2, weights=W, max_iter=500, random_state=42
        )
        
        results[scheme] = {
            'embedding': X_mds,
            'stress': stress,
            'iterations': n_iter
        }
        
        print(f"Scheme: {scheme:12s} | Stress: {stress:.4f} | Iterations: {n_iter}")
    
    return results, labels
 
 
def demonstrate_missing_data_mds():
    """Demonstrate MDS with missing (unknown) distances."""
    np.random.seed(42)
    
    # Generate 2D data
    n = 100
    X_true = np.random.randn(n, 2)
    D_true = cdist(X_true, X_true, metric='euclidean')
    
    # Randomly mask 30% of pairwise distances
    mask = np.random.rand(n, n) > 0.3  # Keep 70%
    mask = np.maximum(mask, mask.T)  # Keep symmetric
    np.fill_diagonal(mask, True)  # Always keep diagonal
    
    # Create weight matrix: 1 for known distances, 0 for unknown
    W = mask.astype(float)
    np.fill_diagonal(W, 0)
    
    print(f"Known distances: {mask.sum() / n**2 * 100:.1f}%")
    
    # Run weighted MDS
    X_mds, stress, n_iter = smacof(
        D_true, n_components=2, weights=W, max_iter=500, random_state=42
    )
    
    # Compare with full-information MDS
    X_mds_full, stress_full, _ = smacof(
        D_true, n_components=2, max_iter=500, random_state=42
    )
    
    print(f"Missing data MDS stress: {stress:.4f}")
    print(f"Full data MDS stress: {stress_full:.4f}")
    
    # Compute correlation of recovered coordinates
    # (need to align via Procrustes first)
    from scipy.linalg import orthogonal_procrustes
    R, _ = orthogonal_procrustes(X_mds, X_mds_full)
    X_mds_aligned = X_mds @ R
    
    corr = np.corrcoef(X_mds_full.flatten(), X_mds_aligned.flatten())[0, 1]
    print(f"Correlation with full-data embedding: {corr:.4f}")
    
    return X_mds, X_mds_full, W
 
if __name__ == "__main__":
    print("Comparing weighting schemes:")
    print("-" * 50)
    compare_weighting_schemes()
    print()
    print("Missing data example:")
    print("-" * 50)
    demonstrate_missing_data_mds()

Initialization and Local Minima

SMACOF and other iterative MDS algorithms converge to local minima, not necessarily the global minimum. The quality of the final embedding depends strongly on initialization.

The local minima problem:

Consider embedding 4 points with equal pairwise distances (a tetrahedron in 3D) into 2D. There are many valid 2D configurations that all have the same stress—the optimization landscape has multiple equivalent minima. For more complex data, there may be qualitatively different minima with different stress values.

Common initialization strategies:

1. Classical MDS (recommended default)

Run Classical MDS on the dissimilarity matrix
Use the resulting coordinates as initialization
Provides a good starting point that's often near the global optimum
Fast (closed-form)2. Random initialization
Sample coordinates from a Gaussian distribution
Scale to match the average pairwise distance
Run multiple times with different seeds; keep best
More exploration but slower

3. PCA of distance matrix

Treat each row of $D$ as a feature vector
Run PCA and take top $k$ components
Not theoretically optimal but often works

4. Warm start from previous solution

When embedding similar data repeatedly
Use previous embedding as initialization
Fast convergence

Multi-Start Strategy

For critical applications, run SMACOF multiple times with different initializations (e.g., Classical MDS + 10 random starts). Keep the solution with lowest stress. This increases computation but provides more confidence in the result.

Diagnosing local minima:

Signs that you may be stuck in a bad local minimum:

Stress is unexpectedly high: Compare to Classical MDS stress as a baseline.
Results vary wildly across runs: The solution is unstable.
Shepard diagram shows systematic distortion: Not just noise, but bending or clustering.
Qualitative structure doesn't match expected: Known clusters not separated, manifold twisted.

Remedies:

Multi-start: Try many random initializations.
Simulated annealing: Add decreasing random perturbations.
Different algorithm: Try gradient descent with momentum, or different majorization.
Increase dimensionality: Higher $k$ often has simpler landscapes.

multistart_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from scipy.spatial.distance import cdist
from smacof import smacof, compute_normalized_stress
 
def multistart_mds(D, n_components=2, n_starts=10, max_iter=300, 
                   eps=1e-6, random_state=None, verbose=False):
    """
    Multi-start Metric MDS to avoid local minima.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Dissimilarity matrix
    n_components : int
        Target dimensionality
    n_starts : int
        Number of random initializations to try
    max_iter : int
        Maximum iterations per start
    eps : float
        Convergence tolerance
    random_state : int
        Master random seed
    verbose : bool
        Print progress
        
    Returns:
    --------
    X_best : ndarray of shape (n, n_components)
        Best embedding found
    stress_best : float
        Stress of best embedding
    all_stresses : list
        Stress values for all starts
    """
    rng = np.random.RandomState(random_state)
    n = D.shape[0]
    
    all_stresses = []
    X_best = None
    stress_best = np.inf
    
    # Start 0: Classical MDS initialization
    X, stress, n_iter = smacof(
        D, n_components=n_components, init=None,  # Uses Classical MDS
        max_iter=max_iter, eps=eps, random_state=rng.randint(100000)
    )
    all_stresses.append(stress)
    
    if verbose:
        print(f"Start  0 (Classical MDS init): stress = {stress:.6f}")
    
    if stress < stress_best:
        stress_best = stress
        X_best = X.copy()
    
    # Starts 1 to n_starts-1: Random initializations
    for start in range(1, n_starts):
        # Random initialization scaled to match data
        mean_dist = D[np.triu_indices(n, k=1)].mean()
        init = rng.randn(n, n_components) * mean_dist / np.sqrt(n_components)
        
        X, stress, n_iter = smacof(
            D, n_components=n_components, init=init,
            max_iter=max_iter, eps=eps, random_state=rng.randint(100000)
        )
        all_stresses.append(stress)
        
        if verbose:
            print(f"Start {start:2d} (random init):       stress = {stress:.6f}")
        
        if stress < stress_best:
            stress_best = stress
            X_best = X.copy()
    
    if verbose:
        print(f"\nBest stress: {stress_best:.6f}")
        print(f"Stress range: [{min(all_stresses):.6f}, {max(all_stresses):.6f}]")
    
    return X_best, stress_best, all_stresses
 
 
def analyze_local_minima():
    """Analyze the local minima landscape for different data types."""
    np.random.seed(42)
    
    # Dataset 1: Well-separated clusters (unique global minimum)
    print("Dataset 1: Well-separated clusters")
    n_per_cluster = 30
    clusters_1 = [
        np.random.randn(n_per_cluster, 5) + np.array([10, 0, 0, 0, 0]),
        np.random.randn(n_per_cluster, 5) + np.array([0, 10, 0, 0, 0]),
        np.random.randn(n_per_cluster, 5) + np.array([0, 0, 10, 0, 0]),
    ]
    X_1 = np.vstack(clusters_1)
    D_1 = cdist(X_1, X_1, metric='euclidean')
    
    X_best_1, stress_1, all_stresses_1 = multistart_mds(
        D_1, n_components=2, n_starts=10, verbose=True
    )
    print(f"Stress std: {np.std(all_stresses_1):.6f}")
    print()
    
    # Dataset 2: Highly symmetric (many equivalent minima)
    print("Dataset 2: Regular simplex (symmetric)")
    n_2 = 20
    # Points on a hypersphere (highly symmetric)
    theta = np.linspace(0, 2*np.pi, n_2, endpoint=False)
    X_2 = np.column_stack([np.cos(theta), np.sin(theta)])
    D_2 = cdist(X_2, X_2, metric='euclidean')
    
    X_best_2, stress_2, all_stresses_2 = multistart_mds(
        D_2, n_components=2, n_starts=10, verbose=True
    )
    print(f"Stress std: {np.std(all_stresses_2):.6f}")
    print()
    
    # Dataset 3: Noisy Swiss roll (challenging manifold)
    print("Dataset 3: Swiss roll (manifold)")
    n_3 = 200
    t = 1.5 * np.pi * (1 + 2 * np.random.rand(n_3))
    X_3 = np.column_stack([
        t * np.cos(t),
        20 * np.random.rand(n_3),
        t * np.sin(t)
    ])
    D_3 = cdist(X_3, X_3, metric='euclidean')
    
    X_best_3, stress_3, all_stresses_3 = multistart_mds(
        D_3, n_components=2, n_starts=10, verbose=True
    )
    print(f"Stress std: {np.std(all_stresses_3):.6f}")
    
    return {
        'clusters': (X_best_1, all_stresses_1),
        'simplex': (X_best_2, all_stresses_2),
        'swiss_roll': (X_best_3, all_stresses_3)
    }
 
if __name__ == "__main__":
    results = analyze_local_minima()

Practical Considerations

Successfully applying Metric MDS requires attention to several practical details.

Choosing the target dimensionality $k$:

Stress elbow: Plot stress vs. $k$ for $k = 1, 2, \ldots$. Look for diminishing returns.
Scree plot from Classical MDS: Examine eigenvalue decay as guidance.
Task-specific: For visualization, $k = 2$ or $3$. For preprocessing, may need higher.
Rule of thumb: Stress < 0.1 usually indicates acceptable fit.

Preprocessing dissimilarities:

Standardization: If dissimilarities have very different scales, consider normalizing.
Transformation: For psychological data, power transformations may improve Euclidean fit.
Missing values: Use weights to down-weight or ignore uncertain pairs.
Symmetrization: If input is asymmetric, symmetrize by $(D + D^T)/2$.

Post-Processing

After MDS, consider: (1) Centering the embedding for interpretability, (2) Rotating to align with meaningful axes (Procrustes if you have a reference), (3) Flipping to match expected orientation (MDS is invariant to reflections).

Common Pitfalls and Solutions

•High stress despite many iterations: Try different initialization, increase k, or accept that the data doesn't embed well in Euclidean space.
•Points collapse together: May indicate degenerate solution. Use regularization or check for all-zero distances.
•Very slow convergence: SMACOF is slowing. Consider higher tolerance or limited iterations with multi-start.
•Negative eigenvalues in Classical MDS init: The dissimilarities aren't Euclidean. Metric MDS handles this, but be aware of the approximation.
•Results sensitive to small changes: The data may lie on a high-dimensional manifold that doesn't project cleanly. Consider non-metric MDS or manifold learning.

Software implementations:

Python: sklearn.manifold.MDS with metric=True
R: cmdscale() for Classical, isoMDS() and sammon() from MASS package
MATLAB: mdscale() with various stress criteria

Using sklearn:

sklearn_mds_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from sklearn.manifold import MDS
from sklearn.metrics import pairwise_distances
from sklearn.datasets import load_digits
import numpy as np
 
# Load example data
digits = load_digits(n_class=5)
X = digits.data[:300]  # Use subset for speed
y = digits.target[:300]
 
# Compute distance matrix
D = pairwise_distances(X, metric='euclidean')
 
# Metric MDS (defaults to SMACOF)
mds_metric = MDS(
    n_components=2,
    metric=True,           # Metric MDS (vs non-metric)
    dissimilarity='precomputed',  # We provide the distance matrix
    n_init=4,              # Number of random starts
    max_iter=300,
    eps=1e-6,
    random_state=42,
    n_jobs=-1              # Parallelize over starts
)
 
X_mds = mds_metric.fit_transform(D)
 
print(f"Final stress: {mds_metric.stress_:.4f}")
print(f"Embedding shape: {X_mds.shape}")
 
# Alternatively, from features directly (uses Euclidean distance internally)
mds_from_features = MDS(
    n_components=2,
    metric=True,
    dissimilarity='euclidean',  # Compute distances from features
    n_init=4,
    random_state=42
)
 
X_mds_alt = mds_from_features.fit_transform(X)
print(f"Stress (from features): {mds_from_features.stress_:.4f}")

Extensions and Variants

Metric MDS has spawned several important variants and extensions.

1. SSTRESS (Squared Stress)

Instead of optimizing $(\delta_{ij} - d_{ij})^2$, optimize $(\delta_{ij}^2 - d_{ij}^2)^2$:

$$\text{SSTRESS}(X) = \sum_{i<j} (\delta_{ij}^2 - d_{ij}(X)^2)^2$$

Advantage: The squared distance $d_{ij}^2 = \sum_s (x_{is} - x_{js})^2$ is a polynomial in coordinates, making gradients simpler. Used in some implementations.

2. Robust MDS

Replace squared error with robust loss functions:

$$\sigma_{\text{robust}}(X) = \sum_{i<j} \rho(\delta_{ij} - d_{ij})$$

where $\rho$ is a robust loss (Huber, Tukey bisquare). This reduces sensitivity to outlier distances.

3. Procrustes MDS

When you have multiple dissimilarity matrices from different sources (e.g., different subjects rating the same stimuli), Procrustes MDS finds a common configuration that best fits all matrices.

4. Individual Differences Scaling (INDSCAL)

Extends MDS to handle multiple dissimilarity matrices by assuming a common configuration with subject-specific dimension weights. Reveals which dimensions are important for different subjects.

Connection to Other Methods

Metric MDS sits at the foundation of many modern methods: Isomap uses MDS on geodesic distances, t-SNE optimizes a related (but non-metric) objective, and kernel PCA can be viewed as MDS on kernel-induced distances. Understanding MDS illuminates all of these.

MDS Family of Methods
Method	What It Preserves	Key Feature
Classical MDS	Euclidean distances (exactly)	Closed-form, O(n³)
Metric MDS (SMACOF)	Metric distances (approximately)	Iterative, weighted, flexible
Non-metric MDS	Distance rankings only	Rank-based, robust to measurement scale
Sammon Mapping	Local distances (emphasis)	Weighted by 1/δ
Isomap	Geodesic distances	MDS on graph shortest paths
Landmark MDS	Distances to landmarks	Scalable O(nm²) instead of O(n³)

Summary: Metric MDS

Metric MDS extends Classical MDS with explicit optimization, providing flexibility at the cost of iterative computation. Let's consolidate the key concepts:

Key Takeaways

•Metric MDS minimizes a stress function measuring the discrepancy between target dissimilarities and embedded distances.
•Stress variants (Stress-1, SSTRESS, Sammon) offer different trade-offs between local and global structure preservation.
•SMACOF is the standard algorithm: Majorization-minimization guarantees monotonic stress decrease at each iteration.
•Weights enable flexibility: Handle missing data, emphasize important distances, or focus on local neighborhoods.
•Initialization matters: Classical MDS provides a good start; multi-start strategies help avoid bad local minima.
•Computational cost is O(n² × iterations): Faster per-iteration than Classical MDS but may need many iterations.
•Post-processing (centering, rotation) may be needed for interpretable embeddings.

What's next:

Metric MDS assumes that the numerical values of dissimilarities are meaningful. But what if dissimilarities come from subjective judgments where only the ranking matters? The next page covers Non-metric MDS, which abandons metric preservation in favor of monotonic (rank-preserving) embeddings—a more robust approach for ordinal data.

Page Complete

You now understand Metric MDS: the stress function formulation, SMACOF optimization, weighting schemes, and practical considerations. This forms the bridge between Classical MDS and the more flexible Non-metric MDS that follows.

Metric Multidimensional Scaling

When Closed-Form Isn't Enough

Classical MDS provides an elegant, closed-form solution when distances are exactly Euclidean. But the real world is messier:

Dissimilarities may not be Euclidean: Psychological similarity ratings, edit distances on strings, or graph-based distances often violate the Euclidean property.
Some distances matter more than others: We might care deeply about preserving small distances (local structure) while tolerating distortion in large distances.
We want explicit control: Rather than accepting whatever eigendecomposition gives us, we want to directly minimize the mismatch between original and embedded distances.

What You Will Learn

From Classical to Metric MDS

Classical MDS approach:

Assume $\delta_{ij}$ are Euclidean distances
Use double centering to recover the Gram matrix
Eigendecomposition yields coordinates
Limitation: No explicit objective function; assumes Euclidean input

Metric MDS approach:

Define an explicit stress function measuring mismatch
Use optimization to minimize stress
Works for any dissimilarity matrix
Cost: Requires iterative algorithms; may have local minima

Why the name 'Metric'?

Historical Note

Classical MDS vs Metric MDS
Aspect	Classical MDS	Metric MDS
Objective	Implicit (eigenvalue-based)	Explicit (stress minimization)
Input assumption	Euclidean distances	Any dissimilarities
Solution method	Closed-form (eigendecomposition)	Iterative optimization
Computational complexity	O(n³) one-time	O(n² × iterations)
Local minima	No (unique solution)	Yes (multiple possible)
Weighting	Not supported	Supported (weighted stress)
Guarantees	Global optimum for Euclidean input	Local optimum (depends on initialization)

The Stress Function

The heart of Metric MDS is the stress function—a measure of how poorly the embedded distances match the target dissimilarities. Several variants exist, each with different properties.

Notation:

$\delta_{ij}$: Target dissimilarity between objects $i$ and $j$ (given)
$d_{ij}(X)$: Euclidean distance between $\mathbf{x}_i$ and $\mathbf{x}_j$ in the embedding
$w_{ij}$: Weight for pair $(i,j)$ (optional, typically $w_{ij} = 1$)

Kruskal's Stress-1 (Raw Stress):

$$\sigma_{\text{raw}}(X) = \sum_{i < j} w_{ij} \left( \delta_{ij} - d_{ij}(X) \right)^2$$

This is the sum of squared differences between target and embedded distances. Minimizing this finds a configuration where distances match as closely as possible.

Normalized Stress-1:

$$\sigma_1(X) = \sqrt{\frac{\sum_{i<j} w_{ij}(\delta_{ij} - d_{ij})^2}{\sum_{i<j} w_{ij} \delta_{ij}^2}}$$

Normalizing by the sum of squared target distances makes stress comparable across datasets of different scales.

Stress Interpretation

Stress-2 (Squared Normalization):

$$\sigma_2(X) = \sqrt{\frac{\sum_{i<j} (\delta_{ij} - d_{ij})^2}{\sum_{i<j} d_{ij}^2}}$$

Normalizes by embedded distances rather than target distances. Less commonly used.

Strain (used in Sammon Mapping):

$$E_{\text{Sammon}}(X) = \frac{1}{\sum_{i<j} \delta_{ij}} \sum_{i<j} \frac{(\delta_{ij} - d_{ij})^2}{\delta_{ij}}$$

Sammon's stress weights errors by $1/\delta_{ij}$, giving more importance to preserving small distances. This emphasizes local structure preservation.

Why square the differences?

Differentiability: Squared error is smooth everywhere, enabling gradient-based optimization.
Penalizes large errors: Big discrepancies contribute quadratically more to stress.
Connection to least squares: The stress function is essentially a weighted least squares problem.

Choosing a Stress Function

•Raw stress or Stress-1: General purpose, most common. Use when all distances are equally important.
•Sammon stress (1/δ weighting): When local neighborhood preservation matters more than global structure.
•Custom weights wᵢⱼ: When you have domain knowledge about which pairwise relationships are most important.
•SSTRESS (squared distances): Some algorithms minimize (δ²ᵢⱼ - d²ᵢⱼ)² which simplifies gradients.

The SMACOF Algorithm

Key insight: The stress function can be bounded above by a simpler quadratic function (the majorizer). Minimizing the majorizer is easy and guaranteed to decrease the original stress.

Mathematical setup:

Define the stress with weights $w_{ij}$:

$$\sigma(X) = \sum_{i<j} w_{ij}(\delta_{ij} - d_{ij}(X))^2$$

This can be rewritten using matrix notation. Let $X \in \mathbb{R}^{n \times k}$ be the coordinate matrix. Define:

$$d_{ij}(X) = \sqrt{\sum_{s=1}^k (x_{is} - x_{js})^2}$$

The stress can be decomposed as:

$$\sigma(X) = \eta^2 + \rho^2 - 2\rho(X)\eta(X)$$

where:

$\eta^2 = \sum_{i<j} w_{ij}\delta_{ij}^2$ (constant)
$\rho^2(X) = \sum_{i<j} w_{ij}d_{ij}^2(X) = \text{tr}(X^TVX)$
$V$ is the weighted Laplacian matrix
$\eta\rho$ term involves $\sum_{i<j} w_{ij}\delta_{ij}d_{ij}(X)$

Majorization–Minimization Intuition

The SMACOF update:

At iteration $t$, given current configuration $X^{(t)}$, the update is:

$$X^{(t+1)} = V^+ B(X^{(t)}) X^{(t)}$$

where:

$V$ is the weighted Laplacian matrix: $$V_{ij} = \begin{cases} -w_{ij} & i \neq j \ \sum_{k \neq i} w_{ik} & i = j \end{cases}$$
$V^+$ is the Moore-Penrose pseudoinverse of $V$
$B(X)$ is the Guttman transform matrix: $$B(X){ij} = \begin{cases} -w{ij}\delta_{ij}/d_{ij}(X) & i \neq j, d_{ij} > 0 \ 0 & i \neq j, d_{ij} = 0 \ -\sum_{k \neq i} B(X)_{ik} & i = j \end{cases}$$

Key properties of SMACOF:

Monotonic convergence: $\sigma(X^{(t+1)}) \leq \sigma(X^{(t)})$ guaranteed.
Simple update: Each iteration is a matrix multiplication, no line search needed.
Converges to stationary point: Not necessarily global minimum, but a local minimum or saddle.

smacof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
import numpy as np
from scipy.spatial.distance import cdist
 
def compute_stress(D_target, X, weights=None):
    """
    Compute raw stress for current configuration.
    
    Parameters:
    -----------
    D_target : ndarray of shape (n, n)
        Target dissimilarity matrix
    X : ndarray of shape (n, k)
        Current embedding coordinates
    weights : ndarray of shape (n, n), optional
        Weight matrix (default: all ones)
        
    Returns:
    --------
    stress : float
        Raw stress value
    """
    n = X.shape[0]
    D_current = cdist(X, X, metric='euclidean')
    
    if weights is None:
        weights = np.ones((n, n))
    
    # Only sum over upper triangle
    mask = np.triu(np.ones((n, n), dtype=bool), k=1)
    diff_sq = (D_target - D_current) ** 2
    stress = np.sum(weights[mask] * diff_sq[mask])
    
    return stress
 
 
def compute_normalized_stress(D_target, X, weights=None):
    """Compute Kruskal's Stress-1 (normalized)."""
    n = X.shape[0]
    D_current = cdist(X, X, metric='euclidean')
    
    if weights is None:
        weights = np.ones((n, n))
    
    mask = np.triu(np.ones((n, n), dtype=bool), k=1)
    
    numerator = np.sum(weights[mask] * (D_target[mask] - D_current[mask])**2)
    denominator = np.sum(weights[mask] * D_target[mask]**2)
    
    return np.sqrt(numerator / denominator)
 
 
def smacof(D_target, n_components=2, weights=None, init=None,
           max_iter=300, eps=1e-6, random_state=None, verbose=False):
    """
    SMACOF algorithm for Metric MDS.
    
    Parameters:
    -----------
    D_target : ndarray of shape (n, n)
        Target dissimilarity matrix
    n_components : int
        Target dimensionality
    weights : ndarray of shape (n, n), optional
        Weight matrix (default: uniform weights)
    init : ndarray of shape (n, k), optional
        Initial configuration (default: Classical MDS solution)
    max_iter : int
        Maximum number of iterations
    eps : float
        Convergence threshold on stress improvement
    random_state : int, optional
        Random seed
    verbose : bool
        Print progress information
        
    Returns:
    --------
    X : ndarray of shape (n, n_components)
        Final embedding
    stress : float
        Final stress value
    n_iter : int
        Number of iterations performed
    """
    rng = np.random.RandomState(random_state)
    n = D_target.shape[0]
    k = n_components
    
    # Initialize weights
    if weights is None:
        weights = np.ones((n, n))
    np.fill_diagonal(weights, 0)  # No self-weights
    
    # Initialize configuration
    if init is None:
        # Use Classical MDS as initialization
        D_sq = D_target ** 2
        row_means = D_sq.mean(axis=1, keepdims=True)
        col_means = D_sq.mean(axis=0, keepdims=True)
        grand_mean = D_sq.mean()
        B = -0.5 * (D_sq - row_means - col_means + grand_mean)
        
        eigenvalues, eigenvectors = np.linalg.eigh(B)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx][:k]
        eigenvectors = eigenvectors[:, idx][:, :k]
        
        # Handle negative eigenvalues
        eigenvalues = np.maximum(eigenvalues, 0)
        X = eigenvectors * np.sqrt(eigenvalues)
    else:
        X = init.copy()
    
    # Compute V matrix (weighted Laplacian)
    V = np.diag(weights.sum(axis=1)) - weights
    
    # Pseudoinverse of V (for centering)
    # Note: V is singular (rows sum to 0), use pseudoinverse
    V_pinv = np.linalg.pinv(V)
    
    # Initial stress
    stress_prev = compute_stress(D_target, X, weights)
    
    if verbose:
        print(f"Initial stress: {stress_prev:.6f}")
    
    for iteration in range(max_iter):
        # Compute current distances
        D_current = cdist(X, X, metric='euclidean')
        
        # Compute B matrix (Guttman transform)
        # B_ij = -w_ij * delta_ij / d_ij  for i != j and d_ij > 0
        with np.errstate(divide='ignore', invalid='ignore'):
            B = -weights * D_target / D_current
        B[~np.isfinite(B)] = 0  # Handle division by zero
        np.fill_diagonal(B, 0)
        np.fill_diagonal(B, -B.sum(axis=1))  # Row sums to zero
        
        # SMACOF update: X_new = V^+ @ B @ X  
        X_new = V_pinv @ B @ X
        
        # Compute new stress
        stress_new = compute_stress(D_target, X_new, weights)
        
        if verbose and iteration % 10 == 0:
            print(f"Iteration {iteration}: stress = {stress_new:.6f}")
        
        # Check convergence
        if stress_prev - stress_new < eps:
            if verbose:
                print(f"Converged at iteration {iteration}")
            X = X_new
            break
        
        X = X_new
        stress_prev = stress_new
    
    final_stress = compute_normalized_stress(D_target, X, weights)
    
    return X, final_stress, iteration + 1
 
 
def demonstrate_smacof():
    """Demonstrate SMACOF on synthetic data."""
    np.random.seed(42)
    
    # Generate 2D spiral data
    n = 150
    t = np.linspace(0, 3*np.pi, n)
    X_true = np.column_stack([t * np.cos(t), t * np.sin(t)])
    
    # Compute true distances
    D = cdist(X_true, X_true, metric='euclidean')
    
    # Add some noise to distances (makes them non-exactly-Euclidean)
    noise = np.random.randn(n, n) * 0.1
    D_noisy = D + (noise + noise.T) / 2  # Keep symmetric
    D_noisy = np.maximum(D_noisy, 0)  # Keep non-negative
    np.fill_diagonal(D_noisy, 0)
    
    # Run SMACOF
    X_mds, stress, n_iter = smacof(
        D_noisy, n_components=2, max_iter=300, verbose=True
    )
    
    print(f"\nFinal normalized stress: {stress:.4f}")
    print(f"Iterations: {n_iter}")
    
    # Compare reconstructed distances with noisy distances
    D_reconstructed = cdist(X_mds, X_mds, metric='euclidean')
    corr = np.corrcoef(D_noisy.flatten(), D_reconstructed.flatten())[0, 1]
    print(f"Correlation with noisy input distances: {corr:.4f}")
    
    return X_mds, D_noisy, D_reconstructed
 
if __name__ == "__main__":
    demonstrate_smacof()

Weighted MDS: Prioritizing Distances

A key advantage of Metric MDS is the ability to assign weights to different pairwise dissimilarities. This enables:

Handling missing data: Set $w_{ij} = 0$ for missing pairs
Confidence weighting: Give higher weights to more reliable measurements
Local structure emphasis: Weight small distances more heavily
Hierarchical emphasis: Weight certain clusters or groups

The weighted stress function:

$$\sigma(X) = \sum_{i<j} w_{ij}(\delta_{ij} - d_{ij}(X))^2$$

The SMACOF algorithm naturally incorporates weights through the $V$ and $B$ matrices.

Common Weighting Schemes

Sammon Mapping as Weighted MDS:

Sammon mapping (1969) is a special case of weighted Metric MDS with $w_{ij} = 1/\delta_{ij}$:

$$E_{\text{Sammon}} = \frac{1}{\sum_{i<j}\delta_{ij}} \sum_{i<j} \frac{(\delta_{ij} - d_{ij})^2}{\delta_{ij}}$$

Local MDS:

For extreme local emphasis, use k-nearest neighbor weights:

$$w_{ij} = \begin{cases} 1 & \text{if } j \in \text{KNN}(i) \text{ or } i \in \text{KNN}(j) \ 0 & \text{otherwise} \end{cases}$$

This only tries to preserve distances between nearby points, ignoring long-range structure. The result is an embedding focused purely on local neighborhoods—conceptually similar to LLE and t-SNE.

weighted_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
from scipy.spatial.distance import cdist
from smacof import smacof
 
def create_weight_matrix(D, scheme='uniform', k=None):
    """
    Create weight matrix for Weighted MDS.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Distance or dissimilarity matrix
    scheme : str
        Weighting scheme: 'uniform', 'inverse', 'inverse_sq', 'knn', 'sammon'
    k : int
        Number of neighbors for 'knn' scheme
        
    Returns:
    --------
    W : ndarray of shape (n, n)
        Weight matrix
    """
    n = D.shape[0]
    
    if scheme == 'uniform':
        W = np.ones((n, n))
        
    elif scheme == 'inverse' or scheme == 'sammon':
        # w_ij = 1 / delta_ij
        with np.errstate(divide='ignore'):
            W = 1.0 / D
        W[~np.isfinite(W)] = 0  # Handle division by zero
        
    elif scheme == 'inverse_sq':
        # w_ij = 1 / delta_ij^2
        with np.errstate(divide='ignore'):
            W = 1.0 / (D ** 2)
        W[~np.isfinite(W)] = 0
        
    elif scheme == 'knn':
        if k is None:
            k = min(10, n - 1)
        W = np.zeros((n, n))
        for i in range(n):
            # Find k nearest neighbors (excluding self)
            dists = D[i].copy()
            dists[i] = np.inf
            knn_idx = np.argsort(dists)[:k]
            W[i, knn_idx] = 1
        # Symmetrize
        W = np.maximum(W, W.T)
        
    else:
        raise ValueError(f"Unknown weighting scheme: {scheme}")
    
    np.fill_diagonal(W, 0)  # No self-weights
    return W
 
 
def compare_weighting_schemes():
    """Compare different weighting schemes on the same data."""
    np.random.seed(42)
    
    # Generate clustered data in 5D
    n_per_cluster = 50
    n_clusters = 3
    
    clusters = []
    for i in range(n_clusters):
        center = np.random.randn(5) * 10
        cluster = center + np.random.randn(n_per_cluster, 5)
        clusters.append(cluster)
    
    X_true = np.vstack(clusters)
    labels = np.repeat(range(n_clusters), n_per_cluster)
    n = len(labels)
    
    # Compute distances
    D = cdist(X_true, X_true, metric='euclidean')
    
    # Different weighting schemes
    schemes = ['uniform', 'inverse', 'inverse_sq', 'knn']
    results = {}
    
    for scheme in schemes:
        if scheme == 'knn':
            W = create_weight_matrix(D, scheme=scheme, k=10)
        else:
            W = create_weight_matrix(D, scheme=scheme)
        
        X_mds, stress, n_iter = smacof(
            D, n_components=2, weights=W, max_iter=500, random_state=42
        )
        
        results[scheme] = {
            'embedding': X_mds,
            'stress': stress,
            'iterations': n_iter
        }
        
        print(f"Scheme: {scheme:12s} | Stress: {stress:.4f} | Iterations: {n_iter}")
    
    return results, labels
 
 
def demonstrate_missing_data_mds():
    """Demonstrate MDS with missing (unknown) distances."""
    np.random.seed(42)
    
    # Generate 2D data
    n = 100
    X_true = np.random.randn(n, 2)
    D_true = cdist(X_true, X_true, metric='euclidean')
    
    # Randomly mask 30% of pairwise distances
    mask = np.random.rand(n, n) > 0.3  # Keep 70%
    mask = np.maximum(mask, mask.T)  # Keep symmetric
    np.fill_diagonal(mask, True)  # Always keep diagonal
    
    # Create weight matrix: 1 for known distances, 0 for unknown
    W = mask.astype(float)
    np.fill_diagonal(W, 0)
    
    print(f"Known distances: {mask.sum() / n**2 * 100:.1f}%")
    
    # Run weighted MDS
    X_mds, stress, n_iter = smacof(
        D_true, n_components=2, weights=W, max_iter=500, random_state=42
    )
    
    # Compare with full-information MDS
    X_mds_full, stress_full, _ = smacof(
        D_true, n_components=2, max_iter=500, random_state=42
    )
    
    print(f"Missing data MDS stress: {stress:.4f}")
    print(f"Full data MDS stress: {stress_full:.4f}")
    
    # Compute correlation of recovered coordinates
    # (need to align via Procrustes first)
    from scipy.linalg import orthogonal_procrustes
    R, _ = orthogonal_procrustes(X_mds, X_mds_full)
    X_mds_aligned = X_mds @ R
    
    corr = np.corrcoef(X_mds_full.flatten(), X_mds_aligned.flatten())[0, 1]
    print(f"Correlation with full-data embedding: {corr:.4f}")
    
    return X_mds, X_mds_full, W
 
if __name__ == "__main__":
    print("Comparing weighting schemes:")
    print("-" * 50)
    compare_weighting_schemes()
    print()
    print("Missing data example:")
    print("-" * 50)
    demonstrate_missing_data_mds()

Initialization and Local Minima

SMACOF and other iterative MDS algorithms converge to local minima, not necessarily the global minimum. The quality of the final embedding depends strongly on initialization.

The local minima problem:

Common initialization strategies:

1. Classical MDS (recommended default)

Run Classical MDS on the dissimilarity matrix
Use the resulting coordinates as initialization
Provides a good starting point that's often near the global optimum
Fast (closed-form)2. Random initialization
Sample coordinates from a Gaussian distribution
Scale to match the average pairwise distance
Run multiple times with different seeds; keep best
More exploration but slower

3. PCA of distance matrix

Treat each row of $D$ as a feature vector
Run PCA and take top $k$ components
Not theoretically optimal but often works

4. Warm start from previous solution

When embedding similar data repeatedly
Use previous embedding as initialization
Fast convergence

Multi-Start Strategy

Diagnosing local minima:

Signs that you may be stuck in a bad local minimum:

Stress is unexpectedly high: Compare to Classical MDS stress as a baseline.
Results vary wildly across runs: The solution is unstable.
Shepard diagram shows systematic distortion: Not just noise, but bending or clustering.
Qualitative structure doesn't match expected: Known clusters not separated, manifold twisted.

Remedies:

Multi-start: Try many random initializations.
Simulated annealing: Add decreasing random perturbations.
Different algorithm: Try gradient descent with momentum, or different majorization.
Increase dimensionality: Higher $k$ often has simpler landscapes.

multistart_mds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from scipy.spatial.distance import cdist
from smacof import smacof, compute_normalized_stress
 
def multistart_mds(D, n_components=2, n_starts=10, max_iter=300, 
                   eps=1e-6, random_state=None, verbose=False):
    """
    Multi-start Metric MDS to avoid local minima.
    
    Parameters:
    -----------
    D : ndarray of shape (n, n)
        Dissimilarity matrix
    n_components : int
        Target dimensionality
    n_starts : int
        Number of random initializations to try
    max_iter : int
        Maximum iterations per start
    eps : float
        Convergence tolerance
    random_state : int
        Master random seed
    verbose : bool
        Print progress
        
    Returns:
    --------
    X_best : ndarray of shape (n, n_components)
        Best embedding found
    stress_best : float
        Stress of best embedding
    all_stresses : list
        Stress values for all starts
    """
    rng = np.random.RandomState(random_state)
    n = D.shape[0]
    
    all_stresses = []
    X_best = None
    stress_best = np.inf
    
    # Start 0: Classical MDS initialization
    X, stress, n_iter = smacof(
        D, n_components=n_components, init=None,  # Uses Classical MDS
        max_iter=max_iter, eps=eps, random_state=rng.randint(100000)
    )
    all_stresses.append(stress)
    
    if verbose:
        print(f"Start  0 (Classical MDS init): stress = {stress:.6f}")
    
    if stress < stress_best:
        stress_best = stress
        X_best = X.copy()
    
    # Starts 1 to n_starts-1: Random initializations
    for start in range(1, n_starts):
        # Random initialization scaled to match data
        mean_dist = D[np.triu_indices(n, k=1)].mean()
        init = rng.randn(n, n_components) * mean_dist / np.sqrt(n_components)
        
        X, stress, n_iter = smacof(
            D, n_components=n_components, init=init,
            max_iter=max_iter, eps=eps, random_state=rng.randint(100000)
        )
        all_stresses.append(stress)
        
        if verbose:
            print(f"Start {start:2d} (random init):       stress = {stress:.6f}")
        
        if stress < stress_best:
            stress_best = stress
            X_best = X.copy()
    
    if verbose:
        print(f"\nBest stress: {stress_best:.6f}")
        print(f"Stress range: [{min(all_stresses):.6f}, {max(all_stresses):.6f}]")
    
    return X_best, stress_best, all_stresses
 
 
def analyze_local_minima():
    """Analyze the local minima landscape for different data types."""
    np.random.seed(42)
    
    # Dataset 1: Well-separated clusters (unique global minimum)
    print("Dataset 1: Well-separated clusters")
    n_per_cluster = 30
    clusters_1 = [
        np.random.randn(n_per_cluster, 5) + np.array([10, 0, 0, 0, 0]),
        np.random.randn(n_per_cluster, 5) + np.array([0, 10, 0, 0, 0]),
        np.random.randn(n_per_cluster, 5) + np.array([0, 0, 10, 0, 0]),
    ]
    X_1 = np.vstack(clusters_1)
    D_1 = cdist(X_1, X_1, metric='euclidean')
    
    X_best_1, stress_1, all_stresses_1 = multistart_mds(
        D_1, n_components=2, n_starts=10, verbose=True
    )
    print(f"Stress std: {np.std(all_stresses_1):.6f}")
    print()
    
    # Dataset 2: Highly symmetric (many equivalent minima)
    print("Dataset 2: Regular simplex (symmetric)")
    n_2 = 20
    # Points on a hypersphere (highly symmetric)
    theta = np.linspace(0, 2*np.pi, n_2, endpoint=False)
    X_2 = np.column_stack([np.cos(theta), np.sin(theta)])
    D_2 = cdist(X_2, X_2, metric='euclidean')
    
    X_best_2, stress_2, all_stresses_2 = multistart_mds(
        D_2, n_components=2, n_starts=10, verbose=True
    )
    print(f"Stress std: {np.std(all_stresses_2):.6f}")
    print()
    
    # Dataset 3: Noisy Swiss roll (challenging manifold)
    print("Dataset 3: Swiss roll (manifold)")
    n_3 = 200
    t = 1.5 * np.pi * (1 + 2 * np.random.rand(n_3))
    X_3 = np.column_stack([
        t * np.cos(t),
        20 * np.random.rand(n_3),
        t * np.sin(t)
    ])
    D_3 = cdist(X_3, X_3, metric='euclidean')
    
    X_best_3, stress_3, all_stresses_3 = multistart_mds(
        D_3, n_components=2, n_starts=10, verbose=True
    )
    print(f"Stress std: {np.std(all_stresses_3):.6f}")
    
    return {
        'clusters': (X_best_1, all_stresses_1),
        'simplex': (X_best_2, all_stresses_2),
        'swiss_roll': (X_best_3, all_stresses_3)
    }
 
if __name__ == "__main__":
    results = analyze_local_minima()

Practical Considerations

Successfully applying Metric MDS requires attention to several practical details.

Choosing the target dimensionality $k$:

Stress elbow: Plot stress vs. $k$ for $k = 1, 2, \ldots$. Look for diminishing returns.
Scree plot from Classical MDS: Examine eigenvalue decay as guidance.
Task-specific: For visualization, $k = 2$ or $3$. For preprocessing, may need higher.
Rule of thumb: Stress < 0.1 usually indicates acceptable fit.

Preprocessing dissimilarities:

Standardization: If dissimilarities have very different scales, consider normalizing.
Transformation: For psychological data, power transformations may improve Euclidean fit.
Missing values: Use weights to down-weight or ignore uncertain pairs.
Symmetrization: If input is asymmetric, symmetrize by $(D + D^T)/2$.

Post-Processing

Common Pitfalls and Solutions

•High stress despite many iterations: Try different initialization, increase k, or accept that the data doesn't embed well in Euclidean space.
•Points collapse together: May indicate degenerate solution. Use regularization or check for all-zero distances.
•Very slow convergence: SMACOF is slowing. Consider higher tolerance or limited iterations with multi-start.
•Negative eigenvalues in Classical MDS init: The dissimilarities aren't Euclidean. Metric MDS handles this, but be aware of the approximation.
•Results sensitive to small changes: The data may lie on a high-dimensional manifold that doesn't project cleanly. Consider non-metric MDS or manifold learning.

Software implementations:

Python: sklearn.manifold.MDS with metric=True
R: cmdscale() for Classical, isoMDS() and sammon() from MASS package
MATLAB: mdscale() with various stress criteria

Using sklearn:

sklearn_mds_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from sklearn.manifold import MDS
from sklearn.metrics import pairwise_distances
from sklearn.datasets import load_digits
import numpy as np
 
# Load example data
digits = load_digits(n_class=5)
X = digits.data[:300]  # Use subset for speed
y = digits.target[:300]
 
# Compute distance matrix
D = pairwise_distances(X, metric='euclidean')
 
# Metric MDS (defaults to SMACOF)
mds_metric = MDS(
    n_components=2,
    metric=True,           # Metric MDS (vs non-metric)
    dissimilarity='precomputed',  # We provide the distance matrix
    n_init=4,              # Number of random starts
    max_iter=300,
    eps=1e-6,
    random_state=42,
    n_jobs=-1              # Parallelize over starts
)
 
X_mds = mds_metric.fit_transform(D)
 
print(f"Final stress: {mds_metric.stress_:.4f}")
print(f"Embedding shape: {X_mds.shape}")
 
# Alternatively, from features directly (uses Euclidean distance internally)
mds_from_features = MDS(
    n_components=2,
    metric=True,
    dissimilarity='euclidean',  # Compute distances from features
    n_init=4,
    random_state=42
)
 
X_mds_alt = mds_from_features.fit_transform(X)
print(f"Stress (from features): {mds_from_features.stress_:.4f}")

Extensions and Variants

Metric MDS has spawned several important variants and extensions.

1. SSTRESS (Squared Stress)

Instead of optimizing $(\delta_{ij} - d_{ij})^2$, optimize $(\delta_{ij}^2 - d_{ij}^2)^2$:

$$\text{SSTRESS}(X) = \sum_{i<j} (\delta_{ij}^2 - d_{ij}(X)^2)^2$$

Advantage: The squared distance $d_{ij}^2 = \sum_s (x_{is} - x_{js})^2$ is a polynomial in coordinates, making gradients simpler. Used in some implementations.

2. Robust MDS

Replace squared error with robust loss functions:

$$\sigma_{\text{robust}}(X) = \sum_{i<j} \rho(\delta_{ij} - d_{ij})$$

where $\rho$ is a robust loss (Huber, Tukey bisquare). This reduces sensitivity to outlier distances.

3. Procrustes MDS

When you have multiple dissimilarity matrices from different sources (e.g., different subjects rating the same stimuli), Procrustes MDS finds a common configuration that best fits all matrices.

4. Individual Differences Scaling (INDSCAL)

Extends MDS to handle multiple dissimilarity matrices by assuming a common configuration with subject-specific dimension weights. Reveals which dimensions are important for different subjects.

Connection to Other Methods

MDS Family of Methods
Method	What It Preserves	Key Feature
Classical MDS	Euclidean distances (exactly)	Closed-form, O(n³)
Metric MDS (SMACOF)	Metric distances (approximately)	Iterative, weighted, flexible
Non-metric MDS	Distance rankings only	Rank-based, robust to measurement scale
Sammon Mapping	Local distances (emphasis)	Weighted by 1/δ
Isomap	Geodesic distances	MDS on graph shortest paths
Landmark MDS	Distances to landmarks	Scalable O(nm²) instead of O(n³)

Summary: Metric MDS

Metric MDS extends Classical MDS with explicit optimization, providing flexibility at the cost of iterative computation. Let's consolidate the key concepts:

Key Takeaways

•Metric MDS minimizes a stress function measuring the discrepancy between target dissimilarities and embedded distances.
•Stress variants (Stress-1, SSTRESS, Sammon) offer different trade-offs between local and global structure preservation.
•SMACOF is the standard algorithm: Majorization-minimization guarantees monotonic stress decrease at each iteration.
•Weights enable flexibility: Handle missing data, emphasize important distances, or focus on local neighborhoods.
•Initialization matters: Classical MDS provides a good start; multi-start strategies help avoid bad local minima.
•Computational cost is O(n² × iterations): Faster per-iteration than Classical MDS but may need many iterations.
•Post-processing (centering, rotation) may be needed for interpretable embeddings.

What's next:

Page Complete