Machine LearningManifold Learning

Locally Linear Embedding

LevelAdvanced

Duration90 mins

TopicManifold Learning

2 / 5

Weight Computation

Encoding Local Geometry Through Weights

The first phase of Locally Linear Embedding transforms intuition into computation: we must find reconstruction weights that quantify how each data point relates to its neighbors. These weights are not arbitrary numerical values—they encode the precise geometric relationships in the local tangent space.

Computing optimal weights is a constrained optimization problem with elegant mathematical structure. The solution involves linear algebra concepts—Gram matrices, constrained least squares, and regularization—that reveal deep connections between geometry and algebra. Done correctly, the weights capture all information needed to preserve local structure in the low-dimensional embedding.

This page provides a comprehensive treatment of weight computation, from the optimization formulation to practical numerical implementation.

Learning Objectives

By the end of this page, you will: • Derive the closed-form solution for optimal reconstruction weights • Understand the role of the local Gram matrix in weight computation • Master regularization techniques for ill-conditioned problems • Implement efficient weight computation algorithms • Recognize failure modes and diagnostic techniques

The Constrained Optimization Problem

For each point $\mathbf{x}i$ in our dataset, we seek weights ${w{ij}}_{j \in \mathcal{N}_i}$ that minimize the reconstruction error subject to constraints.

The Formal Problem:

$$\min_{\mathbf{w}_i} \left| \mathbf{x}i - \sum{j \in \mathcal{N}i} w{ij} \mathbf{x}_j \right|^2$$

subject to: $$\sum_{j \in \mathcal{N}i} w{ij} = 1$$

where $\mathcal{N}_i$ denotes the set of $k$ nearest neighbors of point $i$.

Expanding the Objective:

Let us denote the $k$ neighbors of $\mathbf{x}i$ by ${\mathbf{x}{j_1}, \ldots, \mathbf{x}_{j_k}}$. Collecting the weights into a vector $\mathbf{w}_i \in \mathbb{R}^k$, the objective becomes:

$$\varepsilon_i = \left| \mathbf{x}i - \sum{l=1}^k w_{il} \mathbf{x}_{j_l} \right|^2$$

Introducing Difference Vectors:

Define $\boldsymbol{\eta}{l} = \mathbf{x}i - \mathbf{x}{j_l}$ for $l = 1, \ldots, k$. Using the constraint $\sum_l w{il} = 1$:

$$\mathbf{x}i - \sum_l w{il} \mathbf{x}{j_l} = \sum_l w{il} \mathbf{x}i - \sum_l w{il} \mathbf{x}{j_l} = \sum_l w{il} (\mathbf{x}i - \mathbf{x}{j_l}) = \sum_l w_{il} \boldsymbol{\eta}_l$$

Therefore: $$\varepsilon_i = \left| \sum_l w_{il} \boldsymbol{\eta}_l \right|^2 = \mathbf{w}_i^T \mathbf{G}^{(i)} \mathbf{w}_i$$

where $\mathbf{G}^{(i)}_{lm} = \boldsymbol{\eta}_l^T \boldsymbol{\eta}_m$ is the local Gram matrix.

The Local Gram Matrix

The Gram matrix G⁽ⁱ⁾ ∈ ℝᵏˣᵏ captures all pairwise inner products between the difference vectors (x_i - x_j). Its (l,m) entry is:

G_{lm}⁽ⁱ⁾ = (x_i - x_{j_l})ᵀ(x_i - x_{j_m})

This matrix is symmetric and positive semi-definite. Its rank equals the dimension of the affine hull spanned by x_i and its neighbors.

Deriving the Closed-Form Solution

We now seek to minimize the quadratic objective $\mathbf{w}_i^T \mathbf{G}^{(i)} \mathbf{w}_i$ subject to the linear constraint $\mathbf{1}^T \mathbf{w}_i = 1$.

Method: Lagrange Multipliers

The Lagrangian is: $$\mathcal{L}(\mathbf{w}_i, \lambda) = \mathbf{w}_i^T \mathbf{G}^{(i)} \mathbf{w}_i - \lambda(\mathbf{1}^T \mathbf{w}_i - 1)$$

First-Order Conditions:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}_i} = 2\mathbf{G}^{(i)} \mathbf{w}_i - \lambda \mathbf{1} = \mathbf{0}$$

$$\frac{\partial \mathcal{L}}{\partial \lambda} = \mathbf{1}^T \mathbf{w}_i - 1 = 0$$

Solving for Weights:

From the first condition (assuming $\mathbf{G}^{(i)}$ is invertible): $$\mathbf{w}_i = \frac{\lambda}{2} (\mathbf{G}^{(i)})^{-1} \mathbf{1}$$

Substituting into the constraint: $$\mathbf{1}^T \mathbf{w}_i = \frac{\lambda}{2} \mathbf{1}^T (\mathbf{G}^{(i)})^{-1} \mathbf{1} = 1$$

Therefore: $$\lambda = \frac{2}{\mathbf{1}^T (\mathbf{G}^{(i)})^{-1} \mathbf{1}}$$

The Final Solution:

$$\mathbf{w}_i = \frac{(\mathbf{G}^{(i)})^{-1} \mathbf{1}}{\mathbf{1}^T (\mathbf{G}^{(i)})^{-1} \mathbf{1}}$$

This elegant formula shows that optimal weights are proportional to $\mathbf{G}^{-1}\mathbf{1}$, normalized to sum to one.

weight_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy.spatial import distance_matrix
 
def compute_lle_weights(X, k=10, reg=1e-3):
    """
    Compute LLE reconstruction weights for all points.
    
    Parameters:
    -----------
    X : ndarray of shape (N, D)
        Data matrix with N points in D dimensions
    k : int
        Number of nearest neighbors
    reg : float
        Regularization parameter for ill-conditioned Gram matrices
        
    Returns:
    --------
    W : ndarray of shape (N, N)
        Sparse weight matrix where W[i,j] = w_ij
    """
    N, D = X.shape
    
    # Compute pairwise distances
    dist = distance_matrix(X, X)
    
    # Initialize sparse weight matrix
    W = np.zeros((N, N))
    
    for i in range(N):
        # Find k nearest neighbors (excluding self)
        neighbor_indices = np.argsort(dist[i])[1:k+1]
        
        # Compute local Gram matrix
        # eta_j = x_i - x_j for each neighbor j
        x_i = X[i]
        X_neighbors = X[neighbor_indices]
        
        # Difference vectors: each row is (x_i - x_j)
        eta = x_i - X_neighbors  # Shape: (k, D)
        
        # Local Gram matrix: G[l,m] = eta_l^T @ eta_m
        G = eta @ eta.T  # Shape: (k, k)
        
        # Regularization for numerical stability
        # Add small value to diagonal if needed
        trace_G = np.trace(G)
        if trace_G > 0:
            G += reg * trace_G * np.eye(k)
        else:
            G += reg * np.eye(k)
        
        # Solve for weights: w = G^{-1} @ 1 / (1^T @ G^{-1} @ 1)
        ones = np.ones(k)
        
        try:
            # Solve G @ w_unnorm = ones
            w_unnorm = np.linalg.solve(G, ones)
            
            # Normalize to sum to 1
            w = w_unnorm / np.sum(w_unnorm)
            
        except np.linalg.LinAlgError:
            # Fallback: uniform weights if Gram matrix is singular
            w = np.ones(k) / k
            print(f"Warning: Singular Gram matrix at point {i}")
        
        # Store weights in sparse matrix
        W[i, neighbor_indices] = w
    
    return W
 
 
def compute_reconstruction_error(X, W):
    """
    Compute the total reconstruction error for given weights.
    
    Error = sum_i ||x_i - sum_j w_ij * x_j||^2
    """
    N = X.shape[0]
    error = 0.0
    
    for i in range(N):
        reconstruction = W[i] @ X  # Weighted sum of all points
        residual = X[i] - reconstruction
        error += np.dot(residual, residual)
    
    return error
 
 
# Example usage
if __name__ == "__main__":
    # Generate Swiss Roll data
    np.random.seed(42)
    n_samples = 500
    t = 1.5 * np.pi * (1 + 2 * np.random.rand(n_samples))
    x = t * np.cos(t)
    y = 10 * np.random.rand(n_samples)
    z = t * np.sin(t)
    X = np.column_stack([x, y, z])
    
    # Compute weights with different k values
    for k in [5, 10, 15, 20]:
        W = compute_lle_weights(X, k=k)
        error = compute_reconstruction_error(X, W)
        print(f"k={k:2d}: Reconstruction error = {error:.4f}")

Alternative: Solving a Linear System

Instead of explicitly computing G⁻¹, we can solve the linear system G @ w = 1 for w, then normalize. This is numerically more stable and is the preferred implementation approach. The np.linalg.solve function uses LU decomposition which is efficient and stable.

Geometric Interpretation of the Weights

The closed-form solution $\mathbf{w}_i = \mathbf{G}^{-1}\mathbf{1} / (\mathbf{1}^T \mathbf{G}^{-1}\mathbf{1})$ has deep geometric meaning that illuminates how LLE captures local structure.

Interpretation 1: Barycentric Coordinates

When neighbors lie exactly in a $d$-dimensional affine subspace containing $\mathbf{x}_i$, the weights are the barycentric coordinates of $\mathbf{x}_i$ with respect to the neighbors. Barycentric coordinates satisfy:

$\sum_j w_j = 1$ (partition of unity)
$\sum_j w_j \mathbf{x}_j = \mathbf{x}_i$ (exact reconstruction)

This connects LLE weights to classical computational geometry.

Interpretation 2: Projection onto Tangent Space

The weights describe how $\mathbf{x}_i$ projects onto the affine hull of its neighbors. If the neighbors span the local tangent space, the weights encode the local coordinate representation of $\mathbf{x}_i$.

Interpretation 3: Local Linear Regression

Finding weights that minimize reconstruction error is equivalent to solving a local linear regression problem: $$\min_\mathbf{w} |\mathbf{X}_\text{neighbors}^T \mathbf{w} - \mathbf{x}_i|^2 \quad \text{s.t. } \mathbf{1}^T \mathbf{w} = 1$$

This is constrained least squares in the local neighborhood.

Weight Characteristics and Their Meaning
Weight Pattern	Geometric Meaning	Implication
Weights near 1/k (uniform)	x_i is near the centroid of neighbors	Point is in a 'flat' region
One weight ≈ 1, others ≈ 0	x_i is very close to one neighbor	Possibly redundant data point
Negative weights	x_i requires extrapolation	Point is on the 'edge' of the local cloud
Large magnitude weights	Ill-conditioned local geometry	May need regularization
Weights decay with distance	Closer neighbors contribute more	Expected for well-sampled manifolds

Negative Weights:

Unlike in some methods (e.g., kernel smoothing), LLE weights can be negative. This occurs when $\mathbf{x}_i$ is not in the convex hull of its neighbors—requiring extrapolation rather than interpolation.

Consider a simple 1D example: if neighbors are at positions 1 and 3, and $\mathbf{x}_i$ is at position 5, we need: $$w_1 \cdot 1 + w_2 \cdot 3 = 5 \quad \text{with } w_1 + w_2 = 1$$ Solving gives $w_1 = -1, w_2 = 2$—negative weight for the farther neighbor.

Negative weights are not errors; they naturally arise when the point lies outside the convex hull of its neighbors. However, many negative weights may indicate poor neighborhood selection.

Sign of Weights and Stability

Many negative weights can signal:

The neighborhood size k is too small
The local geometry is poorly sampled
The point is an outlier or lies on the boundary of the data

Monitoring weight distributions helps diagnose LLE issues.

Regularization for Ill-Conditioned Problems

The local Gram matrix $\mathbf{G}^{(i)}$ can be ill-conditioned or singular, leading to numerical instability. This is especially common when:

k > D: More neighbors than input dimensions (Gram matrix is rank-deficient)
Collinear neighbors: Neighbors lie on a lower-dimensional subspace
Duplicate or near-duplicate points: Very small eigenvalues
High noise: Gram matrix poorly conditioned

The Regularization Approach:

Add a small value to the diagonal of the Gram matrix: $$\tilde{\mathbf{G}}^{(i)} = \mathbf{G}^{(i)} + \alpha \cdot \text{tr}(\mathbf{G}^{(i)}) \cdot \mathbf{I}$$

where $\alpha$ is a small regularization parameter (typically $10^{-3}$ to $10^{-5}$).

Why This Works:

Regularization is equivalent to adding a penalty term to the optimization: $$\min_\mathbf{w} \left( \mathbf{w}^T \mathbf{G} \mathbf{w} + \gamma |\mathbf{w}|^2 \right) \quad \text{s.t. } \mathbf{1}^T \mathbf{w} = 1$$

The penalty discourages weights with large magnitude, improving stability without significantly biasing the solution.

Scaling with Trace:

Multiplying the regularization by $\text{tr}(\mathbf{G})$ makes the regularization scale-invariant. This ensures consistent behavior regardless of the data's absolute scale.

Without Regularization

•Fails when k > D (singular Gram matrix)
•Sensitive to near-duplicate points
•Weights can have very large magnitudes
•Numerical errors accumulate
•May crash on edge cases

With Regularization

•Works for any k (even k > D)
•Robust to near-duplicates
•Weights have bounded magnitude
•Numerically stable computation
•Graceful degradation in edge cases

regularization_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
 
def analyze_gram_conditioning(X, neighbor_indices):
    """
    Analyze the conditioning of local Gram matrices.
    
    Returns condition numbers and eigenvalue analysis.
    """
    N = len(neighbor_indices)
    condition_numbers = []
    min_eigenvalues = []
    max_eigenvalues = []
    
    for i, neighbors in enumerate(neighbor_indices):
        x_i = X[i]
        X_neighbors = X[neighbors]
        
        # Difference vectors
        eta = x_i - X_neighbors
        
        # Local Gram matrix
        G = eta @ eta.T
        
        # Eigenvalue analysis
        eigvals = np.linalg.eigvalsh(G)
        eigvals_sorted = np.sort(eigvals)[::-1]  # Descending
        
        min_eigval = eigvals_sorted[-1]
        max_eigval = eigvals_sorted[0]
        
        # Condition number (ratio of largest to smallest eigenvalue)
        if min_eigval > 1e-15:
            cond = max_eigval / min_eigval
        else:
            cond = np.inf
        
        condition_numbers.append(cond)
        min_eigenvalues.append(min_eigval)
        max_eigenvalues.append(max_eigval)
    
    return {
        'condition_numbers': np.array(condition_numbers),
        'min_eigenvalues': np.array(min_eigenvalues),
        'max_eigenvalues': np.array(max_eigenvalues),
    }
 
 
def choose_regularization(G, method='trace'):
    """
    Automatically choose regularization strength.
    
    Methods:
    - 'trace': reg = alpha * trace(G) / k
    - 'eigval': reg based on eigenvalue gap analysis
    - 'fixed': use fixed small value
    """
    k = G.shape[0]
    
    if method == 'trace':
        # Standard LLE regularization
        alpha = 1e-3
        trace_G = np.trace(G)
        if trace_G > 0:
            reg = alpha * trace_G / k
        else:
            reg = 1e-6
            
    elif method == 'eigval':
        # Analyze eigenvalue spectrum
        eigvals = np.linalg.eigvalsh(G)
        eigvals_sorted = np.sort(eigvals)[::-1]
        
        # Target condition number of 1e6
        target_cond = 1e6
        min_eigval = max(eigvals_sorted[0] / target_cond, 1e-10)
        reg = max(0, min_eigval - eigvals_sorted[-1])
        
    elif method == 'fixed':
        reg = 1e-6
        
    else:
        raise ValueError(f"Unknown method: {method}")
    
    return reg
 
 
def solve_weights_robust(G, reg=None):
    """
    Solve for weights with robust handling of ill-conditioning.
    """
    k = G.shape[0]
    ones = np.ones(k)
    
    if reg is None:
        reg = choose_regularization(G)
    
    G_reg = G + reg * np.eye(k)
    
    # Use Cholesky decomposition for symmetric positive definite
    try:
        L = np.linalg.cholesky(G_reg)
        # Solve L @ y = ones, then L.T @ w = y
        y = np.linalg.solve(L, ones)
        w_unnorm = np.linalg.solve(L.T, y)
    except np.linalg.LinAlgError:
        # Fall back to general solver
        w_unnorm = np.linalg.lstsq(G_reg, ones, rcond=None)[0]
    
    # Normalize
    w = w_unnorm / np.sum(w_unnorm)
    
    return w

Constructing the Global Weight Matrix

Having computed weights for each point individually, we assemble them into a global weight matrix $\mathbf{W} \in \mathbb{R}^{N \times N}$.

Structure of W:

$$W_{ij} = \begin{cases} w_{ij} & \text{if } j \in \mathcal{N}_i \ 0 & \text{otherwise} \end{cases}$$

Properties of W:

Sparse: Only $O(Nk)$ non-zero entries (each row has at most $k$ non-zeros)
Row-Stochastic: $\sum_j W_{ij} = 1$ for all $i$ (though not column-stochastic)
Non-Symmetric: Generally $W_{ij} eq W_{ji}$
Possibly Negative Entries: Unlike transition matrices, weights can be negative

The Reconstruction Cost in Matrix Form:

The total reconstruction error can be written: $$\mathcal{E} = |(\mathbf{I} - \mathbf{W})\mathbf{X}|_F^2 = \text{tr}\left( \mathbf{X}^T (\mathbf{I} - \mathbf{W})^T (\mathbf{I} - \mathbf{W}) \mathbf{X} \right)$$

Define $\mathbf{M} = (\mathbf{I} - \mathbf{W})^T (\mathbf{I} - \mathbf{W})$, which we'll see is crucial for the embedding phase.

The Matrix M:

$$\mathbf{M} = (\mathbf{I} - \mathbf{W})^T (\mathbf{I} - \mathbf{W})$$

This matrix has important properties:

Symmetric: $\mathbf{M}^T = \mathbf{M}$
Positive Semi-Definite: All eigenvalues $\geq 0$
Sparse: Due to sparsity of $\mathbf{W}$
Sum of rows/columns is zero: $\mathbf{M}\mathbf{1} = \mathbf{1}^T\mathbf{M} = \mathbf{0}$ (because $\mathbf{W}\mathbf{1} = \mathbf{1}$)

The last property implies $\mathbf{1}$ is an eigenvector with eigenvalue 0. This is the same structure as graph Laplacians, connecting LLE to spectral graph theory.

Explicit Form of M:

$$M_{ij} = \delta_{ij} - W_{ij} - W_{ji} + \sum_l W_{li} W_{lj}$$

where $\delta_{ij}$ is the Kronecker delta.

Computational Efficiency

We don't need to explicitly compute the dense matrix M. In the embedding phase, we only need to apply (I - W)ᵀ(I - W) to vectors, which can be done efficiently using the sparse structure of W:

Compute v' = (I - W) @ v
Compute result = (I - W)ᵀ @ v' = v' - Wᵀ @ v'

Total cost: O(Nk) per matrix-vector product.

Neighborhood Selection Strategies

The quality of LLE depends critically on how we select neighbors. Poor neighborhood choice can violate the local linearity assumption.

Method 1: k-Nearest Neighbors (k-NN)

The most common approach: for each point $\mathbf{x}_i$, select the $k$ points closest in Euclidean distance.

Advantages:

Simple and fast (O(N log N) with KD-trees)
Adapts to local density
Standard in most implementations

Disadvantages:

Fixed neighborhood size regardless of local geometry
May select geodesically distant points on folded manifolds

Method 2: ε-Neighborhoods

Select all points within Euclidean distance $\epsilon$ of $\mathbf{x}_i$.

Advantages:

Adapts neighborhood size to local density
Natural for manifolds with varying sampling density

Disadvantages:

Choosing ε is difficult
May have very few or very many neighbors in different regions
Can disconnect the neighborhood graph

Neighborhood Selection Trade-offs
Aspect	k-NN	ε-Neighborhood	Recommendation
Neighborhood size	Fixed at k	Variable	Start with k-NN
Parameter tuning	Discrete, intuitive	Continuous, scale-dependent	k is easier to tune
Sparse regions	May include distant points	May have too few neighbors	Increase k or ε
Dense regions	May miss nearby points	May have too many neighbors	k-NN more stable
Short-circuit risk	Lower for small k	Lower for small ε	Validate neighborhood graph

Choosing k:

The neighborhood size $k$ should satisfy:

$k \geq d + 1$ where $d$ is the intrinsic dimensionality (for unique weight solution)
$k \ll N$ to preserve local structure
$k$ small enough that neighborhoods are approximately linear

Practical Guidelines:

Dataset Size	Typical k Range	Notes
N < 100	3-10	Very sensitive to k
N ∈ [100, 1000]	5-20	Most common regime
N ∈ [1000, 10000]	10-50	Can use larger k
N > 10000	10-100	Computational considerations

Validation Strategies:

Residual Plot: Plot reconstruction error vs. k; error should decrease and plateau
Stability Analysis: Run LLE with multiple k values; stable results indicate robust choice
Visual Inspection: For 2D/3D embeddings, check if local structure is preserved
Quantitative Metrics: Use neighborhood preservation or continuity metrics

neighborhood_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
 
def analyze_neighborhood_selection(X, k_values):
    """
    Analyze reconstruction error and weight statistics
    across different neighborhood sizes.
    """
    results = []
    
    for k in k_values:
        # Find k nearest neighbors
        nn = NearestNeighbors(n_neighbors=k+1)
        nn.fit(X)
        distances, indices = nn.kneighbors(X)
        
        # Remove self-connections
        neighbor_indices = indices[:, 1:]
        neighbor_distances = distances[:, 1:]
        
        # Compute weights and statistics
        N = X.shape[0]
        reconstruction_errors = []
        weight_stats = []
        
        for i in range(N):
            neighbors = neighbor_indices[i]
            x_i = X[i]
            X_neighbors = X[neighbors]
            
            # Compute Gram matrix
            eta = x_i - X_neighbors
            G = eta @ eta.T
            
            # Regularize
            reg = 1e-3 * np.trace(G) / k
            G_reg = G + reg * np.eye(k)
            
            # Solve for weights
            ones = np.ones(k)
            w = np.linalg.solve(G_reg, ones)
            w = w / np.sum(w)
            
            # Reconstruction
            reconstruction = X_neighbors.T @ w
            error = np.linalg.norm(x_i - reconstruction)
            
            reconstruction_errors.append(error)
            weight_stats.append({
                'min': np.min(w),
                'max': np.max(w),
                'negative_count': np.sum(w < 0),
                'magnitude': np.linalg.norm(w),
            })
        
        results.append({
            'k': k,
            'mean_error': np.mean(reconstruction_errors),
            'std_error': np.std(reconstruction_errors),
            'max_error': np.max(reconstruction_errors),
            'mean_negative_weights': np.mean([s['negative_count'] for s in weight_stats]),
            'mean_weight_magnitude': np.mean([s['magnitude'] for s in weight_stats]),
        })
    
    return results
 
 
def plot_k_analysis(results):
    """
    Visualize how LLE behavior changes with k.
    """
    k_values = [r['k'] for r in results]
    mean_errors = [r['mean_error'] for r in results]
    neg_weights = [r['mean_negative_weights'] for r in results]
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    axes[0].plot(k_values, mean_errors, 'o-')
    axes[0].set_xlabel('Number of Neighbors (k)')
    axes[0].set_ylabel('Mean Reconstruction Error')
    axes[0].set_title('Reconstruction Error vs. k')
    axes[0].grid(True, alpha=0.3)
    
    axes[1].plot(k_values, neg_weights, 's-', color='orange')
    axes[1].set_xlabel('Number of Neighbors (k)')
    axes[1].set_ylabel('Mean Number of Negative Weights')
    axes[1].set_title('Negative Weights vs. k')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

Connection to Local PCA

There is a deep connection between LLE's weight computation and Local PCA (computing PCA in each neighborhood). Understanding this connection provides additional insight into what the weights capture.

Local Covariance Matrix:

For point $i$ with neighbors ${\mathbf{x}{j_l}}$, define the local covariance: $$\boldsymbol{\Sigma}i = \frac{1}{k} \sum{l=1}^k (\mathbf{x}{j_l} - \bar{\mathbf{x}}i)(\mathbf{x}{j_l} - \bar{\mathbf{x}}_i)^T$$

where $\bar{\mathbf{x}}_i$ is the centroid of the neighborhood.

The Local Gram Matrix Revisited:

Recall the LLE Gram matrix: $$G^{(i)}_{lm} = (\mathbf{x}i - \mathbf{x}{j_l})^T (\mathbf{x}i - \mathbf{x}{j_m})$$

If we let $\boldsymbol{\eta}_l = \mathbf{x}i - \mathbf{x}{j_l}$, then: $$\mathbf{G}^{(i)} = \mathbf{H}^T \mathbf{H}$$

where $\mathbf{H}$ is the matrix with rows $\boldsymbol{\eta}_l^T$.

Eigenstructure:

The eigenvalues of $\mathbf{G}^{(i)}$ are the squared singular values of $\mathbf{H}$. These relate to the local covariance eigenvalues, revealing the local dimensionality structure.

Local PCA for Understanding Weights:

If we perform PCA on the local neighborhood:

Large eigenvalues → directions of high variance (tangent directions)
Small eigenvalues → directions of low variance (normal directions)

The reconstruction weights implicitly use this structure: they place more emphasis on neighbors that lie along principal directions.

When Gram Matrix is Rank-Deficient

If the local Gram matrix has rank r < k, it means the neighbors lie in an r-dimensional affine subspace. This happens when: • k > D (more neighbors than input dimensions) • Neighbors are collinear or coplanar • The local manifold has dimension less than k

This is why regularization is essential: it ensures the optimization has a unique solution.

LTSA: A Closely Related Method

Local Tangent Space Alignment (LTSA) makes the connection to local PCA explicit:

Compute local tangent bases via PCA in each neighborhood
Construct a global alignment of these local tangent coordinates

LTSA and LLE have similar asymptotic behavior but differ in how they encode local structure:

LLE: Reconstruction weights (implicit tangent space)
LTSA: Explicit tangent basis vectors

When the intrinsic dimension is known and the manifold is well-sampled, LTSA can be more accurate. When intrinsic dimension is uncertain, LLE's implicit approach is more robust.

Efficient Implementation Considerations

Implementing LLE efficiently requires attention to computational complexity and memory usage.

Computational Complexity:

Step	Naive	Optimized
Neighbor search	O(N²D)	O(N log N · D) with KD-tree
Gram matrix computation	O(Nk²D)	O(Nk²D) — already optimal
Solving for weights	O(Nk³)	O(Nk³) — each solve is O(k³)
Total weight computation	O(N²D + Nk³)	O(N log N · D + Nk³)

For large N with moderate k, the bottleneck is neighbor finding, which can be accelerated with spatial data structures.

Memory Considerations:

Distance matrix: Avoid storing full N×N matrix; compute on-the-fly or use approximate methods
Weight matrix: Store as sparse matrix (only O(Nk) entries)
Gram matrices: Compute and discard; don't store all N of them simultaneously

Parallelization:

Weight computation is embarrassingly parallel:

Each point's weights are computed independently
No communication needed between points
Scales linearly with number of processors

efficient_lle_weights.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from scipy.sparse import lil_matrix, csr_matrix
from sklearn.neighbors import NearestNeighbors
from concurrent.futures import ProcessPoolExecutor
import warnings
 
class EfficientLLEWeights:
    """
    Efficient, production-ready LLE weight computation.
    
    Features:
    - KD-tree for O(N log N) neighbor search
    - Sparse weight matrix storage
    - Parallel processing support
    - Robust numerical handling
    """
    
    def __init__(self, n_neighbors=10, reg=1e-3, n_jobs=-1):
        """
        Parameters:
        -----------
        n_neighbors : int
            Number of neighbors for each point
        reg : float
            Regularization strength
        n_jobs : int
            Number of parallel jobs (-1 for all cores)
        """
        self.n_neighbors = n_neighbors
        self.reg = reg
        self.n_jobs = n_jobs
        
    def fit(self, X):
        """
        Compute LLE weights for dataset X.
        
        Parameters:
        -----------
        X : ndarray of shape (N, D)
            Input data
            
        Returns:
        --------
        self : fitted object with weight matrix W_
        """
        N, D = X.shape
        k = self.n_neighbors
        
        # Step 1: Find k nearest neighbors using KD-tree
        nn = NearestNeighbors(
            n_neighbors=k+1,  # +1 because self is included
            algorithm='auto',  # Chooses KD-tree or ball tree
            n_jobs=self.n_jobs
        )
        nn.fit(X)
        _, indices = nn.kneighbors(X)
        
        # Remove self-connections
        self.neighbor_indices_ = indices[:, 1:]  # Shape: (N, k)
        
        # Step 2: Compute weights (sparse storage)
        W = lil_matrix((N, N))
        
        for i in range(N):
            neighbors = self.neighbor_indices_[i]
            w = self._compute_weights_single(X, i, neighbors)
            W[i, neighbors] = w
        
        # Convert to CSR for efficient arithmetic
        self.W_ = csr_matrix(W)
        
        return self
    
    def _compute_weights_single(self, X, i, neighbors):
        """
        Compute weights for a single point.
        """
        k = len(neighbors)
        x_i = X[i]
        X_neighbors = X[neighbors]
        
        # Difference vectors
        eta = x_i - X_neighbors  # Shape: (k, D)
        
        # Local Gram matrix
        G = eta @ eta.T  # Shape: (k, k)
        
        # Regularization
        trace_G = np.trace(G)
        if trace_G > 0:
            reg_value = self.reg * trace_G / k
        else:
            reg_value = self.reg
        G += reg_value * np.eye(k)
        
        # Solve for weights
        ones = np.ones(k)
        try:
            w = np.linalg.solve(G, ones)
        except np.linalg.LinAlgError:
            warnings.warn(f"Singular Gram matrix at point {i}")
            w = np.ones(k)
        
        # Normalize
        w = w / np.sum(w)
        
        return w
    
    def get_cost_matrix(self):
        """
        Return the cost matrix M = (I - W)^T (I - W).
        
        This is computed implicitly without forming I - W explicitly.
        """
        if not hasattr(self, 'W_'):
            raise ValueError("Must call fit() first")
        
        # I - W
        I_minus_W = csr_matrix(np.eye(self.W_.shape[0])) - self.W_
        
        # M = (I - W)^T (I - W)
        M = I_minus_W.T @ I_minus_W
        
        return M
    
    def reconstruction_error(self, X):
        """
        Compute total reconstruction error ||X - WX||_F^2.
        """
        if not hasattr(self, 'W_'):
            raise ValueError("Must call fit() first")
        
        reconstructed = self.W_ @ X
        residual = X - reconstructed
        error = np.sum(residual ** 2)
        
        return error
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_swiss_roll
    
    # Generate data
    X, t = make_swiss_roll(n_samples=1000, noise=0.1, random_state=42)
    
    # Compute weights
    lle_weights = EfficientLLEWeights(n_neighbors=12, reg=1e-3)
    lle_weights.fit(X)
    
    # Analyze
    print(f"Weight matrix shape: {lle_weights.W_.shape}")
    print(f"Non-zero entries: {lle_weights.W_.nnz}")
    print(f"Sparsity: {1 - lle_weights.W_.nnz / np.prod(lle_weights.W_.shape):.4f}")
    print(f"Reconstruction error: {lle_weights.reconstruction_error(X):.4f}")

Summary and What's Next

We have developed a complete understanding of how LLE computes reconstruction weights that encode local linear geometry.

Key Takeaways

•Optimization Formulation: Weight computation minimizes reconstruction error subject to sum-to-one constraint
•Local Gram Matrix: The k×k matrix G captures pairwise relationships between difference vectors
•Closed-Form Solution: Optimal weights are w = G⁻¹1 / (1ᵀG⁻¹1), normalized to sum to one
•Geometric Meaning: Weights are barycentric coordinates encoding position in local linear subspace
•Regularization: Essential for handling ill-conditioned Gram matrices, especially when k > D
•Sparse Structure: Weight matrix W has only O(Nk) non-zero entries, enabling efficient computation
•Independence: Weight computation for different points is fully parallelizable

What's Next:

With weights computed, we move to the second phase of LLE: embedding optimization. We'll see how the weights guide discovery of low-dimensional coordinates that preserve local geometry. The optimization involves solving an eigenvalue problem on the cost matrix M = (I - W)ᵀ(I - W), connecting LLE to spectral methods and providing the final embedding.

Page Complete

You now have a comprehensive understanding of LLE weight computation—from the optimization formulation through closed-form solution, regularization, and efficient implementation. These weights are the bridge carrying local geometric information from high-dimensional to low-dimensional space.

2 / 5

Loading learning content...

Machine LearningManifold Learning

Locally Linear Embedding

LevelAdvanced

Duration90 mins

TopicManifold Learning

2 / 5

Weight Computation

Encoding Local Geometry Through Weights

This page provides a comprehensive treatment of weight computation, from the optimization formulation to practical numerical implementation.

Learning Objectives

The Constrained Optimization Problem

For each point $\mathbf{x}i$ in our dataset, we seek weights ${w{ij}}_{j \in \mathcal{N}_i}$ that minimize the reconstruction error subject to constraints.

The Formal Problem:

$$\min_{\mathbf{w}_i} \left| \mathbf{x}i - \sum{j \in \mathcal{N}i} w{ij} \mathbf{x}_j \right|^2$$

subject to: $$\sum_{j \in \mathcal{N}i} w{ij} = 1$$

where $\mathcal{N}_i$ denotes the set of $k$ nearest neighbors of point $i$.

Expanding the Objective:

Let us denote the $k$ neighbors of $\mathbf{x}i$ by ${\mathbf{x}{j_1}, \ldots, \mathbf{x}_{j_k}}$. Collecting the weights into a vector $\mathbf{w}_i \in \mathbb{R}^k$, the objective becomes:

$$\varepsilon_i = \left| \mathbf{x}i - \sum{l=1}^k w_{il} \mathbf{x}_{j_l} \right|^2$$

Introducing Difference Vectors:

Define $\boldsymbol{\eta}{l} = \mathbf{x}i - \mathbf{x}{j_l}$ for $l = 1, \ldots, k$. Using the constraint $\sum_l w{il} = 1$:

$$\mathbf{x}i - \sum_l w{il} \mathbf{x}{j_l} = \sum_l w{il} \mathbf{x}i - \sum_l w{il} \mathbf{x}{j_l} = \sum_l w{il} (\mathbf{x}i - \mathbf{x}{j_l}) = \sum_l w_{il} \boldsymbol{\eta}_l$$

Therefore: $$\varepsilon_i = \left| \sum_l w_{il} \boldsymbol{\eta}_l \right|^2 = \mathbf{w}_i^T \mathbf{G}^{(i)} \mathbf{w}_i$$

where $\mathbf{G}^{(i)}_{lm} = \boldsymbol{\eta}_l^T \boldsymbol{\eta}_m$ is the local Gram matrix.

The Local Gram Matrix

The Gram matrix G⁽ⁱ⁾ ∈ ℝᵏˣᵏ captures all pairwise inner products between the difference vectors (x_i - x_j). Its (l,m) entry is:

G_{lm}⁽ⁱ⁾ = (x_i - x_{j_l})ᵀ(x_i - x_{j_m})

This matrix is symmetric and positive semi-definite. Its rank equals the dimension of the affine hull spanned by x_i and its neighbors.

Deriving the Closed-Form Solution

We now seek to minimize the quadratic objective $\mathbf{w}_i^T \mathbf{G}^{(i)} \mathbf{w}_i$ subject to the linear constraint $\mathbf{1}^T \mathbf{w}_i = 1$.

Method: Lagrange Multipliers

The Lagrangian is: $$\mathcal{L}(\mathbf{w}_i, \lambda) = \mathbf{w}_i^T \mathbf{G}^{(i)} \mathbf{w}_i - \lambda(\mathbf{1}^T \mathbf{w}_i - 1)$$

First-Order Conditions:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}_i} = 2\mathbf{G}^{(i)} \mathbf{w}_i - \lambda \mathbf{1} = \mathbf{0}$$

$$\frac{\partial \mathcal{L}}{\partial \lambda} = \mathbf{1}^T \mathbf{w}_i - 1 = 0$$

Solving for Weights:

From the first condition (assuming $\mathbf{G}^{(i)}$ is invertible): $$\mathbf{w}_i = \frac{\lambda}{2} (\mathbf{G}^{(i)})^{-1} \mathbf{1}$$

Substituting into the constraint: $$\mathbf{1}^T \mathbf{w}_i = \frac{\lambda}{2} \mathbf{1}^T (\mathbf{G}^{(i)})^{-1} \mathbf{1} = 1$$

Therefore: $$\lambda = \frac{2}{\mathbf{1}^T (\mathbf{G}^{(i)})^{-1} \mathbf{1}}$$

The Final Solution:

$$\mathbf{w}_i = \frac{(\mathbf{G}^{(i)})^{-1} \mathbf{1}}{\mathbf{1}^T (\mathbf{G}^{(i)})^{-1} \mathbf{1}}$$

This elegant formula shows that optimal weights are proportional to $\mathbf{G}^{-1}\mathbf{1}$, normalized to sum to one.

weight_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy.spatial import distance_matrix
 
def compute_lle_weights(X, k=10, reg=1e-3):
    """
    Compute LLE reconstruction weights for all points.
    
    Parameters:
    -----------
    X : ndarray of shape (N, D)
        Data matrix with N points in D dimensions
    k : int
        Number of nearest neighbors
    reg : float
        Regularization parameter for ill-conditioned Gram matrices
        
    Returns:
    --------
    W : ndarray of shape (N, N)
        Sparse weight matrix where W[i,j] = w_ij
    """
    N, D = X.shape
    
    # Compute pairwise distances
    dist = distance_matrix(X, X)
    
    # Initialize sparse weight matrix
    W = np.zeros((N, N))
    
    for i in range(N):
        # Find k nearest neighbors (excluding self)
        neighbor_indices = np.argsort(dist[i])[1:k+1]
        
        # Compute local Gram matrix
        # eta_j = x_i - x_j for each neighbor j
        x_i = X[i]
        X_neighbors = X[neighbor_indices]
        
        # Difference vectors: each row is (x_i - x_j)
        eta = x_i - X_neighbors  # Shape: (k, D)
        
        # Local Gram matrix: G[l,m] = eta_l^T @ eta_m
        G = eta @ eta.T  # Shape: (k, k)
        
        # Regularization for numerical stability
        # Add small value to diagonal if needed
        trace_G = np.trace(G)
        if trace_G > 0:
            G += reg * trace_G * np.eye(k)
        else:
            G += reg * np.eye(k)
        
        # Solve for weights: w = G^{-1} @ 1 / (1^T @ G^{-1} @ 1)
        ones = np.ones(k)
        
        try:
            # Solve G @ w_unnorm = ones
            w_unnorm = np.linalg.solve(G, ones)
            
            # Normalize to sum to 1
            w = w_unnorm / np.sum(w_unnorm)
            
        except np.linalg.LinAlgError:
            # Fallback: uniform weights if Gram matrix is singular
            w = np.ones(k) / k
            print(f"Warning: Singular Gram matrix at point {i}")
        
        # Store weights in sparse matrix
        W[i, neighbor_indices] = w
    
    return W
 
 
def compute_reconstruction_error(X, W):
    """
    Compute the total reconstruction error for given weights.
    
    Error = sum_i ||x_i - sum_j w_ij * x_j||^2
    """
    N = X.shape[0]
    error = 0.0
    
    for i in range(N):
        reconstruction = W[i] @ X  # Weighted sum of all points
        residual = X[i] - reconstruction
        error += np.dot(residual, residual)
    
    return error
 
 
# Example usage
if __name__ == "__main__":
    # Generate Swiss Roll data
    np.random.seed(42)
    n_samples = 500
    t = 1.5 * np.pi * (1 + 2 * np.random.rand(n_samples))
    x = t * np.cos(t)
    y = 10 * np.random.rand(n_samples)
    z = t * np.sin(t)
    X = np.column_stack([x, y, z])
    
    # Compute weights with different k values
    for k in [5, 10, 15, 20]:
        W = compute_lle_weights(X, k=k)
        error = compute_reconstruction_error(X, W)
        print(f"k={k:2d}: Reconstruction error = {error:.4f}")

Alternative: Solving a Linear System

Geometric Interpretation of the Weights

The closed-form solution $\mathbf{w}_i = \mathbf{G}^{-1}\mathbf{1} / (\mathbf{1}^T \mathbf{G}^{-1}\mathbf{1})$ has deep geometric meaning that illuminates how LLE captures local structure.

Interpretation 1: Barycentric Coordinates

$\sum_j w_j = 1$ (partition of unity)
$\sum_j w_j \mathbf{x}_j = \mathbf{x}_i$ (exact reconstruction)

This connects LLE weights to classical computational geometry.

Interpretation 2: Projection onto Tangent Space

Interpretation 3: Local Linear Regression

This is constrained least squares in the local neighborhood.

Weight Characteristics and Their Meaning
Weight Pattern	Geometric Meaning	Implication
Weights near 1/k (uniform)	x_i is near the centroid of neighbors	Point is in a 'flat' region
One weight ≈ 1, others ≈ 0	x_i is very close to one neighbor	Possibly redundant data point
Negative weights	x_i requires extrapolation	Point is on the 'edge' of the local cloud
Large magnitude weights	Ill-conditioned local geometry	May need regularization
Weights decay with distance	Closer neighbors contribute more	Expected for well-sampled manifolds

Negative Weights:

Negative weights are not errors; they naturally arise when the point lies outside the convex hull of its neighbors. However, many negative weights may indicate poor neighborhood selection.

Sign of Weights and Stability

Many negative weights can signal:

The neighborhood size k is too small
The local geometry is poorly sampled
The point is an outlier or lies on the boundary of the data

Monitoring weight distributions helps diagnose LLE issues.

Regularization for Ill-Conditioned Problems

The local Gram matrix $\mathbf{G}^{(i)}$ can be ill-conditioned or singular, leading to numerical instability. This is especially common when:

k > D: More neighbors than input dimensions (Gram matrix is rank-deficient)
Collinear neighbors: Neighbors lie on a lower-dimensional subspace
Duplicate or near-duplicate points: Very small eigenvalues
High noise: Gram matrix poorly conditioned

The Regularization Approach:

Add a small value to the diagonal of the Gram matrix: $$\tilde{\mathbf{G}}^{(i)} = \mathbf{G}^{(i)} + \alpha \cdot \text{tr}(\mathbf{G}^{(i)}) \cdot \mathbf{I}$$

where $\alpha$ is a small regularization parameter (typically $10^{-3}$ to $10^{-5}$).

Why This Works:

The penalty discourages weights with large magnitude, improving stability without significantly biasing the solution.

Scaling with Trace:

Multiplying the regularization by $\text{tr}(\mathbf{G})$ makes the regularization scale-invariant. This ensures consistent behavior regardless of the data's absolute scale.

Without Regularization

•Fails when k > D (singular Gram matrix)
•Sensitive to near-duplicate points
•Weights can have very large magnitudes
•Numerical errors accumulate
•May crash on edge cases

With Regularization

•Works for any k (even k > D)
•Robust to near-duplicates
•Weights have bounded magnitude
•Numerically stable computation
•Graceful degradation in edge cases

regularization_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
 
def analyze_gram_conditioning(X, neighbor_indices):
    """
    Analyze the conditioning of local Gram matrices.
    
    Returns condition numbers and eigenvalue analysis.
    """
    N = len(neighbor_indices)
    condition_numbers = []
    min_eigenvalues = []
    max_eigenvalues = []
    
    for i, neighbors in enumerate(neighbor_indices):
        x_i = X[i]
        X_neighbors = X[neighbors]
        
        # Difference vectors
        eta = x_i - X_neighbors
        
        # Local Gram matrix
        G = eta @ eta.T
        
        # Eigenvalue analysis
        eigvals = np.linalg.eigvalsh(G)
        eigvals_sorted = np.sort(eigvals)[::-1]  # Descending
        
        min_eigval = eigvals_sorted[-1]
        max_eigval = eigvals_sorted[0]
        
        # Condition number (ratio of largest to smallest eigenvalue)
        if min_eigval > 1e-15:
            cond = max_eigval / min_eigval
        else:
            cond = np.inf
        
        condition_numbers.append(cond)
        min_eigenvalues.append(min_eigval)
        max_eigenvalues.append(max_eigval)
    
    return {
        'condition_numbers': np.array(condition_numbers),
        'min_eigenvalues': np.array(min_eigenvalues),
        'max_eigenvalues': np.array(max_eigenvalues),
    }
 
 
def choose_regularization(G, method='trace'):
    """
    Automatically choose regularization strength.
    
    Methods:
    - 'trace': reg = alpha * trace(G) / k
    - 'eigval': reg based on eigenvalue gap analysis
    - 'fixed': use fixed small value
    """
    k = G.shape[0]
    
    if method == 'trace':
        # Standard LLE regularization
        alpha = 1e-3
        trace_G = np.trace(G)
        if trace_G > 0:
            reg = alpha * trace_G / k
        else:
            reg = 1e-6
            
    elif method == 'eigval':
        # Analyze eigenvalue spectrum
        eigvals = np.linalg.eigvalsh(G)
        eigvals_sorted = np.sort(eigvals)[::-1]
        
        # Target condition number of 1e6
        target_cond = 1e6
        min_eigval = max(eigvals_sorted[0] / target_cond, 1e-10)
        reg = max(0, min_eigval - eigvals_sorted[-1])
        
    elif method == 'fixed':
        reg = 1e-6
        
    else:
        raise ValueError(f"Unknown method: {method}")
    
    return reg
 
 
def solve_weights_robust(G, reg=None):
    """
    Solve for weights with robust handling of ill-conditioning.
    """
    k = G.shape[0]
    ones = np.ones(k)
    
    if reg is None:
        reg = choose_regularization(G)
    
    G_reg = G + reg * np.eye(k)
    
    # Use Cholesky decomposition for symmetric positive definite
    try:
        L = np.linalg.cholesky(G_reg)
        # Solve L @ y = ones, then L.T @ w = y
        y = np.linalg.solve(L, ones)
        w_unnorm = np.linalg.solve(L.T, y)
    except np.linalg.LinAlgError:
        # Fall back to general solver
        w_unnorm = np.linalg.lstsq(G_reg, ones, rcond=None)[0]
    
    # Normalize
    w = w_unnorm / np.sum(w_unnorm)
    
    return w

Constructing the Global Weight Matrix

Having computed weights for each point individually, we assemble them into a global weight matrix $\mathbf{W} \in \mathbb{R}^{N \times N}$.

Structure of W:

$$W_{ij} = \begin{cases} w_{ij} & \text{if } j \in \mathcal{N}_i \ 0 & \text{otherwise} \end{cases}$$

Properties of W:

Sparse: Only $O(Nk)$ non-zero entries (each row has at most $k$ non-zeros)
Row-Stochastic: $\sum_j W_{ij} = 1$ for all $i$ (though not column-stochastic)
Non-Symmetric: Generally $W_{ij} eq W_{ji}$
Possibly Negative Entries: Unlike transition matrices, weights can be negative

The Reconstruction Cost in Matrix Form:

Define $\mathbf{M} = (\mathbf{I} - \mathbf{W})^T (\mathbf{I} - \mathbf{W})$, which we'll see is crucial for the embedding phase.

The Matrix M:

$$\mathbf{M} = (\mathbf{I} - \mathbf{W})^T (\mathbf{I} - \mathbf{W})$$

This matrix has important properties:

Symmetric: $\mathbf{M}^T = \mathbf{M}$
Positive Semi-Definite: All eigenvalues $\geq 0$
Sparse: Due to sparsity of $\mathbf{W}$
Sum of rows/columns is zero: $\mathbf{M}\mathbf{1} = \mathbf{1}^T\mathbf{M} = \mathbf{0}$ (because $\mathbf{W}\mathbf{1} = \mathbf{1}$)

The last property implies $\mathbf{1}$ is an eigenvector with eigenvalue 0. This is the same structure as graph Laplacians, connecting LLE to spectral graph theory.

Explicit Form of M:

$$M_{ij} = \delta_{ij} - W_{ij} - W_{ji} + \sum_l W_{li} W_{lj}$$

where $\delta_{ij}$ is the Kronecker delta.

Computational Efficiency

We don't need to explicitly compute the dense matrix M. In the embedding phase, we only need to apply (I - W)ᵀ(I - W) to vectors, which can be done efficiently using the sparse structure of W:

Compute v' = (I - W) @ v
Compute result = (I - W)ᵀ @ v' = v' - Wᵀ @ v'

Total cost: O(Nk) per matrix-vector product.

Neighborhood Selection Strategies

The quality of LLE depends critically on how we select neighbors. Poor neighborhood choice can violate the local linearity assumption.

Method 1: k-Nearest Neighbors (k-NN)

The most common approach: for each point $\mathbf{x}_i$, select the $k$ points closest in Euclidean distance.

Advantages:

Simple and fast (O(N log N) with KD-trees)
Adapts to local density
Standard in most implementations

Disadvantages:

Fixed neighborhood size regardless of local geometry
May select geodesically distant points on folded manifolds

Method 2: ε-Neighborhoods

Select all points within Euclidean distance $\epsilon$ of $\mathbf{x}_i$.

Advantages:

Adapts neighborhood size to local density
Natural for manifolds with varying sampling density

Disadvantages:

Choosing ε is difficult
May have very few or very many neighbors in different regions
Can disconnect the neighborhood graph

Neighborhood Selection Trade-offs
Aspect	k-NN	ε-Neighborhood	Recommendation
Neighborhood size	Fixed at k	Variable	Start with k-NN
Parameter tuning	Discrete, intuitive	Continuous, scale-dependent	k is easier to tune
Sparse regions	May include distant points	May have too few neighbors	Increase k or ε
Dense regions	May miss nearby points	May have too many neighbors	k-NN more stable
Short-circuit risk	Lower for small k	Lower for small ε	Validate neighborhood graph

Choosing k:

The neighborhood size $k$ should satisfy:

$k \geq d + 1$ where $d$ is the intrinsic dimensionality (for unique weight solution)
$k \ll N$ to preserve local structure
$k$ small enough that neighborhoods are approximately linear

Practical Guidelines:

Dataset Size	Typical k Range	Notes
N < 100	3-10	Very sensitive to k
N ∈ [100, 1000]	5-20	Most common regime
N ∈ [1000, 10000]	10-50	Can use larger k
N > 10000	10-100	Computational considerations

Validation Strategies:

Residual Plot: Plot reconstruction error vs. k; error should decrease and plateau
Stability Analysis: Run LLE with multiple k values; stable results indicate robust choice
Visual Inspection: For 2D/3D embeddings, check if local structure is preserved
Quantitative Metrics: Use neighborhood preservation or continuity metrics

neighborhood_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
 
def analyze_neighborhood_selection(X, k_values):
    """
    Analyze reconstruction error and weight statistics
    across different neighborhood sizes.
    """
    results = []
    
    for k in k_values:
        # Find k nearest neighbors
        nn = NearestNeighbors(n_neighbors=k+1)
        nn.fit(X)
        distances, indices = nn.kneighbors(X)
        
        # Remove self-connections
        neighbor_indices = indices[:, 1:]
        neighbor_distances = distances[:, 1:]
        
        # Compute weights and statistics
        N = X.shape[0]
        reconstruction_errors = []
        weight_stats = []
        
        for i in range(N):
            neighbors = neighbor_indices[i]
            x_i = X[i]
            X_neighbors = X[neighbors]
            
            # Compute Gram matrix
            eta = x_i - X_neighbors
            G = eta @ eta.T
            
            # Regularize
            reg = 1e-3 * np.trace(G) / k
            G_reg = G + reg * np.eye(k)
            
            # Solve for weights
            ones = np.ones(k)
            w = np.linalg.solve(G_reg, ones)
            w = w / np.sum(w)
            
            # Reconstruction
            reconstruction = X_neighbors.T @ w
            error = np.linalg.norm(x_i - reconstruction)
            
            reconstruction_errors.append(error)
            weight_stats.append({
                'min': np.min(w),
                'max': np.max(w),
                'negative_count': np.sum(w < 0),
                'magnitude': np.linalg.norm(w),
            })
        
        results.append({
            'k': k,
            'mean_error': np.mean(reconstruction_errors),
            'std_error': np.std(reconstruction_errors),
            'max_error': np.max(reconstruction_errors),
            'mean_negative_weights': np.mean([s['negative_count'] for s in weight_stats]),
            'mean_weight_magnitude': np.mean([s['magnitude'] for s in weight_stats]),
        })
    
    return results
 
 
def plot_k_analysis(results):
    """
    Visualize how LLE behavior changes with k.
    """
    k_values = [r['k'] for r in results]
    mean_errors = [r['mean_error'] for r in results]
    neg_weights = [r['mean_negative_weights'] for r in results]
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    axes[0].plot(k_values, mean_errors, 'o-')
    axes[0].set_xlabel('Number of Neighbors (k)')
    axes[0].set_ylabel('Mean Reconstruction Error')
    axes[0].set_title('Reconstruction Error vs. k')
    axes[0].grid(True, alpha=0.3)
    
    axes[1].plot(k_values, neg_weights, 's-', color='orange')
    axes[1].set_xlabel('Number of Neighbors (k)')
    axes[1].set_ylabel('Mean Number of Negative Weights')
    axes[1].set_title('Negative Weights vs. k')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

Connection to Local PCA

Local Covariance Matrix:

where $\bar{\mathbf{x}}_i$ is the centroid of the neighborhood.

The Local Gram Matrix Revisited:

Recall the LLE Gram matrix: $$G^{(i)}_{lm} = (\mathbf{x}i - \mathbf{x}{j_l})^T (\mathbf{x}i - \mathbf{x}{j_m})$$

If we let $\boldsymbol{\eta}_l = \mathbf{x}i - \mathbf{x}{j_l}$, then: $$\mathbf{G}^{(i)} = \mathbf{H}^T \mathbf{H}$$

where $\mathbf{H}$ is the matrix with rows $\boldsymbol{\eta}_l^T$.

Eigenstructure:

The eigenvalues of $\mathbf{G}^{(i)}$ are the squared singular values of $\mathbf{H}$. These relate to the local covariance eigenvalues, revealing the local dimensionality structure.

Local PCA for Understanding Weights:

If we perform PCA on the local neighborhood:

Large eigenvalues → directions of high variance (tangent directions)
Small eigenvalues → directions of low variance (normal directions)

The reconstruction weights implicitly use this structure: they place more emphasis on neighbors that lie along principal directions.

When Gram Matrix is Rank-Deficient

This is why regularization is essential: it ensures the optimization has a unique solution.

LTSA: A Closely Related Method

Local Tangent Space Alignment (LTSA) makes the connection to local PCA explicit:

Compute local tangent bases via PCA in each neighborhood
Construct a global alignment of these local tangent coordinates

LTSA and LLE have similar asymptotic behavior but differ in how they encode local structure:

LLE: Reconstruction weights (implicit tangent space)
LTSA: Explicit tangent basis vectors

When the intrinsic dimension is known and the manifold is well-sampled, LTSA can be more accurate. When intrinsic dimension is uncertain, LLE's implicit approach is more robust.

Efficient Implementation Considerations

Implementing LLE efficiently requires attention to computational complexity and memory usage.

Computational Complexity:

Step	Naive	Optimized
Neighbor search	O(N²D)	O(N log N · D) with KD-tree
Gram matrix computation	O(Nk²D)	O(Nk²D) — already optimal
Solving for weights	O(Nk³)	O(Nk³) — each solve is O(k³)
Total weight computation	O(N²D + Nk³)	O(N log N · D + Nk³)

For large N with moderate k, the bottleneck is neighbor finding, which can be accelerated with spatial data structures.

Memory Considerations:

Distance matrix: Avoid storing full N×N matrix; compute on-the-fly or use approximate methods
Weight matrix: Store as sparse matrix (only O(Nk) entries)
Gram matrices: Compute and discard; don't store all N of them simultaneously

Parallelization:

Weight computation is embarrassingly parallel:

Each point's weights are computed independently
No communication needed between points
Scales linearly with number of processors

efficient_lle_weights.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from scipy.sparse import lil_matrix, csr_matrix
from sklearn.neighbors import NearestNeighbors
from concurrent.futures import ProcessPoolExecutor
import warnings
 
class EfficientLLEWeights:
    """
    Efficient, production-ready LLE weight computation.
    
    Features:
    - KD-tree for O(N log N) neighbor search
    - Sparse weight matrix storage
    - Parallel processing support
    - Robust numerical handling
    """
    
    def __init__(self, n_neighbors=10, reg=1e-3, n_jobs=-1):
        """
        Parameters:
        -----------
        n_neighbors : int
            Number of neighbors for each point
        reg : float
            Regularization strength
        n_jobs : int
            Number of parallel jobs (-1 for all cores)
        """
        self.n_neighbors = n_neighbors
        self.reg = reg
        self.n_jobs = n_jobs
        
    def fit(self, X):
        """
        Compute LLE weights for dataset X.
        
        Parameters:
        -----------
        X : ndarray of shape (N, D)
            Input data
            
        Returns:
        --------
        self : fitted object with weight matrix W_
        """
        N, D = X.shape
        k = self.n_neighbors
        
        # Step 1: Find k nearest neighbors using KD-tree
        nn = NearestNeighbors(
            n_neighbors=k+1,  # +1 because self is included
            algorithm='auto',  # Chooses KD-tree or ball tree
            n_jobs=self.n_jobs
        )
        nn.fit(X)
        _, indices = nn.kneighbors(X)
        
        # Remove self-connections
        self.neighbor_indices_ = indices[:, 1:]  # Shape: (N, k)
        
        # Step 2: Compute weights (sparse storage)
        W = lil_matrix((N, N))
        
        for i in range(N):
            neighbors = self.neighbor_indices_[i]
            w = self._compute_weights_single(X, i, neighbors)
            W[i, neighbors] = w
        
        # Convert to CSR for efficient arithmetic
        self.W_ = csr_matrix(W)
        
        return self
    
    def _compute_weights_single(self, X, i, neighbors):
        """
        Compute weights for a single point.
        """
        k = len(neighbors)
        x_i = X[i]
        X_neighbors = X[neighbors]
        
        # Difference vectors
        eta = x_i - X_neighbors  # Shape: (k, D)
        
        # Local Gram matrix
        G = eta @ eta.T  # Shape: (k, k)
        
        # Regularization
        trace_G = np.trace(G)
        if trace_G > 0:
            reg_value = self.reg * trace_G / k
        else:
            reg_value = self.reg
        G += reg_value * np.eye(k)
        
        # Solve for weights
        ones = np.ones(k)
        try:
            w = np.linalg.solve(G, ones)
        except np.linalg.LinAlgError:
            warnings.warn(f"Singular Gram matrix at point {i}")
            w = np.ones(k)
        
        # Normalize
        w = w / np.sum(w)
        
        return w
    
    def get_cost_matrix(self):
        """
        Return the cost matrix M = (I - W)^T (I - W).
        
        This is computed implicitly without forming I - W explicitly.
        """
        if not hasattr(self, 'W_'):
            raise ValueError("Must call fit() first")
        
        # I - W
        I_minus_W = csr_matrix(np.eye(self.W_.shape[0])) - self.W_
        
        # M = (I - W)^T (I - W)
        M = I_minus_W.T @ I_minus_W
        
        return M
    
    def reconstruction_error(self, X):
        """
        Compute total reconstruction error ||X - WX||_F^2.
        """
        if not hasattr(self, 'W_'):
            raise ValueError("Must call fit() first")
        
        reconstructed = self.W_ @ X
        residual = X - reconstructed
        error = np.sum(residual ** 2)
        
        return error
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_swiss_roll
    
    # Generate data
    X, t = make_swiss_roll(n_samples=1000, noise=0.1, random_state=42)
    
    # Compute weights
    lle_weights = EfficientLLEWeights(n_neighbors=12, reg=1e-3)
    lle_weights.fit(X)
    
    # Analyze
    print(f"Weight matrix shape: {lle_weights.W_.shape}")
    print(f"Non-zero entries: {lle_weights.W_.nnz}")
    print(f"Sparsity: {1 - lle_weights.W_.nnz / np.prod(lle_weights.W_.shape):.4f}")
    print(f"Reconstruction error: {lle_weights.reconstruction_error(X):.4f}")

Summary and What's Next

We have developed a complete understanding of how LLE computes reconstruction weights that encode local linear geometry.

Key Takeaways

•Optimization Formulation: Weight computation minimizes reconstruction error subject to sum-to-one constraint
•Local Gram Matrix: The k×k matrix G captures pairwise relationships between difference vectors
•Closed-Form Solution: Optimal weights are w = G⁻¹1 / (1ᵀG⁻¹1), normalized to sum to one
•Geometric Meaning: Weights are barycentric coordinates encoding position in local linear subspace
•Regularization: Essential for handling ill-conditioned Gram matrices, especially when k > D
•Sparse Structure: Weight matrix W has only O(Nk) non-zero entries, enabling efficient computation
•Independence: Weight computation for different points is fully parallelizable

What's Next:

Page Complete

2 / 5