Machine LearningSpectral Clustering

Spectral Clustering

LevelAdvanced

Duration90 mins

TopicSpectral Clustering

1 / 5

Graph-based Formulation

Rethinking Clustering Through the Lens of Graphs

Traditional clustering algorithms like K-Means operate in a geometric paradigm: they assume clusters are compact, convex regions in feature space that can be characterized by centroids or prototypes. This assumption works beautifully for well-separated, spherical clusters—but catastrophically fails when clusters have complex shapes, manifold structures, or non-convex boundaries.

Spectral clustering represents a fundamental paradigm shift. Instead of treating data points as coordinates in Euclidean space, we view them as nodes in a graph where edges encode pairwise similarities. Clustering then becomes graph partitioning: finding cuts that separate the graph into coherent subgraphs while minimizing the connections severed.

This reformulation unlocks remarkable capabilities. Spectral methods can discover clusters that wrap around each other, separate interlocking rings, and detect communities in complex networks—all tasks where geometric methods fail completely.

What You Will Learn

By the end of this page, you will understand: (1) Why clustering can be reformulated as a graph partitioning problem, (2) How similarity graphs encode the structure of data, (3) The different types of similarity graphs and their construction, (4) How graph cuts provide a mathematical objective for clustering, and (5) The intuition behind why graph-based methods succeed where geometric methods fail.

From Geometry to Graphs: The Conceptual Shift

To understand why spectral clustering exists, we must first confront the fundamental limitations of geometric clustering approaches. Consider the K-Means objective:

$$\min_{\mu_1, \ldots, \mu_k} \sum_{i=1}^{n} \min_{j} |x_i - \mu_j|^2$$

This objective minimizes the sum of squared distances from each point to its nearest centroid. The implicit assumption is devastating: clusters are Voronoi cells—convex polytopes defined by proximity to centroids. Any cluster that cannot be expressed as a Voronoi cell is fundamentally invisible to K-Means.

The Two Moons Problem illustrates this limitation dramatically. Two crescent-shaped clusters interlock like parentheses. Geometrically, points from different clusters may be closer to each other than points within the same cluster. K-Means fails spectacularly, typically splitting each moon in half rather than separating them.

The insight behind spectral clustering is profound: what matters for clustering is not absolute position, but connectivity patterns. If we construct a graph where similar points are connected, the cluster structure becomes visible as graph communities—densely connected subgraphs with sparse connections between them.

Geometric vs. Graph-Based Clustering Paradigms
Aspect	Geometric Paradigm (K-Means)	Graph Paradigm (Spectral)
Data representation	Points in ℝᵈ	Nodes in a graph G = (V, E)
Similarity concept	Distance to centroids	Edge weights between nodes
Cluster assumption	Convex, compact regions	Densely connected subgraphs
Objective function	Minimize within-cluster variance	Minimize cut weight between clusters
Handles non-convex shapes	No	Yes
Computational approach	Iterative centroid updates	Eigendecomposition of Laplacian
Parameter sensitivity	Sensitive to initialization	Sensitive to similarity metric

The graph perspective reveals hidden structure. When we translate the two moons problem into a graph where each point connects to its nearest neighbors, a clear picture emerges: points within each moon form densely connected chains, while only a few edges bridge between moons. The "correct" clustering is now obvious: cut those few bridge edges.

This intuition generalizes powerfully. Any clustering problem can be reframed as: Given a similarity graph, find a partition that keeps similar points together while separating dissimilar ones. The mathematical machinery of spectral graph theory then provides elegant tools to find optimal partitions.

The Key Insight

In spectral clustering, we never directly cluster in the original feature space. Instead, we: (1) Build a similarity graph, (2) This graph encodes local neighborhood relationships, (3) Analyze the graph's spectral properties (eigenvalues/eigenvectors), (4) Cluster in the spectral embedding space. The spectral embedding "unfolds" complex structures into spaces where simple methods like K-Means work perfectly.

Constructing Similarity Graphs

The first step in spectral clustering is constructing a similarity graph $G = (V, E, W)$ where:

Vertices $V = {1, 2, \ldots, n}$ correspond to data points $x_1, \ldots, x_n$
Edges $E$ connect similar points
Weights $W_{ij}$ quantify the similarity between points $i$ and $j$

The choice of how to construct this graph profoundly influences clustering results. Three main approaches dominate practice, each with distinct characteristics and trade-offs.

Types of Similarity Graphs

•ε-Neighborhood Graph — Connect all points within distance ε: $W_{ij} = 1$ if $|x_i - x_j| < \varepsilon$, else 0. Creates unweighted graphs. Requires choosing ε carefully—too small creates disconnected components, too large obscures structure.
•K-Nearest Neighbor (KNN) Graph — Connect each point to its k nearest neighbors. Can be symmetric (connect if either is in other's KNN) or mutual (connect only if both are in each other's KNN). Adapts naturally to varying local density.
•Fully Connected Graph with Gaussian Kernel — Connect all pairs with weights $W_{ij} = \exp(-|x_i - x_j|^2 / (2\sigma^2))$. No hard connectivity threshold; all pairs contribute. Parameter σ controls effective neighborhood size.

Understanding the Gaussian Kernel Weight Function:

The Gaussian (RBF) similarity is the most widely used:

$$W_{ij} = \exp\left(-\frac{|x_i - x_j|^2}{2\sigma^2}\right)$$

This function has elegant properties:

Bounded range: $W_{ij} \in (0, 1]$ with $W_{ii} = 1$
Self-similarity maximum: Identical points have weight 1
Smooth decay: Similarity decreases smoothly with distance
Locality control: σ determines the scale at which distances become significant

The parameter σ acts as a scale parameter. When $|x_i - x_j| \ll \sigma$, the weight is close to 1 (highly similar). When $|x_i - x_j| \gg \sigma$, the weight approaches 0 (effectively disconnected). A useful heuristic is to set σ based on the distribution of pairwise distances—often the median or a percentile of the distance distribution.

similarity_graph_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import numpy as np
from scipy.spatial.distance import cdist
from scipy.sparse import lil_matrix
from sklearn.neighbors import NearestNeighbors
 
class SimilarityGraph:
    """
    Construct different types of similarity graphs for spectral clustering.
    
    This class implements three fundamental graph construction strategies,
    each encoding different notions of local neighborhood structure.
    """
    
    def __init__(self, X):
        """
        Initialize with data matrix X of shape (n_samples, n_features).
        """
        self.X = X
        self.n = X.shape[0]
        self.distances = None  # Cached pairwise distances
    
    def _compute_distances(self):
        """Compute and cache pairwise Euclidean distances."""
        if self.distances is None:
            self.distances = cdist(self.X, self.X, metric='euclidean')
        return self.distances
    
    def epsilon_neighborhood(self, epsilon):
        """
        Construct ε-neighborhood graph.
        
        Connect points i and j if ||x_i - x_j|| < epsilon.
        Returns an unweighted (binary) adjacency matrix.
        
        Properties:
        - Simple and intuitive
        - May create disconnected components if epsilon too small
        - All neighborhoods have same radius (may not suit varying density)
        """
        D = self._compute_distances()
        W = (D < epsilon).astype(float)
        np.fill_diagonal(W, 0)  # No self-loops
        return W
    
    def knn_graph(self, k, mode='connectivity', mutual=False):
        """
        Construct K-nearest neighbor graph.
        
        Parameters:
        -----------
        k : int
            Number of nearest neighbors
        mode : str
            'connectivity' for binary edges, 'distance' for weighted
        mutual : bool
            If True, only connect if both points are in each other's KNN
            If False, connect if either is in other's KNN (symmetric KNN)
        
        Properties:
        - Adapts to local density naturally
        - Guarantees each point has at least k neighbors
        - Produces sparse graphs (efficient for large n)
        """
        nn = NearestNeighbors(n_neighbors=k+1)  # +1 because point is its own neighbor
        nn.fit(self.X)
        
        if mode == 'connectivity':
            # Binary adjacency
            A = nn.kneighbors_graph(mode='connectivity')
        else:
            # Distance-weighted
            A = nn.kneighbors_graph(mode='distance')
        
        # Remove self-connections
        A = lil_matrix(A)
        A.setdiag(0)
        A = A.tocsr()
        
        if mutual:
            # Mutual KNN: keep edge only if both directions present
            W = A.multiply(A.T)
        else:
            # Symmetric KNN: symmetrize the adjacency matrix
            W = A + A.T
            W.data[W.data > 1] = 1  # Binary weights
        
        return W.toarray()
    
    def gaussian_kernel(self, sigma=None, sigma_percentile=50):
        """
        Construct fully connected graph with Gaussian (RBF) kernel weights.
        
        W_ij = exp(-||x_i - x_j||^2 / (2*sigma^2))
        
        Parameters:
        -----------
        sigma : float or None
            Kernel bandwidth. If None, computed from distance distribution.
        sigma_percentile : int
            If sigma is None, use this percentile of distances as sigma.
        
        Properties:
        - Soft notion of similarity (no hard cutoffs)
        - Dense matrix (O(n^2) storage)
        - Smooth, differentiable with respect to data
        """
        D = self._compute_distances()
        
        if sigma is None:
            # Heuristic: use percentile of non-zero distances
            nonzero_dists = D[D > 0]
            sigma = np.percentile(nonzero_dists, sigma_percentile)
        
        # Gaussian kernel
        W = np.exp(-D**2 / (2 * sigma**2))
        np.fill_diagonal(W, 0)  # No self-loops
        
        return W, sigma
    
    def local_scaling_kernel(self, k=7):
        """
        Construct graph with local scaling (Zelnik-Manor & Perona, 2004).
        
        Uses local statistics to define point-specific scales:
        W_ij = exp(-||x_i - x_j||^2 / (sigma_i * sigma_j))
        
        where sigma_i is the distance to the k-th nearest neighbor of x_i.
        
        Properties:
        - Automatically adapts to local data density
        - No global bandwidth parameter to tune
        - Handles clusters of different scales
        """
        D = self._compute_distances()
        
        # Compute local scales (distance to k-th neighbor)
        nn = NearestNeighbors(n_neighbors=k+1)
        nn.fit(self.X)
        distances, _ = nn.kneighbors(self.X)
        sigma_local = distances[:, k]  # k-th neighbor distance
        
        # Outer product of local scales
        sigma_matrix = np.outer(sigma_local, sigma_local)
        sigma_matrix[sigma_matrix == 0] = 1e-10  # Avoid division by zero
        
        # Local scaling kernel
        W = np.exp(-D**2 / sigma_matrix)
        np.fill_diagonal(W, 0)
        
        return W, sigma_local
 
 
def demonstrate_graph_construction():
    """
    Demonstrate different graph constructions on two moons dataset.
    """
    from sklearn.datasets import make_moons
    
    # Generate two moons data
    X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)
    
    sg = SimilarityGraph(X)
    
    # Compare different graphs
    print("Graph Construction Comparison")
    print("=" * 50)
    
    # Epsilon neighborhood
    for eps in [0.2, 0.4, 0.6]:
        W = sg.epsilon_neighborhood(eps)
        n_edges = np.sum(W > 0) // 2
        connected = np.all(np.sum(W, axis=1) > 0)
        print(f"ε-neighborhood (ε={eps}): {n_edges} edges, connected={connected}")
    
    print()
    
    # KNN graph
    for k in [5, 10, 15]:
        W = sg.knn_graph(k, mutual=False)
        n_edges = np.sum(W > 0) // 2
        print(f"KNN graph (k={k}): {n_edges} edges")
    
    print()
    
    # Gaussian kernel
    for sigma in [0.1, 0.3, 0.5]:
        W, _ = sg.gaussian_kernel(sigma=sigma)
        effective_edges = np.sum(W > 0.01) // 2  # Count "significant" edges
        print(f"Gaussian (σ={sigma}): {effective_edges} edges with W>0.01")
    
    return X, y_true, sg
 
 
if __name__ == "__main__":
    demonstrate_graph_construction()

Graph Construction is Critical

The quality of spectral clustering depends heavily on the similarity graph. Poor graph construction can obscure cluster structure entirely. Key guidelines: (1) The graph should be connected—at minimum, contain a path between any two nodes. Disconnected components become automatic clusters. (2) Edge weights should reflect true similarity concepts for your domain. (3) When in doubt, KNN graphs are often most robust, adapting naturally to varying density.

Graph Cuts as Clustering Objective

With a similarity graph constructed, clustering becomes graph partitioning: finding a division of vertices into groups such that within-group edges have high weight and between-group edges have low weight.

The Minimum Cut Problem:

Given a partition of vertices into disjoint sets $A$ and $B$ (where $A \cup B = V$ and $A \cap B = \emptyset$), the cut is the total weight of edges crossing between them:

$$\text{cut}(A, B) = \sum_{i \in A, j \in B} W_{ij}$$

A natural objective is to minimize this cut—sever as few connections as possible. For multiple clusters $C_1, \ldots, C_k$, we minimize:

$$\text{cut}(C_1, \ldots, C_k) = \frac{1}{2} \sum_{i=1}^{k} W(C_i, \bar{C_i})$$

where $\bar{C_i}$ denotes the complement of cluster $C_i$ and $W(A, B) = \sum_{i \in A, j \in B} W_{ij}$.

The Problem with Minimum Cut:

Unfortunately, naive minimum cut produces degenerate solutions. Consider a graph with 100 nodes and one outlier loosely connected to the main mass. The minimum cut would isolate that single outlier—technically minimizing cut weight but producing a useless "cluster" of size 1.

The minimum cut objective is biased toward cutting off small, isolated portions of the graph rather than finding balanced partitions. We need objective functions that account for cluster sizes.

Ratio Cut and Normalized Cut:

Two classical solutions normalize the cut by cluster size:

Ratio Cut (RatioCut): $$\text{RatioCut}(C_1, \ldots, C_k) = \sum_{i=1}^{k} \frac{\text{cut}(C_i, \bar{C_i})}{|C_i|}$$

Divides by the number of vertices in each cluster. Penalizes small clusters because $|C_i|$ appears in the denominator.

Normalized Cut (NCut): $$\text{NCut}(C_1, \ldots, C_k) = \sum_{i=1}^{k} \frac{\text{cut}(C_i, \bar{C_i})}{\text{vol}(C_i)}$$

where $\text{vol}(C) = \sum_{i \in C} d_i$ and $d_i = \sum_j W_{ij}$ is the degree of node $i$.

NCut normalizes by the total edge weight (volume) incident to each cluster, accounting for the connectivity of nodes rather than just their count. This is generally preferred and connects more naturally to random walk interpretations.

Comparison of Graph Cut Objectives
Objective	Formula	Normalization	Properties
Minimum Cut	Σ cut(Cᵢ, C̄ᵢ)	None	Favors isolating small groups; often degenerate
Ratio Cut	Σ cut(Cᵢ, C̄ᵢ)/\|Cᵢ\|	Cluster size (# vertices)	Balanced in vertex count; ignores vertex degrees
Normalized Cut	Σ cut(Cᵢ, C̄ᵢ)/vol(Cᵢ)	Cluster volume (sum of degrees)	Balanced in connectivity; relates to random walks

Computational Complexity:

Here's the critical challenge: finding the exact minimum of RatioCut or NCut is NP-hard. We cannot efficiently solve these problems by exhaustive search over all possible partitions—the number of partitions grows exponentially with $n$.

This is where spectral methods enter. By relaxing the discrete optimization problem (partitions) to a continuous one (real-valued vectors), we transform an intractable combinatorial problem into an eigenvalue problem solvable in polynomial time. The eigenvectors of graph Laplacian matrices provide approximate solutions to RatioCut and NCut that are, in practice, remarkably effective.

This connection—between graph cuts and eigenvalue problems—is the mathematical heart of spectral clustering and the subject of the following pages.

The Spectral Relaxation Strategy

The key insight enabling spectral clustering: (1) Express the cut objective in matrix form using indicator vectors, (2) Show this equals a Rayleigh quotient involving the graph Laplacian, (3) Relax discrete indicators to continuous vectors, (4) Solve the relaxed problem via eigendecomposition, (5) Discretize the continuous solution back to cluster assignments. This elegant strategy converts NP-hard combinatorics into tractable linear algebra.

Building Intuition: The Two-Cluster Case

To build deep intuition, let's examine the simplest case: partitioning a graph into exactly two clusters. This reveals the core mathematics before generalizing to k clusters.

Indicator Vector Representation:

We encode a partition ${A, B}$ using an indicator vector $f \in \mathbb{R}^n$:

$$f_i = \begin{cases} \sqrt{|B|/|A|} & \text{if } i \in A \ -\sqrt{|A|/|B|} & \text{if } i \in B \end{cases}$$

This specific scaling ensures the vector is orthogonal to the constant vector and has unit norm in an appropriate sense. For the simple case of equal-sized clusters: $f_i = +1$ if $i \in A$, $f_i = -1$ if $i \in B$.

Connection to RatioCut:

A remarkable fact emerges: the RatioCut objective can be written as:

$$\text{RatioCut}(A, B) = \frac{f^T L f}{f^T f}$$

where $L$ is the graph Laplacian matrix (to be developed in the next page). This is a Rayleigh quotient—a ratio whose minimization over vectors is a classic eigenvalue problem!

Why This Matters:

Minimizing the Rayleigh quotient $f^T L f / f^T f$ over all vectors $f$ orthogonal to constants yields the second-smallest eigenvector of $L$—called the Fiedler vector.

The Fiedler vector provides a real-valued "ordering" of vertices that respects cluster structure:

Vertices in the same cluster tend to have similar Fiedler vector values
Vertices in different clusters tend to have different values
The sign of the Fiedler vector naturally suggests a binary partition

For two moons data, the Fiedler vector might assign positive values to all points in one moon and negative values to all points in the other—even though the moons geometrically interlock!

The Spectral Clustering Recipe (Two Clusters):

Construct similarity graph with weight matrix $W$
Compute graph Laplacian $L = D - W$ (where $D$ is diagonal degree matrix)
Find the Fiedler vector (second eigenvector of $L$)
Assign points to clusters based on the sign of their Fiedler vector component

This elegant procedure transforms a combinatorial nightmare into straightforward linear algebra.

two_cluster_spectral.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.linalg import eigh
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
 
def spectral_two_clusters(W):
    """
    Spectral clustering for exactly two clusters using the Fiedler vector.
    
    This function demonstrates the core spectral clustering idea:
    the second eigenvector of the graph Laplacian encodes cluster structure.
    
    Parameters:
    -----------
    W : ndarray, shape (n, n)
        Symmetric weight/adjacency matrix with non-negative entries.
        W[i,j] represents similarity between points i and j.
    
    Returns:
    --------
    labels : ndarray, shape (n,)
        Binary cluster assignments (0 or 1)
    fiedler_vector : ndarray, shape (n,)
        The second eigenvector (Fiedler vector)
    eigenvalues : ndarray
        First few eigenvalues of the Laplacian
    """
    n = W.shape[0]
    
    # Step 1: Compute the degree matrix
    # d_i = sum of all edge weights incident to node i
    d = np.sum(W, axis=1)
    D = np.diag(d)
    
    # Step 2: Compute the unnormalized graph Laplacian
    # L = D - W
    # Key properties of L:
    # - Symmetric, positive semi-definite
    # - Smallest eigenvalue is 0, with eigenvector = constant
    # - Second-smallest eigenvalue is the algebraic connectivity
    L = D - W
    
    # Step 3: Eigendecomposition
    # We only need the smallest eigenvalues/eigenvectors
    # eigsh would be more efficient for large sparse matrices
    eigenvalues, eigenvectors = eigh(L)
    
    # Step 4: Extract the Fiedler vector (2nd eigenvector)
    # Note: eigenvalues are sorted in ascending order
    fiedler_vector = eigenvectors[:, 1]
    
    # Step 5: Cluster assignment based on sign
    # Positive -> cluster 0, Negative -> cluster 1
    labels = (fiedler_vector >= 0).astype(int)
    
    return labels, fiedler_vector, eigenvalues[:5]
 
 
def visualize_fiedler_clustering():
    """
    Demonstrate spectral clustering on two moons dataset.
    
    This visualization shows:
    1. The original data with true labels
    2. The Fiedler vector values for each point
    3. The spectral clustering result
    """
    # Generate two moons dataset
    X, y_true = make_moons(n_samples=200, noise=0.06, random_state=42)
    
    # Construct similarity graph using Gaussian kernel
    from scipy.spatial.distance import cdist
    distances = cdist(X, X)
    sigma = np.median(distances[distances > 0])
    W = np.exp(-distances**2 / (2 * sigma**2))
    np.fill_diagonal(W, 0)
    
    # Apply spectral clustering
    labels, fiedler, eigenvalues = spectral_two_clusters(W)
    
    # Calculate accuracy (accounting for possible label flip)
    acc1 = np.mean(labels == y_true)
    acc2 = np.mean(labels != y_true)
    accuracy = max(acc1, acc2)
    
    print("Spectral Clustering Results (Two Moons)")
    print("=" * 50)
    print(f"First 5 eigenvalues: {eigenvalues}")
    print(f"Fiedler vector range: [{fiedler.min():.3f}, {fiedler.max():.3f}]")
    print(f"Clustering accuracy: {accuracy:.1%}")
    
    # The eigenvalue gap indicates cluster separation quality
    spectral_gap = eigenvalues[2] - eigenvalues[1]
    print(f"Spectral gap (λ₃ - λ₂): {spectral_gap:.4f}")
    
    # Create visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Plot 1: True labels
    axes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='coolwarm', s=50, alpha=0.7)
    axes[0].set_title("True Labels")
    axes[0].set_aspect('equal')
    
    # Plot 2: Fiedler vector values
    scatter = axes[1].scatter(X[:, 0], X[:, 1], c=fiedler, cmap='RdBu', s=50, alpha=0.7)
    plt.colorbar(scatter, ax=axes[1], label='Fiedler value')
    axes[1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[1].set_title("Fiedler Vector Values")
    axes[1].set_aspect('equal')
    
    # Plot 3: Spectral clustering result
    axes[2].scatter(X[:, 0], X[:, 1], c=labels, cmap='coolwarm', s=50, alpha=0.7)
    axes[2].set_title(f"Spectral Clustering (Acc: {accuracy:.1%})")
    axes[2].set_aspect('equal')
    
    plt.tight_layout()
    plt.savefig('fiedler_clustering.png', dpi=150)
    plt.close()
    
    return X, y_true, labels, fiedler
 
 
if __name__ == "__main__":
    visualize_fiedler_clustering()

The Power of Spectral Embedding

Notice what happened: the Fiedler vector transformed a problem where K-Means fails completely (non-convex, interlocking clusters) into a problem where simple sign-based thresholding achieves perfect separation. The spectral embedding has 'unfolded' the data structure. This is the magic of spectral clustering—it finds representations where cluster structure becomes linearly separable.

Why Graphs Capture Complex Structure

The success of graph-based methods isn't accidental—it reflects deep mathematical principles about how graphs encode geometric and topological information.

Local Connectivity Preserves Global Structure:

Consider a point $x$ on one moon of the two moons dataset. Its K-nearest neighbors are all on the same moon (assuming noise is moderate). This point connects to its neighbors, those neighbors connect to their neighbors, and so on—forming a chain of connectivity tracing along the moon's shape.

The graph doesn't "know" about the moon's crescent shape in any explicit way. But the pattern of local connections implicitly encodes the manifold structure. When we analyze the graph spectrally, this structure is revealed and exploited.

Random Walk Interpretation:

A beautiful alternative view of spectral clustering comes from random walks. Imagine a random walker hopping between nodes, choosing the next node with probability proportional to edge weights.

The key insight: a random walker tends to stay within clusters if between-cluster edges are sparse. The walker will bounce around within densely connected regions before occasionally escaping to another.

The normalized cut objective exactly measures how quickly a random walk escapes from a cluster. Minimizing NCut finds partitions where random walks are "trapped" within clusters the longest—precisely what we want!

$$\text{NCut probability interpretation: } \frac{\text{cut}(A, \bar{A})}{\text{vol}(A)} = P(\text{transition out of } A | \text{currently in } A)$$

Manifold Structure and Geodesic Distance:

Graph-based clustering also connects to manifold learning. When data lies on a curved manifold embedded in high-dimensional space, Euclidean distance can be misleading—two points might be close in ambient space but far apart along the manifold.

Similarity graphs with local connectivity (like KNN graphs) approximate the manifold's topology. The geodesic distance (shortest path along the manifold) is approximated by the graph distance (shortest path along edges). Spectral clustering thus respects manifold structure in ways Euclidean methods cannot.

Multi-Scale Information:

Eigenvectors of the graph Laplacian capture structure at different scales:

Lower eigenvectors (small eigenvalues): Capture broad, global structure. The Fiedler vector reveals the coarsest partition.
Higher eigenvectors (larger eigenvalues): Capture finer, more local variations.

Using multiple eigenvectors for clustering ($k$ eigenvectors for $k$ clusters) leverages multi-scale information simultaneously, finding partitions that are optimal at both global and local levels.

When K-Means Fails (Geometric Methods)

•Non-convex cluster shapes
•Clusters of varying densities
•Elongated or curved clusters
•Manifold-structured data
•Data with complex boundaries
•Interlocking or nested clusters

When Spectral Clustering Succeeds

•Clusters defined by connectivity patterns
•Community detection in networks
•Image segmentation problems
•Document clustering by topic
•Gene expression analysis
•Social network analysis

The Unifying Principle

Spectral clustering works because it asks the right question: 'Which points belong together based on their pattern of connections?' rather than 'Which points are geometrically close?' This shift in perspective—from absolute position to relative connectivity—unlocks structure invisible to geometric methods.

Applications and Connections

Graph-based clustering extends far beyond standard data clustering, connecting to diverse fields and applications:

Image Segmentation:

Treat pixels as graph nodes, with edges connecting neighboring pixels weighted by color/texture similarity. Spectral clustering finds segments—contiguous regions of similar appearance. The normalized cut formulation was originally proposed for this application (Shi & Malik, 2000).

Community Detection in Networks:

Social networks, citation networks, and biological networks are naturally graphs. Spectral clustering discovers communities—groups of nodes more connected internally than externally. This is the same mathematical problem as data clustering!

Natural Language Processing:

Document similarity graphs enable document clustering. Word co-occurrence graphs support semantic analysis. Spectral methods find topics and categories without explicit labels.

Spectral Clustering Applications Across Domains
Domain	Nodes	Edges (Similarity)	Application
Computer Vision	Pixels/superpixels	Color/texture similarity	Image segmentation, object detection
Social Networks	Users	Friendships/interactions	Community detection, influencer identification
Bioinformatics	Genes/proteins	Co-expression/interaction	Gene module discovery, protein complex detection
NLP	Documents/words	Semantic similarity	Topic modeling, document organization
Recommendation	Users/items	Interaction patterns	User segmentation, item categorization
Neuroscience	Brain regions	Functional connectivity	Brain network parcellation

Connections to Other Methods:

Kernel K-Means: Spectral clustering is equivalent to kernel K-Means with specific kernels derived from the graph. The spectral embedding is the implicit feature space.
Laplacian Eigenmaps: A dimensionality reduction technique that uses the same eigenvector computation. Spectral clustering adds the final clustering step after embedding.
Diffusion Maps: Uses powers of the graph's transition matrix. Spectral clustering is related to the limit of diffusion as time goes to infinity.
Graph Neural Networks: Modern GNNs can be seen as learnable versions of spectral methods, with message passing replacing fixed Laplacian eigenvectors.

Understanding spectral clustering provides deep insight into this entire family of graph-based machine learning methods.

Summary: Graph-based Formulation

We have established the conceptual foundation for spectral clustering—the shift from geometric to graph-theoretic thinking about cluster structure. Let's consolidate the key insights:

Key Takeaways

•Clustering as Graph Partitioning — Viewing data points as graph nodes and similarities as edge weights transforms clustering into finding optimal graph cuts.
•Similarity Graphs — ε-neighborhood, KNN, and Gaussian kernel graphs encode local structure. Graph construction choices significantly impact results.
•Cut Objectives — Minimum cut is biased toward small clusters. Ratio Cut and Normalized Cut balance cut weight against cluster size/volume.
•The Fiedler Vector — For two clusters, the second eigenvector of the Laplacian provides cluster assignments that approximately minimize RatioCut.
•Why It Works — Graphs capture connectivity patterns and manifold structure that geometric methods miss. Random walk interpretations provide additional insight.
•Broad Applicability — The same mathematics applies to image segmentation, community detection, and many other domains.

What's Next:

This page established the why and what of graph-based clustering. The next page dives into the how: the graph Laplacian matrix. We will formally define the Laplacian, derive its fundamental properties, and understand why its eigenvectors encode cluster structure. This mathematical foundation is essential for truly mastering spectral clustering.

Page Complete

You now understand the conceptual shift from geometric to graph-based clustering. The key insight is that cluster structure can be defined by connectivity patterns rather than geometric proximity. This perspective enables discovery of clusters invisible to methods like K-Means. Next, we formalize this intuition through the mathematical machinery of graph Laplacians.

1 / 5

Loading learning content...

Machine LearningSpectral Clustering

Spectral Clustering

LevelAdvanced

Duration90 mins

TopicSpectral Clustering

1 / 5

Graph-based Formulation

Rethinking Clustering Through the Lens of Graphs

What You Will Learn

From Geometry to Graphs: The Conceptual Shift

To understand why spectral clustering exists, we must first confront the fundamental limitations of geometric clustering approaches. Consider the K-Means objective:

$$\min_{\mu_1, \ldots, \mu_k} \sum_{i=1}^{n} \min_{j} |x_i - \mu_j|^2$$

Geometric vs. Graph-Based Clustering Paradigms
Aspect	Geometric Paradigm (K-Means)	Graph Paradigm (Spectral)
Data representation	Points in ℝᵈ	Nodes in a graph G = (V, E)
Similarity concept	Distance to centroids	Edge weights between nodes
Cluster assumption	Convex, compact regions	Densely connected subgraphs
Objective function	Minimize within-cluster variance	Minimize cut weight between clusters
Handles non-convex shapes	No	Yes
Computational approach	Iterative centroid updates	Eigendecomposition of Laplacian
Parameter sensitivity	Sensitive to initialization	Sensitive to similarity metric

The Key Insight

Constructing Similarity Graphs

The first step in spectral clustering is constructing a similarity graph $G = (V, E, W)$ where:

Vertices $V = {1, 2, \ldots, n}$ correspond to data points $x_1, \ldots, x_n$
Edges $E$ connect similar points
Weights $W_{ij}$ quantify the similarity between points $i$ and $j$

The choice of how to construct this graph profoundly influences clustering results. Three main approaches dominate practice, each with distinct characteristics and trade-offs.

Types of Similarity Graphs

•ε-Neighborhood Graph — Connect all points within distance ε: $W_{ij} = 1$ if $|x_i - x_j| < \varepsilon$, else 0. Creates unweighted graphs. Requires choosing ε carefully—too small creates disconnected components, too large obscures structure.
•K-Nearest Neighbor (KNN) Graph — Connect each point to its k nearest neighbors. Can be symmetric (connect if either is in other's KNN) or mutual (connect only if both are in each other's KNN). Adapts naturally to varying local density.
•Fully Connected Graph with Gaussian Kernel — Connect all pairs with weights $W_{ij} = \exp(-|x_i - x_j|^2 / (2\sigma^2))$. No hard connectivity threshold; all pairs contribute. Parameter σ controls effective neighborhood size.

Understanding the Gaussian Kernel Weight Function:

The Gaussian (RBF) similarity is the most widely used:

$$W_{ij} = \exp\left(-\frac{|x_i - x_j|^2}{2\sigma^2}\right)$$

This function has elegant properties:

Bounded range: $W_{ij} \in (0, 1]$ with $W_{ii} = 1$
Self-similarity maximum: Identical points have weight 1
Smooth decay: Similarity decreases smoothly with distance
Locality control: σ determines the scale at which distances become significant

similarity_graph_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import numpy as np
from scipy.spatial.distance import cdist
from scipy.sparse import lil_matrix
from sklearn.neighbors import NearestNeighbors
 
class SimilarityGraph:
    """
    Construct different types of similarity graphs for spectral clustering.
    
    This class implements three fundamental graph construction strategies,
    each encoding different notions of local neighborhood structure.
    """
    
    def __init__(self, X):
        """
        Initialize with data matrix X of shape (n_samples, n_features).
        """
        self.X = X
        self.n = X.shape[0]
        self.distances = None  # Cached pairwise distances
    
    def _compute_distances(self):
        """Compute and cache pairwise Euclidean distances."""
        if self.distances is None:
            self.distances = cdist(self.X, self.X, metric='euclidean')
        return self.distances
    
    def epsilon_neighborhood(self, epsilon):
        """
        Construct ε-neighborhood graph.
        
        Connect points i and j if ||x_i - x_j|| < epsilon.
        Returns an unweighted (binary) adjacency matrix.
        
        Properties:
        - Simple and intuitive
        - May create disconnected components if epsilon too small
        - All neighborhoods have same radius (may not suit varying density)
        """
        D = self._compute_distances()
        W = (D < epsilon).astype(float)
        np.fill_diagonal(W, 0)  # No self-loops
        return W
    
    def knn_graph(self, k, mode='connectivity', mutual=False):
        """
        Construct K-nearest neighbor graph.
        
        Parameters:
        -----------
        k : int
            Number of nearest neighbors
        mode : str
            'connectivity' for binary edges, 'distance' for weighted
        mutual : bool
            If True, only connect if both points are in each other's KNN
            If False, connect if either is in other's KNN (symmetric KNN)
        
        Properties:
        - Adapts to local density naturally
        - Guarantees each point has at least k neighbors
        - Produces sparse graphs (efficient for large n)
        """
        nn = NearestNeighbors(n_neighbors=k+1)  # +1 because point is its own neighbor
        nn.fit(self.X)
        
        if mode == 'connectivity':
            # Binary adjacency
            A = nn.kneighbors_graph(mode='connectivity')
        else:
            # Distance-weighted
            A = nn.kneighbors_graph(mode='distance')
        
        # Remove self-connections
        A = lil_matrix(A)
        A.setdiag(0)
        A = A.tocsr()
        
        if mutual:
            # Mutual KNN: keep edge only if both directions present
            W = A.multiply(A.T)
        else:
            # Symmetric KNN: symmetrize the adjacency matrix
            W = A + A.T
            W.data[W.data > 1] = 1  # Binary weights
        
        return W.toarray()
    
    def gaussian_kernel(self, sigma=None, sigma_percentile=50):
        """
        Construct fully connected graph with Gaussian (RBF) kernel weights.
        
        W_ij = exp(-||x_i - x_j||^2 / (2*sigma^2))
        
        Parameters:
        -----------
        sigma : float or None
            Kernel bandwidth. If None, computed from distance distribution.
        sigma_percentile : int
            If sigma is None, use this percentile of distances as sigma.
        
        Properties:
        - Soft notion of similarity (no hard cutoffs)
        - Dense matrix (O(n^2) storage)
        - Smooth, differentiable with respect to data
        """
        D = self._compute_distances()
        
        if sigma is None:
            # Heuristic: use percentile of non-zero distances
            nonzero_dists = D[D > 0]
            sigma = np.percentile(nonzero_dists, sigma_percentile)
        
        # Gaussian kernel
        W = np.exp(-D**2 / (2 * sigma**2))
        np.fill_diagonal(W, 0)  # No self-loops
        
        return W, sigma
    
    def local_scaling_kernel(self, k=7):
        """
        Construct graph with local scaling (Zelnik-Manor & Perona, 2004).
        
        Uses local statistics to define point-specific scales:
        W_ij = exp(-||x_i - x_j||^2 / (sigma_i * sigma_j))
        
        where sigma_i is the distance to the k-th nearest neighbor of x_i.
        
        Properties:
        - Automatically adapts to local data density
        - No global bandwidth parameter to tune
        - Handles clusters of different scales
        """
        D = self._compute_distances()
        
        # Compute local scales (distance to k-th neighbor)
        nn = NearestNeighbors(n_neighbors=k+1)
        nn.fit(self.X)
        distances, _ = nn.kneighbors(self.X)
        sigma_local = distances[:, k]  # k-th neighbor distance
        
        # Outer product of local scales
        sigma_matrix = np.outer(sigma_local, sigma_local)
        sigma_matrix[sigma_matrix == 0] = 1e-10  # Avoid division by zero
        
        # Local scaling kernel
        W = np.exp(-D**2 / sigma_matrix)
        np.fill_diagonal(W, 0)
        
        return W, sigma_local
 
 
def demonstrate_graph_construction():
    """
    Demonstrate different graph constructions on two moons dataset.
    """
    from sklearn.datasets import make_moons
    
    # Generate two moons data
    X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)
    
    sg = SimilarityGraph(X)
    
    # Compare different graphs
    print("Graph Construction Comparison")
    print("=" * 50)
    
    # Epsilon neighborhood
    for eps in [0.2, 0.4, 0.6]:
        W = sg.epsilon_neighborhood(eps)
        n_edges = np.sum(W > 0) // 2
        connected = np.all(np.sum(W, axis=1) > 0)
        print(f"ε-neighborhood (ε={eps}): {n_edges} edges, connected={connected}")
    
    print()
    
    # KNN graph
    for k in [5, 10, 15]:
        W = sg.knn_graph(k, mutual=False)
        n_edges = np.sum(W > 0) // 2
        print(f"KNN graph (k={k}): {n_edges} edges")
    
    print()
    
    # Gaussian kernel
    for sigma in [0.1, 0.3, 0.5]:
        W, _ = sg.gaussian_kernel(sigma=sigma)
        effective_edges = np.sum(W > 0.01) // 2  # Count "significant" edges
        print(f"Gaussian (σ={sigma}): {effective_edges} edges with W>0.01")
    
    return X, y_true, sg
 
 
if __name__ == "__main__":
    demonstrate_graph_construction()

Graph Construction is Critical

Graph Cuts as Clustering Objective

The Minimum Cut Problem:

Given a partition of vertices into disjoint sets $A$ and $B$ (where $A \cup B = V$ and $A \cap B = \emptyset$), the cut is the total weight of edges crossing between them:

$$\text{cut}(A, B) = \sum_{i \in A, j \in B} W_{ij}$$

A natural objective is to minimize this cut—sever as few connections as possible. For multiple clusters $C_1, \ldots, C_k$, we minimize:

$$\text{cut}(C_1, \ldots, C_k) = \frac{1}{2} \sum_{i=1}^{k} W(C_i, \bar{C_i})$$

where $\bar{C_i}$ denotes the complement of cluster $C_i$ and $W(A, B) = \sum_{i \in A, j \in B} W_{ij}$.

The Problem with Minimum Cut:

The minimum cut objective is biased toward cutting off small, isolated portions of the graph rather than finding balanced partitions. We need objective functions that account for cluster sizes.

Ratio Cut and Normalized Cut:

Two classical solutions normalize the cut by cluster size:

Ratio Cut (RatioCut): $$\text{RatioCut}(C_1, \ldots, C_k) = \sum_{i=1}^{k} \frac{\text{cut}(C_i, \bar{C_i})}{|C_i|}$$

Divides by the number of vertices in each cluster. Penalizes small clusters because $|C_i|$ appears in the denominator.

Normalized Cut (NCut): $$\text{NCut}(C_1, \ldots, C_k) = \sum_{i=1}^{k} \frac{\text{cut}(C_i, \bar{C_i})}{\text{vol}(C_i)}$$

where $\text{vol}(C) = \sum_{i \in C} d_i$ and $d_i = \sum_j W_{ij}$ is the degree of node $i$.

Comparison of Graph Cut Objectives
Objective	Formula	Normalization	Properties
Minimum Cut	Σ cut(Cᵢ, C̄ᵢ)	None	Favors isolating small groups; often degenerate
Ratio Cut	Σ cut(Cᵢ, C̄ᵢ)/\|Cᵢ\|	Cluster size (# vertices)	Balanced in vertex count; ignores vertex degrees
Normalized Cut	Σ cut(Cᵢ, C̄ᵢ)/vol(Cᵢ)	Cluster volume (sum of degrees)	Balanced in connectivity; relates to random walks

Computational Complexity:

This connection—between graph cuts and eigenvalue problems—is the mathematical heart of spectral clustering and the subject of the following pages.

The Spectral Relaxation Strategy

Building Intuition: The Two-Cluster Case

To build deep intuition, let's examine the simplest case: partitioning a graph into exactly two clusters. This reveals the core mathematics before generalizing to k clusters.

Indicator Vector Representation:

We encode a partition ${A, B}$ using an indicator vector $f \in \mathbb{R}^n$:

$$f_i = \begin{cases} \sqrt{|B|/|A|} & \text{if } i \in A \ -\sqrt{|A|/|B|} & \text{if } i \in B \end{cases}$$

Connection to RatioCut:

A remarkable fact emerges: the RatioCut objective can be written as:

$$\text{RatioCut}(A, B) = \frac{f^T L f}{f^T f}$$

where $L$ is the graph Laplacian matrix (to be developed in the next page). This is a Rayleigh quotient—a ratio whose minimization over vectors is a classic eigenvalue problem!

Why This Matters:

Minimizing the Rayleigh quotient $f^T L f / f^T f$ over all vectors $f$ orthogonal to constants yields the second-smallest eigenvector of $L$—called the Fiedler vector.

The Fiedler vector provides a real-valued "ordering" of vertices that respects cluster structure:

Vertices in the same cluster tend to have similar Fiedler vector values
Vertices in different clusters tend to have different values
The sign of the Fiedler vector naturally suggests a binary partition

For two moons data, the Fiedler vector might assign positive values to all points in one moon and negative values to all points in the other—even though the moons geometrically interlock!

The Spectral Clustering Recipe (Two Clusters):

Construct similarity graph with weight matrix $W$
Compute graph Laplacian $L = D - W$ (where $D$ is diagonal degree matrix)
Find the Fiedler vector (second eigenvector of $L$)
Assign points to clusters based on the sign of their Fiedler vector component

This elegant procedure transforms a combinatorial nightmare into straightforward linear algebra.

two_cluster_spectral.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.linalg import eigh
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
 
def spectral_two_clusters(W):
    """
    Spectral clustering for exactly two clusters using the Fiedler vector.
    
    This function demonstrates the core spectral clustering idea:
    the second eigenvector of the graph Laplacian encodes cluster structure.
    
    Parameters:
    -----------
    W : ndarray, shape (n, n)
        Symmetric weight/adjacency matrix with non-negative entries.
        W[i,j] represents similarity between points i and j.
    
    Returns:
    --------
    labels : ndarray, shape (n,)
        Binary cluster assignments (0 or 1)
    fiedler_vector : ndarray, shape (n,)
        The second eigenvector (Fiedler vector)
    eigenvalues : ndarray
        First few eigenvalues of the Laplacian
    """
    n = W.shape[0]
    
    # Step 1: Compute the degree matrix
    # d_i = sum of all edge weights incident to node i
    d = np.sum(W, axis=1)
    D = np.diag(d)
    
    # Step 2: Compute the unnormalized graph Laplacian
    # L = D - W
    # Key properties of L:
    # - Symmetric, positive semi-definite
    # - Smallest eigenvalue is 0, with eigenvector = constant
    # - Second-smallest eigenvalue is the algebraic connectivity
    L = D - W
    
    # Step 3: Eigendecomposition
    # We only need the smallest eigenvalues/eigenvectors
    # eigsh would be more efficient for large sparse matrices
    eigenvalues, eigenvectors = eigh(L)
    
    # Step 4: Extract the Fiedler vector (2nd eigenvector)
    # Note: eigenvalues are sorted in ascending order
    fiedler_vector = eigenvectors[:, 1]
    
    # Step 5: Cluster assignment based on sign
    # Positive -> cluster 0, Negative -> cluster 1
    labels = (fiedler_vector >= 0).astype(int)
    
    return labels, fiedler_vector, eigenvalues[:5]
 
 
def visualize_fiedler_clustering():
    """
    Demonstrate spectral clustering on two moons dataset.
    
    This visualization shows:
    1. The original data with true labels
    2. The Fiedler vector values for each point
    3. The spectral clustering result
    """
    # Generate two moons dataset
    X, y_true = make_moons(n_samples=200, noise=0.06, random_state=42)
    
    # Construct similarity graph using Gaussian kernel
    from scipy.spatial.distance import cdist
    distances = cdist(X, X)
    sigma = np.median(distances[distances > 0])
    W = np.exp(-distances**2 / (2 * sigma**2))
    np.fill_diagonal(W, 0)
    
    # Apply spectral clustering
    labels, fiedler, eigenvalues = spectral_two_clusters(W)
    
    # Calculate accuracy (accounting for possible label flip)
    acc1 = np.mean(labels == y_true)
    acc2 = np.mean(labels != y_true)
    accuracy = max(acc1, acc2)
    
    print("Spectral Clustering Results (Two Moons)")
    print("=" * 50)
    print(f"First 5 eigenvalues: {eigenvalues}")
    print(f"Fiedler vector range: [{fiedler.min():.3f}, {fiedler.max():.3f}]")
    print(f"Clustering accuracy: {accuracy:.1%}")
    
    # The eigenvalue gap indicates cluster separation quality
    spectral_gap = eigenvalues[2] - eigenvalues[1]
    print(f"Spectral gap (λ₃ - λ₂): {spectral_gap:.4f}")
    
    # Create visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Plot 1: True labels
    axes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='coolwarm', s=50, alpha=0.7)
    axes[0].set_title("True Labels")
    axes[0].set_aspect('equal')
    
    # Plot 2: Fiedler vector values
    scatter = axes[1].scatter(X[:, 0], X[:, 1], c=fiedler, cmap='RdBu', s=50, alpha=0.7)
    plt.colorbar(scatter, ax=axes[1], label='Fiedler value')
    axes[1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[1].set_title("Fiedler Vector Values")
    axes[1].set_aspect('equal')
    
    # Plot 3: Spectral clustering result
    axes[2].scatter(X[:, 0], X[:, 1], c=labels, cmap='coolwarm', s=50, alpha=0.7)
    axes[2].set_title(f"Spectral Clustering (Acc: {accuracy:.1%})")
    axes[2].set_aspect('equal')
    
    plt.tight_layout()
    plt.savefig('fiedler_clustering.png', dpi=150)
    plt.close()
    
    return X, y_true, labels, fiedler
 
 
if __name__ == "__main__":
    visualize_fiedler_clustering()

The Power of Spectral Embedding

Why Graphs Capture Complex Structure

The success of graph-based methods isn't accidental—it reflects deep mathematical principles about how graphs encode geometric and topological information.

Local Connectivity Preserves Global Structure:

Random Walk Interpretation:

A beautiful alternative view of spectral clustering comes from random walks. Imagine a random walker hopping between nodes, choosing the next node with probability proportional to edge weights.

$$\text{NCut probability interpretation: } \frac{\text{cut}(A, \bar{A})}{\text{vol}(A)} = P(\text{transition out of } A | \text{currently in } A)$$

Manifold Structure and Geodesic Distance:

Multi-Scale Information:

Eigenvectors of the graph Laplacian capture structure at different scales:

Lower eigenvectors (small eigenvalues): Capture broad, global structure. The Fiedler vector reveals the coarsest partition.
Higher eigenvectors (larger eigenvalues): Capture finer, more local variations.

Using multiple eigenvectors for clustering ($k$ eigenvectors for $k$ clusters) leverages multi-scale information simultaneously, finding partitions that are optimal at both global and local levels.

When K-Means Fails (Geometric Methods)

•Non-convex cluster shapes
•Clusters of varying densities
•Elongated or curved clusters
•Manifold-structured data
•Data with complex boundaries
•Interlocking or nested clusters

When Spectral Clustering Succeeds

•Clusters defined by connectivity patterns
•Community detection in networks
•Image segmentation problems
•Document clustering by topic
•Gene expression analysis
•Social network analysis

The Unifying Principle

Applications and Connections

Graph-based clustering extends far beyond standard data clustering, connecting to diverse fields and applications:

Image Segmentation:

Community Detection in Networks:

Natural Language Processing:

Document similarity graphs enable document clustering. Word co-occurrence graphs support semantic analysis. Spectral methods find topics and categories without explicit labels.

Spectral Clustering Applications Across Domains
Domain	Nodes	Edges (Similarity)	Application
Computer Vision	Pixels/superpixels	Color/texture similarity	Image segmentation, object detection
Social Networks	Users	Friendships/interactions	Community detection, influencer identification
Bioinformatics	Genes/proteins	Co-expression/interaction	Gene module discovery, protein complex detection
NLP	Documents/words	Semantic similarity	Topic modeling, document organization
Recommendation	Users/items	Interaction patterns	User segmentation, item categorization
Neuroscience	Brain regions	Functional connectivity	Brain network parcellation

Connections to Other Methods:

Kernel K-Means: Spectral clustering is equivalent to kernel K-Means with specific kernels derived from the graph. The spectral embedding is the implicit feature space.
Laplacian Eigenmaps: A dimensionality reduction technique that uses the same eigenvector computation. Spectral clustering adds the final clustering step after embedding.
Diffusion Maps: Uses powers of the graph's transition matrix. Spectral clustering is related to the limit of diffusion as time goes to infinity.
Graph Neural Networks: Modern GNNs can be seen as learnable versions of spectral methods, with message passing replacing fixed Laplacian eigenvectors.

Understanding spectral clustering provides deep insight into this entire family of graph-based machine learning methods.

Summary: Graph-based Formulation

We have established the conceptual foundation for spectral clustering—the shift from geometric to graph-theoretic thinking about cluster structure. Let's consolidate the key insights:

Key Takeaways

•Clustering as Graph Partitioning — Viewing data points as graph nodes and similarities as edge weights transforms clustering into finding optimal graph cuts.
•Similarity Graphs — ε-neighborhood, KNN, and Gaussian kernel graphs encode local structure. Graph construction choices significantly impact results.
•Cut Objectives — Minimum cut is biased toward small clusters. Ratio Cut and Normalized Cut balance cut weight against cluster size/volume.
•The Fiedler Vector — For two clusters, the second eigenvector of the Laplacian provides cluster assignments that approximately minimize RatioCut.
•Why It Works — Graphs capture connectivity patterns and manifold structure that geometric methods miss. Random walk interpretations provide additional insight.
•Broad Applicability — The same mathematics applies to image segmentation, community detection, and many other domains.

What's Next:

Page Complete

1 / 5