Semi Supervised Methods - Learning Module

Loading content...

0/245

Graph-Based Methods

Beyond Propagation: Learning on Manifolds

Label propagation treats the graph as fixed infrastructure through which labels flow. But what if we could simultaneously learn what to predict and how to use the graph structure? This leads to a richer family of graph-based methods that integrate graph regularization directly into the learning objective.

The key insight: high-dimensional data often lies on a low-dimensional manifold. A 28×28 image has 784 dimensions, but valid handwritten digits occupy a tiny subspace. Graph-based methods approximate this manifold with a nearest-neighbor graph and enforce that predictions vary smoothly along it.

What You Will Learn

By the end of this page, you will master manifold regularization, understand the semi-supervised SVM, explore graph neural networks for SSL, and learn about methods that adaptively construct graphs from data.

Manifold Regularization

Manifold regularization (Belkin et al., 2006) unifies supervised learning with graph-based smoothness. The objective combines three terms:

$$\min_f \underbrace{\sum_{i \in L} \mathcal{L}(f(x_i), y_i)}{\text{Supervised Loss}} + \underbrace{\lambda_A |f|^2{\mathcal{H}}}{\text{Ambient Regularization}} + \underbrace{\lambda_I \sum{i,j} w_{ij}(f(x_i) - f(x_j))^2}_{\text{Intrinsic Regularization}}$$

Terms Explained:

Supervised Loss: Penalize mistakes on labeled data (cross-entropy, hinge, etc.)
Ambient Regularization: Standard regularization in the input/feature space (e.g., L2 on weights)
Intrinsic Regularization: Penalize predictions that differ between connected nodes—enforces manifold smoothness

Graph Laplacian Formulation:

The intrinsic term can be written using the graph Laplacian: $$\sum_{i,j} w_{ij}(f_i - f_j)^2 = 2 \cdot f^T L f$$

where $L = D - W$ is the unnormalized Laplacian. This quadratic form measures the "smoothness" of $f$ over the graph—zero if all connected nodes have identical predictions.

Laplacian SVM (LapSVM):

Applying manifold regularization to SVMs: $$\min_{f \in \mathcal{H}K} \frac{1}{|L|} \sum{i \in L} \max(0, 1 - y_i f(x_i)) + \lambda_A |f|_K^2 + \lambda_I f^T L f$$

This semi-supervised SVM enforces margin on labeled data while keeping predictions smooth on the manifold.

manifold_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.base import BaseEstimator, ClassifierMixin
 
class ManifoldRegularizedClassifier(BaseEstimator, ClassifierMixin):
    """
    Manifold regularization for semi-supervised classification.
    Uses closed-form solution for squared loss.
    """
    
    def __init__(self, lambda_a=1.0, lambda_i=1.0, k_neighbors=10, 
                 kernel='rbf', gamma=1.0):
        self.lambda_a = lambda_a  # Ambient regularization
        self.lambda_i = lambda_i  # Intrinsic (graph) regularization
        self.k_neighbors = k_neighbors
        self.kernel = kernel
        self.gamma = gamma
        
    def _build_graph_laplacian(self, X):
        """Build kNN graph Laplacian."""
        n = len(X)
        distances = cdist(X, X, metric='euclidean')
        
        # k-NN affinity
        W = np.zeros((n, n))
        for i in range(n):
            neighbors = np.argsort(distances[i])[1:self.k_neighbors+1]
            W[i, neighbors] = np.exp(-self.gamma * distances[i, neighbors]**2)
        W = (W + W.T) / 2  # Symmetrize
        
        D = np.diag(W.sum(axis=1))
        L = D - W
        return L
    
    def _compute_kernel(self, X1, X2):
        """RBF kernel matrix."""
        distances = cdist(X1, X2, metric='euclidean')
        return np.exp(-self.gamma * distances**2)
    
    def fit(self, X_labeled, y_labeled, X_unlabeled):
        """
        Fit with manifold regularization.
        Uses closed-form solution for squared loss.
        """
        X_all = np.vstack([X_labeled, X_unlabeled])
        n_labeled = len(X_labeled)
        n_total = len(X_all)
        
        # Build graph over all data
        L = self._build_graph_laplacian(X_all)
        
        # Kernel matrix
        K = self._compute_kernel(X_all, X_all)
        
        # Label vector (0 for unlabeled)
        Y = np.zeros(n_total)
        Y[:n_labeled] = y_labeled
        
        # Diagonal matrix: 1 for labeled, 0 for unlabeled
        J = np.zeros((n_total, n_total))
        np.fill_diagonal(J[:n_labeled, :n_labeled], 1)
        
        # Closed-form: alpha = (JK + lambda_a*I + lambda_i*LK)^{-1} Y
        A = J @ K + self.lambda_a * np.eye(n_total) + self.lambda_i * L @ K
        self.alpha_ = np.linalg.solve(A, Y)
        self.X_train_ = X_all
        
        return self
    
    def predict(self, X):
        """Predict using learned weights."""
        K = self._compute_kernel(X, self.X_train_)
        scores = K @ self.alpha_
        return np.sign(scores)
    
    def decision_function(self, X):
        K = self._compute_kernel(X, self.X_train_)
        return K @ self.alpha_

Transductive SVM (TSVM)

The Transductive SVM (Joachims, 1999) extends the max-margin principle to unlabeled data. The objective finds a labeling of unlabeled points that maximizes the margin:

$$\min_{w,b,\hat{y}U} \frac{1}{2}|w|^2 + C_L \sum{i \in L} \xi_i + C_U \sum_{j \in U} \xi_j$$

subject to:

$y_i(w^T x_i + b) \geq 1 - \xi_i$ for labeled
$\hat{y}_j(w^T x_j + b) \geq 1 - \xi_j$ for unlabeled

The labels $\hat{y}_U$ are optimization variables! We search for the labeling that yields maximum margin.

Computational Challenge

TSVM is NP-hard because we're optimizing over discrete labels. Practical algorithms use iterative approaches: start with initial labels from a standard SVM, then refine by swapping labels that would increase the margin. Convergence to global optimum is not guaranteed.

The Low-Density Separation Principle:

TSVM implements the assumption that decision boundaries should pass through low-density regions. By maximizing margin on unlabeled data, we push the boundary away from all data points—found or unlabeled—naturally placing it where data is sparse.

Balancing Constraint:

Without constraints, TSVM might assign all unlabeled points to one class. A balancing constraint ensures: $$\frac{1}{|U|} \sum_{j \in U} \hat{y}_j \approx \text{class prior}$$

Graph Neural Networks for SSL

Graph Neural Networks (GNNs) learn to propagate and transform information through graph structure. For semi-supervised node classification, they naturally leverage unlabeled nodes:

Graph Convolutional Network (GCN):

Kipf & Welling (2016) proposed the layer-wise propagation rule: $$H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)$$

where $\tilde{A} = A + I$ (adjacency with self-loops), $\tilde{D}$ is the degree matrix, and $W^{(l)}$ are learnable weights.

Why GCNs Work for SSL:

Message passing aggregates neighbor features—unlabeled nodes contribute
The spectral view shows GCN is low-pass filtering—enforces smoothness
Supervised loss on labeled nodes backpropagates through entire graph

gcn_ssl.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class GCNLayer(nn.Module):
    """Single Graph Convolutional Layer."""
    
    def __init__(self, in_features, out_features):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features, bias=False)
        
    def forward(self, X, A_norm):
        """
        Args:
            X: Node features (n_nodes, in_features)
            A_norm: Normalized adjacency (D^{-1/2} A D^{-1/2})
        """
        # Aggregate: A_norm @ X
        # Transform: linear projection
        return self.linear(A_norm @ X)
 
class GCN(nn.Module):
    """2-layer GCN for semi-supervised node classification."""
    
    def __init__(self, n_features, n_hidden, n_classes, dropout=0.5):
        super().__init__()
        self.gc1 = GCNLayer(n_features, n_hidden)
        self.gc2 = GCNLayer(n_hidden, n_classes)
        self.dropout = dropout
        
    def forward(self, X, A_norm):
        H = F.relu(self.gc1(X, A_norm))
        H = F.dropout(H, p=self.dropout, training=self.training)
        return self.gc2(H, A_norm)
 
def train_gcn_ssl(model, X, A_norm, y_train, train_mask, epochs=200):
    """
    Train GCN with semi-supervised loss.
    Only labeled nodes contribute to loss, but all nodes
    participate in message passing.
    """
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
    
    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        
        # Forward pass uses ALL nodes
        logits = model(X, A_norm)
        
        # Loss only on labeled nodes
        loss = F.cross_entropy(logits[train_mask], y_train[train_mask])
        
        loss.backward()
        optimizer.step()
    
    return model

GNN Variants for Semi-Supervised Learning
Method	Key Innovation	Aggregation
GCN	Spectral convolutions simplified	Mean of neighbors
GAT	Attention-weighted aggregation	Learned attention weights
GraphSAGE	Sampling for scalability	Sample + aggregate
SGC	Remove non-linearities	Pre-computed k-hop

Learning the Graph Structure

So far, we've treated the graph as given (constructed from distances). But what if the graph is suboptimal? Recent methods learn the graph jointly with the classifier:

Graph Learning for SSL:

$$\min_{f, W} \mathcal{L}_{\text{supervised}}(f) + \lambda_1 \cdot \text{smoothness}(f, W) + \lambda_2 \cdot \text{graph_regularization}(W)$$

The graph $W$ is now an optimization variable, learned to make the smoothness assumption hold.

learned_graph.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class LearnableGraphSSL(nn.Module):
    """
    Joint learning of graph structure and classifier.
    """
    
    def __init__(self, n_features, n_hidden, n_classes):
        super().__init__()
        # Graph learning parameters
        self.edge_mlp = nn.Sequential(
            nn.Linear(n_features * 2, n_hidden),
            nn.ReLU(),
            nn.Linear(n_hidden, 1),
            nn.Sigmoid()
        )
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(n_features, n_hidden),
            nn.ReLU(),
            nn.Linear(n_hidden, n_classes)
        )
        
    def compute_graph(self, X, k=10):
        """Learn edge weights between all pairs."""
        n = X.shape[0]
        
        # Compute pairwise edge features
        X_i = X.unsqueeze(1).expand(n, n, -1)
        X_j = X.unsqueeze(0).expand(n, n, -1)
        edge_features = torch.cat([X_i, X_j], dim=-1)
        
        # Edge weights from MLP
        W = self.edge_mlp(edge_features).squeeze(-1)
        
        # Sparsify: keep top-k per node
        _, topk_idx = W.topk(k, dim=1)
        mask = torch.zeros_like(W)
        mask.scatter_(1, topk_idx, 1)
        W = W * mask
        
        # Symmetrize
        W = (W + W.T) / 2
        
        return W
    
    def forward(self, X, train_mask, y_train, lambda_smooth=0.1):
        # Learn graph
        W = self.compute_graph(X)
        
        # Graph Laplacian
        D = torch.diag(W.sum(dim=1))
        L = D - W
        
        # Predictions
        logits = self.classifier(X)
        probs = F.softmax(logits, dim=1)
        
        # Supervised loss
        loss_sup = F.cross_entropy(logits[train_mask], y_train[train_mask])
        
        # Smoothness loss: predictions should be similar for high W
        loss_smooth = 0
        for c in range(logits.shape[1]):
            f = probs[:, c]
            loss_smooth += f @ L @ f
        
        return logits, loss_sup + lambda_smooth * loss_smooth

When to Learn the Graph

Learn the graph when: (1) features are high-dimensional and Euclidean distance is unreliable, (2) the task-relevant similarity differs from generic feature similarity, (3) you have enough labeled data to guide graph learning. For small labeled sets, fixed kNN graphs often work better.

Summary

Key Takeaways

•Manifold regularization combines supervised loss with graph smoothness penalty
•TSVM maximizes margin on labeled AND unlabeled data (NP-hard)
•GNNs naturally support SSL via message passing through unlabeled nodes
•Graph learning jointly optimizes graph structure and classifier
•The manifold assumption underlies all methods: labels vary smoothly along data manifold

What's Next:

We've explored discriminative approaches to semi-supervised learning. But there's another powerful paradigm: generative approaches that model the joint distribution $P(X, Y)$. By modeling how data is generated, we can leverage unlabeled data to improve our understanding of the feature space, which in turn improves classification.

Page Complete

You now understand advanced graph-based methods: manifold regularization, transductive SVMs, GNNs for SSL, and learned graph construction. These techniques represent the state of the art in leveraging structure for semi-supervised learning.