Loading content...
Label propagation treats the graph as fixed infrastructure through which labels flow. But what if we could simultaneously learn what to predict and how to use the graph structure? This leads to a richer family of graph-based methods that integrate graph regularization directly into the learning objective.
The key insight: high-dimensional data often lies on a low-dimensional manifold. A 28×28 image has 784 dimensions, but valid handwritten digits occupy a tiny subspace. Graph-based methods approximate this manifold with a nearest-neighbor graph and enforce that predictions vary smoothly along it.
By the end of this page, you will master manifold regularization, understand the semi-supervised SVM, explore graph neural networks for SSL, and learn about methods that adaptively construct graphs from data.
Manifold regularization (Belkin et al., 2006) unifies supervised learning with graph-based smoothness. The objective combines three terms:
$$\min_f \underbrace{\sum_{i \in L} \mathcal{L}(f(x_i), y_i)}{\text{Supervised Loss}} + \underbrace{\lambda_A |f|^2{\mathcal{H}}}{\text{Ambient Regularization}} + \underbrace{\lambda_I \sum{i,j} w_{ij}(f(x_i) - f(x_j))^2}_{\text{Intrinsic Regularization}}$$
Terms Explained:
Graph Laplacian Formulation:
The intrinsic term can be written using the graph Laplacian: $$\sum_{i,j} w_{ij}(f_i - f_j)^2 = 2 \cdot f^T L f$$
where $L = D - W$ is the unnormalized Laplacian. This quadratic form measures the "smoothness" of $f$ over the graph—zero if all connected nodes have identical predictions.
Laplacian SVM (LapSVM):
Applying manifold regularization to SVMs: $$\min_{f \in \mathcal{H}K} \frac{1}{|L|} \sum{i \in L} \max(0, 1 - y_i f(x_i)) + \lambda_A |f|_K^2 + \lambda_I f^T L f$$
This semi-supervised SVM enforces margin on labeled data while keeping predictions smooth on the manifold.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy.spatial.distance import cdistfrom sklearn.base import BaseEstimator, ClassifierMixin class ManifoldRegularizedClassifier(BaseEstimator, ClassifierMixin): """ Manifold regularization for semi-supervised classification. Uses closed-form solution for squared loss. """ def __init__(self, lambda_a=1.0, lambda_i=1.0, k_neighbors=10, kernel='rbf', gamma=1.0): self.lambda_a = lambda_a # Ambient regularization self.lambda_i = lambda_i # Intrinsic (graph) regularization self.k_neighbors = k_neighbors self.kernel = kernel self.gamma = gamma def _build_graph_laplacian(self, X): """Build kNN graph Laplacian.""" n = len(X) distances = cdist(X, X, metric='euclidean') # k-NN affinity W = np.zeros((n, n)) for i in range(n): neighbors = np.argsort(distances[i])[1:self.k_neighbors+1] W[i, neighbors] = np.exp(-self.gamma * distances[i, neighbors]**2) W = (W + W.T) / 2 # Symmetrize D = np.diag(W.sum(axis=1)) L = D - W return L def _compute_kernel(self, X1, X2): """RBF kernel matrix.""" distances = cdist(X1, X2, metric='euclidean') return np.exp(-self.gamma * distances**2) def fit(self, X_labeled, y_labeled, X_unlabeled): """ Fit with manifold regularization. Uses closed-form solution for squared loss. """ X_all = np.vstack([X_labeled, X_unlabeled]) n_labeled = len(X_labeled) n_total = len(X_all) # Build graph over all data L = self._build_graph_laplacian(X_all) # Kernel matrix K = self._compute_kernel(X_all, X_all) # Label vector (0 for unlabeled) Y = np.zeros(n_total) Y[:n_labeled] = y_labeled # Diagonal matrix: 1 for labeled, 0 for unlabeled J = np.zeros((n_total, n_total)) np.fill_diagonal(J[:n_labeled, :n_labeled], 1) # Closed-form: alpha = (JK + lambda_a*I + lambda_i*LK)^{-1} Y A = J @ K + self.lambda_a * np.eye(n_total) + self.lambda_i * L @ K self.alpha_ = np.linalg.solve(A, Y) self.X_train_ = X_all return self def predict(self, X): """Predict using learned weights.""" K = self._compute_kernel(X, self.X_train_) scores = K @ self.alpha_ return np.sign(scores) def decision_function(self, X): K = self._compute_kernel(X, self.X_train_) return K @ self.alpha_The Transductive SVM (Joachims, 1999) extends the max-margin principle to unlabeled data. The objective finds a labeling of unlabeled points that maximizes the margin:
$$\min_{w,b,\hat{y}U} \frac{1}{2}|w|^2 + C_L \sum{i \in L} \xi_i + C_U \sum_{j \in U} \xi_j$$
subject to:
The labels $\hat{y}_U$ are optimization variables! We search for the labeling that yields maximum margin.
TSVM is NP-hard because we're optimizing over discrete labels. Practical algorithms use iterative approaches: start with initial labels from a standard SVM, then refine by swapping labels that would increase the margin. Convergence to global optimum is not guaranteed.
The Low-Density Separation Principle:
TSVM implements the assumption that decision boundaries should pass through low-density regions. By maximizing margin on unlabeled data, we push the boundary away from all data points—found or unlabeled—naturally placing it where data is sparse.
Balancing Constraint:
Without constraints, TSVM might assign all unlabeled points to one class. A balancing constraint ensures: $$\frac{1}{|U|} \sum_{j \in U} \hat{y}_j \approx \text{class prior}$$
Graph Neural Networks (GNNs) learn to propagate and transform information through graph structure. For semi-supervised node classification, they naturally leverage unlabeled nodes:
Graph Convolutional Network (GCN):
Kipf & Welling (2016) proposed the layer-wise propagation rule: $$H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)$$
where $\tilde{A} = A + I$ (adjacency with self-loops), $\tilde{D}$ is the degree matrix, and $W^{(l)}$ are learnable weights.
Why GCNs Work for SSL:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import torchimport torch.nn as nnimport torch.nn.functional as F class GCNLayer(nn.Module): """Single Graph Convolutional Layer.""" def __init__(self, in_features, out_features): super().__init__() self.linear = nn.Linear(in_features, out_features, bias=False) def forward(self, X, A_norm): """ Args: X: Node features (n_nodes, in_features) A_norm: Normalized adjacency (D^{-1/2} A D^{-1/2}) """ # Aggregate: A_norm @ X # Transform: linear projection return self.linear(A_norm @ X) class GCN(nn.Module): """2-layer GCN for semi-supervised node classification.""" def __init__(self, n_features, n_hidden, n_classes, dropout=0.5): super().__init__() self.gc1 = GCNLayer(n_features, n_hidden) self.gc2 = GCNLayer(n_hidden, n_classes) self.dropout = dropout def forward(self, X, A_norm): H = F.relu(self.gc1(X, A_norm)) H = F.dropout(H, p=self.dropout, training=self.training) return self.gc2(H, A_norm) def train_gcn_ssl(model, X, A_norm, y_train, train_mask, epochs=200): """ Train GCN with semi-supervised loss. Only labeled nodes contribute to loss, but all nodes participate in message passing. """ optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) for epoch in range(epochs): model.train() optimizer.zero_grad() # Forward pass uses ALL nodes logits = model(X, A_norm) # Loss only on labeled nodes loss = F.cross_entropy(logits[train_mask], y_train[train_mask]) loss.backward() optimizer.step() return model| Method | Key Innovation | Aggregation |
|---|---|---|
| GCN | Spectral convolutions simplified | Mean of neighbors |
| GAT | Attention-weighted aggregation | Learned attention weights |
| GraphSAGE | Sampling for scalability | Sample + aggregate |
| SGC | Remove non-linearities | Pre-computed k-hop |
So far, we've treated the graph as given (constructed from distances). But what if the graph is suboptimal? Recent methods learn the graph jointly with the classifier:
Graph Learning for SSL:
$$\min_{f, W} \mathcal{L}_{\text{supervised}}(f) + \lambda_1 \cdot \text{smoothness}(f, W) + \lambda_2 \cdot \text{graph_regularization}(W)$$
The graph $W$ is now an optimization variable, learned to make the smoothness assumption hold.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import torchimport torch.nn as nnimport torch.nn.functional as F class LearnableGraphSSL(nn.Module): """ Joint learning of graph structure and classifier. """ def __init__(self, n_features, n_hidden, n_classes): super().__init__() # Graph learning parameters self.edge_mlp = nn.Sequential( nn.Linear(n_features * 2, n_hidden), nn.ReLU(), nn.Linear(n_hidden, 1), nn.Sigmoid() ) # Classifier self.classifier = nn.Sequential( nn.Linear(n_features, n_hidden), nn.ReLU(), nn.Linear(n_hidden, n_classes) ) def compute_graph(self, X, k=10): """Learn edge weights between all pairs.""" n = X.shape[0] # Compute pairwise edge features X_i = X.unsqueeze(1).expand(n, n, -1) X_j = X.unsqueeze(0).expand(n, n, -1) edge_features = torch.cat([X_i, X_j], dim=-1) # Edge weights from MLP W = self.edge_mlp(edge_features).squeeze(-1) # Sparsify: keep top-k per node _, topk_idx = W.topk(k, dim=1) mask = torch.zeros_like(W) mask.scatter_(1, topk_idx, 1) W = W * mask # Symmetrize W = (W + W.T) / 2 return W def forward(self, X, train_mask, y_train, lambda_smooth=0.1): # Learn graph W = self.compute_graph(X) # Graph Laplacian D = torch.diag(W.sum(dim=1)) L = D - W # Predictions logits = self.classifier(X) probs = F.softmax(logits, dim=1) # Supervised loss loss_sup = F.cross_entropy(logits[train_mask], y_train[train_mask]) # Smoothness loss: predictions should be similar for high W loss_smooth = 0 for c in range(logits.shape[1]): f = probs[:, c] loss_smooth += f @ L @ f return logits, loss_sup + lambda_smooth * loss_smoothLearn the graph when: (1) features are high-dimensional and Euclidean distance is unreliable, (2) the task-relevant similarity differs from generic feature similarity, (3) you have enough labeled data to guide graph learning. For small labeled sets, fixed kNN graphs often work better.
What's Next:
We've explored discriminative approaches to semi-supervised learning. But there's another powerful paradigm: generative approaches that model the joint distribution $P(X, Y)$. By modeling how data is generated, we can leverage unlabeled data to improve our understanding of the feature space, which in turn improves classification.
You now understand advanced graph-based methods: manifold regularization, transductive SVMs, GNNs for SSL, and learned graph construction. These techniques represent the state of the art in leveraging structure for semi-supervised learning.