Loading content...
In the previous pages, we established that semi-supervised learning requires assumptions linking the marginal distribution P(X) to the conditional P(Y|X). Without such assumptions, unlabeled data provides no information about the labeling function. This page examines these assumptions in depth.
The assumptions of semi-supervised learning are not arbitrary mathematical conveniences—they encode our beliefs about how the world is structured. When these beliefs hold, SSL methods can dramatically outperform supervised learning. When they fail, SSL can actually hurt performance.
Understanding these assumptions transforms you from a user of SSL methods into someone who can:
This page provides comprehensive coverage of SSL assumptions. You will understand: (1) The smoothness assumption and its implications, (2) The cluster assumption for classification, (3) Low-density separation and decision boundaries, (4) The manifold assumption for high-dimensional data, and (5) How to verify whether assumptions hold for your data.
The most fundamental assumption in semi-supervised learning is smoothness: if two points x and x' are close in feature space, their labels y and y' should be similar.
Smoothness Assumption:
If x₁ and x₂ are close in a high-density region of P(X), then the corresponding outputs y₁ and y₂ should also be close.
Mathematically, for a function f: 𝒳 → 𝒴:
$$|f(x_1) - f(x_2)| \leq L \cdot d(x_1, x_2)$$
when both x₁ and x₂ lie in high-density regions (p(x₁), p(x₂) > threshold).
Here, L is the Lipschitz constant—a measure of how 'smooth' f can be. Smaller L means smoother functions.
The caveat 'in high-density regions' is crucial. The smoothness assumption makes no claims about low-density regions. This distinction leads to the low-density separation principle: decision boundaries should lie in low-density regions where the smoothness constraint doesn't apply.
Consider two points on opposite sides of a decision boundary:
This asymmetry allows for sharp boundaries while demanding smooth behavior where data concentrates.
Unlabeled data reveals the density structure of P(X). Given this structure:
Methods exploiting smoothness:
Consistency Regularization: $$\mathcal{L}{smooth} = \mathbb{E}{x \sim D_U}\left[|f(x) - f(x + \eta)|^2\right]$$
where η is small noise. This directly encourages the function to be smooth locally.
Graph Laplacian Regularization: $$\mathcal{L}{graph} = \sum{i,j} w_{ij}|f(x_i) - f(x_j)|^2 = f^T L f$$
where L is the graph Laplacian and w_ij encodes similarity. This penalizes prediction differences between similar points.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import torchimport torch.nn as nnimport torch.nn.functional as F class SmoothnessRegularizer: """ Implements smoothness-based regularization for SSL. """ @staticmethod def local_smoothness_loss(model: nn.Module, x_unlabeled: torch.Tensor, noise_std: float = 0.1, num_perturbations: int = 2) -> torch.Tensor: """ Local smoothness loss: predictions should be similar for x and x + noise. L = E[||f(x) - f(x + ε)||²] where ε ~ N(0, σ²) """ model.eval() # Use eval mode for base prediction with torch.no_grad(): base_pred = model(x_unlabeled) model.train() smoothness_loss = 0.0 for _ in range(num_perturbations): # Add small Gaussian noise noise = torch.randn_like(x_unlabeled) * noise_std perturbed_x = x_unlabeled + noise perturbed_pred = model(perturbed_x) # MSE between base and perturbed predictions smoothness_loss += F.mse_loss(perturbed_pred, base_pred.detach()) return smoothness_loss / num_perturbations @staticmethod def graph_laplacian_loss(embeddings: torch.Tensor, adjacency: torch.Tensor, predictions: torch.Tensor) -> torch.Tensor: """ Graph Laplacian regularization: f^T L f Args: embeddings: (n, d) feature embeddings for graph construction adjacency: (n, n) precomputed adjacency matrix (or None to compute) predictions: (n, c) soft predictions from model """ n = predictions.shape[0] if adjacency is None: # Compute k-NN adjacency from embeddings k = min(10, n - 1) distances = torch.cdist(embeddings, embeddings) _, indices = torch.topk(distances, k + 1, largest=False) adjacency = torch.zeros(n, n, device=embeddings.device) for i in range(n): adjacency[i, indices[i, 1:]] = 1.0 adjacency = (adjacency + adjacency.T) / 2 # Symmetrize # Compute Laplacian: L = D - W degree = adjacency.sum(dim=1) laplacian = torch.diag(degree) - adjacency # Quadratic form: sum over classes loss = 0.0 for c in range(predictions.shape[1]): f_c = predictions[:, c] loss += f_c @ laplacian @ f_c return loss / predictions.shape[1] # Average over classesSmoothness is a reasonable assumption when: (1) Features are meaningful representations (not raw pixels without preprocessing), (2) Classes correspond to natural concepts with consistent appearance, (3) There's no adversarial structure in the data. Smoothness often fails for adversarial examples, where small perturbations cause large label changes.
The cluster assumption is perhaps the most intuitive premise of semi-supervised learning: data points naturally form clusters, and points in the same cluster tend to share the same label.
Cluster Assumption:
The data distribution P(X) forms clusters, and points within the same cluster are more likely to share the same class label.
More precisely:
The cluster assumption is actually a special case of the smoothness assumption:
However, the cluster assumption goes further by asserting that the structure is explicitly clustered, not just smooth. This stronger assumption enables more powerful methods but also fails more catastrophically when wrong.
1. Cluster-then-Label:
The simplest approach:
2. Transductive SVM (Decision Boundary in Low-Density):
TSVM explicitly seeks decision boundaries that:
3. Entropy Minimization:
Encourages confident predictions on unlabeled data:
$$\mathcal{L}{ent} = -\frac{1}{u}\sum{j=1}^{u}\sum_{c=1}^{C} p_c(x_j) \log p_c(x_j)$$
Low entropy means predictions are confident (near 0 or 1), which encourages points within the same cluster to be assigned to the same class confidently.
A critical subtlety: the cluster assumption requires that clusters correspond to classes. This is not always true:
Case 1: Clusters = Classes (Ideal)
Case 2: Classes ⊂ Clusters (Manageable)
Case 3: Clusters ⊂ Classes (Problematic)
Case 4: Clusters ⊥ Classes (Disastrous)
| Scenario | Example | SSL Effect | Mitigation |
|---|---|---|---|
| Classes = Clusters | Distinct species in images | Strong positive | Use standard SSL |
| Classes form sub-clusters | Object poses/viewpoints | Moderate positive | Ensure labeled samples span sub-clusters |
| Clusters contain mixed classes | Similar-looking different objects | Possible negative | Careful class-balanced pseudo-labeling |
| Clusters orthogonal to classes | Brightness clusters, semantic labels | Strong negative | Learn representations first, then cluster |
Raw features often cluster by non-semantic attributes (lighting, background, image quality) rather than semantic classes. This is why modern SSL methods use learned representations: neural networks trained with appropriate regularization learn features where cluster structure aligns with class structure.
The low-density separation principle is the contrapositive of the cluster assumption: decision boundaries should pass through regions where few data points reside.
Low-Density Separation:
The decision boundary of the classifier should lie in low-density regions of P(X).
Equivalently:
Class posterior P(Y|X) should change mainly in regions where P(X) is small.
We can formalize this as seeking a decision boundary B ⊂ 𝒳 that minimizes:
$$\int_{x \in B} p(x) dx$$
subject to correctly classifying labeled points.
This integral measures the 'density traversed' by the boundary. A boundary through empty space has cost 0; a boundary through a cluster has high cost.
VAT operationalizes low-density separation through local perturbation analysis:
1. Find Adversarial Direction:
For each point x, find the perturbation r that maximally changes predictions:
$$r_{adv} = \arg\max_{|r| \leq \epsilon} D_{KL}(p(y|x) | p(y|x+r))$$
This direction points toward the nearest decision boundary.
2. Penalize Prediction Change:
$$\mathcal{L}{VAT}(x) = D{KL}(p(y|x) | p(y|x+r_{adv}))$$
3. Effect:
Minimizing this loss pushes the decision boundary away from x. Since we apply this to all unlabeled points, boundaries are pushed to regions devoid of data—achieving low-density separation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import torchimport torch.nn as nnimport torch.nn.functional as F def virtual_adversarial_loss(model: nn.Module, x: torch.Tensor, xi: float = 1e-6, epsilon: float = 1.0, num_power_iterations: int = 1) -> torch.Tensor: """ Virtual Adversarial Training (VAT) loss. Finds the perturbation that maximally changes model predictions, then penalizes that change to push boundaries to low-density regions. Args: model: Neural network classifier x: Input batch (typically unlabeled data) xi: Small constant for numerical gradient estimation epsilon: Maximum perturbation norm num_power_iterations: Iterations for power method to find r_adv Returns: VAT loss (scalar) """ model.eval() with torch.no_grad(): pred = F.softmax(model(x), dim=1) model.train() # Initialize random perturbation d = torch.rand_like(x) - 0.5 d = F.normalize(d.view(d.shape[0], -1), dim=1).view_as(x) # Power iteration to find adversarial direction for _ in range(num_power_iterations): d.requires_grad_(True) pred_perturbed = F.softmax(model(x + xi * d), dim=1) # KL divergence: D_KL(pred || pred_perturbed) kl_div = F.kl_div( pred_perturbed.log(), pred.detach(), reduction='batchmean' ) # Gradient of KL w.r.t. perturbation kl_div.backward() d = d.grad.detach() d = F.normalize(d.view(d.shape[0], -1), dim=1).view_as(x) # Compute adversarial perturbation with final direction r_adv = epsilon * d.detach() # VAT loss: KL divergence at adversarial point pred_adv = F.softmax(model(x + r_adv), dim=1) vat_loss = F.kl_div( pred_adv.log(), pred.detach(), reduction='batchmean' ) return vat_loss # Example training loop integrationdef train_step_with_vat(model, optimizer, x_labeled, y_labeled, x_unlabeled, alpha_vat=1.0): optimizer.zero_grad() # Supervised loss pred_labeled = model(x_labeled) sup_loss = F.cross_entropy(pred_labeled, y_labeled) # VAT loss on unlabeled data vat_loss = virtual_adversarial_loss(model, x_unlabeled) # Total loss total_loss = sup_loss + alpha_vat * vat_loss total_loss.backward() optimizer.step() return sup_loss.item(), vat_loss.item()Low-density separation doesn't mean boundaries are pushed infinitely far from data. It means boundaries prefer regions with fewer samples. In well-separated clusters, this is the gap between clusters. In overlapping classes, the boundary will still traverse some moderate-density regions—the assumption provides soft guidance, not hard constraints.
The manifold assumption addresses high-dimensional data by positing that despite living in high-dimensional space, data actually lies on a lower-dimensional manifold.
Manifold Assumption:
The data lies on a low-dimensional manifold ℳ embedded in the high-dimensional input space 𝒳. The class labels vary smoothly along this manifold.
Mathematically:
Consider image data:
Estimated intrinsic dimensions:
If we knew the manifold ℳ, supervised learning would be a k-dimensional problem instead of a d-dimensional one—dramatically reducing sample complexity.
Sample complexity scaling:
Unlabeled data helps discover the manifold:
A key insight of manifold-based SSL: Euclidean distance can be misleading.
Consider a curved manifold (like a Swiss roll):
Graph-based methods approximate geodesic distance:
Modern deep learning methods implicitly learn manifold structure:
Autoencoders:
Contrastive Learning:
Data Augmentation as Manifold Exploration:
Data augmentations can be viewed as defining local manifold tangent directions. When we augment an image (flip, rotate, crop) and require consistent predictions, we're saying 'these directions are within the same class region of the manifold.' The choice of augmentations encodes domain knowledge about manifold structure.
The four assumptions we've discussed are interconnected. Understanding their relationships helps in selecting and combining methods.
Smoothness is the most general:
Cluster is a special case:
Low-density separation is the boundary perspective:
Manifold is about representation:
In some datasets, assumptions can conflict:
Smoothness vs. Cluster:
Manifold vs. Cluster:
Low-density vs. Reality:
Modern SSL methods combine assumptions:
FixMatch = Smoothness + Cluster + Pseudo-labels:
Graph Networks = Manifold + Smoothness:
Contrastive Learning = Manifold + Cluster:
Methods that exploit multiple assumptions are more robust. If one assumption fails partially, others provide backup signal. This is why FixMatch and MixMatch—which combine consistency, entropy minimization, and pseudo-labeling—often outperform single-assumption methods.
Before applying SSL, you should verify that relevant assumptions hold for your dataset. Here are practical diagnostics for each assumption.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom sklearn.neighbors import NearestNeighborsfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scorefrom collections import Counter def diagnose_cluster_assumption(X, y_true, n_neighbors=10, n_clusters=None): """ Diagnostic tests for the cluster assumption. Args: X: (n, d) feature matrix y_true: (n,) true labels (from labeled subset) n_neighbors: k for k-NN consistency check n_clusters: number of clusters for k-means (defaults to num classes) Returns: Dictionary of diagnostic metrics """ n_classes = len(np.unique(y_true)) n_clusters = n_clusters or n_classes # 1. k-NN class consistency nn = NearestNeighbors(n_neighbors=n_neighbors + 1).fit(X) _, indices = nn.kneighbors(X) consistency_scores = [] for i, neighbors in enumerate(indices): neighbor_labels = y_true[neighbors[1:]] # Exclude self mode_label = Counter(neighbor_labels).most_common(1)[0][0] consistency_scores.append(1.0 if y_true[i] == mode_label else 0.0) knn_consistency = np.mean(consistency_scores) # 2. K-means cluster purity kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X) cluster_labels = kmeans.labels_ purity = 0.0 for cluster_id in range(n_clusters): cluster_mask = cluster_labels == cluster_id if cluster_mask.sum() > 0: cluster_true_labels = y_true[cluster_mask] mode_count = Counter(cluster_true_labels).most_common(1)[0][1] purity += mode_count purity /= len(y_true) # 3. Silhouette score (using true labels) silhouette = silhouette_score(X, y_true) return { 'knn_consistency': knn_consistency, 'cluster_purity': purity, 'silhouette_score': silhouette, 'recommendation': 'SSL likely beneficial' if knn_consistency > 0.7 and purity > 0.7 else 'SSL may not help' if knn_consistency < 0.5 else 'SSL results may vary' } def estimate_intrinsic_dimension(X, k=10): """ Estimate intrinsic dimension using Two-NN method. Based on: "Estimating the intrinsic dimension of datasets" by Facco et al. (2017) """ nn = NearestNeighbors(n_neighbors=k+1).fit(X) distances, _ = nn.kneighbors(X) # Ratio of second to first neighbor distance r1 = distances[:, 1] # Distance to 1st neighbor r2 = distances[:, 2] # Distance to 2nd neighbor # Avoid division by zero valid = r1 > 1e-10 mu = r2[valid] / r1[valid] # MLE estimate of intrinsic dimension # Based on mu following a Pareto distribution with parameter d d_estimate = 1.0 / np.mean(np.log(mu)) return { 'intrinsic_dimension': d_estimate, 'ambient_dimension': X.shape[1], 'dimension_ratio': d_estimate / X.shape[1], 'manifold_likely': d_estimate < X.shape[1] * 0.1 # Less than 10% of ambient }Watch for these indicators that assumptions don't hold:
The ultimate test of SSL assumptions is empirical performance. Always train a supervised baseline on the same labeled data. If SSL doesn't outperform it, assumptions don't hold for your data, your method is poorly tuned, or both. Never blindly trust SSL—verify.
Real-world data rarely perfectly satisfies SSL assumptions. Here we discuss strategies for handling partial or complete assumption violations.
If cluster structure doesn't align with classes in raw features:
This approach has become standard in practice. Modern SSL pipelines almost always involve:
Raw Data → Self-Supervised Pretraining → Fine-tuning with SSL → Final Model
Some methods are designed to be robust to assumption violations:
When uncertain about assumptions, consider approaches that make minimal assumptions:
Self-Training with High Threshold:
Representation Learning Only:
Ensemble of Experts:
If standard augmentations don't respect your data's structure:
When in doubt, be conservative: high confidence thresholds, representation learning before SSL, and always compare to supervised baseline. A 5% improvement from SSL is worse than no improvement if SSL degrades 20% of cases—reliability matters more than peak performance.
We have examined the fundamental assumptions that enable semi-supervised learning. These assumptions are not optional—they are the theoretical bedrock on which all SSL methods stand. Let's consolidate the key insights:
What's Next:
With assumptions understood, the next page examines the evaluation challenges unique to semi-supervised learning. We'll explore how to properly evaluate SSL methods, the pitfalls of naive evaluation, and best practices for rigorous experimental design in the low-label regime.
You now understand the four fundamental assumptions of semi-supervised learning: smoothness, cluster, low-density separation, and manifold. You can identify which assumptions your data satisfies, select methods that exploit those assumptions, and apply diagnostics to verify SSL appropriateness. This knowledge transforms SSL from a black box into a principled tool you can apply with confidence.