The Label Scarcity Problem - Learning Module

Loading content...

0/278

Assumptions: The Foundations of Semi-Supervised Learning

Why Assumptions Are Everything in SSL

In the previous pages, we established that semi-supervised learning requires assumptions linking the marginal distribution P(X) to the conditional P(Y|X). Without such assumptions, unlabeled data provides no information about the labeling function. This page examines these assumptions in depth.

The assumptions of semi-supervised learning are not arbitrary mathematical conveniences—they encode our beliefs about how the world is structured. When these beliefs hold, SSL methods can dramatically outperform supervised learning. When they fail, SSL can actually hurt performance.

Understanding these assumptions transforms you from a user of SSL methods into someone who can:

Diagnose why SSL fails on specific datasets
Design methods tailored to your domain's structure
Verify whether SSL is appropriate for your problem
Combine assumptions for maximum benefit

What You Will Learn

This page provides comprehensive coverage of SSL assumptions. You will understand: (1) The smoothness assumption and its implications, (2) The cluster assumption for classification, (3) Low-density separation and decision boundaries, (4) The manifold assumption for high-dimensional data, and (5) How to verify whether assumptions hold for your data.

The Smoothness Assumption

The most fundamental assumption in semi-supervised learning is smoothness: if two points x and x' are close in feature space, their labels y and y' should be similar.

Formal Definition

Smoothness Assumption:

If x₁ and x₂ are close in a high-density region of P(X), then the corresponding outputs y₁ and y₂ should also be close.

Mathematically, for a function f: 𝒳 → 𝒴:

$$|f(x_1) - f(x_2)| \leq L \cdot d(x_1, x_2)$$

when both x₁ and x₂ lie in high-density regions (p(x₁), p(x₂) > threshold).

Here, L is the Lipschitz constant—a measure of how 'smooth' f can be. Smaller L means smoother functions.

Why High-Density Matters

The caveat 'in high-density regions' is crucial. The smoothness assumption makes no claims about low-density regions. This distinction leads to the low-density separation principle: decision boundaries should lie in low-density regions where the smoothness constraint doesn't apply.

Consider two points on opposite sides of a decision boundary:

If they're in a low-density region (few samples nearby), smoothness doesn't require their labels match
If they're in a high-density region, the boundary violates smoothness

This asymmetry allows for sharp boundaries while demanding smooth behavior where data concentrates.

How Unlabeled Data Exploits Smoothness

Unlabeled data reveals the density structure of P(X). Given this structure:

Identify high-density regions: Clusters, modes, or regions with many samples
Enforce smoothness within regions: Similar predictions for nearby points
Allow boundaries in low-density gaps: Where smoothness constraint is relaxed

Methods exploiting smoothness:

Consistency Regularization: $$\mathcal{L}{smooth} = \mathbb{E}{x \sim D_U}\left[|f(x) - f(x + \eta)|^2\right]$$

where η is small noise. This directly encourages the function to be smooth locally.

Graph Laplacian Regularization: $$\mathcal{L}{graph} = \sum{i,j} w_{ij}|f(x_i) - f(x_j)|^2 = f^T L f$$

where L is the graph Laplacian and w_ij encodes similarity. This penalizes prediction differences between similar points.

smoothness_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class SmoothnessRegularizer:
    """
    Implements smoothness-based regularization for SSL.
    """
    
    @staticmethod
    def local_smoothness_loss(model: nn.Module, 
                               x_unlabeled: torch.Tensor, 
                               noise_std: float = 0.1,
                               num_perturbations: int = 2) -> torch.Tensor:
        """
        Local smoothness loss: predictions should be similar 
        for x and x + noise.
        
        L = E[||f(x) - f(x + ε)||²]  where ε ~ N(0, σ²)
        """
        model.eval()  # Use eval mode for base prediction
        with torch.no_grad():
            base_pred = model(x_unlabeled)
        
        model.train()
        smoothness_loss = 0.0
        
        for _ in range(num_perturbations):
            # Add small Gaussian noise
            noise = torch.randn_like(x_unlabeled) * noise_std
            perturbed_x = x_unlabeled + noise
            perturbed_pred = model(perturbed_x)
            
            # MSE between base and perturbed predictions
            smoothness_loss += F.mse_loss(perturbed_pred, base_pred.detach())
        
        return smoothness_loss / num_perturbations
    
    @staticmethod
    def graph_laplacian_loss(embeddings: torch.Tensor,
                             adjacency: torch.Tensor,
                             predictions: torch.Tensor) -> torch.Tensor:
        """
        Graph Laplacian regularization: f^T L f
        
        Args:
            embeddings: (n, d) feature embeddings for graph construction
            adjacency: (n, n) precomputed adjacency matrix (or None to compute)
            predictions: (n, c) soft predictions from model
        """
        n = predictions.shape[0]
        
        if adjacency is None:
            # Compute k-NN adjacency from embeddings
            k = min(10, n - 1)
            distances = torch.cdist(embeddings, embeddings)
            _, indices = torch.topk(distances, k + 1, largest=False)
            
            adjacency = torch.zeros(n, n, device=embeddings.device)
            for i in range(n):
                adjacency[i, indices[i, 1:]] = 1.0
            adjacency = (adjacency + adjacency.T) / 2  # Symmetrize
        
        # Compute Laplacian: L = D - W
        degree = adjacency.sum(dim=1)
        laplacian = torch.diag(degree) - adjacency
        
        # Quadratic form: sum over classes
        loss = 0.0
        for c in range(predictions.shape[1]):
            f_c = predictions[:, c]
            loss += f_c @ laplacian @ f_c
        
        return loss / predictions.shape[1]  # Average over classes

When Smoothness Holds

Smoothness is a reasonable assumption when: (1) Features are meaningful representations (not raw pixels without preprocessing), (2) Classes correspond to natural concepts with consistent appearance, (3) There's no adversarial structure in the data. Smoothness often fails for adversarial examples, where small perturbations cause large label changes.

The Cluster Assumption

The cluster assumption is perhaps the most intuitive premise of semi-supervised learning: data points naturally form clusters, and points in the same cluster tend to share the same label.

Formal Definition

Cluster Assumption:

The data distribution P(X) forms clusters, and points within the same cluster are more likely to share the same class label.

More precisely:

P(X) can be decomposed into K clusters: P(X) = Σₖ πₖ P(X|cluster=k)
P(Y|cluster=k) is concentrated on one or few classes

Relationship to Smoothness

The cluster assumption is actually a special case of the smoothness assumption:

Within-cluster, points are densely connected
Between-cluster, there is low density (gaps)
Smoothness forces similar labels within clusters
Low density allows different labels between clusters

However, the cluster assumption goes further by asserting that the structure is explicitly clustered, not just smooth. This stronger assumption enables more powerful methods but also fails more catastrophically when wrong.

Methods Exploiting the Cluster Assumption

1. Cluster-then-Label:

The simplest approach:

Cluster all data (labeled + unlabeled) using k-means, DBSCAN, etc.
Assign cluster labels based on majority vote of labeled points in each cluster
Propagate labels to all points in each cluster

2. Transductive SVM (Decision Boundary in Low-Density):

TSVM explicitly seeks decision boundaries that:

Correctly classify labeled data
Pass through low-density regions
Maximize margin on both labeled and unlabeled data

3. Entropy Minimization:

Encourages confident predictions on unlabeled data:

$$\mathcal{L}{ent} = -\frac{1}{u}\sum{j=1}^{u}\sum_{c=1}^{C} p_c(x_j) \log p_c(x_j)$$

Low entropy means predictions are confident (near 0 or 1), which encourages points within the same cluster to be assigned to the same class confidently.

The Class-Cluster Correspondence Problem

A critical subtlety: the cluster assumption requires that clusters correspond to classes. This is not always true:

Case 1: Clusters = Classes (Ideal)

Each cluster contains points from a single class
SSL works perfectly using cluster structure

Case 2: Classes ⊂ Clusters (Manageable)

Each class forms multiple sub-clusters
SSL can still help, though some labeled data from each sub-cluster is needed

Case 3: Clusters ⊂ Classes (Problematic)

Multiple classes mixed within some clusters
Cluster structure misleads SSL
Can cause worse performance than supervised baseline

Case 4: Clusters ⊥ Classes (Disastrous)

Cluster structure is orthogonal to class structure
E.g., images cluster by brightness but classes are objects
SSL will aggressively harm performance

Cluster-Class Correspondence Scenarios
Scenario	Example	SSL Effect	Mitigation
Classes = Clusters	Distinct species in images	Strong positive	Use standard SSL
Classes form sub-clusters	Object poses/viewpoints	Moderate positive	Ensure labeled samples span sub-clusters
Clusters contain mixed classes	Similar-looking different objects	Possible negative	Careful class-balanced pseudo-labeling
Clusters orthogonal to classes	Brightness clusters, semantic labels	Strong negative	Learn representations first, then cluster

The Cluster Assumption Trap

Raw features often cluster by non-semantic attributes (lighting, background, image quality) rather than semantic classes. This is why modern SSL methods use learned representations: neural networks trained with appropriate regularization learn features where cluster structure aligns with class structure.

Low-Density Separation

The low-density separation principle is the contrapositive of the cluster assumption: decision boundaries should pass through regions where few data points reside.

Formal Definition

Low-Density Separation:

The decision boundary of the classifier should lie in low-density regions of P(X).

Equivalently:

Class posterior P(Y|X) should change mainly in regions where P(X) is small.

Mathematical Formulation

We can formalize this as seeking a decision boundary B ⊂ 𝒳 that minimizes:

$$\int_{x \in B} p(x) dx$$

subject to correctly classifying labeled points.

This integral measures the 'density traversed' by the boundary. A boundary through empty space has cost 0; a boundary through a cluster has high cost.

Methods Implementing Low-Density Separation

Low-Density Separation Methods

•Transductive SVM: Maximizes margin on unlabeled points, pushing boundary to low-density regions where unlabeled points are scarce.
•Entropy Minimization: High-confidence predictions at all points means the model 'commits' everywhere, placing boundaries only where it must (low-density regions).
•Virtual Adversarial Training (VAT): Finds adversarial perturbations that maximally change predictions, then penalizes such changes. Boundaries become stable (low-density) regions.
•MixUp and CutMix: Interpolations between samples encourage linear behavior between class regions, implicitly pushing boundaries to sparse interpolated regions.

Virtual Adversarial Training (VAT) Deep Dive

VAT operationalizes low-density separation through local perturbation analysis:

1. Find Adversarial Direction:

For each point x, find the perturbation r that maximally changes predictions:

$$r_{adv} = \arg\max_{|r| \leq \epsilon} D_{KL}(p(y|x) | p(y|x+r))$$

This direction points toward the nearest decision boundary.

2. Penalize Prediction Change:

$$\mathcal{L}{VAT}(x) = D{KL}(p(y|x) | p(y|x+r_{adv}))$$

3. Effect:

Minimizing this loss pushes the decision boundary away from x. Since we apply this to all unlabeled points, boundaries are pushed to regions devoid of data—achieving low-density separation.

virtual_adversarial_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def virtual_adversarial_loss(model: nn.Module, 
                              x: torch.Tensor, 
                              xi: float = 1e-6,
                              epsilon: float = 1.0,
                              num_power_iterations: int = 1) -> torch.Tensor:
    """
    Virtual Adversarial Training (VAT) loss.
    
    Finds the perturbation that maximally changes model predictions,
    then penalizes that change to push boundaries to low-density regions.
    
    Args:
        model: Neural network classifier
        x: Input batch (typically unlabeled data)
        xi: Small constant for numerical gradient estimation
        epsilon: Maximum perturbation norm
        num_power_iterations: Iterations for power method to find r_adv
    
    Returns:
        VAT loss (scalar)
    """
    model.eval()
    with torch.no_grad():
        pred = F.softmax(model(x), dim=1)
    model.train()
    
    # Initialize random perturbation
    d = torch.rand_like(x) - 0.5
    d = F.normalize(d.view(d.shape[0], -1), dim=1).view_as(x)
    
    # Power iteration to find adversarial direction
    for _ in range(num_power_iterations):
        d.requires_grad_(True)
        pred_perturbed = F.softmax(model(x + xi * d), dim=1)
        
        # KL divergence: D_KL(pred || pred_perturbed)
        kl_div = F.kl_div(
            pred_perturbed.log(), 
            pred.detach(),
            reduction='batchmean'
        )
        
        # Gradient of KL w.r.t. perturbation
        kl_div.backward()
        d = d.grad.detach()
        d = F.normalize(d.view(d.shape[0], -1), dim=1).view_as(x)
    
    # Compute adversarial perturbation with final direction
    r_adv = epsilon * d.detach()
    
    # VAT loss: KL divergence at adversarial point
    pred_adv = F.softmax(model(x + r_adv), dim=1)
    vat_loss = F.kl_div(
        pred_adv.log(),
        pred.detach(),
        reduction='batchmean'
    )
    
    return vat_loss
 
 
# Example training loop integration
def train_step_with_vat(model, optimizer, x_labeled, y_labeled, x_unlabeled,
                        alpha_vat=1.0):
    optimizer.zero_grad()
    
    # Supervised loss
    pred_labeled = model(x_labeled)
    sup_loss = F.cross_entropy(pred_labeled, y_labeled)
    
    # VAT loss on unlabeled data
    vat_loss = virtual_adversarial_loss(model, x_unlabeled)
    
    # Total loss
    total_loss = sup_loss + alpha_vat * vat_loss
    total_loss.backward()
    optimizer.step()
    
    return sup_loss.item(), vat_loss.item()

Low-Density ≠ Nowhere

Low-density separation doesn't mean boundaries are pushed infinitely far from data. It means boundaries prefer regions with fewer samples. In well-separated clusters, this is the gap between clusters. In overlapping classes, the boundary will still traverse some moderate-density regions—the assumption provides soft guidance, not hard constraints.

The Manifold Assumption

The manifold assumption addresses high-dimensional data by positing that despite living in high-dimensional space, data actually lies on a lower-dimensional manifold.

Formal Definition

Manifold Assumption:

The data lies on a low-dimensional manifold ℳ embedded in the high-dimensional input space 𝒳. The class labels vary smoothly along this manifold.

Mathematically:

Data lives on ℳ ⊂ ℝᵈ where dim(ℳ) = k << d
The labeling function f: ℳ → 𝒴 is smooth on ℳ

Why Manifolds?

Consider image data:

A 256×256 RGB image has 196,608 dimensions
But the set of 'natural images' occupies a tiny fraction of this space
Random pixel values look like noise; natural images have structure
This structure defines a manifold of much lower intrinsic dimension

Estimated intrinsic dimensions:

MNIST digits: ~10-15 dimensions (despite 784 pixel input)
Natural images: ~100-500 dimensions (despite millions of pixels)
Text embeddings: ~50-100 dimensions (despite 768+ embedding dims)

The Manifold-Aware Learning Principle

If we knew the manifold ℳ, supervised learning would be a k-dimensional problem instead of a d-dimensional one—dramatically reducing sample complexity.

Sample complexity scaling:

Without manifold: O(d/ε²) labeled samples
With known manifold: O(k/ε²) labeled samples
Improvement ratio: d/k (can be 100x or more)

Unlabeled data helps discover the manifold:

Many unlabeled points trace out ℳ's shape
Local neighborhoods reveal manifold tangent structure
Geodesic (on-manifold) distances replace Euclidean distances

Geodesic Distance vs. Euclidean Distance

A key insight of manifold-based SSL: Euclidean distance can be misleading.

Consider a curved manifold (like a Swiss roll):

Two points may be close in Euclidean ℝᵈ distance
But far apart along the manifold (geodesic distance)
Smoothness should apply along the manifold, not through 'shortcuts'

Graph-based methods approximate geodesic distance:

Connect nearby points (local Euclidean distance is valid)
Use shortest path on graph (approximates geodesic)
Apply smoothness with geodesic-based similarity

Methods Exploiting the Manifold Assumption

Manifold-Based SSL Methods

•Laplacian Eigenmaps: Compute eigenvectors of graph Laplacian to find low-dimensional manifold coordinates. Use these for classification.
•Isomap: Approximate geodesic distances via shortest paths, then apply MDS to embed in low dimensions.
•Semi-Supervised Laplacian Regularization: Learn classifier f while penalizing: $f^T L f$ where L is manifold Laplacian.
•Manifold Regularization Loss: $\mathcal{L}{manifold} = |f|{\mathcal{H}_K}^2$ penalizes function complexity in an RKHS defined by manifold geometry.
•Tangent Propagation: Penalize prediction changes along manifold tangent directions (estimated locally).

Neural Networks and Implicit Manifold Learning

Modern deep learning methods implicitly learn manifold structure:

Autoencoders:

Compress data to low-dimensional bottleneck
Bottleneck must capture manifold structure to enable reconstruction
Semi-supervised autoencoders combine reconstruction + classification losses

Contrastive Learning:

Augmented views of same image should have similar representations
Augmentations explore the local manifold around each point
The representation space encodes manifold structure

Data Augmentation as Manifold Exploration:

Augmentations (rotation, crop, color jitter) move along the image manifold
Consistency regularization enforces smoothness along these directions
Strong augmentations explore further along manifold tangents

The Augmentation Connection

Data augmentations can be viewed as defining local manifold tangent directions. When we augment an image (flip, rotate, crop) and require consistent predictions, we're saying 'these directions are within the same class region of the manifold.' The choice of augmentations encodes domain knowledge about manifold structure.

Relationships Between Assumptions

The four assumptions we've discussed are interconnected. Understanding their relationships helps in selecting and combining methods.

Hierarchy of Assumptions

Smoothness is the most general:

Cluster → implies local smoothness
Low-density separation → smoothness holds in high-density regions
Manifold → smoothness on manifold

Cluster is a special case:

Smoothness + multi-modal P(X) ≈ Cluster assumption
Cluster implies low-density separation (gaps between clusters)

Low-density separation is the boundary perspective:

Dual of cluster assumption
Smoothness implies avoiding high-density boundaries

Manifold is about representation:

Operates on embedded representation
Other assumptions apply on manifold, not ambient space

Converting Mermaid diagram...

When Assumptions Conflict

In some datasets, assumptions can conflict:

Smoothness vs. Cluster:

Smoothness wants gradual transitions
Cluster wants hard boundaries at gaps
Resolution: smoothness applies within clusters; step changes at boundaries

Manifold vs. Cluster:

Manifold is continuous (single connected component)
Clusters are disconnected
Resolution: Multiple manifolds (one per cluster) or manifold with gaps

Low-density vs. Reality:

Sometimes classes genuinely overlap in feature space
Low-density separation is impossible
Resolution: Accept soft boundaries; use probabilistic predictions

Combining Assumptions in Practice

Modern SSL methods combine assumptions:

FixMatch = Smoothness + Cluster + Pseudo-labels:

Consistency regularization (smoothness)
Confidence threshold (cluster: confident => pure cluster)
Pseudo-labels (propagate decisions)

Graph Networks = Manifold + Smoothness:

Graph encodes manifold structure
Message passing enforces smoothness on graph

Contrastive Learning = Manifold + Cluster:

Augmentations define local manifold
Negative pairs push apart different clusters

The Multi-Assumption Advantage

Methods that exploit multiple assumptions are more robust. If one assumption fails partially, others provide backup signal. This is why FixMatch and MixMatch—which combine consistency, entropy minimization, and pseudo-labeling—often outperform single-assumption methods.

Verifying Assumptions on Your Data

Before applying SSL, you should verify that relevant assumptions hold for your dataset. Here are practical diagnostics for each assumption.

Checking the Cluster Assumption

Cluster Assumption Diagnostics

•Visualization: Apply t-SNE or UMAP to your data. Do labeled points from same class cluster together? Do clusters align with classes?
•Cluster purity: Run clustering (k-means, DBSCAN) on all data. Measure purity: fraction of cluster members sharing majority label.
•Silhouette analysis: Compute silhouette scores using class labels. High scores indicate class-aligned clustering.
•k-NN class consistency: For each point, check if its k nearest neighbors share its class. High consistency supports cluster assumption.

Checking the Manifold Assumption

Manifold Assumption Diagnostics

•Intrinsic dimension estimation: Use methods like MLE, correlation dimension, or two-NN to estimate intrinsic dimension. If dim << ambient dim, manifold assumption is plausible.
•Local linearity: PCA on local neighborhoods. If first few components explain most variance, local manifold is approximately linear.
•Reconstruction from embeddings: Compress data with autoencoder. Reconstruction quality indicates whether low-dimensional representation captures structure.
•Geodesic path visualization: Interpolate between points along graph shortest paths. Do interpolations look like valid data?

assumption_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from collections import Counter
 
def diagnose_cluster_assumption(X, y_true, n_neighbors=10, n_clusters=None):
    """
    Diagnostic tests for the cluster assumption.
    
    Args:
        X: (n, d) feature matrix
        y_true: (n,) true labels (from labeled subset)
        n_neighbors: k for k-NN consistency check
        n_clusters: number of clusters for k-means (defaults to num classes)
    
    Returns:
        Dictionary of diagnostic metrics
    """
    n_classes = len(np.unique(y_true))
    n_clusters = n_clusters or n_classes
    
    # 1. k-NN class consistency
    nn = NearestNeighbors(n_neighbors=n_neighbors + 1).fit(X)
    _, indices = nn.kneighbors(X)
    
    consistency_scores = []
    for i, neighbors in enumerate(indices):
        neighbor_labels = y_true[neighbors[1:]]  # Exclude self
        mode_label = Counter(neighbor_labels).most_common(1)[0][0]
        consistency_scores.append(1.0 if y_true[i] == mode_label else 0.0)
    
    knn_consistency = np.mean(consistency_scores)
    
    # 2. K-means cluster purity
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    cluster_labels = kmeans.labels_
    
    purity = 0.0
    for cluster_id in range(n_clusters):
        cluster_mask = cluster_labels == cluster_id
        if cluster_mask.sum() > 0:
            cluster_true_labels = y_true[cluster_mask]
            mode_count = Counter(cluster_true_labels).most_common(1)[0][1]
            purity += mode_count
    purity /= len(y_true)
    
    # 3. Silhouette score (using true labels)
    silhouette = silhouette_score(X, y_true)
    
    return {
        'knn_consistency': knn_consistency,
        'cluster_purity': purity,
        'silhouette_score': silhouette,
        'recommendation': 'SSL likely beneficial' if knn_consistency > 0.7 and purity > 0.7 
                         else 'SSL may not help' if knn_consistency < 0.5 
                         else 'SSL results may vary'
    }
 
 
def estimate_intrinsic_dimension(X, k=10):
    """
    Estimate intrinsic dimension using Two-NN method.
    
    Based on: "Estimating the intrinsic dimension of datasets" 
    by Facco et al. (2017)
    """
    nn = NearestNeighbors(n_neighbors=k+1).fit(X)
    distances, _ = nn.kneighbors(X)
    
    # Ratio of second to first neighbor distance
    r1 = distances[:, 1]  # Distance to 1st neighbor
    r2 = distances[:, 2]  # Distance to 2nd neighbor
    
    # Avoid division by zero
    valid = r1 > 1e-10
    mu = r2[valid] / r1[valid]
    
    # MLE estimate of intrinsic dimension
    # Based on mu following a Pareto distribution with parameter d
    d_estimate = 1.0 / np.mean(np.log(mu))
    
    return {
        'intrinsic_dimension': d_estimate,
        'ambient_dimension': X.shape[1],
        'dimension_ratio': d_estimate / X.shape[1],
        'manifold_likely': d_estimate < X.shape[1] * 0.1  # Less than 10% of ambient
    }

Warning Signs: When SSL Might Fail

Watch for these indicators that assumptions don't hold:

SSL Failure Warning Signs

•Low k-NN consistency (<0.5): Classes are interleaved; cluster assumption fails.
•t-SNE shows no structure: Data may not have exploitable manifold or cluster structure.
•High intrinsic dimension: Manifold assumption provides little benefit.
•SSL performance ≤ supervised baseline: Assumptions are actively violated; unlabeled data is misleading.
•Pseudo-label accuracy deteriorates over training: Confirmation bias; initial errors are reinforced.

Always Compare to Supervised Baseline

The ultimate test of SSL assumptions is empirical performance. Always train a supervised baseline on the same labeled data. If SSL doesn't outperform it, assumptions don't hold for your data, your method is poorly tuned, or both. Never blindly trust SSL—verify.

Dealing with Assumption Violations

Real-world data rarely perfectly satisfies SSL assumptions. Here we discuss strategies for handling partial or complete assumption violations.

Strategy 1: Improve Representations

If cluster structure doesn't align with classes in raw features:

Pre-train representations: Use self-supervised learning (SimCLR, BYOL) to learn features where similar instances are nearby.
Fine-tune with supervision: The few labels guide the representation toward class-relevant structure.
Then apply SSL: In the improved representation space, assumptions are more likely to hold.

This approach has become standard in practice. Modern SSL pipelines almost always involve:

Raw Data → Self-Supervised Pretraining → Fine-tuning with SSL → Final Model

Strategy 2: Robust SSL Methods

Some methods are designed to be robust to assumption violations:

Robust SSL Approaches

•Confidence thresholding: Only use pseudo-labels above a high confidence threshold (e.g., 0.95). Low-confidence predictions in ambiguous regions are ignored.
•Class-balanced pseudo-labeling: Weight pseudo-labels to maintain class balance, preventing majority class dominance.
•Curriculum pseudo-labeling: Start with only highest-confidence pseudo-labels, gradually include more as training progresses.
•Negative learning: Use negative pseudo-labels ('not this class') which are more robust to errors than positive labels.
•Meta-learning for weighting: Learn to weight pseudo-labels based on their quality (Meta-Weight-Net, MentorNet).

Strategy 3: Assumption-Free Baselines

When uncertain about assumptions, consider approaches that make minimal assumptions:

Self-Training with High Threshold:

Only propagate labels with >0.99 confidence
Very conservative; may not use much unlabeled data
But avoids error propagation from wrong pseudo-labels

Representation Learning Only:

Use contrastive learning to improve representations
Train final classifier purely supervised on labeled data
Doesn't use pseudo-labels; avoids assumption-dependent SSL losses

Ensemble of Experts:

Train multiple models with different random seeds
Only pseudo-label points where all models agree
Consensus reduces impact of any single model's biases

Strategy 4: Domain-Adaptive Augmentations

If standard augmentations don't respect your data's structure:

Domain knowledge: Design augmentations that preserve class labels in your domain
Learned augmentations: Use AutoAugment or RandAugment tuned for your task
Careful validation: Verify augmented samples visually; do they look like valid members of the same class?

The Conservative Approach

When in doubt, be conservative: high confidence thresholds, representation learning before SSL, and always compare to supervised baseline. A 5% improvement from SSL is worse than no improvement if SSL degrades 20% of cases—reliability matters more than peak performance.

Summary: Assumptions as the Foundation of SSL

We have examined the fundamental assumptions that enable semi-supervised learning. These assumptions are not optional—they are the theoretical bedrock on which all SSL methods stand. Let's consolidate the key insights:

Key Takeaways

•Assumptions are necessary: Without assumptions linking P(X) to P(Y|X), unlabeled data cannot help. Every SSL method makes such assumptions, explicitly or implicitly.
•Smoothness: If nearby points have similar labels, consistency regularization and graph methods can exploit local structure.
•Cluster assumption: If data forms class-aligned clusters, pseudo-labeling and entropy minimization work well.
•Low-density separation: If decision boundaries belong in sparse regions, TSVM and VAT push boundaries appropriately.
•Manifold assumption: If data lies on a low-dimensional manifold, the effective sample complexity is reduced dramatically.
•Verify before trusting: Use diagnostics (k-NN consistency, cluster purity, dimension estimates) to check assumptions before deploying SSL.

What's Next:

With assumptions understood, the next page examines the evaluation challenges unique to semi-supervised learning. We'll explore how to properly evaluate SSL methods, the pitfalls of naive evaluation, and best practices for rigorous experimental design in the low-label regime.

Page Complete

You now understand the four fundamental assumptions of semi-supervised learning: smoothness, cluster, low-density separation, and manifold. You can identify which assumptions your data satisfies, select methods that exploit those assumptions, and apply diagnostics to verify SSL appropriateness. This knowledge transforms SSL from a black box into a principled tool you can apply with confidence.

Assumptions: The Foundations of Semi-Supervised Learning

Why Assumptions Are Everything in SSL

Understanding these assumptions transforms you from a user of SSL methods into someone who can:

Diagnose why SSL fails on specific datasets
Design methods tailored to your domain's structure
Verify whether SSL is appropriate for your problem
Combine assumptions for maximum benefit

What You Will Learn

The Smoothness Assumption

The most fundamental assumption in semi-supervised learning is smoothness: if two points x and x' are close in feature space, their labels y and y' should be similar.

Formal Definition

Smoothness Assumption:

If x₁ and x₂ are close in a high-density region of P(X), then the corresponding outputs y₁ and y₂ should also be close.

Mathematically, for a function f: 𝒳 → 𝒴:

$$|f(x_1) - f(x_2)| \leq L \cdot d(x_1, x_2)$$

when both x₁ and x₂ lie in high-density regions (p(x₁), p(x₂) > threshold).

Here, L is the Lipschitz constant—a measure of how 'smooth' f can be. Smaller L means smoother functions.

Why High-Density Matters

Consider two points on opposite sides of a decision boundary:

If they're in a low-density region (few samples nearby), smoothness doesn't require their labels match
If they're in a high-density region, the boundary violates smoothness

This asymmetry allows for sharp boundaries while demanding smooth behavior where data concentrates.

How Unlabeled Data Exploits Smoothness

Unlabeled data reveals the density structure of P(X). Given this structure:

Identify high-density regions: Clusters, modes, or regions with many samples
Enforce smoothness within regions: Similar predictions for nearby points
Allow boundaries in low-density gaps: Where smoothness constraint is relaxed

Methods exploiting smoothness:

Consistency Regularization: $$\mathcal{L}{smooth} = \mathbb{E}{x \sim D_U}\left[|f(x) - f(x + \eta)|^2\right]$$

where η is small noise. This directly encourages the function to be smooth locally.

Graph Laplacian Regularization: $$\mathcal{L}{graph} = \sum{i,j} w_{ij}|f(x_i) - f(x_j)|^2 = f^T L f$$

where L is the graph Laplacian and w_ij encodes similarity. This penalizes prediction differences between similar points.

smoothness_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class SmoothnessRegularizer:
    """
    Implements smoothness-based regularization for SSL.
    """
    
    @staticmethod
    def local_smoothness_loss(model: nn.Module, 
                               x_unlabeled: torch.Tensor, 
                               noise_std: float = 0.1,
                               num_perturbations: int = 2) -> torch.Tensor:
        """
        Local smoothness loss: predictions should be similar 
        for x and x + noise.
        
        L = E[||f(x) - f(x + ε)||²]  where ε ~ N(0, σ²)
        """
        model.eval()  # Use eval mode for base prediction
        with torch.no_grad():
            base_pred = model(x_unlabeled)
        
        model.train()
        smoothness_loss = 0.0
        
        for _ in range(num_perturbations):
            # Add small Gaussian noise
            noise = torch.randn_like(x_unlabeled) * noise_std
            perturbed_x = x_unlabeled + noise
            perturbed_pred = model(perturbed_x)
            
            # MSE between base and perturbed predictions
            smoothness_loss += F.mse_loss(perturbed_pred, base_pred.detach())
        
        return smoothness_loss / num_perturbations
    
    @staticmethod
    def graph_laplacian_loss(embeddings: torch.Tensor,
                             adjacency: torch.Tensor,
                             predictions: torch.Tensor) -> torch.Tensor:
        """
        Graph Laplacian regularization: f^T L f
        
        Args:
            embeddings: (n, d) feature embeddings for graph construction
            adjacency: (n, n) precomputed adjacency matrix (or None to compute)
            predictions: (n, c) soft predictions from model
        """
        n = predictions.shape[0]
        
        if adjacency is None:
            # Compute k-NN adjacency from embeddings
            k = min(10, n - 1)
            distances = torch.cdist(embeddings, embeddings)
            _, indices = torch.topk(distances, k + 1, largest=False)
            
            adjacency = torch.zeros(n, n, device=embeddings.device)
            for i in range(n):
                adjacency[i, indices[i, 1:]] = 1.0
            adjacency = (adjacency + adjacency.T) / 2  # Symmetrize
        
        # Compute Laplacian: L = D - W
        degree = adjacency.sum(dim=1)
        laplacian = torch.diag(degree) - adjacency
        
        # Quadratic form: sum over classes
        loss = 0.0
        for c in range(predictions.shape[1]):
            f_c = predictions[:, c]
            loss += f_c @ laplacian @ f_c
        
        return loss / predictions.shape[1]  # Average over classes

When Smoothness Holds

The Cluster Assumption

The cluster assumption is perhaps the most intuitive premise of semi-supervised learning: data points naturally form clusters, and points in the same cluster tend to share the same label.

Formal Definition

Cluster Assumption:

The data distribution P(X) forms clusters, and points within the same cluster are more likely to share the same class label.

More precisely:

P(X) can be decomposed into K clusters: P(X) = Σₖ πₖ P(X|cluster=k)
P(Y|cluster=k) is concentrated on one or few classes

Relationship to Smoothness

The cluster assumption is actually a special case of the smoothness assumption:

Within-cluster, points are densely connected
Between-cluster, there is low density (gaps)
Smoothness forces similar labels within clusters
Low density allows different labels between clusters

Methods Exploiting the Cluster Assumption

1. Cluster-then-Label:

The simplest approach:

Cluster all data (labeled + unlabeled) using k-means, DBSCAN, etc.
Assign cluster labels based on majority vote of labeled points in each cluster
Propagate labels to all points in each cluster

2. Transductive SVM (Decision Boundary in Low-Density):

TSVM explicitly seeks decision boundaries that:

Correctly classify labeled data
Pass through low-density regions
Maximize margin on both labeled and unlabeled data

3. Entropy Minimization:

Encourages confident predictions on unlabeled data:

$$\mathcal{L}{ent} = -\frac{1}{u}\sum{j=1}^{u}\sum_{c=1}^{C} p_c(x_j) \log p_c(x_j)$$

Low entropy means predictions are confident (near 0 or 1), which encourages points within the same cluster to be assigned to the same class confidently.

The Class-Cluster Correspondence Problem

A critical subtlety: the cluster assumption requires that clusters correspond to classes. This is not always true:

Case 1: Clusters = Classes (Ideal)

Each cluster contains points from a single class
SSL works perfectly using cluster structure

Case 2: Classes ⊂ Clusters (Manageable)

Each class forms multiple sub-clusters
SSL can still help, though some labeled data from each sub-cluster is needed

Case 3: Clusters ⊂ Classes (Problematic)

Multiple classes mixed within some clusters
Cluster structure misleads SSL
Can cause worse performance than supervised baseline

Case 4: Clusters ⊥ Classes (Disastrous)

Cluster structure is orthogonal to class structure
E.g., images cluster by brightness but classes are objects
SSL will aggressively harm performance

Cluster-Class Correspondence Scenarios
Scenario	Example	SSL Effect	Mitigation
Classes = Clusters	Distinct species in images	Strong positive	Use standard SSL
Classes form sub-clusters	Object poses/viewpoints	Moderate positive	Ensure labeled samples span sub-clusters
Clusters contain mixed classes	Similar-looking different objects	Possible negative	Careful class-balanced pseudo-labeling
Clusters orthogonal to classes	Brightness clusters, semantic labels	Strong negative	Learn representations first, then cluster

The Cluster Assumption Trap

Low-Density Separation

The low-density separation principle is the contrapositive of the cluster assumption: decision boundaries should pass through regions where few data points reside.

Formal Definition

Low-Density Separation:

The decision boundary of the classifier should lie in low-density regions of P(X).

Equivalently:

Class posterior P(Y|X) should change mainly in regions where P(X) is small.

Mathematical Formulation

We can formalize this as seeking a decision boundary B ⊂ 𝒳 that minimizes:

$$\int_{x \in B} p(x) dx$$

subject to correctly classifying labeled points.

This integral measures the 'density traversed' by the boundary. A boundary through empty space has cost 0; a boundary through a cluster has high cost.

Methods Implementing Low-Density Separation

Low-Density Separation Methods

•Transductive SVM: Maximizes margin on unlabeled points, pushing boundary to low-density regions where unlabeled points are scarce.
•Entropy Minimization: High-confidence predictions at all points means the model 'commits' everywhere, placing boundaries only where it must (low-density regions).
•Virtual Adversarial Training (VAT): Finds adversarial perturbations that maximally change predictions, then penalizes such changes. Boundaries become stable (low-density) regions.
•MixUp and CutMix: Interpolations between samples encourage linear behavior between class regions, implicitly pushing boundaries to sparse interpolated regions.

Virtual Adversarial Training (VAT) Deep Dive

VAT operationalizes low-density separation through local perturbation analysis:

1. Find Adversarial Direction:

For each point x, find the perturbation r that maximally changes predictions:

$$r_{adv} = \arg\max_{|r| \leq \epsilon} D_{KL}(p(y|x) | p(y|x+r))$$

This direction points toward the nearest decision boundary.

2. Penalize Prediction Change:

$$\mathcal{L}{VAT}(x) = D{KL}(p(y|x) | p(y|x+r_{adv}))$$

3. Effect:

Minimizing this loss pushes the decision boundary away from x. Since we apply this to all unlabeled points, boundaries are pushed to regions devoid of data—achieving low-density separation.

virtual_adversarial_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def virtual_adversarial_loss(model: nn.Module, 
                              x: torch.Tensor, 
                              xi: float = 1e-6,
                              epsilon: float = 1.0,
                              num_power_iterations: int = 1) -> torch.Tensor:
    """
    Virtual Adversarial Training (VAT) loss.
    
    Finds the perturbation that maximally changes model predictions,
    then penalizes that change to push boundaries to low-density regions.
    
    Args:
        model: Neural network classifier
        x: Input batch (typically unlabeled data)
        xi: Small constant for numerical gradient estimation
        epsilon: Maximum perturbation norm
        num_power_iterations: Iterations for power method to find r_adv
    
    Returns:
        VAT loss (scalar)
    """
    model.eval()
    with torch.no_grad():
        pred = F.softmax(model(x), dim=1)
    model.train()
    
    # Initialize random perturbation
    d = torch.rand_like(x) - 0.5
    d = F.normalize(d.view(d.shape[0], -1), dim=1).view_as(x)
    
    # Power iteration to find adversarial direction
    for _ in range(num_power_iterations):
        d.requires_grad_(True)
        pred_perturbed = F.softmax(model(x + xi * d), dim=1)
        
        # KL divergence: D_KL(pred || pred_perturbed)
        kl_div = F.kl_div(
            pred_perturbed.log(), 
            pred.detach(),
            reduction='batchmean'
        )
        
        # Gradient of KL w.r.t. perturbation
        kl_div.backward()
        d = d.grad.detach()
        d = F.normalize(d.view(d.shape[0], -1), dim=1).view_as(x)
    
    # Compute adversarial perturbation with final direction
    r_adv = epsilon * d.detach()
    
    # VAT loss: KL divergence at adversarial point
    pred_adv = F.softmax(model(x + r_adv), dim=1)
    vat_loss = F.kl_div(
        pred_adv.log(),
        pred.detach(),
        reduction='batchmean'
    )
    
    return vat_loss
 
 
# Example training loop integration
def train_step_with_vat(model, optimizer, x_labeled, y_labeled, x_unlabeled,
                        alpha_vat=1.0):
    optimizer.zero_grad()
    
    # Supervised loss
    pred_labeled = model(x_labeled)
    sup_loss = F.cross_entropy(pred_labeled, y_labeled)
    
    # VAT loss on unlabeled data
    vat_loss = virtual_adversarial_loss(model, x_unlabeled)
    
    # Total loss
    total_loss = sup_loss + alpha_vat * vat_loss
    total_loss.backward()
    optimizer.step()
    
    return sup_loss.item(), vat_loss.item()

Low-Density ≠ Nowhere

The Manifold Assumption

The manifold assumption addresses high-dimensional data by positing that despite living in high-dimensional space, data actually lies on a lower-dimensional manifold.

Formal Definition

Manifold Assumption:

The data lies on a low-dimensional manifold ℳ embedded in the high-dimensional input space 𝒳. The class labels vary smoothly along this manifold.

Mathematically:

Data lives on ℳ ⊂ ℝᵈ where dim(ℳ) = k << d
The labeling function f: ℳ → 𝒴 is smooth on ℳ

Why Manifolds?

Consider image data:

A 256×256 RGB image has 196,608 dimensions
But the set of 'natural images' occupies a tiny fraction of this space
Random pixel values look like noise; natural images have structure
This structure defines a manifold of much lower intrinsic dimension

Estimated intrinsic dimensions:

MNIST digits: ~10-15 dimensions (despite 784 pixel input)
Natural images: ~100-500 dimensions (despite millions of pixels)
Text embeddings: ~50-100 dimensions (despite 768+ embedding dims)

The Manifold-Aware Learning Principle

If we knew the manifold ℳ, supervised learning would be a k-dimensional problem instead of a d-dimensional one—dramatically reducing sample complexity.

Sample complexity scaling:

Without manifold: O(d/ε²) labeled samples
With known manifold: O(k/ε²) labeled samples
Improvement ratio: d/k (can be 100x or more)

Unlabeled data helps discover the manifold:

Many unlabeled points trace out ℳ's shape
Local neighborhoods reveal manifold tangent structure
Geodesic (on-manifold) distances replace Euclidean distances

Geodesic Distance vs. Euclidean Distance

A key insight of manifold-based SSL: Euclidean distance can be misleading.

Consider a curved manifold (like a Swiss roll):

Two points may be close in Euclidean ℝᵈ distance
But far apart along the manifold (geodesic distance)
Smoothness should apply along the manifold, not through 'shortcuts'

Graph-based methods approximate geodesic distance:

Connect nearby points (local Euclidean distance is valid)
Use shortest path on graph (approximates geodesic)
Apply smoothness with geodesic-based similarity

Methods Exploiting the Manifold Assumption

Manifold-Based SSL Methods

•Laplacian Eigenmaps: Compute eigenvectors of graph Laplacian to find low-dimensional manifold coordinates. Use these for classification.
•Isomap: Approximate geodesic distances via shortest paths, then apply MDS to embed in low dimensions.
•Semi-Supervised Laplacian Regularization: Learn classifier f while penalizing: $f^T L f$ where L is manifold Laplacian.
•Manifold Regularization Loss: $\mathcal{L}{manifold} = |f|{\mathcal{H}_K}^2$ penalizes function complexity in an RKHS defined by manifold geometry.
•Tangent Propagation: Penalize prediction changes along manifold tangent directions (estimated locally).

Neural Networks and Implicit Manifold Learning

Modern deep learning methods implicitly learn manifold structure:

Autoencoders:

Compress data to low-dimensional bottleneck
Bottleneck must capture manifold structure to enable reconstruction
Semi-supervised autoencoders combine reconstruction + classification losses

Contrastive Learning:

Augmented views of same image should have similar representations
Augmentations explore the local manifold around each point
The representation space encodes manifold structure

Data Augmentation as Manifold Exploration:

Augmentations (rotation, crop, color jitter) move along the image manifold
Consistency regularization enforces smoothness along these directions
Strong augmentations explore further along manifold tangents

The Augmentation Connection

Relationships Between Assumptions

The four assumptions we've discussed are interconnected. Understanding their relationships helps in selecting and combining methods.

Hierarchy of Assumptions

Smoothness is the most general:

Cluster → implies local smoothness
Low-density separation → smoothness holds in high-density regions
Manifold → smoothness on manifold

Cluster is a special case:

Smoothness + multi-modal P(X) ≈ Cluster assumption
Cluster implies low-density separation (gaps between clusters)

Low-density separation is the boundary perspective:

Dual of cluster assumption
Smoothness implies avoiding high-density boundaries

Manifold is about representation:

Operates on embedded representation
Other assumptions apply on manifold, not ambient space

Converting Mermaid diagram...

When Assumptions Conflict

In some datasets, assumptions can conflict:

Smoothness vs. Cluster:

Smoothness wants gradual transitions
Cluster wants hard boundaries at gaps
Resolution: smoothness applies within clusters; step changes at boundaries

Manifold vs. Cluster:

Manifold is continuous (single connected component)
Clusters are disconnected
Resolution: Multiple manifolds (one per cluster) or manifold with gaps

Low-density vs. Reality:

Sometimes classes genuinely overlap in feature space
Low-density separation is impossible
Resolution: Accept soft boundaries; use probabilistic predictions

Combining Assumptions in Practice

Modern SSL methods combine assumptions:

FixMatch = Smoothness + Cluster + Pseudo-labels:

Consistency regularization (smoothness)
Confidence threshold (cluster: confident => pure cluster)
Pseudo-labels (propagate decisions)

Graph Networks = Manifold + Smoothness:

Graph encodes manifold structure
Message passing enforces smoothness on graph

Contrastive Learning = Manifold + Cluster:

Augmentations define local manifold
Negative pairs push apart different clusters

The Multi-Assumption Advantage

Verifying Assumptions on Your Data

Before applying SSL, you should verify that relevant assumptions hold for your dataset. Here are practical diagnostics for each assumption.

Checking the Cluster Assumption

Cluster Assumption Diagnostics

•Visualization: Apply t-SNE or UMAP to your data. Do labeled points from same class cluster together? Do clusters align with classes?
•Cluster purity: Run clustering (k-means, DBSCAN) on all data. Measure purity: fraction of cluster members sharing majority label.
•Silhouette analysis: Compute silhouette scores using class labels. High scores indicate class-aligned clustering.
•k-NN class consistency: For each point, check if its k nearest neighbors share its class. High consistency supports cluster assumption.

Checking the Manifold Assumption

Manifold Assumption Diagnostics

•Intrinsic dimension estimation: Use methods like MLE, correlation dimension, or two-NN to estimate intrinsic dimension. If dim << ambient dim, manifold assumption is plausible.
•Local linearity: PCA on local neighborhoods. If first few components explain most variance, local manifold is approximately linear.
•Reconstruction from embeddings: Compress data with autoencoder. Reconstruction quality indicates whether low-dimensional representation captures structure.
•Geodesic path visualization: Interpolate between points along graph shortest paths. Do interpolations look like valid data?

assumption_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from collections import Counter
 
def diagnose_cluster_assumption(X, y_true, n_neighbors=10, n_clusters=None):
    """
    Diagnostic tests for the cluster assumption.
    
    Args:
        X: (n, d) feature matrix
        y_true: (n,) true labels (from labeled subset)
        n_neighbors: k for k-NN consistency check
        n_clusters: number of clusters for k-means (defaults to num classes)
    
    Returns:
        Dictionary of diagnostic metrics
    """
    n_classes = len(np.unique(y_true))
    n_clusters = n_clusters or n_classes
    
    # 1. k-NN class consistency
    nn = NearestNeighbors(n_neighbors=n_neighbors + 1).fit(X)
    _, indices = nn.kneighbors(X)
    
    consistency_scores = []
    for i, neighbors in enumerate(indices):
        neighbor_labels = y_true[neighbors[1:]]  # Exclude self
        mode_label = Counter(neighbor_labels).most_common(1)[0][0]
        consistency_scores.append(1.0 if y_true[i] == mode_label else 0.0)
    
    knn_consistency = np.mean(consistency_scores)
    
    # 2. K-means cluster purity
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    cluster_labels = kmeans.labels_
    
    purity = 0.0
    for cluster_id in range(n_clusters):
        cluster_mask = cluster_labels == cluster_id
        if cluster_mask.sum() > 0:
            cluster_true_labels = y_true[cluster_mask]
            mode_count = Counter(cluster_true_labels).most_common(1)[0][1]
            purity += mode_count
    purity /= len(y_true)
    
    # 3. Silhouette score (using true labels)
    silhouette = silhouette_score(X, y_true)
    
    return {
        'knn_consistency': knn_consistency,
        'cluster_purity': purity,
        'silhouette_score': silhouette,
        'recommendation': 'SSL likely beneficial' if knn_consistency > 0.7 and purity > 0.7 
                         else 'SSL may not help' if knn_consistency < 0.5 
                         else 'SSL results may vary'
    }
 
 
def estimate_intrinsic_dimension(X, k=10):
    """
    Estimate intrinsic dimension using Two-NN method.
    
    Based on: "Estimating the intrinsic dimension of datasets" 
    by Facco et al. (2017)
    """
    nn = NearestNeighbors(n_neighbors=k+1).fit(X)
    distances, _ = nn.kneighbors(X)
    
    # Ratio of second to first neighbor distance
    r1 = distances[:, 1]  # Distance to 1st neighbor
    r2 = distances[:, 2]  # Distance to 2nd neighbor
    
    # Avoid division by zero
    valid = r1 > 1e-10
    mu = r2[valid] / r1[valid]
    
    # MLE estimate of intrinsic dimension
    # Based on mu following a Pareto distribution with parameter d
    d_estimate = 1.0 / np.mean(np.log(mu))
    
    return {
        'intrinsic_dimension': d_estimate,
        'ambient_dimension': X.shape[1],
        'dimension_ratio': d_estimate / X.shape[1],
        'manifold_likely': d_estimate < X.shape[1] * 0.1  # Less than 10% of ambient
    }

Warning Signs: When SSL Might Fail

Watch for these indicators that assumptions don't hold:

SSL Failure Warning Signs

•Low k-NN consistency (<0.5): Classes are interleaved; cluster assumption fails.
•t-SNE shows no structure: Data may not have exploitable manifold or cluster structure.
•High intrinsic dimension: Manifold assumption provides little benefit.
•SSL performance ≤ supervised baseline: Assumptions are actively violated; unlabeled data is misleading.
•Pseudo-label accuracy deteriorates over training: Confirmation bias; initial errors are reinforced.

Always Compare to Supervised Baseline

Dealing with Assumption Violations

Real-world data rarely perfectly satisfies SSL assumptions. Here we discuss strategies for handling partial or complete assumption violations.

Strategy 1: Improve Representations

If cluster structure doesn't align with classes in raw features:

Pre-train representations: Use self-supervised learning (SimCLR, BYOL) to learn features where similar instances are nearby.
Fine-tune with supervision: The few labels guide the representation toward class-relevant structure.
Then apply SSL: In the improved representation space, assumptions are more likely to hold.

This approach has become standard in practice. Modern SSL pipelines almost always involve:

Raw Data → Self-Supervised Pretraining → Fine-tuning with SSL → Final Model

Strategy 2: Robust SSL Methods

Some methods are designed to be robust to assumption violations:

Robust SSL Approaches

•Confidence thresholding: Only use pseudo-labels above a high confidence threshold (e.g., 0.95). Low-confidence predictions in ambiguous regions are ignored.
•Class-balanced pseudo-labeling: Weight pseudo-labels to maintain class balance, preventing majority class dominance.
•Curriculum pseudo-labeling: Start with only highest-confidence pseudo-labels, gradually include more as training progresses.
•Negative learning: Use negative pseudo-labels ('not this class') which are more robust to errors than positive labels.
•Meta-learning for weighting: Learn to weight pseudo-labels based on their quality (Meta-Weight-Net, MentorNet).

Strategy 3: Assumption-Free Baselines

When uncertain about assumptions, consider approaches that make minimal assumptions:

Self-Training with High Threshold:

Only propagate labels with >0.99 confidence
Very conservative; may not use much unlabeled data
But avoids error propagation from wrong pseudo-labels

Representation Learning Only:

Use contrastive learning to improve representations
Train final classifier purely supervised on labeled data
Doesn't use pseudo-labels; avoids assumption-dependent SSL losses

Ensemble of Experts:

Train multiple models with different random seeds
Only pseudo-label points where all models agree
Consensus reduces impact of any single model's biases

Strategy 4: Domain-Adaptive Augmentations

If standard augmentations don't respect your data's structure:

Domain knowledge: Design augmentations that preserve class labels in your domain
Learned augmentations: Use AutoAugment or RandAugment tuned for your task
Careful validation: Verify augmented samples visually; do they look like valid members of the same class?

The Conservative Approach

Summary: Assumptions as the Foundation of SSL

Key Takeaways

•Assumptions are necessary: Without assumptions linking P(X) to P(Y|X), unlabeled data cannot help. Every SSL method makes such assumptions, explicitly or implicitly.
•Smoothness: If nearby points have similar labels, consistency regularization and graph methods can exploit local structure.
•Cluster assumption: If data forms class-aligned clusters, pseudo-labeling and entropy minimization work well.
•Low-density separation: If decision boundaries belong in sparse regions, TSVM and VAT push boundaries appropriately.
•Manifold assumption: If data lies on a low-dimensional manifold, the effective sample complexity is reduced dramatically.
•Verify before trusting: Use diagnostics (k-NN consistency, cluster purity, dimension estimates) to check assumptions before deploying SSL.

What's Next:

Page Complete