The Label Scarcity Problem - Learning Module

Loading content...

0/278

The Semi-Supervised Setting: Formal Framework

Formalizing the Semi-Supervised Learning Problem

Having established the practical importance of label scarcity in the previous page, we now shift to a rigorous mathematical treatment of the semi-supervised learning problem. This formalization is not mere academic exercise—it provides the precise language needed to state assumptions, prove guarantees, and understand when semi-supervised methods can and cannot help.

The core question we address: Under what conditions can unlabeled data improve learning, and how do we formalize this mathematically?

What You Will Learn

This page provides a complete formal treatment of the semi-supervised setting. You will understand: (1) The mathematical notation and data model, (2) How SSL differs from supervised and unsupervised learning, (3) The learning objectives in SSL, (4) The distinction between inductive and transductive settings, and (5) The fundamental limits and possibilities of learning from unlabeled data.

Mathematical Notation and Data Model

We begin by establishing precise notation that will be used throughout this chapter and the field more broadly.

The Data Generating Process

Assume data is drawn from an underlying joint distribution P(X, Y) over a feature space 𝒳 and label space 𝒴. This joint distribution factors as:

$$P(X, Y) = P(Y | X) \cdot P(X) = P(X | Y) \cdot P(Y)$$

In supervised learning, we observe samples from P(X, Y) directly. In semi-supervised learning, we observe:

Labeled samples: (x_i, y_i) drawn from P(X, Y)
Unlabeled samples: x_j drawn from P(X) only (the marginal distribution)

Formal Problem Statement

Let us define the semi-supervised learning problem precisely:

Definition: Semi-Supervised Learning

Given: (1) A labeled dataset D_L = {(x₁, y₁), ..., (x_l, y_l)} where (x_i, y_i) ~ P(X, Y), (2) An unlabeled dataset D_U = {x_{l+1}, ..., x_{l+u}} where x_j ~ P(X), and (3) A hypothesis class ℋ of functions h: 𝒳 → 𝒴.

Find: A function f ∈ ℋ that minimizes the expected risk: R(f) = 𝔼_{(X,Y)~P}[L(f(X), Y)] where L is a loss function.

Key Quantities and Notation

We use the following notation throughout:

Standard Semi-Supervised Learning Notation
Symbol	Meaning	Typical Values
n = l + u	Total number of samples	10³ - 10⁹
l	Number of labeled samples	10¹ - 10⁶
u	Number of unlabeled samples	10³ - 10⁹
r = l/n	Label ratio	0.001 - 0.1
𝒳	Feature space	ℝᵈ, images, text, etc.
𝒴	Label space	{0,1}, {1,...,K}, ℝ
P(X,Y)	Joint distribution	Unknown, to be learned
P(X)	Marginal distribution	Observable from D_U
P(Y\|X)	Conditional (target)	To be learned
ℋ	Hypothesis class	Neural networks, etc.
L(ŷ, y)	Loss function	Cross-entropy, MSE, etc.

The Fundamental Information Asymmetry

The core challenge in SSL is that we observe P(X) well (through many unlabeled samples) but P(Y|X) poorly (through few labeled samples). The question is whether knowledge of P(X) can help in learning P(Y|X).

In full generality, the answer is no—knowing the marginal P(X) tells us nothing about the conditional P(Y|X). This is the famous semi-supervised learning impossibility result:

Theorem (Informal): Without additional assumptions linking P(X) and P(Y|X), no consistent estimator can benefit from unlabeled data for estimating P(Y|X).

This theorem is both discouraging and clarifying: it tells us that assumptions are necessary. The entire field of semi-supervised learning is fundamentally about identifying and exploiting assumptions that link the marginal and conditional distributions.

The Assumption Requirement

Every semi-supervised algorithm, implicitly or explicitly, makes assumptions about how P(X) relates to P(Y|X). Without such assumptions, unlabeled data provides zero information about the classification problem. Understanding these assumptions—and verifying they hold for your data—is critical for successful SSL deployment.

Comparison with Related Learning Settings

Semi-supervised learning occupies a specific position in the landscape of machine learning paradigms. Understanding its relationship to neighboring settings clarifies what SSL is and isn't.

The Learning Paradigm Spectrum

Comparison of Learning Paradigms
Paradigm	Labeled Data	Unlabeled Data	Goal	Key Challenge
Supervised Learning	All	None	Learn P(Y\|X)	Label cost, overfitting
Semi-Supervised Learning	Few	Many	Learn P(Y\|X)	Leveraging P(X)
Unsupervised Learning	None	All	Learn P(X) structure	No task guidance
Self-Supervised Learning	None (pseudo)	All	Learn representations	Pretext design
Weakly-Supervised	Noisy/Partial	Varies	Learn P(Y\|X)	Label noise
Active Learning	Interactive	Pool	Learn P(Y\|X)	Query selection
Transfer Learning	Different task	Target task	Adapt knowledge	Domain gap

Semi-Supervised vs. Supervised Learning

In supervised learning, all n samples are labeled. The standard empirical risk minimization (ERM) objective is:

$$\hat{f}{SL} = \arg\min{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^{n} L(f(x_i), y_i)$$

In semi-supervised learning, we only have l << n labels. The naive approach—ignoring unlabeled data—yields:

$$\hat{f}{naive} = \arg\min{f \in \mathcal{H}} \frac{1}{l} \sum_{i=1}^{l} L(f(x_i), y_i)$$

This suffers from high variance due to small sample size. SSL methods add regularization terms derived from unlabeled data:

$$\hat{f}{SSL} = \arg\min{f \in \mathcal{H}} \underbrace{\frac{1}{l} \sum_{i=1}^{l} L(f(x_i), y_i)}{\text{supervised loss}} + \lambda \cdot \underbrace{R(f; D_U)}{\text{unlabeled regularization}}$$

where R(f; D_U) is a regularization term computed over unlabeled data.

Semi-Supervised vs. Unsupervised Learning

Unsupervised learning seeks to understand P(X) without any labels—clustering, density estimation, dimensionality reduction. SSL uses unsupervised learning as a means to an end: we care about P(X) only insofar as it helps predict Y.

The key distinction:

Unsupervised: Learn representations that capture structure in X
Semi-supervised: Learn representations that are useful for predicting Y

This distinction matters practically. A beautiful cluster structure in X is useless for SSL if these clusters don't correspond to the label structure. SSL methods must align unsupervised structure with the supervised task.

Semi-Supervised vs. Self-Supervised Learning

Self-supervised learning (SSL) creates pseudo-labels from the data itself through pretext tasks (e.g., predicting masked words, image rotation, next-frame prediction). The learned representations are then fine-tuned on downstream tasks.

The relationship is subtle:

Self-supervised learning: A family of representation learning techniques
Semi-supervised learning: A learning setting with labeled and unlabeled data

Modern approaches often combine both:

Pretrain with self-supervised learning on massive unlabeled data
Fine-tune in a semi-supervised manner with few labels + more unlabeled data

This hybrid approach (exemplified by BERT, GPT, etc.) has proven extraordinarily effective.

Terminological Clarity

The field uses 'SSL' to abbreviate both semi-supervised and self-supervised learning. Context usually clarifies, but be attentive. In this chapter, we use 'semi-supervised' explicitly to avoid ambiguity.

The Semi-Supervised Learning Objective

Semi-supervised learning methods can be understood through a unified objective function framework. While specific methods differ in their regularization approaches, they share a common structure.

The General SSL Objective

The prototypical semi-supervised objective takes the form:

$$\mathcal{L}{SSL}(f; D_L, D_U) = \underbrace{\mathcal{L}{sup}(f; D_L)}{\text{Supervised Term}} + \lambda \cdot \underbrace{\mathcal{L}{unsup}(f; D_U)}_{\text{Unsupervised Term}}$$

where:

ℒ_sup: Standard supervised loss over labeled data
ℒ_unsup: Regularization term computed over unlabeled (and sometimes labeled) data
λ ≥ 0: Hyperparameter controlling the balance

Supervised Loss Component

The supervised term is standard classification/regression loss:

$$\mathcal{L}{sup}(f; D_L) = \frac{1}{l} \sum{i=1}^{l} L(f(x_i), y_i)$$

For classification, L is typically cross-entropy: $$L(\hat{y}, y) = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

For regression, L is often mean squared error: $$L(\hat{y}, y) = (\hat{y} - y)^2$$

Unsupervised Loss Components

The unsupervised term is where SSL methods innovate. Common formulations include:

Common Unsupervised Loss Formulations

•Entropy Minimization: $\mathcal{L}{ent} = -\frac{1}{u}\sum{j} \sum_{c} p_c(x_j) \log p_c(x_j)$ — Encourages confident predictions on unlabeled data.
•Consistency Regularization: $\mathcal{L}{cons} = \frac{1}{u}\sum{j} d(f(x_j), f(\tilde{x}_j))$ — Enforces similar predictions for augmented versions of the same input.
•Pseudo-Label Loss: $\mathcal{L}{pseudo} = \frac{1}{u}\sum{j} \mathbf{1}[\max(p(x_j)) > \tau] \cdot L(f(x_j), \hat{y}_j)$ — Uses model's own confident predictions as targets.
•Graph Regularization: $\mathcal{L}{graph} = \sum{i,j} w_{ij} |f(x_i) - f(x_j)|^2$ — Encourages similar predictions for similar inputs according to a graph structure.
•Contrastive Loss: $\mathcal{L}_{contr} = -\log\frac{\exp(sim(z_i, z_i^+)/\tau)}{\sum_k \exp(sim(z_i, z_k)/\tau)}$ — Pulls together representations of similar samples while pushing apart dissimilar ones.

The Role of λ: Balancing Supervision and Regularization

The hyperparameter λ controls the relative importance of labeled vs. unlabeled data. Its optimal value depends on:

Label ratio r = l/n: When r is very small, we should trust unlabeled data more (larger λ)
Label reliability: Noisy labels suggest trusting unlabeled regularization more
Assumption validity: If SSL assumptions hold strongly, larger λ is appropriate
Training phase: Many methods use λ warmup—starting with λ=0 and increasing during training

The λ warmup schedule is crucial in practice. Training with large λ from the start can cause the model to converge to poor solutions before seeing enough labeled signal to calibrate predictions.

lambda_warmup.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
 
def linear_warmup(epoch: int, warmup_epochs: int = 10, max_lambda: float = 1.0) -> float:
    """Linear warmup schedule for SSL λ hyperparameter."""
    return min(epoch / warmup_epochs, 1.0) * max_lambda
 
def sigmoid_warmup(epoch: int, midpoint: int = 10, steepness: float = 0.5, max_lambda: float = 1.0) -> float:
    """Sigmoid warmup schedule - smoother transition."""
    return max_lambda / (1 + np.exp(-steepness * (epoch - midpoint)))
 
def step_warmup(epoch: int, step_epoch: int = 5, max_lambda: float = 1.0) -> float:
    """Step function - no unsupervised loss initially."""
    return max_lambda if epoch >= step_epoch else 0.0
 
# Example usage in training loop:
# for epoch in range(num_epochs):
#     lambda_u = sigmoid_warmup(epoch, midpoint=10, max_lambda=1.0)
#     loss = supervised_loss + lambda_u * unsupervised_loss
#     optimizer.step()

Practical Wisdom

In practice, sigmoid warmup over 5-20% of total training typically works well. The key insight: early in training, model predictions are near-random and provide no useful signal. Warmup allows the model to learn basic patterns from labeled data before trusting its own predictions on unlabeled data.

Transductive vs. Inductive Learning

A fundamental distinction in semi-supervised learning is between transductive and inductive settings. This distinction affects what guarantees we can provide and what methods are appropriate.

Transductive Learning

In the transductive setting, the unlabeled data D_U is known at training time, and our goal is specifically to predict labels for these samples. We don't need to generalize to unseen data.

Formal Definition:

Given D_L and D_U, transductive learning aims to output predictions {ŷ_{l+1}, ..., ŷ_{l+u}} for the unlabeled points, minimizing: $$\sum_{j=l+1}^{l+u} L(\hat{y}_j, y_j)$$

Key characteristics:

Unlabeled set is fixed and known
No requirement to generalize beyond D_U
Can exploit specific structure of the test set
Often easier than inductive learning

Inductive Learning

In the inductive setting, we aim to learn a function f that generalizes to any new sample from P(X), not just the unlabeled samples observed during training.

Formal Definition:

Given D_L and D_U, inductive learning aims to learn f: 𝒳 → 𝒴 that minimizes the expected risk: $$R(f) = \mathbb{E}_{(X,Y) \sim P}[L(f(X), Y)]$$

Key characteristics:

Must generalize to unseen samples
D_U helps but f must work beyond it
More practical for deployment scenarios
Typically harder, requires stronger assumptions

Transductive Advantages

•Can be easier—no generalization required
•Can exploit test set structure (clustering)
•Theoretically well-understood
•Graph-based methods work naturally
•No out-of-sample prediction needed

Inductive Advantages

•Produces deployable model f
•Handles new data without retraining
•More practical for production
•Scales to streaming data
•Neural network methods are naturally inductive

The Induction-Transduction Relationship

An important insight is that any inductive learner can be used for transduction (just apply f to D_U), but not vice versa. This creates a hierarchy:

$$\text{Transductive Methods} \supset \text{Inductive Methods}$$

However, converting transductive methods to inductive ones is possible:

Out-of-sample extension: Use techniques like Nyström approximation to extend graph-based transductive predictions
Model distillation: Train an inductive model to match transductive predictions on D_U
Joint optimization: Learn an inductive model that also performs well on the specific transductive task

When to Use Which

Use Transductive when:

The test set is fixed and known (e.g., website crawl classification)
You have strong cluster structure in the combined labeled+unlabeled data
Retraining for each new batch is feasible
Graph-based methods are appropriate

Use Inductive when:

You need a deployed model for real-time inference
Data streams continuously
Retraining is expensive
You're using deep learning (naturally inductive)

Historical Context

The transductive vs. inductive distinction was emphasized by Vladimir Vapnik, who argued that transduction solves a more specific problem than induction and should therefore be easier. His transductive SVM (TSVM) was an early influential semi-supervised method. Modern deep learning methods are inherently inductive but achieve strong performance by scaling to massive unlabeled datasets.

Problem Variants and Extensions

The basic semi-supervised setting admits several variations that arise in practice. Understanding these variants helps select appropriate methods for specific applications.

Standard Semi-Supervised Classification

The setting we've described so far:

Fixed, discrete label space 𝒴 = {1, ..., K}
i.i.d. sampling from P(X, Y)
Goal: minimize classification error

This is the most studied setting, with most benchmark comparisons focused here.

Semi-Supervised Regression

When 𝒴 = ℝ (or ℝᵈ):

Continuous outputs require different loss functions (MSE, MAE)
Uncertainty quantification becomes important
Consistency regularization can use L2 distance
Pseudo-labeling uses predicted values directly

Challenge: Without discrete classes, concepts like 'confident predictions' are less clear. Threshold-based pseudo-labeling needs adaptation (e.g., based on prediction variance).

Semi-Supervised with Class Imbalance

When P(Y) is highly skewed:

Minority classes have even fewer labeled examples
Pseudo-labeling can reinforce majority class bias
Class-balanced sampling becomes critical
Thresholds may need per-class adjustment

Typical approach: Use class-balanced supervised loss plus distribution-aware pseudo-labeling:

$$\mathcal{L}{sup} = \sum{c=1}^{C} w_c \cdot \frac{1}{n_c} \sum_{i: y_i = c} L(f(x_i), y_i)$$

where w_c are class weights inversely proportional to frequency.

Semi-Supervised Multi-Label Classification

When each sample can have multiple labels (Y ⊆ {1, ..., K}):

Binary classification per label with label dependencies
Consistency must respect multi-label structure
Label co-occurrence provides additional signal
Evaluation metrics differ (F1, mAP vs accuracy)

Semi-Supervised Structured Prediction

When outputs have structure (sequences, trees, graphs):

Sequence labeling (NER, POS tagging)
Parsing
Semantic segmentation

Challenge: Consistency regularization must respect output structure. A prediction of 'B-PER' following 'O' is inconsistent in BIO tagging—the model should enforce such constraints.

Semi-Supervised Problem Variants
Variant	Label Space	Key Challenge	Common Methods
Classification	{1,...,K}	Class imbalance, confidence calibration	FixMatch, MixMatch, UDA
Regression	ℝ or ℝᵈ	Uncertainty estimation	Mean Teacher + variance
Multi-Label	Subsets of {1,...,K}	Label dependencies	Pseudo-labeling with thresholds
Sequence	Sequences Y*	Structural consistency	Self-training, CRF regularization
Segmentation	Pixel labels	Spatial consistency	CutMix, spatial pseudo-labels
Metric Learning	Similarity structure	Pair/triplet construction	Contrastive learning

Orthogonal Extensions

Several orthogonal extensions combine with semi-supervised learning: (1) Domain adaptation when unlabeled data comes from a different distribution, (2) Open-set when unlabeled data may contain unseen classes, (3) Long-tailed when classes have extreme frequency imbalance, and (4) Noisy labels when the few available labels may be incorrect.

Theoretical Foundations

The theory of semi-supervised learning addresses fundamental questions: When can unlabeled data help? How much can it help? What assumptions do we need?

The Fundamental Impossibility Result

We mentioned earlier that without assumptions, unlabeled data cannot help. Let's make this precise.

Theorem (Ben-David et al., 2008): For any learning algorithm A, there exist two distributions P₁ and P₂ such that:

P₁(X) = P₂(X) (identical marginals)
P₁(Y|X) ≠ P₂(Y|X) (different conditionals)
algorithm A cannot distinguish P₁ from P₂ using only unlabeled data

Implication: Since we can only observe P(X) from unlabeled data, and different P(Y|X) can share the same P(X), unlabeled data alone cannot determine the correct labeling function.

Sample Complexity Perspective

From a sample complexity view, the question becomes: how many labeled samples are needed with vs. without unlabeled data?

Let m_SL(ε, δ) be the labeled sample complexity of supervised learning to achieve error ≤ε with probability ≥1-δ.

Let m_SSL(ε, δ, u) be the labeled sample complexity of semi-supervised learning with u unlabeled samples.

Goal: Show that m_SSL(ε, δ, u) << m_SL(ε, δ) when assumptions hold.

Key Result (Singh et al., 2008): Under the cluster assumption with well-separated clusters, semi-supervised learning can achieve: $$m_{SSL} = O\left(\frac{1}{\epsilon} \log \frac{1}{\delta}\right)$$

compared to supervised learning's: $$m_{SL} = O\left(\frac{d}{\epsilon^2} \log \frac{1}{\delta}\right)$$

where d is the input dimension. This represents a potentially exponential improvement in labeled sample complexity.

The Role of Unlabeled Data

Theoretically, unlabeled data can help in three ways:

1. Hypothesis Space Reduction

Unlabeled data, under appropriate assumptions, can eliminate hypotheses inconsistent with P(X) structure:

Cluster assumption: Eliminate boundaries passing through high-density regions
Manifold assumption: Restrict to hypotheses smooth on the data manifold
Low-density assumption: Penalize decision boundaries in dense regions

2. Regularization

Even without reducing the hypothesis space, unlabeled data can provide implicit regularization that reduces variance:

The model must explain unlabeled data structure
This constrains the solution space
Effects similar to explicit regularization (L2, dropout)

3. Representation Learning

In deep learning, unlabeled data helps learn better representations:

Early layers learn generalizable features
These features transfer to the supervised task
The representation bottleneck provides implicit regularization

When Unlabeled Data Hurts

Theory also tells us when semi-supervised learning can degrade performance: (1) When assumptions don't hold, unlabeled data provides misleading signal. (2) When the model class is already well-matched to the labeled data. (3) When pseudo-labels from early training lock in errors. Understanding these failure modes is as important as understanding success cases.

Practical Considerations

Moving from theory to practice, semi-supervised learning involves numerous implementation decisions. Here we address common practical questions.

Selecting and Validating Methods

Method Selection Criteria

•Data modality: Image data suits consistency via augmentation; text suits masked language modeling; tabular needs careful augmentation design.
•Label budget: Very few labels (10-100) favor simple methods like pseudo-labeling; moderate labels (1000+) can benefit from sophisticated approaches like FixMatch.
•Computational budget: Some methods (MixMatch, ReMixMatch) require heavy augmentation and multiple forward passes; others (Mean Teacher) are more efficient.
•Domain knowledge: If you understand domain-appropriate augmentations, consistency regularization is powerful. Without such knowledge, simpler methods may be safer.
•Deployment requirements: Need for real-time inference favors inductive methods that produce a fixed model.

Hyperparameter Tuning with Limited Labels

A fundamental challenge in semi-supervised learning is hyperparameter selection when labeled data is scarce. Holding out validation data from an already-small labeled set further reduces training data.

Strategies:

Use all labels for training, validate on pseudo-labels: Monitor consistency loss on unlabeled data as a proxy (imperfect but practical)
Cross-validation on labels: K-fold cross-validation uses all labeled data for both training and validation, averaging over folds
Default hyperparameters: Well-tuned defaults from published work transfer surprisingly well (e.g., FixMatch's τ=0.95, λ_u=1.0)
Sensitivity analysis: Test a few key hyperparameters on a small subset before full experiment

Debugging Semi-Supervised Training

SSL training can fail in subtle ways. Common debugging approaches:

Debugging Checklist

•Monitor pseudo-label accuracy: If available, check what fraction of pseudo-labels match true labels. Low accuracy (<50%) indicates the model hasn't learned useful patterns yet.
•Check label distribution of pseudo-labels: Severe class imbalance in pseudo-labels often indicates confirmation bias toward majority class.
•Track supervised vs. unsupervised loss separately: If unsupervised loss dominates too early, reduce λ or extend warmup.
•Visualize representations: Use t-SNE or UMAP to check if clusters in representation space correspond to classes.
•Ablation studies: Compare to supervised baseline. If SSL hurts, assumptions may not hold for your data.

The 'Start Simple' Rule

Before trying sophisticated SSL methods, verify that: (1) Your supervised baseline works correctly, (2) Your augmentations don't destroy label information, (3) Pseudo-labels from a trained supervised model have reasonable accuracy. If these checks fail, SSL will likely underperform.

Summary: The Semi-Supervised Framework

We have established the formal mathematical framework for semi-supervised learning. Let's consolidate the key concepts:

Key Takeaways

•Mathematical foundations: SSL operates on labeled D_L and unlabeled D_U, minimizing risk under distribution P(X,Y) by exploiting structure in P(X).
•Assumptions are essential: Without linking P(X) and P(Y|X), unlabeled data cannot help. Every SSL method implicitly makes such assumptions.
•The SSL objective: Combines supervised loss on D_L with unsupervised regularization on D_U, balanced by hyperparameter λ.
•Transductive vs. inductive: Transduction predicts for fixed D_U; induction learns generalizable f. Deep learning is naturally inductive.
•Problem variants: Classification, regression, multi-label, sequence, and structured prediction each require adapted methods.
•Theoretical grounding: Under appropriate assumptions, SSL can achieve exponential improvements in labeled sample complexity.

What's Next:

With the formal framework established, the next page examines the core assumptions that enable semi-supervised learning. We'll study the smoothness assumption, cluster assumption, low-density separation, and manifold assumption in detail—understanding when they hold, how to test them, and which methods exploit each assumption.

Page Complete

You now understand the mathematical framework of semi-supervised learning: the notation, objective functions, transductive vs. inductive settings, problem variants, and theoretical foundations. This formal grounding is essential for understanding why specific methods work and when they are appropriate.

The Semi-Supervised Setting: Formal Framework

Formalizing the Semi-Supervised Learning Problem

The core question we address: Under what conditions can unlabeled data improve learning, and how do we formalize this mathematically?

What You Will Learn

Mathematical Notation and Data Model

We begin by establishing precise notation that will be used throughout this chapter and the field more broadly.

The Data Generating Process

Assume data is drawn from an underlying joint distribution P(X, Y) over a feature space 𝒳 and label space 𝒴. This joint distribution factors as:

$$P(X, Y) = P(Y | X) \cdot P(X) = P(X | Y) \cdot P(Y)$$

In supervised learning, we observe samples from P(X, Y) directly. In semi-supervised learning, we observe:

Labeled samples: (x_i, y_i) drawn from P(X, Y)
Unlabeled samples: x_j drawn from P(X) only (the marginal distribution)

Formal Problem Statement

Let us define the semi-supervised learning problem precisely:

Definition: Semi-Supervised Learning

Find: A function f ∈ ℋ that minimizes the expected risk: R(f) = 𝔼_{(X,Y)~P}[L(f(X), Y)] where L is a loss function.

Key Quantities and Notation

We use the following notation throughout:

Standard Semi-Supervised Learning Notation
Symbol	Meaning	Typical Values
n = l + u	Total number of samples	10³ - 10⁹
l	Number of labeled samples	10¹ - 10⁶
u	Number of unlabeled samples	10³ - 10⁹
r = l/n	Label ratio	0.001 - 0.1
𝒳	Feature space	ℝᵈ, images, text, etc.
𝒴	Label space	{0,1}, {1,...,K}, ℝ
P(X,Y)	Joint distribution	Unknown, to be learned
P(X)	Marginal distribution	Observable from D_U
P(Y\|X)	Conditional (target)	To be learned
ℋ	Hypothesis class	Neural networks, etc.
L(ŷ, y)	Loss function	Cross-entropy, MSE, etc.

The Fundamental Information Asymmetry

In full generality, the answer is no—knowing the marginal P(X) tells us nothing about the conditional P(Y|X). This is the famous semi-supervised learning impossibility result:

Theorem (Informal): Without additional assumptions linking P(X) and P(Y|X), no consistent estimator can benefit from unlabeled data for estimating P(Y|X).

The Assumption Requirement

Comparison with Related Learning Settings

Semi-supervised learning occupies a specific position in the landscape of machine learning paradigms. Understanding its relationship to neighboring settings clarifies what SSL is and isn't.

The Learning Paradigm Spectrum

Comparison of Learning Paradigms
Paradigm	Labeled Data	Unlabeled Data	Goal	Key Challenge
Supervised Learning	All	None	Learn P(Y\|X)	Label cost, overfitting
Semi-Supervised Learning	Few	Many	Learn P(Y\|X)	Leveraging P(X)
Unsupervised Learning	None	All	Learn P(X) structure	No task guidance
Self-Supervised Learning	None (pseudo)	All	Learn representations	Pretext design
Weakly-Supervised	Noisy/Partial	Varies	Learn P(Y\|X)	Label noise
Active Learning	Interactive	Pool	Learn P(Y\|X)	Query selection
Transfer Learning	Different task	Target task	Adapt knowledge	Domain gap

Semi-Supervised vs. Supervised Learning

In supervised learning, all n samples are labeled. The standard empirical risk minimization (ERM) objective is:

$$\hat{f}{SL} = \arg\min{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^{n} L(f(x_i), y_i)$$

In semi-supervised learning, we only have l << n labels. The naive approach—ignoring unlabeled data—yields:

$$\hat{f}{naive} = \arg\min{f \in \mathcal{H}} \frac{1}{l} \sum_{i=1}^{l} L(f(x_i), y_i)$$

This suffers from high variance due to small sample size. SSL methods add regularization terms derived from unlabeled data:

where R(f; D_U) is a regularization term computed over unlabeled data.

Semi-Supervised vs. Unsupervised Learning

The key distinction:

Unsupervised: Learn representations that capture structure in X
Semi-supervised: Learn representations that are useful for predicting Y

Semi-Supervised vs. Self-Supervised Learning

The relationship is subtle:

Self-supervised learning: A family of representation learning techniques
Semi-supervised learning: A learning setting with labeled and unlabeled data

Modern approaches often combine both:

Pretrain with self-supervised learning on massive unlabeled data
Fine-tune in a semi-supervised manner with few labels + more unlabeled data

This hybrid approach (exemplified by BERT, GPT, etc.) has proven extraordinarily effective.

Terminological Clarity

The Semi-Supervised Learning Objective

Semi-supervised learning methods can be understood through a unified objective function framework. While specific methods differ in their regularization approaches, they share a common structure.

The General SSL Objective

The prototypical semi-supervised objective takes the form:

$$\mathcal{L}{SSL}(f; D_L, D_U) = \underbrace{\mathcal{L}{sup}(f; D_L)}{\text{Supervised Term}} + \lambda \cdot \underbrace{\mathcal{L}{unsup}(f; D_U)}_{\text{Unsupervised Term}}$$

where:

ℒ_sup: Standard supervised loss over labeled data
ℒ_unsup: Regularization term computed over unlabeled (and sometimes labeled) data
λ ≥ 0: Hyperparameter controlling the balance

Supervised Loss Component

The supervised term is standard classification/regression loss:

$$\mathcal{L}{sup}(f; D_L) = \frac{1}{l} \sum{i=1}^{l} L(f(x_i), y_i)$$

For classification, L is typically cross-entropy: $$L(\hat{y}, y) = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

For regression, L is often mean squared error: $$L(\hat{y}, y) = (\hat{y} - y)^2$$

Unsupervised Loss Components

The unsupervised term is where SSL methods innovate. Common formulations include:

Common Unsupervised Loss Formulations

•Entropy Minimization: $\mathcal{L}{ent} = -\frac{1}{u}\sum{j} \sum_{c} p_c(x_j) \log p_c(x_j)$ — Encourages confident predictions on unlabeled data.
•Consistency Regularization: $\mathcal{L}{cons} = \frac{1}{u}\sum{j} d(f(x_j), f(\tilde{x}_j))$ — Enforces similar predictions for augmented versions of the same input.
•Pseudo-Label Loss: $\mathcal{L}{pseudo} = \frac{1}{u}\sum{j} \mathbf{1}[\max(p(x_j)) > \tau] \cdot L(f(x_j), \hat{y}_j)$ — Uses model's own confident predictions as targets.
•Graph Regularization: $\mathcal{L}{graph} = \sum{i,j} w_{ij} |f(x_i) - f(x_j)|^2$ — Encourages similar predictions for similar inputs according to a graph structure.
•Contrastive Loss: $\mathcal{L}_{contr} = -\log\frac{\exp(sim(z_i, z_i^+)/\tau)}{\sum_k \exp(sim(z_i, z_k)/\tau)}$ — Pulls together representations of similar samples while pushing apart dissimilar ones.

The Role of λ: Balancing Supervision and Regularization

The hyperparameter λ controls the relative importance of labeled vs. unlabeled data. Its optimal value depends on:

Label ratio r = l/n: When r is very small, we should trust unlabeled data more (larger λ)
Label reliability: Noisy labels suggest trusting unlabeled regularization more
Assumption validity: If SSL assumptions hold strongly, larger λ is appropriate
Training phase: Many methods use λ warmup—starting with λ=0 and increasing during training

The λ warmup schedule is crucial in practice. Training with large λ from the start can cause the model to converge to poor solutions before seeing enough labeled signal to calibrate predictions.

lambda_warmup.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
 
def linear_warmup(epoch: int, warmup_epochs: int = 10, max_lambda: float = 1.0) -> float:
    """Linear warmup schedule for SSL λ hyperparameter."""
    return min(epoch / warmup_epochs, 1.0) * max_lambda
 
def sigmoid_warmup(epoch: int, midpoint: int = 10, steepness: float = 0.5, max_lambda: float = 1.0) -> float:
    """Sigmoid warmup schedule - smoother transition."""
    return max_lambda / (1 + np.exp(-steepness * (epoch - midpoint)))
 
def step_warmup(epoch: int, step_epoch: int = 5, max_lambda: float = 1.0) -> float:
    """Step function - no unsupervised loss initially."""
    return max_lambda if epoch >= step_epoch else 0.0
 
# Example usage in training loop:
# for epoch in range(num_epochs):
#     lambda_u = sigmoid_warmup(epoch, midpoint=10, max_lambda=1.0)
#     loss = supervised_loss + lambda_u * unsupervised_loss
#     optimizer.step()

Practical Wisdom

Transductive vs. Inductive Learning

A fundamental distinction in semi-supervised learning is between transductive and inductive settings. This distinction affects what guarantees we can provide and what methods are appropriate.

Transductive Learning

In the transductive setting, the unlabeled data D_U is known at training time, and our goal is specifically to predict labels for these samples. We don't need to generalize to unseen data.

Formal Definition:

Given D_L and D_U, transductive learning aims to output predictions {ŷ_{l+1}, ..., ŷ_{l+u}} for the unlabeled points, minimizing: $$\sum_{j=l+1}^{l+u} L(\hat{y}_j, y_j)$$

Key characteristics:

Unlabeled set is fixed and known
No requirement to generalize beyond D_U
Can exploit specific structure of the test set
Often easier than inductive learning

Inductive Learning

In the inductive setting, we aim to learn a function f that generalizes to any new sample from P(X), not just the unlabeled samples observed during training.

Formal Definition:

Given D_L and D_U, inductive learning aims to learn f: 𝒳 → 𝒴 that minimizes the expected risk: $$R(f) = \mathbb{E}_{(X,Y) \sim P}[L(f(X), Y)]$$

Key characteristics:

Must generalize to unseen samples
D_U helps but f must work beyond it
More practical for deployment scenarios
Typically harder, requires stronger assumptions

Transductive Advantages

•Can be easier—no generalization required
•Can exploit test set structure (clustering)
•Theoretically well-understood
•Graph-based methods work naturally
•No out-of-sample prediction needed

Inductive Advantages

•Produces deployable model f
•Handles new data without retraining
•More practical for production
•Scales to streaming data
•Neural network methods are naturally inductive

The Induction-Transduction Relationship

An important insight is that any inductive learner can be used for transduction (just apply f to D_U), but not vice versa. This creates a hierarchy:

$$\text{Transductive Methods} \supset \text{Inductive Methods}$$

However, converting transductive methods to inductive ones is possible:

Out-of-sample extension: Use techniques like Nyström approximation to extend graph-based transductive predictions
Model distillation: Train an inductive model to match transductive predictions on D_U
Joint optimization: Learn an inductive model that also performs well on the specific transductive task

When to Use Which

Use Transductive when:

The test set is fixed and known (e.g., website crawl classification)
You have strong cluster structure in the combined labeled+unlabeled data
Retraining for each new batch is feasible
Graph-based methods are appropriate

Use Inductive when:

You need a deployed model for real-time inference
Data streams continuously
Retraining is expensive
You're using deep learning (naturally inductive)

Historical Context

Problem Variants and Extensions

The basic semi-supervised setting admits several variations that arise in practice. Understanding these variants helps select appropriate methods for specific applications.

Standard Semi-Supervised Classification

The setting we've described so far:

Fixed, discrete label space 𝒴 = {1, ..., K}
i.i.d. sampling from P(X, Y)
Goal: minimize classification error

This is the most studied setting, with most benchmark comparisons focused here.

Semi-Supervised Regression

When 𝒴 = ℝ (or ℝᵈ):

Continuous outputs require different loss functions (MSE, MAE)
Uncertainty quantification becomes important
Consistency regularization can use L2 distance
Pseudo-labeling uses predicted values directly

Challenge: Without discrete classes, concepts like 'confident predictions' are less clear. Threshold-based pseudo-labeling needs adaptation (e.g., based on prediction variance).

Semi-Supervised with Class Imbalance

When P(Y) is highly skewed:

Minority classes have even fewer labeled examples
Pseudo-labeling can reinforce majority class bias
Class-balanced sampling becomes critical
Thresholds may need per-class adjustment

Typical approach: Use class-balanced supervised loss plus distribution-aware pseudo-labeling:

$$\mathcal{L}{sup} = \sum{c=1}^{C} w_c \cdot \frac{1}{n_c} \sum_{i: y_i = c} L(f(x_i), y_i)$$

where w_c are class weights inversely proportional to frequency.

Semi-Supervised Multi-Label Classification

When each sample can have multiple labels (Y ⊆ {1, ..., K}):

Binary classification per label with label dependencies
Consistency must respect multi-label structure
Label co-occurrence provides additional signal
Evaluation metrics differ (F1, mAP vs accuracy)

Semi-Supervised Structured Prediction

When outputs have structure (sequences, trees, graphs):

Sequence labeling (NER, POS tagging)
Parsing
Semantic segmentation

Challenge: Consistency regularization must respect output structure. A prediction of 'B-PER' following 'O' is inconsistent in BIO tagging—the model should enforce such constraints.

Semi-Supervised Problem Variants
Variant	Label Space	Key Challenge	Common Methods
Classification	{1,...,K}	Class imbalance, confidence calibration	FixMatch, MixMatch, UDA
Regression	ℝ or ℝᵈ	Uncertainty estimation	Mean Teacher + variance
Multi-Label	Subsets of {1,...,K}	Label dependencies	Pseudo-labeling with thresholds
Sequence	Sequences Y*	Structural consistency	Self-training, CRF regularization
Segmentation	Pixel labels	Spatial consistency	CutMix, spatial pseudo-labels
Metric Learning	Similarity structure	Pair/triplet construction	Contrastive learning

Orthogonal Extensions

Theoretical Foundations

The theory of semi-supervised learning addresses fundamental questions: When can unlabeled data help? How much can it help? What assumptions do we need?

The Fundamental Impossibility Result

We mentioned earlier that without assumptions, unlabeled data cannot help. Let's make this precise.

Theorem (Ben-David et al., 2008): For any learning algorithm A, there exist two distributions P₁ and P₂ such that:

P₁(X) = P₂(X) (identical marginals)
P₁(Y|X) ≠ P₂(Y|X) (different conditionals)
algorithm A cannot distinguish P₁ from P₂ using only unlabeled data

Implication: Since we can only observe P(X) from unlabeled data, and different P(Y|X) can share the same P(X), unlabeled data alone cannot determine the correct labeling function.

Sample Complexity Perspective

From a sample complexity view, the question becomes: how many labeled samples are needed with vs. without unlabeled data?

Let m_SL(ε, δ) be the labeled sample complexity of supervised learning to achieve error ≤ε with probability ≥1-δ.

Let m_SSL(ε, δ, u) be the labeled sample complexity of semi-supervised learning with u unlabeled samples.

Goal: Show that m_SSL(ε, δ, u) << m_SL(ε, δ) when assumptions hold.

compared to supervised learning's: $$m_{SL} = O\left(\frac{d}{\epsilon^2} \log \frac{1}{\delta}\right)$$

where d is the input dimension. This represents a potentially exponential improvement in labeled sample complexity.

The Role of Unlabeled Data

Theoretically, unlabeled data can help in three ways:

1. Hypothesis Space Reduction

Unlabeled data, under appropriate assumptions, can eliminate hypotheses inconsistent with P(X) structure:

Cluster assumption: Eliminate boundaries passing through high-density regions
Manifold assumption: Restrict to hypotheses smooth on the data manifold
Low-density assumption: Penalize decision boundaries in dense regions

2. Regularization

Even without reducing the hypothesis space, unlabeled data can provide implicit regularization that reduces variance:

The model must explain unlabeled data structure
This constrains the solution space
Effects similar to explicit regularization (L2, dropout)

3. Representation Learning

In deep learning, unlabeled data helps learn better representations:

Early layers learn generalizable features
These features transfer to the supervised task
The representation bottleneck provides implicit regularization

When Unlabeled Data Hurts

Practical Considerations

Moving from theory to practice, semi-supervised learning involves numerous implementation decisions. Here we address common practical questions.

Selecting and Validating Methods

Method Selection Criteria

•Data modality: Image data suits consistency via augmentation; text suits masked language modeling; tabular needs careful augmentation design.
•Label budget: Very few labels (10-100) favor simple methods like pseudo-labeling; moderate labels (1000+) can benefit from sophisticated approaches like FixMatch.
•Computational budget: Some methods (MixMatch, ReMixMatch) require heavy augmentation and multiple forward passes; others (Mean Teacher) are more efficient.
•Domain knowledge: If you understand domain-appropriate augmentations, consistency regularization is powerful. Without such knowledge, simpler methods may be safer.
•Deployment requirements: Need for real-time inference favors inductive methods that produce a fixed model.

Hyperparameter Tuning with Limited Labels

Strategies:

Use all labels for training, validate on pseudo-labels: Monitor consistency loss on unlabeled data as a proxy (imperfect but practical)
Cross-validation on labels: K-fold cross-validation uses all labeled data for both training and validation, averaging over folds
Default hyperparameters: Well-tuned defaults from published work transfer surprisingly well (e.g., FixMatch's τ=0.95, λ_u=1.0)
Sensitivity analysis: Test a few key hyperparameters on a small subset before full experiment

Debugging Semi-Supervised Training

SSL training can fail in subtle ways. Common debugging approaches:

Debugging Checklist

•Monitor pseudo-label accuracy: If available, check what fraction of pseudo-labels match true labels. Low accuracy (<50%) indicates the model hasn't learned useful patterns yet.
•Check label distribution of pseudo-labels: Severe class imbalance in pseudo-labels often indicates confirmation bias toward majority class.
•Track supervised vs. unsupervised loss separately: If unsupervised loss dominates too early, reduce λ or extend warmup.
•Visualize representations: Use t-SNE or UMAP to check if clusters in representation space correspond to classes.
•Ablation studies: Compare to supervised baseline. If SSL hurts, assumptions may not hold for your data.

The 'Start Simple' Rule

Summary: The Semi-Supervised Framework

We have established the formal mathematical framework for semi-supervised learning. Let's consolidate the key concepts:

Key Takeaways

•Mathematical foundations: SSL operates on labeled D_L and unlabeled D_U, minimizing risk under distribution P(X,Y) by exploiting structure in P(X).
•Assumptions are essential: Without linking P(X) and P(Y|X), unlabeled data cannot help. Every SSL method implicitly makes such assumptions.
•The SSL objective: Combines supervised loss on D_L with unsupervised regularization on D_U, balanced by hyperparameter λ.
•Transductive vs. inductive: Transduction predicts for fixed D_U; induction learns generalizable f. Deep learning is naturally inductive.
•Problem variants: Classification, regression, multi-label, sequence, and structured prediction each require adapted methods.
•Theoretical grounding: Under appropriate assumptions, SSL can achieve exponential improvements in labeled sample complexity.

What's Next:

Page Complete