Loading content...
Having established the practical importance of label scarcity in the previous page, we now shift to a rigorous mathematical treatment of the semi-supervised learning problem. This formalization is not mere academic exercise—it provides the precise language needed to state assumptions, prove guarantees, and understand when semi-supervised methods can and cannot help.
The core question we address: Under what conditions can unlabeled data improve learning, and how do we formalize this mathematically?
This page provides a complete formal treatment of the semi-supervised setting. You will understand: (1) The mathematical notation and data model, (2) How SSL differs from supervised and unsupervised learning, (3) The learning objectives in SSL, (4) The distinction between inductive and transductive settings, and (5) The fundamental limits and possibilities of learning from unlabeled data.
We begin by establishing precise notation that will be used throughout this chapter and the field more broadly.
Assume data is drawn from an underlying joint distribution P(X, Y) over a feature space 𝒳 and label space 𝒴. This joint distribution factors as:
$$P(X, Y) = P(Y | X) \cdot P(X) = P(X | Y) \cdot P(Y)$$
In supervised learning, we observe samples from P(X, Y) directly. In semi-supervised learning, we observe:
Let us define the semi-supervised learning problem precisely:
Given: (1) A labeled dataset D_L = {(x₁, y₁), ..., (x_l, y_l)} where (x_i, y_i) ~ P(X, Y), (2) An unlabeled dataset D_U = {x_{l+1}, ..., x_{l+u}} where x_j ~ P(X), and (3) A hypothesis class ℋ of functions h: 𝒳 → 𝒴.
Find: A function f ∈ ℋ that minimizes the expected risk: R(f) = 𝔼_{(X,Y)~P}[L(f(X), Y)] where L is a loss function.
We use the following notation throughout:
| Symbol | Meaning | Typical Values |
|---|---|---|
| n = l + u | Total number of samples | 10³ - 10⁹ |
| l | Number of labeled samples | 10¹ - 10⁶ |
| u | Number of unlabeled samples | 10³ - 10⁹ |
| r = l/n | Label ratio | 0.001 - 0.1 |
| 𝒳 | Feature space | ℝᵈ, images, text, etc. |
| 𝒴 | Label space | {0,1}, {1,...,K}, ℝ |
| P(X,Y) | Joint distribution | Unknown, to be learned |
| P(X) | Marginal distribution | Observable from D_U |
| P(Y|X) | Conditional (target) | To be learned |
| ℋ | Hypothesis class | Neural networks, etc. |
| L(ŷ, y) | Loss function | Cross-entropy, MSE, etc. |
The core challenge in SSL is that we observe P(X) well (through many unlabeled samples) but P(Y|X) poorly (through few labeled samples). The question is whether knowledge of P(X) can help in learning P(Y|X).
In full generality, the answer is no—knowing the marginal P(X) tells us nothing about the conditional P(Y|X). This is the famous semi-supervised learning impossibility result:
Theorem (Informal): Without additional assumptions linking P(X) and P(Y|X), no consistent estimator can benefit from unlabeled data for estimating P(Y|X).
This theorem is both discouraging and clarifying: it tells us that assumptions are necessary. The entire field of semi-supervised learning is fundamentally about identifying and exploiting assumptions that link the marginal and conditional distributions.
Every semi-supervised algorithm, implicitly or explicitly, makes assumptions about how P(X) relates to P(Y|X). Without such assumptions, unlabeled data provides zero information about the classification problem. Understanding these assumptions—and verifying they hold for your data—is critical for successful SSL deployment.
Semi-supervised learning occupies a specific position in the landscape of machine learning paradigms. Understanding its relationship to neighboring settings clarifies what SSL is and isn't.
| Paradigm | Labeled Data | Unlabeled Data | Goal | Key Challenge |
|---|---|---|---|---|
| Supervised Learning | All | None | Learn P(Y|X) | Label cost, overfitting |
| Semi-Supervised Learning | Few | Many | Learn P(Y|X) | Leveraging P(X) |
| Unsupervised Learning | None | All | Learn P(X) structure | No task guidance |
| Self-Supervised Learning | None (pseudo) | All | Learn representations | Pretext design |
| Weakly-Supervised | Noisy/Partial | Varies | Learn P(Y|X) | Label noise |
| Active Learning | Interactive | Pool | Learn P(Y|X) | Query selection |
| Transfer Learning | Different task | Target task | Adapt knowledge | Domain gap |
In supervised learning, all n samples are labeled. The standard empirical risk minimization (ERM) objective is:
$$\hat{f}{SL} = \arg\min{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^{n} L(f(x_i), y_i)$$
In semi-supervised learning, we only have l << n labels. The naive approach—ignoring unlabeled data—yields:
$$\hat{f}{naive} = \arg\min{f \in \mathcal{H}} \frac{1}{l} \sum_{i=1}^{l} L(f(x_i), y_i)$$
This suffers from high variance due to small sample size. SSL methods add regularization terms derived from unlabeled data:
$$\hat{f}{SSL} = \arg\min{f \in \mathcal{H}} \underbrace{\frac{1}{l} \sum_{i=1}^{l} L(f(x_i), y_i)}{\text{supervised loss}} + \lambda \cdot \underbrace{R(f; D_U)}{\text{unlabeled regularization}}$$
where R(f; D_U) is a regularization term computed over unlabeled data.
Unsupervised learning seeks to understand P(X) without any labels—clustering, density estimation, dimensionality reduction. SSL uses unsupervised learning as a means to an end: we care about P(X) only insofar as it helps predict Y.
The key distinction:
This distinction matters practically. A beautiful cluster structure in X is useless for SSL if these clusters don't correspond to the label structure. SSL methods must align unsupervised structure with the supervised task.
Self-supervised learning (SSL) creates pseudo-labels from the data itself through pretext tasks (e.g., predicting masked words, image rotation, next-frame prediction). The learned representations are then fine-tuned on downstream tasks.
The relationship is subtle:
Modern approaches often combine both:
This hybrid approach (exemplified by BERT, GPT, etc.) has proven extraordinarily effective.
The field uses 'SSL' to abbreviate both semi-supervised and self-supervised learning. Context usually clarifies, but be attentive. In this chapter, we use 'semi-supervised' explicitly to avoid ambiguity.
Semi-supervised learning methods can be understood through a unified objective function framework. While specific methods differ in their regularization approaches, they share a common structure.
The prototypical semi-supervised objective takes the form:
$$\mathcal{L}{SSL}(f; D_L, D_U) = \underbrace{\mathcal{L}{sup}(f; D_L)}{\text{Supervised Term}} + \lambda \cdot \underbrace{\mathcal{L}{unsup}(f; D_U)}_{\text{Unsupervised Term}}$$
where:
The supervised term is standard classification/regression loss:
$$\mathcal{L}{sup}(f; D_L) = \frac{1}{l} \sum{i=1}^{l} L(f(x_i), y_i)$$
For classification, L is typically cross-entropy: $$L(\hat{y}, y) = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$
For regression, L is often mean squared error: $$L(\hat{y}, y) = (\hat{y} - y)^2$$
The unsupervised term is where SSL methods innovate. Common formulations include:
The hyperparameter λ controls the relative importance of labeled vs. unlabeled data. Its optimal value depends on:
The λ warmup schedule is crucial in practice. Training with large λ from the start can cause the model to converge to poor solutions before seeing enough labeled signal to calibrate predictions.
12345678910111213141516171819
import numpy as np def linear_warmup(epoch: int, warmup_epochs: int = 10, max_lambda: float = 1.0) -> float: """Linear warmup schedule for SSL λ hyperparameter.""" return min(epoch / warmup_epochs, 1.0) * max_lambda def sigmoid_warmup(epoch: int, midpoint: int = 10, steepness: float = 0.5, max_lambda: float = 1.0) -> float: """Sigmoid warmup schedule - smoother transition.""" return max_lambda / (1 + np.exp(-steepness * (epoch - midpoint))) def step_warmup(epoch: int, step_epoch: int = 5, max_lambda: float = 1.0) -> float: """Step function - no unsupervised loss initially.""" return max_lambda if epoch >= step_epoch else 0.0 # Example usage in training loop:# for epoch in range(num_epochs):# lambda_u = sigmoid_warmup(epoch, midpoint=10, max_lambda=1.0)# loss = supervised_loss + lambda_u * unsupervised_loss# optimizer.step()In practice, sigmoid warmup over 5-20% of total training typically works well. The key insight: early in training, model predictions are near-random and provide no useful signal. Warmup allows the model to learn basic patterns from labeled data before trusting its own predictions on unlabeled data.
A fundamental distinction in semi-supervised learning is between transductive and inductive settings. This distinction affects what guarantees we can provide and what methods are appropriate.
In the transductive setting, the unlabeled data D_U is known at training time, and our goal is specifically to predict labels for these samples. We don't need to generalize to unseen data.
Formal Definition:
Given D_L and D_U, transductive learning aims to output predictions {ŷ_{l+1}, ..., ŷ_{l+u}} for the unlabeled points, minimizing: $$\sum_{j=l+1}^{l+u} L(\hat{y}_j, y_j)$$
Key characteristics:
In the inductive setting, we aim to learn a function f that generalizes to any new sample from P(X), not just the unlabeled samples observed during training.
Formal Definition:
Given D_L and D_U, inductive learning aims to learn f: 𝒳 → 𝒴 that minimizes the expected risk: $$R(f) = \mathbb{E}_{(X,Y) \sim P}[L(f(X), Y)]$$
Key characteristics:
An important insight is that any inductive learner can be used for transduction (just apply f to D_U), but not vice versa. This creates a hierarchy:
$$\text{Transductive Methods} \supset \text{Inductive Methods}$$
However, converting transductive methods to inductive ones is possible:
Use Transductive when:
Use Inductive when:
The transductive vs. inductive distinction was emphasized by Vladimir Vapnik, who argued that transduction solves a more specific problem than induction and should therefore be easier. His transductive SVM (TSVM) was an early influential semi-supervised method. Modern deep learning methods are inherently inductive but achieve strong performance by scaling to massive unlabeled datasets.
The basic semi-supervised setting admits several variations that arise in practice. Understanding these variants helps select appropriate methods for specific applications.
The setting we've described so far:
This is the most studied setting, with most benchmark comparisons focused here.
When 𝒴 = ℝ (or ℝᵈ):
Challenge: Without discrete classes, concepts like 'confident predictions' are less clear. Threshold-based pseudo-labeling needs adaptation (e.g., based on prediction variance).
When P(Y) is highly skewed:
Typical approach: Use class-balanced supervised loss plus distribution-aware pseudo-labeling:
$$\mathcal{L}{sup} = \sum{c=1}^{C} w_c \cdot \frac{1}{n_c} \sum_{i: y_i = c} L(f(x_i), y_i)$$
where w_c are class weights inversely proportional to frequency.
When each sample can have multiple labels (Y ⊆ {1, ..., K}):
When outputs have structure (sequences, trees, graphs):
Challenge: Consistency regularization must respect output structure. A prediction of 'B-PER' following 'O' is inconsistent in BIO tagging—the model should enforce such constraints.
| Variant | Label Space | Key Challenge | Common Methods |
|---|---|---|---|
| Classification | {1,...,K} | Class imbalance, confidence calibration | FixMatch, MixMatch, UDA |
| Regression | ℝ or ℝᵈ | Uncertainty estimation | Mean Teacher + variance |
| Multi-Label | Subsets of {1,...,K} | Label dependencies | Pseudo-labeling with thresholds |
| Sequence | Sequences Y* | Structural consistency | Self-training, CRF regularization |
| Segmentation | Pixel labels | Spatial consistency | CutMix, spatial pseudo-labels |
| Metric Learning | Similarity structure | Pair/triplet construction | Contrastive learning |
Several orthogonal extensions combine with semi-supervised learning: (1) Domain adaptation when unlabeled data comes from a different distribution, (2) Open-set when unlabeled data may contain unseen classes, (3) Long-tailed when classes have extreme frequency imbalance, and (4) Noisy labels when the few available labels may be incorrect.
The theory of semi-supervised learning addresses fundamental questions: When can unlabeled data help? How much can it help? What assumptions do we need?
We mentioned earlier that without assumptions, unlabeled data cannot help. Let's make this precise.
Theorem (Ben-David et al., 2008): For any learning algorithm A, there exist two distributions P₁ and P₂ such that:
Implication: Since we can only observe P(X) from unlabeled data, and different P(Y|X) can share the same P(X), unlabeled data alone cannot determine the correct labeling function.
From a sample complexity view, the question becomes: how many labeled samples are needed with vs. without unlabeled data?
Let m_SL(ε, δ) be the labeled sample complexity of supervised learning to achieve error ≤ε with probability ≥1-δ.
Let m_SSL(ε, δ, u) be the labeled sample complexity of semi-supervised learning with u unlabeled samples.
Goal: Show that m_SSL(ε, δ, u) << m_SL(ε, δ) when assumptions hold.
Key Result (Singh et al., 2008): Under the cluster assumption with well-separated clusters, semi-supervised learning can achieve: $$m_{SSL} = O\left(\frac{1}{\epsilon} \log \frac{1}{\delta}\right)$$
compared to supervised learning's: $$m_{SL} = O\left(\frac{d}{\epsilon^2} \log \frac{1}{\delta}\right)$$
where d is the input dimension. This represents a potentially exponential improvement in labeled sample complexity.
Theoretically, unlabeled data can help in three ways:
1. Hypothesis Space Reduction
Unlabeled data, under appropriate assumptions, can eliminate hypotheses inconsistent with P(X) structure:
2. Regularization
Even without reducing the hypothesis space, unlabeled data can provide implicit regularization that reduces variance:
3. Representation Learning
In deep learning, unlabeled data helps learn better representations:
Theory also tells us when semi-supervised learning can degrade performance: (1) When assumptions don't hold, unlabeled data provides misleading signal. (2) When the model class is already well-matched to the labeled data. (3) When pseudo-labels from early training lock in errors. Understanding these failure modes is as important as understanding success cases.
Moving from theory to practice, semi-supervised learning involves numerous implementation decisions. Here we address common practical questions.
A fundamental challenge in semi-supervised learning is hyperparameter selection when labeled data is scarce. Holding out validation data from an already-small labeled set further reduces training data.
Strategies:
Use all labels for training, validate on pseudo-labels: Monitor consistency loss on unlabeled data as a proxy (imperfect but practical)
Cross-validation on labels: K-fold cross-validation uses all labeled data for both training and validation, averaging over folds
Default hyperparameters: Well-tuned defaults from published work transfer surprisingly well (e.g., FixMatch's τ=0.95, λ_u=1.0)
Sensitivity analysis: Test a few key hyperparameters on a small subset before full experiment
SSL training can fail in subtle ways. Common debugging approaches:
Before trying sophisticated SSL methods, verify that: (1) Your supervised baseline works correctly, (2) Your augmentations don't destroy label information, (3) Pseudo-labels from a trained supervised model have reasonable accuracy. If these checks fail, SSL will likely underperform.
We have established the formal mathematical framework for semi-supervised learning. Let's consolidate the key concepts:
What's Next:
With the formal framework established, the next page examines the core assumptions that enable semi-supervised learning. We'll study the smoothness assumption, cluster assumption, low-density separation, and manifold assumption in detail—understanding when they hold, how to test them, and which methods exploit each assumption.
You now understand the mathematical framework of semi-supervised learning: the notation, objective functions, transductive vs. inductive settings, problem variants, and theoretical foundations. This formal grounding is essential for understanding why specific methods work and when they are appropriate.