Machine LearningSemi-Supervised & Self-Supervised Learning

Consistency Regularization

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

5 / 5

Pseudo-labeling

The Foundational Technique

Pseudo-labeling is one of the oldest and most intuitive techniques in semi-supervised learning. The idea is simple: use the model's own predictions on unlabeled data as training labels, effectively "labeling" the unlabeled data with the model's best guesses.

Despite its simplicity—or perhaps because of it—pseudo-labeling remains a core component of modern methods like FixMatch and UDA. Understanding pseudo-labeling in depth provides crucial insight into why modern methods work, when they fail, and how to diagnose issues in semi-supervised training.

What You Will Learn

By the end of this page, you will understand: the core pseudo-labeling algorithm and its variants, the theoretical basis for why pseudo-labeling works, confidence thresholding and its critical role, failure modes including confirmation bias and class collapse, advanced techniques like curriculum pseudo-labeling and self-training, and how modern methods like FixMatch extend basic pseudo-labeling.

The Pseudo-labeling Algorithm

At its core, pseudo-labeling is remarkably straightforward:

Basic Pseudo-labeling Algorithm:

Train a model on the labeled data
Use the model to predict labels for unlabeled data
Treat high-confidence predictions as ground truth labels
Retrain the model on the combined labeled + pseudo-labeled data
Repeat steps 2-4 until convergence

This can be expressed mathematically. Given:

Labeled data: $\mathcal{D}L = {(x_i, y_i)}{i=1}^{n_l}$
Unlabeled data: $\mathcal{D}U = {u_j}{j=1}^{n_u}$
Model with parameters $\theta$: $f_\theta(x) \rightarrow p(y|x)$

The pseudo-labeling loss is:

$$\mathcal{L} = \underbrace{\frac{1}{n_l} \sum_{i=1}^{n_l} \ell(f_\theta(x_i), y_i)}{\text{Supervised loss}} + \lambda \underbrace{\frac{1}{n_u} \sum{j=1}^{n_u} \mathbf{1}[\max(f_\theta(u_j)) \geq \tau] \cdot \ell(f_\theta(u_j), \hat{y}j)}{\text{Pseudo-label loss}}$$

where $\hat{y}j = \arg\max f\theta(u_j)$ is the pseudo-label and $\tau$ is the confidence threshold.

Converting Mermaid diagram...

basic_pseudolabel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Dict
 
def pseudo_label_loss(
    model: nn.Module,
    unlabeled: torch.Tensor,
    threshold: float = 0.95,
    hard_labels: bool = True,
) -> Tuple[torch.Tensor, Dict[str, float]]:
    """
    Basic pseudo-labeling loss computation.
    
    Args:
        model: The neural network classifier
        unlabeled: Batch of unlabeled data [B, C, H, W]
        threshold: Confidence threshold for including samples
        hard_labels: If True, use argmax; if False, use soft labels
        
    Returns:
        (loss, metrics_dict)
    """
    # Get model predictions
    with torch.no_grad():
        logits = model(unlabeled)
        probs = F.softmax(logits, dim=-1)
        
        # Compute confidence (max probability)
        max_probs, pseudo_labels = probs.max(dim=-1)
        
        # Create mask for high-confidence predictions
        mask = max_probs >= threshold
    
    # If no samples exceed threshold, return zero loss
    if mask.sum() == 0:
        return torch.tensor(0.0, device=unlabeled.device), {
            "mask_ratio": 0.0,
            "avg_confidence": max_probs.mean().item(),
            "loss": 0.0,
        }
    
    # Get predictions for masked samples (with gradient)
    logits_masked = model(unlabeled[mask])
    
    if hard_labels:
        # Hard pseudo-labels: cross-entropy with argmax
        loss = F.cross_entropy(logits_masked, pseudo_labels[mask])
    else:
        # Soft pseudo-labels: cross-entropy with full distribution
        soft_targets = probs[mask]
        log_probs = F.log_softmax(logits_masked, dim=-1)
        loss = -(soft_targets * log_probs).sum(dim=-1).mean()
    
    metrics = {
        "mask_ratio": mask.float().mean().item(),
        "avg_confidence": max_probs.mean().item(),
        "avg_confidence_masked": max_probs[mask].mean().item(),
        "loss": loss.item(),
    }
    
    return loss, metrics
 
 
class PseudoLabelTrainer:
    """
    Complete pseudo-labeling training loop.
    
    Handles the full training process with pseudo-labels,
    including curriculum threshold scheduling.
    """
    
    def __init__(
        self,
        model: nn.Module,
        optimizer: torch.optim.Optimizer,
        threshold: float = 0.95,
        lambda_u: float = 1.0,
        threshold_warmup: int = 0,  # Steps to linearly increase threshold
        min_threshold: float = 0.5,  # Starting threshold during warmup
    ):
        self.model = model
        self.optimizer = optimizer
        self.base_threshold = threshold
        self.lambda_u = lambda_u
        self.threshold_warmup = threshold_warmup
        self.min_threshold = min_threshold
        self.step = 0
        
    def get_threshold(self) -> float:
        """Get current threshold with optional warmup."""
        if self.threshold_warmup <= 0 or self.step >= self.threshold_warmup:
            return self.base_threshold
        
        # Linear ramp from min_threshold to base_threshold
        progress = self.step / self.threshold_warmup
        return self.min_threshold + progress * (self.base_threshold - self.min_threshold)
    
    def train_step(
        self,
        x_labeled: torch.Tensor,
        y_labeled: torch.Tensor,
        x_unlabeled: torch.Tensor,
    ) -> Dict[str, float]:
        """
        Single training step with pseudo-labeling.
        """
        self.model.train()
        self.optimizer.zero_grad()
        
        # Supervised loss
        logits_l = self.model(x_labeled)
        loss_sup = F.cross_entropy(logits_l, y_labeled)
        
        # Pseudo-label loss
        threshold = self.get_threshold()
        loss_pseudo, pseudo_metrics = pseudo_label_loss(
            self.model, x_unlabeled, threshold
        )
        
        # Combined loss
        loss_total = loss_sup + self.lambda_u * loss_pseudo
        
        loss_total.backward()
        self.optimizer.step()
        
        self.step += 1
        
        return {
            "loss_sup": loss_sup.item(),
            "loss_pseudo": loss_pseudo.item() if isinstance(loss_pseudo, torch.Tensor) else loss_pseudo,
            "loss_total": loss_total.item(),
            "threshold": threshold,
            **pseudo_metrics,
        }

Why Pseudo-labeling Works

Pseudo-labeling seems almost circular: we use the model's predictions to train the model. Why doesn't this just reinforce whatever the model already believes? Understanding when and why pseudo-labeling works reveals fundamental insights about semi-supervised learning.

Entropy Minimization Perspective:

Pseudo-labeling can be understood as a form of entropy minimization. When we train on pseudo-labels, we encourage the model to be more confident:

$$\ell(f_\theta(x), \arg\max f_\theta(x)) \propto -\max_y p_\theta(y|x)$$

Minimizing this loss pushes the model toward lower-entropy (more confident) predictions. This is equivalent to the entropy minimization principle:

The decision boundary should pass through low-density regions of the input space.

By making confident predictions, the model moves its decision boundaries away from data points.

The Cluster Assumption Connection

Pseudo-labeling works when the cluster assumption holds: data points form clusters, and points within a cluster share the same label. High-confidence predictions occur for points clearly within a cluster. By training on these, the model learns to recognize cluster membership, which then improves predictions for points at cluster boundaries.

The Bootstrap Effect:

Pseudo-labeling creates a positive feedback loop:

Model correctly classifies some unlabeled examples with high confidence
These become additional training data
Model becomes more accurate overall
More examples can now be correctly classified with high confidence
These become additional training data...

This bootstrapping effect is why pseudo-labeling can dramatically outperform supervised-only training, even though it uses the model's own predictions.

Why the Threshold Matters:

The confidence threshold is crucial because without it, the bootstrap effect works in reverse for incorrectly classified examples:

Model incorrectly classifies an example
The wrong label becomes training data
Model becomes more confident in the wrong prediction (confirmation bias)
Error reinforces itself

The threshold filters out low-confidence (potentially wrong) predictions, allowing only reliable pseudo-labels to contribute.

Threshold Effects on Pseudo-Label Quality
Threshold	Mask Ratio*	Pseudo-Label Accuracy*	Final Test Error*
No threshold (0.0)	100%	~68%	18.2%
0.5	~85%	~78%	12.4%
0.8	~55%	~89%	8.7%
0.95	~25%	~96%	5.8%
0.99	~8%	~99%	7.2%

*Approximate values for CIFAR-10 with 250 labels. The optimal threshold balances pseudo-label quality (higher threshold) with quantity (lower threshold). 0.95 is typically optimal for image classification.

The Curriculum Effect:

Pseudo-labeling implicitly creates a curriculum:

Early training: Only the "easiest" examples exceed the threshold—those clearly belonging to a class
Mid training: As the model improves, "medium difficulty" examples become high-confidence
Late training: Even ambiguous examples may become high-confidence

This natural progression from easy to hard examples aligns with curriculum learning principles, providing a structured learning experience without explicit design.

Failure Modes of Pseudo-labeling

Despite its effectiveness, pseudo-labeling can fail in predictable ways. Understanding these failure modes is essential for diagnosing issues and choosing appropriate remedies.

Common Failure Modes

•Confirmation Bias: The model becomes increasingly confident in its mistakes, creating a feedback loop where errors reinforce themselves. This is the most common failure mode when thresholds are too low.
•Class Collapse: The model assigns most or all unlabeled examples to a single class (often the majority class). This happens when pseudo-labels are dominated by one class early in training.
•Threshold Dead Zone: If the initial threshold is too high, no samples qualify for pseudo-labeling, and the model never improves beyond supervised-only performance.
•Distribution Mismatch: If the unlabeled data distribution differs from labeled data, pseudo-labels may be systematically wrong for certain regions of the input space.
•Slow Convergence: With high thresholds, learning from unlabeled data is very gradual because few samples contribute each batch.

failure_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import torch
import torch.nn.functional as F
from collections import Counter
from typing import Dict, List
 
class PseudoLabelMonitor:
    """
    Monitor pseudo-labeling training for failure modes.
    
    Tracks metrics that indicate potential problems:
    - Class distribution of pseudo-labels
    - Confidence distribution over time
    - Mask ratio trends
    - Pseudo-label accuracy (requires validation set)
    """
    
    def __init__(self, num_classes: int, window_size: int = 100):
        self.num_classes = num_classes
        self.window_size = window_size
        
        # Tracking buffers
        self.class_counts = Counter()
        self.confidence_history: List[float] = []
        self.mask_ratio_history: List[float] = []
        self.step = 0
        
    def update(
        self,
        pseudo_labels: torch.Tensor,
        confidences: torch.Tensor,
        mask: torch.Tensor,
    ):
        """Update monitoring statistics."""
        self.step += 1
        
        # Track class distribution of pseudo-labels
        for label in pseudo_labels[mask].cpu().tolist():
            self.class_counts[label] += 1
            
        # Track confidence
        self.confidence_history.append(confidences.mean().item())
        if len(self.confidence_history) > self.window_size:
            self.confidence_history.pop(0)
            
        # Track mask ratio
        self.mask_ratio_history.append(mask.float().mean().item())
        if len(self.mask_ratio_history) > self.window_size:
            self.mask_ratio_history.pop(0)
    
    def detect_class_collapse(
        self, 
        imbalance_threshold: float = 0.8
    ) -> bool:
        """
        Detect if pseudo-labels are dominated by single class.
        
        Args:
            imbalance_threshold: Fraction of labels from top class
                                that indicates collapse
                                
        Returns:
            True if class collapse is likely occurring
        """
        total = sum(self.class_counts.values())
        if total == 0:
            return False
            
        # Most common class
        most_common_count = self.class_counts.most_common(1)[0][1]
        
        return (most_common_count / total) > imbalance_threshold
    
    def detect_confirmation_bias(
        self,
        confidence_trend_threshold: float = 0.01,
    ) -> bool:
        """
        Detect rapidly increasing confidence (potential confirmation bias).
        
        Returns True if confidence is increasing faster than expected.
        """
        if len(self.confidence_history) < self.window_size:
            return False
            
        # Check if confidence is monotonically increasing rapidly
        recent = self.confidence_history[-20:]
        earlier = self.confidence_history[-40:-20]
        
        recent_avg = sum(recent) / len(recent)
        earlier_avg = sum(earlier) / len(earlier)
        
        return (recent_avg - earlier_avg) > confidence_trend_threshold
    
    def detect_threshold_deadzone(
        self,
        min_mask_ratio: float = 0.01,
    ) -> bool:
        """
        Detect if threshold is too high (no samples contributing).
        """
        if len(self.mask_ratio_history) < 10:
            return False
            
        recent_avg = sum(self.mask_ratio_history[-10:]) / 10
        return recent_avg < min_mask_ratio
    
    def get_diagnostics(self) -> Dict[str, any]:
        """Get comprehensive diagnostics."""
        total = sum(self.class_counts.values())
        
        # Class distribution
        class_dist = {
            k: v/total if total > 0 else 0 
            for k, v in self.class_counts.items()
        }
        
        # Class imbalance metrics
        if total > 0:
            max_frac = max(class_dist.values())
            min_frac = min(class_dist.values()) if class_dist else 0
            imbalance_ratio = max_frac / (min_frac + 1e-8)
        else:
            imbalance_ratio = 0
        
        return {
            "class_distribution": class_dist,
            "imbalance_ratio": imbalance_ratio,
            "avg_confidence": sum(self.confidence_history) / len(self.confidence_history) if self.confidence_history else 0,
            "avg_mask_ratio": sum(self.mask_ratio_history) / len(self.mask_ratio_history) if self.mask_ratio_history else 0,
            "class_collapse_warning": self.detect_class_collapse(),
            "confirmation_bias_warning": self.detect_confirmation_bias(),
            "deadzone_warning": self.detect_threshold_deadzone(),
        }

Remedies for Failure Modes

Confirmation bias: Increase threshold, use EMA teacher, add strong augmentation. Class collapse: Use distribution alignment, class-balanced sampling, or lower λ_u early in training. Threshold dead zone: Use curriculum threshold that starts lower and increases, or use soft pseudo-labels. Distribution mismatch: Use domain adaptation techniques or filter unlabeled data.

Hard vs. Soft Pseudo-labels

A key design choice in pseudo-labeling is whether to use hard (one-hot) or soft (probability distribution) pseudo-labels. Each has distinct characteristics and use cases.

Hard Pseudo-labels

•Definition: $\hat{y} = \arg\max_y p(y|x)$
•Loss: Standard cross-entropy
•Gradient: Stronger signal (full supervision)
•Risk: All-or-nothing; wrong labels are very harmful
•Used by: FixMatch, basic pseudo-labeling
•Best when: High confidence threshold ensures quality

Soft Pseudo-labels

•Definition: $\hat{y} = p(y|x)$ (full distribution)
•Loss: KL divergence or soft cross-entropy
•Gradient: Smoother, with uncertainty preserved
•Risk: May be too conservative; slow learning
•Used by: UDA, MixMatch, knowledge distillation
•Best when: Predictions are uncertain or overlapping

hard_soft_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn.functional as F
 
def compare_pseudo_label_gradients():
    """
    Demonstrate gradient differences between hard and soft pseudo-labels.
    """
    # Example: model predicts [0.7, 0.2, 0.1] for a sample
    logits = torch.tensor([[1.4, 0.0, -0.5]], requires_grad=True)
    probs = F.softmax(logits, dim=-1)
    # probs ≈ [0.64, 0.16, 0.10]
    
    # Hard pseudo-label: argmax = class 0
    hard_label = torch.tensor([0])
    loss_hard = F.cross_entropy(logits, hard_label)
    
    # Soft pseudo-label: full distribution
    soft_label = probs.detach()
    log_probs = F.log_softmax(logits, dim=-1)
    loss_soft = -(soft_label * log_probs).sum(dim=-1).mean()
    
    # Compare gradients
    loss_hard.backward(retain_graph=True)
    grad_hard = logits.grad.clone()
    logits.grad.zero_()
    
    loss_soft.backward()
    grad_soft = logits.grad.clone()
    
    print(f"Probs: {probs}")
    print(f"Hard gradient: {grad_hard}")  # Pushes strongly toward class 0
    print(f"Soft gradient: {grad_soft}")  # More balanced across classes
    
    # Key insight: Hard gradients are larger in magnitude
    print(f"Hard gradient magnitude: {grad_hard.norm():.4f}")
    print(f"Soft gradient magnitude: {grad_soft.norm():.4f}")
 
 
def sharpened_soft_labels(
    probs: torch.Tensor, 
    temperature: float = 0.5
) -> torch.Tensor:
    """
    Create sharpened soft pseudo-labels (used in UDA, MixMatch).
    
    Interpolates between soft (T=1) and hard (T→0) labels.
    """
    # Sharp: raise to power 1/T and renormalize
    sharpened = probs.pow(1.0 / temperature)
    return sharpened / sharpened.sum(dim=-1, keepdim=True)
 
 
# Example: Effect of sharpening
probs = torch.tensor([[0.6, 0.25, 0.15]])
 
print("Original soft label:", probs)
print("Sharpened (T=0.5):", sharpened_soft_labels(probs, 0.5))
print("Sharpened (T=0.25):", sharpened_soft_labels(probs, 0.25))
print("Sharpened (T=0.1):", sharpened_soft_labels(probs, 0.1))  # Nearly hard
 
# Output:
# Original soft label: tensor([[0.60, 0.25, 0.15]])
# Sharpened (T=0.5): tensor([[0.78, 0.14, 0.08]])  
# Sharpened (T=0.25): tensor([[0.91, 0.06, 0.03]])
# Sharpened (T=0.1): tensor([[0.99, 0.01, 0.00]])  # Essentially hard

When to Use Each:

Scenario	Recommended	Reason
High confidence threshold (≥0.95)	Hard	Predictions are reliable; hard labels give stronger signal
Lower threshold (0.5-0.8)	Soft (sharpened)	Preserve uncertainty; reduce confirmation bias
Class overlap present	Soft	Ambiguous samples shouldn't be forced to single class
Knowledge distillation	Soft	Student should learn teacher's uncertainty
Fast convergence needed	Hard	Stronger gradients accelerate learning

Self-Training: The Broader Framework

Pseudo-labeling is an instance of self-training, a broader semi-supervised learning paradigm. Understanding self-training provides context for various pseudo-labeling variants and extensions.

Self-Training Paradigm:

Train a "teacher" model on labeled data
Teacher generates predictions (labels) for unlabeled data
Select reliable predictions based on some criterion
Add pseudo-labeled data to training set
Train a "student" model on the expanded set
(Optionally) Student becomes new teacher; repeat

Key Variants:

Self-Training Variants
Variant	Teacher	Selection Criterion	Student Update
Basic Pseudo-labeling	Current model	Confidence threshold	Same model (online)
Iterative Self-Training	Previous iteration	Confidence threshold	New model from scratch
Noisy Student	Pre-trained teacher	No filtering (all pseudo-labels)	Add noise/augmentation
Mean Teacher	EMA of student	No explicit threshold	Consistency loss
Meta Pseudo Labels	Teacher network	Meta-learning selection	Student with meta-gradients

Noisy Student Training:

An influential variant is Noisy Student (Xie et al., 2020), which achieves state-of-the-art image classification by:

Training a large teacher on labeled ImageNet
Generating pseudo-labels for 300M unlabeled images (JFT-300M)
Training an equal-or-larger student with:
- Pseudo-labels from teacher (no filtering)
- Strong data augmentation (RandAugment)
- Dropout and stochastic depth
Student becomes new teacher; repeat

Key insight: The "noise" (augmentation + dropout) prevents the student from merely copying the teacher, forcing it to learn robust representations.

noisy_student.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
 
class NoisyStudentTrainer:
    """
    Noisy Student Training implementation.
    
    Key differences from basic pseudo-labeling:
    1. Teacher is fixed (not updated during student training)
    2. No confidence threshold (all pseudo-labels used)
    3. Strong noise added to student (augmentation, dropout)
    4. Iterative: student becomes teacher for next round
    """
    
    def __init__(
        self,
        teacher: nn.Module,
        student: nn.Module,
        optimizer: torch.optim.Optimizer,
        augment_fn: callable,
        dropout_rate: float = 0.5,
    ):
        self.teacher = teacher
        self.student = student
        self.optimizer = optimizer
        self.augment_fn = augment_fn
        self.dropout_rate = dropout_rate
        
        # Freeze teacher
        for param in self.teacher.parameters():
            param.requires_grad = False
        self.teacher.eval()
    
    def enable_student_noise(self):
        """Enable all noise sources in student."""
        self.student.train()  # Enable dropout, batch norm training mode
        
    def generate_pseudo_labels(
        self, 
        unlabeled: torch.Tensor
    ) -> torch.Tensor:
        """
        Generate pseudo-labels from teacher.
        
        Note: No confidence filtering - all samples get pseudo-labels.
        """
        with torch.no_grad():
            # Teacher sees clean (unaugmented) images
            logits = self.teacher(unlabeled)
            pseudo_labels = logits.argmax(dim=-1)
        return pseudo_labels
    
    def train_step(
        self,
        x_labeled: torch.Tensor,
        y_labeled: torch.Tensor,
        x_unlabeled: torch.Tensor,
        lambda_u: float = 1.0,
    ) -> dict:
        """
        Single Noisy Student training step.
        """
        self.enable_student_noise()
        self.optimizer.zero_grad()
        
        # Get pseudo-labels for unlabeled data (from clean teacher)
        pseudo_labels = self.generate_pseudo_labels(x_unlabeled)
        
        # Student sees augmented versions with noise
        x_labeled_aug = self.augment_fn(x_labeled)
        x_unlabeled_aug = self.augment_fn(x_unlabeled)
        
        # Forward pass through noisy student
        logits_l = self.student(x_labeled_aug)
        logits_u = self.student(x_unlabeled_aug)
        
        # Supervised loss
        loss_sup = F.cross_entropy(logits_l, y_labeled)
        
        # Pseudo-label loss (no threshold!)
        loss_pseudo = F.cross_entropy(logits_u, pseudo_labels)
        
        # Combined loss
        loss_total = loss_sup + lambda_u * loss_pseudo
        
        loss_total.backward()
        self.optimizer.step()
        
        return {
            "loss_sup": loss_sup.item(),
            "loss_pseudo": loss_pseudo.item(),
            "loss_total": loss_total.item(),
        }
    
    def update_teacher(self):
        """
        Make current student the new teacher for next iteration.
        """
        # Deep copy student parameters to teacher
        self.teacher.load_state_dict(self.student.state_dict())
        
        # Re-freeze teacher
        for param in self.teacher.parameters():
            param.requires_grad = False
        self.teacher.eval()

Curriculum Pseudo-labeling

Curriculum pseudo-labeling adapts the threshold or selection strategy over the course of training, typically starting easier and becoming more aggressive. This aligns with curriculum learning principles and can significantly improve performance.

Motivation:

Early in training, the model is unreliable. Using a high threshold means very few pseudo-labels contribute. But if we wait for the model to improve before using pseudo-labels, we lose the benefit of unlabeled data.

Curriculum Strategies:

Curriculum Threshold Strategies

•Fixed threshold: Use constant τ (e.g., 0.95) throughout training. Simple but may waste early training steps.
•Linear increase: Start at τ_min (e.g., 0.5), linearly increase to τ_max over warmup period.
•Confidence-based: Set threshold based on current model confidence distribution (e.g., top-k% percentile).
•Class-adaptive: Use different thresholds per class based on class-specific confidence distributions.
•Flex-Match: Dynamically adjust threshold based on learning status of each class.

curriculum_pseudolabel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import torch
import numpy as np
from typing import Dict, Optional
 
class FlexMatch:
    """
    FlexMatch: Curriculum pseudo-labeling with flexible thresholds.
    
    Paper: "FlexMatch: Boosting Semi-Supervised Learning with 
            Curriculum Pseudo Labeling"
    
    Key idea: Use class-specific thresholds that adapt based on
    each class's learning status (confidence distribution).
    """
    
    def __init__(
        self,
        num_classes: int,
        base_threshold: float = 0.95,
        warmup_steps: int = 16000,
    ):
        self.num_classes = num_classes
        self.base_threshold = base_threshold
        self.warmup_steps = warmup_steps
        
        # Class-specific "learning status" (σ in paper)
        # Tracks max confidence seen for each class
        self.class_max_conf = torch.zeros(num_classes)
        
        # Momentum for updating class stats
        self.momentum = 0.999
        
    def update_class_stats(
        self,
        probs: torch.Tensor,
        pseudo_labels: torch.Tensor,
    ):
        """
        Update per-class learning status based on current batch.
        
        Args:
            probs: Model probability predictions [B, K]
            pseudo_labels: Argmax pseudo-labels [B]
        """
        max_probs = probs.max(dim=-1)[0]
        
        for c in range(self.num_classes):
            class_mask = pseudo_labels == c
            if class_mask.sum() > 0:
                class_max = max_probs[class_mask].max()
                # EMA update
                self.class_max_conf[c] = (
                    self.momentum * self.class_max_conf[c] + 
                    (1 - self.momentum) * class_max
                )
    
    def get_class_thresholds(self, step: int) -> torch.Tensor:
        """
        Compute per-class thresholds based on learning status.
        
        Classes that the model has "learned" (high max confidence)
        get higher thresholds. Classes still being learned get
        lower thresholds.
        """
        # Normalize class max confidences to [0, 1]
        if self.class_max_conf.max() > 0:
            normalized = self.class_max_conf / self.class_max_conf.max()
        else:
            normalized = torch.ones(self.num_classes)
        
        # Threshold = base * normalized_learning_status
        # Well-learned classes: threshold close to base (0.95)
        # Poorly-learned classes: lower threshold (more samples)
        thresholds = self.base_threshold * normalized
        
        # Warmup: linearly increase from uniform threshold
        if step < self.warmup_steps:
            warmup_factor = step / self.warmup_steps
            uniform_threshold = self.base_threshold * 0.5  # Starting threshold
            thresholds = (
                (1 - warmup_factor) * uniform_threshold + 
                warmup_factor * thresholds
            )
        
        return thresholds
    
    def create_mask(
        self,
        probs: torch.Tensor,
        pseudo_labels: torch.Tensor,
        step: int,
    ) -> torch.Tensor:
        """
        Create per-sample mask using class-specific thresholds.
        
        Args:
            probs: Probability predictions [B, K]
            pseudo_labels: Pseudo-labels [B]
            step: Current training step
            
        Returns:
            Boolean mask [B]
        """
        # Update class stats
        self.update_class_stats(probs, pseudo_labels)
        
        # Get per-class thresholds
        class_thresholds = self.get_class_thresholds(step)
        
        # Get confidence
        max_probs = probs.max(dim=-1)[0]
        
        # Per-sample threshold based on pseudo-label class
        sample_thresholds = class_thresholds[pseudo_labels]
        
        # Create mask
        mask = max_probs >= sample_thresholds
        
        return mask
 
 
class CurriculumPseudoLabel:
    """
    Simple curriculum pseudo-labeling with adaptive threshold.
    """
    
    def __init__(
        self,
        min_threshold: float = 0.5,
        max_threshold: float = 0.95,
        warmup_steps: int = 10000,
        schedule: str = "linear",  # 'linear', 'exp', 'cos'
    ):
        self.min_threshold = min_threshold
        self.max_threshold = max_threshold
        self.warmup_steps = warmup_steps
        self.schedule = schedule
        
    def get_threshold(self, step: int) -> float:
        """Get threshold at current step."""
        if step >= self.warmup_steps:
            return self.max_threshold
            
        progress = step / self.warmup_steps
        
        if self.schedule == "linear":
            threshold = self.min_threshold + progress * (
                self.max_threshold - self.min_threshold
            )
        elif self.schedule == "exp":
            # Slow start, fast increase
            threshold = self.min_threshold + (progress ** 2) * (
                self.max_threshold - self.min_threshold
            )
        elif self.schedule == "cos":
            # Cosine schedule
            threshold = self.min_threshold + 0.5 * (
                1 - np.cos(np.pi * progress)
            ) * (self.max_threshold - self.min_threshold)
            
        return threshold

FlexMatch Results

FlexMatch achieves significant improvements over FixMatch in imbalanced settings. On CIFAR-10 with 40 labels: FixMatch 4.26% → FlexMatch 4.51% (similar). But on STL-10 with 40 labels: FixMatch 32.72% → FlexMatch 16.59% (2× improvement). The per-class adaptive thresholds help minority classes get pseudo-labels earlier.

Connection to Modern Methods

Understanding how pseudo-labeling connects to methods like FixMatch, UDA, and MixMatch reveals the shared principles that make them work.

FixMatch as Enhanced Pseudo-labeling:

FixMatch can be viewed as pseudo-labeling with two key enhancements:

Strong augmentation: Apply RandAugment to the input before computing the pseudo-label loss
Weak-to-strong consistency: Use weak augmentation for pseudo-label generation, strong for training

$$\text{FixMatch} = \text{PseudoLabel}(\text{hard}, \tau=0.95) + \text{StrongAug}$$

UDA as Soft Pseudo-labeling:

UDA uses soft (sharpened) pseudo-labels with KL divergence:

$$\text{UDA} = \text{PseudoLabel}(\text{soft}, \text{sharpen}, \text{TSA}) + \text{StrongAug}$$

Pseudo-labeling Across Methods
Method	Pseudo-label Type	Threshold	Key Addition
Basic Pseudo-label	Hard	Variable	None
Mean Teacher	Soft (from EMA)	None (all samples)	EMA teacher
UDA	Soft (sharpened)	~0.8	TSA + Strong Aug
FixMatch	Hard	0.95	Weak→Strong Aug
MixMatch	Soft (sharpened, averaged)	None	MixUp interpolation
FlexMatch	Hard	Class-adaptive	Curriculum thresholds
Noisy Student	Hard	None	Noise in student

Unifying View:

All these methods share the core pseudo-labeling insight: use model predictions to supervise training on unlabeled data. They differ in:

How predictions become labels: Hard argmax, soft distributions, sharpened, averaged
Which predictions to trust: Confidence threshold, class-adaptive, all predictions
How to prevent error propagation: High threshold, EMA teacher, strong augmentation
Additional regularization: MixUp, consistency, distribution alignment

Understanding these dimensions helps you design new methods or adapt existing ones to your specific problem.

Why High Threshold Works for FixMatch

FixMatch's 0.95 threshold seems to discard most unlabeled samples. Why does it still work so well? Because strong augmentation makes each included sample extremely valuable—the model must learn truly invariant features to be consistent under aggressive perturbations. Quality over quantity: a few high-quality pseudo-labels with strong augmentation beat many low-quality ones.

Summary and Looking Ahead

Pseudo-labeling is the foundation upon which modern semi-supervised learning is built. Understanding it deeply—its mechanisms, failure modes, and variations—provides crucial insight into why methods like FixMatch and UDA work.

Key Takeaways

•Pseudo-labeling uses model predictions as training labels for unlabeled data, enabling learning from unlabeled samples.
•Confidence thresholding is critical to prevent confirmation bias—only high-confidence predictions should become pseudo-labels.
•Hard vs. soft labels trade off gradient strength against uncertainty preservation; high threshold favors hard labels.
•Failure modes include confirmation bias, class collapse, and threshold dead zones—monitor for these during training.
•Self-training is the broader framework; variants include iterative training, noisy student, and mean teacher.
•Curriculum approaches like FlexMatch adapt thresholds during training for better sample utilization.
•Modern methods are sophisticated pseudo-labeling: FixMatch adds strong augmentation, UDA adds sharpening and TSA.

Module Complete:

This concludes Module 3: Consistency Regularization. We've covered:

Page 0: The consistency assumption and its theoretical foundations
Page 1: Data augmentation as the engine of consistency methods
Page 2: UDA and FixMatch—the state-of-the-art consistency methods
Page 3: MixMatch—holistic combination of multiple techniques
Page 4: Pseudo-labeling—the foundational technique underlying all modern methods

These techniques form the backbone of semi-supervised learning today. The next module will explore Self-Supervised Learning—learning representations without any labels at all.

Consistency Regularization Mastered

You now have a comprehensive understanding of consistency regularization methods for semi-supervised learning. From theoretical foundations to practical implementations, you're equipped to apply these techniques to real-world problems and understand why they work.

5 / 5

Loading learning content...

Machine LearningSemi-Supervised & Self-Supervised Learning

Consistency Regularization

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

5 / 5

Pseudo-labeling

The Foundational Technique

What You Will Learn

The Pseudo-labeling Algorithm

At its core, pseudo-labeling is remarkably straightforward:

Basic Pseudo-labeling Algorithm:

Train a model on the labeled data
Use the model to predict labels for unlabeled data
Treat high-confidence predictions as ground truth labels
Retrain the model on the combined labeled + pseudo-labeled data
Repeat steps 2-4 until convergence

This can be expressed mathematically. Given:

Labeled data: $\mathcal{D}L = {(x_i, y_i)}{i=1}^{n_l}$
Unlabeled data: $\mathcal{D}U = {u_j}{j=1}^{n_u}$
Model with parameters $\theta$: $f_\theta(x) \rightarrow p(y|x)$

The pseudo-labeling loss is:

where $\hat{y}j = \arg\max f\theta(u_j)$ is the pseudo-label and $\tau$ is the confidence threshold.

Converting Mermaid diagram...

basic_pseudolabel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Dict
 
def pseudo_label_loss(
    model: nn.Module,
    unlabeled: torch.Tensor,
    threshold: float = 0.95,
    hard_labels: bool = True,
) -> Tuple[torch.Tensor, Dict[str, float]]:
    """
    Basic pseudo-labeling loss computation.
    
    Args:
        model: The neural network classifier
        unlabeled: Batch of unlabeled data [B, C, H, W]
        threshold: Confidence threshold for including samples
        hard_labels: If True, use argmax; if False, use soft labels
        
    Returns:
        (loss, metrics_dict)
    """
    # Get model predictions
    with torch.no_grad():
        logits = model(unlabeled)
        probs = F.softmax(logits, dim=-1)
        
        # Compute confidence (max probability)
        max_probs, pseudo_labels = probs.max(dim=-1)
        
        # Create mask for high-confidence predictions
        mask = max_probs >= threshold
    
    # If no samples exceed threshold, return zero loss
    if mask.sum() == 0:
        return torch.tensor(0.0, device=unlabeled.device), {
            "mask_ratio": 0.0,
            "avg_confidence": max_probs.mean().item(),
            "loss": 0.0,
        }
    
    # Get predictions for masked samples (with gradient)
    logits_masked = model(unlabeled[mask])
    
    if hard_labels:
        # Hard pseudo-labels: cross-entropy with argmax
        loss = F.cross_entropy(logits_masked, pseudo_labels[mask])
    else:
        # Soft pseudo-labels: cross-entropy with full distribution
        soft_targets = probs[mask]
        log_probs = F.log_softmax(logits_masked, dim=-1)
        loss = -(soft_targets * log_probs).sum(dim=-1).mean()
    
    metrics = {
        "mask_ratio": mask.float().mean().item(),
        "avg_confidence": max_probs.mean().item(),
        "avg_confidence_masked": max_probs[mask].mean().item(),
        "loss": loss.item(),
    }
    
    return loss, metrics
 
 
class PseudoLabelTrainer:
    """
    Complete pseudo-labeling training loop.
    
    Handles the full training process with pseudo-labels,
    including curriculum threshold scheduling.
    """
    
    def __init__(
        self,
        model: nn.Module,
        optimizer: torch.optim.Optimizer,
        threshold: float = 0.95,
        lambda_u: float = 1.0,
        threshold_warmup: int = 0,  # Steps to linearly increase threshold
        min_threshold: float = 0.5,  # Starting threshold during warmup
    ):
        self.model = model
        self.optimizer = optimizer
        self.base_threshold = threshold
        self.lambda_u = lambda_u
        self.threshold_warmup = threshold_warmup
        self.min_threshold = min_threshold
        self.step = 0
        
    def get_threshold(self) -> float:
        """Get current threshold with optional warmup."""
        if self.threshold_warmup <= 0 or self.step >= self.threshold_warmup:
            return self.base_threshold
        
        # Linear ramp from min_threshold to base_threshold
        progress = self.step / self.threshold_warmup
        return self.min_threshold + progress * (self.base_threshold - self.min_threshold)
    
    def train_step(
        self,
        x_labeled: torch.Tensor,
        y_labeled: torch.Tensor,
        x_unlabeled: torch.Tensor,
    ) -> Dict[str, float]:
        """
        Single training step with pseudo-labeling.
        """
        self.model.train()
        self.optimizer.zero_grad()
        
        # Supervised loss
        logits_l = self.model(x_labeled)
        loss_sup = F.cross_entropy(logits_l, y_labeled)
        
        # Pseudo-label loss
        threshold = self.get_threshold()
        loss_pseudo, pseudo_metrics = pseudo_label_loss(
            self.model, x_unlabeled, threshold
        )
        
        # Combined loss
        loss_total = loss_sup + self.lambda_u * loss_pseudo
        
        loss_total.backward()
        self.optimizer.step()
        
        self.step += 1
        
        return {
            "loss_sup": loss_sup.item(),
            "loss_pseudo": loss_pseudo.item() if isinstance(loss_pseudo, torch.Tensor) else loss_pseudo,
            "loss_total": loss_total.item(),
            "threshold": threshold,
            **pseudo_metrics,
        }

Why Pseudo-labeling Works

Entropy Minimization Perspective:

Pseudo-labeling can be understood as a form of entropy minimization. When we train on pseudo-labels, we encourage the model to be more confident:

$$\ell(f_\theta(x), \arg\max f_\theta(x)) \propto -\max_y p_\theta(y|x)$$

Minimizing this loss pushes the model toward lower-entropy (more confident) predictions. This is equivalent to the entropy minimization principle:

The decision boundary should pass through low-density regions of the input space.

By making confident predictions, the model moves its decision boundaries away from data points.

The Cluster Assumption Connection

The Bootstrap Effect:

Pseudo-labeling creates a positive feedback loop:

Model correctly classifies some unlabeled examples with high confidence
These become additional training data
Model becomes more accurate overall
More examples can now be correctly classified with high confidence
These become additional training data...

This bootstrapping effect is why pseudo-labeling can dramatically outperform supervised-only training, even though it uses the model's own predictions.

Why the Threshold Matters:

The confidence threshold is crucial because without it, the bootstrap effect works in reverse for incorrectly classified examples:

Model incorrectly classifies an example
The wrong label becomes training data
Model becomes more confident in the wrong prediction (confirmation bias)
Error reinforces itself

The threshold filters out low-confidence (potentially wrong) predictions, allowing only reliable pseudo-labels to contribute.

Threshold Effects on Pseudo-Label Quality
Threshold	Mask Ratio*	Pseudo-Label Accuracy*	Final Test Error*
No threshold (0.0)	100%	~68%	18.2%
0.5	~85%	~78%	12.4%
0.8	~55%	~89%	8.7%
0.95	~25%	~96%	5.8%
0.99	~8%	~99%	7.2%

The Curriculum Effect:

Pseudo-labeling implicitly creates a curriculum:

Early training: Only the "easiest" examples exceed the threshold—those clearly belonging to a class
Mid training: As the model improves, "medium difficulty" examples become high-confidence
Late training: Even ambiguous examples may become high-confidence

This natural progression from easy to hard examples aligns with curriculum learning principles, providing a structured learning experience without explicit design.

Failure Modes of Pseudo-labeling

Despite its effectiveness, pseudo-labeling can fail in predictable ways. Understanding these failure modes is essential for diagnosing issues and choosing appropriate remedies.

Common Failure Modes

•Confirmation Bias: The model becomes increasingly confident in its mistakes, creating a feedback loop where errors reinforce themselves. This is the most common failure mode when thresholds are too low.
•Class Collapse: The model assigns most or all unlabeled examples to a single class (often the majority class). This happens when pseudo-labels are dominated by one class early in training.
•Threshold Dead Zone: If the initial threshold is too high, no samples qualify for pseudo-labeling, and the model never improves beyond supervised-only performance.
•Distribution Mismatch: If the unlabeled data distribution differs from labeled data, pseudo-labels may be systematically wrong for certain regions of the input space.
•Slow Convergence: With high thresholds, learning from unlabeled data is very gradual because few samples contribute each batch.

failure_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import torch
import torch.nn.functional as F
from collections import Counter
from typing import Dict, List
 
class PseudoLabelMonitor:
    """
    Monitor pseudo-labeling training for failure modes.
    
    Tracks metrics that indicate potential problems:
    - Class distribution of pseudo-labels
    - Confidence distribution over time
    - Mask ratio trends
    - Pseudo-label accuracy (requires validation set)
    """
    
    def __init__(self, num_classes: int, window_size: int = 100):
        self.num_classes = num_classes
        self.window_size = window_size
        
        # Tracking buffers
        self.class_counts = Counter()
        self.confidence_history: List[float] = []
        self.mask_ratio_history: List[float] = []
        self.step = 0
        
    def update(
        self,
        pseudo_labels: torch.Tensor,
        confidences: torch.Tensor,
        mask: torch.Tensor,
    ):
        """Update monitoring statistics."""
        self.step += 1
        
        # Track class distribution of pseudo-labels
        for label in pseudo_labels[mask].cpu().tolist():
            self.class_counts[label] += 1
            
        # Track confidence
        self.confidence_history.append(confidences.mean().item())
        if len(self.confidence_history) > self.window_size:
            self.confidence_history.pop(0)
            
        # Track mask ratio
        self.mask_ratio_history.append(mask.float().mean().item())
        if len(self.mask_ratio_history) > self.window_size:
            self.mask_ratio_history.pop(0)
    
    def detect_class_collapse(
        self, 
        imbalance_threshold: float = 0.8
    ) -> bool:
        """
        Detect if pseudo-labels are dominated by single class.
        
        Args:
            imbalance_threshold: Fraction of labels from top class
                                that indicates collapse
                                
        Returns:
            True if class collapse is likely occurring
        """
        total = sum(self.class_counts.values())
        if total == 0:
            return False
            
        # Most common class
        most_common_count = self.class_counts.most_common(1)[0][1]
        
        return (most_common_count / total) > imbalance_threshold
    
    def detect_confirmation_bias(
        self,
        confidence_trend_threshold: float = 0.01,
    ) -> bool:
        """
        Detect rapidly increasing confidence (potential confirmation bias).
        
        Returns True if confidence is increasing faster than expected.
        """
        if len(self.confidence_history) < self.window_size:
            return False
            
        # Check if confidence is monotonically increasing rapidly
        recent = self.confidence_history[-20:]
        earlier = self.confidence_history[-40:-20]
        
        recent_avg = sum(recent) / len(recent)
        earlier_avg = sum(earlier) / len(earlier)
        
        return (recent_avg - earlier_avg) > confidence_trend_threshold
    
    def detect_threshold_deadzone(
        self,
        min_mask_ratio: float = 0.01,
    ) -> bool:
        """
        Detect if threshold is too high (no samples contributing).
        """
        if len(self.mask_ratio_history) < 10:
            return False
            
        recent_avg = sum(self.mask_ratio_history[-10:]) / 10
        return recent_avg < min_mask_ratio
    
    def get_diagnostics(self) -> Dict[str, any]:
        """Get comprehensive diagnostics."""
        total = sum(self.class_counts.values())
        
        # Class distribution
        class_dist = {
            k: v/total if total > 0 else 0 
            for k, v in self.class_counts.items()
        }
        
        # Class imbalance metrics
        if total > 0:
            max_frac = max(class_dist.values())
            min_frac = min(class_dist.values()) if class_dist else 0
            imbalance_ratio = max_frac / (min_frac + 1e-8)
        else:
            imbalance_ratio = 0
        
        return {
            "class_distribution": class_dist,
            "imbalance_ratio": imbalance_ratio,
            "avg_confidence": sum(self.confidence_history) / len(self.confidence_history) if self.confidence_history else 0,
            "avg_mask_ratio": sum(self.mask_ratio_history) / len(self.mask_ratio_history) if self.mask_ratio_history else 0,
            "class_collapse_warning": self.detect_class_collapse(),
            "confirmation_bias_warning": self.detect_confirmation_bias(),
            "deadzone_warning": self.detect_threshold_deadzone(),
        }

Remedies for Failure Modes

Hard vs. Soft Pseudo-labels

A key design choice in pseudo-labeling is whether to use hard (one-hot) or soft (probability distribution) pseudo-labels. Each has distinct characteristics and use cases.

Hard Pseudo-labels

•Definition: $\hat{y} = \arg\max_y p(y|x)$
•Loss: Standard cross-entropy
•Gradient: Stronger signal (full supervision)
•Risk: All-or-nothing; wrong labels are very harmful
•Used by: FixMatch, basic pseudo-labeling
•Best when: High confidence threshold ensures quality

Soft Pseudo-labels

•Definition: $\hat{y} = p(y|x)$ (full distribution)
•Loss: KL divergence or soft cross-entropy
•Gradient: Smoother, with uncertainty preserved
•Risk: May be too conservative; slow learning
•Used by: UDA, MixMatch, knowledge distillation
•Best when: Predictions are uncertain or overlapping

hard_soft_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn.functional as F
 
def compare_pseudo_label_gradients():
    """
    Demonstrate gradient differences between hard and soft pseudo-labels.
    """
    # Example: model predicts [0.7, 0.2, 0.1] for a sample
    logits = torch.tensor([[1.4, 0.0, -0.5]], requires_grad=True)
    probs = F.softmax(logits, dim=-1)
    # probs ≈ [0.64, 0.16, 0.10]
    
    # Hard pseudo-label: argmax = class 0
    hard_label = torch.tensor([0])
    loss_hard = F.cross_entropy(logits, hard_label)
    
    # Soft pseudo-label: full distribution
    soft_label = probs.detach()
    log_probs = F.log_softmax(logits, dim=-1)
    loss_soft = -(soft_label * log_probs).sum(dim=-1).mean()
    
    # Compare gradients
    loss_hard.backward(retain_graph=True)
    grad_hard = logits.grad.clone()
    logits.grad.zero_()
    
    loss_soft.backward()
    grad_soft = logits.grad.clone()
    
    print(f"Probs: {probs}")
    print(f"Hard gradient: {grad_hard}")  # Pushes strongly toward class 0
    print(f"Soft gradient: {grad_soft}")  # More balanced across classes
    
    # Key insight: Hard gradients are larger in magnitude
    print(f"Hard gradient magnitude: {grad_hard.norm():.4f}")
    print(f"Soft gradient magnitude: {grad_soft.norm():.4f}")
 
 
def sharpened_soft_labels(
    probs: torch.Tensor, 
    temperature: float = 0.5
) -> torch.Tensor:
    """
    Create sharpened soft pseudo-labels (used in UDA, MixMatch).
    
    Interpolates between soft (T=1) and hard (T→0) labels.
    """
    # Sharp: raise to power 1/T and renormalize
    sharpened = probs.pow(1.0 / temperature)
    return sharpened / sharpened.sum(dim=-1, keepdim=True)
 
 
# Example: Effect of sharpening
probs = torch.tensor([[0.6, 0.25, 0.15]])
 
print("Original soft label:", probs)
print("Sharpened (T=0.5):", sharpened_soft_labels(probs, 0.5))
print("Sharpened (T=0.25):", sharpened_soft_labels(probs, 0.25))
print("Sharpened (T=0.1):", sharpened_soft_labels(probs, 0.1))  # Nearly hard
 
# Output:
# Original soft label: tensor([[0.60, 0.25, 0.15]])
# Sharpened (T=0.5): tensor([[0.78, 0.14, 0.08]])  
# Sharpened (T=0.25): tensor([[0.91, 0.06, 0.03]])
# Sharpened (T=0.1): tensor([[0.99, 0.01, 0.00]])  # Essentially hard

When to Use Each:

Scenario	Recommended	Reason
High confidence threshold (≥0.95)	Hard	Predictions are reliable; hard labels give stronger signal
Lower threshold (0.5-0.8)	Soft (sharpened)	Preserve uncertainty; reduce confirmation bias
Class overlap present	Soft	Ambiguous samples shouldn't be forced to single class
Knowledge distillation	Soft	Student should learn teacher's uncertainty
Fast convergence needed	Hard	Stronger gradients accelerate learning

Self-Training: The Broader Framework

Pseudo-labeling is an instance of self-training, a broader semi-supervised learning paradigm. Understanding self-training provides context for various pseudo-labeling variants and extensions.

Self-Training Paradigm:

Train a "teacher" model on labeled data
Teacher generates predictions (labels) for unlabeled data
Select reliable predictions based on some criterion
Add pseudo-labeled data to training set
Train a "student" model on the expanded set
(Optionally) Student becomes new teacher; repeat

Key Variants:

Self-Training Variants
Variant	Teacher	Selection Criterion	Student Update
Basic Pseudo-labeling	Current model	Confidence threshold	Same model (online)
Iterative Self-Training	Previous iteration	Confidence threshold	New model from scratch
Noisy Student	Pre-trained teacher	No filtering (all pseudo-labels)	Add noise/augmentation
Mean Teacher	EMA of student	No explicit threshold	Consistency loss
Meta Pseudo Labels	Teacher network	Meta-learning selection	Student with meta-gradients

Noisy Student Training:

An influential variant is Noisy Student (Xie et al., 2020), which achieves state-of-the-art image classification by:

Training a large teacher on labeled ImageNet
Generating pseudo-labels for 300M unlabeled images (JFT-300M)
Training an equal-or-larger student with:
- Pseudo-labels from teacher (no filtering)
- Strong data augmentation (RandAugment)
- Dropout and stochastic depth
Student becomes new teacher; repeat

Key insight: The "noise" (augmentation + dropout) prevents the student from merely copying the teacher, forcing it to learn robust representations.

noisy_student.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
 
class NoisyStudentTrainer:
    """
    Noisy Student Training implementation.
    
    Key differences from basic pseudo-labeling:
    1. Teacher is fixed (not updated during student training)
    2. No confidence threshold (all pseudo-labels used)
    3. Strong noise added to student (augmentation, dropout)
    4. Iterative: student becomes teacher for next round
    """
    
    def __init__(
        self,
        teacher: nn.Module,
        student: nn.Module,
        optimizer: torch.optim.Optimizer,
        augment_fn: callable,
        dropout_rate: float = 0.5,
    ):
        self.teacher = teacher
        self.student = student
        self.optimizer = optimizer
        self.augment_fn = augment_fn
        self.dropout_rate = dropout_rate
        
        # Freeze teacher
        for param in self.teacher.parameters():
            param.requires_grad = False
        self.teacher.eval()
    
    def enable_student_noise(self):
        """Enable all noise sources in student."""
        self.student.train()  # Enable dropout, batch norm training mode
        
    def generate_pseudo_labels(
        self, 
        unlabeled: torch.Tensor
    ) -> torch.Tensor:
        """
        Generate pseudo-labels from teacher.
        
        Note: No confidence filtering - all samples get pseudo-labels.
        """
        with torch.no_grad():
            # Teacher sees clean (unaugmented) images
            logits = self.teacher(unlabeled)
            pseudo_labels = logits.argmax(dim=-1)
        return pseudo_labels
    
    def train_step(
        self,
        x_labeled: torch.Tensor,
        y_labeled: torch.Tensor,
        x_unlabeled: torch.Tensor,
        lambda_u: float = 1.0,
    ) -> dict:
        """
        Single Noisy Student training step.
        """
        self.enable_student_noise()
        self.optimizer.zero_grad()
        
        # Get pseudo-labels for unlabeled data (from clean teacher)
        pseudo_labels = self.generate_pseudo_labels(x_unlabeled)
        
        # Student sees augmented versions with noise
        x_labeled_aug = self.augment_fn(x_labeled)
        x_unlabeled_aug = self.augment_fn(x_unlabeled)
        
        # Forward pass through noisy student
        logits_l = self.student(x_labeled_aug)
        logits_u = self.student(x_unlabeled_aug)
        
        # Supervised loss
        loss_sup = F.cross_entropy(logits_l, y_labeled)
        
        # Pseudo-label loss (no threshold!)
        loss_pseudo = F.cross_entropy(logits_u, pseudo_labels)
        
        # Combined loss
        loss_total = loss_sup + lambda_u * loss_pseudo
        
        loss_total.backward()
        self.optimizer.step()
        
        return {
            "loss_sup": loss_sup.item(),
            "loss_pseudo": loss_pseudo.item(),
            "loss_total": loss_total.item(),
        }
    
    def update_teacher(self):
        """
        Make current student the new teacher for next iteration.
        """
        # Deep copy student parameters to teacher
        self.teacher.load_state_dict(self.student.state_dict())
        
        # Re-freeze teacher
        for param in self.teacher.parameters():
            param.requires_grad = False
        self.teacher.eval()

Curriculum Pseudo-labeling

Motivation:

Curriculum Strategies:

Curriculum Threshold Strategies

•Fixed threshold: Use constant τ (e.g., 0.95) throughout training. Simple but may waste early training steps.
•Linear increase: Start at τ_min (e.g., 0.5), linearly increase to τ_max over warmup period.
•Confidence-based: Set threshold based on current model confidence distribution (e.g., top-k% percentile).
•Class-adaptive: Use different thresholds per class based on class-specific confidence distributions.
•Flex-Match: Dynamically adjust threshold based on learning status of each class.

curriculum_pseudolabel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import torch
import numpy as np
from typing import Dict, Optional
 
class FlexMatch:
    """
    FlexMatch: Curriculum pseudo-labeling with flexible thresholds.
    
    Paper: "FlexMatch: Boosting Semi-Supervised Learning with 
            Curriculum Pseudo Labeling"
    
    Key idea: Use class-specific thresholds that adapt based on
    each class's learning status (confidence distribution).
    """
    
    def __init__(
        self,
        num_classes: int,
        base_threshold: float = 0.95,
        warmup_steps: int = 16000,
    ):
        self.num_classes = num_classes
        self.base_threshold = base_threshold
        self.warmup_steps = warmup_steps
        
        # Class-specific "learning status" (σ in paper)
        # Tracks max confidence seen for each class
        self.class_max_conf = torch.zeros(num_classes)
        
        # Momentum for updating class stats
        self.momentum = 0.999
        
    def update_class_stats(
        self,
        probs: torch.Tensor,
        pseudo_labels: torch.Tensor,
    ):
        """
        Update per-class learning status based on current batch.
        
        Args:
            probs: Model probability predictions [B, K]
            pseudo_labels: Argmax pseudo-labels [B]
        """
        max_probs = probs.max(dim=-1)[0]
        
        for c in range(self.num_classes):
            class_mask = pseudo_labels == c
            if class_mask.sum() > 0:
                class_max = max_probs[class_mask].max()
                # EMA update
                self.class_max_conf[c] = (
                    self.momentum * self.class_max_conf[c] + 
                    (1 - self.momentum) * class_max
                )
    
    def get_class_thresholds(self, step: int) -> torch.Tensor:
        """
        Compute per-class thresholds based on learning status.
        
        Classes that the model has "learned" (high max confidence)
        get higher thresholds. Classes still being learned get
        lower thresholds.
        """
        # Normalize class max confidences to [0, 1]
        if self.class_max_conf.max() > 0:
            normalized = self.class_max_conf / self.class_max_conf.max()
        else:
            normalized = torch.ones(self.num_classes)
        
        # Threshold = base * normalized_learning_status
        # Well-learned classes: threshold close to base (0.95)
        # Poorly-learned classes: lower threshold (more samples)
        thresholds = self.base_threshold * normalized
        
        # Warmup: linearly increase from uniform threshold
        if step < self.warmup_steps:
            warmup_factor = step / self.warmup_steps
            uniform_threshold = self.base_threshold * 0.5  # Starting threshold
            thresholds = (
                (1 - warmup_factor) * uniform_threshold + 
                warmup_factor * thresholds
            )
        
        return thresholds
    
    def create_mask(
        self,
        probs: torch.Tensor,
        pseudo_labels: torch.Tensor,
        step: int,
    ) -> torch.Tensor:
        """
        Create per-sample mask using class-specific thresholds.
        
        Args:
            probs: Probability predictions [B, K]
            pseudo_labels: Pseudo-labels [B]
            step: Current training step
            
        Returns:
            Boolean mask [B]
        """
        # Update class stats
        self.update_class_stats(probs, pseudo_labels)
        
        # Get per-class thresholds
        class_thresholds = self.get_class_thresholds(step)
        
        # Get confidence
        max_probs = probs.max(dim=-1)[0]
        
        # Per-sample threshold based on pseudo-label class
        sample_thresholds = class_thresholds[pseudo_labels]
        
        # Create mask
        mask = max_probs >= sample_thresholds
        
        return mask
 
 
class CurriculumPseudoLabel:
    """
    Simple curriculum pseudo-labeling with adaptive threshold.
    """
    
    def __init__(
        self,
        min_threshold: float = 0.5,
        max_threshold: float = 0.95,
        warmup_steps: int = 10000,
        schedule: str = "linear",  # 'linear', 'exp', 'cos'
    ):
        self.min_threshold = min_threshold
        self.max_threshold = max_threshold
        self.warmup_steps = warmup_steps
        self.schedule = schedule
        
    def get_threshold(self, step: int) -> float:
        """Get threshold at current step."""
        if step >= self.warmup_steps:
            return self.max_threshold
            
        progress = step / self.warmup_steps
        
        if self.schedule == "linear":
            threshold = self.min_threshold + progress * (
                self.max_threshold - self.min_threshold
            )
        elif self.schedule == "exp":
            # Slow start, fast increase
            threshold = self.min_threshold + (progress ** 2) * (
                self.max_threshold - self.min_threshold
            )
        elif self.schedule == "cos":
            # Cosine schedule
            threshold = self.min_threshold + 0.5 * (
                1 - np.cos(np.pi * progress)
            ) * (self.max_threshold - self.min_threshold)
            
        return threshold

FlexMatch Results

Connection to Modern Methods

Understanding how pseudo-labeling connects to methods like FixMatch, UDA, and MixMatch reveals the shared principles that make them work.

FixMatch as Enhanced Pseudo-labeling:

FixMatch can be viewed as pseudo-labeling with two key enhancements:

Strong augmentation: Apply RandAugment to the input before computing the pseudo-label loss
Weak-to-strong consistency: Use weak augmentation for pseudo-label generation, strong for training

$$\text{FixMatch} = \text{PseudoLabel}(\text{hard}, \tau=0.95) + \text{StrongAug}$$

UDA as Soft Pseudo-labeling:

UDA uses soft (sharpened) pseudo-labels with KL divergence:

$$\text{UDA} = \text{PseudoLabel}(\text{soft}, \text{sharpen}, \text{TSA}) + \text{StrongAug}$$

Pseudo-labeling Across Methods
Method	Pseudo-label Type	Threshold	Key Addition
Basic Pseudo-label	Hard	Variable	None
Mean Teacher	Soft (from EMA)	None (all samples)	EMA teacher
UDA	Soft (sharpened)	~0.8	TSA + Strong Aug
FixMatch	Hard	0.95	Weak→Strong Aug
MixMatch	Soft (sharpened, averaged)	None	MixUp interpolation
FlexMatch	Hard	Class-adaptive	Curriculum thresholds
Noisy Student	Hard	None	Noise in student

Unifying View:

All these methods share the core pseudo-labeling insight: use model predictions to supervise training on unlabeled data. They differ in:

How predictions become labels: Hard argmax, soft distributions, sharpened, averaged
Which predictions to trust: Confidence threshold, class-adaptive, all predictions
How to prevent error propagation: High threshold, EMA teacher, strong augmentation
Additional regularization: MixUp, consistency, distribution alignment

Understanding these dimensions helps you design new methods or adapt existing ones to your specific problem.

Why High Threshold Works for FixMatch

Summary and Looking Ahead

Key Takeaways

•Pseudo-labeling uses model predictions as training labels for unlabeled data, enabling learning from unlabeled samples.
•Confidence thresholding is critical to prevent confirmation bias—only high-confidence predictions should become pseudo-labels.
•Hard vs. soft labels trade off gradient strength against uncertainty preservation; high threshold favors hard labels.
•Failure modes include confirmation bias, class collapse, and threshold dead zones—monitor for these during training.
•Self-training is the broader framework; variants include iterative training, noisy student, and mean teacher.
•Curriculum approaches like FlexMatch adapt thresholds during training for better sample utilization.
•Modern methods are sophisticated pseudo-labeling: FixMatch adds strong augmentation, UDA adds sharpening and TSA.

Module Complete:

This concludes Module 3: Consistency Regularization. We've covered:

Page 0: The consistency assumption and its theoretical foundations
Page 1: Data augmentation as the engine of consistency methods
Page 2: UDA and FixMatch—the state-of-the-art consistency methods
Page 3: MixMatch—holistic combination of multiple techniques
Page 4: Pseudo-labeling—the foundational technique underlying all modern methods

These techniques form the backbone of semi-supervised learning today. The next module will explore Self-Supervised Learning—learning representations without any labels at all.

Consistency Regularization Mastered

5 / 5