Meta Learning - Learning Module

Loading content...

0/245

Few-Shot Learning: Mastering Minimal Data

Learning from a Handful of Examples

A child sees three examples of an 'okapi'—a strange zebra-giraffe hybrid—and immediately recognizes the fourth one at the zoo. A security system trained on millions of faces cannot identify a new employee from three enrollment photos. This stark contrast between human and machine learning illustrates the few-shot learning challenge: how can we build systems that learn effectively from minimal data?

Few-shot learning is the capability to learn new concepts from just a few examples (typically 1-5). It represents one of the most practically important applications of meta-learning, directly addressing scenarios where large labeled datasets are impossible:

Medical diagnosis: Rare diseases have few documented cases
Drug discovery: Novel compounds have limited experimental data
Personalization: Individual users provide only a handful of preferences
Industrial inspection: New defect types appear without large training sets
Fraud detection: New attack patterns emerge with minimal initial examples

In this page, we'll formally define few-shot learning, understand why it's fundamentally difficult, explore benchmark datasets and evaluation protocols, and set the stage for the algorithmic solutions covered in subsequent pages.

What You Will Learn

By completing this page, you will understand: (1) The precise formulation of N-way K-shot learning, (2) Why traditional machine learning fails with few examples, (3) Standard benchmark datasets (Omniglot, miniImageNet, tieredImageNet), (4) Evaluation protocols and metrics, (5) The fundamental challenges that make few-shot learning hard, and (6) Real-world applications where few-shot learning is essential.

The N-Way K-Shot Formulation

Few-shot learning problems are formalized using a standardized N-way K-shot structure that precisely defines the evaluation scenario.

Definition: N-Way K-Shot Classification

Given:

N classes randomly sampled from a set of novel (unseen during training) classes
K examples per class in the support set (labeled examples for adaptation)
Q query examples per class to evaluate classification accuracy

The task: Use the support set to learn a classifier that correctly labels the query examples.

The Episode Structure:

Each evaluation (and training) episode consists of:

$$\mathcal{E} = {\mathcal{S}, \mathcal{Q}}$$

where:

Support set: $\mathcal{S} = {(x_i, y_i)}_{i=1}^{N \times K}$ with $K$ examples for each of $N$ classes
Query set: $\mathcal{Q} = {(x_j, y_j)}_{j=1}^{N \times Q}$ with $Q$ examples for each of $N$ classes

Critically, the classes in each episode are relabeled from 0 to N-1, so the model cannot rely on memorizing class identities from training.

Common Few-Shot Settings
Setting	N (Classes)	K (Support/Class)	Total Support	Difficulty	Use Cases
5-way 1-shot	5	1	5	Extreme	Testing rapid learning ability
5-way 5-shot	5	5	25	Challenging	Standard benchmark
5-way 20-shot	5	20	100	Moderate	Testing scalability
20-way 1-shot	20	1	20	Very Hard	Many-class few-shot
20-way 5-shot	20	5	100	Hard	Realistic scenarios

few_shot_episode.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import torch
import numpy as np
from typing import Dict, List, Tuple
from dataclasses import dataclass
 
@dataclass
class FewShotEpisode:
    """
    A complete few-shot learning episode.
    
    Attributes:
        n_way: Number of classes in this episode
        k_shot: Number of support examples per class
        q_query: Number of query examples per class
        support_images: Tensor of shape [n_way * k_shot, C, H, W]
        support_labels: Tensor of shape [n_way * k_shot], values in [0, n_way)
        query_images: Tensor of shape [n_way * q_query, C, H, W]
        query_labels: Tensor of shape [n_way * q_query], values in [0, n_way)
        original_classes: Original class IDs before relabeling (for debugging)
    """
    n_way: int
    k_shot: int
    q_query: int
    support_images: torch.Tensor
    support_labels: torch.Tensor
    query_images: torch.Tensor
    query_labels: torch.Tensor
    original_classes: List[int]
    
    def __post_init__(self):
        # Validate dimensions
        assert self.support_images.shape[0] == self.n_way * self.k_shot
        assert self.query_images.shape[0] == self.n_way * self.q_query
        assert len(set(self.support_labels.tolist())) == self.n_way
        assert len(set(self.query_labels.tolist())) == self.n_way
    
    @property
    def support_size(self) -> int:
        return self.n_way * self.k_shot
    
    @property
    def query_size(self) -> int:
        return self.n_way * self.q_query
    
    def to(self, device: torch.device) -> 'FewShotEpisode':
        """Move tensors to specified device."""
        return FewShotEpisode(
            n_way=self.n_way,
            k_shot=self.k_shot,
            q_query=self.q_query,
            support_images=self.support_images.to(device),
            support_labels=self.support_labels.to(device),
            query_images=self.query_images.to(device),
            query_labels=self.query_labels.to(device),
            original_classes=self.original_classes
        )
 
class FewShotEpisodeSampler:
    """
    Samples few-shot episodes from a dataset.
    
    Key principle: Each episode contains novel class combinations,
    with fresh support/query splits. The meta-learner sees many
    such episodes during training, learning to generalize across them.
    """
    
    def __init__(
        self, 
        dataset: Dict[int, List[torch.Tensor]],  # class_id -> list of images
        n_way: int = 5,
        k_shot: int = 5,
        q_query: int = 15
    ):
        self.dataset = dataset  # Pre-organized by class
        self.all_classes = list(dataset.keys())
        self.n_way = n_way
        self.k_shot = k_shot
        self.q_query = q_query
        
        # Validate sufficient examples per class
        min_examples = k_shot + q_query
        for cls, examples in dataset.items():
            if len(examples) < min_examples:
                raise ValueError(
                    f"Class {cls} has {len(examples)} examples, "
                    f"need at least {min_examples} for {k_shot}-shot + {q_query} queries"
                )
    
    def sample_episode(self) -> FewShotEpisode:
        """
        Sample a complete episode.
        
        Steps:
        1. Randomly select n_way classes
        2. For each class, sample k_shot + q_query examples
        3. Split into support and query
        4. Relabel classes to [0, n_way)
        5. Shuffle support and query separately
        """
        # Step 1: Select classes
        selected_classes = np.random.choice(
            self.all_classes, 
            size=self.n_way, 
            replace=False
        )
        
        support_images, support_labels = [], []
        query_images, query_labels = [], []
        
        # Step 2-4: Sample and split for each class
        for new_label, original_class in enumerate(selected_classes):
            class_examples = self.dataset[original_class]
            
            # Random selection without replacement
            indices = np.random.choice(
                len(class_examples),
                size=self.k_shot + self.q_query,
                replace=False
            )
            
            selected_examples = [class_examples[i] for i in indices]
            
            # Split
            support_examples = selected_examples[:self.k_shot]
            query_examples = selected_examples[self.k_shot:]
            
            # Add to lists with relabeled class IDs
            support_images.extend(support_examples)
            support_labels.extend([new_label] * self.k_shot)
            
            query_images.extend(query_examples)
            query_labels.extend([new_label] * self.q_query)
        
        # Step 5: Shuffle to remove class ordering
        support_perm = np.random.permutation(len(support_labels))
        query_perm = np.random.permutation(len(query_labels))
        
        return FewShotEpisode(
            n_way=self.n_way,
            k_shot=self.k_shot,
            q_query=self.q_query,
            support_images=torch.stack([support_images[i] for i in support_perm]),
            support_labels=torch.tensor([support_labels[i] for i in support_perm]),
            query_images=torch.stack([query_images[i] for i in query_perm]),
            query_labels=torch.tensor([query_labels[i] for i in query_perm]),
            original_classes=list(selected_classes)
        )

Why Relabeling Matters

Classes are relabeled to 0, 1, ..., N-1 in each episode. This prevents the model from 'cheating' by memorizing class identities from training. The model must genuinely learn to classify based on the support examples, not recognize familiar categories.

Why Few-Shot Learning is Fundamentally Difficult

Few-shot learning isn't merely 'small-data machine learning'—it presents fundamental statistical and computational challenges that standard approaches cannot overcome. Understanding these challenges clarifies why specialized solutions are necessary.

Challenge 1: High Variance, Low Bias

With few examples, any classifier has high variance—small changes in the training set dramatically change the learned function. Consider fitting a decision boundary with 5 points versus 5,000: the 5-point version is wildly unstable.

Mathematically, for a model with $d$ parameters and $n$ training examples:

$$\text{Generalization Error} \approx \text{Bias}^2 + \text{Variance} + \text{Noise}$$

Traditional ML reduces variance by increasing $n$. Few-shot learning requires reducing variance through informative priors—which is exactly what meta-learning provides.

Fundamental Challenges in Few-Shot Learning

•Sample complexity mismatch — Deep networks have millions of parameters but only K examples per class. Classical theory suggests this is catastrophically underdetermined. The number of possible labelings vastly exceeds information in the support set.
•Overfitting pressure — With 5 examples, a model can perfectly memorize each one without learning generalizable features. Any flexibility in the model will be used to fit noise rather than signal.
•Class confusion — With minimal examples, classes may appear more similar than they are. A single 'okapi' image might share more visual features with a 'giraffe' than with other okapis.
•Prototype instability — If using example means as class representatives, a single outlier example shifts the prototype significantly. With K=1, you have no robustness to atypical examples.
•Distribution shift — The few support examples may not represent the true class distribution. You might get 5 images of dogs lying down and then need to classify a standing dog.

Sample Complexity Comparison
Learning Setting	Typical Examples	Parameter Count	Ratio (Params/Examples)	Effective Constraints
ImageNet training	1,200,000	25,000,000	~20:1	Well-constrained
Fine-tuning	10,000	25,000,000	~2500:1	Regularization-dependent
5-way 5-shot	25	25,000,000	~1,000,000:1	Catastrophically underdetermined
5-way 1-shot	5	25,000,000	~5,000,000:1	Impossible without priors

Why Standard Fine-Tuning Fails:

One might think: 'Just fine-tune a pre-trained model on the few examples.' This intuition is partially correct—pre-training helps—but naive fine-tuning still fails:

Catastrophic overfitting: With 5 examples, a few gradient steps drive training accuracy to 100% while test accuracy plummets.
Feature destruction: Aggressive fine-tuning overwrites useful pre-trained representations with noise from few examples.
No class-specific adaptation: The model learns 'these 5 images' rather than 'the concept of this class.'

The Meta-Learning Solution:

Meta-learning addresses these challenges by:

Learning what to adapt and what to keep fixed
Optimizing specifically for fast adaptation scenarios
Building priors that regularize toward good solutions
Learning representations where few examples suffice

The Lottery of Examples

In 1-shot learning, your single example determines everything. If you happen to get an atypical example (a Chihuahua for 'dog', a limousine for 'car'), classification will fail. Meta-learning can't eliminate this variance, but it can learn representations where typical and atypical examples are closer together.

Benchmark Datasets: The Testing Grounds

Few-shot learning research relies on standardized benchmarks that enable fair comparison across methods. Understanding these datasets is essential for both evaluating existing methods and designing new ones.

The Train/Validation/Test Split Philosophy:

Unlike traditional ML where we split examples, few-shot learning splits classes:

Meta-train classes: Used for training the meta-learner (episodes sampled from these)
Meta-validation classes: Used for hyperparameter tuning and model selection
Meta-test classes: Used for final evaluation—completely disjoint from training

Omniglot: The 'MNIST of Few-Shot Learning'

Omniglot consists of 1,623 handwritten characters from 50 different alphabets, with 20 examples per character. Each character was drawn by 20 different people.

Dataset Structure:

50 alphabets (scripts like Latin, Korean, Sanskrit, invented alphabets)
1,623 unique characters total
20 handwritten examples per character (by different writers)
Image size: 105×105 grayscale

Standard Splits:

Meta-train: 964 characters (30 alphabets)
Meta-test: 659 characters (20 alphabets)

Why Omniglot Works for Few-Shot:

High class diversity (1,623 classes vs. 10 in MNIST)
Low intra-class variance (same character, different writers)
Natural episode structure (classify characters from same alphabet)

Omniglot Benchmark Results (Accuracy %)
Method	5-way 1-shot	5-way 5-shot	20-way 1-shot	20-way 5-shot
Nearest Neighbor	41.1	69.2	20.3	52.8
Matching Networks	98.1	98.9	93.8	98.5
Prototypical Networks	98.8	99.7	96.0	98.9
MAML	98.7	99.9	95.8	98.9

Caveat

Omniglot has become 'solved'—many methods achieve near-perfect accuracy. It remains useful for initial validation but doesn't discriminate between state-of-the-art methods. Harder benchmarks are now standard for comparing advances.

Evaluation Protocols and Metrics

Proper evaluation in few-shot learning requires careful attention to protocols that weren't necessary in traditional ML. The high variance in few-shot scenarios means that a single test run is statistically meaningless.

Standard Evaluation Protocol:

Sample many episodes: Typically 600-10,000 test episodes
Compute accuracy per episode: What fraction of query examples are correctly classified
Report mean and 95% confidence interval: $\mu \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}$

The confidence interval is critical—it indicates whether apparent differences between methods are statistically significant.

evaluation_protocol.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from typing import Tuple, List
import torch
from scipy import stats
 
def evaluate_few_shot(
    model,
    episode_sampler,
    n_episodes: int = 600,
    confidence: float = 0.95
) -> Tuple[float, float, List[float]]:
    """
    Standard few-shot evaluation protocol.
    
    Args:
        model: Few-shot classifier
        episode_sampler: Samples (support, query) episodes
        n_episodes: Number of test episodes
        confidence: Confidence level for interval
    
    Returns:
        mean_accuracy: Average accuracy across episodes
        confidence_interval: Half-width of confidence interval
        all_accuracies: Per-episode accuracies for analysis
    """
    model.eval()
    accuracies = []
    
    with torch.no_grad():
        for episode_idx in range(n_episodes):
            episode = episode_sampler.sample_episode()
            
            # Model adapts to support set and predicts on query
            predictions = model(
                support_x=episode.support_images,
                support_y=episode.support_labels,
                query_x=episode.query_images
            )
            
            # Compute accuracy for this episode
            correct = (predictions == episode.query_labels).float()
            episode_accuracy = correct.mean().item()
            accuracies.append(episode_accuracy)
            
            if (episode_idx + 1) % 100 == 0:
                running_mean = np.mean(accuracies)
                running_std = np.std(accuracies)
                print(f"Episode {episode_idx + 1}/{n_episodes}: "
                      f"Accuracy = {running_mean:.2%} ± {running_std:.4f}")
    
    # Compute statistics
    mean_acc = np.mean(accuracies)
    std_acc = np.std(accuracies)
    
    # Confidence interval using t-distribution for small samples
    # For large n_episodes, this approximates the normal distribution
    t_value = stats.t.ppf((1 + confidence) / 2, n_episodes - 1)
    ci_half_width = t_value * (std_acc / np.sqrt(n_episodes))
    
    print(f"\nFinal Results ({n_episodes} episodes):")
    print(f"  Mean Accuracy: {mean_acc:.2%}")
    print(f"  Std Deviation: {std_acc:.4f}")
    print(f"  {int(confidence * 100)}% CI: [{mean_acc - ci_half_width:.2%}, {mean_acc + ci_half_width:.2%}]")
    
    return mean_acc, ci_half_width, accuracies
 
def compare_methods(
    results_a: Tuple[float, float],
    results_b: Tuple[float, float],
    n_episodes: int
) -> bool:
    """
    Test if method A is significantly better than method B.
    
    Uses a two-sample t-test to determine significance.
    Returns True if A is significantly better at p < 0.05.
    """
    mean_a, ci_a = results_a
    mean_b, ci_b = results_b
    
    # If confidence intervals don't overlap, difference is significant
    # This is a conservative test (overlapping CIs can still be significant)
    overlap = (mean_a - ci_a < mean_b + ci_b) and (mean_b - ci_b < mean_a + ci_a)
    
    if not overlap:
        return mean_a > mean_b  
    
    # Proper statistical test would require per-episode accuracies
    # This is a simplification
    return False
 
# Metric variations for specific scenarios
def evaluate_imbalanced_few_shot(
    model,
    episode_sampler,
    n_episodes: int = 600
) -> dict:
    """
    Evaluation for imbalanced few-shot scenarios.
    
    In realistic settings, classes may have different numbers of
    support examples or query examples. Report class-balanced metrics.
    """
    model.eval()
    all_predictions = []
    all_labels = []
    all_n_ways = []
    per_class_accuracies = []
    
    for _ in range(n_episodes):
        episode = episode_sampler.sample_variable_episode()
        
        predictions = model(
            support_x=episode.support_images,
            support_y=episode.support_labels,
            query_x=episode.query_images
        )
        
        # Per-class accuracy (balanced across classes)
        n_way = episode.n_way
        class_accs = []
        for c in range(n_way):
            mask = episode.query_labels == c
            if mask.sum() > 0:
                class_acc = (predictions[mask] == c).float().mean().item()
                class_accs.append(class_acc)
        
        # Macro-averaged (balanced) accuracy
        balanced_acc = np.mean(class_accs) if class_accs else 0.0
        per_class_accuracies.append(balanced_acc)
    
    return {
        'balanced_accuracy': np.mean(per_class_accuracies),
        'balanced_std': np.std(per_class_accuracies),
    }

Evaluation Pitfalls

Common mistakes in few-shot evaluation: (1) Using too few episodes (100 is insufficient for significance), (2) Not reporting confidence intervals, (3) Comparing methods with different backbones or data augmentation, (4) Using training classes in test episodes, (5) Not controlling for random seed effects.

Transductive vs. Inductive Few-Shot Learning

An important distinction in few-shot learning is whether the method operates inductively or transductively. This choice significantly impacts what information is available during classification.

Inductive Inference:

Each query example is classified independently
Only support set information is used
Classification function is: $f: \mathcal{X} \times \mathcal{S} \rightarrow \mathcal{Y}$
More realistic for streaming/online scenarios

Transductive Inference:

All query examples are considered jointly
Statistics of the query set can inform classification
Classification function is: $f: \mathcal{X}^Q \times \mathcal{S} \rightarrow \mathcal{Y}^Q$
Can exploit batch structure (class balance, clusters)

Inductive Methods

•MAML: Adapt on support, classify query independently
•Prototypical Networks: Fixed prototypes from support
•Matching Networks: Attention over support only
•Baseline++: Fine-tune, then classify

Transductive Methods

•TPN: Propagate labels through query graph
•Transductive ProtoNet: Refine prototypes with query
•Fine-tuning + query: Use query stats for batch norm
•Semi-supervised extensions: Unlabeled queries help

When Transduction Helps

Transductive methods typically outperform inductive ones (by 1-3% accuracy) because they leverage query set statistics. However, they require all queries to be available simultaneously, which isn't always realistic. For fair comparison, papers should clearly label which setting they use.

Transductive Mechanisms:

How do transductive methods exploit query information?

Batch normalization: Computing mean/variance across the query batch provides domain-specific normalization.
Prototype refinement: Initial prototypes from support can be updated using high-confidence query predictions.
Label propagation: Graph-based methods propagate labels from support through query, exploiting cluster structure.
Entropy minimization: Encourage confident predictions across the query set, pushing decision boundaries away from dense regions.

These techniques effectively use the unlabeled query set as additional information, improving classification accuracy at the cost of requiring batch access.

Real-World Applications of Few-Shot Learning

Few-shot learning isn't solely an academic pursuit—it addresses genuine practical challenges where labeled data is scarce, expensive, or impossible to obtain at scale. Understanding these applications motivates the importance of the techniques we'll study.

High-Impact Application Domains

•Medical Imaging — Rare diseases have few documented cases. A few-shot model trained on common conditions can adapt to diagnose rare conditions from 5-10 examples, enabling earlier detection without waiting for large datasets.
•Drug Discovery — Each new compound is effectively a new 'class' with limited experimental data. Few-shot models can predict properties (toxicity, efficacy) from molecular structure with minimal wet-lab validation.
•Robotics — New objects, environments, and tasks appear constantly. A robot that learns to manipulate a new object from a few demonstrations adapts far faster than one requiring thousands of examples.
•Speaker Verification — Voice assistants must recognize enrolled users from a few voice samples. Few-shot learning enables secure, personalized devices without requiring extensive enrollment.
•Fraud Detection — New fraud patterns emerge constantly. Detecting them from the first few examples prevents millions in losses that would accumulate while building a large labeled dataset.
•Wildlife Conservation — Endangered species have few photographed individuals. Counting and tracking specific animals requires few-shot identification from limited reference images.

Application Domains and Few-Shot Requirements
Domain	Why Data is Scarce	Typical K	Deployment Constraints
Medical diagnosis	Rare diseases, privacy	1-5	High accuracy required
Drug discovery	Experimental cost	5-20	False negatives costly
Industrial inspection	New defect types	1-10	Real-time inference
Personalization	Privacy, user burden	3-10	User experience
E-commerce search	Long-tail products	1-5	Latency-critical
Document processing	New form types	1-3	Enterprise deployment

Beyond Classification

While benchmarks focus on classification, few-shot learning extends to detection, segmentation, regression, and reinforcement learning. Few-shot object detection identifies new object categories; few-shot RL learns new tasks from a few demonstrations. The principles transfer across problem types.

From Problem Formulation to Algorithmic Solutions

We've now established the full context of few-shot learning:

What it is: Learning from K examples per class, where K is small (typically 1-5)
How it's formalized: N-way K-shot episodes with support and query sets
Why it's hard: Massive underdetermination, overfitting pressure, variance issues
How it's evaluated: Many episodes, confidence intervals, standard benchmarks
Where it applies: Medical, industrial, personalization, and countless other domains

The remaining question: How do we actually solve few-shot problems?

Solution Approaches (Coming Next)

•Page 2: MAML — Learn an initialization from which a few gradient steps produce excellent task-specific solutions. Optimization-based approach that learns 'where to start.'
•Page 3: Prototypical Networks — Learn an embedding space where classification reduces to nearest-prototype comparison. Metric-based approach that learns 'how to compare.'
•Page 4: Meta-Learning Applications — Survey of applications beyond image classification: NLP, RL, drug discovery, and more.

Foundations Complete

You now understand the few-shot learning problem in depth—its formulation, challenges, benchmarks, and practical importance. In the next pages, we'll dive into the algorithmic solutions that have achieved remarkable success, starting with Model-Agnostic Meta-Learning (MAML).

Summary: Few-Shot Learning Fundamentals

Key Takeaways

•N-way K-shot formulation standardizes few-shot evaluation: N classes, K support examples each, classify unseen queries.
•Episodic structure with support/query splits mirrors test-time scenarios, enabling meta-learning to optimize for few-shot performance.
•Traditional ML fails few-shot due to catastrophic underdetermination—millions of parameters, handful of examples.
•Standard benchmarks (Omniglot, miniImageNet, tieredImageNet, Meta-Dataset) enable fair comparison across methods.
•Proper evaluation requires many episodes (600+) with confidence intervals reported.
•Transductive methods leverage query set information for improved accuracy at the cost of batch requirements.
•Real-world applications span medical imaging, drug discovery, robotics, and any domain where data is scarce.

Coming Next: Page 2 dives into Model-Agnostic Meta-Learning (MAML)—the seminal algorithm that showed how learning a good initialization enables rapid adaptation to new tasks.