Loading content...
A child sees three examples of an 'okapi'—a strange zebra-giraffe hybrid—and immediately recognizes the fourth one at the zoo. A security system trained on millions of faces cannot identify a new employee from three enrollment photos. This stark contrast between human and machine learning illustrates the few-shot learning challenge: how can we build systems that learn effectively from minimal data?
Few-shot learning is the capability to learn new concepts from just a few examples (typically 1-5). It represents one of the most practically important applications of meta-learning, directly addressing scenarios where large labeled datasets are impossible:
In this page, we'll formally define few-shot learning, understand why it's fundamentally difficult, explore benchmark datasets and evaluation protocols, and set the stage for the algorithmic solutions covered in subsequent pages.
By completing this page, you will understand: (1) The precise formulation of N-way K-shot learning, (2) Why traditional machine learning fails with few examples, (3) Standard benchmark datasets (Omniglot, miniImageNet, tieredImageNet), (4) Evaluation protocols and metrics, (5) The fundamental challenges that make few-shot learning hard, and (6) Real-world applications where few-shot learning is essential.
Few-shot learning problems are formalized using a standardized N-way K-shot structure that precisely defines the evaluation scenario.
Definition: N-Way K-Shot Classification
Given:
The task: Use the support set to learn a classifier that correctly labels the query examples.
The Episode Structure:
Each evaluation (and training) episode consists of:
$$\mathcal{E} = {\mathcal{S}, \mathcal{Q}}$$
where:
Critically, the classes in each episode are relabeled from 0 to N-1, so the model cannot rely on memorizing class identities from training.
| Setting | N (Classes) | K (Support/Class) | Total Support | Difficulty | Use Cases |
|---|---|---|---|---|---|
| 5-way 1-shot | 5 | 1 | 5 | Extreme | Testing rapid learning ability |
| 5-way 5-shot | 5 | 5 | 25 | Challenging | Standard benchmark |
| 5-way 20-shot | 5 | 20 | 100 | Moderate | Testing scalability |
| 20-way 1-shot | 20 | 1 | 20 | Very Hard | Many-class few-shot |
| 20-way 5-shot | 20 | 5 | 100 | Hard | Realistic scenarios |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
import torchimport numpy as npfrom typing import Dict, List, Tuplefrom dataclasses import dataclass @dataclassclass FewShotEpisode: """ A complete few-shot learning episode. Attributes: n_way: Number of classes in this episode k_shot: Number of support examples per class q_query: Number of query examples per class support_images: Tensor of shape [n_way * k_shot, C, H, W] support_labels: Tensor of shape [n_way * k_shot], values in [0, n_way) query_images: Tensor of shape [n_way * q_query, C, H, W] query_labels: Tensor of shape [n_way * q_query], values in [0, n_way) original_classes: Original class IDs before relabeling (for debugging) """ n_way: int k_shot: int q_query: int support_images: torch.Tensor support_labels: torch.Tensor query_images: torch.Tensor query_labels: torch.Tensor original_classes: List[int] def __post_init__(self): # Validate dimensions assert self.support_images.shape[0] == self.n_way * self.k_shot assert self.query_images.shape[0] == self.n_way * self.q_query assert len(set(self.support_labels.tolist())) == self.n_way assert len(set(self.query_labels.tolist())) == self.n_way @property def support_size(self) -> int: return self.n_way * self.k_shot @property def query_size(self) -> int: return self.n_way * self.q_query def to(self, device: torch.device) -> 'FewShotEpisode': """Move tensors to specified device.""" return FewShotEpisode( n_way=self.n_way, k_shot=self.k_shot, q_query=self.q_query, support_images=self.support_images.to(device), support_labels=self.support_labels.to(device), query_images=self.query_images.to(device), query_labels=self.query_labels.to(device), original_classes=self.original_classes ) class FewShotEpisodeSampler: """ Samples few-shot episodes from a dataset. Key principle: Each episode contains novel class combinations, with fresh support/query splits. The meta-learner sees many such episodes during training, learning to generalize across them. """ def __init__( self, dataset: Dict[int, List[torch.Tensor]], # class_id -> list of images n_way: int = 5, k_shot: int = 5, q_query: int = 15 ): self.dataset = dataset # Pre-organized by class self.all_classes = list(dataset.keys()) self.n_way = n_way self.k_shot = k_shot self.q_query = q_query # Validate sufficient examples per class min_examples = k_shot + q_query for cls, examples in dataset.items(): if len(examples) < min_examples: raise ValueError( f"Class {cls} has {len(examples)} examples, " f"need at least {min_examples} for {k_shot}-shot + {q_query} queries" ) def sample_episode(self) -> FewShotEpisode: """ Sample a complete episode. Steps: 1. Randomly select n_way classes 2. For each class, sample k_shot + q_query examples 3. Split into support and query 4. Relabel classes to [0, n_way) 5. Shuffle support and query separately """ # Step 1: Select classes selected_classes = np.random.choice( self.all_classes, size=self.n_way, replace=False ) support_images, support_labels = [], [] query_images, query_labels = [], [] # Step 2-4: Sample and split for each class for new_label, original_class in enumerate(selected_classes): class_examples = self.dataset[original_class] # Random selection without replacement indices = np.random.choice( len(class_examples), size=self.k_shot + self.q_query, replace=False ) selected_examples = [class_examples[i] for i in indices] # Split support_examples = selected_examples[:self.k_shot] query_examples = selected_examples[self.k_shot:] # Add to lists with relabeled class IDs support_images.extend(support_examples) support_labels.extend([new_label] * self.k_shot) query_images.extend(query_examples) query_labels.extend([new_label] * self.q_query) # Step 5: Shuffle to remove class ordering support_perm = np.random.permutation(len(support_labels)) query_perm = np.random.permutation(len(query_labels)) return FewShotEpisode( n_way=self.n_way, k_shot=self.k_shot, q_query=self.q_query, support_images=torch.stack([support_images[i] for i in support_perm]), support_labels=torch.tensor([support_labels[i] for i in support_perm]), query_images=torch.stack([query_images[i] for i in query_perm]), query_labels=torch.tensor([query_labels[i] for i in query_perm]), original_classes=list(selected_classes) )Classes are relabeled to 0, 1, ..., N-1 in each episode. This prevents the model from 'cheating' by memorizing class identities from training. The model must genuinely learn to classify based on the support examples, not recognize familiar categories.
Few-shot learning isn't merely 'small-data machine learning'—it presents fundamental statistical and computational challenges that standard approaches cannot overcome. Understanding these challenges clarifies why specialized solutions are necessary.
Challenge 1: High Variance, Low Bias
With few examples, any classifier has high variance—small changes in the training set dramatically change the learned function. Consider fitting a decision boundary with 5 points versus 5,000: the 5-point version is wildly unstable.
Mathematically, for a model with $d$ parameters and $n$ training examples:
$$\text{Generalization Error} \approx \text{Bias}^2 + \text{Variance} + \text{Noise}$$
Traditional ML reduces variance by increasing $n$. Few-shot learning requires reducing variance through informative priors—which is exactly what meta-learning provides.
| Learning Setting | Typical Examples | Parameter Count | Ratio (Params/Examples) | Effective Constraints |
|---|---|---|---|---|
| ImageNet training | 1,200,000 | 25,000,000 | ~20:1 | Well-constrained |
| Fine-tuning | 10,000 | 25,000,000 | ~2500:1 | Regularization-dependent |
| 5-way 5-shot | 25 | 25,000,000 | ~1,000,000:1 | Catastrophically underdetermined |
| 5-way 1-shot | 5 | 25,000,000 | ~5,000,000:1 | Impossible without priors |
Why Standard Fine-Tuning Fails:
One might think: 'Just fine-tune a pre-trained model on the few examples.' This intuition is partially correct—pre-training helps—but naive fine-tuning still fails:
Catastrophic overfitting: With 5 examples, a few gradient steps drive training accuracy to 100% while test accuracy plummets.
Feature destruction: Aggressive fine-tuning overwrites useful pre-trained representations with noise from few examples.
No class-specific adaptation: The model learns 'these 5 images' rather than 'the concept of this class.'
The Meta-Learning Solution:
Meta-learning addresses these challenges by:
In 1-shot learning, your single example determines everything. If you happen to get an atypical example (a Chihuahua for 'dog', a limousine for 'car'), classification will fail. Meta-learning can't eliminate this variance, but it can learn representations where typical and atypical examples are closer together.
Few-shot learning research relies on standardized benchmarks that enable fair comparison across methods. Understanding these datasets is essential for both evaluating existing methods and designing new ones.
The Train/Validation/Test Split Philosophy:
Unlike traditional ML where we split examples, few-shot learning splits classes:
Omniglot: The 'MNIST of Few-Shot Learning'
Omniglot consists of 1,623 handwritten characters from 50 different alphabets, with 20 examples per character. Each character was drawn by 20 different people.
Dataset Structure:
Standard Splits:
Why Omniglot Works for Few-Shot:
| Method | 5-way 1-shot | 5-way 5-shot | 20-way 1-shot | 20-way 5-shot |
|---|---|---|---|---|
| Nearest Neighbor | 41.1 | 69.2 | 20.3 | 52.8 |
| Matching Networks | 98.1 | 98.9 | 93.8 | 98.5 |
| Prototypical Networks | 98.8 | 99.7 | 96.0 | 98.9 |
| MAML | 98.7 | 99.9 | 95.8 | 98.9 |
Omniglot has become 'solved'—many methods achieve near-perfect accuracy. It remains useful for initial validation but doesn't discriminate between state-of-the-art methods. Harder benchmarks are now standard for comparing advances.
Proper evaluation in few-shot learning requires careful attention to protocols that weren't necessary in traditional ML. The high variance in few-shot scenarios means that a single test run is statistically meaningless.
Standard Evaluation Protocol:
The confidence interval is critical—it indicates whether apparent differences between methods are statistically significant.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom typing import Tuple, Listimport torchfrom scipy import stats def evaluate_few_shot( model, episode_sampler, n_episodes: int = 600, confidence: float = 0.95) -> Tuple[float, float, List[float]]: """ Standard few-shot evaluation protocol. Args: model: Few-shot classifier episode_sampler: Samples (support, query) episodes n_episodes: Number of test episodes confidence: Confidence level for interval Returns: mean_accuracy: Average accuracy across episodes confidence_interval: Half-width of confidence interval all_accuracies: Per-episode accuracies for analysis """ model.eval() accuracies = [] with torch.no_grad(): for episode_idx in range(n_episodes): episode = episode_sampler.sample_episode() # Model adapts to support set and predicts on query predictions = model( support_x=episode.support_images, support_y=episode.support_labels, query_x=episode.query_images ) # Compute accuracy for this episode correct = (predictions == episode.query_labels).float() episode_accuracy = correct.mean().item() accuracies.append(episode_accuracy) if (episode_idx + 1) % 100 == 0: running_mean = np.mean(accuracies) running_std = np.std(accuracies) print(f"Episode {episode_idx + 1}/{n_episodes}: " f"Accuracy = {running_mean:.2%} ± {running_std:.4f}") # Compute statistics mean_acc = np.mean(accuracies) std_acc = np.std(accuracies) # Confidence interval using t-distribution for small samples # For large n_episodes, this approximates the normal distribution t_value = stats.t.ppf((1 + confidence) / 2, n_episodes - 1) ci_half_width = t_value * (std_acc / np.sqrt(n_episodes)) print(f"\nFinal Results ({n_episodes} episodes):") print(f" Mean Accuracy: {mean_acc:.2%}") print(f" Std Deviation: {std_acc:.4f}") print(f" {int(confidence * 100)}% CI: [{mean_acc - ci_half_width:.2%}, {mean_acc + ci_half_width:.2%}]") return mean_acc, ci_half_width, accuracies def compare_methods( results_a: Tuple[float, float], results_b: Tuple[float, float], n_episodes: int) -> bool: """ Test if method A is significantly better than method B. Uses a two-sample t-test to determine significance. Returns True if A is significantly better at p < 0.05. """ mean_a, ci_a = results_a mean_b, ci_b = results_b # If confidence intervals don't overlap, difference is significant # This is a conservative test (overlapping CIs can still be significant) overlap = (mean_a - ci_a < mean_b + ci_b) and (mean_b - ci_b < mean_a + ci_a) if not overlap: return mean_a > mean_b # Proper statistical test would require per-episode accuracies # This is a simplification return False # Metric variations for specific scenariosdef evaluate_imbalanced_few_shot( model, episode_sampler, n_episodes: int = 600) -> dict: """ Evaluation for imbalanced few-shot scenarios. In realistic settings, classes may have different numbers of support examples or query examples. Report class-balanced metrics. """ model.eval() all_predictions = [] all_labels = [] all_n_ways = [] per_class_accuracies = [] for _ in range(n_episodes): episode = episode_sampler.sample_variable_episode() predictions = model( support_x=episode.support_images, support_y=episode.support_labels, query_x=episode.query_images ) # Per-class accuracy (balanced across classes) n_way = episode.n_way class_accs = [] for c in range(n_way): mask = episode.query_labels == c if mask.sum() > 0: class_acc = (predictions[mask] == c).float().mean().item() class_accs.append(class_acc) # Macro-averaged (balanced) accuracy balanced_acc = np.mean(class_accs) if class_accs else 0.0 per_class_accuracies.append(balanced_acc) return { 'balanced_accuracy': np.mean(per_class_accuracies), 'balanced_std': np.std(per_class_accuracies), }Common mistakes in few-shot evaluation: (1) Using too few episodes (100 is insufficient for significance), (2) Not reporting confidence intervals, (3) Comparing methods with different backbones or data augmentation, (4) Using training classes in test episodes, (5) Not controlling for random seed effects.
An important distinction in few-shot learning is whether the method operates inductively or transductively. This choice significantly impacts what information is available during classification.
Inductive Inference:
Transductive Inference:
Transductive methods typically outperform inductive ones (by 1-3% accuracy) because they leverage query set statistics. However, they require all queries to be available simultaneously, which isn't always realistic. For fair comparison, papers should clearly label which setting they use.
Transductive Mechanisms:
How do transductive methods exploit query information?
Batch normalization: Computing mean/variance across the query batch provides domain-specific normalization.
Prototype refinement: Initial prototypes from support can be updated using high-confidence query predictions.
Label propagation: Graph-based methods propagate labels from support through query, exploiting cluster structure.
Entropy minimization: Encourage confident predictions across the query set, pushing decision boundaries away from dense regions.
These techniques effectively use the unlabeled query set as additional information, improving classification accuracy at the cost of requiring batch access.
Few-shot learning isn't solely an academic pursuit—it addresses genuine practical challenges where labeled data is scarce, expensive, or impossible to obtain at scale. Understanding these applications motivates the importance of the techniques we'll study.
| Domain | Why Data is Scarce | Typical K | Deployment Constraints |
|---|---|---|---|
| Medical diagnosis | Rare diseases, privacy | 1-5 | High accuracy required |
| Drug discovery | Experimental cost | 5-20 | False negatives costly |
| Industrial inspection | New defect types | 1-10 | Real-time inference |
| Personalization | Privacy, user burden | 3-10 | User experience |
| E-commerce search | Long-tail products | 1-5 | Latency-critical |
| Document processing | New form types | 1-3 | Enterprise deployment |
While benchmarks focus on classification, few-shot learning extends to detection, segmentation, regression, and reinforcement learning. Few-shot object detection identifies new object categories; few-shot RL learns new tasks from a few demonstrations. The principles transfer across problem types.
We've now established the full context of few-shot learning:
The remaining question: How do we actually solve few-shot problems?
You now understand the few-shot learning problem in depth—its formulation, challenges, benchmarks, and practical importance. In the next pages, we'll dive into the algorithmic solutions that have achieved remarkable success, starting with Model-Agnostic Meta-Learning (MAML).
Coming Next: Page 2 dives into Model-Agnostic Meta-Learning (MAML)—the seminal algorithm that showed how learning a good initialization enables rapid adaptation to new tasks.