Continual Learning - Learning Module

Loading content...

0/245

Evaluation Protocols for Continual Learning

The Evaluation Challenge in Continual Learning

Evaluating continual learning systems is fundamentally more complex than evaluating standard machine learning models. In traditional ML, we measure performance on a held-out test set after training completes. In continual learning, we must measure performance dynamics over time, across multiple tasks, with interdependent metrics that can trade off against each other.\n\nThe Core Questions:\n\n- How much does the model forget? (stability)\n- How well does it learn new tasks? (plasticity)\n- Does old knowledge help new learning? (forward transfer)\n- Does new knowledge improve old tasks? (backward transfer)\n- How does performance evolve over the task sequence?\n\nThese questions cannot be answered with a single accuracy number. Proper evaluation requires a suite of metrics, temporal tracking, and careful experimental design.

What You Will Learn

By the end of this page, you will understand the standard metrics for continual learning (accuracy, forgetting, forward/backward transfer), how to construct and interpret accuracy matrices, standard benchmarks and evaluation scenarios, proper experimental design including baselines and statistical testing, and how to avoid common evaluation pitfalls.

The Accuracy Matrix: Foundation of Evaluation

All continual learning metrics derive from the accuracy matrix $A$, where $A_{i,j}$ represents the accuracy (or other performance metric) on task $T_i$ immediately after training on task $T_j$.\n\nMatrix Structure:\n\n$$A = \begin{bmatrix} A_{1,1} & A_{1,2} & \cdots & A_{1,n} \\ A_{2,1} & A_{2,2} & \cdots & A_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{n,1} & A_{n,2} & \cdots & A_{n,n} \end{bmatrix}$$\n\nKey Regions:\n\n- Diagonal ($A_{i,i}$): Performance on task $i$ right after training on it\n- Lower Triangle ($A_{i,j}$ for $i < j$): Performance on old tasks after training on new tasks—shows forgetting\n- Upper Triangle ($A_{i,j}$ for $i > j$): Performance on future tasks before training on them—shows forward transfer (zero-shot or pre-existing capability)

accuracy_matrix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Callable
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
 
class ContinualEvaluator:
    """
    Comprehensive evaluation framework for continual learning.
    
    Builds and analyzes the accuracy matrix from which all
    standard metrics are derived.
    """
    
    def __init__(self, n_tasks: int):
        """
        Args:
            n_tasks: Total number of tasks in sequence
        """
        self.n_tasks = n_tasks
        
        # Accuracy matrix: A[i,j] = accuracy on task i after training on task j
        # Initialize with NaN to distinguish "not evaluated" from "zero accuracy"
        self.accuracy_matrix = np.full((n_tasks, n_tasks), np.nan)
        
        # Random baseline for comparison
        self.random_baseline = np.zeros(n_tasks)
        
        # Joint training baseline (upper bound)
        self.joint_baseline = np.zeros(n_tasks)
        
    def record(
        self,
        model: nn.Module,
        task_dataloaders: List[DataLoader],
        current_task: int,
        device: torch.device
    ) -> None:
        """
        Record accuracies after training on current_task.
        
        Evaluates on ALL tasks (past, present, future) to capture
        full transfer dynamics.
        """
        model.eval()
        
        with torch.no_grad():
            for task_id, loader in enumerate(task_dataloaders):
                correct = 0
                total = 0
                
                for inputs, targets in loader:
                    inputs, targets = inputs.to(device), targets.to(device)
                    outputs = model(inputs)
                    _, predicted = outputs.max(1)
                    correct += predicted.eq(targets).sum().item()
                    total += targets.size(0)
                    
                self.accuracy_matrix[task_id, current_task] = correct / total
                
    def get_matrix(self) -> np.ndarray:
        """Return the full accuracy matrix."""
        return self.accuracy_matrix.copy()
    
    def visualize(
        self,
        save_path: str = None,
        title: str = "Continual Learning Accuracy Matrix"
    ) -> None:
        """
        Visualize accuracy matrix as heatmap.
        
        Color coding:
        - Diagonal: Task trained at this step
        - Lower triangle: Potential forgetting region
        - Upper triangle: Forward transfer region
        """
        fig, ax = plt.subplots(figsize=(10, 8))
        
        # Plot heatmap
        im = ax.imshow(self.accuracy_matrix, cmap='RdYlGn', vmin=0, vmax=1)
        
        # Add colorbar
        cbar = ax.figure.colorbar(im, ax=ax)
        cbar.ax.set_ylabel("Accuracy", rotation=-90, va="bottom")
        
        # Labels
        ax.set_xticks(np.arange(self.n_tasks))
        ax.set_yticks(np.arange(self.n_tasks))
        ax.set_xticklabels([f'After T{i+1}' for i in range(self.n_tasks)])
        ax.set_yticklabels([f'Task {i+1}' for i in range(self.n_tasks)])
        
        # Rotate x labels
        plt.setp(ax.get_xticklabels(), rotation=45, ha="right")
        
        # Add text annotations
        for i in range(self.n_tasks):
            for j in range(self.n_tasks):
                val = self.accuracy_matrix[i, j]
                if not np.isnan(val):
                    color = 'white' if val < 0.5 else 'black'
                    ax.text(j, i, f'{val:.2f}', ha='center', va='center', color=color)
                    
        ax.set_title(title)
        ax.set_xlabel("Training Progression")
        ax.set_ylabel("Evaluation Task")
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.show()
 
 
def explain_matrix_regions():
    """Visual explanation of accuracy matrix regions."""
    
    print("ACCURACY MATRIX INTERPRETATION")
    print("=" * 60)
    print()
    print("Matrix A[i,j] = Accuracy on Task i after training on Task j")
    print()
    print("       After T1  After T2  After T3  After T4  After T5")
    print("      ┌─────────┬─────────┬─────────┬─────────┬─────────┐")
    print("Task1 │  DIAG   │  LOWER  │  LOWER  │  LOWER  │  LOWER  │")
    print("      │  (peak) │  (forg) │  (forg) │  (forg) │  (forg) │")
    print("      ├─────────┼─────────┼─────────┼─────────┼─────────┤")
    print("Task2 │  UPPER  │  DIAG   │  LOWER  │  LOWER  │  LOWER  │")
    print("      │  (fwd)  │  (peak) │  (forg) │  (forg) │  (forg) │")
    print("      ├─────────┼─────────┼─────────┼─────────┼─────────┤")
    print("Task3 │  UPPER  │  UPPER  │  DIAG   │  LOWER  │  LOWER  │")
    print("      │  (fwd)  │  (fwd)  │  (peak) │  (forg) │  (forg) │")
    print("      ├─────────┼─────────┼─────────┼─────────┼─────────┤")
    print("Task4 │  UPPER  │  UPPER  │  UPPER  │  DIAG   │  LOWER  │")
    print("      │  (fwd)  │  (fwd)  │  (fwd)  │  (peak) │  (forg) │")
    print("      ├─────────┼─────────┼─────────┼─────────┼─────────┤")
    print("Task5 │  UPPER  │  UPPER  │  UPPER  │  UPPER  │  DIAG   │")
    print("      │  (fwd)  │  (fwd)  │  (fwd)  │  (fwd)  │  (peak) │")
    print("      └─────────┴─────────┴─────────┴─────────┴─────────┘")
    print()
    print("DIAG  = Diagonal: Peak accuracy right after training")
    print("LOWER = Lower triangle: Shows forgetting over time")
    print("UPPER = Upper triangle: Shows forward transfer (often 0)")
    
explain_matrix_regions()

Core Continual Learning Metrics

From the accuracy matrix, we derive several standard metrics that capture different aspects of continual learning performance.

Standard Metrics

•Average Accuracy (ACC) — Mean accuracy across all tasks at the end of training: $\text{ACC} = \frac{1}{n}\sum_{i=1}^{n} A_{i,n}$. The most common single metric, but conflates many factors.
•Average Forgetting (F) — Mean decrease from peak accuracy: $F = \frac{1}{n-1}\sum_{i=1}^{n-1} (\max_{j \leq t} A_{i,j} - A_{i,n})$. Measures stability—how much is lost after learning new tasks.
•Learning Accuracy (LA) — Average diagonal of the matrix: $\text{LA} = \frac{1}{n}\sum_{i=1}^{n} A_{i,i}$. Measures plasticity—how well the model learns each task when trained on it.
•Forward Transfer (FWT) — Performance on future tasks before training: $\text{FWT} = \frac{1}{n-1}\sum_{i=2}^{n} (A_{i,i-1} - b_i)$ where $b_i$ is baseline. Measures positive transfer from past learning.
•Backward Transfer (BWT) — Change in old task performance after new learning: $\text{BWT} = \frac{1}{n-1}\sum_{i=1}^{n-1} (A_{i,n} - A_{i,i})$. Negative BWT is forgetting; positive BWT is improvement.

continual_metrics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
from typing import Dict, Optional
from dataclasses import dataclass
 
@dataclass
class ContinualMetrics:
    """Container for all continual learning metrics."""
    average_accuracy: float
    average_forgetting: float
    learning_accuracy: float
    forward_transfer: float
    backward_transfer: float
    final_accuracies: np.ndarray
    forgetting_per_task: np.ndarray
    
    def summary(self) -> str:
        """Human-readable summary of metrics."""
        return f"""
Continual Learning Metrics Summary
===================================
Average Accuracy (ACC):    {self.average_accuracy:.4f}
Average Forgetting (F):    {self.average_forgetting:.4f}
Learning Accuracy (LA):    {self.learning_accuracy:.4f}
Forward Transfer (FWT):    {self.forward_transfer:.4f}
Backward Transfer (BWT):   {self.backward_transfer:.4f}
 
Per-Task Final Accuracies:
{self._format_array(self.final_accuracies)}
 
Per-Task Forgetting:
{self._format_array(self.forgetting_per_task)}
"""
    
    def _format_array(self, arr: np.ndarray) -> str:
        return "  " + "  ".join([f"T{i+1}: {v:.3f}" for i, v in enumerate(arr)])
 
 
class MetricsCalculator:
    """
    Calculate all standard continual learning metrics from accuracy matrix.
    
    All formulas follow the conventions from:
    - López-Paz & Ranzato, "Gradient Episodic Memory" (2017)
    - Díaz-Rodríguez et al., "Don't forget, there is more..." (2018)
    """
    
    @staticmethod
    def compute_all(
        accuracy_matrix: np.ndarray,
        random_baseline: Optional[np.ndarray] = None
    ) -> ContinualMetrics:
        """
        Compute all metrics from accuracy matrix.
        
        Args:
            accuracy_matrix: A[i,j] = accuracy on task i after training on task j
            random_baseline: Baseline accuracy per task (defaults to 0)
            
        Returns:
            ContinualMetrics containing all computed values
        """
        n = accuracy_matrix.shape[0]
        
        if random_baseline is None:
            random_baseline = np.zeros(n)
            
        # Average Accuracy: mean of final column
        final_accuracies = accuracy_matrix[:, -1]
        average_accuracy = np.mean(final_accuracies)
        
        # Learning Accuracy: mean of diagonal
        learning_accuracy = np.mean(np.diag(accuracy_matrix))
        
        # Forgetting: peak accuracy minus final accuracy
        forgetting_per_task = np.zeros(n - 1)
        for i in range(n - 1):
            # Peak accuracy on task i (up to but not including final)
            peak = np.max(accuracy_matrix[i, i:-1]) if i < n-1 else accuracy_matrix[i, i]
            final = accuracy_matrix[i, -1]
            forgetting_per_task[i] = max(0, peak - final)
            
        average_forgetting = np.mean(forgetting_per_task)
        
        # Forward Transfer: performance on task i before training on it
        forward_transfers = []
        for i in range(1, n):
            # A[i, i-1] is accuracy on task i right before training on it
            # Compare to baseline
            fwt_i = accuracy_matrix[i, i-1] - random_baseline[i]
            forward_transfers.append(fwt_i)
        forward_transfer = np.mean(forward_transfers) if forward_transfers else 0.0
        
        # Backward Transfer: change in task i's accuracy after training on later tasks
        backward_transfers = []
        for i in range(n - 1):
            # Final accuracy minus accuracy right after training
            bwt_i = accuracy_matrix[i, -1] - accuracy_matrix[i, i]
            backward_transfers.append(bwt_i)
        backward_transfer = np.mean(backward_transfers) if backward_transfers else 0.0
        
        return ContinualMetrics(
            average_accuracy=average_accuracy,
            average_forgetting=average_forgetting,
            learning_accuracy=learning_accuracy,
            forward_transfer=forward_transfer,
            backward_transfer=backward_transfer,
            final_accuracies=final_accuracies,
            forgetting_per_task=forgetting_per_task
        )
    
    @staticmethod
    def compute_area_under_curve(accuracy_matrix: np.ndarray) -> float:
        """
        Compute Area Under the Accuracy Curve (AUAC).
        
        An alternative to final accuracy that accounts for the
        full trajectory of learning.
        
        AUAC = (1/n) * Σ_t Average_Accuracy_at_step_t
        """
        n = accuracy_matrix.shape[0]
        auac = 0.0
        
        for t in range(n):
            # Average accuracy on tasks 1..t+1 after training on task t+1
            avg_at_t = np.mean(accuracy_matrix[:t+1, t])
            auac += avg_at_t
            
        return auac / n
    
    @staticmethod
    def compute_intransigence(
        accuracy_matrix: np.ndarray,
        joint_training_accuracy: np.ndarray
    ) -> float:
        """
        Compute Intransigence: inability to learn new tasks.
        
        Intransigence = Joint_accuracy - CL_accuracy_on_each_task
        
        High intransigence means CL struggles to learn tasks that
        joint training can learn well.
        """
        n = accuracy_matrix.shape[0]
        intransigence = np.zeros(n)
        
        for i in range(n):
            # Compare diagonal (CL peak) to joint training
            intransigence[i] = joint_training_accuracy[i] - accuracy_matrix[i, i]
            
        return np.mean(np.maximum(0, intransigence))
 
 
def demonstrate_metrics():
    """Demonstrate metric calculations on example matrix."""
    
    # Example accuracy matrix (5 tasks)
    # Showing typical forgetting pattern
    A = np.array([
        [0.95, 0.72, 0.55, 0.42, 0.35],  # Task 1 degrades over time
        [0.00, 0.93, 0.78, 0.65, 0.52],  # Task 2 (0 before trained)
        [0.00, 0.00, 0.91, 0.75, 0.63],  # Task 3
        [0.00, 0.00, 0.00, 0.94, 0.78],  # Task 4
        [0.00, 0.00, 0.00, 0.00, 0.96],  # Task 5 (current)
    ])
    
    metrics = MetricsCalculator.compute_all(A)
    print(metrics.summary())
    
    # Also compute AUAC
    auac = MetricsCalculator.compute_area_under_curve(A)
    print(f"Area Under Accuracy Curve: {auac:.4f}")
    
demonstrate_metrics()

Metrics Can Conflict

A method can excel on one metric while failing on another. High learning accuracy (LA) with low backward transfer (BWT) suggests the model learns well but forgets fast. High BWT with low LA suggests catastrophic backward transfer—new learning corrupting old representations in a way that happens to help old tasks. Always report multiple metrics.

Standard Benchmarks and Datasets

The field has converged on several standard benchmarks that enable fair comparison across methods. Understanding these benchmarks is essential for interpreting the literature.

Standard Continual Learning Benchmarks
Benchmark	Base Dataset	Tasks	Scenario	Challenge Level
Permuted MNIST	MNIST	10-20 permutations	Domain-IL	Moderate
Rotated MNIST	MNIST	20 rotations	Domain-IL	Moderate
Split MNIST	MNIST	5 (2 classes each)	Class-IL/Task-IL	Easy
Split CIFAR-100	CIFAR-100	10-20 (5-10 classes each)	Class-IL	Hard
Split TinyImageNet	TinyImageNet	10 (20 classes each)	Class-IL	Very Hard
CORe50	Object images	50 classes, 11 sessions	Multiple	Realistic
Split MiniImageNet	MiniImageNet	20 (5 classes each)	Class-IL	Hard

benchmarks.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, Subset
from torchvision import datasets, transforms
import numpy as np
from typing import List, Tuple, Dict
 
class PermutedMNIST:
    """
    Permuted MNIST benchmark.
    
    Each task applies a different fixed permutation to pixel positions.
    All tasks have the same class structure (0-9), but entirely
    different visual patterns.
    
    Tests: Ability to learn with no shared visual features
    Scenario: Domain-incremental (same output, different input distribution)
    """
    
    def __init__(
        self,
        n_tasks: int = 10,
        seed: int = 42,
        data_dir: str = './data'
    ):
        self.n_tasks = n_tasks
        self.seed = seed
        
        # Generate fixed permutations for each task
        rng = np.random.RandomState(seed)
        self.permutations = [
            torch.LongTensor(rng.permutation(784))
            for _ in range(n_tasks)
        ]
        
        # Load MNIST once
        self.transform = transforms.ToTensor()
        self.train_data = datasets.MNIST(
            data_dir, train=True, download=True, transform=self.transform
        )
        self.test_data = datasets.MNIST(
            data_dir, train=False, transform=self.transform
        )
        
    def get_task_data(
        self,
        task_id: int,
        train: bool = True
    ) -> Dataset:
        """Get dataset for specific task with permutation applied."""
        base_data = self.train_data if train else self.test_data
        permutation = self.permutations[task_id]
        
        return PermutedDataset(base_data, permutation)
 
 
class PermutedDataset(Dataset):
    """Wrapper that applies permutation to a base dataset."""
    
    def __init__(self, base_dataset: Dataset, permutation: torch.Tensor):
        self.base = base_dataset
        self.permutation = permutation
        
    def __len__(self):
        return len(self.base)
    
    def __getitem__(self, idx):
        img, label = self.base[idx]
        # Flatten, permute, reshape
        flat = img.view(-1)
        permuted = flat[self.permutation]
        return permuted.view(1, 28, 28), label
 
 
class SplitDataset:
    """
    Split dataset benchmark (works with MNIST, CIFAR, etc.).
    
    Divides classes into disjoint subsets, each becoming a task.
    
    Tests: Class-incremental learning
    Scenario: Task-IL (if task ID given) or Class-IL (if not)
    """
    
    def __init__(
        self,
        base_dataset: Dataset,
        n_tasks: int,
        classes_per_task: int = None,
        shuffle_classes: bool = True,
        seed: int = 42
    ):
        self.base_dataset = base_dataset
        self.n_tasks = n_tasks
        
        # Get all unique labels
        all_labels = set()
        for _, label in base_dataset:
            all_labels.add(label if isinstance(label, int) else label.item())
        all_labels = sorted(list(all_labels))
        
        n_classes = len(all_labels)
        classes_per_task = classes_per_task or n_classes // n_tasks
        
        # Optionally shuffle class order
        if shuffle_classes:
            rng = np.random.RandomState(seed)
            all_labels = rng.permutation(all_labels).tolist()
            
        # Assign classes to tasks
        self.task_classes: List[List[int]] = []
        for t in range(n_tasks):
            start = t * classes_per_task
            end = start + classes_per_task
            self.task_classes.append(all_labels[start:end])
            
        # Pre-compute indices per task
        self.task_indices = self._compute_task_indices()
        
    def _compute_task_indices(self) -> Dict[int, List[int]]:
        """Pre-compute which dataset indices belong to each task."""
        indices = {t: [] for t in range(self.n_tasks)}
        
        for idx in range(len(self.base_dataset)):
            _, label = self.base_dataset[idx]
            label = label if isinstance(label, int) else label.item()
            
            for t, classes in enumerate(self.task_classes):
                if label in classes:
                    indices[t].append(idx)
                    break
                    
        return indices
    
    def get_task_data(self, task_id: int) -> Subset:
        """Get subset of data for specific task."""
        return Subset(self.base_dataset, self.task_indices[task_id])
    
    def get_label_mapping(self, task_id: int) -> Dict[int, int]:
        """
        Get mapping from original labels to task-local labels.
        
        For Task-IL, we typically remap to 0..k-1 within each task.
        """
        mapping = {}
        for new_label, original_label in enumerate(self.task_classes[task_id]):
            mapping[original_label] = new_label
        return mapping
 
 
class ContinualLearningScenarios:
    """
    The three standard continual learning scenarios.
    
    These differ in what information is available at test time.
    """
    
    @staticmethod
    def task_incremental(description=True):
        """
        Task-Incremental Learning (Task-IL):
        - Task identity provided at both train and test time
        - Model knows which 'head' to use for each sample
        - Easiest scenario
        
        Example: Multi-head classifier with task-specific output layers
        """
        if description:
            return """
            Task-IL: Task identity available at test time.
            
            Setup: Separate output head per task, or task embedding input
            Test: Model told "this is task 3" → use head 3
            Challenge: Must learn tasks well; forgetting within shared layers
            Metrics: Per-task accuracy is natural
            """
    
    @staticmethod
    def domain_incremental(description=True):
        """
        Domain-Incremental Learning (Domain-IL):
        - Task identity NOT required at test (tasks share output structure)
        - New domains for the same underlying problem
        
        Example: Sentiment analysis on different text sources
        """
        if description:
            return """
            Domain-IL: Same output space, different input distributions.
            
            Setup: Single output head, shared across domains
            Test: Model must work on any domain without being told which
            Challenge: Handle distribution shift without forgetting
            Metrics: Average accuracy across domains
            """
    
    @staticmethod
    def class_incremental(description=True):
        """
        Class-Incremental Learning (Class-IL):
        - Task identity NOT provided at test
        - Output space grows (new classes added)
        - Must distinguish ALL classes ever seen
        
        Example: Object recognition that learns new categories over time
        """
        if description:
            return """
            Class-IL: New classes added over time, no task labels at test.
            
            Setup: Output layer expands as classes are added
            Test: Model must classify into ALL seen classes
            Challenge: Distinguish old from new without task oracle
            Metrics: Accuracy on unified classification problem
            
            THIS IS THE HARDEST SCENARIO.
            """
 
 
# Quick benchmark creation utilities
def create_split_mnist(n_tasks: int = 5) -> SplitDataset:
    """Create Split MNIST benchmark."""
    mnist = datasets.MNIST('./data', train=True, download=True, 
                           transform=transforms.ToTensor())
    return SplitDataset(mnist, n_tasks=n_tasks, classes_per_task=2)
 
def create_split_cifar100(n_tasks: int = 20) -> SplitDataset:
    """Create Split CIFAR-100 benchmark."""
    cifar = datasets.CIFAR100('./data', train=True, download=True,
                               transform=transforms.ToTensor())
    return SplitDataset(cifar, n_tasks=n_tasks, classes_per_task=5)

Class-IL vs Task-IL Results

Methods often report dramatically different results on Task-IL vs Class-IL. A method achieving 90% on Split CIFAR-100 (Task-IL) might achieve only 40% on the same data in Class-IL mode. Always check which scenario is being evaluated when comparing methods.

Experimental Best Practices

Rigorous experimental design is crucial in continual learning research, where many factors can confound results.

Essential Baselines

•Fine-tuning (Lower Bound) — Train on each task without any protection. Shows worst-case forgetting. If your method doesn't beat this, it's not helping.
•Joint Training (Upper Bound) — Train on all data simultaneously. The best possible performance. Your method should approach this.
•Replay with Full Memory — Store all previous data and replay. Near-optimal but impractical. Useful to understand how much replay helps.
•Single Task (Independent) — Train separate models per task. No forgetting possible. Shows what's achievable with unlimited capacity.
•Random Classifier — For classification, baseline is 1/n_classes. Ensure methods beat random guessing on all tasks.

Critical Experimental Controls:\n\n1. Multiple Seeds: Run each experiment with 3-5 different random seeds. Report mean and standard deviation. Continual learning results can be highly variable.\n\n2. Task Order Sensitivity: Some methods are sensitive to task order. Consider evaluating on multiple random orderings.\n\n3. Hyperparameter Fairness: Each method may have optimal hyperparameters. Either tune all methods fairly (same budget) or use published values consistently.\n\n4. Architectural Consistency: Use the same base architecture across methods. Architecture differences can confound comparisons.\n\n5. Epoch Standardization: Train for the same number of epochs per task, or use validation-based early stopping with the same protocol.

experimental_protocol.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
import numpy as np
from typing import List, Dict, Callable, Any
from dataclasses import dataclass
import json
from scipy import stats
 
@dataclass
class ExperimentConfig:
    """Configuration for a rigorous CL experiment."""
    
    # Method identification
    method_name: str
    method_params: Dict[str, Any]
    
    # Data configuration
    benchmark: str
    n_tasks: int
    scenario: str  # 'task_il', 'class_il', 'domain_il'
    
    # Training configuration
    epochs_per_task: int
    batch_size: int
    learning_rate: float
    
    # Architecture
    model_class: str
    model_params: Dict[str, Any]
    
    # Reproducibility
    seeds: List[int]
    
    def to_json(self) -> str:
        return json.dumps(self.__dict__, indent=2)
 
 
class RigorousExperiment:
    """
    Framework for running rigorous continual learning experiments.
    
    Ensures proper baselines, multiple seeds, and statistical testing.
    """
    
    def __init__(self, config: ExperimentConfig):
        self.config = config
        self.results: Dict[int, Dict[str, Any]] = {}  # seed -> results
        
    def run_single_seed(
        self,
        seed: int,
        method_fn: Callable,
        baseline_fns: Dict[str, Callable]
    ) -> Dict[str, Any]:
        """
        Run experiment with single seed.
        
        Returns dict with method results and all baseline results.
        """
        import torch
        import random
        
        # Set all random seeds
        torch.manual_seed(seed)
        np.random.seed(seed)
        random.seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(seed)
            
        results = {'seed': seed}
        
        # Run main method
        results['method'] = method_fn(self.config)
        
        # Run all baselines with same seed
        for baseline_name, baseline_fn in baseline_fns.items():
            results[baseline_name] = baseline_fn(self.config)
            
        return results
    
    def run_all_seeds(
        self,
        method_fn: Callable,
        baseline_fns: Dict[str, Callable]
    ) -> None:
        """Run experiment across all configured seeds."""
        for seed in self.config.seeds:
            print(f"  Running seed {seed}...")
            self.results[seed] = self.run_single_seed(
                seed, method_fn, baseline_fns
            )
            
    def aggregate_results(self) -> Dict[str, Dict[str, float]]:
        """
        Aggregate results across seeds.
        
        Returns mean ± std for each metric for each method.
        """
        aggregated = {}
        
        # Collect all method names
        methods = list(self.results[self.config.seeds[0]].keys())
        methods = [m for m in methods if m != 'seed']
        
        for method in methods:
            method_results = []
            
            for seed in self.config.seeds:
                if method in self.results[seed]:
                    method_results.append(self.results[seed][method])
                    
            # Aggregate each metric
            if method_results:
                aggregated[method] = self._aggregate_metric_dict(method_results)
                
        return aggregated
    
    def _aggregate_metric_dict(
        self,
        results_list: List[Dict]
    ) -> Dict[str, str]:
        """Aggregate a list of metric dicts into mean±std."""
        if not results_list:
            return {}
            
        aggregated = {}
        for key in results_list[0].keys():
            if isinstance(results_list[0][key], (int, float)):
                values = [r[key] for r in results_list]
                mean = np.mean(values)
                std = np.std(values)
                aggregated[key] = f"{mean:.4f} ± {std:.4f}"
                
        return aggregated
    
    def statistical_comparison(
        self,
        method1: str,
        method2: str,
        metric: str
    ) -> Dict[str, Any]:
        """
        Perform statistical test comparing two methods.
        
        Uses paired t-test since same seeds used for both.
        """
        values1 = [self.results[s][method1][metric] for s in self.config.seeds]
        values2 = [self.results[s][method2][metric] for s in self.config.seeds]
        
        t_stat, p_value = stats.ttest_rel(values1, values2)
        
        return {
            'method1_mean': np.mean(values1),
            'method2_mean': np.mean(values2),
            'difference': np.mean(values1) - np.mean(values2),
            't_statistic': t_stat,
            'p_value': p_value,
            'significant_at_005': p_value < 0.05,
            'significant_at_001': p_value < 0.01
        }
 
 
def example_experiment_report():
    """Example of a proper experiment report."""
    
    print("""
===============================================================================
EXPERIMENT REPORT: EWC vs Baseline Methods on Split CIFAR-100
===============================================================================
 
Configuration:
  - Benchmark: Split CIFAR-100 (20 tasks, 5 classes each)
  - Scenario: Class-Incremental Learning
  - Architecture: ResNet-18
  - Training: 20 epochs/task, batch size 64, LR 0.001
  - Seeds: 1, 42, 123, 456, 789
 
Results (mean ± std over 5 seeds):
---------------------------------------------------------------------------
Method              | Avg Acc ↑    | Forgetting ↓ | Learn Acc ↑ | BWT ↑
---------------------------------------------------------------------------
Fine-tuning         | 0.2156±0.021 | 0.6523±0.034 | 0.7234±0.018 | -0.5078±0.029
EWC (λ=1000)       | 0.3892±0.033 | 0.4012±0.041 | 0.6678±0.025 | -0.2786±0.035
EWC (λ=5000)       | 0.4234±0.028 | 0.3456±0.037 | 0.6345±0.031 | -0.2111±0.024
Replay (2000 buf)   | 0.5123±0.019 | 0.2234±0.023 | 0.6789±0.022 | -0.1666±0.018
Joint Training      | 0.7456±0.011 | 0.0000±0.000 | 0.7456±0.011 | +0.0000±0.000
---------------------------------------------------------------------------
 
Statistical Tests (EWC λ=5000 vs Fine-tuning):
  - Average Accuracy: p < 0.001 (significant improvement)
  - Forgetting: p < 0.001 (significant reduction)
 
Key Findings:
1. EWC significantly outperforms fine-tuning baseline
2. Replay still superior to EWC with same memory budget
3. Gap to joint training (upper bound) remains substantial
4. Higher λ reduces forgetting but also reduces learning accuracy
 
Reproducibility:
  - Code: github.com/example/continual-learning
  - Weights: Available upon request
  - Random seeds ensure reproducibility
""")
 
example_experiment_report()

Common Evaluation Pitfalls

Many published results suffer from subtle evaluation issues that make comparison difficult. Being aware of these pitfalls helps in both interpreting the literature and designing your own experiments.

Evaluation Pitfalls to Avoid

•Task-IL vs Class-IL Confusion — Some papers report Task-IL results on benchmarks typically evaluated in Class-IL mode. A method achieving 85% on Task-IL Split CIFAR might achieve only 40% on Class-IL. Always clarify the scenario.
•Unfair Hyperparameter Tuning — Tuning hyperparameters on test data (even implicitly, by choosing 'best' method based on test performance) inflates results. Use validation data or report results for a range of hyperparameters.
•Architecture Differences — Methods may use different architectures. A larger network can naturally accommodate more tasks. Ensure fair architectural comparison.
•Buffer Size Inconsistency — Replay methods vary in buffer size. Comparing Replay with 5000 samples to EWC with no buffer isn't fair. Compare at equal memory budgets.
•Ignoring Standard Deviation — Single-seed results can be misleading due to high variance. Methods may appear better/worse purely due to lucky/unlucky initialization.
•Task Order Cherry-Picking — Some methods are sensitive to task order. Reporting best order inflates results. Report average over random orderings.
•Inappropriate Baselines — Comparing only to other continual learning methods misses the bigger picture. Include joint training upper bound and fine-tuning lower bound.
•Evaluating Only Final Accuracy — Final accuracy conflates forgetting and learning. A method could achieve same final accuracy through high learning + high forgetting or moderate learning + low forgetting. Report full metrics.

The Hidden Test Set Problem

In class-incremental learning, the test set contains classes the model hasn't seen yet. Some evaluation protocols inadvertently leak this information. Ensure your evaluation code doesn't give the model access to classes from future tasks.

Beyond Standard Metrics: Real-World Considerations

The standard accuracy-based metrics capture only part of what matters in real continual learning systems. Production deployments have additional concerns:

Real-World Evaluation Dimensions

•Computational Cost — Training time, inference latency, memory footprint. A method with 5% better accuracy but 10x training time may not be practical.
•Memory Dynamics — How does memory (for replay or importance) grow with tasks? Does the method eventually become infeasible?
•Task Scalability — Most papers evaluate on 5-20 tasks. Production may need 100+. Does the method degrade at scale?
•Robustness to Task Similarity — Performance when tasks are very similar vs. very different. Some methods excel at one but fail at the other.
•Calibration — Is the model's confidence meaningful? Overconfident on forgotten classes is dangerous in production.
•Out-of-Distribution Detection — Can the model recognize inputs from unseen classes/tasks? Critical for safe deployment.
•Anytime Evaluation — Performance if training is interrupted mid-task. Graceful degradation is desirable.

Toward Comprehensive Evaluation:\n\nA truly comprehensive evaluation would include:\n\n1. Standard metrics (accuracy, forgetting, transfer) on multiple benchmarks\n2. Multiple scenarios (Task-IL, Class-IL, Domain-IL)\n3. Computational cost analysis (FLOPs, wall time, memory)\n4. Scalability analysis (performance vs. number of tasks)\n5. Robustness analysis (task order sensitivity, hyperparameter sensitivity)\n6. Statistical significance testing\n7. Comparison to proper baselines (fine-tuning, joint training)\n\nFew papers achieve all of this, but striving toward comprehensive evaluation improves the field.

Summary: Evaluation Protocols

We have covered the foundations of rigorous evaluation in continual learning. Let's consolidate the key insights:

Key Takeaways

•The accuracy matrix $A_{i,j}$ is the foundation of all continual learning metrics. It captures performance on task $i$ after training on task $j$.
•Multiple metrics are essential: Average accuracy, forgetting, learning accuracy, forward transfer, and backward transfer capture different aspects of performance.
•Three scenarios exist: Task-IL (easiest), Domain-IL (moderate), and Class-IL (hardest). Always clarify which scenario is evaluated.
•Standard benchmarks include Permuted MNIST, Split MNIST, Split CIFAR-100, and CORe50. Know their characteristics for proper interpretation.
•Essential baselines: Fine-tuning (lower bound), joint training (upper bound), and replay with full memory. Without these, results lack context.
•Multiple seeds and statistical testing are necessary. Single-seed results can be highly misleading.
•Common pitfalls include: Task-IL/Class-IL confusion, unfair hyperparameter tuning, ignoring variance, and inappropriate baselines.
•Real-world considerations extend beyond accuracy to include computational cost, scalability, calibration, and robustness.

Module Complete: Continual Learning

Congratulations! You have completed the comprehensive module on Continual Learning. You now understand catastrophic forgetting and the stability-plasticity dilemma, regularization approaches (EWC, SI, MAS, LwF), replay methods (experience replay, generative replay, DER), dynamic architectures (progressive networks, parameter isolation, sparse networks), and rigorous evaluation protocols. This knowledge positions you to both understand the research literature and build practical continual learning systems.