Loading content...
Evaluating continual learning systems is fundamentally more complex than evaluating standard machine learning models. In traditional ML, we measure performance on a held-out test set after training completes. In continual learning, we must measure performance dynamics over time, across multiple tasks, with interdependent metrics that can trade off against each other.\n\nThe Core Questions:\n\n- How much does the model forget? (stability)\n- How well does it learn new tasks? (plasticity)\n- Does old knowledge help new learning? (forward transfer)\n- Does new knowledge improve old tasks? (backward transfer)\n- How does performance evolve over the task sequence?\n\nThese questions cannot be answered with a single accuracy number. Proper evaluation requires a suite of metrics, temporal tracking, and careful experimental design.
By the end of this page, you will understand the standard metrics for continual learning (accuracy, forgetting, forward/backward transfer), how to construct and interpret accuracy matrices, standard benchmarks and evaluation scenarios, proper experimental design including baselines and statistical testing, and how to avoid common evaluation pitfalls.
All continual learning metrics derive from the accuracy matrix $A$, where $A_{i,j}$ represents the accuracy (or other performance metric) on task $T_i$ immediately after training on task $T_j$.\n\nMatrix Structure:\n\n$$A = \begin{bmatrix} A_{1,1} & A_{1,2} & \cdots & A_{1,n} \\ A_{2,1} & A_{2,2} & \cdots & A_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{n,1} & A_{n,2} & \cdots & A_{n,n} \end{bmatrix}$$\n\nKey Regions:\n\n- Diagonal ($A_{i,i}$): Performance on task $i$ right after training on it\n- Lower Triangle ($A_{i,j}$ for $i < j$): Performance on old tasks after training on new tasks—shows forgetting\n- Upper Triangle ($A_{i,j}$ for $i > j$): Performance on future tasks before training on them—shows forward transfer (zero-shot or pre-existing capability)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
import numpy as npimport matplotlib.pyplot as pltfrom typing import List, Tuple, Dict, Callableimport torchimport torch.nn as nnfrom torch.utils.data import DataLoader class ContinualEvaluator: """ Comprehensive evaluation framework for continual learning. Builds and analyzes the accuracy matrix from which all standard metrics are derived. """ def __init__(self, n_tasks: int): """ Args: n_tasks: Total number of tasks in sequence """ self.n_tasks = n_tasks # Accuracy matrix: A[i,j] = accuracy on task i after training on task j # Initialize with NaN to distinguish "not evaluated" from "zero accuracy" self.accuracy_matrix = np.full((n_tasks, n_tasks), np.nan) # Random baseline for comparison self.random_baseline = np.zeros(n_tasks) # Joint training baseline (upper bound) self.joint_baseline = np.zeros(n_tasks) def record( self, model: nn.Module, task_dataloaders: List[DataLoader], current_task: int, device: torch.device ) -> None: """ Record accuracies after training on current_task. Evaluates on ALL tasks (past, present, future) to capture full transfer dynamics. """ model.eval() with torch.no_grad(): for task_id, loader in enumerate(task_dataloaders): correct = 0 total = 0 for inputs, targets in loader: inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) _, predicted = outputs.max(1) correct += predicted.eq(targets).sum().item() total += targets.size(0) self.accuracy_matrix[task_id, current_task] = correct / total def get_matrix(self) -> np.ndarray: """Return the full accuracy matrix.""" return self.accuracy_matrix.copy() def visualize( self, save_path: str = None, title: str = "Continual Learning Accuracy Matrix" ) -> None: """ Visualize accuracy matrix as heatmap. Color coding: - Diagonal: Task trained at this step - Lower triangle: Potential forgetting region - Upper triangle: Forward transfer region """ fig, ax = plt.subplots(figsize=(10, 8)) # Plot heatmap im = ax.imshow(self.accuracy_matrix, cmap='RdYlGn', vmin=0, vmax=1) # Add colorbar cbar = ax.figure.colorbar(im, ax=ax) cbar.ax.set_ylabel("Accuracy", rotation=-90, va="bottom") # Labels ax.set_xticks(np.arange(self.n_tasks)) ax.set_yticks(np.arange(self.n_tasks)) ax.set_xticklabels([f'After T{i+1}' for i in range(self.n_tasks)]) ax.set_yticklabels([f'Task {i+1}' for i in range(self.n_tasks)]) # Rotate x labels plt.setp(ax.get_xticklabels(), rotation=45, ha="right") # Add text annotations for i in range(self.n_tasks): for j in range(self.n_tasks): val = self.accuracy_matrix[i, j] if not np.isnan(val): color = 'white' if val < 0.5 else 'black' ax.text(j, i, f'{val:.2f}', ha='center', va='center', color=color) ax.set_title(title) ax.set_xlabel("Training Progression") ax.set_ylabel("Evaluation Task") plt.tight_layout() if save_path: plt.savefig(save_path, dpi=150, bbox_inches='tight') plt.show() def explain_matrix_regions(): """Visual explanation of accuracy matrix regions.""" print("ACCURACY MATRIX INTERPRETATION") print("=" * 60) print() print("Matrix A[i,j] = Accuracy on Task i after training on Task j") print() print(" After T1 After T2 After T3 After T4 After T5") print(" ┌─────────┬─────────┬─────────┬─────────┬─────────┐") print("Task1 │ DIAG │ LOWER │ LOWER │ LOWER │ LOWER │") print(" │ (peak) │ (forg) │ (forg) │ (forg) │ (forg) │") print(" ├─────────┼─────────┼─────────┼─────────┼─────────┤") print("Task2 │ UPPER │ DIAG │ LOWER │ LOWER │ LOWER │") print(" │ (fwd) │ (peak) │ (forg) │ (forg) │ (forg) │") print(" ├─────────┼─────────┼─────────┼─────────┼─────────┤") print("Task3 │ UPPER │ UPPER │ DIAG │ LOWER │ LOWER │") print(" │ (fwd) │ (fwd) │ (peak) │ (forg) │ (forg) │") print(" ├─────────┼─────────┼─────────┼─────────┼─────────┤") print("Task4 │ UPPER │ UPPER │ UPPER │ DIAG │ LOWER │") print(" │ (fwd) │ (fwd) │ (fwd) │ (peak) │ (forg) │") print(" ├─────────┼─────────┼─────────┼─────────┼─────────┤") print("Task5 │ UPPER │ UPPER │ UPPER │ UPPER │ DIAG │") print(" │ (fwd) │ (fwd) │ (fwd) │ (fwd) │ (peak) │") print(" └─────────┴─────────┴─────────┴─────────┴─────────┘") print() print("DIAG = Diagonal: Peak accuracy right after training") print("LOWER = Lower triangle: Shows forgetting over time") print("UPPER = Upper triangle: Shows forward transfer (often 0)") explain_matrix_regions()From the accuracy matrix, we derive several standard metrics that capture different aspects of continual learning performance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
import numpy as npfrom typing import Dict, Optionalfrom dataclasses import dataclass @dataclassclass ContinualMetrics: """Container for all continual learning metrics.""" average_accuracy: float average_forgetting: float learning_accuracy: float forward_transfer: float backward_transfer: float final_accuracies: np.ndarray forgetting_per_task: np.ndarray def summary(self) -> str: """Human-readable summary of metrics.""" return f"""Continual Learning Metrics Summary===================================Average Accuracy (ACC): {self.average_accuracy:.4f}Average Forgetting (F): {self.average_forgetting:.4f}Learning Accuracy (LA): {self.learning_accuracy:.4f}Forward Transfer (FWT): {self.forward_transfer:.4f}Backward Transfer (BWT): {self.backward_transfer:.4f} Per-Task Final Accuracies:{self._format_array(self.final_accuracies)} Per-Task Forgetting:{self._format_array(self.forgetting_per_task)}""" def _format_array(self, arr: np.ndarray) -> str: return " " + " ".join([f"T{i+1}: {v:.3f}" for i, v in enumerate(arr)]) class MetricsCalculator: """ Calculate all standard continual learning metrics from accuracy matrix. All formulas follow the conventions from: - López-Paz & Ranzato, "Gradient Episodic Memory" (2017) - Díaz-Rodríguez et al., "Don't forget, there is more..." (2018) """ @staticmethod def compute_all( accuracy_matrix: np.ndarray, random_baseline: Optional[np.ndarray] = None ) -> ContinualMetrics: """ Compute all metrics from accuracy matrix. Args: accuracy_matrix: A[i,j] = accuracy on task i after training on task j random_baseline: Baseline accuracy per task (defaults to 0) Returns: ContinualMetrics containing all computed values """ n = accuracy_matrix.shape[0] if random_baseline is None: random_baseline = np.zeros(n) # Average Accuracy: mean of final column final_accuracies = accuracy_matrix[:, -1] average_accuracy = np.mean(final_accuracies) # Learning Accuracy: mean of diagonal learning_accuracy = np.mean(np.diag(accuracy_matrix)) # Forgetting: peak accuracy minus final accuracy forgetting_per_task = np.zeros(n - 1) for i in range(n - 1): # Peak accuracy on task i (up to but not including final) peak = np.max(accuracy_matrix[i, i:-1]) if i < n-1 else accuracy_matrix[i, i] final = accuracy_matrix[i, -1] forgetting_per_task[i] = max(0, peak - final) average_forgetting = np.mean(forgetting_per_task) # Forward Transfer: performance on task i before training on it forward_transfers = [] for i in range(1, n): # A[i, i-1] is accuracy on task i right before training on it # Compare to baseline fwt_i = accuracy_matrix[i, i-1] - random_baseline[i] forward_transfers.append(fwt_i) forward_transfer = np.mean(forward_transfers) if forward_transfers else 0.0 # Backward Transfer: change in task i's accuracy after training on later tasks backward_transfers = [] for i in range(n - 1): # Final accuracy minus accuracy right after training bwt_i = accuracy_matrix[i, -1] - accuracy_matrix[i, i] backward_transfers.append(bwt_i) backward_transfer = np.mean(backward_transfers) if backward_transfers else 0.0 return ContinualMetrics( average_accuracy=average_accuracy, average_forgetting=average_forgetting, learning_accuracy=learning_accuracy, forward_transfer=forward_transfer, backward_transfer=backward_transfer, final_accuracies=final_accuracies, forgetting_per_task=forgetting_per_task ) @staticmethod def compute_area_under_curve(accuracy_matrix: np.ndarray) -> float: """ Compute Area Under the Accuracy Curve (AUAC). An alternative to final accuracy that accounts for the full trajectory of learning. AUAC = (1/n) * Σ_t Average_Accuracy_at_step_t """ n = accuracy_matrix.shape[0] auac = 0.0 for t in range(n): # Average accuracy on tasks 1..t+1 after training on task t+1 avg_at_t = np.mean(accuracy_matrix[:t+1, t]) auac += avg_at_t return auac / n @staticmethod def compute_intransigence( accuracy_matrix: np.ndarray, joint_training_accuracy: np.ndarray ) -> float: """ Compute Intransigence: inability to learn new tasks. Intransigence = Joint_accuracy - CL_accuracy_on_each_task High intransigence means CL struggles to learn tasks that joint training can learn well. """ n = accuracy_matrix.shape[0] intransigence = np.zeros(n) for i in range(n): # Compare diagonal (CL peak) to joint training intransigence[i] = joint_training_accuracy[i] - accuracy_matrix[i, i] return np.mean(np.maximum(0, intransigence)) def demonstrate_metrics(): """Demonstrate metric calculations on example matrix.""" # Example accuracy matrix (5 tasks) # Showing typical forgetting pattern A = np.array([ [0.95, 0.72, 0.55, 0.42, 0.35], # Task 1 degrades over time [0.00, 0.93, 0.78, 0.65, 0.52], # Task 2 (0 before trained) [0.00, 0.00, 0.91, 0.75, 0.63], # Task 3 [0.00, 0.00, 0.00, 0.94, 0.78], # Task 4 [0.00, 0.00, 0.00, 0.00, 0.96], # Task 5 (current) ]) metrics = MetricsCalculator.compute_all(A) print(metrics.summary()) # Also compute AUAC auac = MetricsCalculator.compute_area_under_curve(A) print(f"Area Under Accuracy Curve: {auac:.4f}") demonstrate_metrics()A method can excel on one metric while failing on another. High learning accuracy (LA) with low backward transfer (BWT) suggests the model learns well but forgets fast. High BWT with low LA suggests catastrophic backward transfer—new learning corrupting old representations in a way that happens to help old tasks. Always report multiple metrics.
The field has converged on several standard benchmarks that enable fair comparison across methods. Understanding these benchmarks is essential for interpreting the literature.
| Benchmark | Base Dataset | Tasks | Scenario | Challenge Level |
|---|---|---|---|---|
| Permuted MNIST | MNIST | 10-20 permutations | Domain-IL | Moderate |
| Rotated MNIST | MNIST | 20 rotations | Domain-IL | Moderate |
| Split MNIST | MNIST | 5 (2 classes each) | Class-IL/Task-IL | Easy |
| Split CIFAR-100 | CIFAR-100 | 10-20 (5-10 classes each) | Class-IL | Hard |
| Split TinyImageNet | TinyImageNet | 10 (20 classes each) | Class-IL | Very Hard |
| CORe50 | Object images | 50 classes, 11 sessions | Multiple | Realistic |
| Split MiniImageNet | MiniImageNet | 20 (5 classes each) | Class-IL | Hard |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
import torchimport torch.nn as nnfrom torch.utils.data import Dataset, DataLoader, Subsetfrom torchvision import datasets, transformsimport numpy as npfrom typing import List, Tuple, Dict class PermutedMNIST: """ Permuted MNIST benchmark. Each task applies a different fixed permutation to pixel positions. All tasks have the same class structure (0-9), but entirely different visual patterns. Tests: Ability to learn with no shared visual features Scenario: Domain-incremental (same output, different input distribution) """ def __init__( self, n_tasks: int = 10, seed: int = 42, data_dir: str = './data' ): self.n_tasks = n_tasks self.seed = seed # Generate fixed permutations for each task rng = np.random.RandomState(seed) self.permutations = [ torch.LongTensor(rng.permutation(784)) for _ in range(n_tasks) ] # Load MNIST once self.transform = transforms.ToTensor() self.train_data = datasets.MNIST( data_dir, train=True, download=True, transform=self.transform ) self.test_data = datasets.MNIST( data_dir, train=False, transform=self.transform ) def get_task_data( self, task_id: int, train: bool = True ) -> Dataset: """Get dataset for specific task with permutation applied.""" base_data = self.train_data if train else self.test_data permutation = self.permutations[task_id] return PermutedDataset(base_data, permutation) class PermutedDataset(Dataset): """Wrapper that applies permutation to a base dataset.""" def __init__(self, base_dataset: Dataset, permutation: torch.Tensor): self.base = base_dataset self.permutation = permutation def __len__(self): return len(self.base) def __getitem__(self, idx): img, label = self.base[idx] # Flatten, permute, reshape flat = img.view(-1) permuted = flat[self.permutation] return permuted.view(1, 28, 28), label class SplitDataset: """ Split dataset benchmark (works with MNIST, CIFAR, etc.). Divides classes into disjoint subsets, each becoming a task. Tests: Class-incremental learning Scenario: Task-IL (if task ID given) or Class-IL (if not) """ def __init__( self, base_dataset: Dataset, n_tasks: int, classes_per_task: int = None, shuffle_classes: bool = True, seed: int = 42 ): self.base_dataset = base_dataset self.n_tasks = n_tasks # Get all unique labels all_labels = set() for _, label in base_dataset: all_labels.add(label if isinstance(label, int) else label.item()) all_labels = sorted(list(all_labels)) n_classes = len(all_labels) classes_per_task = classes_per_task or n_classes // n_tasks # Optionally shuffle class order if shuffle_classes: rng = np.random.RandomState(seed) all_labels = rng.permutation(all_labels).tolist() # Assign classes to tasks self.task_classes: List[List[int]] = [] for t in range(n_tasks): start = t * classes_per_task end = start + classes_per_task self.task_classes.append(all_labels[start:end]) # Pre-compute indices per task self.task_indices = self._compute_task_indices() def _compute_task_indices(self) -> Dict[int, List[int]]: """Pre-compute which dataset indices belong to each task.""" indices = {t: [] for t in range(self.n_tasks)} for idx in range(len(self.base_dataset)): _, label = self.base_dataset[idx] label = label if isinstance(label, int) else label.item() for t, classes in enumerate(self.task_classes): if label in classes: indices[t].append(idx) break return indices def get_task_data(self, task_id: int) -> Subset: """Get subset of data for specific task.""" return Subset(self.base_dataset, self.task_indices[task_id]) def get_label_mapping(self, task_id: int) -> Dict[int, int]: """ Get mapping from original labels to task-local labels. For Task-IL, we typically remap to 0..k-1 within each task. """ mapping = {} for new_label, original_label in enumerate(self.task_classes[task_id]): mapping[original_label] = new_label return mapping class ContinualLearningScenarios: """ The three standard continual learning scenarios. These differ in what information is available at test time. """ @staticmethod def task_incremental(description=True): """ Task-Incremental Learning (Task-IL): - Task identity provided at both train and test time - Model knows which 'head' to use for each sample - Easiest scenario Example: Multi-head classifier with task-specific output layers """ if description: return """ Task-IL: Task identity available at test time. Setup: Separate output head per task, or task embedding input Test: Model told "this is task 3" → use head 3 Challenge: Must learn tasks well; forgetting within shared layers Metrics: Per-task accuracy is natural """ @staticmethod def domain_incremental(description=True): """ Domain-Incremental Learning (Domain-IL): - Task identity NOT required at test (tasks share output structure) - New domains for the same underlying problem Example: Sentiment analysis on different text sources """ if description: return """ Domain-IL: Same output space, different input distributions. Setup: Single output head, shared across domains Test: Model must work on any domain without being told which Challenge: Handle distribution shift without forgetting Metrics: Average accuracy across domains """ @staticmethod def class_incremental(description=True): """ Class-Incremental Learning (Class-IL): - Task identity NOT provided at test - Output space grows (new classes added) - Must distinguish ALL classes ever seen Example: Object recognition that learns new categories over time """ if description: return """ Class-IL: New classes added over time, no task labels at test. Setup: Output layer expands as classes are added Test: Model must classify into ALL seen classes Challenge: Distinguish old from new without task oracle Metrics: Accuracy on unified classification problem THIS IS THE HARDEST SCENARIO. """ # Quick benchmark creation utilitiesdef create_split_mnist(n_tasks: int = 5) -> SplitDataset: """Create Split MNIST benchmark.""" mnist = datasets.MNIST('./data', train=True, download=True, transform=transforms.ToTensor()) return SplitDataset(mnist, n_tasks=n_tasks, classes_per_task=2) def create_split_cifar100(n_tasks: int = 20) -> SplitDataset: """Create Split CIFAR-100 benchmark.""" cifar = datasets.CIFAR100('./data', train=True, download=True, transform=transforms.ToTensor()) return SplitDataset(cifar, n_tasks=n_tasks, classes_per_task=5)Methods often report dramatically different results on Task-IL vs Class-IL. A method achieving 90% on Split CIFAR-100 (Task-IL) might achieve only 40% on the same data in Class-IL mode. Always check which scenario is being evaluated when comparing methods.
Rigorous experimental design is crucial in continual learning research, where many factors can confound results.
Critical Experimental Controls:\n\n1. Multiple Seeds: Run each experiment with 3-5 different random seeds. Report mean and standard deviation. Continual learning results can be highly variable.\n\n2. Task Order Sensitivity: Some methods are sensitive to task order. Consider evaluating on multiple random orderings.\n\n3. Hyperparameter Fairness: Each method may have optimal hyperparameters. Either tune all methods fairly (same budget) or use published values consistently.\n\n4. Architectural Consistency: Use the same base architecture across methods. Architecture differences can confound comparisons.\n\n5. Epoch Standardization: Train for the same number of epochs per task, or use validation-based early stopping with the same protocol.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203
import numpy as npfrom typing import List, Dict, Callable, Anyfrom dataclasses import dataclassimport jsonfrom scipy import stats @dataclassclass ExperimentConfig: """Configuration for a rigorous CL experiment.""" # Method identification method_name: str method_params: Dict[str, Any] # Data configuration benchmark: str n_tasks: int scenario: str # 'task_il', 'class_il', 'domain_il' # Training configuration epochs_per_task: int batch_size: int learning_rate: float # Architecture model_class: str model_params: Dict[str, Any] # Reproducibility seeds: List[int] def to_json(self) -> str: return json.dumps(self.__dict__, indent=2) class RigorousExperiment: """ Framework for running rigorous continual learning experiments. Ensures proper baselines, multiple seeds, and statistical testing. """ def __init__(self, config: ExperimentConfig): self.config = config self.results: Dict[int, Dict[str, Any]] = {} # seed -> results def run_single_seed( self, seed: int, method_fn: Callable, baseline_fns: Dict[str, Callable] ) -> Dict[str, Any]: """ Run experiment with single seed. Returns dict with method results and all baseline results. """ import torch import random # Set all random seeds torch.manual_seed(seed) np.random.seed(seed) random.seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed(seed) results = {'seed': seed} # Run main method results['method'] = method_fn(self.config) # Run all baselines with same seed for baseline_name, baseline_fn in baseline_fns.items(): results[baseline_name] = baseline_fn(self.config) return results def run_all_seeds( self, method_fn: Callable, baseline_fns: Dict[str, Callable] ) -> None: """Run experiment across all configured seeds.""" for seed in self.config.seeds: print(f" Running seed {seed}...") self.results[seed] = self.run_single_seed( seed, method_fn, baseline_fns ) def aggregate_results(self) -> Dict[str, Dict[str, float]]: """ Aggregate results across seeds. Returns mean ± std for each metric for each method. """ aggregated = {} # Collect all method names methods = list(self.results[self.config.seeds[0]].keys()) methods = [m for m in methods if m != 'seed'] for method in methods: method_results = [] for seed in self.config.seeds: if method in self.results[seed]: method_results.append(self.results[seed][method]) # Aggregate each metric if method_results: aggregated[method] = self._aggregate_metric_dict(method_results) return aggregated def _aggregate_metric_dict( self, results_list: List[Dict] ) -> Dict[str, str]: """Aggregate a list of metric dicts into mean±std.""" if not results_list: return {} aggregated = {} for key in results_list[0].keys(): if isinstance(results_list[0][key], (int, float)): values = [r[key] for r in results_list] mean = np.mean(values) std = np.std(values) aggregated[key] = f"{mean:.4f} ± {std:.4f}" return aggregated def statistical_comparison( self, method1: str, method2: str, metric: str ) -> Dict[str, Any]: """ Perform statistical test comparing two methods. Uses paired t-test since same seeds used for both. """ values1 = [self.results[s][method1][metric] for s in self.config.seeds] values2 = [self.results[s][method2][metric] for s in self.config.seeds] t_stat, p_value = stats.ttest_rel(values1, values2) return { 'method1_mean': np.mean(values1), 'method2_mean': np.mean(values2), 'difference': np.mean(values1) - np.mean(values2), 't_statistic': t_stat, 'p_value': p_value, 'significant_at_005': p_value < 0.05, 'significant_at_001': p_value < 0.01 } def example_experiment_report(): """Example of a proper experiment report.""" print("""===============================================================================EXPERIMENT REPORT: EWC vs Baseline Methods on Split CIFAR-100=============================================================================== Configuration: - Benchmark: Split CIFAR-100 (20 tasks, 5 classes each) - Scenario: Class-Incremental Learning - Architecture: ResNet-18 - Training: 20 epochs/task, batch size 64, LR 0.001 - Seeds: 1, 42, 123, 456, 789 Results (mean ± std over 5 seeds):---------------------------------------------------------------------------Method | Avg Acc ↑ | Forgetting ↓ | Learn Acc ↑ | BWT ↑---------------------------------------------------------------------------Fine-tuning | 0.2156±0.021 | 0.6523±0.034 | 0.7234±0.018 | -0.5078±0.029EWC (λ=1000) | 0.3892±0.033 | 0.4012±0.041 | 0.6678±0.025 | -0.2786±0.035EWC (λ=5000) | 0.4234±0.028 | 0.3456±0.037 | 0.6345±0.031 | -0.2111±0.024Replay (2000 buf) | 0.5123±0.019 | 0.2234±0.023 | 0.6789±0.022 | -0.1666±0.018Joint Training | 0.7456±0.011 | 0.0000±0.000 | 0.7456±0.011 | +0.0000±0.000--------------------------------------------------------------------------- Statistical Tests (EWC λ=5000 vs Fine-tuning): - Average Accuracy: p < 0.001 (significant improvement) - Forgetting: p < 0.001 (significant reduction) Key Findings:1. EWC significantly outperforms fine-tuning baseline2. Replay still superior to EWC with same memory budget3. Gap to joint training (upper bound) remains substantial4. Higher λ reduces forgetting but also reduces learning accuracy Reproducibility: - Code: github.com/example/continual-learning - Weights: Available upon request - Random seeds ensure reproducibility""") example_experiment_report()Many published results suffer from subtle evaluation issues that make comparison difficult. Being aware of these pitfalls helps in both interpreting the literature and designing your own experiments.
In class-incremental learning, the test set contains classes the model hasn't seen yet. Some evaluation protocols inadvertently leak this information. Ensure your evaluation code doesn't give the model access to classes from future tasks.
The standard accuracy-based metrics capture only part of what matters in real continual learning systems. Production deployments have additional concerns:
Toward Comprehensive Evaluation:\n\nA truly comprehensive evaluation would include:\n\n1. Standard metrics (accuracy, forgetting, transfer) on multiple benchmarks\n2. Multiple scenarios (Task-IL, Class-IL, Domain-IL)\n3. Computational cost analysis (FLOPs, wall time, memory)\n4. Scalability analysis (performance vs. number of tasks)\n5. Robustness analysis (task order sensitivity, hyperparameter sensitivity)\n6. Statistical significance testing\n7. Comparison to proper baselines (fine-tuning, joint training)\n\nFew papers achieve all of this, but striving toward comprehensive evaluation improves the field.
We have covered the foundations of rigorous evaluation in continual learning. Let's consolidate the key insights:
Congratulations! You have completed the comprehensive module on Continual Learning. You now understand catastrophic forgetting and the stability-plasticity dilemma, regularization approaches (EWC, SI, MAS, LwF), replay methods (experience replay, generative replay, DER), dynamic architectures (progressive networks, parameter isolation, sparse networks), and rigorous evaluation protocols. This knowledge positions you to both understand the research literature and build practical continual learning systems.