Time Series Cross Validation - Learning Module

Loading content...

0/278

Forward Chaining

The Temporal Data Challenge

When working with time series data—stock prices, weather patterns, user behavior logs, sensor readings, or any data where observations are ordered chronologically—standard cross-validation methods like k-fold CV become fundamentally flawed. The core issue is temporal data leakage: using future information to predict the past.

Forward chaining (also called walk-forward validation or time series split) is the foundational solution to this problem. It is the most principled approach to validating predictive models on temporal data, ensuring that your performance estimates reflect how the model will actually perform when deployed.

This page provides a comprehensive treatment of forward chaining—from theoretical foundations to implementation details, covering why it works, when to use it, and how to avoid common pitfalls that invalidate your temporal validation.

What You Will Learn

By the end of this page, you will understand the fundamental problem with standard CV on time series, master the forward chaining algorithm, learn to configure fold sizes and growth strategies, and recognize scenarios where forward chaining is essential versus optional.

Why Standard Cross-Validation Fails for Time Series

Standard k-fold cross-validation randomly shuffles data and splits it into folds. Each fold takes a turn as the validation set while the remaining folds form the training set. This works beautifully for i.i.d. (independent and identically distributed) data—but time series data is explicitly not i.i.d.

The fundamental violations:

1. Temporal Dependence (Autocorrelation) Time series observations are correlated with nearby observations. Today's stock price is correlated with yesterday's; this hour's temperature is correlated with the previous hour's. When random shuffling mixes temporally adjacent points into different folds, training and validation sets share information through these correlations—inflating performance estimates.

2. Future Information Leakage In random k-fold, a training set might include observations from January and March while validating on February. The model literally trains on the future to predict the past. Any model that captures trend or seasonality will exploit this leakage, producing validation scores that cannot be reproduced in production.

3. Non-Stationarity Many time series exhibit changing statistical properties over time—means shift, variances change, relationships evolve. A randomly shuffled CV set will smooth over these changes, training on a mixture of regimes that never exist simultaneously in practice.

Standard CV vs. Time Series Reality
Aspect	Standard K-Fold Assumption	Time Series Reality
Data ordering	Observations are exchangeable	Order carries critical information
Independence	Observations are independent	Strong temporal autocorrelation
Distribution	Stationary distribution	Often non-stationary, regime changes
Information flow	No directional dependency	Past predicts future, not reverse
Leakage risk	Low (data is i.i.d.)	High (future in training set)

The Leakage Disaster

Models validated with standard CV on time series routinely show 20-50% better performance than they achieve in production. This isn't a minor calibration issue—it's a fundamental methodological failure that leads to deploying models that lose money, miss anomalies, or produce systematically wrong predictions.

A concrete example of the disaster:

Consider predicting daily stock returns. You have data from January 1 to December 31. With 5-fold CV:

Fold 1 validation: Random subset including March 15, June 22, October 5...
Fold 1 training: Everything else, including dates before AND after each validation date

Your model learns that "when the 30-day moving average on June 21 looks like X, the June 22 return is Y." But the 30-day moving average on June 21 includes information from June 22's price (through the return calculation). The model is being trained on information it cannot have in production.

The validation score looks excellent. The production performance is catastrophic.

The Forward Chaining Algorithm

Forward chaining enforces a simple but powerful constraint: the training set always precedes the validation set in time. This mirrors how models are actually used—trained on historical data to predict future observations.

The Algorithm:

Given a time series of T observations ordered chronologically:

Define the minimum training size (n_min): The smallest amount of historical data needed to train a meaningful initial model
Define the validation/test size (h): How many future observations to predict (the forecast horizon)
Iterate through time: At each step, train on all available history up to time t, validate on observations t+1 through t+h
Expand training set: Move forward, including previous validation data in the new training set
Aggregate metrics: Compute mean, standard deviation, and distribution of validation scores across all folds

The key insight: Each fold simulates a realistic deployment scenario. The model only sees what it would see in production—past data—and is evaluated on genuinely future observations.

forward_chaining.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from typing import Generator, Tuple, List
 
def forward_chain_splits(
    n_samples: int,
    n_splits: int = 5,
    min_train_size: int = None,
    test_size: int = None,
    gap: int = 0
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Generate forward chaining (time series) cross-validation splits.
    
    Parameters:
    -----------
    n_samples : int
        Total number of observations in the time series
    n_splits : int
        Number of validation folds to generate
    min_train_size : int, optional
        Minimum size of training set for first fold.
        Defaults to n_samples // (n_splits + 1)
    test_size : int, optional
        Size of each validation set. 
        Defaults to n_samples // (n_splits + 1)
    gap : int
        Number of observations to exclude between train and test
        (embargo period for preventing leakage)
    
    Yields:
    -------
    train_indices, test_indices : tuple of np.ndarray
        Indices for training and validation sets for each fold
    """
    if min_train_size is None:
        min_train_size = n_samples // (n_splits + 1)
    if test_size is None:
        test_size = n_samples // (n_splits + 1)
    
    # Validate parameters
    if min_train_size + test_size + gap > n_samples:
        raise ValueError(
            f"Not enough samples ({n_samples}) for min_train_size={min_train_size}, "
            f"test_size={test_size}, gap={gap}"
        )
    
    indices = np.arange(n_samples)
    
    for split_idx in range(n_splits):
        # Calculate training end point (grows with each split)
        train_end = min_train_size + split_idx * test_size
        
        # Ensure we don't exceed available data
        if train_end + gap + test_size > n_samples:
            break
        
        # Training set: all observations up to train_end
        train_indices = indices[:train_end]
        
        # Test set: observations after gap, of size test_size
        test_start = train_end + gap
        test_end = test_start + test_size
        test_indices = indices[test_start:test_end]
        
        yield train_indices, test_indices
 
 
def demonstrate_forward_chaining():
    """Visualize forward chaining splits."""
    n_samples = 100
    n_splits = 5
    
    print("Forward Chaining Visualization")
    print("=" * 60)
    print("Legend: [###] = Training | (***) = Test | ... = Unused")
    print("=" * 60)
    
    for fold, (train_idx, test_idx) in enumerate(
        forward_chain_splits(n_samples, n_splits), 1
    ):
        # Create visual representation
        visual = ['.'] * n_samples
        for i in train_idx:
            visual[i] = '#'
        for i in test_idx:
            visual[i] = '*'
        
        # Compress for display (show 60 chars)
        compressed = ''.join(visual[::2][:30]) if n_samples > 60 else ''.join(visual)
        
        print(f"Fold {fold}: Train[0:{train_idx[-1]+1}] Test[{test_idx[0]}:{test_idx[-1]+1}]")
        print(f"        {compressed}")
        print()
 
# Output:
# Forward Chaining Visualization
# ============================================================
# Legend: [###] = Training | (***) = Test | ... = Unused
# ============================================================
# Fold 1: Train[0:16] Test[16:32]
#         ########********...............
# 
# Fold 2: Train[0:32] Test[32:48]
#         ################********...... 
# ...

Sklearn's TimeSeriesSplit

Scikit-learn provides TimeSeriesSplit which implements forward chaining. However, understanding the algorithm deeply—as shown above—is essential for customizing window sizes, adding embargo periods, and debugging validation pipelines.

Configuring Fold Sizes and Growth Strategy

The quality of forward chaining depends critically on how you configure the training and validation sizes. Poor choices lead to either unstable estimates (too few validation points) or unrealistic scenarios (training sets too small to learn patterns).

Key Configuration Parameters:

1. Minimum Training Size (n_min) The initial training set must be large enough to:

Capture the fundamental patterns in your data
Include at least one full cycle of any seasonality (e.g., one year for annual patterns)
Provide enough samples for your model's complexity (larger models need more data)

Rule of thumb: Start with at least 2-3 seasonal cycles or √(T) observations, whichever is larger.

2. Validation/Test Size (h) Should match your actual prediction horizon:

If you predict one day ahead, h = 1 is appropriate
If you predict a month ahead, h = 30 (or ~22 trading days for financial data)
If you produce monthly forecasts, h = 1 month of observations

Critical: Mismatching h versus actual use case invalidates your performance estimates.

3. Number of Folds (k) More folds = more validation points = more stable estimates, but:

Each fold adds computational cost (full model training)
Later folds have increasingly large training sets (may mask degradation)
Too many folds may leave insufficient data for final test

Fold Size Configuration Guidelines
Data Characteristic	Recommendation	Rationale
Strong seasonality	min_train ≥ 2 seasonal periods	Model must see pattern repeat to learn it
High-frequency data (minute/tick)	Large min_train, small test	Need substantial history; predict short-term
Monthly/quarterly data	min_train ≥ 24+ months	Long cycles need long history
Regime changes expected	Smaller min_train	Don't over-anchor on old regimes
Stable, stationary series	Larger min_train, fewer folds	More data improves estimates

fold_configuration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from dataclasses import dataclass
from typing import List, Tuple
 
@dataclass
class ForwardChainConfig:
    """Configuration for forward chaining cross-validation."""
    min_train_size: int
    test_size: int
    n_splits: int
    gap: int = 0
    
    def validate(self, n_samples: int) -> None:
        """Validate configuration against data size."""
        required = self.min_train_size + self.n_splits * self.test_size + self.gap
        if required > n_samples:
            raise ValueError(
                f"Configuration requires {required} samples, but only {n_samples} available. "
                f"Reduce n_splits, test_size, or min_train_size."
            )
        
        # Warn about potential issues
        if self.min_train_size < 30:
            print("Warning: Very small initial training set may produce unstable models")
        
        if self.test_size < 10:
            print("Warning: Very small test size may produce high-variance fold scores")
        
        final_train_size = self.min_train_size + (self.n_splits - 1) * self.test_size
        print(f"Fold 1 training size: {self.min_train_size}")
        print(f"Fold {self.n_splits} training size: {final_train_size}")
        print(f"Training size growth ratio: {final_train_size / self.min_train_size:.2f}x")
 
 
def auto_configure(
    n_samples: int,
    seasonal_period: int = None,
    forecast_horizon: int = 1,
    desired_folds: int = 5
) -> ForwardChainConfig:
    """
    Automatically configure forward chaining based on data characteristics.
    
    Parameters:
    -----------
    n_samples : int
        Total number of observations
    seasonal_period : int, optional
        Length of seasonal cycle (e.g., 12 for monthly data with annual seasonality)
    forecast_horizon : int
        How far ahead the model predicts
    desired_folds : int
        Target number of cross-validation folds
    
    Returns:
    --------
    ForwardChainConfig : Configured validation setup
    """
    # Determine minimum training size
    if seasonal_period:
        # At least 2 seasonal cycles for pattern learning
        min_train = max(seasonal_period * 2, int(np.sqrt(n_samples)))
    else:
        # Default: 20% of data or sqrt(n), whichever is larger
        min_train = max(int(0.2 * n_samples), int(np.sqrt(n_samples)))
    
    # Test size matches forecast horizon (or multiples for efficiency)
    test_size = max(forecast_horizon, n_samples // (desired_folds + 3))
    
    # Calculate achievable folds
    remaining = n_samples - min_train
    n_splits = min(desired_folds, remaining // test_size)
    
    if n_splits < 2:
        raise ValueError(
            f"Insufficient data for meaningful cross-validation. "
            f"Need at least {min_train + 2 * test_size} samples."
        )
    
    return ForwardChainConfig(
        min_train_size=min_train,
        test_size=test_size,
        n_splits=n_splits
    )
 
 
# Example usage
config = auto_configure(
    n_samples=500,
    seasonal_period=12,  # Monthly data, annual seasonality
    forecast_horizon=3,  # 3-month ahead prediction
    desired_folds=5
)
config.validate(500)
 
# Output:
# Fold 1 training size: 24
# Fold 5 training size: 344
# Training size growth ratio: 14.33x

Bias-Variance Considerations in Forward Chaining

Forward chaining introduces a unique bias-variance tradeoff not present in standard k-fold CV. Understanding this tradeoff is essential for interpreting validation results correctly.

The Growing Training Set Problem:

In forward chaining, each successive fold has a larger training set. This creates systematic differences between folds:

Early folds: Small training set → high bias, high variance models → validation scores may underestimate final performance
Late folds: Large training set → lower bias, lower variance → validation scores may overestimate performance on truly new data

Why this matters: If you simply average all fold scores, you're mixing estimates from very different training regimes. The average may not represent performance when deployed with your full training set.

The Temporal Representativeness Problem:

Later folds validate on more recent data. If your time series has:

Trends: Later data may be systematically different from earlier data
Regime changes: Different folds may validate under different market conditions, policy environments, or seasonal patterns
Concept drift: The relationship between features and target may have evolved

This means fold scores are not exchangeable—they measure performance under different conditions.

Sources of Bias in Forward Chaining Estimates

•Pessimistic bias from early folds: Models with small training sets perform worse than the final deployed model
•Temporal selection bias: If certain periods are unusually easy/hard, fold scores are biased by which periods fall in validation
•Survivorship bias: Forward chaining inherently tests on data that 'survived' to be collected—edge cases may be missing
•Train-test mismatch: Final deployment uses all historical data; validation uses subsets—learned patterns may differ

Mitigating the Bias:

1. Weighted averaging: Weight fold scores by training set size to emphasize folds that better approximate deployment conditions

2. Report fold-level metrics: Show all fold scores, not just the mean, to reveal temporal patterns and variance

3. Anchored analysis: Compare fold scores to identify trend or regime effects—if performance degrades chronologically, investigate

4. Validation set resampling: Within each validation set, bootstrap to estimate variance of that fold's score

weighted_forward_chain.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from typing import List, Tuple
 
def weighted_cv_score(
    fold_scores: List[float],
    train_sizes: List[int],
    weighting: str = "linear"
) -> Tuple[float, float]:
    """
    Compute weighted cross-validation score emphasizing larger training sets.
    
    Parameters:
    -----------
    fold_scores : List[float]
        Validation score for each fold
    train_sizes : List[int]
        Training set size for each fold
    weighting : str
        'linear' - weight proportional to train size
        'sqrt' - weight proportional to sqrt(train size)
        'uniform' - standard unweighted average
    
    Returns:
    --------
    weighted_mean, weighted_std : Tuple[float, float]
    """
    scores = np.array(fold_scores)
    sizes = np.array(train_sizes)
    
    if weighting == "linear":
        weights = sizes / sizes.sum()
    elif weighting == "sqrt":
        weights = np.sqrt(sizes) / np.sqrt(sizes).sum()
    elif weighting == "uniform":
        weights = np.ones(len(sizes)) / len(sizes)
    else:
        raise ValueError(f"Unknown weighting: {weighting}")
    
    weighted_mean = np.sum(weights * scores)
    
    # Weighted standard deviation
    variance = np.sum(weights * (scores - weighted_mean) ** 2)
    weighted_std = np.sqrt(variance)
    
    return weighted_mean, weighted_std
 
 
def analyze_fold_progression(
    fold_scores: List[float],
    train_sizes: List[int]
) -> dict:
    """
    Analyze how performance changes across forward chaining folds.
    
    Detects trends, regime changes, and variance patterns.
    """
    n_folds = len(fold_scores)
    
    # Linear regression to detect trend
    x = np.arange(n_folds)
    slope, intercept = np.polyfit(x, fold_scores, 1)
    
    # Correlation with training size
    size_correlation = np.corrcoef(train_sizes, fold_scores)[0, 1]
    
    # Detect potential regime change (large jump between consecutive folds)
    diffs = np.diff(fold_scores)
    max_jump_idx = np.argmax(np.abs(diffs))
    max_jump = diffs[max_jump_idx]
    
    return {
        "mean_score": np.mean(fold_scores),
        "std_score": np.std(fold_scores),
        "trend_slope": slope,
        "trend_direction": "improving" if slope > 0 else "degrading",
        "size_correlation": size_correlation,
        "potential_regime_change": {
            "between_folds": (max_jump_idx + 1, max_jump_idx + 2),
            "score_change": max_jump
        },
        "early_vs_late": {
            "early_mean": np.mean(fold_scores[:n_folds//2]),
            "late_mean": np.mean(fold_scores[n_folds//2:])
        }
    }
 
 
# Example analysis
fold_scores = [0.72, 0.75, 0.74, 0.78, 0.81]
train_sizes = [100, 150, 200, 250, 300]
 
uniform_mean, uniform_std = weighted_cv_score(fold_scores, train_sizes, "uniform")
weighted_mean, weighted_std = weighted_cv_score(fold_scores, train_sizes, "linear")
 
print(f"Uniform average: {uniform_mean:.4f} ± {uniform_std:.4f}")
print(f"Weighted average: {weighted_mean:.4f} ± {weighted_std:.4f}")
 
analysis = analyze_fold_progression(fold_scores, train_sizes)
print(f"
Trend: {analysis['trend_direction']} (slope: {analysis['trend_slope']:.4f})")
print(f"Train size correlation: {analysis['size_correlation']:.4f}")

When Forward Chaining Is Essential vs. Optional

Not all temporal data requires forward chaining. Understanding when it's essential versus when standard CV is acceptable helps you make efficient methodological choices.

Forward Chaining is ESSENTIAL when:

•Your features include lagged values or rolling statistics — Moving averages, momentum indicators, or any feature computed from recent history will leak future information under random shuffling
•You're predicting future values explicitly — Stock returns, sales forecasts, demand prediction, any regression/classification where the target is a future observation
•Your data has strong autocorrelation — If today's value strongly predicts tomorrow's, shuffling destroys the temporal structure the model exploits
•Model decisions will be made sequentially in time — Trading systems, real-time anomaly detection, adaptive recommendation systems
•Concept drift is possible or expected — The relationship between X and Y may change over time; forward chaining captures this degradation

Forward Chaining may be OPTIONAL when:

•Data is truly i.i.d. despite having timestamps — Patient records with timestamps but no temporal features; each patient is independent
•Features are purely contemporaneous — You only use features from time t to predict outcomes at time t, with no look-ahead
•Strong stationarity with no autocorrelation — Rare in practice, but if validated, shuffling is acceptable
•Cross-sectional analysis — Even with temporal data, if you're comparing across entities at fixed time points, standard CV may apply

When in Doubt, Use Forward Chaining

Forward chaining is always valid for temporal data; standard CV is only valid under strict conditions that are hard to verify. The cost of unnecessary forward chaining is computational; the cost of inappropriate random shuffling is systematically wrong conclusions. Default to forward chaining.

Domain-Specific Guidelines
Domain	Typical Approach	Rationale
Financial markets	Forward chaining required	Strong autocorrelation, regime changes, look-ahead bias is fatal
Weather/climate	Forward chaining required	Temporal dynamics are the prediction target
E-commerce demand	Forward chaining required	Seasonality, trends, and promotional effects
Medical diagnosis	Often standard CV acceptable	Patients usually independent; validate absence of autocorrelation
Fraud detection	Forward chaining recommended	Fraud patterns evolve; temporal adaptation needed
NLP sentiment	Depends on features	If using temporal context/trends, forward chain; pure text features may not need it

Complete Forward Chaining Implementation

Let's bring everything together with a production-ready forward chaining implementation that handles feature engineering, prevents leakage, and provides comprehensive diagnostics.

production_forward_chaining.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, clone
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from dataclasses import dataclass
from typing import Dict, List, Callable, Optional
import warnings
 
@dataclass
class FoldResult:
    """Results from a single forward chaining fold."""
    fold_idx: int
    train_start: int
    train_end: int
    test_start: int
    test_end: int
    train_size: int
    test_size: int
    metrics: Dict[str, float]
    predictions: np.ndarray
    actuals: np.ndarray
 
class ForwardChainingCV:
    """
    Production-ready forward chaining cross-validation.
    
    Features:
    - Configurable training strategy (expanding vs fixed window)
    - Gap/embargo period support
    - Feature pipeline integration with leakage prevention
    - Comprehensive metrics and diagnostics
    - Parallel execution support
    """
    
    def __init__(
        self,
        n_splits: int = 5,
        min_train_size: int = None,
        test_size: int = None,
        gap: int = 0,
        expanding: bool = True,
        metrics: Dict[str, Callable] = None
    ):
        self.n_splits = n_splits
        self.min_train_size = min_train_size
        self.test_size = test_size
        self.gap = gap
        self.expanding = expanding
        self.metrics = metrics or {
            'rmse': lambda y, p: np.sqrt(mean_squared_error(y, p)),
            'mae': lambda y, p: mean_absolute_error(y, p),
            'r2': lambda y, p: r2_score(y, p)
        }
        self.fold_results_: List[FoldResult] = []
    
    def split(self, X: np.ndarray):
        """Generate train/test indices."""
        n_samples = len(X)
        
        # Auto-configure if not specified
        min_train = self.min_train_size or n_samples // (self.n_splits + 1)
        test_sz = self.test_size or n_samples // (self.n_splits + 1)
        
        indices = np.arange(n_samples)
        
        for fold in range(self.n_splits):
            if self.expanding:
                train_start = 0
            else:
                # Fixed window: slide forward
                train_start = fold * test_sz
            
            train_end = min_train + fold * test_sz
            test_start = train_end + self.gap
            test_end = test_start + test_sz
            
            if test_end > n_samples:
                break
            
            yield (
                indices[train_start:train_end],
                indices[test_start:test_end]
            )
    
    def evaluate(
        self,
        model: BaseEstimator,
        X: np.ndarray,
        y: np.ndarray,
        fit_params: dict = None
    ) -> Dict:
        """
        Run forward chaining CV and return detailed results.
        """
        fit_params = fit_params or {}
        self.fold_results_ = []
        
        for fold_idx, (train_idx, test_idx) in enumerate(self.split(X)):
            # Clone model to ensure independence
            fold_model = clone(model)
            
            # Split data
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            
            # Fit model
            fold_model.fit(X_train, y_train, **fit_params)
            
            # Predict
            predictions = fold_model.predict(X_test)
            
            # Calculate metrics
            fold_metrics = {
                name: metric_fn(y_test, predictions)
                for name, metric_fn in self.metrics.items()
            }
            
            # Store results
            self.fold_results_.append(FoldResult(
                fold_idx=fold_idx,
                train_start=train_idx[0],
                train_end=train_idx[-1],
                test_start=test_idx[0],
                test_end=test_idx[-1],
                train_size=len(train_idx),
                test_size=len(test_idx),
                metrics=fold_metrics,
                predictions=predictions,
                actuals=y_test
            ))
        
        return self._aggregate_results()
    
    def _aggregate_results(self) -> Dict:
        """Aggregate fold results into summary statistics."""
        all_metrics = {name: [] for name in self.metrics.keys()}
        train_sizes = []
        
        for result in self.fold_results_:
            train_sizes.append(result.train_size)
            for name, value in result.metrics.items():
                all_metrics[name].append(value)
        
        summary = {
            'n_folds': len(self.fold_results_),
            'fold_results': self.fold_results_,
        }
        
        # Per-metric statistics
        for name, values in all_metrics.items():
            values = np.array(values)
            weights = np.array(train_sizes) / sum(train_sizes)
            
            summary[f'{name}_mean'] = np.mean(values)
            summary[f'{name}_std'] = np.std(values)
            summary[f'{name}_weighted_mean'] = np.sum(weights * values)
            summary[f'{name}_min'] = np.min(values)
            summary[f'{name}_max'] = np.max(values)
            summary[f'{name}_by_fold'] = values.tolist()
        
        return summary
 
 
# Usage example
if __name__ == "__main__":
    from sklearn.linear_model import Ridge
    
    # Generate sample time series
    np.random.seed(42)
    n = 500
    X = np.random.randn(n, 5)
    y = X @ np.array([1, 2, -1, 0.5, 0.3]) + 0.5 * np.random.randn(n)
    
    # Configure forward chaining
    cv = ForwardChainingCV(
        n_splits=5,
        min_train_size=100,
        test_size=50,
        gap=5,  # Embargo period
        expanding=True
    )
    
    # Evaluate model
    results = cv.evaluate(Ridge(alpha=1.0), X, y)
    
    print("Forward Chaining CV Results")
    print("=" * 50)
    print(f"RMSE: {results['rmse_mean']:.4f} ± {results['rmse_std']:.4f}")
    print(f"R²: {results['r2_mean']:.4f} ± {results['r2_std']:.4f}")
    print(f"
Fold-by-fold RMSE: {results['rmse_by_fold']}")

Summary: Forward Chaining Essentials

Forward chaining is the foundation of time series cross-validation—a simple yet powerful constraint that transforms unreliable validation into trustworthy performance estimation.

Key Takeaways

•Standard CV on time series is fundamentally broken — Random shuffling creates future data leakage, producing optimistic estimates that don't generalize
•Forward chaining respects temporal ordering — Training always precedes validation, mimicking real-world deployment conditions
•Configuration matters critically — Minimum training size, test size, and number of folds must match your data characteristics and use case
•Fold scores are not exchangeable — Growing training sets and temporal variation mean folds measure different things; report distributions, not just means
•Weight by training size for deployment estimates — Later folds with larger training sets better approximate deployed model performance
•When in doubt, use forward chaining — It's always valid for temporal data; standard CV requires strong assumptions rarely satisfied

What's Next

Forward chaining is the foundational technique, but it uses an expanding training window. The next pages explore sliding windows (fixed training size) and expanding windows in detail, followed by embargo periods and purging strategies for preventing subtle forms of data leakage in financial and high-frequency applications.