Time Series Cross Validation - Learning Module

Loading content...

0/245

Sliding Window Cross-Validation

The Fixed Training Window Approach

While forward chaining (expanding window) grows the training set with each fold, sliding window cross-validation maintains a fixed training set size that moves forward through time. This approach trades training data volume for recency—prioritizing recent observations over distant history.

Sliding window validation is particularly powerful for non-stationary time series where older data may be misleading: financial markets that undergo regime changes, user behavior that evolves with platform updates, or sensor data from systems that degrade over time.

This page provides a comprehensive treatment of sliding window cross-validation—when to prefer it over expanding windows, how to configure window sizes, and how to diagnose whether your data favors recency over volume.

What You Will Learn

By the end of this page, you will master sliding window mechanics, understand the recency vs. volume tradeoff, learn optimal window sizing strategies, and know how to compare sliding vs. expanding windows empirically.

Sliding Window Mechanics

The Sliding Window Algorithm:

Unlike expanding window where the training start remains fixed at the beginning of the series, sliding window moves both the start and end of the training window forward:

Define fixed window size (W): The number of observations in each training set
Define test size (h): The forecast horizon (validation set size)
Define step size (S): How far the window slides between folds (often S = h)
Iterate: At each step, train on observations [t, t+W), validate on [t+W, t+W+h)
Slide forward: Increment t by S and repeat

Key Difference from Expanding Window:

Aspect	Expanding Window	Sliding Window
Training start	Fixed at beginning	Moves forward
Training size	Grows each fold	Fixed at W
Early data	Always included	Eventually dropped
Computational cost	Increases with folds	Constant per fold

sliding_window_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from typing import Generator, Tuple, List
 
def sliding_window_splits(
    n_samples: int,
    window_size: int,
    test_size: int,
    step_size: int = None,
    gap: int = 0
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Generate sliding window cross-validation splits.
    
    Parameters:
    -----------
    n_samples : int
        Total number of observations
    window_size : int
        Fixed size of each training window
    test_size : int
        Size of each validation set
    step_size : int, optional
        How far to slide between folds. Defaults to test_size
        (non-overlapping validation sets)
    gap : int
        Embargo period between train and test
    
    Yields:
    -------
    train_indices, test_indices : tuple of np.ndarray
    """
    if step_size is None:
        step_size = test_size
    
    # Validate parameters
    if window_size + gap + test_size > n_samples:
        raise ValueError(
            f"Window ({window_size}) + gap ({gap}) + test ({test_size}) "
            f"exceeds data size ({n_samples})"
        )
    
    indices = np.arange(n_samples)
    train_start = 0
    
    while train_start + window_size + gap + test_size <= n_samples:
        train_end = train_start + window_size
        test_start = train_end + gap
        test_end = test_start + test_size
        
        yield (
            indices[train_start:train_end],
            indices[test_start:test_end]
        )
        
        train_start += step_size
 
 
def visualize_sliding_vs_expanding(n_samples: int = 100):
    """Compare sliding and expanding window visually."""
    window_size = 20
    test_size = 10
    
    print("SLIDING WINDOW (fixed training size)")
    print("=" * 50)
    for fold, (train, test) in enumerate(
        sliding_window_splits(n_samples, window_size, test_size), 1
    ):
        if fold > 5:
            print("...")
            break
        bar = ['.'] * 50  # Compressed view
        scale = n_samples / 50
        for i in train:
            bar[int(i / scale)] = '#'
        for i in test:
            bar[int(i / scale)] = '*'
        print(f"Fold {fold}: {''.join(bar[:40])}")
    
    print("\nEXPANDING WINDOW (growing training size)")
    print("=" * 50)
    # Show first 5 folds of expanding for comparison
    from page_0 import forward_chain_splits  # Assuming previous implementation
    for fold, (train, test) in enumerate(
        forward_chain_splits(n_samples, 5, window_size, test_size), 1
    ):
        bar = ['.'] * 50
        scale = n_samples / 50
        for i in train:
            bar[int(i / scale)] = '#'
        for i in test:
            bar[int(i / scale)] = '*'
        print(f"Fold {fold}: {''.join(bar[:40])}")
 
 
# Demonstration output:
# SLIDING WINDOW (fixed training size)
# ==================================================
# Fold 1: ##########**........
# Fold 2: ..########**........
# Fold 3: ....########**......
# Fold 4: ......########**....
# Fold 5: ........########**..
#
# EXPANDING WINDOW (growing training size)
# ==================================================
# Fold 1: ##########**........
# Fold 2: ##############**....
# Fold 3: ##################**
# Fold 4: ######################**
# Fold 5: ##########################**

When to Prefer Sliding Window Over Expanding

The choice between sliding and expanding windows is fundamentally a bet about data relevance over time. Sliding window assumes older data is less valuable—or even harmful—for predicting current outcomes.

Sliding Window is Preferred When:

Indicators for Sliding Window

•Concept drift is expected or detected — The relationship between features and target changes over time; old patterns become misleading
•Regime changes are frequent — Financial crises, policy changes, or system updates create distinct periods with different dynamics
•Structural breaks in the series — Major events (COVID-19, platform redesigns) make pre-event data less relevant
•Model performance degrades with more data — A diagnostic sign that older data is hurting, not helping
•Computational constraints — Training on smaller, fixed windows is faster than growing windows
•Feature stationarity assumptions — Some feature engineering assumes recent statistics (e.g., recent volatility) which older data violates

Expanding Window is Preferred When:

Indicators for Expanding Window

•Series is stationary — Statistical properties don't change; more data is strictly better
•Patterns are rare but repeating — Annual seasonality, rare events (e.g., Black Friday) need long history to observe
•Model is high-capacity — Deep learning models or gradient boosting can selectively use relevant patterns from large datasets
•No detected concept drift — Empirical tests show stable performance with increasing training data
•Regulatory or domain requirements — Some domains require using all available history for audit purposes

The Empirical Test

Don't guess—test both approaches. Plot CV performance against training window size. If performance plateaus or degrades beyond a certain window size, sliding window is indicated. If performance consistently improves with more data, use expanding window.

Optimal Window Size Selection

Window size (W) is the critical hyperparameter for sliding window CV. Too small, and the model lacks sufficient data to learn patterns. Too large, and you lose the recency benefits that motivated sliding window in the first place.

Factors Influencing Optimal Window Size:

1. Seasonality Period The window must include at least 1-2 complete seasonal cycles to capture repeating patterns:

Daily seasonality: W ≥ 7-14 days
Weekly seasonality: W ≥ 2-4 weeks
Annual seasonality: W ≥ 1-2 years

2. Regime/Concept Drift Speed Faster drift → smaller windows:

High-frequency trading: minutes to hours
Consumer behavior: weeks to months
Climate patterns: years to decades

3. Model Complexity More parameters need more data:

Simple linear model: W can be relatively small
Random forest/XGBoost: moderate W (hundreds to thousands of samples)
Neural networks: large W or data augmentation required

4. Feature Engineering Requirements Technical indicators and lagged features consume data:

30-day moving average needs at least 30 observations before first training sample
Account for this in effective window size

window_size_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
from sklearn.base import BaseEstimator, clone
from sklearn.metrics import mean_squared_error
from typing import List, Dict, Tuple
import matplotlib.pyplot as plt
 
def find_optimal_window_size(
    X: np.ndarray,
    y: np.ndarray,
    model: BaseEstimator,
    window_sizes: List[int],
    test_size: int,
    min_folds: int = 3
) -> Dict:
    """
    Find optimal sliding window size by testing multiple sizes.
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (time-ordered)
    y : np.ndarray
        Target variable
    model : BaseEstimator
        Model to evaluate
    window_sizes : List[int]
        Window sizes to test
    test_size : int
        Validation set size for each fold
    min_folds : int
        Minimum number of folds required for valid comparison
    
    Returns:
    --------
    Dict with optimal window size and performance by window size
    """
    n_samples = len(X)
    results = {}
    
    for W in window_sizes:
        # Check if this window size allows enough folds
        max_folds = (n_samples - W) // test_size
        if max_folds < min_folds:
            print(f"Window {W}: Skipped (only {max_folds} folds possible)")
            continue
        
        fold_scores = []
        
        # Run sliding window CV
        train_start = 0
        while train_start + W + test_size <= n_samples:
            train_end = train_start + W
            test_start = train_end
            test_end = test_start + test_size
            
            X_train = X[train_start:train_end]
            y_train = y[train_start:train_end]
            X_test = X[test_start:test_end]
            y_test = y[test_start:test_end]
            
            fold_model = clone(model)
            fold_model.fit(X_train, y_train)
            predictions = fold_model.predict(X_test)
            
            rmse = np.sqrt(mean_squared_error(y_test, predictions))
            fold_scores.append(rmse)
            
            train_start += test_size
        
        results[W] = {
            'n_folds': len(fold_scores),
            'mean_rmse': np.mean(fold_scores),
            'std_rmse': np.std(fold_scores),
            'fold_scores': fold_scores
        }
    
    # Find optimal window (lowest mean RMSE)
    optimal_W = min(results.keys(), key=lambda w: results[w]['mean_rmse'])
    
    return {
        'optimal_window_size': optimal_W,
        'optimal_rmse': results[optimal_W]['mean_rmse'],
        'all_results': results
    }
 
 
def plot_window_size_analysis(results: Dict):
    """Visualize performance vs window size."""
    window_sizes = sorted(results['all_results'].keys())
    means = [results['all_results'][w]['mean_rmse'] for w in window_sizes]
    stds = [results['all_results'][w]['std_rmse'] for w in window_sizes]
    
    plt.figure(figsize=(10, 6))
    plt.errorbar(window_sizes, means, yerr=stds, marker='o', capsize=5)
    plt.axvline(
        results['optimal_window_size'], 
        color='red', 
        linestyle='--', 
        label=f"Optimal: {results['optimal_window_size']}"
    )
    plt.xlabel('Window Size')
    plt.ylabel('RMSE')
    plt.title('Sliding Window CV: Performance vs Window Size')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
 
 
# Example usage
from sklearn.linear_model import Ridge
 
# Generate sample non-stationary data
np.random.seed(42)
n = 500
t = np.arange(n)
 
# Regime change at t=250
X = np.random.randn(n, 3)
y = np.where(
    t < 250,
    X @ np.array([1, 0.5, 0.2]) + 0.1 * np.random.randn(n),
    X @ np.array([0.2, 0.5, 1.5]) + 0.1 * np.random.randn(n)  # Changed coefficients
)
 
# Find optimal window
results = find_optimal_window_size(
    X, y,
    model=Ridge(alpha=1.0),
    window_sizes=[50, 100, 150, 200, 300, 400],
    test_size=25,
    min_folds=5
)
 
print(f"Optimal window size: {results['optimal_window_size']}")
print(f"RMSE at optimal: {results['optimal_rmse']:.4f}")
 
# With regime change, smaller windows should perform better
# because they don't mix data from different regimes

Beware of Window Size Overfitting

Optimizing window size on the same data you'll later evaluate creates a subtle form of leakage. Ideally, use a holdout test set that wasn't used in window selection, or use nested CV where the inner loop selects window size.

Step Size and Validation Overlap Strategies

The step size (S) determines how far the window slides between consecutive folds. This parameter controls the overlap between successive training sets and affects both the number of folds and the diversity of validation scenarios.

Step Size Options:

1. Non-overlapping (S = test_size) Each test point appears in exactly one validation fold:

Maximum diversity of validation scenarios
Fewer total folds
Standard choice for final performance reporting

2. Sliding by 1 (S = 1) Maximum overlap between consecutive training sets:

Many folds (n - W - test_size + 1)
High correlation between consecutive fold scores
Useful for studying performance evolution over time

3. Partial overlap (S = test_size / 2) Middle ground:

More folds than non-overlapping
Some diversity maintained
Good for variance reduction via averaging

Step Size Tradeoffs
Step Size	Number of Folds	Fold Independence	Use Case
S = test_size	Low (~5-10 typical)	High (no overlap)	Standard validation
S = 1	Very high (~n)	Very low (adjacent folds nearly identical)	Temporal analysis, concept drift detection
S = test_size/k	Moderate	Moderate	Variance reduction, ensemble training

step_size_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from typing import List
from scipy import stats
 
def analyze_step_size_impact(
    n_samples: int,
    window_size: int,
    test_size: int,
    step_sizes: List[int]
) -> None:
    """Analyze impact of different step sizes."""
    
    print("Step Size Impact Analysis")
    print("=" * 60)
    print(f"Data size: {n_samples}, Window: {window_size}, Test: {test_size}")
    print("=" * 60)
    
    for step in step_sizes:
        n_folds = (n_samples - window_size - test_size) // step + 1
        
        # Calculate training set overlap between consecutive folds
        train_overlap = max(0, window_size - step) / window_size
        
        # Calculate test set overlap
        test_overlap = max(0, test_size - step) / test_size if step < test_size else 0
        
        print(f"\nStep size = {step}:")
        print(f"  Number of folds: {n_folds}")
        print(f"  Training overlap: {train_overlap:.1%}")
        print(f"  Test overlap: {test_overlap:.1%}")
        print(f"  Effective sample size: {n_folds * test_size * (1 - test_overlap):.0f}")
 
 
def compute_fold_correlation(
    fold_scores: List[float],
    step_size: int,
    window_size: int
) -> float:
    """
    Estimate correlation between adjacent fold scores.
    
    High correlation indicates folded scores are not independent
    and simple averaging may underestimate variance.
    """
    if len(fold_scores) < 3:
        return np.nan
    
    scores = np.array(fold_scores)
    
    # Lag-1 autocorrelation of fold scores
    autocorr = np.corrcoef(scores[:-1], scores[1:])[0, 1]
    
    # Theoretical overlap-based correlation
    training_overlap = max(0, (window_size - step_size) / window_size)
    
    return {
        'empirical_autocorr': autocorr,
        'training_overlap': training_overlap,
        'expected_corr_lower': training_overlap * 0.5,  # Rough lower bound
        'effective_n_folds': len(fold_scores) * (1 - abs(autocorr))
    }
 
 
def adjusted_standard_error(
    fold_scores: List[float],
    autocorrelation: float
) -> float:
    """
    Compute standard error adjusted for fold correlation.
    
    When folds are correlated, naive SE understates uncertainty.
    This applies a Newey-West style correction.
    """
    n = len(fold_scores)
    naive_se = np.std(fold_scores) / np.sqrt(n)
    
    # Correction factor for lag-1 autocorrelation
    # More sophisticated methods use HAC estimators
    correction = np.sqrt(1 + 2 * autocorrelation) if autocorrelation > 0 else 1
    
    return naive_se * correction
 
 
# Example
analyze_step_size_impact(
    n_samples=500,
    window_size=100,
    test_size=20,
    step_sizes=[20, 10, 5, 1]
)
 
# Example output:
# Step size = 20:
#   Number of folds: 19
#   Training overlap: 80.0%
#   Test overlap: 0.0%
#
# Step size = 1:
#   Number of folds: 380
#   Training overlap: 99.0%
#   Test overlap: 95.0%

The Anchored Walk-Forward Variant

Anchored walk-forward is a hybrid between pure sliding window and expanding window. Instead of a fixed start that expands or a sliding start with fixed width, anchored walk-forward uses a rolling anchor point that periodically resets.

The Algorithm:

Start with anchor at beginning of data
Expand training window from anchor until reaching maximum size
Reset anchor forward (e.g., to where last test ended)
Repeat expansion cycle

Use Cases:

Periodic model retraining: Simulates deployment where models are completely retrained monthly/quarterly from scratch
Memory-limited training: When keeping all history isn't feasible
Deliberate regime isolation: When you want to prevent learning from too-distant regimes

anchored_walk_forward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from typing import Generator, Tuple
 
def anchored_walk_forward(
    n_samples: int,
    anchor_period: int,
    min_train_size: int,
    max_train_size: int,
    test_size: int,
    gap: int = 0
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Anchored walk-forward with periodic anchor resets.
    
    Parameters:
    -----------
    n_samples : int
        Total observations
    anchor_period : int
        How often to reset the anchor point (in observations)
    min_train_size : int
        Minimum training size after anchor reset
    max_train_size : int
        Maximum training size before anchor resets
    test_size : int
        Validation set size
    gap : int
        Embargo/gap between train and test
    
    Yields:
    -------
    train_indices, test_indices for each fold
    """
    indices = np.arange(n_samples)
    anchor = 0
    
    while anchor + min_train_size + gap + test_size <= n_samples:
        # Expansion phase: grow from anchor
        train_end = anchor + min_train_size
        
        while train_end + gap + test_size <= n_samples:
            # Check if we've hit max training size
            current_train_size = train_end - anchor
            if current_train_size > max_train_size:
                break
            
            test_start = train_end + gap
            test_end = test_start + test_size
            
            yield (
                indices[anchor:train_end],
                indices[test_start:test_end]
            )
            
            train_end += test_size
        
        # Reset anchor forward
        anchor += anchor_period
 
 
def visualize_anchored_walk_forward():
    """Visualize anchored walk-forward pattern."""
    n = 120
    
    print("Anchored Walk-Forward Visualization")
    print("=" * 60)
    print("Each '|' marks an anchor reset")
    print()
    
    folds = list(anchored_walk_forward(
        n_samples=n,
        anchor_period=40,
        min_train_size=20,
        max_train_size=40,
        test_size=10
    ))
    
    current_anchor = 0
    for fold_idx, (train, test) in enumerate(folds):
        # Detect anchor reset
        new_anchor = train[0]
        reset_marker = " |RESET|" if new_anchor != current_anchor else ""
        current_anchor = new_anchor
        
        visual = ['.'] * (n // 2)
        for i in train:
            visual[i // 2] = '#'
        for i in test:
            visual[i // 2] = '*'
        
        print(f"Fold {fold_idx+1}: {''.join(visual[:50])}{reset_marker}")
        print(f"         Train[{train[0]}:{train[-1]+1}] Test[{test[0]}:{test[-1]+1}]")
 
 
visualize_anchored_walk_forward()
 
# Output shows training windows that grow, then reset:
# Fold 1: Train[0:20] Test[20:30]    ##########**...
# Fold 2: Train[0:30] Test[30:40]    ###############**...
# Fold 3: Train[0:40] Test[40:50]    ####################**... |RESET|
# Fold 4: Train[40:60] Test[60:70]   ...##########**...
# ...

When to Use Anchored Walk-Forward

Anchored walk-forward is less common than pure sliding or expanding windows but valuable in production scenarios where you periodically retrain from scratch (e.g., monthly complete retraining) rather than incremental updates. It also helps when you suspect catastrophic forgetting would occur with too much old data.

Empirically Comparing Sliding vs. Expanding Windows

Rather than guessing which approach suits your data, run a head-to-head comparison. This empirical approach reveals whether your time series benefits from recency (sliding) or volume (expanding).

Comparison Protocol:

Match validation points: Both approaches should validate on the same time points
Control for confounds: Use same model, same hyperparameters, same preprocessing
Measure multiple metrics: Biases may appear in some metrics but not others
Statistical testing: Determine if differences are significant

compare_window_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from sklearn.base import clone, BaseEstimator
from sklearn.metrics import mean_squared_error
from scipy import stats
from typing import Dict, List
 
def compare_window_strategies(
    X: np.ndarray,
    y: np.ndarray,
    model: BaseEstimator,
    window_size: int,  # Used as initial/fixed size
    test_size: int,
    n_folds: int = 5
) -> Dict:
    """
    Compare sliding vs expanding window strategies.
    
    Both strategies validate on the same time points for fair comparison.
    """
    n_samples = len(X)
    
    sliding_scores = []
    expanding_scores = []
    
    sliding_preds_all = []
    expanding_preds_all = []
    actuals_all = []
    
    # Generate matched fold positions
    fold_positions = []
    for i in range(n_folds):
        train_end_expanding = window_size + i * test_size
        test_start = train_end_expanding
        test_end = test_start + test_size
        
        if test_end > n_samples:
            break
        
        fold_positions.append({
            'expanding_train': (0, train_end_expanding),
            'sliding_train': (train_end_expanding - window_size, train_end_expanding),
            'test': (test_start, test_end)
        })
    
    for fold in fold_positions:
        test_start, test_end = fold['test']
        X_test = X[test_start:test_end]
        y_test = y[test_start:test_end]
        actuals_all.extend(y_test)
        
        # Expanding window
        exp_start, exp_end = fold['expanding_train']
        X_train_exp = X[exp_start:exp_end]
        y_train_exp = y[exp_start:exp_end]
        
        model_exp = clone(model)
        model_exp.fit(X_train_exp, y_train_exp)
        preds_exp = model_exp.predict(X_test)
        expanding_scores.append(mean_squared_error(y_test, preds_exp))
        expanding_preds_all.extend(preds_exp)
        
        # Sliding window
        slide_start, slide_end = fold['sliding_train']
        X_train_slide = X[slide_start:slide_end]
        y_train_slide = y[slide_start:slide_end]
        
        model_slide = clone(model)
        model_slide.fit(X_train_slide, y_train_slide)
        preds_slide = model_slide.predict(X_test)
        sliding_scores.append(mean_squared_error(y_test, preds_slide))
        sliding_preds_all.extend(preds_slide)
    
    # Statistical comparison
    sliding_mean = np.mean(sliding_scores)
    expanding_mean = np.mean(expanding_scores)
    
    # Paired t-test (folds are matched)
    t_stat, p_value = stats.ttest_rel(sliding_scores, expanding_scores)
    
    # Effect size (Cohen's d)
    diff = np.array(sliding_scores) - np.array(expanding_scores)
    cohens_d = np.mean(diff) / np.std(diff) if np.std(diff) > 0 else 0
    
    return {
        'sliding': {
            'mean_mse': sliding_mean,
            'std_mse': np.std(sliding_scores),
            'rmse': np.sqrt(sliding_mean),
            'fold_scores': sliding_scores
        },
        'expanding': {
            'mean_mse': expanding_mean,
            'std_mse': np.std(expanding_scores),
            'rmse': np.sqrt(expanding_mean),
            'fold_scores': expanding_scores
        },
        'comparison': {
            'difference_mse': sliding_mean - expanding_mean,
            'winner': 'sliding' if sliding_mean < expanding_mean else 'expanding',
            'relative_improvement': abs(sliding_mean - expanding_mean) / max(sliding_mean, expanding_mean),
            'p_value': p_value,
            'is_significant': p_value < 0.05,
            'cohens_d': cohens_d,
            'interpretation': interpret_comparison(cohens_d, p_value)
        }
    }
 
 
def interpret_comparison(cohens_d: float, p_value: float) -> str:
    """Provide interpretation of statistical comparison."""
    if p_value >= 0.05:
        return "No significant difference - either approach acceptable"
    
    effect = abs(cohens_d)
    if effect < 0.2:
        magnitude = "negligible"
    elif effect < 0.5:
        magnitude = "small"
    elif effect < 0.8:
        magnitude = "medium"
    else:
        magnitude = "large"
    
    better = "sliding" if cohens_d < 0 else "expanding"
    return f"{better} window is significantly better (p={p_value:.4f}, {magnitude} effect)"
 
 
# Example with regime change (should favor sliding)
np.random.seed(42)
n = 500
t = np.arange(n)
X = np.random.randn(n, 3)
y = np.where(
    t < 250,
    X @ [1, 0.5, 0.2],
    X @ [0.2, 0.5, 1.5]  # Regime change
) + 0.2 * np.random.randn(n)
 
from sklearn.linear_model import Ridge
results = compare_window_strategies(
    X, y, Ridge(alpha=1.0),
    window_size=100,
    test_size=30,
    n_folds=8
)
 
print("Window Strategy Comparison")
print("=" * 50)
print(f"Sliding Window RMSE: {results['sliding']['rmse']:.4f}")
print(f"Expanding Window RMSE: {results['expanding']['rmse']:.4f}")
print(f"\nStatistical Test: {results['comparison']['interpretation']}")

Summary: Sliding Window Essentials

Sliding window cross-validation provides the recency-focused alternative to expanding window approaches, trading training data volume for temporal relevance.

Key Takeaways

•Sliding window maintains fixed training size — Both start and end of training window move forward, unlike expanding window which only moves the end
•Prefer sliding when concept drift is present — When older data hurts predictions, limiting training to recent observations improves performance
•Window size is the critical hyperparameter — Must be large enough to capture patterns but small enough to adapt to changes
•Step size controls fold overlap — Smaller steps give more folds but higher correlation; non-overlapping (step = test_size) is standard
•Anchored walk-forward is a useful hybrid — Periodic resets simulate production retraining scenarios
•Compare empirically, don't assume — Run both sliding and expanding on your data with statistical testing to determine which is better

What's Next

The next page explores expanding window cross-validation in greater depth, including strategies for handling the growing training set's computational and statistical implications. We'll also cover when to combine sliding and expanding approaches in a multi-resolution validation strategy.