Machine LearningCross-Validation & Resampling

Time Series Cross-Validation

LevelIntermediate

Duration90 mins

TopicCross-Validation & Resampling

4 / 5

Embargo Periods in Time Series Cross-Validation

Preventing Subtle Temporal Leakage

Even with proper forward chaining (expanding or sliding window), a subtle but devastating form of information leakage can occur: temporal proximity leakage. When training and test sets are immediately adjacent in time, features computed from the training set may contain information that "bleeds" into the test period through autocorrelation, overlapping computation windows, or lagged effects.

Embargo periods (also called gaps or buffers) address this by inserting a temporal buffer zone between the training set's end and the test set's start. Observations within the embargo are excluded from both training and testing, creating a clean separation.

This technique is particularly critical in domains like quantitative finance, where even small amounts of look-ahead bias can produce models that appear profitable in backtesting but fail catastrophically in live trading.

What You Will Learn

By the end of this page, you will understand why simple adjacency creates leakage, how to determine appropriate embargo lengths, implement embargo in your CV pipelines, and recognize when embargo is essential versus optional.

Why Simple Adjacency Creates Leakage

Consider a forward chaining setup where the training set ends at time t and the test set begins immediately at t+1. Several mechanisms can cause information from the test period to contaminate the training process:

Mechanism 1: Overlapping Technical Indicators

Many time series features involve rolling computations (moving averages, rolling volatility, momentum indicators). A 30-day moving average computed at time t includes observations from t-29 to t. If the test set starts at t+1, the moving average at t+30 will include observations from t+1 to t+30—overlapping with the test period.

More critically, if you compute features using the entire dataset before splitting (a common mistake), test period values directly leak into training features.

Mechanism 2: Autocorrelation Spillover

Most time series exhibit autocorrelation—observations near each other are correlated. If today's value predicts tomorrow's (lag-1 autocorrelation = 0.8), and the training set ends today, then information about tomorrow (the test point) is partially encoded in today's observation, which is in the training set.

Mechanism 3: Label/Target Leakage

For multi-step ahead predictions, the target variable may look multiple steps into the future. Without proper embargo, observations whose targets overlap with the test period can appear in training.

Leakage Mechanisms and Required Embargo
Leakage Source	Example	Required Embargo
Rolling features	30-day moving average	≥ Rolling window length
Multi-step targets	Predict 5-day return	≥ Forecast horizon
Autocorrelation	AR(1) with ρ = 0.8	≥ Autocorrelation decay time
Label smoothing	7-day average target	≥ Smoothing window
Event windows	Earnings ±3 days	≥ Event window half-width

The 30% Performance Illusion

In quantitative finance, studies have shown that backtests without proper embargo can overestimate performance by 20-40%. A strategy appearing to generate 15% annual returns may actually produce 0% or negative returns when embargo is properly applied. This isn't a minor calibration issue—it's the difference between a profitable and unprofitable strategy.

Determining the Appropriate Embargo Length

Embargo length is not one-size-fits-all. It depends on your feature engineering, prediction horizon, and data characteristics. Here's a systematic approach to determining the right embargo:

Step 1: Inventory Feature Dependencies

List all features and their temporal dependencies:

Rolling features: maximum lookback window
Lagged features: maximum lag used
Technical indicators: their computation windows
External data joins: any alignment issues

Step 2: Identify Target Dependencies

Single-step prediction: minimal target leakage
Multi-step prediction: target may span multiple future periods
Cumulative targets (e.g., 30-day return): extends into future

Step 3: Measure Autocorrelation Decay

Compute autocorrelation function (ACF) and identify where correlation drops below threshold (e.g., 0.1):

embargo_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import List, Dict
from scipy import stats
from statsmodels.tsa.stattools import acf
 
def calculate_required_embargo(
    feature_windows: List[int],
    forecast_horizon: int,
    target_window: int = 1,
    series: np.ndarray = None,
    acf_threshold: float = 0.1,
    max_acf_lag: int = 50
) -> Dict:
    """
    Calculate required embargo based on feature engineering and data properties.
    
    Parameters:
    -----------
    feature_windows : List[int]
        Lookback windows for all rolling features
    forecast_horizon : int
        How many steps ahead you're predicting
    target_window : int
        If target is aggregated (e.g., 5-day return), the window size
    series : np.ndarray, optional
        Time series data for autocorrelation analysis
    acf_threshold : float
        ACF threshold below which correlation is considered negligible
    max_acf_lag : int
        Maximum lag to check for autocorrelation
    
    Returns:
    --------
    Dict with embargo recommendation and breakdown
    """
    components = {
        'max_feature_window': max(feature_windows) if feature_windows else 0,
        'forecast_horizon': forecast_horizon,
        'target_window': target_window,
    }
    
    # Feature-based embargo (prevent feature overlap)
    feature_embargo = components['max_feature_window']
    
    # Target-based embargo (prevent target leakage)
    target_embargo = forecast_horizon + target_window - 1
    
    # Autocorrelation-based embargo
    acf_embargo = 0
    if series is not None:
        # Compute ACF
        acf_values = acf(series, nlags=max_acf_lag, fft=False)
        
        # Find lag where ACF drops below threshold
        for lag, value in enumerate(acf_values):
            if abs(value) < acf_threshold:
                acf_embargo = lag
                break
        else:
            acf_embargo = max_acf_lag  # Never dropped below threshold
        
        components['acf_decay_lag'] = acf_embargo
        components['acf_at_recommended'] = acf_values[min(acf_embargo, len(acf_values)-1)]
    
    # Total recommended embargo is maximum of all components
    recommended = max(feature_embargo, target_embargo, acf_embargo)
    
    # Add safety margin (20% extra, minimum 1)
    safety_margin = max(1, int(recommended * 0.2))
    final_recommendation = recommended + safety_margin
    
    return {
        'components': components,
        'feature_embargo': feature_embargo,
        'target_embargo': target_embargo,
        'acf_embargo': acf_embargo,
        'base_recommendation': recommended,
        'safety_margin': safety_margin,
        'final_recommendation': final_recommendation,
        'binding_constraint': (
            'feature_windows' if feature_embargo >= target_embargo and feature_embargo >= acf_embargo
            else 'target_horizon' if target_embargo >= acf_embargo
            else 'autocorrelation'
        )
    }
 
 
def validate_embargo(
    embargo: int,
    series: np.ndarray,
    n_simulations: int = 1000
) -> Dict:
    """
    Validate embargo length through simulation.
    
    Simulates whether the embargo successfully decorrelates train/test.
    """
    n = len(series)
    correlations = []
    
    for _ in range(n_simulations):
        # Random train/test split point
        split = np.random.randint(embargo + 10, n - 10)
        
        # Last training observation
        train_end_value = series[split - embargo - 1]
        
        # First test observation
        test_start_value = series[split]
        
        correlations.append((train_end_value, test_start_value))
    
    train_vals = [c[0] for c in correlations]
    test_vals = [c[1] for c in correlations]
    
    correlation = np.corrcoef(train_vals, test_vals)[0, 1]
    
    return {
        'embargo': embargo,
        'train_test_correlation': correlation,
        'is_sufficient': abs(correlation) < 0.1,
        'recommendation': (
            'Embargo is sufficient' if abs(correlation) < 0.1
            else 'Consider increasing embargo' if abs(correlation) < 0.2
            else 'Embargo is too short'
        )
    }
 
 
# Example usage
np.random.seed(42)
 
# Simulate an AR(1) process with strong autocorrelation
n = 1000
ar_coef = 0.9
series = np.zeros(n)
series[0] = np.random.randn()
for t in range(1, n):
    series[t] = ar_coef * series[t-1] + np.random.randn() * 0.5
 
# Calculate required embargo
embargo_info = calculate_required_embargo(
    feature_windows=[5, 10, 20, 30],  # Various rolling windows
    forecast_horizon=5,  # Predict 5 days ahead
    target_window=5,  # Target is 5-day return
    series=series,
    acf_threshold=0.1
)
 
print("Embargo Calculation")
print("=" * 50)
print(f"Feature-based embargo: {embargo_info['feature_embargo']}")
print(f"Target-based embargo: {embargo_info['target_embargo']}")
print(f"ACF-based embargo: {embargo_info['acf_embargo']}")
print(f"\nBinding constraint: {embargo_info['binding_constraint']}")
print(f"Final recommendation: {embargo_info['final_recommendation']} periods")
 
# Validate the embargo
validation = validate_embargo(embargo_info['final_recommendation'], series)
print(f"\nValidation: {validation['recommendation']}")
print(f"Train-test correlation with embargo: {validation['train_test_correlation']:.4f}")

Implementing Embargo in Cross-Validation

Implementing embargo correctly requires careful attention to how the gap is applied. The embargo creates a "dead zone" between training and test sets where observations are excluded from both.

Key Implementation Considerations:

Embargo applies per fold: Each fold gets its own embargo zone
Observations in embargo are wasted: They contribute to neither training nor testing
Consider embargo from both ends: In some cases, you may need "reverse embargo" where early test observations are excluded because their targets look back at training

embargo_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
import numpy as np
from typing import Generator, Tuple, List
from dataclasses import dataclass
 
@dataclass
class EmbargoConfig:
    """Configuration for embargo in time series CV."""
    forward_embargo: int  # Gap after training ends
    backward_embargo: int = 0  # Gap before test starts (for lookback in test targets)
    
    @property
    def total_gap(self) -> int:
        return self.forward_embargo + self.backward_embargo
    
    def describe(self) -> str:
        return (f"Forward embargo: {self.forward_embargo}, "
                f"Backward embargo: {self.backward_embargo}, "
                f"Total gap: {self.total_gap}")
 
 
def time_series_cv_with_embargo(
    n_samples: int,
    n_splits: int = 5,
    min_train_size: int = None,
    test_size: int = None,
    embargo: int = 0,
    expanding: bool = True
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Time series CV with embargo period between train and test.
    
    Parameters:
    -----------
    n_samples : int
        Total number of observations
    n_splits : int
        Maximum number of folds
    min_train_size : int
        Initial training size (defaults to n_samples // (n_splits + 2))
    test_size : int
        Test set size (defaults to n_samples // (n_splits + 2))
    embargo : int
        Number of observations to exclude between train and test
    expanding : bool
        If True, expanding window; if False, sliding window
    
    Yields:
    -------
    train_indices, test_indices
    """
    if min_train_size is None:
        min_train_size = n_samples // (n_splits + 2)
    if test_size is None:
        test_size = n_samples // (n_splits + 2)
    
    indices = np.arange(n_samples)
    
    for split in range(n_splits):
        # Calculate training window
        if expanding:
            train_start = 0
        else:
            train_start = split * test_size
        
        train_end = min_train_size + split * test_size
        
        # Apply embargo: test starts after embargo gap
        test_start = train_end + embargo
        test_end = test_start + test_size
        
        # Check if we have enough data
        if test_end > n_samples:
            break
        
        yield (
            indices[train_start:train_end],
            indices[test_start:test_end]
        )
 
 
def visualize_embargo(n_samples: int = 100, embargo: int = 10):
    """Visualize embargo gap in cross-validation."""
    
    print(f"TIME SERIES CV WITH EMBARGO = {embargo}")
    print("=" * 60)
    print("Legend: [###] Train | [   ] Embargo | [***] Test")
    print("=" * 60)
    
    min_train = 20
    test_size = 15
    
    for fold, (train, test) in enumerate(
        time_series_cv_with_embargo(
            n_samples, 
            n_splits=5,
            min_train_size=min_train,
            test_size=test_size,
            embargo=embargo
        ), 1
    ):
        bar = ['.'] * 50
        scale = n_samples / 50
        
        # Mark training
        for i in train:
            bar[int(i / scale)] = '#'
        
        # Mark embargo (between train end and test start)
        embargo_start = train[-1] + 1
        embargo_end = test[0]
        for i in range(embargo_start, embargo_end):
            idx = int(i / scale)
            if idx < 50:
                bar[idx] = ' '
        
        # Mark test
        for i in test:
            idx = int(i / scale)
            if idx < 50:
                bar[idx] = '*'
        
        print(f"Fold {fold}: |{''.join(bar)}|")
        print(f"         Train[0:{train[-1]+1}] Embargo[{embargo_start}:{embargo_end}] Test[{test[0]}:{test[-1]+1}]")
        print()
 
 
def calculate_data_efficiency(
    n_samples: int,
    n_splits: int,
    min_train_size: int,
    test_size: int,
    embargo: int
) -> Dict:
    """
    Calculate how much data is "wasted" in embargo zones.
    """
    total_embargo_waste = n_splits * embargo
    
    # Observations that appear in at least one train or test set
    # In expanding window, all observations except final embargo are used
    min_used = min_train_size  # Always in training
    
    # Maximum test end
    max_test_end = min_train_size + (n_splits - 1) * test_size + embargo + test_size
    
    # Efficiency
    if max_test_end <= n_samples:
        usable_fraction = 1 - (total_embargo_waste / (n_splits * (test_size + embargo)))
    else:
        usable_fraction = (n_samples - total_embargo_waste) / n_samples
    
    return {
        'total_embargo_observations': total_embargo_waste,
        'data_efficiency': usable_fraction,
        'effective_test_observations': n_splits * test_size,
        'observation_waste_fraction': total_embargo_waste / n_samples
    }
 
 
# Visualize the impact
visualize_embargo(n_samples=100, embargo=10)
 
efficiency = calculate_data_efficiency(
    n_samples=1000,
    n_splits=5,
    min_train_size=200,
    test_size=100,
    embargo=20
)
 
print(f"\nData Efficiency Analysis:")
print(f"Total observations in embargo zones: {efficiency['total_embargo_observations']}")
print(f"Fraction of data wasted to embargo: {efficiency['observation_waste_fraction']:.1%}")

Bidirectional Embargo for Complex Leakage

Standard embargo creates a gap after the training set to prevent training features from incorporating test information. However, some scenarios require bidirectional embargo—gaps on both sides of the test set.

When Bidirectional Embargo is Needed:

1. Overlapping Labels on Both Sides If your training approach uses future information for labeling (e.g., supervised learning on labeled "return over next 30 days"), and the test set is short, training labels from late training observations may overlap with test period outcomes.

2. Event Studies When studying events (earnings announcements, product launches), windows around events may span train/test boundaries in both directions.

3. Purging in Addition to Embargo Some frameworks (notably in finance) use "purging" to remove training observations whose labels overlap with any test observation, which is effectively bidirectional exclusion.

bidirectional_embargo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from typing import Generator, Tuple, List
 
def bidirectional_embargo_splits(
    n_samples: int,
    n_splits: int,
    min_train_size: int,
    test_size: int,
    forward_embargo: int,
    backward_embargo: int = 0
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Time series CV with bidirectional embargo.
    
    Parameters:
    -----------
    n_samples : int
        Total observations
    n_splits : int
        Number of folds
    min_train_size : int
        Initial training size
    test_size : int
        Test set size
    forward_embargo : int
        Gap after training ends (before test starts)
    backward_embargo : int
        Gap after test ends (before next training can include those obs)
        In practice, this removes observations from training whose
        labels would overlap with test
    
    Note: backward_embargo is more relevant for purging (covered in next page)
    """
    indices = np.arange(n_samples)
    total_gap = forward_embargo + backward_embargo
    
    for split in range(n_splits):
        train_end = min_train_size + split * (test_size + total_gap)
        test_start = train_end + forward_embargo
        test_end = test_start + test_size
        
        if test_end > n_samples:
            break
        
        # Standard training indices (may need purging - next page)
        train_indices = indices[:train_end]
        test_indices = indices[test_start:test_end]
        
        yield train_indices, test_indices
 
 
def apply_label_purging(
    train_indices: np.ndarray,
    test_indices: np.ndarray,
    label_horizon: int
) -> np.ndarray:
    """
    Remove training observations whose labels overlap with test period.
    
    If we're predicting k-step-ahead returns, a training observation at time t
    has a label that spans [t+1, t+k]. If this overlaps with the test period,
    we must exclude that training observation.
    
    Parameters:
    -----------
    train_indices : np.ndarray
        Original training indices
    test_indices : np.ndarray
        Test indices
    label_horizon : int
        How many steps ahead the label looks (e.g., 5 for 5-day return)
    
    Returns:
    --------
    np.ndarray : Purged training indices
    """
    test_start = test_indices[0]
    
    # Training observation at time t has label ending at t + label_horizon
    # Remove if t + label_horizon >= test_start
    # i.e., keep only if t < test_start - label_horizon
    purge_cutoff = test_start - label_horizon
    
    purged_train = train_indices[train_indices < purge_cutoff]
    
    n_purged = len(train_indices) - len(purged_train)
    
    return purged_train, n_purged
 
 
def comprehensive_embargo_cv(
    n_samples: int,
    n_splits: int,
    min_train_size: int,
    test_size: int,
    forward_embargo: int,
    label_horizon: int
) -> Generator[Tuple[np.ndarray, np.ndarray, dict], None, None]:
    """
    Comprehensive CV with forward embargo AND label purging.
    
    Yields train_indices, test_indices, and metadata about what was removed.
    """
    indices = np.arange(n_samples)
    
    for split in range(n_splits):
        train_end = min_train_size + split * test_size
        test_start = train_end + forward_embargo
        test_end = test_start + test_size
        
        if test_end > n_samples:
            break
        
        # Initial training set (before purging)
        initial_train = indices[:train_end]
        test_indices = indices[test_start:test_end]
        
        # Apply purging based on label horizon
        purged_train, n_purged = apply_label_purging(
            initial_train, 
            test_indices, 
            label_horizon
        )
        
        metadata = {
            'fold': split + 1,
            'initial_train_size': len(initial_train),
            'purged_train_size': len(purged_train),
            'observations_purged': n_purged,
            'purge_fraction': n_purged / len(initial_train),
            'embargo_size': forward_embargo,
            'effective_gap': forward_embargo + n_purged  # Total separation
        }
        
        yield purged_train, test_indices, metadata
 
 
# Example
print("Comprehensive Embargo + Purging Example")
print("=" * 60)
 
for train, test, meta in comprehensive_embargo_cv(
    n_samples=500,
    n_splits=5,
    min_train_size=100,
    test_size=50,
    forward_embargo=10,
    label_horizon=20  # Predicting 20-step-ahead returns
):
    print(f"Fold {meta['fold']}:")
    print(f"  Initial train: {meta['initial_train_size']}")
    print(f"  After purging: {meta['purged_train_size']} ({meta['observations_purged']} purged)")
    print(f"  Purge rate: {meta['purge_fraction']:.1%}")
    print(f"  Effective gap: {meta['effective_gap']} observations")
    print()

Embargo vs. Purging

Embargo excludes a fixed gap of observations between train and test. Purging dynamically removes training observations whose labels overlap with the test period. In practice, you often need both: embargo for feature leakage, purging for label leakage. The next page covers purging in comprehensive detail.

Domain-Specific Embargo Guidelines

Different domains have established best practices for embargo periods. Here are guidelines from practitioners across major time series application areas:

Domain-Specific Embargo Recommendations
Domain	Typical Embargo	Rationale	Special Considerations
Equity trading (daily)	5-21 trading days	Autocorrelation + momentum spillover	Avoid earnings announcement periods
High-frequency trading	Minutes to hours	Order book dynamics decay quickly	May need microsecond precision
Macro/economic forecasting	1-3 months	Economic indicators have long lags	Publication delays matter
Weather prediction	Hours to days	Atmospheric autocorrelation	Match operational forecast horizon
Demand forecasting (retail)	7-14 days	Weekly seasonality + promotions	Exclude holiday periods from embargo
Healthcare outcomes	Often unnecessary	Patient observations may be independent	Verify no temporal dependencies first
Fraud detection	1+ event window	Fraud patterns evolve	Account for detection delay

Finance-Specific Considerations

In quantitative finance, insufficient embargo is one of the leading causes of backtest overfitting. Practitioners often use 5× the maximum feature lookback or 2× the forecast horizon, whichever is larger. When in doubt, use more embargo—the cost is data efficiency, but the benefit is avoiding false discoveries.

Financial Time Series Checklist

•Account for market microstructure — Bid-ask bounce, price discreteness create short-term correlations
•Consider market hours — Daily data may need extra embargo around market open/close
•Handle corporate actions — Stock splits, dividends create structural breaks; exclude affected periods
•Respect data vendor latency — Point-in-time data may not be available when you think
•Match operational constraints — Embargo should reflect actual trading implementation delays

Empirically Validating Your Embargo

Rather than relying solely on theoretical embargo calculations, validate empirically that your embargo successfully decorrelates training and test sets.

Validation Approaches:

1. Train-Test Residual Correlation After fitting on training data, compute residuals. If test period residuals are correlated with late training residuals, leakage is present.

2. Permutation Testing Shuffle the test period labels; if shuffled performance is comparable to actual performance, your model may be exploiting leakage rather than genuine patterns.

3. Incremental Embargo Testing Vary embargo from 0 to 2× your theoretical value; plot CV performance. Genuine signal should be relatively stable; leakage shows steep decline as embargo increases.

embargo_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
from sklearn.base import clone, BaseEstimator
from sklearn.metrics import mean_squared_error
from typing import List, Dict
 
def validate_embargo_empirically(
    X: np.ndarray,
    y: np.ndarray,
    model: BaseEstimator,
    embargo_values: List[int],
    n_splits: int = 5,
    min_train_size: int = None,
    test_size: int = None
) -> Dict:
    """
    Test whether embargo is sufficient by measuring performance vs embargo length.
    
    If performance drops sharply as embargo increases, leakage was present.
    If performance is stable, embargo is sufficient (or leakage isn't the issue).
    """
    n_samples = len(X)
    
    if min_train_size is None:
        min_train_size = n_samples // (n_splits + 2)
    if test_size is None:
        test_size = n_samples // (n_splits + 2)
    
    results = {}
    
    for embargo in embargo_values:
        fold_scores = []
        
        for split in range(n_splits):
            train_end = min_train_size + split * test_size
            test_start = train_end + embargo
            test_end = test_start + test_size
            
            if test_end > n_samples:
                break
            
            X_train = X[:train_end]
            y_train = y[:train_end]
            X_test = X[test_start:test_end]
            y_test = y[test_start:test_end]
            
            fold_model = clone(model)
            fold_model.fit(X_train, y_train)
            predictions = fold_model.predict(X_test)
            
            rmse = np.sqrt(mean_squared_error(y_test, predictions))
            fold_scores.append(rmse)
        
        results[embargo] = {
            'mean_rmse': np.mean(fold_scores),
            'std_rmse': np.std(fold_scores),
            'n_folds': len(fold_scores)
        }
    
    # Analyze pattern
    embargo_list = sorted(results.keys())
    performance_list = [results[e]['mean_rmse'] for e in embargo_list]
    
    # Check for significant degradation (sign of leakage at lower embargo)
    if len(embargo_list) >= 2:
        low_embargo_perf = performance_list[0]
        high_embargo_perf = performance_list[-1]
        degradation = (high_embargo_perf - low_embargo_perf) / low_embargo_perf
    else:
        degradation = 0
    
    return {
        'by_embargo': results,
        'degradation_fraction': degradation,
        'leakage_detected': degradation > 0.1,  # >10% degradation suggests leakage
        'recommendation': (
            f'Leakage detected: performance degrades {degradation:.1%} from embargo=0. '
            f'Use embargo≥{embargo_list[-1]}'
            if degradation > 0.1
            else f'No significant leakage detected. Embargo={embargo_list[0]} appears sufficient.'
        )
    }
 
 
def residual_correlation_test(
    X: np.ndarray,
    y: np.ndarray,
    model: BaseEstimator,
    train_end: int,
    test_start: int,
    test_end: int,
    n_residual_lags: int = 10
) -> Dict:
    """
    Test for leakage by checking if test residuals correlate with training residuals.
    """
    # Fit model
    model = clone(model)
    X_train, y_train = X[:train_end], y[:train_end]
    X_test, y_test = X[test_start:test_end], y[test_start:test_end]
    
    model.fit(X_train, y_train)
    
    # Training residuals (last few before embargo)
    train_preds = model.predict(X_train)
    train_residuals = y_train - train_preds
    late_train_residuals = train_residuals[-n_residual_lags:]
    
    # Test residuals
    test_preds = model.predict(X_test)
    test_residuals = y_test - test_preds
    early_test_residuals = test_residuals[:n_residual_lags]
    
    # Cross-correlation
    if len(late_train_residuals) == len(early_test_residuals):
        correlation = np.corrcoef(late_train_residuals, early_test_residuals)[0, 1]
    else:
        correlation = np.nan
    
    return {
        'residual_correlation': correlation,
        'leakage_suspected': abs(correlation) > 0.2,
        'embargo_gap': test_start - train_end,
        'interpretation': (
            'Residual correlation is low—embargo appears adequate'
            if abs(correlation) < 0.2
            else f'Residual correlation={correlation:.2f} suggests information spillover'
        )
    }
 
 
# Example usage
from sklearn.linear_model import Ridge
 
np.random.seed(42)
 
# Create data with autocorrelation
n = 500
X = np.random.randn(n, 5)
y = np.zeros(n)
for t in range(1, n):
    y[t] = 0.8 * y[t-1] + X[t] @ np.array([0.5, 0.3, 0.2, 0.1, 0.0]) + np.random.randn() * 0.2
 
# Test different embargo values
validation = validate_embargo_empirically(
    X, y,
    model=Ridge(alpha=1.0),
    embargo_values=[0, 5, 10, 20, 30],
    n_splits=5
)
 
print("Embargo Validation Results")
print("=" * 50)
for embargo, stats in validation['by_embargo'].items():
    print(f"Embargo = {embargo:2d}: RMSE = {stats['mean_rmse']:.4f} ± {stats['std_rmse']:.4f}")
 
print(f"\n{validation['recommendation']}")

Summary: Embargo Period Essentials

Embargo periods are a critical but often overlooked component of time series cross-validation, preventing subtle temporal leakage that can dramatically inflate performance estimates.

Key Takeaways

•Simple adjacency allows leakage — Rolling features, autocorrelation, and label overlap create information channels that embargo must break
•Embargo length depends on feature engineering — Maximum lookback window, forecast horizon, and autocorrelation decay all inform embargo requirements
•Bidirectional embargo may be needed — Forward embargo prevents feature leakage; backward considerations (purging) prevent label leakage
•Domain expertise guides defaults — Finance needs 1-4 weeks; HFT needs minutes; healthcare may need none
•Validate empirically — Test performance across embargo values; if performance drops significantly, leakage was present
•Err on the side of more embargo — The cost is data efficiency; the benefit is avoiding false discoveries and failed deployments

What's Next

The next page covers purging—a complementary technique that removes training observations whose labels overlap with the test period. Together, embargo and purging form a comprehensive defense against temporal leakage in sophisticated time series applications.

4 / 5

Loading learning content...

Machine LearningCross-Validation & Resampling

Time Series Cross-Validation

LevelIntermediate

Duration90 mins

TopicCross-Validation & Resampling

4 / 5

Embargo Periods in Time Series Cross-Validation

Preventing Subtle Temporal Leakage

What You Will Learn

Why Simple Adjacency Creates Leakage

Mechanism 1: Overlapping Technical Indicators

More critically, if you compute features using the entire dataset before splitting (a common mistake), test period values directly leak into training features.

Mechanism 2: Autocorrelation Spillover

Mechanism 3: Label/Target Leakage

For multi-step ahead predictions, the target variable may look multiple steps into the future. Without proper embargo, observations whose targets overlap with the test period can appear in training.

Leakage Mechanisms and Required Embargo
Leakage Source	Example	Required Embargo
Rolling features	30-day moving average	≥ Rolling window length
Multi-step targets	Predict 5-day return	≥ Forecast horizon
Autocorrelation	AR(1) with ρ = 0.8	≥ Autocorrelation decay time
Label smoothing	7-day average target	≥ Smoothing window
Event windows	Earnings ±3 days	≥ Event window half-width

The 30% Performance Illusion

Determining the Appropriate Embargo Length

Embargo length is not one-size-fits-all. It depends on your feature engineering, prediction horizon, and data characteristics. Here's a systematic approach to determining the right embargo:

Step 1: Inventory Feature Dependencies

List all features and their temporal dependencies:

Rolling features: maximum lookback window
Lagged features: maximum lag used
Technical indicators: their computation windows
External data joins: any alignment issues

Step 2: Identify Target Dependencies

Single-step prediction: minimal target leakage
Multi-step prediction: target may span multiple future periods
Cumulative targets (e.g., 30-day return): extends into future

Step 3: Measure Autocorrelation Decay

Compute autocorrelation function (ACF) and identify where correlation drops below threshold (e.g., 0.1):

embargo_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import List, Dict
from scipy import stats
from statsmodels.tsa.stattools import acf
 
def calculate_required_embargo(
    feature_windows: List[int],
    forecast_horizon: int,
    target_window: int = 1,
    series: np.ndarray = None,
    acf_threshold: float = 0.1,
    max_acf_lag: int = 50
) -> Dict:
    """
    Calculate required embargo based on feature engineering and data properties.
    
    Parameters:
    -----------
    feature_windows : List[int]
        Lookback windows for all rolling features
    forecast_horizon : int
        How many steps ahead you're predicting
    target_window : int
        If target is aggregated (e.g., 5-day return), the window size
    series : np.ndarray, optional
        Time series data for autocorrelation analysis
    acf_threshold : float
        ACF threshold below which correlation is considered negligible
    max_acf_lag : int
        Maximum lag to check for autocorrelation
    
    Returns:
    --------
    Dict with embargo recommendation and breakdown
    """
    components = {
        'max_feature_window': max(feature_windows) if feature_windows else 0,
        'forecast_horizon': forecast_horizon,
        'target_window': target_window,
    }
    
    # Feature-based embargo (prevent feature overlap)
    feature_embargo = components['max_feature_window']
    
    # Target-based embargo (prevent target leakage)
    target_embargo = forecast_horizon + target_window - 1
    
    # Autocorrelation-based embargo
    acf_embargo = 0
    if series is not None:
        # Compute ACF
        acf_values = acf(series, nlags=max_acf_lag, fft=False)
        
        # Find lag where ACF drops below threshold
        for lag, value in enumerate(acf_values):
            if abs(value) < acf_threshold:
                acf_embargo = lag
                break
        else:
            acf_embargo = max_acf_lag  # Never dropped below threshold
        
        components['acf_decay_lag'] = acf_embargo
        components['acf_at_recommended'] = acf_values[min(acf_embargo, len(acf_values)-1)]
    
    # Total recommended embargo is maximum of all components
    recommended = max(feature_embargo, target_embargo, acf_embargo)
    
    # Add safety margin (20% extra, minimum 1)
    safety_margin = max(1, int(recommended * 0.2))
    final_recommendation = recommended + safety_margin
    
    return {
        'components': components,
        'feature_embargo': feature_embargo,
        'target_embargo': target_embargo,
        'acf_embargo': acf_embargo,
        'base_recommendation': recommended,
        'safety_margin': safety_margin,
        'final_recommendation': final_recommendation,
        'binding_constraint': (
            'feature_windows' if feature_embargo >= target_embargo and feature_embargo >= acf_embargo
            else 'target_horizon' if target_embargo >= acf_embargo
            else 'autocorrelation'
        )
    }
 
 
def validate_embargo(
    embargo: int,
    series: np.ndarray,
    n_simulations: int = 1000
) -> Dict:
    """
    Validate embargo length through simulation.
    
    Simulates whether the embargo successfully decorrelates train/test.
    """
    n = len(series)
    correlations = []
    
    for _ in range(n_simulations):
        # Random train/test split point
        split = np.random.randint(embargo + 10, n - 10)
        
        # Last training observation
        train_end_value = series[split - embargo - 1]
        
        # First test observation
        test_start_value = series[split]
        
        correlations.append((train_end_value, test_start_value))
    
    train_vals = [c[0] for c in correlations]
    test_vals = [c[1] for c in correlations]
    
    correlation = np.corrcoef(train_vals, test_vals)[0, 1]
    
    return {
        'embargo': embargo,
        'train_test_correlation': correlation,
        'is_sufficient': abs(correlation) < 0.1,
        'recommendation': (
            'Embargo is sufficient' if abs(correlation) < 0.1
            else 'Consider increasing embargo' if abs(correlation) < 0.2
            else 'Embargo is too short'
        )
    }
 
 
# Example usage
np.random.seed(42)
 
# Simulate an AR(1) process with strong autocorrelation
n = 1000
ar_coef = 0.9
series = np.zeros(n)
series[0] = np.random.randn()
for t in range(1, n):
    series[t] = ar_coef * series[t-1] + np.random.randn() * 0.5
 
# Calculate required embargo
embargo_info = calculate_required_embargo(
    feature_windows=[5, 10, 20, 30],  # Various rolling windows
    forecast_horizon=5,  # Predict 5 days ahead
    target_window=5,  # Target is 5-day return
    series=series,
    acf_threshold=0.1
)
 
print("Embargo Calculation")
print("=" * 50)
print(f"Feature-based embargo: {embargo_info['feature_embargo']}")
print(f"Target-based embargo: {embargo_info['target_embargo']}")
print(f"ACF-based embargo: {embargo_info['acf_embargo']}")
print(f"\nBinding constraint: {embargo_info['binding_constraint']}")
print(f"Final recommendation: {embargo_info['final_recommendation']} periods")
 
# Validate the embargo
validation = validate_embargo(embargo_info['final_recommendation'], series)
print(f"\nValidation: {validation['recommendation']}")
print(f"Train-test correlation with embargo: {validation['train_test_correlation']:.4f}")

Implementing Embargo in Cross-Validation

Implementing embargo correctly requires careful attention to how the gap is applied. The embargo creates a "dead zone" between training and test sets where observations are excluded from both.

Key Implementation Considerations:

Embargo applies per fold: Each fold gets its own embargo zone
Observations in embargo are wasted: They contribute to neither training nor testing
Consider embargo from both ends: In some cases, you may need "reverse embargo" where early test observations are excluded because their targets look back at training

embargo_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
import numpy as np
from typing import Generator, Tuple, List
from dataclasses import dataclass
 
@dataclass
class EmbargoConfig:
    """Configuration for embargo in time series CV."""
    forward_embargo: int  # Gap after training ends
    backward_embargo: int = 0  # Gap before test starts (for lookback in test targets)
    
    @property
    def total_gap(self) -> int:
        return self.forward_embargo + self.backward_embargo
    
    def describe(self) -> str:
        return (f"Forward embargo: {self.forward_embargo}, "
                f"Backward embargo: {self.backward_embargo}, "
                f"Total gap: {self.total_gap}")
 
 
def time_series_cv_with_embargo(
    n_samples: int,
    n_splits: int = 5,
    min_train_size: int = None,
    test_size: int = None,
    embargo: int = 0,
    expanding: bool = True
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Time series CV with embargo period between train and test.
    
    Parameters:
    -----------
    n_samples : int
        Total number of observations
    n_splits : int
        Maximum number of folds
    min_train_size : int
        Initial training size (defaults to n_samples // (n_splits + 2))
    test_size : int
        Test set size (defaults to n_samples // (n_splits + 2))
    embargo : int
        Number of observations to exclude between train and test
    expanding : bool
        If True, expanding window; if False, sliding window
    
    Yields:
    -------
    train_indices, test_indices
    """
    if min_train_size is None:
        min_train_size = n_samples // (n_splits + 2)
    if test_size is None:
        test_size = n_samples // (n_splits + 2)
    
    indices = np.arange(n_samples)
    
    for split in range(n_splits):
        # Calculate training window
        if expanding:
            train_start = 0
        else:
            train_start = split * test_size
        
        train_end = min_train_size + split * test_size
        
        # Apply embargo: test starts after embargo gap
        test_start = train_end + embargo
        test_end = test_start + test_size
        
        # Check if we have enough data
        if test_end > n_samples:
            break
        
        yield (
            indices[train_start:train_end],
            indices[test_start:test_end]
        )
 
 
def visualize_embargo(n_samples: int = 100, embargo: int = 10):
    """Visualize embargo gap in cross-validation."""
    
    print(f"TIME SERIES CV WITH EMBARGO = {embargo}")
    print("=" * 60)
    print("Legend: [###] Train | [   ] Embargo | [***] Test")
    print("=" * 60)
    
    min_train = 20
    test_size = 15
    
    for fold, (train, test) in enumerate(
        time_series_cv_with_embargo(
            n_samples, 
            n_splits=5,
            min_train_size=min_train,
            test_size=test_size,
            embargo=embargo
        ), 1
    ):
        bar = ['.'] * 50
        scale = n_samples / 50
        
        # Mark training
        for i in train:
            bar[int(i / scale)] = '#'
        
        # Mark embargo (between train end and test start)
        embargo_start = train[-1] + 1
        embargo_end = test[0]
        for i in range(embargo_start, embargo_end):
            idx = int(i / scale)
            if idx < 50:
                bar[idx] = ' '
        
        # Mark test
        for i in test:
            idx = int(i / scale)
            if idx < 50:
                bar[idx] = '*'
        
        print(f"Fold {fold}: |{''.join(bar)}|")
        print(f"         Train[0:{train[-1]+1}] Embargo[{embargo_start}:{embargo_end}] Test[{test[0]}:{test[-1]+1}]")
        print()
 
 
def calculate_data_efficiency(
    n_samples: int,
    n_splits: int,
    min_train_size: int,
    test_size: int,
    embargo: int
) -> Dict:
    """
    Calculate how much data is "wasted" in embargo zones.
    """
    total_embargo_waste = n_splits * embargo
    
    # Observations that appear in at least one train or test set
    # In expanding window, all observations except final embargo are used
    min_used = min_train_size  # Always in training
    
    # Maximum test end
    max_test_end = min_train_size + (n_splits - 1) * test_size + embargo + test_size
    
    # Efficiency
    if max_test_end <= n_samples:
        usable_fraction = 1 - (total_embargo_waste / (n_splits * (test_size + embargo)))
    else:
        usable_fraction = (n_samples - total_embargo_waste) / n_samples
    
    return {
        'total_embargo_observations': total_embargo_waste,
        'data_efficiency': usable_fraction,
        'effective_test_observations': n_splits * test_size,
        'observation_waste_fraction': total_embargo_waste / n_samples
    }
 
 
# Visualize the impact
visualize_embargo(n_samples=100, embargo=10)
 
efficiency = calculate_data_efficiency(
    n_samples=1000,
    n_splits=5,
    min_train_size=200,
    test_size=100,
    embargo=20
)
 
print(f"\nData Efficiency Analysis:")
print(f"Total observations in embargo zones: {efficiency['total_embargo_observations']}")
print(f"Fraction of data wasted to embargo: {efficiency['observation_waste_fraction']:.1%}")

Bidirectional Embargo for Complex Leakage

When Bidirectional Embargo is Needed:

2. Event Studies When studying events (earnings announcements, product launches), windows around events may span train/test boundaries in both directions.

bidirectional_embargo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from typing import Generator, Tuple, List
 
def bidirectional_embargo_splits(
    n_samples: int,
    n_splits: int,
    min_train_size: int,
    test_size: int,
    forward_embargo: int,
    backward_embargo: int = 0
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Time series CV with bidirectional embargo.
    
    Parameters:
    -----------
    n_samples : int
        Total observations
    n_splits : int
        Number of folds
    min_train_size : int
        Initial training size
    test_size : int
        Test set size
    forward_embargo : int
        Gap after training ends (before test starts)
    backward_embargo : int
        Gap after test ends (before next training can include those obs)
        In practice, this removes observations from training whose
        labels would overlap with test
    
    Note: backward_embargo is more relevant for purging (covered in next page)
    """
    indices = np.arange(n_samples)
    total_gap = forward_embargo + backward_embargo
    
    for split in range(n_splits):
        train_end = min_train_size + split * (test_size + total_gap)
        test_start = train_end + forward_embargo
        test_end = test_start + test_size
        
        if test_end > n_samples:
            break
        
        # Standard training indices (may need purging - next page)
        train_indices = indices[:train_end]
        test_indices = indices[test_start:test_end]
        
        yield train_indices, test_indices
 
 
def apply_label_purging(
    train_indices: np.ndarray,
    test_indices: np.ndarray,
    label_horizon: int
) -> np.ndarray:
    """
    Remove training observations whose labels overlap with test period.
    
    If we're predicting k-step-ahead returns, a training observation at time t
    has a label that spans [t+1, t+k]. If this overlaps with the test period,
    we must exclude that training observation.
    
    Parameters:
    -----------
    train_indices : np.ndarray
        Original training indices
    test_indices : np.ndarray
        Test indices
    label_horizon : int
        How many steps ahead the label looks (e.g., 5 for 5-day return)
    
    Returns:
    --------
    np.ndarray : Purged training indices
    """
    test_start = test_indices[0]
    
    # Training observation at time t has label ending at t + label_horizon
    # Remove if t + label_horizon >= test_start
    # i.e., keep only if t < test_start - label_horizon
    purge_cutoff = test_start - label_horizon
    
    purged_train = train_indices[train_indices < purge_cutoff]
    
    n_purged = len(train_indices) - len(purged_train)
    
    return purged_train, n_purged
 
 
def comprehensive_embargo_cv(
    n_samples: int,
    n_splits: int,
    min_train_size: int,
    test_size: int,
    forward_embargo: int,
    label_horizon: int
) -> Generator[Tuple[np.ndarray, np.ndarray, dict], None, None]:
    """
    Comprehensive CV with forward embargo AND label purging.
    
    Yields train_indices, test_indices, and metadata about what was removed.
    """
    indices = np.arange(n_samples)
    
    for split in range(n_splits):
        train_end = min_train_size + split * test_size
        test_start = train_end + forward_embargo
        test_end = test_start + test_size
        
        if test_end > n_samples:
            break
        
        # Initial training set (before purging)
        initial_train = indices[:train_end]
        test_indices = indices[test_start:test_end]
        
        # Apply purging based on label horizon
        purged_train, n_purged = apply_label_purging(
            initial_train, 
            test_indices, 
            label_horizon
        )
        
        metadata = {
            'fold': split + 1,
            'initial_train_size': len(initial_train),
            'purged_train_size': len(purged_train),
            'observations_purged': n_purged,
            'purge_fraction': n_purged / len(initial_train),
            'embargo_size': forward_embargo,
            'effective_gap': forward_embargo + n_purged  # Total separation
        }
        
        yield purged_train, test_indices, metadata
 
 
# Example
print("Comprehensive Embargo + Purging Example")
print("=" * 60)
 
for train, test, meta in comprehensive_embargo_cv(
    n_samples=500,
    n_splits=5,
    min_train_size=100,
    test_size=50,
    forward_embargo=10,
    label_horizon=20  # Predicting 20-step-ahead returns
):
    print(f"Fold {meta['fold']}:")
    print(f"  Initial train: {meta['initial_train_size']}")
    print(f"  After purging: {meta['purged_train_size']} ({meta['observations_purged']} purged)")
    print(f"  Purge rate: {meta['purge_fraction']:.1%}")
    print(f"  Effective gap: {meta['effective_gap']} observations")
    print()

Embargo vs. Purging

Domain-Specific Embargo Guidelines

Different domains have established best practices for embargo periods. Here are guidelines from practitioners across major time series application areas:

Domain-Specific Embargo Recommendations
Domain	Typical Embargo	Rationale	Special Considerations
Equity trading (daily)	5-21 trading days	Autocorrelation + momentum spillover	Avoid earnings announcement periods
High-frequency trading	Minutes to hours	Order book dynamics decay quickly	May need microsecond precision
Macro/economic forecasting	1-3 months	Economic indicators have long lags	Publication delays matter
Weather prediction	Hours to days	Atmospheric autocorrelation	Match operational forecast horizon
Demand forecasting (retail)	7-14 days	Weekly seasonality + promotions	Exclude holiday periods from embargo
Healthcare outcomes	Often unnecessary	Patient observations may be independent	Verify no temporal dependencies first
Fraud detection	1+ event window	Fraud patterns evolve	Account for detection delay

Finance-Specific Considerations

Financial Time Series Checklist

•Account for market microstructure — Bid-ask bounce, price discreteness create short-term correlations
•Consider market hours — Daily data may need extra embargo around market open/close
•Handle corporate actions — Stock splits, dividends create structural breaks; exclude affected periods
•Respect data vendor latency — Point-in-time data may not be available when you think
•Match operational constraints — Embargo should reflect actual trading implementation delays

Empirically Validating Your Embargo

Rather than relying solely on theoretical embargo calculations, validate empirically that your embargo successfully decorrelates training and test sets.

Validation Approaches:

1. Train-Test Residual Correlation After fitting on training data, compute residuals. If test period residuals are correlated with late training residuals, leakage is present.

2. Permutation Testing Shuffle the test period labels; if shuffled performance is comparable to actual performance, your model may be exploiting leakage rather than genuine patterns.

embargo_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
from sklearn.base import clone, BaseEstimator
from sklearn.metrics import mean_squared_error
from typing import List, Dict
 
def validate_embargo_empirically(
    X: np.ndarray,
    y: np.ndarray,
    model: BaseEstimator,
    embargo_values: List[int],
    n_splits: int = 5,
    min_train_size: int = None,
    test_size: int = None
) -> Dict:
    """
    Test whether embargo is sufficient by measuring performance vs embargo length.
    
    If performance drops sharply as embargo increases, leakage was present.
    If performance is stable, embargo is sufficient (or leakage isn't the issue).
    """
    n_samples = len(X)
    
    if min_train_size is None:
        min_train_size = n_samples // (n_splits + 2)
    if test_size is None:
        test_size = n_samples // (n_splits + 2)
    
    results = {}
    
    for embargo in embargo_values:
        fold_scores = []
        
        for split in range(n_splits):
            train_end = min_train_size + split * test_size
            test_start = train_end + embargo
            test_end = test_start + test_size
            
            if test_end > n_samples:
                break
            
            X_train = X[:train_end]
            y_train = y[:train_end]
            X_test = X[test_start:test_end]
            y_test = y[test_start:test_end]
            
            fold_model = clone(model)
            fold_model.fit(X_train, y_train)
            predictions = fold_model.predict(X_test)
            
            rmse = np.sqrt(mean_squared_error(y_test, predictions))
            fold_scores.append(rmse)
        
        results[embargo] = {
            'mean_rmse': np.mean(fold_scores),
            'std_rmse': np.std(fold_scores),
            'n_folds': len(fold_scores)
        }
    
    # Analyze pattern
    embargo_list = sorted(results.keys())
    performance_list = [results[e]['mean_rmse'] for e in embargo_list]
    
    # Check for significant degradation (sign of leakage at lower embargo)
    if len(embargo_list) >= 2:
        low_embargo_perf = performance_list[0]
        high_embargo_perf = performance_list[-1]
        degradation = (high_embargo_perf - low_embargo_perf) / low_embargo_perf
    else:
        degradation = 0
    
    return {
        'by_embargo': results,
        'degradation_fraction': degradation,
        'leakage_detected': degradation > 0.1,  # >10% degradation suggests leakage
        'recommendation': (
            f'Leakage detected: performance degrades {degradation:.1%} from embargo=0. '
            f'Use embargo≥{embargo_list[-1]}'
            if degradation > 0.1
            else f'No significant leakage detected. Embargo={embargo_list[0]} appears sufficient.'
        )
    }
 
 
def residual_correlation_test(
    X: np.ndarray,
    y: np.ndarray,
    model: BaseEstimator,
    train_end: int,
    test_start: int,
    test_end: int,
    n_residual_lags: int = 10
) -> Dict:
    """
    Test for leakage by checking if test residuals correlate with training residuals.
    """
    # Fit model
    model = clone(model)
    X_train, y_train = X[:train_end], y[:train_end]
    X_test, y_test = X[test_start:test_end], y[test_start:test_end]
    
    model.fit(X_train, y_train)
    
    # Training residuals (last few before embargo)
    train_preds = model.predict(X_train)
    train_residuals = y_train - train_preds
    late_train_residuals = train_residuals[-n_residual_lags:]
    
    # Test residuals
    test_preds = model.predict(X_test)
    test_residuals = y_test - test_preds
    early_test_residuals = test_residuals[:n_residual_lags]
    
    # Cross-correlation
    if len(late_train_residuals) == len(early_test_residuals):
        correlation = np.corrcoef(late_train_residuals, early_test_residuals)[0, 1]
    else:
        correlation = np.nan
    
    return {
        'residual_correlation': correlation,
        'leakage_suspected': abs(correlation) > 0.2,
        'embargo_gap': test_start - train_end,
        'interpretation': (
            'Residual correlation is low—embargo appears adequate'
            if abs(correlation) < 0.2
            else f'Residual correlation={correlation:.2f} suggests information spillover'
        )
    }
 
 
# Example usage
from sklearn.linear_model import Ridge
 
np.random.seed(42)
 
# Create data with autocorrelation
n = 500
X = np.random.randn(n, 5)
y = np.zeros(n)
for t in range(1, n):
    y[t] = 0.8 * y[t-1] + X[t] @ np.array([0.5, 0.3, 0.2, 0.1, 0.0]) + np.random.randn() * 0.2
 
# Test different embargo values
validation = validate_embargo_empirically(
    X, y,
    model=Ridge(alpha=1.0),
    embargo_values=[0, 5, 10, 20, 30],
    n_splits=5
)
 
print("Embargo Validation Results")
print("=" * 50)
for embargo, stats in validation['by_embargo'].items():
    print(f"Embargo = {embargo:2d}: RMSE = {stats['mean_rmse']:.4f} ± {stats['std_rmse']:.4f}")
 
print(f"\n{validation['recommendation']}")

Summary: Embargo Period Essentials

Embargo periods are a critical but often overlooked component of time series cross-validation, preventing subtle temporal leakage that can dramatically inflate performance estimates.

Key Takeaways

•Simple adjacency allows leakage — Rolling features, autocorrelation, and label overlap create information channels that embargo must break
•Embargo length depends on feature engineering — Maximum lookback window, forecast horizon, and autocorrelation decay all inform embargo requirements
•Bidirectional embargo may be needed — Forward embargo prevents feature leakage; backward considerations (purging) prevent label leakage
•Domain expertise guides defaults — Finance needs 1-4 weeks; HFT needs minutes; healthcare may need none
•Validate empirically — Test performance across embargo values; if performance drops significantly, leakage was present
•Err on the side of more embargo — The cost is data efficiency; the benefit is avoiding false discoveries and failed deployments

What's Next

4 / 5