Loading learning content...
Even with proper forward chaining (expanding or sliding window), a subtle but devastating form of information leakage can occur: temporal proximity leakage. When training and test sets are immediately adjacent in time, features computed from the training set may contain information that "bleeds" into the test period through autocorrelation, overlapping computation windows, or lagged effects.
Embargo periods (also called gaps or buffers) address this by inserting a temporal buffer zone between the training set's end and the test set's start. Observations within the embargo are excluded from both training and testing, creating a clean separation.
This technique is particularly critical in domains like quantitative finance, where even small amounts of look-ahead bias can produce models that appear profitable in backtesting but fail catastrophically in live trading.
By the end of this page, you will understand why simple adjacency creates leakage, how to determine appropriate embargo lengths, implement embargo in your CV pipelines, and recognize when embargo is essential versus optional.
Consider a forward chaining setup where the training set ends at time t and the test set begins immediately at t+1. Several mechanisms can cause information from the test period to contaminate the training process:
Mechanism 1: Overlapping Technical Indicators
Many time series features involve rolling computations (moving averages, rolling volatility, momentum indicators). A 30-day moving average computed at time t includes observations from t-29 to t. If the test set starts at t+1, the moving average at t+30 will include observations from t+1 to t+30—overlapping with the test period.
More critically, if you compute features using the entire dataset before splitting (a common mistake), test period values directly leak into training features.
Mechanism 2: Autocorrelation Spillover
Most time series exhibit autocorrelation—observations near each other are correlated. If today's value predicts tomorrow's (lag-1 autocorrelation = 0.8), and the training set ends today, then information about tomorrow (the test point) is partially encoded in today's observation, which is in the training set.
Mechanism 3: Label/Target Leakage
For multi-step ahead predictions, the target variable may look multiple steps into the future. Without proper embargo, observations whose targets overlap with the test period can appear in training.
| Leakage Source | Example | Required Embargo |
|---|---|---|
| Rolling features | 30-day moving average | ≥ Rolling window length |
| Multi-step targets | Predict 5-day return | ≥ Forecast horizon |
| Autocorrelation | AR(1) with ρ = 0.8 | ≥ Autocorrelation decay time |
| Label smoothing | 7-day average target | ≥ Smoothing window |
| Event windows | Earnings ±3 days | ≥ Event window half-width |
In quantitative finance, studies have shown that backtests without proper embargo can overestimate performance by 20-40%. A strategy appearing to generate 15% annual returns may actually produce 0% or negative returns when embargo is properly applied. This isn't a minor calibration issue—it's the difference between a profitable and unprofitable strategy.
Embargo length is not one-size-fits-all. It depends on your feature engineering, prediction horizon, and data characteristics. Here's a systematic approach to determining the right embargo:
Step 1: Inventory Feature Dependencies
List all features and their temporal dependencies:
Step 2: Identify Target Dependencies
Step 3: Measure Autocorrelation Decay
Compute autocorrelation function (ACF) and identify where correlation drops below threshold (e.g., 0.1):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom typing import List, Dictfrom scipy import statsfrom statsmodels.tsa.stattools import acf def calculate_required_embargo( feature_windows: List[int], forecast_horizon: int, target_window: int = 1, series: np.ndarray = None, acf_threshold: float = 0.1, max_acf_lag: int = 50) -> Dict: """ Calculate required embargo based on feature engineering and data properties. Parameters: ----------- feature_windows : List[int] Lookback windows for all rolling features forecast_horizon : int How many steps ahead you're predicting target_window : int If target is aggregated (e.g., 5-day return), the window size series : np.ndarray, optional Time series data for autocorrelation analysis acf_threshold : float ACF threshold below which correlation is considered negligible max_acf_lag : int Maximum lag to check for autocorrelation Returns: -------- Dict with embargo recommendation and breakdown """ components = { 'max_feature_window': max(feature_windows) if feature_windows else 0, 'forecast_horizon': forecast_horizon, 'target_window': target_window, } # Feature-based embargo (prevent feature overlap) feature_embargo = components['max_feature_window'] # Target-based embargo (prevent target leakage) target_embargo = forecast_horizon + target_window - 1 # Autocorrelation-based embargo acf_embargo = 0 if series is not None: # Compute ACF acf_values = acf(series, nlags=max_acf_lag, fft=False) # Find lag where ACF drops below threshold for lag, value in enumerate(acf_values): if abs(value) < acf_threshold: acf_embargo = lag break else: acf_embargo = max_acf_lag # Never dropped below threshold components['acf_decay_lag'] = acf_embargo components['acf_at_recommended'] = acf_values[min(acf_embargo, len(acf_values)-1)] # Total recommended embargo is maximum of all components recommended = max(feature_embargo, target_embargo, acf_embargo) # Add safety margin (20% extra, minimum 1) safety_margin = max(1, int(recommended * 0.2)) final_recommendation = recommended + safety_margin return { 'components': components, 'feature_embargo': feature_embargo, 'target_embargo': target_embargo, 'acf_embargo': acf_embargo, 'base_recommendation': recommended, 'safety_margin': safety_margin, 'final_recommendation': final_recommendation, 'binding_constraint': ( 'feature_windows' if feature_embargo >= target_embargo and feature_embargo >= acf_embargo else 'target_horizon' if target_embargo >= acf_embargo else 'autocorrelation' ) } def validate_embargo( embargo: int, series: np.ndarray, n_simulations: int = 1000) -> Dict: """ Validate embargo length through simulation. Simulates whether the embargo successfully decorrelates train/test. """ n = len(series) correlations = [] for _ in range(n_simulations): # Random train/test split point split = np.random.randint(embargo + 10, n - 10) # Last training observation train_end_value = series[split - embargo - 1] # First test observation test_start_value = series[split] correlations.append((train_end_value, test_start_value)) train_vals = [c[0] for c in correlations] test_vals = [c[1] for c in correlations] correlation = np.corrcoef(train_vals, test_vals)[0, 1] return { 'embargo': embargo, 'train_test_correlation': correlation, 'is_sufficient': abs(correlation) < 0.1, 'recommendation': ( 'Embargo is sufficient' if abs(correlation) < 0.1 else 'Consider increasing embargo' if abs(correlation) < 0.2 else 'Embargo is too short' ) } # Example usagenp.random.seed(42) # Simulate an AR(1) process with strong autocorrelationn = 1000ar_coef = 0.9series = np.zeros(n)series[0] = np.random.randn()for t in range(1, n): series[t] = ar_coef * series[t-1] + np.random.randn() * 0.5 # Calculate required embargoembargo_info = calculate_required_embargo( feature_windows=[5, 10, 20, 30], # Various rolling windows forecast_horizon=5, # Predict 5 days ahead target_window=5, # Target is 5-day return series=series, acf_threshold=0.1) print("Embargo Calculation")print("=" * 50)print(f"Feature-based embargo: {embargo_info['feature_embargo']}")print(f"Target-based embargo: {embargo_info['target_embargo']}")print(f"ACF-based embargo: {embargo_info['acf_embargo']}")print(f"\nBinding constraint: {embargo_info['binding_constraint']}")print(f"Final recommendation: {embargo_info['final_recommendation']} periods") # Validate the embargovalidation = validate_embargo(embargo_info['final_recommendation'], series)print(f"\nValidation: {validation['recommendation']}")print(f"Train-test correlation with embargo: {validation['train_test_correlation']:.4f}")Implementing embargo correctly requires careful attention to how the gap is applied. The embargo creates a "dead zone" between training and test sets where observations are excluded from both.
Key Implementation Considerations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173
import numpy as npfrom typing import Generator, Tuple, Listfrom dataclasses import dataclass @dataclassclass EmbargoConfig: """Configuration for embargo in time series CV.""" forward_embargo: int # Gap after training ends backward_embargo: int = 0 # Gap before test starts (for lookback in test targets) @property def total_gap(self) -> int: return self.forward_embargo + self.backward_embargo def describe(self) -> str: return (f"Forward embargo: {self.forward_embargo}, " f"Backward embargo: {self.backward_embargo}, " f"Total gap: {self.total_gap}") def time_series_cv_with_embargo( n_samples: int, n_splits: int = 5, min_train_size: int = None, test_size: int = None, embargo: int = 0, expanding: bool = True) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]: """ Time series CV with embargo period between train and test. Parameters: ----------- n_samples : int Total number of observations n_splits : int Maximum number of folds min_train_size : int Initial training size (defaults to n_samples // (n_splits + 2)) test_size : int Test set size (defaults to n_samples // (n_splits + 2)) embargo : int Number of observations to exclude between train and test expanding : bool If True, expanding window; if False, sliding window Yields: ------- train_indices, test_indices """ if min_train_size is None: min_train_size = n_samples // (n_splits + 2) if test_size is None: test_size = n_samples // (n_splits + 2) indices = np.arange(n_samples) for split in range(n_splits): # Calculate training window if expanding: train_start = 0 else: train_start = split * test_size train_end = min_train_size + split * test_size # Apply embargo: test starts after embargo gap test_start = train_end + embargo test_end = test_start + test_size # Check if we have enough data if test_end > n_samples: break yield ( indices[train_start:train_end], indices[test_start:test_end] ) def visualize_embargo(n_samples: int = 100, embargo: int = 10): """Visualize embargo gap in cross-validation.""" print(f"TIME SERIES CV WITH EMBARGO = {embargo}") print("=" * 60) print("Legend: [###] Train | [ ] Embargo | [***] Test") print("=" * 60) min_train = 20 test_size = 15 for fold, (train, test) in enumerate( time_series_cv_with_embargo( n_samples, n_splits=5, min_train_size=min_train, test_size=test_size, embargo=embargo ), 1 ): bar = ['.'] * 50 scale = n_samples / 50 # Mark training for i in train: bar[int(i / scale)] = '#' # Mark embargo (between train end and test start) embargo_start = train[-1] + 1 embargo_end = test[0] for i in range(embargo_start, embargo_end): idx = int(i / scale) if idx < 50: bar[idx] = ' ' # Mark test for i in test: idx = int(i / scale) if idx < 50: bar[idx] = '*' print(f"Fold {fold}: |{''.join(bar)}|") print(f" Train[0:{train[-1]+1}] Embargo[{embargo_start}:{embargo_end}] Test[{test[0]}:{test[-1]+1}]") print() def calculate_data_efficiency( n_samples: int, n_splits: int, min_train_size: int, test_size: int, embargo: int) -> Dict: """ Calculate how much data is "wasted" in embargo zones. """ total_embargo_waste = n_splits * embargo # Observations that appear in at least one train or test set # In expanding window, all observations except final embargo are used min_used = min_train_size # Always in training # Maximum test end max_test_end = min_train_size + (n_splits - 1) * test_size + embargo + test_size # Efficiency if max_test_end <= n_samples: usable_fraction = 1 - (total_embargo_waste / (n_splits * (test_size + embargo))) else: usable_fraction = (n_samples - total_embargo_waste) / n_samples return { 'total_embargo_observations': total_embargo_waste, 'data_efficiency': usable_fraction, 'effective_test_observations': n_splits * test_size, 'observation_waste_fraction': total_embargo_waste / n_samples } # Visualize the impactvisualize_embargo(n_samples=100, embargo=10) efficiency = calculate_data_efficiency( n_samples=1000, n_splits=5, min_train_size=200, test_size=100, embargo=20) print(f"\nData Efficiency Analysis:")print(f"Total observations in embargo zones: {efficiency['total_embargo_observations']}")print(f"Fraction of data wasted to embargo: {efficiency['observation_waste_fraction']:.1%}")Standard embargo creates a gap after the training set to prevent training features from incorporating test information. However, some scenarios require bidirectional embargo—gaps on both sides of the test set.
When Bidirectional Embargo is Needed:
1. Overlapping Labels on Both Sides If your training approach uses future information for labeling (e.g., supervised learning on labeled "return over next 30 days"), and the test set is short, training labels from late training observations may overlap with test period outcomes.
2. Event Studies When studying events (earnings announcements, product launches), windows around events may span train/test boundaries in both directions.
3. Purging in Addition to Embargo Some frameworks (notably in finance) use "purging" to remove training observations whose labels overlap with any test observation, which is effectively bidirectional exclusion.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
import numpy as npfrom typing import Generator, Tuple, List def bidirectional_embargo_splits( n_samples: int, n_splits: int, min_train_size: int, test_size: int, forward_embargo: int, backward_embargo: int = 0) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]: """ Time series CV with bidirectional embargo. Parameters: ----------- n_samples : int Total observations n_splits : int Number of folds min_train_size : int Initial training size test_size : int Test set size forward_embargo : int Gap after training ends (before test starts) backward_embargo : int Gap after test ends (before next training can include those obs) In practice, this removes observations from training whose labels would overlap with test Note: backward_embargo is more relevant for purging (covered in next page) """ indices = np.arange(n_samples) total_gap = forward_embargo + backward_embargo for split in range(n_splits): train_end = min_train_size + split * (test_size + total_gap) test_start = train_end + forward_embargo test_end = test_start + test_size if test_end > n_samples: break # Standard training indices (may need purging - next page) train_indices = indices[:train_end] test_indices = indices[test_start:test_end] yield train_indices, test_indices def apply_label_purging( train_indices: np.ndarray, test_indices: np.ndarray, label_horizon: int) -> np.ndarray: """ Remove training observations whose labels overlap with test period. If we're predicting k-step-ahead returns, a training observation at time t has a label that spans [t+1, t+k]. If this overlaps with the test period, we must exclude that training observation. Parameters: ----------- train_indices : np.ndarray Original training indices test_indices : np.ndarray Test indices label_horizon : int How many steps ahead the label looks (e.g., 5 for 5-day return) Returns: -------- np.ndarray : Purged training indices """ test_start = test_indices[0] # Training observation at time t has label ending at t + label_horizon # Remove if t + label_horizon >= test_start # i.e., keep only if t < test_start - label_horizon purge_cutoff = test_start - label_horizon purged_train = train_indices[train_indices < purge_cutoff] n_purged = len(train_indices) - len(purged_train) return purged_train, n_purged def comprehensive_embargo_cv( n_samples: int, n_splits: int, min_train_size: int, test_size: int, forward_embargo: int, label_horizon: int) -> Generator[Tuple[np.ndarray, np.ndarray, dict], None, None]: """ Comprehensive CV with forward embargo AND label purging. Yields train_indices, test_indices, and metadata about what was removed. """ indices = np.arange(n_samples) for split in range(n_splits): train_end = min_train_size + split * test_size test_start = train_end + forward_embargo test_end = test_start + test_size if test_end > n_samples: break # Initial training set (before purging) initial_train = indices[:train_end] test_indices = indices[test_start:test_end] # Apply purging based on label horizon purged_train, n_purged = apply_label_purging( initial_train, test_indices, label_horizon ) metadata = { 'fold': split + 1, 'initial_train_size': len(initial_train), 'purged_train_size': len(purged_train), 'observations_purged': n_purged, 'purge_fraction': n_purged / len(initial_train), 'embargo_size': forward_embargo, 'effective_gap': forward_embargo + n_purged # Total separation } yield purged_train, test_indices, metadata # Exampleprint("Comprehensive Embargo + Purging Example")print("=" * 60) for train, test, meta in comprehensive_embargo_cv( n_samples=500, n_splits=5, min_train_size=100, test_size=50, forward_embargo=10, label_horizon=20 # Predicting 20-step-ahead returns): print(f"Fold {meta['fold']}:") print(f" Initial train: {meta['initial_train_size']}") print(f" After purging: {meta['purged_train_size']} ({meta['observations_purged']} purged)") print(f" Purge rate: {meta['purge_fraction']:.1%}") print(f" Effective gap: {meta['effective_gap']} observations") print()Embargo excludes a fixed gap of observations between train and test. Purging dynamically removes training observations whose labels overlap with the test period. In practice, you often need both: embargo for feature leakage, purging for label leakage. The next page covers purging in comprehensive detail.
Different domains have established best practices for embargo periods. Here are guidelines from practitioners across major time series application areas:
| Domain | Typical Embargo | Rationale | Special Considerations |
|---|---|---|---|
| Equity trading (daily) | 5-21 trading days | Autocorrelation + momentum spillover | Avoid earnings announcement periods |
| High-frequency trading | Minutes to hours | Order book dynamics decay quickly | May need microsecond precision |
| Macro/economic forecasting | 1-3 months | Economic indicators have long lags | Publication delays matter |
| Weather prediction | Hours to days | Atmospheric autocorrelation | Match operational forecast horizon |
| Demand forecasting (retail) | 7-14 days | Weekly seasonality + promotions | Exclude holiday periods from embargo |
| Healthcare outcomes | Often unnecessary | Patient observations may be independent | Verify no temporal dependencies first |
| Fraud detection | 1+ event window | Fraud patterns evolve | Account for detection delay |
In quantitative finance, insufficient embargo is one of the leading causes of backtest overfitting. Practitioners often use 5× the maximum feature lookback or 2× the forecast horizon, whichever is larger. When in doubt, use more embargo—the cost is data efficiency, but the benefit is avoiding false discoveries.
Rather than relying solely on theoretical embargo calculations, validate empirically that your embargo successfully decorrelates training and test sets.
Validation Approaches:
1. Train-Test Residual Correlation After fitting on training data, compute residuals. If test period residuals are correlated with late training residuals, leakage is present.
2. Permutation Testing Shuffle the test period labels; if shuffled performance is comparable to actual performance, your model may be exploiting leakage rather than genuine patterns.
3. Incremental Embargo Testing Vary embargo from 0 to 2× your theoretical value; plot CV performance. Genuine signal should be relatively stable; leakage shows steep decline as embargo increases.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
import numpy as npfrom sklearn.base import clone, BaseEstimatorfrom sklearn.metrics import mean_squared_errorfrom typing import List, Dict def validate_embargo_empirically( X: np.ndarray, y: np.ndarray, model: BaseEstimator, embargo_values: List[int], n_splits: int = 5, min_train_size: int = None, test_size: int = None) -> Dict: """ Test whether embargo is sufficient by measuring performance vs embargo length. If performance drops sharply as embargo increases, leakage was present. If performance is stable, embargo is sufficient (or leakage isn't the issue). """ n_samples = len(X) if min_train_size is None: min_train_size = n_samples // (n_splits + 2) if test_size is None: test_size = n_samples // (n_splits + 2) results = {} for embargo in embargo_values: fold_scores = [] for split in range(n_splits): train_end = min_train_size + split * test_size test_start = train_end + embargo test_end = test_start + test_size if test_end > n_samples: break X_train = X[:train_end] y_train = y[:train_end] X_test = X[test_start:test_end] y_test = y[test_start:test_end] fold_model = clone(model) fold_model.fit(X_train, y_train) predictions = fold_model.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, predictions)) fold_scores.append(rmse) results[embargo] = { 'mean_rmse': np.mean(fold_scores), 'std_rmse': np.std(fold_scores), 'n_folds': len(fold_scores) } # Analyze pattern embargo_list = sorted(results.keys()) performance_list = [results[e]['mean_rmse'] for e in embargo_list] # Check for significant degradation (sign of leakage at lower embargo) if len(embargo_list) >= 2: low_embargo_perf = performance_list[0] high_embargo_perf = performance_list[-1] degradation = (high_embargo_perf - low_embargo_perf) / low_embargo_perf else: degradation = 0 return { 'by_embargo': results, 'degradation_fraction': degradation, 'leakage_detected': degradation > 0.1, # >10% degradation suggests leakage 'recommendation': ( f'Leakage detected: performance degrades {degradation:.1%} from embargo=0. ' f'Use embargo≥{embargo_list[-1]}' if degradation > 0.1 else f'No significant leakage detected. Embargo={embargo_list[0]} appears sufficient.' ) } def residual_correlation_test( X: np.ndarray, y: np.ndarray, model: BaseEstimator, train_end: int, test_start: int, test_end: int, n_residual_lags: int = 10) -> Dict: """ Test for leakage by checking if test residuals correlate with training residuals. """ # Fit model model = clone(model) X_train, y_train = X[:train_end], y[:train_end] X_test, y_test = X[test_start:test_end], y[test_start:test_end] model.fit(X_train, y_train) # Training residuals (last few before embargo) train_preds = model.predict(X_train) train_residuals = y_train - train_preds late_train_residuals = train_residuals[-n_residual_lags:] # Test residuals test_preds = model.predict(X_test) test_residuals = y_test - test_preds early_test_residuals = test_residuals[:n_residual_lags] # Cross-correlation if len(late_train_residuals) == len(early_test_residuals): correlation = np.corrcoef(late_train_residuals, early_test_residuals)[0, 1] else: correlation = np.nan return { 'residual_correlation': correlation, 'leakage_suspected': abs(correlation) > 0.2, 'embargo_gap': test_start - train_end, 'interpretation': ( 'Residual correlation is low—embargo appears adequate' if abs(correlation) < 0.2 else f'Residual correlation={correlation:.2f} suggests information spillover' ) } # Example usagefrom sklearn.linear_model import Ridge np.random.seed(42) # Create data with autocorrelationn = 500X = np.random.randn(n, 5)y = np.zeros(n)for t in range(1, n): y[t] = 0.8 * y[t-1] + X[t] @ np.array([0.5, 0.3, 0.2, 0.1, 0.0]) + np.random.randn() * 0.2 # Test different embargo valuesvalidation = validate_embargo_empirically( X, y, model=Ridge(alpha=1.0), embargo_values=[0, 5, 10, 20, 30], n_splits=5) print("Embargo Validation Results")print("=" * 50)for embargo, stats in validation['by_embargo'].items(): print(f"Embargo = {embargo:2d}: RMSE = {stats['mean_rmse']:.4f} ± {stats['std_rmse']:.4f}") print(f"\n{validation['recommendation']}")Embargo periods are a critical but often overlooked component of time series cross-validation, preventing subtle temporal leakage that can dramatically inflate performance estimates.
The next page covers purging—a complementary technique that removes training observations whose labels overlap with the test period. Together, embargo and purging form a comprehensive defense against temporal leakage in sophisticated time series applications.