Loading content...
Among all forms of domain shift, covariate shift is the most tractable and best understood. It occurs when the marginal distribution of inputs changes between training and deployment, but the conditional relationship between inputs and outputs remains constant.
Mathematically: $$P_S(X) \neq P_T(X) \quad \text{but} \quad P_S(Y|X) = P_T(Y|X)$$
This assumption—that the "rules" mapping inputs to outputs don't change—enables principled correction through importance weighting. If we knew exactly how the input distribution shifted, we could reweight training samples to match the target distribution and obtain an optimal classifier.
This page covers the theory of covariate shift, importance weighting for correction, density ratio estimation methods, practical challenges, and when covariate shift assumptions hold.
The key insight is that training on reweighted source samples can recover the optimal target classifier.
Risk Under Target Distribution:
$$R_T(h) = \mathbb{E}_{(x,y) \sim P_T}[\ell(h(x), y)]$$
We want to minimize this but only have samples from $P_S$. Using importance sampling:
$$R_T(h) = \mathbb{E}_{(x,y) \sim P_S}\left[\frac{P_T(x,y)}{P_S(x,y)} \ell(h(x), y)\right]$$
Under covariate shift ($P_S(Y|X) = P_T(Y|X)$):
$$\frac{P_T(x,y)}{P_S(x,y)} = \frac{P_T(x)P_T(y|x)}{P_S(x)P_S(y|x)} = \frac{P_T(x)}{P_S(x)} = w(x)$$
The importance weight $w(x)$ only depends on input distributions!
Given weights w(xᵢ), we minimize the weighted loss: Σᵢ w(xᵢ) · ℓ(h(xᵢ), yᵢ). This theoretically recovers the optimal target classifier if weights are accurate.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as npfrom sklearn.base import BaseEstimator, ClassifierMixin class ImportanceWeightedClassifier(BaseEstimator, ClassifierMixin): """ Classifier that uses importance weighting for covariate shift. Reweights training samples by P_T(x)/P_S(x) to correct for distributional mismatch between training and target. """ def __init__(self, base_classifier, density_ratio_estimator): self.base_classifier = base_classifier self.density_ratio_estimator = density_ratio_estimator def fit(self, X_source, y_source, X_target): """ Fit classifier using importance-weighted samples. Args: X_source: Labeled source domain samples y_source: Source domain labels X_target: Unlabeled target domain samples """ # Estimate importance weights w(x) = P_T(x) / P_S(x) self.density_ratio_estimator.fit(X_source, X_target) weights = self.density_ratio_estimator.predict(X_source) # Normalize weights for stability weights = weights / weights.mean() # Clip extreme weights to reduce variance weights = np.clip(weights, 0.1, 10.0) # Fit base classifier with sample weights self.base_classifier.fit(X_source, y_source, sample_weight=weights) return self def predict(self, X): return self.base_classifier.predict(X) def predict_proba(self, X): return self.base_classifier.predict_proba(X)The success of importance weighting depends on accurately estimating $w(x) = P_T(x)/P_S(x)$. This is surprisingly challenging.
Estimate $\hat{P}_S(x)$ and $\hat{P}_T(x)$ separately, then compute their ratio. This fails because:
Directly estimate the ratio without estimating densities:
$$\min_w KL(P_T | \hat{P}_T) \quad \text{where} \quad \hat{P}_T(x) = w(x)P_S(x)$$
Subject to: $\mathbb{E}_{P_S}[w(x)] = 1$ (normalization)
Minimize squared error between true and estimated ratio:
$$\min_w \frac{1}{2}\mathbb{E}{P_T}[w(x)^2] - \mathbb{E}{P_S}[w(x)]$$
This has a closed-form solution for linear models.
| Method | Objective | Pros | Cons |
|---|---|---|---|
| KLIEP | KL divergence | Asymptotically optimal | Optimization required |
| uLSIF | Least squares | Closed-form solution | May be negative |
| KMM | MMD matching | Convex problem | Kernel selection crucial |
| Classifier | Domain discrimination | Uses deep learning | Needs probability calibration |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom scipy.optimize import minimize class KLIEPDensityRatio: """ KLIEP: Kullback-Leibler Importance Estimation Procedure. Directly estimates density ratios by minimizing KL divergence between true target density and reweighted source density. """ def __init__(self, n_kernels=100, sigma=1.0): self.n_kernels = n_kernels self.sigma = sigma self.alpha = None self.centers = None def fit(self, X_source, X_target): n_s, n_t = len(X_source), len(X_target) # Select kernel centers from target samples idx = np.random.choice(n_t, min(self.n_kernels, n_t), replace=False) self.centers = X_target[idx] # Compute kernel matrices Phi_s = self._compute_kernel_matrix(X_source) # (n_s, n_kernels) Phi_t = self._compute_kernel_matrix(X_target) # (n_t, n_kernels) # KLIEP objective: maximize sum(log(Phi_t @ alpha)) # subject to mean(Phi_s @ alpha) = 1 def objective(alpha): w_t = Phi_t @ alpha w_t = np.maximum(w_t, 1e-10) return -np.mean(np.log(w_t)) def constraint(alpha): return np.mean(Phi_s @ alpha) - 1.0 # Initialize and optimize alpha0 = np.ones(len(self.centers)) / len(self.centers) result = minimize( objective, alpha0, method='SLSQP', constraints={'type': 'eq', 'fun': constraint}, bounds=[(0, None)] * len(self.centers) ) self.alpha = result.x return self def predict(self, X): Phi = self._compute_kernel_matrix(X) return np.maximum(Phi @ self.alpha, 0) def _compute_kernel_matrix(self, X): dists = np.sum((X[:, None, :] - self.centers[None, :, :]) ** 2, axis=-1) return np.exp(-dists / (2 * self.sigma ** 2)) class ClassifierDensityRatio: """ Estimate density ratio using domain classification. P_T(x)/P_S(x) = P(D=T|x)/P(D=S|x) * P(D=S)/P(D=T) """ def __init__(self, classifier): self.classifier = classifier self.prior_ratio = 1.0 def fit(self, X_source, X_target): # Create domain labels X = np.vstack([X_source, X_target]) y = np.array([0]*len(X_source) + [1]*len(X_target)) # Shuffle and fit perm = np.random.permutation(len(X)) self.classifier.fit(X[perm], y[perm]) # Prior ratio self.prior_ratio = len(X_source) / len(X_target) return self def predict(self, X): probs = self.classifier.predict_proba(X) p_target = np.clip(probs[:, 1], 0.01, 0.99) p_source = 1 - p_target return (p_target / p_source) * self.prior_ratioImportance weighting is theoretically elegant but faces practical challenges.
When source and target distributions have low overlap, some samples receive very large weights. This inflates the variance of the weighted estimator:
$$\text{Var}[\hat{R}{T,w}] = \frac{1}{n}\text{Var}{P_S}[w(x)\ell(h(x), y)]$$
Effective Sample Size (ESS): $$\text{ESS} = \frac{(\sum_i w_i)^2}{\sum_i w_i^2}$$
When ESS is much smaller than n, most information comes from a few highly-weighted samples.
If source and target have little overlap (different supports), weights become extreme or undefined. No amount of reweighting can create information about regions unseen in training. In these cases, covariate shift methods provide no benefit.
Before applying correction, assess whether covariate shift is present and correctable:
Given the challenges, practitioners have developed robust strategies:
$$\hat{R}_T = \frac{\sum_i w(x_i) \ell_i}{\sum_i w(x_i)}$$
This is biased but has lower variance. Works well when weights sum is estimated poorly.
Combines importance weighting with outcome modeling:
$$\hat{R}_{DR} = \frac{1}{n}\sum_i w(x_i)(\ell_i - \hat{\ell}(x_i)) + \frac{1}{m}\sum_j \hat{\ell}(x_j)$$
If either model is correct, the estimator is consistent.
Instead of weighting, stratify samples by propensity scores and estimate within strata. More robust to extreme weights.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom sklearn.model_selection import cross_val_predict def self_normalized_weights(weights): """Self-normalized importance weights (lower variance).""" return weights / weights.sum() * len(weights) def effective_sample_size(weights): """Compute ESS to assess weight quality.""" w = weights / weights.sum() return 1.0 / np.sum(w ** 2) def weight_trimming(weights, percentile=95): """Trim extreme weights to reduce variance.""" threshold = np.percentile(weights, percentile) return np.minimum(weights, threshold) def assess_covariate_shift(X_source, X_target, classifier): """ Assess whether covariate shift correction is feasible. Returns: Dictionary with shift severity and recommendation """ from sklearn.model_selection import cross_val_score X = np.vstack([X_source, X_target]) y = np.array([0]*len(X_source) + [1]*len(X_target)) # Domain classifier accuracy acc = cross_val_score(classifier, X, y, cv=5).mean() # Estimate weights and ESS classifier.fit(X, y) probs = classifier.predict_proba(X_source)[:, 1] probs = np.clip(probs, 0.05, 0.95) weights = (probs / (1 - probs)) * (len(X_source) / len(X_target)) ess = effective_sample_size(weights) # Recommendations if acc < 0.55: severity = "minimal" recommendation = "No correction needed" elif ess > len(X_source) * 0.3: severity = "moderate" recommendation = "Apply importance weighting" else: severity = "severe" recommendation = "Consider domain adaptation instead" return { 'domain_accuracy': acc, 'effective_sample_size': ess, 'ess_ratio': ess / len(X_source), 'severity': severity, 'recommendation': recommendation }The next page covers distribution matching—learning representations where source and target domains become indistinguishable. This complements importance weighting by creating shared feature spaces.