Domain Adaptation - Learning Module

Loading content...

0/245

Covariate Shift

When Inputs Change But Labels Don't

Among all forms of domain shift, covariate shift is the most tractable and best understood. It occurs when the marginal distribution of inputs changes between training and deployment, but the conditional relationship between inputs and outputs remains constant.

Mathematically: $$P_S(X) \neq P_T(X) \quad \text{but} \quad P_S(Y|X) = P_T(Y|X)$$

This assumption—that the "rules" mapping inputs to outputs don't change—enables principled correction through importance weighting. If we knew exactly how the input distribution shifted, we could reweight training samples to match the target distribution and obtain an optimal classifier.

What You Will Learn

This page covers the theory of covariate shift, importance weighting for correction, density ratio estimation methods, practical challenges, and when covariate shift assumptions hold.

Importance Weighting Framework

The key insight is that training on reweighted source samples can recover the optimal target classifier.

Risk Under Target Distribution:

$$R_T(h) = \mathbb{E}_{(x,y) \sim P_T}[\ell(h(x), y)]$$

We want to minimize this but only have samples from $P_S$. Using importance sampling:

$$R_T(h) = \mathbb{E}_{(x,y) \sim P_S}\left[\frac{P_T(x,y)}{P_S(x,y)} \ell(h(x), y)\right]$$

Under covariate shift ($P_S(Y|X) = P_T(Y|X)$):

$$\frac{P_T(x,y)}{P_S(x,y)} = \frac{P_T(x)P_T(y|x)}{P_S(x)P_S(y|x)} = \frac{P_T(x)}{P_S(x)} = w(x)$$

The importance weight $w(x)$ only depends on input distributions!

Weighted Empirical Risk Minimization

Given weights w(xᵢ), we minimize the weighted loss: Σᵢ w(xᵢ) · ℓ(h(xᵢ), yᵢ). This theoretically recovers the optimal target classifier if weights are accurate.

importance_weighting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
 
class ImportanceWeightedClassifier(BaseEstimator, ClassifierMixin):
    """
    Classifier that uses importance weighting for covariate shift.
    
    Reweights training samples by P_T(x)/P_S(x) to correct
    for distributional mismatch between training and target.
    """
    
    def __init__(self, base_classifier, density_ratio_estimator):
        self.base_classifier = base_classifier
        self.density_ratio_estimator = density_ratio_estimator
        
    def fit(self, X_source, y_source, X_target):
        """
        Fit classifier using importance-weighted samples.
        
        Args:
            X_source: Labeled source domain samples
            y_source: Source domain labels
            X_target: Unlabeled target domain samples
        """
        # Estimate importance weights w(x) = P_T(x) / P_S(x)
        self.density_ratio_estimator.fit(X_source, X_target)
        weights = self.density_ratio_estimator.predict(X_source)
        
        # Normalize weights for stability
        weights = weights / weights.mean()
        
        # Clip extreme weights to reduce variance
        weights = np.clip(weights, 0.1, 10.0)
        
        # Fit base classifier with sample weights
        self.base_classifier.fit(X_source, y_source, sample_weight=weights)
        
        return self
    
    def predict(self, X):
        return self.base_classifier.predict(X)
    
    def predict_proba(self, X):
        return self.base_classifier.predict_proba(X)

Density Ratio Estimation

The success of importance weighting depends on accurately estimating $w(x) = P_T(x)/P_S(x)$. This is surprisingly challenging.

2.1 Naive Approach: Separate Density Estimation

Estimate $\hat{P}_S(x)$ and $\hat{P}_T(x)$ separately, then compute their ratio. This fails because:

Density estimation in high dimensions is hard
Small errors in densities become large errors in ratios
Division by near-zero values causes instability

2.2 KLIEP (Kullback-Leibler Importance Estimation Procedure)

Directly estimate the ratio without estimating densities:

$$\min_w KL(P_T | \hat{P}_T) \quad \text{where} \quad \hat{P}_T(x) = w(x)P_S(x)$$

Subject to: $\mathbb{E}_{P_S}[w(x)] = 1$ (normalization)

2.3 uLSIF (unconstrained Least-Squares Importance Fitting)

Minimize squared error between true and estimated ratio:

$$\min_w \frac{1}{2}\mathbb{E}{P_T}[w(x)^2] - \mathbb{E}{P_S}[w(x)]$$

This has a closed-form solution for linear models.

Density Ratio Estimation Methods
Method	Objective	Pros	Cons
KLIEP	KL divergence	Asymptotically optimal	Optimization required
uLSIF	Least squares	Closed-form solution	May be negative
KMM	MMD matching	Convex problem	Kernel selection crucial
Classifier	Domain discrimination	Uses deep learning	Needs probability calibration

density_ratio_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from scipy.optimize import minimize
 
class KLIEPDensityRatio:
    """
    KLIEP: Kullback-Leibler Importance Estimation Procedure.
    
    Directly estimates density ratios by minimizing KL divergence
    between true target density and reweighted source density.
    """
    
    def __init__(self, n_kernels=100, sigma=1.0):
        self.n_kernels = n_kernels
        self.sigma = sigma
        self.alpha = None
        self.centers = None
        
    def fit(self, X_source, X_target):
        n_s, n_t = len(X_source), len(X_target)
        
        # Select kernel centers from target samples
        idx = np.random.choice(n_t, min(self.n_kernels, n_t), replace=False)
        self.centers = X_target[idx]
        
        # Compute kernel matrices
        Phi_s = self._compute_kernel_matrix(X_source)  # (n_s, n_kernels)
        Phi_t = self._compute_kernel_matrix(X_target)  # (n_t, n_kernels)
        
        # KLIEP objective: maximize sum(log(Phi_t @ alpha)) 
        # subject to mean(Phi_s @ alpha) = 1
        def objective(alpha):
            w_t = Phi_t @ alpha
            w_t = np.maximum(w_t, 1e-10)
            return -np.mean(np.log(w_t))
        
        def constraint(alpha):
            return np.mean(Phi_s @ alpha) - 1.0
        
        # Initialize and optimize
        alpha0 = np.ones(len(self.centers)) / len(self.centers)
        result = minimize(
            objective, alpha0, method='SLSQP',
            constraints={'type': 'eq', 'fun': constraint},
            bounds=[(0, None)] * len(self.centers)
        )
        self.alpha = result.x
        return self
    
    def predict(self, X):
        Phi = self._compute_kernel_matrix(X)
        return np.maximum(Phi @ self.alpha, 0)
    
    def _compute_kernel_matrix(self, X):
        dists = np.sum((X[:, None, :] - self.centers[None, :, :]) ** 2, axis=-1)
        return np.exp(-dists / (2 * self.sigma ** 2))
 
 
class ClassifierDensityRatio:
    """
    Estimate density ratio using domain classification.
    
    P_T(x)/P_S(x) = P(D=T|x)/P(D=S|x) * P(D=S)/P(D=T)
    """
    
    def __init__(self, classifier):
        self.classifier = classifier
        self.prior_ratio = 1.0
        
    def fit(self, X_source, X_target):
        # Create domain labels
        X = np.vstack([X_source, X_target])
        y = np.array([0]*len(X_source) + [1]*len(X_target))
        
        # Shuffle and fit
        perm = np.random.permutation(len(X))
        self.classifier.fit(X[perm], y[perm])
        
        # Prior ratio
        self.prior_ratio = len(X_source) / len(X_target)
        return self
    
    def predict(self, X):
        probs = self.classifier.predict_proba(X)
        p_target = np.clip(probs[:, 1], 0.01, 0.99)
        p_source = 1 - p_target
        return (p_target / p_source) * self.prior_ratio

Challenges and Limitations

Importance weighting is theoretically elegant but faces practical challenges.

3.1 Variance Explosion

When source and target distributions have low overlap, some samples receive very large weights. This inflates the variance of the weighted estimator:

$$\text{Var}[\hat{R}{T,w}] = \frac{1}{n}\text{Var}{P_S}[w(x)\ell(h(x), y)]$$

Effective Sample Size (ESS): $$\text{ESS} = \frac{(\sum_i w_i)^2}{\sum_i w_i^2}$$

When ESS is much smaller than n, most information comes from a few highly-weighted samples.

When Covariate Shift Correction Fails

If source and target have little overlap (different supports), weights become extreme or undefined. No amount of reweighting can create information about regions unseen in training. In these cases, covariate shift methods provide no benefit.

Practical Challenges

•Weight clipping — Capping weights reduces variance but introduces bias
•High dimensions — Density ratios are hard to estimate in high-D spaces
•Sample efficiency — Need sufficient target samples for ratio estimation
•Assumption violations — If P(Y|X) changes even slightly, reweighting fails
•Model misspecification — Linear ratio models may miss complex shifts

3.2 Detecting Covariate Shift Severity

Before applying correction, assess whether covariate shift is present and correctable:

Domain classifier accuracy: If a classifier can distinguish domains with >80% accuracy, shift is significant
ESS analysis: If ESS < n/10, weights are too extreme for reliable correction
Support overlap: Visualize embeddings; if domains don't overlap, reweighting won't help

Practical Strategies

Given the challenges, practitioners have developed robust strategies:

4.1 Self-Normalized Importance Weighting

$$\hat{R}_T = \frac{\sum_i w(x_i) \ell_i}{\sum_i w(x_i)}$$

This is biased but has lower variance. Works well when weights sum is estimated poorly.

4.2 Doubly Robust Estimation

Combines importance weighting with outcome modeling:

$$\hat{R}_{DR} = \frac{1}{n}\sum_i w(x_i)(\ell_i - \hat{\ell}(x_i)) + \frac{1}{m}\sum_j \hat{\ell}(x_j)$$

If either model is correct, the estimator is consistent.

4.3 Propensity Score Stratification

Instead of weighting, stratify samples by propensity scores and estimate within strata. More robust to extreme weights.

robust_covariate_shift.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.model_selection import cross_val_predict
 
def self_normalized_weights(weights):
    """Self-normalized importance weights (lower variance)."""
    return weights / weights.sum() * len(weights)
 
def effective_sample_size(weights):
    """Compute ESS to assess weight quality."""
    w = weights / weights.sum()
    return 1.0 / np.sum(w ** 2)
 
def weight_trimming(weights, percentile=95):
    """Trim extreme weights to reduce variance."""
    threshold = np.percentile(weights, percentile)
    return np.minimum(weights, threshold)
 
def assess_covariate_shift(X_source, X_target, classifier):
    """
    Assess whether covariate shift correction is feasible.
    
    Returns:
        Dictionary with shift severity and recommendation
    """
    from sklearn.model_selection import cross_val_score
    
    X = np.vstack([X_source, X_target])
    y = np.array([0]*len(X_source) + [1]*len(X_target))
    
    # Domain classifier accuracy
    acc = cross_val_score(classifier, X, y, cv=5).mean()
    
    # Estimate weights and ESS
    classifier.fit(X, y)
    probs = classifier.predict_proba(X_source)[:, 1]
    probs = np.clip(probs, 0.05, 0.95)
    weights = (probs / (1 - probs)) * (len(X_source) / len(X_target))
    ess = effective_sample_size(weights)
    
    # Recommendations
    if acc < 0.55:
        severity = "minimal"
        recommendation = "No correction needed"
    elif ess > len(X_source) * 0.3:
        severity = "moderate"
        recommendation = "Apply importance weighting"
    else:
        severity = "severe"
        recommendation = "Consider domain adaptation instead"
    
    return {
        'domain_accuracy': acc,
        'effective_sample_size': ess,
        'ess_ratio': ess / len(X_source),
        'severity': severity,
        'recommendation': recommendation
    }

Summary and Next Steps

Key Takeaways

•Covariate shift assumes P(Y|X) is invariant — Only input distributions change
•Importance weighting corrects for covariate shift — Weight samples by P_T(x)/P_S(x)
•Direct ratio estimation beats density estimation — KLIEP, uLSIF avoid division problems
•Effective sample size indicates feasibility — Low ESS means weights are too extreme
•Severe shift requires different approaches — When supports don't overlap, use domain adaptation

Next: Distribution Matching

The next page covers distribution matching—learning representations where source and target domains become indistinguishable. This complements importance weighting by creating shared feature spaces.