Feature Based Transfer - Learning Module

Loading content...

0/245

Domain Discrepancy

Measuring the Gap Between Domains

Transfer learning success fundamentally depends on the relationship between source and target domains. When domains are similar, pre-trained features transfer well; when they differ significantly, transfer may fail or even hurt performance.

Domain discrepancy quantifies this difference. Understanding and measuring domain discrepancy is essential for:

Predicting whether transfer will succeed
Selecting appropriate pre-trained models
Deciding between frozen features and fine-tuning
Designing domain adaptation strategies

This page provides a rigorous treatment of domain discrepancy: formal definitions, measurement techniques, and practical implications for transfer learning.

What You Will Learn

Formal definitions of domain shift, methods for measuring distribution divergence, practical tools for estimating domain discrepancy, and how to use these measurements to guide transfer learning decisions.

Types of Domain Shift

Domain shift occurs when the source distribution $P_S(X, Y)$ differs from the target distribution $P_T(X, Y)$. This can happen in several ways:

Covariate shift: $P_S(X) \neq P_T(X)$ but $P(Y|X)$ remains the same.

The input distribution changes, but the labeling function is constant. Example: training on daytime images, deploying on nighttime images—a car is still a car, but its appearance differs.

Label shift (prior shift): $P_S(Y) \neq P_T(Y)$ but $P(X|Y)$ remains the same.

Class proportions change. Example: training on balanced medical data, deploying where disease is rare.

Concept shift: $P_S(Y|X) \neq P_T(Y|X)$.

The meaning of labels changes. Example: "spam" patterns evolve over time.

Full domain shift: Both marginals and conditionals differ.

The most challenging case—common when source and target are fundamentally different domains.

Types of Domain Shift and Their Implications
Shift Type	What Changes	Example	Adaptation Difficulty
Covariate	P(X)	Day→Night photos	Moderate
Label	P(Y)	Balanced→Imbalanced classes	Low-Moderate
Concept	P(Y\|X)	Evolving spam patterns	High
Full	Everything	Natural→Medical images	Very High

Covariate Shift is Most Common

In practice, covariate shift is the most frequently encountered scenario in transfer learning. The assumption that P(Y|X) is shared allows us to focus on aligning input distributions—the basis for most domain adaptation methods.

Distribution Divergence Measures

To quantify domain discrepancy, we need measures of distribution divergence. Several metrics are commonly used:

Maximum Mean Discrepancy (MMD):

The MMD measures the distance between distributions in a reproducing kernel Hilbert space (RKHS):

$$\text{MMD}^2(P, Q) = |\mu_P - \mu_Q|^2_{\mathcal{H}}$$

where $\mu_P$ and $\mu_Q$ are kernel mean embeddings. With samples:

$$\widehat{\text{MMD}}^2 = \frac{1}{n^2}\sum_{i,j} k(x_i, x_j) + \frac{1}{m^2}\sum_{i,j} k(y_i, y_j) - \frac{2}{nm}\sum_{i,j} k(x_i, y_j)$$

MMD is zero iff distributions are identical (for characteristic kernels).

Proxy A-distance:

Based on the ability of a classifier to distinguish domains:

$$d_A(S, T) = 2(1 - 2\epsilon)$$

where $\epsilon$ is the classification error of Domain Classifier predicting source vs target. If domains are identical, $\epsilon = 0.5$ and $d_A = 0$. If easily separable, $\epsilon \to 0$ and $d_A \to 2$.

Fréchet Distance (FID-like):

Assumes Gaussian distribution and measures:

$$d_F^2 = |\mu_S - \mu_T|^2 + \text{Tr}(\Sigma_S + \Sigma_T - 2(\Sigma_S \Sigma_T)^{1/2})$$

Widely used for comparing image distributions (Fréchet Inception Distance).

KL Divergence:

$$D_{KL}(P | Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$

Asymmetric; useful for density estimation scenarios but requires density estimates.

Wasserstein Distance (Earth Mover's Distance):

$$W(P, Q) = \inf_{\gamma \in \Pi(P,Q)} \mathbb{E}_{(x,y) \sim \gamma}[|x - y|]$$

Measures the minimum cost to transport one distribution to another. Has nice geometric properties.

Comparison of Divergence Measures
Measure	Symmetric	Bounded	Computation	Sample Efficiency
MMD	Yes	No	O(n²)	Moderate
Proxy A-distance	Yes	Yes [0,2]	O(n) + classifier	Good
Fréchet	Yes	No	O(nd²)	Requires many samples
KL Divergence	No	No	Requires densities	Poor
Wasserstein	Yes	No	O(n³) exact	Good

Practical Discrepancy Measurement

In practice, we measure domain discrepancy in representation space, not input space. This is because:

Input space is high-dimensional and distances are less meaningful
We care about how the model "sees" the domains
Representation-space discrepancy directly affects transfer

Measurement workflow:

Extract features $z_S = \phi(x_S)$ for source samples
Extract features $z_T = \phi(x_T)$ for target samples
Compute divergence between ${z_S}$ and ${z_T}$

domain_discrepancy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from scipy.linalg import sqrtm
 
def compute_mmd(X: np.ndarray, Y: np.ndarray, 
                kernel: str = "rbf", gamma: float = None) -> float:
    """
    Compute Maximum Mean Discrepancy between two sample sets.
    """
    from sklearn.metrics.pairwise import rbf_kernel, linear_kernel
    
    if gamma is None:
        gamma = 1.0 / X.shape[1]
    
    if kernel == "rbf":
        k = lambda a, b: rbf_kernel(a, b, gamma=gamma)
    else:
        k = linear_kernel
    
    XX = k(X, X).mean()
    YY = k(Y, Y).mean()
    XY = k(X, Y).mean()
    
    mmd_squared = XX + YY - 2 * XY
    return np.sqrt(max(0, mmd_squared))
 
 
def compute_proxy_a_distance(X_source: np.ndarray, 
                              X_target: np.ndarray) -> float:
    """
    Compute Proxy A-distance using domain classifier.
    
    Returns value in [0, 2]. Higher = more separable = larger shift.
    """
    # Create domain labels
    y = np.concatenate([
        np.zeros(len(X_source)), 
        np.ones(len(X_target))
    ])
    X = np.vstack([X_source, X_target])
    
    # Train domain classifier
    clf = SVC(kernel='linear', C=1.0)
    
    # Cross-validation accuracy
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    accuracy = scores.mean()
    
    # Convert to A-distance
    error = 1 - accuracy
    a_distance = 2 * (1 - 2 * error)
    
    return max(0, a_distance)  # Clamp to [0, 2]
 
 
def compute_frechet_distance(X: np.ndarray, Y: np.ndarray) -> float:
    """
    Compute Fréchet Distance assuming Gaussian distributions.
    """
    mu_X, mu_Y = X.mean(axis=0), Y.mean(axis=0)
    cov_X = np.cov(X, rowvar=False)
    cov_Y = np.cov(Y, rowvar=False)
    
    # Mean difference term
    diff = mu_X - mu_Y
    mean_term = np.dot(diff, diff)
    
    # Covariance term
    cov_sqrt = sqrtm(cov_X @ cov_Y)
    if np.iscomplexobj(cov_sqrt):
        cov_sqrt = cov_sqrt.real
    
    cov_term = np.trace(cov_X + cov_Y - 2 * cov_sqrt)
    
    return np.sqrt(max(0, mean_term + cov_term))
 
 
class DomainDiscrepancyAnalyzer:
    """Complete toolkit for analyzing domain discrepancy."""
    
    def __init__(self, source_features: np.ndarray, 
                 target_features: np.ndarray):
        self.source = source_features
        self.target = target_features
    
    def full_analysis(self) -> dict:
        """Run all discrepancy measures and return summary."""
        results = {
            'mmd_rbf': compute_mmd(self.source, self.target, 'rbf'),
            'mmd_linear': compute_mmd(self.source, self.target, 'linear'),
            'proxy_a_distance': compute_proxy_a_distance(
                self.source, self.target
            ),
            'frechet': compute_frechet_distance(self.source, self.target),
        }
        
        # Interpretation
        pad = results['proxy_a_distance']
        if pad < 0.5:
            results['interpretation'] = "Low shift: frozen features likely sufficient"
        elif pad < 1.0:
            results['interpretation'] = "Moderate shift: consider fine-tuning"
        else:
            results['interpretation'] = "High shift: domain adaptation recommended"
        
        return results

Visualizing Domain Discrepancy

Visualization helps understand domain discrepancy intuitively. Common techniques:

t-SNE/UMAP visualization: Project high-dimensional features to 2D, coloring by domain. Overlapping clusters indicate low discrepancy; separated clusters indicate high discrepancy.

Feature distribution comparison: Plot histograms or kernel density estimates for individual feature dimensions. Look for shifts in mean, variance, or modality.

Nearest neighbor analysis: For each target sample, check if its nearest neighbors come from target or source. High source-neighbor ratio indicates good domain alignment.

discrepancy_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.neighbors import NearestNeighbors
 
def visualize_domain_shift(source_features: np.ndarray,
                           target_features: np.ndarray,
                           method: str = "tsne",
                           save_path: str = None):
    """Visualize source and target domains in 2D."""
    
    # Combine for joint embedding
    combined = np.vstack([source_features, target_features])
    labels = np.array(['Source'] * len(source_features) + 
                      ['Target'] * len(target_features))
    
    # Dimensionality reduction
    if method == "tsne":
        reducer = TSNE(n_components=2, perplexity=30, random_state=42)
    elif method == "umap":
        from umap import UMAP
        reducer = UMAP(n_components=2, random_state=42)
    
    embedded = reducer.fit_transform(combined)
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 8))
    
    source_idx = labels == 'Source'
    target_idx = labels == 'Target'
    
    ax.scatter(embedded[source_idx, 0], embedded[source_idx, 1],
               c='blue', alpha=0.5, label='Source', s=20)
    ax.scatter(embedded[target_idx, 0], embedded[target_idx, 1],
               c='red', alpha=0.5, label='Target', s=20)
    
    ax.legend()
    ax.set_title(f'Domain Visualization ({method.upper()})')
    
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    
    return fig
 
 
def nearest_neighbor_domain_ratio(source_features: np.ndarray,
                                   target_features: np.ndarray,
                                   k: int = 5) -> float:
    """
    For each target point, compute fraction of k-NN from source.
    Higher ratio = better domain alignment.
    """
    combined = np.vstack([source_features, target_features])
    
    # Labels: 0 = source, 1 = target
    domain_labels = np.array(
        [0] * len(source_features) + [1] * len(target_features)
    )
    
    # Fit kNN on combined
    knn = NearestNeighbors(n_neighbors=k+1).fit(combined)
    
    # Query target points
    n_source = len(source_features)
    target_indices = np.arange(n_source, len(combined))
    
    _, neighbors = knn.kneighbors(combined[target_indices])
    
    # Exclude self (first neighbor)
    neighbors = neighbors[:, 1:]
    
    # Count source neighbors
    neighbor_domains = domain_labels[neighbors]
    source_neighbor_ratio = (neighbor_domains == 0).mean()
    
    return source_neighbor_ratio

Interpretation Guide

In t-SNE plots, if source and target points are interleaved (mixed colors), domain discrepancy is low. If they form separate clusters, discrepancy is high. Be cautious: t-SNE can create artificial clusters, so combine with quantitative measures.

Discrepancy and Transfer Performance

Understanding the relationship between discrepancy and transfer enables better decision-making.

The transfer bound:

Theory provides bounds relating source performance, discrepancy, and target performance:

$$\epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(S, T) + \lambda$$

where:

$\epsilon_T(h)$: target error
$\epsilon_S(h)$: source error
$d_{\mathcal{H}\Delta\mathcal{H}}$: divergence (similar to Proxy A-distance)
$\lambda$: optimal combined error (irreducible)

Practical implications:

Low discrepancy: Source performance transfers well → frozen features work
Moderate discrepancy: Source provides starting point, but adaptation helps → fine-tuning beneficial
High discrepancy: Source may not help or hurt → domain adaptation needed, or train from scratch

Discrepancy Levels and Recommended Actions
Proxy A-distance	Discrepancy Level	Recommended Approach
< 0.3	Very Low	Frozen features, linear probe
0.3 - 0.7	Low	Frozen features, possibly small MLP head
0.7 - 1.2	Moderate	Fine-tuning recommended
1.2 - 1.6	High	Domain adaptation methods needed
1.6	Very High	Specialized pre-training or train from scratch

Negative Transfer

When domain discrepancy is very high, transfer can actually hurt performance compared to training from scratch. The pre-trained features encode patterns that confuse rather than help the target task. Always compare against a from-scratch baseline for high-discrepancy scenarios.

Layer-wise Discrepancy Analysis

Domain discrepancy varies across network layers, providing insight into which layers need adaptation:

General pattern:

Early layers: Low discrepancy (edges, textures are universal)
Middle layers: Variable (task-dependent patterns)
Late layers: High discrepancy (task-specific features)

Using layer-wise analysis:

Compute discrepancy at each layer
Identify the "discrepancy transition point"
Freeze layers before transition, fine-tune after

layerwise_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import torch
import numpy as np
import matplotlib.pyplot as plt
 
def layerwise_discrepancy_analysis(
    model: torch.nn.Module,
    source_loader,
    target_loader,
    layer_names: list,
    device: str = "cuda"
) -> dict:
    """
    Analyze domain discrepancy at each layer of the network.
    """
    from domain_discrepancy import compute_proxy_a_distance
    
    model = model.to(device).eval()
    layer_discrepancies = {}
    
    for layer_name in layer_names:
        # Extract features at this layer
        source_feats = extract_at_layer(model, source_loader, 
                                        layer_name, device)
        target_feats = extract_at_layer(model, target_loader,
                                        layer_name, device)
        
        # Compute discrepancy
        pad = compute_proxy_a_distance(source_feats, target_feats)
        layer_discrepancies[layer_name] = pad
        print(f"{layer_name}: A-distance = {pad:.3f}")
    
    return layer_discrepancies
 
 
def extract_at_layer(model, dataloader, layer_name, device):
    """Extract and flatten features from a specific layer."""
    features = []
    
    # Register hook
    target_layer = dict(model.named_modules())[layer_name]
    
    def hook(module, input, output):
        if output.dim() == 4:
            output = output.mean(dim=[2, 3])  # Global avg pool
        features.append(output.detach().cpu())
    
    handle = target_layer.register_forward_hook(hook)
    
    with torch.no_grad():
        for images, _ in dataloader:
            _ = model(images.to(device))
    
    handle.remove()
    return torch.cat(features).numpy()
 
 
def recommend_freeze_depth(layer_discrepancies: dict, 
                           threshold: float = 0.8) -> str:
    """
    Recommend which layers to freeze based on discrepancy.
    """
    for layer, disc in layer_discrepancies.items():
        if disc > threshold:
            return f"Freeze layers before '{layer}', fine-tune from '{layer}' onward"
    
    return "Low discrepancy throughout - frozen features may suffice"

Summary: Domain Discrepancy

Key Takeaways

•Domain shift types include covariate, label, concept, and full shift—covariate shift is most common.
•Divergence measures (MMD, Proxy A-distance, Fréchet) quantify distributional differences.
•Proxy A-distance is practical: uses domain classifier accuracy to estimate separability.
•Measure in representation space, not input space, for transfer-relevant discrepancy.
•Visualization (t-SNE, kNN analysis) provides intuitive understanding of domain alignment.
•Discrepancy predicts transfer success: low → frozen features; moderate → fine-tune; high → domain adaptation.
•Layer-wise analysis identifies where discrepancy emerges, guiding selective fine-tuning.

What's next: The final page explores feature adaptation—techniques to reduce domain discrepancy and improve feature transferability when frozen features aren't sufficient.

Page Complete

You now understand how to measure and interpret domain discrepancy, enabling principled decisions about transfer learning strategies.