Bias Detection Mitigation - Learning Module

Loading content...

0/278

In-processing Methods

The In-processing Philosophy

In-processing methods take a fundamentally different stance from pre-processing: rather than fixing the data before training, we embed fairness directly into the learning algorithm. The model learns to be fair as it learns to be accurate, treating fairness as a first-class optimization objective alongside predictive performance.

This approach has profound implications. It enables fine-grained control over the fairness-accuracy tradeoff through explicit constraints or penalty terms. It can achieve stronger fairness guarantees by directly optimizing for fairness criteria. And it can adapt fairness interventions to the specific model being trained, rather than applying generic data transformations.

Learning Objectives

By the end of this page, you will be able to: (1) Formulate fairness-constrained optimization problems, (2) Implement regularization-based approaches to fairness, (3) Apply adversarial debiasing to neural networks, (4) Understand reduction-based approaches that convert fairness constraints to cost-sensitive learning, (5) Evaluate tradeoffs between different in-processing methods.

The Constrained Optimization Perspective:

Classical machine learning optimizes a single objective—typically empirical risk:

$$\hat{h} = \arg\min_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i)$$

In-processing reframes this as constrained optimization:

$$\hat{h} = \arg\min_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i) \quad \text{subject to} \quad \mathcal{F}(h) \leq \epsilon$$

where $\mathcal{F}(h)$ quantifies unfairness and $\epsilon$ is the maximum tolerable unfairness. Alternatively, we can optimize a Lagrangian:

$$\hat{h} = \arg\min_{h \in \mathcal{H}} \left[ \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i) + \lambda \cdot \mathcal{F}(h) \right]$$

where $\lambda$ controls the fairness-accuracy tradeoff.

Formalizing Fairness Constraints

Before examining specific algorithms, we must formalize what fairness constraints look like mathematically. Different fairness definitions translate to different constraint formulations.

Demographic Parity (Statistical Parity): Positive predictions should be independent of protected attribute.

$$P(\hat{Y} = 1 | A = 0) = P(\hat{Y} = 1 | A = 1)$$

As a constraint: $|P(\hat{Y} = 1 | A = 0) - P(\hat{Y} = 1 | A = 1)| \leq \epsilon$

Equalized Odds: True positive rate and false positive rate should be equal across groups.

$$P(\hat{Y} = 1 | Y = y, A = 0) = P(\hat{Y} = 1 | Y = y, A = 1) \quad \forall y \in {0, 1}$$

Equal Opportunity: True positive rate should be equal (relaxation of equalized odds).

$$P(\hat{Y} = 1 | Y = 1, A = 0) = P(\hat{Y} = 1 | Y = 1, A = 1)$$

Fairness Definitions as Constraints
Fairness Criterion	Mathematical Constraint	Intuition
Demographic Parity	$\|\mathbb{E}[\hat{Y}\|A=0] - \mathbb{E}[\hat{Y}\|A=1]\| \leq \epsilon$	Equal selection rates across groups
Equal Opportunity	$\|TPR_0 - TPR_1\| \leq \epsilon$	Equal chance for qualified individuals
Equalized Odds	$\|TPR_0 - TPR_1\| + \|FPR_0 - FPR_1\| \leq \epsilon$	Equal error rates for both outcomes
Predictive Parity	$\|PPV_0 - PPV_1\| \leq \epsilon$	Equal precision across groups
Calibration	$P(Y=1\|\hat{p}=p, A=a)=p$ for all $a$	Predicted probabilities match actual rates

Impossibility Results

Chouldechova (2017) and Kleinberg et al. (2016) proved that except in degenerate cases, it's impossible to simultaneously satisfy calibration, equal false positive rates, and equal false negative rates when base rates differ across groups. This means perfect fairness across all criteria is usually unachievable—we must choose which fairness properties matter most.

Converting Discrete Constraints to Continuous Losses:

Most fairness constraints involve indicator functions (classifications) that are non-differentiable. For gradient-based optimization, we replace these with continuous relaxations:

Soft Demographic Parity: Replace $\hat{Y} \in {0, 1}$ with prediction score $s(x) \in [0, 1]$: $$\mathcal{F}_{DP}(h) = |\mathbb{E}[s(X) | A = 0] - \mathbb{E}[s(X) | A = 1]|^2$$

Covariance-Based Relaxation: Demographic parity is equivalent to zero covariance between predictions and protected attribute: $$\mathcal{F}_{cov}(h) = |\text{Cov}(\hat{Y}, A)|^2 = |\mathbb{E}[\hat{Y} \cdot A] - \mathbb{E}[\hat{Y}]\mathbb{E}[A]|^2$$

These relaxations enable end-to-end gradient-based training with fairness objectives.

Regularization-Based Approaches

The simplest in-processing approach adds a fairness regularization term to the standard loss function. Just as L2 regularization penalizes large weights to prevent overfitting, fairness regularization penalizes unfair predictions.

General Formulation: $$L_{total} = L_{task}(h; X, Y) + \lambda \cdot R_{fairness}(h; X, A)$$

where:

$L_{task}$: Standard task loss (cross-entropy, MSE, etc.)
$R_{fairness}$: Fairness penalty term
$\lambda$: Regularization strength controlling the tradeoff

Correlation-Based Regularization:

Zafar et al. (2017) proposed penalizing the correlation between predictions and protected attributes:

$$R_{corr}(h) = \left( \frac{1}{n} \sum_{i=1}^n (s(x_i) - \bar{s})(a_i - \bar{a}) \right)^2$$

For a classifier with decision boundary $w^T x + b = 0$, this becomes: $$R_{corr}(w) = \left( \frac{1}{n} \sum_{i=1}^n d(x_i)(a_i - \bar{a}) \right)^2$$

where $d(x_i) = w^T x_i + b$ is the signed distance from the decision boundary.

Key Insight: Minimizing covariance with $A$ while maximizing accuracy pushes the model toward decision boundaries that are orthogonal to protected attribute patterns.

Fairness Regularized Logistic Regression
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
import numpy as np
from scipy.optimize import minimize
from sklearn.base import BaseEstimator, ClassifierMixin
from typing import Optional
 
class FairLogisticRegression(BaseEstimator, ClassifierMixin):
    """
    Logistic regression with fairness regularization.
    
    Adds a penalty term for correlation between predictions
    and protected attribute to standard log-loss.
    """
    
    def __init__(self, 
                 fairness_weight: float = 1.0,
                 l2_weight: float = 0.01,
                 fairness_type: str = 'demographic_parity'):
        """
        Args:
            fairness_weight: Lambda controlling fairness vs accuracy tradeoff
            l2_weight: L2 regularization strength
            fairness_type: 'demographic_parity' or 'equal_opportunity'
        """
        self.fairness_weight = fairness_weight
        self.l2_weight = l2_weight
        self.fairness_type = fairness_type
        self.weights_ = None
        self.bias_ = None
        
    def _sigmoid(self, z):
        """Numerically stable sigmoid."""
        return np.where(
            z >= 0,
            1 / (1 + np.exp(-z)),
            np.exp(z) / (1 + np.exp(z))
        )
    
    def _predict_proba(self, X, weights, bias):
        """Predict probabilities."""
        return self._sigmoid(X @ weights + bias)
    
    def _log_loss(self, probs, y):
        """Binary cross-entropy loss."""
        eps = 1e-15
        probs = np.clip(probs, eps, 1 - eps)
        return -np.mean(y * np.log(probs) + (1 - y) * np.log(1 - probs))
    
    def _fairness_penalty_dp(self, probs, protected):
        """Demographic parity penalty: |E[p|A=0] - E[p|A=1]|^2"""
        mean_0 = np.mean(probs[protected == 0])
        mean_1 = np.mean(probs[protected == 1])
        return (mean_0 - mean_1) ** 2
    
    def _fairness_penalty_eo(self, probs, protected, y):
        """
        Equal opportunity penalty: difference in TPR.
        Only considers positive examples (y=1).
        """
        pos_mask = y == 1
        if pos_mask.sum() == 0:
            return 0.0
        
        probs_pos = probs[pos_mask]
        protected_pos = protected[pos_mask]
        
        mask_0 = protected_pos == 0
        mask_1 = protected_pos == 1
        
        if mask_0.sum() == 0 or mask_1.sum() == 0:
            return 0.0
        
        tpr_0 = np.mean(probs_pos[mask_0])  # Expected TPR for group 0
        tpr_1 = np.mean(probs_pos[mask_1])  # Expected TPR for group 1
        
        return (tpr_0 - tpr_1) ** 2
    
    def _objective(self, params, X, y, protected):
        """Combined objective: log loss + L2 + fairness penalty."""
        weights = params[:-1]
        bias = params[-1]
        
        probs = self._predict_proba(X, weights, bias)
        
        # Task loss
        loss = self._log_loss(probs, y)
        
        # L2 regularization
        l2_penalty = self.l2_weight * np.sum(weights ** 2)
        
        # Fairness penalty
        if self.fairness_type == 'demographic_parity':
            fairness_penalty = self._fairness_penalty_dp(probs, protected)
        elif self.fairness_type == 'equal_opportunity':
            fairness_penalty = self._fairness_penalty_eo(probs, protected, y)
        else:
            raise ValueError(f"Unknown fairness type: {self.fairness_type}")
        
        return loss + l2_penalty + self.fairness_weight * fairness_penalty
    
    def fit(self, X, y, protected):
        """
        Fit the fair logistic regression model.
        
        Args:
            X: Feature matrix (n_samples, n_features)
            y: Binary labels (n_samples,)
            protected: Protected attribute (n_samples,)
        """
        X = np.asarray(X)
        y = np.asarray(y)
        protected = np.asarray(protected)
        
        n_features = X.shape[1]
        
        # Initialize parameters
        init_params = np.zeros(n_features + 1)
        
        # Optimize
        result = minimize(
            self._objective,
            init_params,
            args=(X, y, protected),
            method='L-BFGS-B',
            options={'maxiter': 1000}
        )
        
        self.weights_ = result.x[:-1]
        self.bias_ = result.x[-1]
        
        return self
    
    def predict_proba(self, X):
        """Predict probabilities."""
        probs = self._predict_proba(X, self.weights_, self.bias_)
        return np.column_stack([1 - probs, probs])
    
    def predict(self, X, threshold=0.5):
        """Predict binary labels."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.metrics import accuracy_score
    
    np.random.seed(42)
    n = 2000
    
    # Biased data generation
    protected = np.random.binomial(1, 0.4, n)
    X = np.random.randn(n, 3)
    X[:, 0] += protected  # Feature 0 correlates with protected
    
    # Labels biased by protected attribute
    logits = X[:, 0] + X[:, 1] + 0.5 * protected
    y = (logits + np.random.randn(n) * 0.5 > 0.5).astype(int)
    
    # Train fair model
    fair_model = FairLogisticRegression(
        fairness_weight=10.0,
        fairness_type='demographic_parity'
    )
    fair_model.fit(X, y, protected)
    
    # Evaluate
    preds = fair_model.predict(X)
    print(f"Accuracy: {accuracy_score(y, preds):.3f}")
    print(f"P(Y=1|A=0): {np.mean(preds[protected==0]):.3f}")
    print(f"P(Y=1|A=1): {np.mean(preds[protected==1]):.3f}")

Tuning the Fairness Weight

The regularization strength λ controls the fairness-accuracy tradeoff. Too small: little fairness improvement. Too large: accuracy degrades severely. In practice, plot a Pareto frontier by varying λ and select a point that balances your specific requirements. There's no universal 'correct' value.

Adversarial Debiasing

Adversarial debiasing, introduced by Zhang et al. (2018), applies adversarial training principles to fairness. The core idea is elegant: if an adversary cannot predict the protected attribute from the model's predictions or internal representations, then the model isn't using protected information.

Architecture:

Predictor Network: Standard classifier $f_\theta$ that maps $X \rightarrow \hat{Y}$
Adversary Network: Auxiliary classifier $g_\phi$ that tries to predict $A$ from $f_\theta$'s output or intermediate representations

Training Objective:

The predictor minimizes task loss while maximizing the adversary's loss: $$\min_\theta \max_\phi \left[ L_{task}(f_\theta(X), Y) - \lambda \cdot L_{adv}(g_\phi(f_\theta(X)), A) \right]$$

The predictor 'fights' the adversary by learning representations from which $A$ cannot be recovered.

Mathematical Analysis:

Let $Z$ be the representation learned by the predictor (e.g., the last hidden layer). The adversarial objective pushes toward:

$$I(Z; A) \rightarrow 0$$

where $I(Z; A)$ is the mutual information between the representation and the protected attribute.

Why This Works:

If $g_\phi$ cannot predict $A$ better than random guessing, then $Z$ (and thus $\hat{Y}$) contains no information about $A$. Any discriminatory capacity has been removed.

Training Dynamics:

Adversary Update: Train $g_\phi$ to maximize $L_{adv}$ (minimize loss for predicting $A$)
Predictor Update: Train $f_\theta$ to minimize $L_{task}$ while maximizing $L_{adv}$ (make $A$ unpredictable)

This creates a minimax game similar to GANs, with the predictor trying to 'fool' the adversary about group membership.

Adversarial Debiasing with PyTorch
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from typing import Tuple
 
class AdversarialDebiasing(nn.Module):
    """
    Adversarial debiasing for fair classification.
    
    Uses a predictor and adversary trained in competition:
    - Predictor tries to predict Y while hiding A
    - Adversary tries to recover A from predictor's representations
    """
    
    def __init__(self, 
                 input_dim: int,
                 hidden_dim: int = 64,
                 adversary_weight: float = 1.0):
        super().__init__()
        
        self.adversary_weight = adversary_weight
        
        # Main predictor network
        self.predictor = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.predictor_head = nn.Linear(hidden_dim, 1)
        
        # Adversary network (predicts protected attribute from representation)
        self.adversary = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, 1),
        )
        
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass.
        
        Returns:
            y_pred: Task predictions (logits)
            a_pred: Adversary predictions (logits for protected attribute)
        """
        # Shared representation
        representation = self.predictor(x)
        
        # Task prediction
        y_logits = self.predictor_head(representation)
        
        # Adversary prediction
        a_logits = self.adversary(representation)
        
        return y_logits.squeeze(), a_logits.squeeze()
    
    def predict(self, x: torch.Tensor) -> torch.Tensor:
        """Predict task labels (inference only)."""
        representation = self.predictor(x)
        y_logits = self.predictor_head(representation)
        return torch.sigmoid(y_logits).squeeze()
 
 
class GradientReversalLayer(torch.autograd.Function):
    """
    Gradient reversal layer for domain adaptation / adversarial debiasing.
    Forward: identity, Backward: negate gradients
    """
    
    @staticmethod
    def forward(ctx, x, lambda_):
        ctx.lambda_ = lambda_
        return x.view_as(x)
    
    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.lambda_ * grad_output, None
 
 
def train_adversarial_debiasing(
    model: AdversarialDebiasing,
    X: torch.Tensor,
    y: torch.Tensor,
    protected: torch.Tensor,
    epochs: int = 200,
    batch_size: int = 256,
    lr_predictor: float = 0.001,
    lr_adversary: float = 0.001,
    adversary_steps: int = 1
) -> None:
    """
    Train adversarial debiasing model.
    
    Training alternates between:
    1. Training adversary to predict protected attribute
    2. Training predictor to predict target while fooling adversary
    """
    # Separate optimizers
    predictor_params = (
        list(model.predictor.parameters()) + 
        list(model.predictor_head.parameters())
    )
    adversary_params = list(model.adversary.parameters())
    
    predictor_optimizer = Adam(predictor_params, lr=lr_predictor)
    adversary_optimizer = Adam(adversary_params, lr=lr_adversary)
    
    n_samples = len(X)
    n_batches = (n_samples + batch_size - 1) // batch_size
    
    for epoch in range(epochs):
        # Shuffle data
        perm = torch.randperm(n_samples)
        X_shuffled = X[perm]
        y_shuffled = y[perm]
        protected_shuffled = protected[perm]
        
        epoch_pred_loss = 0.0
        epoch_adv_loss = 0.0
        
        for batch_idx in range(n_batches):
            start = batch_idx * batch_size
            end = min(start + batch_size, n_samples)
            
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]
            a_batch = protected_shuffled[start:end]
            
            # Step 1: Train adversary to predict protected attribute
            for _ in range(adversary_steps):
                adversary_optimizer.zero_grad()
                
                with torch.no_grad():
                    representation = model.predictor(X_batch)
                a_logits = model.adversary(representation)
                
                adv_loss = F.binary_cross_entropy_with_logits(
                    a_logits.squeeze(), a_batch.float()
                )
                adv_loss.backward()
                adversary_optimizer.step()
            
            # Step 2: Train predictor to predict Y while maximizing adversary loss
            predictor_optimizer.zero_grad()
            
            y_logits, a_logits = model(X_batch)
            
            # Task loss
            pred_loss = F.binary_cross_entropy_with_logits(
                y_logits, y_batch.float()
            )
            
            # Adversary loss (we want to maximize this, so subtract)
            adv_loss = F.binary_cross_entropy_with_logits(
                a_logits, a_batch.float()
            )
            
            # Combined loss: minimize task loss, maximize adversary loss
            total_loss = pred_loss - model.adversary_weight * adv_loss
            total_loss.backward()
            predictor_optimizer.step()
            
            epoch_pred_loss += pred_loss.item()
            epoch_adv_loss += adv_loss.item()
        
        if (epoch + 1) % 50 == 0:
            print(f"Epoch {epoch+1}: Pred Loss={epoch_pred_loss/n_batches:.4f}, "
                  f"Adv Loss={epoch_adv_loss/n_batches:.4f}")
 
 
# Demonstration
if __name__ == "__main__":
    torch.manual_seed(42)
    
    # Generate biased data
    n = 3000
    protected = torch.bernoulli(torch.full((n,), 0.4))
    X = torch.randn(n, 5)
    X[:, 0] += protected  # Encode protected in feature
    
    # Biased labels
    logits = X[:, 0] + X[:, 1] + 0.5 * protected
    y = (logits + torch.randn(n) * 0.5 > 0.5).float()
    
    # Train
    model = AdversarialDebiasing(input_dim=5, adversary_weight=2.0)
    train_adversarial_debiasing(model, X, y, protected, epochs=200)
    
    # Evaluate
    with torch.no_grad():
        preds = (model.predict(X) > 0.5).float()
        print(f"
P(Ŷ=1|A=0): {preds[protected==0].mean():.3f}")
        print(f"P(Ŷ=1|A=1): {preds[protected==1].mean():.3f}")

Training Stability Challenges

Like GANs, adversarial debiasing can suffer from training instability. The predictor and adversary may oscillate, or the adversary may become too strong and prevent the predictor from learning anything useful. Techniques from GAN training (gradient clipping, learning rate scheduling, alternating update frequencies) can help stabilize training.

Constrained Optimization Methods

Rather than converting fairness to a soft penalty, we can treat it as a hard constraint using techniques from constrained optimization. This approach provides stronger guarantees—if optimization succeeds, the fairness constraint is satisfied, not just approximately minimized.

Lagrangian Approach:

The constrained problem: $$\min_h L(h) \quad \text{s.t.} \quad g(h) \leq 0$$

where $g(h) = \mathcal{F}(h) - \epsilon$ is the fairness constraint, becomes the Lagrangian:

$$\mathcal{L}(h, \lambda) = L(h) + \lambda \cdot g(h)$$

Primal-Dual Optimization:

We solve for the saddle point: $$\min_h \max_{\lambda \geq 0} \mathcal{L}(h, \lambda)$$

Alternating updates:

Primal Update: $h \leftarrow h - \eta_h abla_h \mathcal{L}(h, \lambda)$
Dual Update: $\lambda \leftarrow [\lambda + \eta_\lambda \cdot g(h)]_+$ (project to non-negative)

This naturally adapts the fairness penalty: if constraints are violated, $\lambda$ increases; if satisfied, $\lambda$ decreases.

Exponentiated Gradient Method (Agarwal et al., 2018):

This influential approach casts fairness-constrained learning as a two-player game:

Learner Player: Chooses a distribution over classifiers
Auditor Player: Chooses weights on constraints to maximize violation

The learner solves a cost-sensitive classification problem at each round: $$h_t = \arg\min_h \sum_i c_i^t L(h(x_i), y_i)$$

where costs $c_i^t$ are derived from constraint violations.

The final classifier is a (possibly randomized) mixture of classifiers from all rounds: $$\hat{h} = \sum_t \alpha_t h_t$$

Key Insight: This reduces fair classification to a sequence of cost-sensitive classifications, which can use any off-the-shelf learner.

Lagrangian Fair Classification
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict
 
class LagrangianFairClassifier(nn.Module):
    """
    Fair classifier using Lagrangian dual optimization.
    
    Maintains dual variables for fairness constraints and updates
    them based on constraint violations.
    """
    
    def __init__(self, 
                 input_dim: int, 
                 hidden_dim: int = 32,
                 num_constraints: int = 1,
                 epsilon: float = 0.05):
        super().__init__()
        
        self.epsilon = epsilon
        
        # Classifier network
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        # Dual variables (Lagrange multipliers) - learnable
        # One for each constraint (e.g., demographic parity)
        self.lambdas = nn.Parameter(torch.ones(num_constraints))
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return torch.sigmoid(self.network(x)).squeeze()
    
    def compute_dp_constraint(self, 
                               probs: torch.Tensor, 
                               protected: torch.Tensor) -> torch.Tensor:
        """
        Compute demographic parity constraint violation.
        g(h) = |P(Ŷ=1|A=0) - P(Ŷ=1|A=1)| - epsilon
        Returns positive value if constraint is violated.
        """
        rate_0 = probs[protected == 0].mean()
        rate_1 = probs[protected == 1].mean()
        violation = torch.abs(rate_0 - rate_1) - self.epsilon
        return violation
    
    def compute_lagrangian(self,
                           probs: torch.Tensor,
                           labels: torch.Tensor,
                           protected: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        Compute Lagrangian: L = task_loss + lambda * g(h)
        """
        # Task loss
        task_loss = F.binary_cross_entropy(probs, labels.float())
        
        # Constraint violation
        constraint_violation = self.compute_dp_constraint(probs, protected)
        
        # Lagrangian
        # Ensure lambda is positive
        lambda_pos = F.softplus(self.lambdas[0])
        lagrangian = task_loss + lambda_pos * constraint_violation
        
        return {
            'lagrangian': lagrangian,
            'task_loss': task_loss,
            'constraint_violation': constraint_violation,
            'lambda': lambda_pos
        }
 
 
def train_lagrangian_fair(
    model: LagrangianFairClassifier,
    X: torch.Tensor,
    y: torch.Tensor,
    protected: torch.Tensor,
    epochs: int = 300,
    lr_primal: float = 0.01,
    lr_dual: float = 0.1
) -> None:
    """
    Train using primal-dual optimization.
    
    - Primal update: minimize Lagrangian w.r.t. network parameters
    - Dual update: maximize Lagrangian w.r.t. lambda (increase if violated)
    """
    # Separate parameters
    network_params = list(model.network.parameters())
    dual_params = [model.lambdas]
    
    primal_optimizer = torch.optim.Adam(network_params, lr=lr_primal)
    dual_optimizer = torch.optim.SGD(dual_params, lr=lr_dual)
    
    for epoch in range(epochs):
        # Primal step: minimize Lagrangian over network params
        primal_optimizer.zero_grad()
        
        probs = model(X)
        losses = model.compute_lagrangian(probs, y, protected)
        
        # For primal, we minimize Lagrangian
        losses['lagrangian'].backward()
        primal_optimizer.step()
        
        # Dual step: maximize Lagrangian over lambda
        # This is equivalent to increasing lambda when constraint is violated
        dual_optimizer.zero_grad()
        
        probs = model(X)
        losses = model.compute_lagrangian(probs, y, protected)
        
        # For dual, we maximize (minimize negative)
        (-losses['lagrangian']).backward()
        dual_optimizer.step()
        
        # Project lambda to be non-negative (handled by softplus)
        
        if (epoch + 1) % 50 == 0:
            print(f"Epoch {epoch+1}: Loss={losses['task_loss']:.4f}, "
                  f"Violation={losses['constraint_violation']:.4f}, "
                  f"λ={losses['lambda']:.4f}")
 
 
# Demo
if __name__ == "__main__":
    torch.manual_seed(42)
    n = 2000
    
    # Biased data
    protected = torch.bernoulli(torch.full((n,), 0.4))
    X = torch.randn(n, 4)
    X[:, 0] += protected
    logits = X[:, 0] + X[:, 1] + 0.3 * protected
    y = (logits + torch.randn(n) * 0.3 > 0.3).float()
    
    # Train
    model = LagrangianFairClassifier(input_dim=4, epsilon=0.02)
    train_lagrangian_fair(model, X, y, protected, epochs=300)
    
    # Final evaluation
    with torch.no_grad():
        preds = (model(X) > 0.5).float()
        print(f"
Final Demographics:")
        print(f"  P(Ŷ=1|A=0): {preds[protected==0].mean():.3f}")
        print(f"  P(Ŷ=1|A=1): {preds[protected==1].mean():.3f}")
        print(f"  Gap: {abs(preds[protected==0].mean() - preds[protected==1].mean()):.3f}")

Fairlearn Library

The Fairlearn library (fairlearn.org) provides production-ready implementations of constrained optimization methods, including the exponentiated gradient algorithm. It supports various fairness constraints (demographic parity, equalized odds, etc.) and works with any scikit-learn compatible classifier.

Fairness-Aware Ensemble Methods

Ensemble methods offer a natural framework for fairness: instead of modifying a single model, we combine multiple models in ways that achieve fair overall predictions.

Fair Boosting:

Modifications to AdaBoost that incorporate fairness. Instead of reweighting based only on prediction errors, also consider fairness violations:

$$w_i^{(t+1)} \propto w_i^{(t)} \cdot \exp(\alpha_t \mathbb{1}[h_t(x_i) eq y_i]) \cdot \textcolor{blue}{\exp(\beta_t \cdot \text{FairnessViolation}(x_i, a_i, h_t))}$$

The additional term increases weights for samples that contribute to unfairness, forcing subsequent weak learners to focus on reducing disparity.

Fair Stacking:

Train multiple base models and learn a fair meta-classifier that combines their predictions. The meta-classifier explicitly includes fairness constraints:

Train $K$ base classifiers: $h_1, h_2, ..., h_K$
Create meta-features: $Z_i = [h_1(x_i), h_2(x_i), ..., h_K(x_i)]$
Train fair meta-classifier $g$ on $(Z, y)$ with fairness constraints

This separates predictive power (base models) from fairness (meta-classifier).

Ensemble Calibration:

Train an unconstrained ensemble, then learn group-specific calibration functions to adjust predictions:

$$\hat{y}{fair}(x, a) = c_a(\hat{y}{ensemble}(x))$$

where $c_a$ is a monotonic calibration function learned to satisfy fairness constraints for group $a$.

Ensemble Advantages

•Can leverage any base learning algorithm
•Often improve both accuracy and fairness
•Ensemble diversity can reduce systematic biases
•Natural framing for mixing multiple fairness objectives

Ensemble Limitations

•Increased computational cost (train multiple models)
•More complex to interpret and explain
•May require careful tuning of ensemble weights
•Potential for majority voting to amplify bias

Deep Learning Specific In-processing

Deep learning offers unique opportunities for in-processing fairness through architectural modifications, specialized layers, and representation-level interventions.

Fairness-Specific Architectures:

1. Disentangled Representations: Learn separate representations for task-relevant and protected-attribute-related information:

$$z = [z_{task}, z_{protected}]$$

Then use only $z_{task}$ for prediction. This requires auxiliary losses to ensure proper disentanglement.

2. Group-Specific Batch Normalization: Since batch normalization learns running statistics, use separate BN layers for each protected group:

$$\text{BN}_a(x) = \gamma_a \frac{x - \mu_a}{\sigma_a} + \beta_a$$

This allows group-specific normalization while sharing other parameters.

Gradient-Based Interventions:

Gradient Projection: Project gradients to remove components that would increase unfairness:

$$g_{fair} = g - \text{Proj}_{ abla \mathcal{F}(h)}(g)$$

where $g$ is the gradient of the task loss and we project out the direction that increases the fairness violation.

Gradient Reversal: For adversarial debiasing, the gradient reversal layer inverts gradients during backpropagation:

$$\frac{\partial L_{adv}}{\partial \theta_{encoder}} \rightarrow -\lambda \frac{\partial L_{adv}}{\partial \theta_{encoder}}$$

This causes the encoder to learn representations that maximize adversary confusion.

Multi-Task Learning with Fairness: Treat fairness as an auxiliary task with negative weight:

$$L = L_{task} - \lambda \cdot L_{protected_prediction}$$

where $L_{protected_prediction}$ is the loss for predicting the protected attribute from intermediate representations.

Modern Trends

Recent work explores fairness in foundation models and LLMs through techniques like constitution-guided training, preference learning with fairness constraints, and prompt-based debiasing. As models become larger and more general, fairness interventions are increasingly applied during fine-tuning rather than training from scratch.

Comparing In-processing Methods

In-processing Methods Comparison
Method	Fairness Guarantee	Computational Cost	Interpretability	Best For
Regularization	Soft (penalty-based)	Low	High	Simple models, quick experiments
Adversarial Debiasing	Soft (information-theoretic)	High	Low	Neural networks, representation learning
Constrained Optimization	Hard (constraint-based)	Medium	Medium	When strict fairness bounds needed
Reduction Methods	Provable bounds	Medium	High	Off-the-shelf learners
Ensemble Methods	Depends on design	High	Medium	When base models exist

Selection Guidelines:

Use Regularization when:

You want a simple, drop-in modification to existing training
Interpretability and simplicity matter more than strict guarantees
You're doing exploratory analysis of fairness-accuracy tradeoff

Use Adversarial Debiasing when:

Working with neural networks and learned representations
You want to remove protected information from latent spaces
The relationship between features and protected attributes is complex

Use Constrained Optimization when:

You need to satisfy specific fairness thresholds (e.g., for compliance)
Worst-case guarantees matter
You can afford the additional optimization complexity

Use Reduction Methods when:

You want to use existing classifiers without modification
You need provable fairness bounds
You're working with structured fairness constraints

No Universal Solution

There is no universally best in-processing method. Performance depends on the data, model, fairness definition, and computational resources. Always compare multiple approaches on your specific problem with held-out validation sets.

Summary: In-processing Methods

Key Takeaways

•In-processing methods embed fairness into learning, modifying the training algorithm rather than the data.
•Regularization approaches add fairness penalty terms to the loss function, controlling the tradeoff via regularization strength.
•Adversarial debiasing trains an adversary to predict protected attributes; the model learns to fool it, removing protected information.
•Constrained optimization treats fairness as a hard constraint, using Lagrangian methods or game-theoretic reductions.
•Impossibility results mean we cannot satisfy all fairness criteria simultaneously—choose based on context.
•Deep learning offers unique approaches including disentangled representations, gradient interventions, and specialized architectures.
•Method selection depends on fairness requirements, model type, interpretability needs, and computational budget.

What's Next:

The next page explores post-processing methods—techniques applied after model training to adjust predictions for fairness. These methods are particularly valuable when models cannot be retrained or when different fairness thresholds are needed for different deployment contexts.

Page Complete

You now understand the major in-processing approaches to bias mitigation—from simple regularization to sophisticated constrained optimization. These techniques provide powerful tools for building models that are fair by design, not just fair by accident.