Loading content...
In-processing methods take a fundamentally different stance from pre-processing: rather than fixing the data before training, we embed fairness directly into the learning algorithm. The model learns to be fair as it learns to be accurate, treating fairness as a first-class optimization objective alongside predictive performance.
This approach has profound implications. It enables fine-grained control over the fairness-accuracy tradeoff through explicit constraints or penalty terms. It can achieve stronger fairness guarantees by directly optimizing for fairness criteria. And it can adapt fairness interventions to the specific model being trained, rather than applying generic data transformations.
By the end of this page, you will be able to: (1) Formulate fairness-constrained optimization problems, (2) Implement regularization-based approaches to fairness, (3) Apply adversarial debiasing to neural networks, (4) Understand reduction-based approaches that convert fairness constraints to cost-sensitive learning, (5) Evaluate tradeoffs between different in-processing methods.
The Constrained Optimization Perspective:
Classical machine learning optimizes a single objective—typically empirical risk:
$$\hat{h} = \arg\min_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i)$$
In-processing reframes this as constrained optimization:
$$\hat{h} = \arg\min_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i) \quad \text{subject to} \quad \mathcal{F}(h) \leq \epsilon$$
where $\mathcal{F}(h)$ quantifies unfairness and $\epsilon$ is the maximum tolerable unfairness. Alternatively, we can optimize a Lagrangian:
$$\hat{h} = \arg\min_{h \in \mathcal{H}} \left[ \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i) + \lambda \cdot \mathcal{F}(h) \right]$$
where $\lambda$ controls the fairness-accuracy tradeoff.
Before examining specific algorithms, we must formalize what fairness constraints look like mathematically. Different fairness definitions translate to different constraint formulations.
Demographic Parity (Statistical Parity): Positive predictions should be independent of protected attribute.
$$P(\hat{Y} = 1 | A = 0) = P(\hat{Y} = 1 | A = 1)$$
As a constraint: $|P(\hat{Y} = 1 | A = 0) - P(\hat{Y} = 1 | A = 1)| \leq \epsilon$
Equalized Odds: True positive rate and false positive rate should be equal across groups.
$$P(\hat{Y} = 1 | Y = y, A = 0) = P(\hat{Y} = 1 | Y = y, A = 1) \quad \forall y \in {0, 1}$$
Equal Opportunity: True positive rate should be equal (relaxation of equalized odds).
$$P(\hat{Y} = 1 | Y = 1, A = 0) = P(\hat{Y} = 1 | Y = 1, A = 1)$$
| Fairness Criterion | Mathematical Constraint | Intuition |
|---|---|---|
| Demographic Parity | $|\mathbb{E}[\hat{Y}|A=0] - \mathbb{E}[\hat{Y}|A=1]| \leq \epsilon$ | Equal selection rates across groups |
| Equal Opportunity | $|TPR_0 - TPR_1| \leq \epsilon$ | Equal chance for qualified individuals |
| Equalized Odds | $|TPR_0 - TPR_1| + |FPR_0 - FPR_1| \leq \epsilon$ | Equal error rates for both outcomes |
| Predictive Parity | $|PPV_0 - PPV_1| \leq \epsilon$ | Equal precision across groups |
| Calibration | $P(Y=1|\hat{p}=p, A=a)=p$ for all $a$ | Predicted probabilities match actual rates |
Chouldechova (2017) and Kleinberg et al. (2016) proved that except in degenerate cases, it's impossible to simultaneously satisfy calibration, equal false positive rates, and equal false negative rates when base rates differ across groups. This means perfect fairness across all criteria is usually unachievable—we must choose which fairness properties matter most.
Converting Discrete Constraints to Continuous Losses:
Most fairness constraints involve indicator functions (classifications) that are non-differentiable. For gradient-based optimization, we replace these with continuous relaxations:
Soft Demographic Parity: Replace $\hat{Y} \in {0, 1}$ with prediction score $s(x) \in [0, 1]$: $$\mathcal{F}_{DP}(h) = |\mathbb{E}[s(X) | A = 0] - \mathbb{E}[s(X) | A = 1]|^2$$
Covariance-Based Relaxation: Demographic parity is equivalent to zero covariance between predictions and protected attribute: $$\mathcal{F}_{cov}(h) = |\text{Cov}(\hat{Y}, A)|^2 = |\mathbb{E}[\hat{Y} \cdot A] - \mathbb{E}[\hat{Y}]\mathbb{E}[A]|^2$$
These relaxations enable end-to-end gradient-based training with fairness objectives.
The simplest in-processing approach adds a fairness regularization term to the standard loss function. Just as L2 regularization penalizes large weights to prevent overfitting, fairness regularization penalizes unfair predictions.
General Formulation: $$L_{total} = L_{task}(h; X, Y) + \lambda \cdot R_{fairness}(h; X, A)$$
where:
Correlation-Based Regularization:
Zafar et al. (2017) proposed penalizing the correlation between predictions and protected attributes:
$$R_{corr}(h) = \left( \frac{1}{n} \sum_{i=1}^n (s(x_i) - \bar{s})(a_i - \bar{a}) \right)^2$$
For a classifier with decision boundary $w^T x + b = 0$, this becomes: $$R_{corr}(w) = \left( \frac{1}{n} \sum_{i=1}^n d(x_i)(a_i - \bar{a}) \right)^2$$
where $d(x_i) = w^T x_i + b$ is the signed distance from the decision boundary.
Key Insight: Minimizing covariance with $A$ while maximizing accuracy pushes the model toward decision boundaries that are orthogonal to protected attribute patterns.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169
import numpy as npfrom scipy.optimize import minimizefrom sklearn.base import BaseEstimator, ClassifierMixinfrom typing import Optional class FairLogisticRegression(BaseEstimator, ClassifierMixin): """ Logistic regression with fairness regularization. Adds a penalty term for correlation between predictions and protected attribute to standard log-loss. """ def __init__(self, fairness_weight: float = 1.0, l2_weight: float = 0.01, fairness_type: str = 'demographic_parity'): """ Args: fairness_weight: Lambda controlling fairness vs accuracy tradeoff l2_weight: L2 regularization strength fairness_type: 'demographic_parity' or 'equal_opportunity' """ self.fairness_weight = fairness_weight self.l2_weight = l2_weight self.fairness_type = fairness_type self.weights_ = None self.bias_ = None def _sigmoid(self, z): """Numerically stable sigmoid.""" return np.where( z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)) ) def _predict_proba(self, X, weights, bias): """Predict probabilities.""" return self._sigmoid(X @ weights + bias) def _log_loss(self, probs, y): """Binary cross-entropy loss.""" eps = 1e-15 probs = np.clip(probs, eps, 1 - eps) return -np.mean(y * np.log(probs) + (1 - y) * np.log(1 - probs)) def _fairness_penalty_dp(self, probs, protected): """Demographic parity penalty: |E[p|A=0] - E[p|A=1]|^2""" mean_0 = np.mean(probs[protected == 0]) mean_1 = np.mean(probs[protected == 1]) return (mean_0 - mean_1) ** 2 def _fairness_penalty_eo(self, probs, protected, y): """ Equal opportunity penalty: difference in TPR. Only considers positive examples (y=1). """ pos_mask = y == 1 if pos_mask.sum() == 0: return 0.0 probs_pos = probs[pos_mask] protected_pos = protected[pos_mask] mask_0 = protected_pos == 0 mask_1 = protected_pos == 1 if mask_0.sum() == 0 or mask_1.sum() == 0: return 0.0 tpr_0 = np.mean(probs_pos[mask_0]) # Expected TPR for group 0 tpr_1 = np.mean(probs_pos[mask_1]) # Expected TPR for group 1 return (tpr_0 - tpr_1) ** 2 def _objective(self, params, X, y, protected): """Combined objective: log loss + L2 + fairness penalty.""" weights = params[:-1] bias = params[-1] probs = self._predict_proba(X, weights, bias) # Task loss loss = self._log_loss(probs, y) # L2 regularization l2_penalty = self.l2_weight * np.sum(weights ** 2) # Fairness penalty if self.fairness_type == 'demographic_parity': fairness_penalty = self._fairness_penalty_dp(probs, protected) elif self.fairness_type == 'equal_opportunity': fairness_penalty = self._fairness_penalty_eo(probs, protected, y) else: raise ValueError(f"Unknown fairness type: {self.fairness_type}") return loss + l2_penalty + self.fairness_weight * fairness_penalty def fit(self, X, y, protected): """ Fit the fair logistic regression model. Args: X: Feature matrix (n_samples, n_features) y: Binary labels (n_samples,) protected: Protected attribute (n_samples,) """ X = np.asarray(X) y = np.asarray(y) protected = np.asarray(protected) n_features = X.shape[1] # Initialize parameters init_params = np.zeros(n_features + 1) # Optimize result = minimize( self._objective, init_params, args=(X, y, protected), method='L-BFGS-B', options={'maxiter': 1000} ) self.weights_ = result.x[:-1] self.bias_ = result.x[-1] return self def predict_proba(self, X): """Predict probabilities.""" probs = self._predict_proba(X, self.weights_, self.bias_) return np.column_stack([1 - probs, probs]) def predict(self, X, threshold=0.5): """Predict binary labels.""" return (self.predict_proba(X)[:, 1] >= threshold).astype(int) # Demonstrationif __name__ == "__main__": from sklearn.metrics import accuracy_score np.random.seed(42) n = 2000 # Biased data generation protected = np.random.binomial(1, 0.4, n) X = np.random.randn(n, 3) X[:, 0] += protected # Feature 0 correlates with protected # Labels biased by protected attribute logits = X[:, 0] + X[:, 1] + 0.5 * protected y = (logits + np.random.randn(n) * 0.5 > 0.5).astype(int) # Train fair model fair_model = FairLogisticRegression( fairness_weight=10.0, fairness_type='demographic_parity' ) fair_model.fit(X, y, protected) # Evaluate preds = fair_model.predict(X) print(f"Accuracy: {accuracy_score(y, preds):.3f}") print(f"P(Y=1|A=0): {np.mean(preds[protected==0]):.3f}") print(f"P(Y=1|A=1): {np.mean(preds[protected==1]):.3f}")The regularization strength λ controls the fairness-accuracy tradeoff. Too small: little fairness improvement. Too large: accuracy degrades severely. In practice, plot a Pareto frontier by varying λ and select a point that balances your specific requirements. There's no universal 'correct' value.
Adversarial debiasing, introduced by Zhang et al. (2018), applies adversarial training principles to fairness. The core idea is elegant: if an adversary cannot predict the protected attribute from the model's predictions or internal representations, then the model isn't using protected information.
Architecture:
Training Objective:
The predictor minimizes task loss while maximizing the adversary's loss: $$\min_\theta \max_\phi \left[ L_{task}(f_\theta(X), Y) - \lambda \cdot L_{adv}(g_\phi(f_\theta(X)), A) \right]$$
The predictor 'fights' the adversary by learning representations from which $A$ cannot be recovered.
Mathematical Analysis:
Let $Z$ be the representation learned by the predictor (e.g., the last hidden layer). The adversarial objective pushes toward:
$$I(Z; A) \rightarrow 0$$
where $I(Z; A)$ is the mutual information between the representation and the protected attribute.
Why This Works:
If $g_\phi$ cannot predict $A$ better than random guessing, then $Z$ (and thus $\hat{Y}$) contains no information about $A$. Any discriminatory capacity has been removed.
Training Dynamics:
This creates a minimax game similar to GANs, with the predictor trying to 'fool' the adversary about group membership.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.optim import Adamfrom typing import Tuple class AdversarialDebiasing(nn.Module): """ Adversarial debiasing for fair classification. Uses a predictor and adversary trained in competition: - Predictor tries to predict Y while hiding A - Adversary tries to recover A from predictor's representations """ def __init__(self, input_dim: int, hidden_dim: int = 64, adversary_weight: float = 1.0): super().__init__() self.adversary_weight = adversary_weight # Main predictor network self.predictor = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), ) self.predictor_head = nn.Linear(hidden_dim, 1) # Adversary network (predicts protected attribute from representation) self.adversary = nn.Sequential( nn.Linear(hidden_dim, hidden_dim // 2), nn.ReLU(), nn.Linear(hidden_dim // 2, 1), ) def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """ Forward pass. Returns: y_pred: Task predictions (logits) a_pred: Adversary predictions (logits for protected attribute) """ # Shared representation representation = self.predictor(x) # Task prediction y_logits = self.predictor_head(representation) # Adversary prediction a_logits = self.adversary(representation) return y_logits.squeeze(), a_logits.squeeze() def predict(self, x: torch.Tensor) -> torch.Tensor: """Predict task labels (inference only).""" representation = self.predictor(x) y_logits = self.predictor_head(representation) return torch.sigmoid(y_logits).squeeze() class GradientReversalLayer(torch.autograd.Function): """ Gradient reversal layer for domain adaptation / adversarial debiasing. Forward: identity, Backward: negate gradients """ @staticmethod def forward(ctx, x, lambda_): ctx.lambda_ = lambda_ return x.view_as(x) @staticmethod def backward(ctx, grad_output): return -ctx.lambda_ * grad_output, None def train_adversarial_debiasing( model: AdversarialDebiasing, X: torch.Tensor, y: torch.Tensor, protected: torch.Tensor, epochs: int = 200, batch_size: int = 256, lr_predictor: float = 0.001, lr_adversary: float = 0.001, adversary_steps: int = 1) -> None: """ Train adversarial debiasing model. Training alternates between: 1. Training adversary to predict protected attribute 2. Training predictor to predict target while fooling adversary """ # Separate optimizers predictor_params = ( list(model.predictor.parameters()) + list(model.predictor_head.parameters()) ) adversary_params = list(model.adversary.parameters()) predictor_optimizer = Adam(predictor_params, lr=lr_predictor) adversary_optimizer = Adam(adversary_params, lr=lr_adversary) n_samples = len(X) n_batches = (n_samples + batch_size - 1) // batch_size for epoch in range(epochs): # Shuffle data perm = torch.randperm(n_samples) X_shuffled = X[perm] y_shuffled = y[perm] protected_shuffled = protected[perm] epoch_pred_loss = 0.0 epoch_adv_loss = 0.0 for batch_idx in range(n_batches): start = batch_idx * batch_size end = min(start + batch_size, n_samples) X_batch = X_shuffled[start:end] y_batch = y_shuffled[start:end] a_batch = protected_shuffled[start:end] # Step 1: Train adversary to predict protected attribute for _ in range(adversary_steps): adversary_optimizer.zero_grad() with torch.no_grad(): representation = model.predictor(X_batch) a_logits = model.adversary(representation) adv_loss = F.binary_cross_entropy_with_logits( a_logits.squeeze(), a_batch.float() ) adv_loss.backward() adversary_optimizer.step() # Step 2: Train predictor to predict Y while maximizing adversary loss predictor_optimizer.zero_grad() y_logits, a_logits = model(X_batch) # Task loss pred_loss = F.binary_cross_entropy_with_logits( y_logits, y_batch.float() ) # Adversary loss (we want to maximize this, so subtract) adv_loss = F.binary_cross_entropy_with_logits( a_logits, a_batch.float() ) # Combined loss: minimize task loss, maximize adversary loss total_loss = pred_loss - model.adversary_weight * adv_loss total_loss.backward() predictor_optimizer.step() epoch_pred_loss += pred_loss.item() epoch_adv_loss += adv_loss.item() if (epoch + 1) % 50 == 0: print(f"Epoch {epoch+1}: Pred Loss={epoch_pred_loss/n_batches:.4f}, " f"Adv Loss={epoch_adv_loss/n_batches:.4f}") # Demonstrationif __name__ == "__main__": torch.manual_seed(42) # Generate biased data n = 3000 protected = torch.bernoulli(torch.full((n,), 0.4)) X = torch.randn(n, 5) X[:, 0] += protected # Encode protected in feature # Biased labels logits = X[:, 0] + X[:, 1] + 0.5 * protected y = (logits + torch.randn(n) * 0.5 > 0.5).float() # Train model = AdversarialDebiasing(input_dim=5, adversary_weight=2.0) train_adversarial_debiasing(model, X, y, protected, epochs=200) # Evaluate with torch.no_grad(): preds = (model.predict(X) > 0.5).float() print(f"\nP(Ŷ=1|A=0): {preds[protected==0].mean():.3f}") print(f"P(Ŷ=1|A=1): {preds[protected==1].mean():.3f}")Like GANs, adversarial debiasing can suffer from training instability. The predictor and adversary may oscillate, or the adversary may become too strong and prevent the predictor from learning anything useful. Techniques from GAN training (gradient clipping, learning rate scheduling, alternating update frequencies) can help stabilize training.
Rather than converting fairness to a soft penalty, we can treat it as a hard constraint using techniques from constrained optimization. This approach provides stronger guarantees—if optimization succeeds, the fairness constraint is satisfied, not just approximately minimized.
Lagrangian Approach:
The constrained problem: $$\min_h L(h) \quad \text{s.t.} \quad g(h) \leq 0$$
where $g(h) = \mathcal{F}(h) - \epsilon$ is the fairness constraint, becomes the Lagrangian:
$$\mathcal{L}(h, \lambda) = L(h) + \lambda \cdot g(h)$$
Primal-Dual Optimization:
We solve for the saddle point: $$\min_h \max_{\lambda \geq 0} \mathcal{L}(h, \lambda)$$
Alternating updates:
This naturally adapts the fairness penalty: if constraints are violated, $\lambda$ increases; if satisfied, $\lambda$ decreases.
Exponentiated Gradient Method (Agarwal et al., 2018):
This influential approach casts fairness-constrained learning as a two-player game:
The learner solves a cost-sensitive classification problem at each round: $$h_t = \arg\min_h \sum_i c_i^t L(h(x_i), y_i)$$
where costs $c_i^t$ are derived from constraint violations.
The final classifier is a (possibly randomized) mixture of classifiers from all rounds: $$\hat{h} = \sum_t \alpha_t h_t$$
Key Insight: This reduces fair classification to a sequence of cost-sensitive classifications, which can use any off-the-shelf learner.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import numpy as npimport torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Dict class LagrangianFairClassifier(nn.Module): """ Fair classifier using Lagrangian dual optimization. Maintains dual variables for fairness constraints and updates them based on constraint violations. """ def __init__(self, input_dim: int, hidden_dim: int = 32, num_constraints: int = 1, epsilon: float = 0.05): super().__init__() self.epsilon = epsilon # Classifier network self.network = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) # Dual variables (Lagrange multipliers) - learnable # One for each constraint (e.g., demographic parity) self.lambdas = nn.Parameter(torch.ones(num_constraints)) def forward(self, x: torch.Tensor) -> torch.Tensor: return torch.sigmoid(self.network(x)).squeeze() def compute_dp_constraint(self, probs: torch.Tensor, protected: torch.Tensor) -> torch.Tensor: """ Compute demographic parity constraint violation. g(h) = |P(Ŷ=1|A=0) - P(Ŷ=1|A=1)| - epsilon Returns positive value if constraint is violated. """ rate_0 = probs[protected == 0].mean() rate_1 = probs[protected == 1].mean() violation = torch.abs(rate_0 - rate_1) - self.epsilon return violation def compute_lagrangian(self, probs: torch.Tensor, labels: torch.Tensor, protected: torch.Tensor) -> Dict[str, torch.Tensor]: """ Compute Lagrangian: L = task_loss + lambda * g(h) """ # Task loss task_loss = F.binary_cross_entropy(probs, labels.float()) # Constraint violation constraint_violation = self.compute_dp_constraint(probs, protected) # Lagrangian # Ensure lambda is positive lambda_pos = F.softplus(self.lambdas[0]) lagrangian = task_loss + lambda_pos * constraint_violation return { 'lagrangian': lagrangian, 'task_loss': task_loss, 'constraint_violation': constraint_violation, 'lambda': lambda_pos } def train_lagrangian_fair( model: LagrangianFairClassifier, X: torch.Tensor, y: torch.Tensor, protected: torch.Tensor, epochs: int = 300, lr_primal: float = 0.01, lr_dual: float = 0.1) -> None: """ Train using primal-dual optimization. - Primal update: minimize Lagrangian w.r.t. network parameters - Dual update: maximize Lagrangian w.r.t. lambda (increase if violated) """ # Separate parameters network_params = list(model.network.parameters()) dual_params = [model.lambdas] primal_optimizer = torch.optim.Adam(network_params, lr=lr_primal) dual_optimizer = torch.optim.SGD(dual_params, lr=lr_dual) for epoch in range(epochs): # Primal step: minimize Lagrangian over network params primal_optimizer.zero_grad() probs = model(X) losses = model.compute_lagrangian(probs, y, protected) # For primal, we minimize Lagrangian losses['lagrangian'].backward() primal_optimizer.step() # Dual step: maximize Lagrangian over lambda # This is equivalent to increasing lambda when constraint is violated dual_optimizer.zero_grad() probs = model(X) losses = model.compute_lagrangian(probs, y, protected) # For dual, we maximize (minimize negative) (-losses['lagrangian']).backward() dual_optimizer.step() # Project lambda to be non-negative (handled by softplus) if (epoch + 1) % 50 == 0: print(f"Epoch {epoch+1}: Loss={losses['task_loss']:.4f}, " f"Violation={losses['constraint_violation']:.4f}, " f"λ={losses['lambda']:.4f}") # Demoif __name__ == "__main__": torch.manual_seed(42) n = 2000 # Biased data protected = torch.bernoulli(torch.full((n,), 0.4)) X = torch.randn(n, 4) X[:, 0] += protected logits = X[:, 0] + X[:, 1] + 0.3 * protected y = (logits + torch.randn(n) * 0.3 > 0.3).float() # Train model = LagrangianFairClassifier(input_dim=4, epsilon=0.02) train_lagrangian_fair(model, X, y, protected, epochs=300) # Final evaluation with torch.no_grad(): preds = (model(X) > 0.5).float() print(f"\nFinal Demographics:") print(f" P(Ŷ=1|A=0): {preds[protected==0].mean():.3f}") print(f" P(Ŷ=1|A=1): {preds[protected==1].mean():.3f}") print(f" Gap: {abs(preds[protected==0].mean() - preds[protected==1].mean()):.3f}")The Fairlearn library (fairlearn.org) provides production-ready implementations of constrained optimization methods, including the exponentiated gradient algorithm. It supports various fairness constraints (demographic parity, equalized odds, etc.) and works with any scikit-learn compatible classifier.
Ensemble methods offer a natural framework for fairness: instead of modifying a single model, we combine multiple models in ways that achieve fair overall predictions.
Fair Boosting:
Modifications to AdaBoost that incorporate fairness. Instead of reweighting based only on prediction errors, also consider fairness violations:
$$w_i^{(t+1)} \propto w_i^{(t)} \cdot \exp(\alpha_t \mathbb{1}[h_t(x_i) \neq y_i]) \cdot \textcolor{blue}{\exp(\beta_t \cdot \text{FairnessViolation}(x_i, a_i, h_t))}$$
The additional term increases weights for samples that contribute to unfairness, forcing subsequent weak learners to focus on reducing disparity.
Fair Stacking:
Train multiple base models and learn a fair meta-classifier that combines their predictions. The meta-classifier explicitly includes fairness constraints:
This separates predictive power (base models) from fairness (meta-classifier).
Ensemble Calibration:
Train an unconstrained ensemble, then learn group-specific calibration functions to adjust predictions:
$$\hat{y}{fair}(x, a) = c_a(\hat{y}{ensemble}(x))$$
where $c_a$ is a monotonic calibration function learned to satisfy fairness constraints for group $a$.
Deep learning offers unique opportunities for in-processing fairness through architectural modifications, specialized layers, and representation-level interventions.
Fairness-Specific Architectures:
1. Disentangled Representations: Learn separate representations for task-relevant and protected-attribute-related information:
$$z = [z_{task}, z_{protected}]$$
Then use only $z_{task}$ for prediction. This requires auxiliary losses to ensure proper disentanglement.
2. Group-Specific Batch Normalization: Since batch normalization learns running statistics, use separate BN layers for each protected group:
$$\text{BN}_a(x) = \gamma_a \frac{x - \mu_a}{\sigma_a} + \beta_a$$
This allows group-specific normalization while sharing other parameters.
Gradient-Based Interventions:
Gradient Projection: Project gradients to remove components that would increase unfairness:
$$g_{fair} = g - \text{Proj}_{\nabla \mathcal{F}(h)}(g)$$
where $g$ is the gradient of the task loss and we project out the direction that increases the fairness violation.
Gradient Reversal: For adversarial debiasing, the gradient reversal layer inverts gradients during backpropagation:
$$\frac{\partial L_{adv}}{\partial \theta_{encoder}} \rightarrow -\lambda \frac{\partial L_{adv}}{\partial \theta_{encoder}}$$
This causes the encoder to learn representations that maximize adversary confusion.
Multi-Task Learning with Fairness: Treat fairness as an auxiliary task with negative weight:
$$L = L_{task} - \lambda \cdot L_{protected_prediction}$$
where $L_{protected_prediction}$ is the loss for predicting the protected attribute from intermediate representations.
Recent work explores fairness in foundation models and LLMs through techniques like constitution-guided training, preference learning with fairness constraints, and prompt-based debiasing. As models become larger and more general, fairness interventions are increasingly applied during fine-tuning rather than training from scratch.
| Method | Fairness Guarantee | Computational Cost | Interpretability | Best For |
|---|---|---|---|---|
| Regularization | Soft (penalty-based) | Low | High | Simple models, quick experiments |
| Adversarial Debiasing | Soft (information-theoretic) | High | Low | Neural networks, representation learning |
| Constrained Optimization | Hard (constraint-based) | Medium | Medium | When strict fairness bounds needed |
| Reduction Methods | Provable bounds | Medium | High | Off-the-shelf learners |
| Ensemble Methods | Depends on design | High | Medium | When base models exist |
Selection Guidelines:
Use Regularization when:
Use Adversarial Debiasing when:
Use Constrained Optimization when:
Use Reduction Methods when:
There is no universally best in-processing method. Performance depends on the data, model, fairness definition, and computational resources. Always compare multiple approaches on your specific problem with held-out validation sets.
What's Next:
The next page explores post-processing methods—techniques applied after model training to adjust predictions for fairness. These methods are particularly valuable when models cannot be retrained or when different fairness thresholds are needed for different deployment contexts.
You now understand the major in-processing approaches to bias mitigation—from simple regularization to sophisticated constrained optimization. These techniques provide powerful tools for building models that are fair by design, not just fair by accident.