Federated Learning - Learning Module

Loading content...

0/245

Distributed Training: Foundations of Federated Learning

The Data Silo Dilemma

Imagine a world where hospitals could collaboratively train a diagnostic AI model using patient data from millions of records—without any single hospital ever sharing its sensitive medical data with others. Or consider financial institutions detecting fraud patterns across the entire banking ecosystem while keeping their customers' transaction histories completely private. This isn't science fiction; it's the promise of Federated Learning (FL).

Federated learning represents a paradigm shift in how we think about distributed machine learning. Instead of the traditional approach where data is centralized in a single location for training, federated learning keeps data in place and brings the model to the data. This fundamental inversion of the training pipeline has profound implications for privacy, scalability, and the future of collaborative AI development.

What You Will Learn

By the end of this page, you will understand the core concepts of distributed training, the motivations behind federated learning, the fundamental architecture of federated systems, and the key differences between federated learning and traditional distributed machine learning. You'll develop an intuition for when and why federated learning is the right choice.

The Evolution of Distributed Machine Learning

To understand federated learning, we must first understand the landscape of distributed machine learning that preceded it. As datasets grew beyond what could fit in a single machine's memory and models became too large for individual GPUs, distributed training became essential.

The Traditional Distributed Training Paradigm:

Conventional distributed training assumes all data can be centralized in a data center or cloud environment. The distribution occurs at the computational level—splitting either data (data parallelism) or model (model parallelism) across multiple workers—but all data ultimately resides under a single administrative domain.

Evolution of Machine Learning Training Paradigms
Era	Paradigm	Data Location	Key Limitation
Pre-2010	Single-machine training	Local disk/memory	Memory and compute bounded
2010-2016	Distributed data parallelism	Centralized data center	Network bandwidth bottleneck
2016-2018	Large-scale distributed (Ring AllReduce)	Centralized cloud storage	Data must be moved to compute
2017+	Federated Learning	Distributed across edge devices/silos	Data never leaves source; computation moves

Data Parallelism vs. Model Parallelism:

Data parallelism partitions the training dataset across workers. Each worker maintains a complete copy of the model and computes gradients on its local data partition. Gradients are then synchronized (typically via AllReduce) to update a global model.

Model parallelism partitions the model itself across workers, with each worker responsible for computing only a portion of the forward and backward passes. This is essential for models too large to fit on a single device but introduces complex synchronization dependencies.

Both approaches share a critical assumption: all training data is accessible from a unified computational environment. This assumption fundamentally breaks in many real-world scenarios.

The Centralization Assumption

Traditional distributed training assumes data can be consolidated. But when data is legally protected (GDPR, HIPAA), commercially sensitive (competitive banking data), or practically immovable (IoT sensors generating terabytes/day), this assumption fails. Federated learning emerges from the need to train models when data cannot move.

Federated Learning: Core Concepts

Federated Learning was formally introduced by Google in 2016 for training keyboard prediction models on mobile devices. The foundational paper, Communication-Efficient Learning of Deep Networks from Decentralized Data (McMahan et al., 2017), established the Federated Averaging (FedAvg) algorithm that remains the cornerstone of federated learning systems.

The Federated Learning Definition:

Federated learning is a machine learning paradigm where:

Training data remains distributed across multiple decentralized clients (devices, organizations, or data silos)
A central server coordinates the training process without accessing raw data
Only model updates (gradients or weights) are communicated—never the underlying data
The goal is to learn a global model that generalizes across all clients' data distributions

Key Properties of Federated Learning

•Data Locality — Training data never leaves the client device or organization. Privacy is maintained by design, not by policy.
•Decentralized Training — Computation happens where the data resides. Clients perform local training on their private datasets.
•Coordinated Aggregation — A central server orchestrates training by aggregating local model updates into a global model.
•Iterative Refinement — Training proceeds in communication rounds. Each round: server distributes model → clients train locally → server aggregates updates.
•Heterogeneity Tolerance — Systems must handle non-IID data distributions, varying client capabilities, and unreliable network connectivity.

federated_averaging_pseudocode.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# FedAvg: The Foundational Federated Learning Algorithm
# McMahan et al., 2017 - "Communication-Efficient Learning of Deep Networks"
 
def federated_averaging(
    global_model: Model,
    clients: List[Client],
    num_rounds: int,
    clients_per_round: int,  # C fraction of clients
    local_epochs: int,        # E epochs of local training
    local_batch_size: int,    # B batch size
    learning_rate: float      # η learning rate
) -> Model:
    """
    Federated Averaging Algorithm (FedAvg)
    
    The core insight: instead of communicating gradients every batch,
    allow clients to perform multiple local SGD updates before
    communicating. This dramatically reduces communication costs.
    
    Args:
        global_model: Initial model weights w₀
        clients: Collection of K clients, each with local dataset Dₖ
        num_rounds: Number of federated communication rounds T
        clients_per_round: Number of clients sampled per round
        local_epochs: Number of local training epochs per round
        local_batch_size: Batch size for local SGD
        learning_rate: Learning rate for local optimization
    
    Returns:
        Trained global model after T rounds
    """
    for round_t in range(num_rounds):
        # Step 1: Server samples a subset of clients
        # Typically C=0.1 to C=0.3 fraction of all clients
        selected_clients = random.sample(clients, clients_per_round)
        
        # Step 2: Server broadcasts current global model to selected clients
        for client in selected_clients:
            client.receive_model(global_model.get_weights())
        
        # Step 3: Each client performs local training
        client_updates = []
        client_weights = []  # For weighted averaging by dataset size
        
        for client in selected_clients:
            # Client trains on local data for E epochs
            local_model = client.local_train(
                epochs=local_epochs,
                batch_size=local_batch_size,
                learning_rate=learning_rate
            )
            
            # Client computes update: Δwₖ = wₖ - w_global
            update = local_model - global_model.get_weights()
            client_updates.append(update)
            client_weights.append(client.dataset_size)
        
        # Step 4: Server aggregates updates (weighted by dataset size)
        # w_{t+1} = Σₖ (nₖ/n) * wₖ where nₖ = |Dₖ|, n = Σₖ nₖ
        total_samples = sum(client_weights)
        aggregated_update = sum(
            (nk / total_samples) * update 
            for nk, update in zip(client_weights, client_updates)
        )
        
        # Step 5: Server updates global model
        global_model.apply_update(aggregated_update)
        
        # Optional: Evaluate global model on held-out test set
        if round_t % eval_frequency == 0:
            evaluate_global_model(global_model)
    
    return global_model
 
 
class Client:
    """
    Federated Learning Client
    
    Represents a data silo (device, organization) participating
    in federated training. Maintains local data and performs
    local model training.
    """
    def __init__(self, client_id: str, local_data: Dataset):
        self.client_id = client_id
        self.local_data = local_data
        self.dataset_size = len(local_data)
        self.local_model = None
    
    def receive_model(self, global_weights: np.ndarray):
        """Initialize local model with global weights."""
        self.local_model = create_model()
        self.local_model.set_weights(global_weights)
    
    def local_train(
        self, 
        epochs: int, 
        batch_size: int, 
        learning_rate: float
    ) -> np.ndarray:
        """
        Perform local SGD on client's private data.
        
        Key insight: Multiple local epochs before communication
        reduces communication rounds but may increase local drift
        (divergence from global optimum due to local data bias).
        """
        optimizer = SGD(learning_rate=learning_rate)
        
        for epoch in range(epochs):
            for batch in DataLoader(self.local_data, batch_size):
                # Standard mini-batch SGD update
                loss = compute_loss(self.local_model, batch)
                gradients = compute_gradients(loss, self.local_model)
                optimizer.step(gradients)
        
        return self.local_model.get_weights()

System Architecture and Components

A production federated learning system involves multiple coordinated components working across a distributed environment. Understanding this architecture is crucial for designing robust FL systems.

The Server-Client Architecture:

Federated learning typically follows a hub-and-spoke topology where a central aggregation server coordinates training across multiple client participants. While fully decentralized (peer-to-peer) variants exist, the centralized coordinator pattern dominates practical deployments due to its simplicity and established security properties.

Converting Mermaid diagram...

Core System Components

•Aggregation Server — Orchestrates training rounds, distributes global models, collects client updates, and performs weighted aggregation. Must handle client dropouts, stragglers, and Byzantine failures.
•Client Runtime — Executes local training on client devices. Manages model synchronization, local data preprocessing, and secure update transmission. Must operate within client resource constraints.
•Communication Protocol — Handles serialization, compression, encryption, and reliable delivery of model updates. Often uses gRPC or custom protocols optimized for model tensors.
•Secure Aggregation Module — Implements cryptographic protocols (Secure Multi-Party Computation, Trusted Execution Environments) ensuring the server learns only the aggregate, not individual updates.
•Client Selection Service — Determines which clients participate in each round based on availability, data quality, fairness, and system heterogeneity.

Cross-Silo vs. Cross-Device Federated Learning:

Federated learning manifests in two primary settings with vastly different system characteristics:

Cross-Silo vs. Cross-Device FL Comparison
Characteristic	Cross-Silo FL	Cross-Device FL
Clients	Organizations (hospitals, banks, companies)	Edge devices (phones, IoT sensors)
Number of clients	Tens to hundreds	Millions to billions
Data per client	Large (millions of samples)	Small (hundreds of samples)
Availability	Always online, reliable	Intermittent, unreliable
Compute capacity	High (datacenter-class)	Low (mobile CPU/GPU)
Communication	High bandwidth, low latency	Limited bandwidth, high latency
Client identity	Known, accountable	Anonymous, untrusted
Example	Medical records across hospitals	Keyboard prediction on phones

Design Implications

Cross-device FL demands extreme efficiency (compression, quantization) and robustness to client dropout. Cross-silo FL can tolerate richer communication but faces complex governance, data quality, and liability questions. The choice fundamentally shapes your system design.

The Federated Training Lifecycle

A complete federated learning pipeline involves much more than the core FedAvg loop. Let's trace through the full lifecycle of a federated training run, understanding what happens at each stage.

Phase 1: Initialization and Configuration

Before training begins, the system must be configured and all parties aligned:

Model Architecture Agreement — All clients must use identical model architectures (same layers, dimensions, activations). This enables meaningful weight aggregation.
Training Protocol Specification — Hyperparameters (learning rate, local epochs, batch size, aggregation frequency) are agreed upon.
Client Registration — Eligible clients register with the server, establishing secure communication channels and providing metadata (e.g., approximate dataset size).
Privacy Configuration — Differential privacy budgets, secure aggregation parameters, and audit requirements are configured.

fl_lifecycle_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# Complete Federated Learning Lifecycle Implementation
from dataclasses import dataclass
from typing import List, Dict, Optional
import numpy as np
 
@dataclass
class FLConfiguration:
    """Configuration for a federated learning training run."""
    
    # Model specification
    model_architecture: str
    initial_weights: Optional[np.ndarray] = None
    
    # Training hyperparameters
    num_rounds: int = 100
    clients_per_round: int = 10
    local_epochs: int = 1
    local_batch_size: int = 32
    learning_rate: float = 0.01
    
    # Client selection
    min_clients_required: int = 5
    client_timeout_seconds: int = 300
    
    # Privacy settings
    use_secure_aggregation: bool = True
    differential_privacy_epsilon: Optional[float] = None
    clip_norm: float = 1.0
    
    # Communication
    compression_method: Optional[str] = "none"  # or "top_k", "random_k"
    quantization_bits: int = 32
 
 
class FederatedTrainingOrchestrator:
    """
    Complete federated training orchestrator handling the full lifecycle.
    
    This class manages:
    - Client registration and selection
    - Training round coordination
    - Aggregation and model updates
    - Monitoring and checkpointing
    """
    
    def __init__(self, config: FLConfiguration):
        self.config = config
        self.global_model = self._initialize_model()
        self.registered_clients: Dict[str, ClientProxy] = {}
        self.training_history: List[Dict] = []
        self.current_round = 0
    
    def _initialize_model(self) -> Model:
        """Initialize the global model with specified architecture."""
        model = create_model(self.config.model_architecture)
        if self.config.initial_weights is not None:
            model.set_weights(self.config.initial_weights)
        return model
    
    # ==========================================
    # PHASE 1: Client Registration
    # ==========================================
    
    def register_client(
        self, 
        client_id: str, 
        client_metadata: Dict
    ) -> bool:
        """
        Register a new client for federated training.
        
        Validates client eligibility and establishes secure channel.
        """
        # Validate client meets requirements
        if not self._validate_client(client_metadata):
            return False
        
        # Establish secure communication channel
        client_proxy = ClientProxy(
            client_id=client_id,
            metadata=client_metadata,
            secure_channel=self._create_secure_channel(client_id)
        )
        
        self.registered_clients[client_id] = client_proxy
        return True
    
    # ==========================================
    # PHASE 2: Training Round Execution
    # ==========================================
    
    def execute_training_round(self) -> Dict[str, float]:
        """
        Execute a single federated training round.
        
        Returns metrics from this round.
        """
        round_metrics = {}
        
        # Step 1: Client Selection
        selected_clients = self._select_clients()
        round_metrics['selected_clients'] = len(selected_clients)
        
        if len(selected_clients) < self.config.min_clients_required:
            raise InsufficientClientsError(
                f"Only {len(selected_clients)} clients available, "
                f"need {self.config.min_clients_required}"
            )
        
        # Step 2: Broadcast Global Model
        model_broadcast = ModelBroadcast(
            round_number=self.current_round,
            model_weights=self.global_model.get_weights(),
            training_config=self._get_round_config()
        )
        
        for client in selected_clients:
            client.send_model(model_broadcast)
        
        # Step 3: Collect Client Updates (with timeout handling)
        client_updates: List[ClientUpdate] = []
        
        for client in selected_clients:
            try:
                update = client.receive_update(
                    timeout=self.config.client_timeout_seconds
                )
                
                # Validate update integrity
                if self._validate_update(update):
                    client_updates.append(update)
                    
            except ClientTimeoutError:
                # Client dropped out - continue with remaining
                round_metrics['client_dropouts'] =                     round_metrics.get('client_dropouts', 0) + 1
        
        round_metrics['successful_clients'] = len(client_updates)
        
        # Step 4: Aggregate Updates
        if self.config.use_secure_aggregation:
            aggregated_update = self._secure_aggregate(client_updates)
        else:
            aggregated_update = self._weighted_average(client_updates)
        
        # Step 5: Apply Differential Privacy (if configured)
        if self.config.differential_privacy_epsilon:
            aggregated_update = self._apply_dp_noise(
                aggregated_update,
                epsilon=self.config.differential_privacy_epsilon
            )
        
        # Step 6: Update Global Model
        self.global_model.apply_update(aggregated_update)
        self.current_round += 1
        
        # Step 7: Evaluate and Log
        round_metrics['global_accuracy'] = self._evaluate_global_model()
        self.training_history.append(round_metrics)
        
        return round_metrics
    
    # ==========================================
    # PHASE 3: Aggregation Strategies
    # ==========================================
    
    def _weighted_average(
        self, 
        updates: List[ClientUpdate]
    ) -> np.ndarray:
        """
        Compute weighted average of client updates.
        
        Weights by dataset size: w_{t+1} = Σₖ (nₖ/n) * wₖ
        """
        total_samples = sum(u.num_samples for u in updates)
        
        aggregated = np.zeros_like(updates[0].weights)
        for update in updates:
            weight = update.num_samples / total_samples
            aggregated += weight * update.weights
        
        return aggregated
    
    def _secure_aggregate(
        self, 
        updates: List[ClientUpdate]
    ) -> np.ndarray:
        """
        Secure aggregation ensuring server only sees sum.
        
        Uses SecAgg protocol (Bonawitz et al., 2017):
        - Each client adds random mask (sum of masks = 0)
        - Server sees sum of masked updates
        - Masks cancel, revealing only true sum
        """
        # Implementation uses cryptographic secure aggregation
        return secure_aggregation_protocol.aggregate(
            [u.encrypted_weights for u in updates],
            threshold=len(updates) // 2  # Byzantine tolerance
        )
    
    # ==========================================
    # PHASE 4: Convergence and Termination
    # ==========================================
    
    def run_training(self) -> Model:
        """Execute complete federated training procedure."""
        
        for round_num in range(self.config.num_rounds):
            metrics = self.execute_training_round()
            
            # Log progress
            print(f"Round {round_num}: "
                  f"Accuracy={metrics['global_accuracy']:.4f}, "
                  f"Clients={metrics['successful_clients']}")
            
            # Early stopping check
            if self._check_convergence():
                print(f"Converged at round {round_num}")
                break
            
            # Checkpoint periodically
            if round_num % 10 == 0:
                self._save_checkpoint()
        
        return self.global_model

When to Use Federated Learning

Federated learning introduces significant complexity compared to centralized training. Understanding when this complexity is justified is crucial for practical ML engineering.

Federated Learning is Appropriate When:

Data Cannot Be Centralized — Legal regulations (GDPR, HIPAA), competitive concerns, or data sovereignty requirements prevent data movement.
Data is Naturally Distributed — Edge devices generate data locally (phones, IoT), and centralization would require prohibitive bandwidth.
Privacy is Paramount — Domains like healthcare, finance, or personal communication where data leakage has severe consequences.
Value in Collaboration — Multiple parties benefit from a shared model but lack trust or legal authority to share data directly.

Good Fit for FL

•Next-word prediction on smartphones
•Medical imaging across hospitals
•Fraud detection across banks
•Autonomous vehicle training
•Smart home device personalization
•Industrial IoT anomaly detection
•Cross-border collaborative research

Poor Fit for FL

•Single organization owns all data
•Data can be easily anonymized
•High-frequency training updates needed
•Model architecture must adapt per client
•Data is already public or synthetic
•Debugging/interpretability is critical
•Strict reproducibility requirements

Complexity Trade-off

Don't use federated learning unless you need it. FL introduces challenges in debugging (you can't inspect client data), hyperparameter tuning (client heterogeneity), and convergence (non-IID data). If centralized training is possible and privacy isn't a concern, it remains the simpler, faster choice.

Federated Learning vs. Traditional Distributed Training

While both federated learning and traditional distributed training spread computation across multiple nodes, they differ fundamentally in assumptions, challenges, and system design. Understanding these differences clarifies when each approach is appropriate.

Federated vs. Traditional Distributed Training
Dimension	Traditional Distributed	Federated Learning
Data access	Centralized, server can inspect	Decentralized, server never sees data
Data distribution	IID (shuffled across workers)	Non-IID (local distribution varies)
Synchronization	Synchronous or asynchronous	Rounds (semi-synchronous)
Client availability	Always available, homogeneous	Intermittent, heterogeneous
Communication cost	High bandwidth, low latency	Limited bandwidth, high latency
Trust model	Workers are trusted	Clients may be adversarial
Privacy guarantees	Data at rest encryption	Data never transmitted
Debugging	Full data access	Cannot inspect client data
Update frequency	Every batch	Every N local epochs
Client count	Tens to hundreds	Thousands to millions

The Non-IID Challenge:

Perhaps the most significant technical difference is data heterogeneity. In traditional distributed training, data is typically shuffled and randomly partitioned across workers, ensuring each worker sees a representative sample (IID—Independent and Identically Distributed).

In federated learning, each client's data reflects their unique behavior:

A Spanish-speaking user's keyboard has different word distributions than an English speaker's
A pediatric hospital's medical images differ from a cardiology clinic's
A luxury retailer's fraud patterns differ from a discount store's

This non-IID data causes gradients from different clients to point in different directions, leading to slower convergence and potential divergence. Managing non-IID distributions is one of the hardest open problems in federated learning.

The IID Assumption Breaks Down

When client A only has cat images and client B only has dog images, naive averaging produces a model that confuses both. Techniques like FedProx (adding a proximal term), client clustering, and personalization layers address this, but no universal solution exists.

Real-World Federated Learning Deployments

Federated learning has moved from research papers to production systems processing billions of training samples daily. Let's examine notable deployments:

Google Gboard (2017-present):

Google's mobile keyboard was the first large-scale FL deployment, training next-word prediction, emoji suggestion, and search query models on data from hundreds of millions of Android devices. Key innovations:

Training only occurs when device is idle, charging, and on WiFi
Secure aggregation protects individual contributions
Over 500 million devices participate in aggregate

Major Federated Learning Deployments
Organization	Application	Scale	Key Innovation
Google	Gboard keyboard prediction	500M+ devices	First production FL system
Apple	Siri voice recognition	1B+ devices	On-device personalization + FL
NVIDIA	CLARA for medical imaging	Hospitals worldwide	Cross-silo healthcare FL
WeBank	Credit risk modeling	Cross-institution	FATE framework development
Financial industry	Fraud detection	Multi-bank consortiums	Regulatory-compliant FL
Owkin	Drug discovery	Pharma partnerships	Privacy-preserving biomedical ML

Apple's Approach:

Apple combines on-device learning with federated techniques for features like QuickType suggestions, Siri improvements, and health feature calibration. Their approach emphasizes:

Differential privacy at the device level
Minimal trust in central servers
User transparency through privacy labels

Healthcare Applications:

NVIDIA's CLARA platform enables federated learning across hospital networks, training diagnostic models on CT scans, X-rays, and pathology images without sharing patient data. The COVID-19 pandemic accelerated adoption, with consortiums using FL to train chest X-ray screening models across international healthcare systems.

Production Scale Matters

These deployments prove FL is not just academically interesting—it's production-ready at scale. The challenges are real but solvable. If Google can coordinate training across 500 million phones, federated learning can likely address your use case.

Summary: Distributed Training Foundations

We've covered the foundational concepts necessary to understand and build federated learning systems. Let's consolidate the key insights:

Key Takeaways

•Federated learning inverts the training paradigm — Instead of moving data to computation, FL brings computation to data, enabling training on sensitive distributed datasets.
•FedAvg is the foundational algorithm — Clients perform local SGD updates, server aggregates weighted averages. Simple but surprisingly effective.
•Cross-silo and cross-device FL have different constraints — Device FL emphasizes efficiency and dropout tolerance; silo FL emphasizes governance and data quality.
•Non-IID data is the central technical challenge — When clients have heterogeneous data distributions, convergence becomes difficult and special techniques are required.
•Production deployments exist at massive scale — Google, Apple, and healthcare consortiums prove FL works in practice, not just theory.
•Use FL only when necessary — The complexity is justified when data cannot move due to privacy, legal, or practical constraints.

What's Next:

With distributed training fundamentals established, we'll dive deep into Privacy Preservation in the next page. You'll learn about the privacy threats in federated learning (yes, even model updates can leak information!), and the techniques—differential privacy, secure aggregation, trusted execution—that provide rigorous privacy guarantees.

Page Complete

You now understand the foundations of distributed training and federated learning. You can articulate when FL is appropriate, describe the FedAvg algorithm, and distinguish between cross-silo and cross-device settings. Next, we strengthen privacy guarantees with formal techniques.