Computer NetworksCloud & Datacenter Networking

Load Balancing

LevelIntermediate

Duration75 mins

TopicCloud & Datacenter Networking

3 / 5

Load Balancing Algorithms

The Art of Backend Selection

When a request arrives at a load balancer, a critical decision must be made in microseconds: which backend server should handle this request? This decision, repeated millions of times per second across the internet, is governed by load balancing algorithms.

The choice of algorithm profoundly impacts system behavior. A poor algorithm choice can create hot spots where some servers are overwhelmed while others sit idle. It can break session stickiness, cause cache inefficiency, or lead to uneven resource utilization. Conversely, the right algorithm ensures optimal distribution, maintains consistency when needed, and adapts to backend health and capacity.

This page provides a comprehensive examination of load balancing algorithms, from simple static policies to sophisticated adaptive techniques used by the world's largest services.

What You Will Master

By the end of this page, you will understand: the taxonomy of load balancing algorithms (static vs dynamic, stateful vs stateless), the mechanics of each major algorithm, consistency guarantees and their importance, how to select the right algorithm for your use case, and the mathematical foundations underlying modern algorithms.

Algorithm Taxonomy: Organizing the Options

Load balancing algorithms can be classified along several dimensions. Understanding this taxonomy helps in selecting the right approach.

Dimension 1: Static vs Dynamic

Static Algorithms:

Decision based only on pre-configured information
Do not consider current backend load or health
Simpler to implement and reason about
Examples: Round-robin, Random, IP Hash

Dynamic Algorithms:

Decision based on real-time backend state
Consider current connections, response times, health
Adapt to changing conditions
Examples: Least Connections, Least Response Time, Adaptive

Dimension 2: Stateless vs Stateful

Stateless Algorithms:

Each decision is independent
No memory of previous requests
Perfect horizontal scaling of load balancers
Examples: Round-robin, Random

Stateful Algorithms:

Maintain state about backends or clients
Can provide consistency guarantees
Require state synchronization in HA setups
Examples: Least Connections, Session-based routing

Load Balancing Algorithm Classification
Algorithm	Static/Dynamic	Stateless/Stateful	Complexity
Round-Robin	Static	Stateless*	Low
Weighted Round-Robin	Static	Stateless*	Low
Random	Static	Stateless	Very Low
IP Hash	Static	Stateless	Low
Least Connections	Dynamic	Stateful	Medium
Weighted Least Connections	Dynamic	Stateful	Medium
Least Response Time	Dynamic	Stateful	High
Consistent Hashing	Static	Stateless	Medium
Maglev Hashing	Static	Stateless	Medium
Random Two Choices	Dynamic	Stateless	Low

Stateless with Minimal State

Round-robin is often classified as 'stateless' even though it maintains a counter for the current position. This state is trivial (a single integer) and losing it merely randomizes the starting position. True stateful algorithms like Least Connections require tracking active connections per backend.

Round-Robin and Weighted Round-Robin

Round-Robin is the simplest and most widely used load balancing algorithm. It distributes requests sequentially across all available backends.

How Round-Robin Works:

Backends: [A, B, C]
Counter: 0

Request 1 → Backend[0 % 3] = A, Counter = 1
Request 2 → Backend[1 % 3] = B, Counter = 2
Request 3 → Backend[2 % 3] = C, Counter = 3
Request 4 → Backend[3 % 3] = A, Counter = 4
...

Implementation (Pseudocode):

class RoundRobinLB:
    def __init__(self, backends):
        self.backends = backends
        self.counter = 0
        self.lock = Lock()  # Thread safety
    
    def select_backend(self):
        with self.lock:
            backend = self.backends[self.counter % len(self.backends)]
            self.counter += 1
            return backend

Characteristics:

Pros: Simple, predictable, fair distribution over time
Cons: Ignores backend capacity and current load
Best For: Homogeneous backends with similar request execution times

Weighted Round-Robin:

When backends have different capacities, weighted round-robin assigns more requests to higher-capacity servers.

Example:

Backend A: weight=3 (handles 3x more traffic)
Backend B: weight=2
Backend C: weight=1

Total weight = 6
Distribution: A, A, A, B, B, C, A, A, A, B, B, C, ...

Smooth Weighted Round-Robin (SWRR):

Naive weighted round-robin creates bursts (all 'A' requests, then 'B', etc.). SWRR distributes more evenly:

Initial:  A(3), B(2), C(1)  →  A, B, A, C, B, A, A, B, A, C, B, A, ...
Pattern:  More interleaved distribution

Nginx SWRR Implementation:

def smooth_weighted_round_robin(backends, weights):
    current_weights = [0] * len(backends)
    total = sum(weights)
    
    def select():
        # Add original weight to current weight
        for i in range(len(backends)):
            current_weights[i] += weights[i]
        
        # Select backend with highest current weight
        max_idx = current_weights.index(max(current_weights))
        
        # Reduce selected backend's weight by total
        current_weights[max_idx] -= total
        
        return backends[max_idx]
    
    return select

When to Use Weighted Round-Robin:

Scenario	Weight Assignment
Different CPU cores	Weight = number of cores
Different memory	Weight = relative memory capacity
Canary deployments	New version = 1, old version = 9 (10% canary)
Graceful degradation	Unhealthy but responsive = reduced weight

The Weight Problem

Setting optimal weights is challenging. Static weights don't adapt to changing conditions—a powerful server running expensive queries might need fewer requests than a less powerful server handling simple requests. Consider dynamic algorithms if workload varies significantly.

Least Connections and Dynamic Variants

Least Connections is a dynamic algorithm that routes requests to the backend with the fewest active connections, naturally adapting to backend capacity and request duration.

How Least Connections Works:

Backends at time T:
  A: 5 active connections
  B: 3 active connections  ← Selected (lowest)
  C: 8 active connections

Request arrives → Routed to B

Backends after:
  A: 5 active connections
  B: 4 active connections
  C: 8 active connections

Implementation (Pseudocode):

class LeastConnectionsLB:
    def __init__(self, backends):
        self.backends = backends
        self.connections = {b: 0 for b in backends}
        self.lock = Lock()
    
    def select_backend(self):
        with self.lock:
            # Find backend with minimum connections
            backend = min(self.connections, key=self.connections.get)
            self.connections[backend] += 1
            return backend
    
    def release_connection(self, backend):
        with self.lock:
            self.connections[backend] -= 1

Why Least Connections is Adaptive:

Slow backends naturally accumulate connections (requests take longer to complete)
Fast backends complete requests quickly, maintaining low connection counts
System self-balances without explicit capacity configuration
Handles variable request durations naturally

Weighted Least Connections:

Combines backend weights with connection tracking:

Score = Active Connections / Weight

Backend A: 6 connections, weight=3 → Score = 6/3 = 2.0
Backend B: 4 connections, weight=2 → Score = 4/2 = 2.0
Backend C: 3 connections, weight=1 → Score = 3/1 = 3.0

Select: A or B (tied), typically first encountered

Least Response Time:

More sophisticated than Least Connections, considers actual response times:

Score = Active Connections × Average Response Time

Backend A: 5 conns, 20ms avg → Score = 100
Backend B: 8 conns, 10ms avg → Score = 80 ← Selected (lowest)
Backend C: 3 conns, 50ms avg → Score = 150

Implementation Challenges:

Response Time Measurement: Requires tracking per-request latency
Time Windows: Use exponential moving average, not all-time average
Cold Start: New backends have no data; default to lowest score
Outlier Handling: Single slow request shouldn't skew long-term

NGINX Least Time Configuration:

upstream backend {
    least_time header;  # Time to receive response headers
    # OR: least_time last_byte;  # Time to receive full response
    
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

Dynamic Algorithm Comparison
Algorithm	Metric	Pros	Cons
Least Connections	Active connections	Simple, self-adaptive	Doesn't consider capacity
Weighted Least Conn.	Connections / weight	Accounts for capacity	Weights need configuration
Least Response Time	Connections × latency	Most accurate	Complex, needs measurement
Least Pending	Queued requests	Queue-aware	Requires queue visibility

When to Use Least Connections

Least Connections excels when request durations vary significantly (some are fast, some are slow). Round-robin assumes all requests are equal; Least Connections adapts to reality. For uniform requests with similar durations, round-robin is simpler and equally effective.

Hashing Algorithms: IP Hash and Session Affinity

Hashing algorithms provide consistent backend selection for requests with the same key. This is essential for session affinity, caching efficiency, and stateful services.

IP Hash:

Routes requests from the same client IP to the same backend:

backend_index = hash(client_ip) % num_backends

Client 203.0.113.10 → hash("203.0.113.10") % 3 = 1 → Backend B
Client 198.51.100.5 → hash("198.51.100.5") % 3 = 0 → Backend A

# Same client always hits same backend (unless backends change)

Implementation:

def ip_hash(client_ip, backends):
    # Simple hash using Python's hash function
    # Production: use consistent hash function like xxHash
    hash_val = hash(client_ip)
    return backends[hash_val % len(backends)]

Limitations of Simple Modulo Hashing:

When backends change (add/remove), most mappings change:

Before (3 backends): hash(ip) % 3
After (4 backends):  hash(ip) % 4

Client with hash=5:
  Before: 5 % 3 = 2 → Backend C
  After:  5 % 4 = 1 → Backend B  ← Different!

Adding one backend remaps ~75% of clients (N-1)/N = 2/3 ≈ 67% changing).

URL/Header/Cookie Hashing:

Hash on application-layer attributes for more specific affinity:

Hash Key	Use Case	Example
URL Path	Cache locality	Same path → same cached server
Query Parameter	User sharding	`?user_id=123` → consistent backend
Cookie Value	Session stickiness	Session cookie → owning server
HTTP Header	Tenant routing	`X-Tenant-ID` → tenant's backend pool

NGINX Hash Configuration:

upstream backend {
    hash $remote_addr consistent;  # IP hash with consistent hashing
    # OR:
    # hash $request_uri consistent;  # URL hash
    # hash $cookie_session_id consistent;  # Cookie hash
    
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

Why Session Affinity Matters:

Stateful Applications: Some apps store session state in server memory
Local Caching: Repeated requests to same server hit warm caches
Connection Pooling: Database connections already established
WebSocket Connections: Require same server for connection lifetime
File Uploads: Multi-part uploads need same server

Affinity vs Load Balancing

Session affinity can create hot spots. If many users share a hash (e.g., behind same NAT), one backend is overloaded. Modern architectures prefer stateless backends with external session stores (Redis, database) to avoid this tradeoff.

Consistent Hashing: Minimizing Remapping

Consistent hashing solves the remapping problem of simple modulo hashing. When backends change, only a minimal number of keys need to be remapped.

The Concept: Hash Ring

Map hash space onto a circular ring (0 to 2³² - 1, wrapping around)
Place each backend at a position on the ring (hash of backend ID)
For each request, hash the key and find the next backend clockwise

                    0 / 2³²
                      ○
                   /     
                B3         B1
              ○               ○
             /                 
            |                   |
             \                 /
              ○               ○
                B2         B4  (new)
                   \     /
                      ○

Key K: hash(K) → Position P → Walk clockwise → First backend = handler

Adding a Backend:

When B4 is added, only keys between B2 and B4 remap (were going to B3, now go to B4). All other mappings unchanged.

Expected Remapping: K/N keys (K = total keys, N = backends)

Compare to modulo hashing: (N-1)/N × K keys remapped.

Virtual Nodes (VNodes):

With few backends, the ring can be unbalanced. Virtual nodes create multiple positions per backend:

Backend A → VNodes: A1, A2, A3, A4 (each at different ring positions)
Backend B → VNodes: B1, B2, B3, B4
Backend C → VNodes: C1, C2, C3, C4

Benefits:

More even distribution
Smoother load when backends are added/removed
Can implement weights (more VNodes = higher weight)

Implementation (Pseudocode):

class ConsistentHash:
    def __init__(self, backends, vnodes=100):
        self.ring = SortedDict()  # Position → Backend
        self.vnodes = vnodes
        
        for backend in backends:
            self.add_backend(backend)
    
    def add_backend(self, backend):
        for i in range(self.vnodes):
            position = hash(f"{backend}-{i}")
            self.ring[position] = backend
    
    def remove_backend(self, backend):
        for i in range(self.vnodes):
            position = hash(f"{backend}-{i}")
            del self.ring[position]
    
    def get_backend(self, key):
        if not self.ring:
            return None
        
        position = hash(key)
        # Find first position >= key position (or wrap to start)
        keys = list(self.ring.keys())
        for pos in keys:
            if pos >= position:
                return self.ring[pos]
        return self.ring[keys[0]]  # Wrap around

Jump Consistent Hash:

An alternative approach that achieves consistent hashing without a ring:

def jump_consistent_hash(key, num_buckets):
    b, j = -1, 0
    while j < num_buckets:
        b = j
        key = ((key * 2862933555777941757) + 1) & (2**64 - 1)
        j = int((b + 1) * (2**31 / ((key >> 33) + 1)))
    return b

Benefits: No storage, O(log N) time, perfect distribution. Limitation: Only supports sequential bucket IDs.

When to Use Consistent Hashing

Consistent hashing is essential when backend changes are frequent AND session/cache continuity matters. Use cases: distributed caches (Redis Cluster, Memcached), sharded databases, content delivery networks. If backend changes are rare or continuity doesn't matter, simpler hashing suffices.

Maglev Hashing: Google's Production Algorithm

Maglev hashing is a consistent hashing algorithm developed by Google for their Maglev load balancers. It provides better distribution and faster lookups than ring-based consistent hashing.

The Problem with Ring-Based Consistent Hashing:

Lookup Performance: O(log N) binary search on ring
Memory: Stores N × VNodes entries
Distribution: Even with VNodes, can have imbalance
Disruption: Adding a backend affects ~K/N keys in theory, but unevenly in practice

Maglev's Approach: Lookup Table

Instead of a ring, Maglev builds a fixed-size lookup table where each slot maps to a backend:

Table size M = 65537 (prime, larger than backends)

+---+---+---+---+---+---+---+---+---+---+
| A | B | C | A | C | B | A | B | C | A | ...
+---+---+---+---+---+---+---+---+---+---+
  0   1   2   3   4   5   6   7   8   9

Lookup: backend = table[hash(key) % M]

Benefits:

O(1) lookup (single array access)
Perfect distribution (each backend owns M/N slots)
Deterministic (same table built from same backends)
Minimal disruption when backends change

How Maglev Builds the Table:

For each backend, compute two hash functions: offset and skip
Each backend generates a "preference list" (permutation of slots)
Backends take turns placing themselves in their preferred slots
If a slot is taken, move to next preference

def build_maglev_table(backends, table_size):
    table = [None] * table_size
    filled = 0
    
    # Compute preference lists for each backend
    preferences = {}
    for backend in backends:
        offset = hash1(backend) % table_size
        skip = hash2(backend) % (table_size - 1) + 1
        preferences[backend] = []
        for i in range(table_size):
            preferences[backend].append((offset + i * skip) % table_size)
    
    # Backends take turns filling slots
    next_pref = {b: 0 for b in backends}  # Next preference index
    
    while filled < table_size:
        for backend in backends:
            while True:
                slot = preferences[backend][next_pref[backend]]
                next_pref[backend] += 1
                if table[slot] is None:
                    table[slot] = backend
                    filled += 1
                    break
                if next_pref[backend] >= table_size:
                    break
    
    return table

Disruption Analysis:

When one backend is removed:

Only slots belonging to that backend need remapping
Remapped slots are distributed evenly among remaining backends
~M/N slots change (minimal disruption)

When one backend is added:

New backend "takes" some slots from each existing backend
Each existing backend loses ~M/(N×(N+1)) slots
Total disruption: ~M/(N+1) slots (optimal)

Consistent Hashing vs Maglev
Aspect	Ring-Based Consistent Hash	Maglev
Lookup Time	O(log N)	O(1)
Memory	O(N × VNodes)	O(M) fixed
Distribution	Approximate	Perfect
Disruption	Optimal	Optimal
Table Build	O(N × VNodes)	O(M × N)
Implementation	Simpler	More complex

Maglev in Production

Maglev is used in Google's network load balancers serving Google Search, YouTube, and other services. Similar algorithms are used in Envoy Proxy, Katran (Facebook's L4 LB), and modern service meshes. If you're building high-performance infrastructure, Maglev is worth understanding.

Power of Two Random Choices

The Power of Two Random Choices (also called "P2C" or "Join-Shortest-Queue-d") is an elegant algorithm that achieves near-optimal load distribution with minimal overhead.

The Insight:

Pure random selection has high variance—some backends get many more requests than others. Least Connections is optimal but requires global state.

P2C combines the best of both:

Randomly select 2 backends
Route to the one with fewer connections

This simple change exponentially reduces the maximum load imbalance.

Mathematical Foundation:

Algorithm	Max Load (expected)
Random	O(log N / log log N)
Two Random Choices	O(log log N)

For N=1000 backends:

Random: max load ~5-6× average
Two Choices: max load ~2× average

How It Works:

import random

class PowerOfTwoLB:
    def __init__(self, backends):
        self.backends = backends
        self.connections = {b: 0 for b in backends}
    
    def select_backend(self):
        # Randomly pick 2 backends
        choice1, choice2 = random.sample(self.backends, 2)
        
        # Select the one with fewer connections
        if self.connections[choice1] <= self.connections[choice2]:
            selected = choice1
        else:
            selected = choice2
        
        self.connections[selected] += 1
        return selected

Weighted P2C:

For weighted backends, compare normalized load:

def weighted_p2c_select(backends, weights, connections):
    choice1, choice2 = random.sample(range(len(backends)), 2)
    
    load1 = connections[choice1] / weights[choice1]
    load2 = connections[choice2] / weights[choice2]
    
    return backends[choice1 if load1 <= load2 else choice2]

Advantages:

Distributed: Each LB instance can run independently (no global state)
Scalable: O(1) operation, no coordination needed
Adaptive: Naturally routes fewer requests to loaded backends
Simple: Easy to implement and debug

Use in Envoy Proxy:

clusters:
  - name: my_service
    lb_policy: LEAST_REQUEST
    least_request_lb_config:
      choice_count: 2  # Power of two choices

Why Not Three Choices?

Interestingly, going from 1 → 2 choices provides exponential improvement. Going from 2 → 3 or more provides only marginal additional benefit. Two choices hits the sweet spot of simplicity and effectiveness.

Algorithm Selection Guide

Choosing the right algorithm requires matching your application's characteristics to algorithm strengths.

Algorithm Selection Matrix
Scenario	Recommended Algorithm	Reason
Homogeneous backends, uniform requests	Round-Robin	Simplicity; equal distribution
Backends with different capacity	Weighted Round-Robin	Respects capacity differences
Variable request duration	Least Connections	Adapts to actual load
Session stickiness required	IP/Cookie Hash	Consistent routing per client
Distributed caching	Consistent Hashing	Maintains cache locality
High-performance infrastructure	Maglev	O(1) lookup, minimal disruption
Stateless with good distribution	Power of Two Choices	Simple, distributed, effective
Microservices with varying latency	Least Response Time	Routes to fastest responders

Key Takeaways

•Round-robin is the default — Start here unless you have specific requirements.
•Least Connections adapts to reality — Handles varying request durations naturally.
•Hashing provides consistency — Essential for sessions, caching, and stateful backends.
•Consistent hashing minimizes disruption — Use when backends frequently change.
•Maglev is optimal for infrastructure — O(1) lookup with perfect distribution.
•Power of Two Choices is underrated — Simple algorithm with excellent practical performance.
•Match algorithm to requirements — No single algorithm is best for all scenarios.

What's Next:

Algorithms determine which backend is selected, but how do we know if a backend is healthy enough to receive traffic? Next, we'll explore health checking mechanisms that monitor backend availability and integrate with load balancing algorithms to route traffic only to responsive servers.

Page Complete

You now understand the major load balancing algorithms, from simple round-robin to sophisticated consistent hashing. You can analyze the tradeoffs and select the right algorithm for your specific requirements. Next, we'll explore health checking mechanisms.

3 / 5

Loading learning content...

Computer NetworksCloud & Datacenter Networking

Load Balancing

LevelIntermediate

Duration75 mins

TopicCloud & Datacenter Networking

3 / 5

Load Balancing Algorithms

The Art of Backend Selection

This page provides a comprehensive examination of load balancing algorithms, from simple static policies to sophisticated adaptive techniques used by the world's largest services.

What You Will Master

Algorithm Taxonomy: Organizing the Options

Load balancing algorithms can be classified along several dimensions. Understanding this taxonomy helps in selecting the right approach.

Dimension 1: Static vs Dynamic

Static Algorithms:

Decision based only on pre-configured information
Do not consider current backend load or health
Simpler to implement and reason about
Examples: Round-robin, Random, IP Hash

Dynamic Algorithms:

Decision based on real-time backend state
Consider current connections, response times, health
Adapt to changing conditions
Examples: Least Connections, Least Response Time, Adaptive

Dimension 2: Stateless vs Stateful

Stateless Algorithms:

Each decision is independent
No memory of previous requests
Perfect horizontal scaling of load balancers
Examples: Round-robin, Random

Stateful Algorithms:

Maintain state about backends or clients
Can provide consistency guarantees
Require state synchronization in HA setups
Examples: Least Connections, Session-based routing

Load Balancing Algorithm Classification
Algorithm	Static/Dynamic	Stateless/Stateful	Complexity
Round-Robin	Static	Stateless*	Low
Weighted Round-Robin	Static	Stateless*	Low
Random	Static	Stateless	Very Low
IP Hash	Static	Stateless	Low
Least Connections	Dynamic	Stateful	Medium
Weighted Least Connections	Dynamic	Stateful	Medium
Least Response Time	Dynamic	Stateful	High
Consistent Hashing	Static	Stateless	Medium
Maglev Hashing	Static	Stateless	Medium
Random Two Choices	Dynamic	Stateless	Low

Stateless with Minimal State

Round-Robin and Weighted Round-Robin

Round-Robin is the simplest and most widely used load balancing algorithm. It distributes requests sequentially across all available backends.

How Round-Robin Works:

Backends: [A, B, C]
Counter: 0

Request 1 → Backend[0 % 3] = A, Counter = 1
Request 2 → Backend[1 % 3] = B, Counter = 2
Request 3 → Backend[2 % 3] = C, Counter = 3
Request 4 → Backend[3 % 3] = A, Counter = 4
...

Implementation (Pseudocode):

class RoundRobinLB:
    def __init__(self, backends):
        self.backends = backends
        self.counter = 0
        self.lock = Lock()  # Thread safety
    
    def select_backend(self):
        with self.lock:
            backend = self.backends[self.counter % len(self.backends)]
            self.counter += 1
            return backend

Characteristics:

Pros: Simple, predictable, fair distribution over time
Cons: Ignores backend capacity and current load
Best For: Homogeneous backends with similar request execution times

Weighted Round-Robin:

When backends have different capacities, weighted round-robin assigns more requests to higher-capacity servers.

Example:

Backend A: weight=3 (handles 3x more traffic)
Backend B: weight=2
Backend C: weight=1

Total weight = 6
Distribution: A, A, A, B, B, C, A, A, A, B, B, C, ...

Smooth Weighted Round-Robin (SWRR):

Naive weighted round-robin creates bursts (all 'A' requests, then 'B', etc.). SWRR distributes more evenly:

Initial:  A(3), B(2), C(1)  →  A, B, A, C, B, A, A, B, A, C, B, A, ...
Pattern:  More interleaved distribution

Nginx SWRR Implementation:

def smooth_weighted_round_robin(backends, weights):
    current_weights = [0] * len(backends)
    total = sum(weights)
    
    def select():
        # Add original weight to current weight
        for i in range(len(backends)):
            current_weights[i] += weights[i]
        
        # Select backend with highest current weight
        max_idx = current_weights.index(max(current_weights))
        
        # Reduce selected backend's weight by total
        current_weights[max_idx] -= total
        
        return backends[max_idx]
    
    return select

When to Use Weighted Round-Robin:

Scenario	Weight Assignment
Different CPU cores	Weight = number of cores
Different memory	Weight = relative memory capacity
Canary deployments	New version = 1, old version = 9 (10% canary)
Graceful degradation	Unhealthy but responsive = reduced weight

The Weight Problem

Least Connections and Dynamic Variants

Least Connections is a dynamic algorithm that routes requests to the backend with the fewest active connections, naturally adapting to backend capacity and request duration.

How Least Connections Works:

Backends at time T:
  A: 5 active connections
  B: 3 active connections  ← Selected (lowest)
  C: 8 active connections

Request arrives → Routed to B

Backends after:
  A: 5 active connections
  B: 4 active connections
  C: 8 active connections

Implementation (Pseudocode):

class LeastConnectionsLB:
    def __init__(self, backends):
        self.backends = backends
        self.connections = {b: 0 for b in backends}
        self.lock = Lock()
    
    def select_backend(self):
        with self.lock:
            # Find backend with minimum connections
            backend = min(self.connections, key=self.connections.get)
            self.connections[backend] += 1
            return backend
    
    def release_connection(self, backend):
        with self.lock:
            self.connections[backend] -= 1

Why Least Connections is Adaptive:

Slow backends naturally accumulate connections (requests take longer to complete)
Fast backends complete requests quickly, maintaining low connection counts
System self-balances without explicit capacity configuration
Handles variable request durations naturally

Weighted Least Connections:

Combines backend weights with connection tracking:

Score = Active Connections / Weight

Backend A: 6 connections, weight=3 → Score = 6/3 = 2.0
Backend B: 4 connections, weight=2 → Score = 4/2 = 2.0
Backend C: 3 connections, weight=1 → Score = 3/1 = 3.0

Select: A or B (tied), typically first encountered

Least Response Time:

More sophisticated than Least Connections, considers actual response times:

Score = Active Connections × Average Response Time

Backend A: 5 conns, 20ms avg → Score = 100
Backend B: 8 conns, 10ms avg → Score = 80 ← Selected (lowest)
Backend C: 3 conns, 50ms avg → Score = 150

Implementation Challenges:

Response Time Measurement: Requires tracking per-request latency
Time Windows: Use exponential moving average, not all-time average
Cold Start: New backends have no data; default to lowest score
Outlier Handling: Single slow request shouldn't skew long-term

NGINX Least Time Configuration:

upstream backend {
    least_time header;  # Time to receive response headers
    # OR: least_time last_byte;  # Time to receive full response
    
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

Dynamic Algorithm Comparison
Algorithm	Metric	Pros	Cons
Least Connections	Active connections	Simple, self-adaptive	Doesn't consider capacity
Weighted Least Conn.	Connections / weight	Accounts for capacity	Weights need configuration
Least Response Time	Connections × latency	Most accurate	Complex, needs measurement
Least Pending	Queued requests	Queue-aware	Requires queue visibility

When to Use Least Connections

Hashing Algorithms: IP Hash and Session Affinity

Hashing algorithms provide consistent backend selection for requests with the same key. This is essential for session affinity, caching efficiency, and stateful services.

IP Hash:

Routes requests from the same client IP to the same backend:

backend_index = hash(client_ip) % num_backends

Client 203.0.113.10 → hash("203.0.113.10") % 3 = 1 → Backend B
Client 198.51.100.5 → hash("198.51.100.5") % 3 = 0 → Backend A

# Same client always hits same backend (unless backends change)

Implementation:

def ip_hash(client_ip, backends):
    # Simple hash using Python's hash function
    # Production: use consistent hash function like xxHash
    hash_val = hash(client_ip)
    return backends[hash_val % len(backends)]

Limitations of Simple Modulo Hashing:

When backends change (add/remove), most mappings change:

Before (3 backends): hash(ip) % 3
After (4 backends):  hash(ip) % 4

Client with hash=5:
  Before: 5 % 3 = 2 → Backend C
  After:  5 % 4 = 1 → Backend B  ← Different!

Adding one backend remaps ~75% of clients (N-1)/N = 2/3 ≈ 67% changing).

URL/Header/Cookie Hashing:

Hash on application-layer attributes for more specific affinity:

Hash Key	Use Case	Example
URL Path	Cache locality	Same path → same cached server
Query Parameter	User sharding	`?user_id=123` → consistent backend
Cookie Value	Session stickiness	Session cookie → owning server
HTTP Header	Tenant routing	`X-Tenant-ID` → tenant's backend pool

NGINX Hash Configuration:

upstream backend {
    hash $remote_addr consistent;  # IP hash with consistent hashing
    # OR:
    # hash $request_uri consistent;  # URL hash
    # hash $cookie_session_id consistent;  # Cookie hash
    
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

Why Session Affinity Matters:

Stateful Applications: Some apps store session state in server memory
Local Caching: Repeated requests to same server hit warm caches
Connection Pooling: Database connections already established
WebSocket Connections: Require same server for connection lifetime
File Uploads: Multi-part uploads need same server

Affinity vs Load Balancing

Consistent Hashing: Minimizing Remapping

Consistent hashing solves the remapping problem of simple modulo hashing. When backends change, only a minimal number of keys need to be remapped.

The Concept: Hash Ring

Map hash space onto a circular ring (0 to 2³² - 1, wrapping around)
Place each backend at a position on the ring (hash of backend ID)
For each request, hash the key and find the next backend clockwise

                    0 / 2³²
                      ○
                   /     
                B3         B1
              ○               ○
             /                 
            |                   |
             \                 /
              ○               ○
                B2         B4  (new)
                   \     /
                      ○

Key K: hash(K) → Position P → Walk clockwise → First backend = handler

Adding a Backend:

When B4 is added, only keys between B2 and B4 remap (were going to B3, now go to B4). All other mappings unchanged.

Expected Remapping: K/N keys (K = total keys, N = backends)

Compare to modulo hashing: (N-1)/N × K keys remapped.

Virtual Nodes (VNodes):

With few backends, the ring can be unbalanced. Virtual nodes create multiple positions per backend:

Backend A → VNodes: A1, A2, A3, A4 (each at different ring positions)
Backend B → VNodes: B1, B2, B3, B4
Backend C → VNodes: C1, C2, C3, C4

Benefits:

More even distribution
Smoother load when backends are added/removed
Can implement weights (more VNodes = higher weight)

Implementation (Pseudocode):

class ConsistentHash:
    def __init__(self, backends, vnodes=100):
        self.ring = SortedDict()  # Position → Backend
        self.vnodes = vnodes
        
        for backend in backends:
            self.add_backend(backend)
    
    def add_backend(self, backend):
        for i in range(self.vnodes):
            position = hash(f"{backend}-{i}")
            self.ring[position] = backend
    
    def remove_backend(self, backend):
        for i in range(self.vnodes):
            position = hash(f"{backend}-{i}")
            del self.ring[position]
    
    def get_backend(self, key):
        if not self.ring:
            return None
        
        position = hash(key)
        # Find first position >= key position (or wrap to start)
        keys = list(self.ring.keys())
        for pos in keys:
            if pos >= position:
                return self.ring[pos]
        return self.ring[keys[0]]  # Wrap around

Jump Consistent Hash:

An alternative approach that achieves consistent hashing without a ring:

def jump_consistent_hash(key, num_buckets):
    b, j = -1, 0
    while j < num_buckets:
        b = j
        key = ((key * 2862933555777941757) + 1) & (2**64 - 1)
        j = int((b + 1) * (2**31 / ((key >> 33) + 1)))
    return b

Benefits: No storage, O(log N) time, perfect distribution. Limitation: Only supports sequential bucket IDs.

When to Use Consistent Hashing

Maglev Hashing: Google's Production Algorithm

Maglev hashing is a consistent hashing algorithm developed by Google for their Maglev load balancers. It provides better distribution and faster lookups than ring-based consistent hashing.

The Problem with Ring-Based Consistent Hashing:

Lookup Performance: O(log N) binary search on ring
Memory: Stores N × VNodes entries
Distribution: Even with VNodes, can have imbalance
Disruption: Adding a backend affects ~K/N keys in theory, but unevenly in practice

Maglev's Approach: Lookup Table

Instead of a ring, Maglev builds a fixed-size lookup table where each slot maps to a backend:

Table size M = 65537 (prime, larger than backends)

+---+---+---+---+---+---+---+---+---+---+
| A | B | C | A | C | B | A | B | C | A | ...
+---+---+---+---+---+---+---+---+---+---+
  0   1   2   3   4   5   6   7   8   9

Lookup: backend = table[hash(key) % M]

Benefits:

O(1) lookup (single array access)
Perfect distribution (each backend owns M/N slots)
Deterministic (same table built from same backends)
Minimal disruption when backends change

How Maglev Builds the Table:

For each backend, compute two hash functions: offset and skip
Each backend generates a "preference list" (permutation of slots)
Backends take turns placing themselves in their preferred slots
If a slot is taken, move to next preference

def build_maglev_table(backends, table_size):
    table = [None] * table_size
    filled = 0
    
    # Compute preference lists for each backend
    preferences = {}
    for backend in backends:
        offset = hash1(backend) % table_size
        skip = hash2(backend) % (table_size - 1) + 1
        preferences[backend] = []
        for i in range(table_size):
            preferences[backend].append((offset + i * skip) % table_size)
    
    # Backends take turns filling slots
    next_pref = {b: 0 for b in backends}  # Next preference index
    
    while filled < table_size:
        for backend in backends:
            while True:
                slot = preferences[backend][next_pref[backend]]
                next_pref[backend] += 1
                if table[slot] is None:
                    table[slot] = backend
                    filled += 1
                    break
                if next_pref[backend] >= table_size:
                    break
    
    return table

Disruption Analysis:

When one backend is removed:

Only slots belonging to that backend need remapping
Remapped slots are distributed evenly among remaining backends
~M/N slots change (minimal disruption)

When one backend is added:

New backend "takes" some slots from each existing backend
Each existing backend loses ~M/(N×(N+1)) slots
Total disruption: ~M/(N+1) slots (optimal)

Consistent Hashing vs Maglev
Aspect	Ring-Based Consistent Hash	Maglev
Lookup Time	O(log N)	O(1)
Memory	O(N × VNodes)	O(M) fixed
Distribution	Approximate	Perfect
Disruption	Optimal	Optimal
Table Build	O(N × VNodes)	O(M × N)
Implementation	Simpler	More complex

Maglev in Production

Power of Two Random Choices

The Power of Two Random Choices (also called "P2C" or "Join-Shortest-Queue-d") is an elegant algorithm that achieves near-optimal load distribution with minimal overhead.

The Insight:

Pure random selection has high variance—some backends get many more requests than others. Least Connections is optimal but requires global state.

P2C combines the best of both:

Randomly select 2 backends
Route to the one with fewer connections

This simple change exponentially reduces the maximum load imbalance.

Mathematical Foundation:

Algorithm	Max Load (expected)
Random	O(log N / log log N)
Two Random Choices	O(log log N)

For N=1000 backends:

Random: max load ~5-6× average
Two Choices: max load ~2× average

How It Works:

import random

class PowerOfTwoLB:
    def __init__(self, backends):
        self.backends = backends
        self.connections = {b: 0 for b in backends}
    
    def select_backend(self):
        # Randomly pick 2 backends
        choice1, choice2 = random.sample(self.backends, 2)
        
        # Select the one with fewer connections
        if self.connections[choice1] <= self.connections[choice2]:
            selected = choice1
        else:
            selected = choice2
        
        self.connections[selected] += 1
        return selected

Weighted P2C:

For weighted backends, compare normalized load:

def weighted_p2c_select(backends, weights, connections):
    choice1, choice2 = random.sample(range(len(backends)), 2)
    
    load1 = connections[choice1] / weights[choice1]
    load2 = connections[choice2] / weights[choice2]
    
    return backends[choice1 if load1 <= load2 else choice2]

Advantages:

Distributed: Each LB instance can run independently (no global state)
Scalable: O(1) operation, no coordination needed
Adaptive: Naturally routes fewer requests to loaded backends
Simple: Easy to implement and debug

Use in Envoy Proxy:

clusters:
  - name: my_service
    lb_policy: LEAST_REQUEST
    least_request_lb_config:
      choice_count: 2  # Power of two choices

Why Not Three Choices?

Algorithm Selection Guide

Choosing the right algorithm requires matching your application's characteristics to algorithm strengths.

Algorithm Selection Matrix
Scenario	Recommended Algorithm	Reason
Homogeneous backends, uniform requests	Round-Robin	Simplicity; equal distribution
Backends with different capacity	Weighted Round-Robin	Respects capacity differences
Variable request duration	Least Connections	Adapts to actual load
Session stickiness required	IP/Cookie Hash	Consistent routing per client
Distributed caching	Consistent Hashing	Maintains cache locality
High-performance infrastructure	Maglev	O(1) lookup, minimal disruption
Stateless with good distribution	Power of Two Choices	Simple, distributed, effective
Microservices with varying latency	Least Response Time	Routes to fastest responders

Key Takeaways

•Round-robin is the default — Start here unless you have specific requirements.
•Least Connections adapts to reality — Handles varying request durations naturally.
•Hashing provides consistency — Essential for sessions, caching, and stateful backends.
•Consistent hashing minimizes disruption — Use when backends frequently change.
•Maglev is optimal for infrastructure — O(1) lookup with perfect distribution.
•Power of Two Choices is underrated — Simple algorithm with excellent practical performance.
•Match algorithm to requirements — No single algorithm is best for all scenarios.

What's Next:

Page Complete

3 / 5