Loading content...
Round Robin and Weighted Round Robin distribute requests evenly, but they're blind to what happens after a request is sent. Consider this scenario:
The Problem:
You have 3 servers receiving round-robin traffic:
After distributing 3 requests, round robin considers its job done—each server got one. But Server B is now overloaded, processing a long request, while Server A completed its work 4.9 seconds ago and sits idle.
The next 3 requests will again go equally to A, B, and C, despite B still processing its first request. Server B accumulates connections while A and C remain underutilized.
Least Connections solves this by tracking the actual number of active connections to each server and routing new requests to the server with the fewest current connections. It's load-aware, not just request-aware.
By the end of this page, you will understand Least Connections' mechanics and variants, its significant advantages for variable workloads, implementation complexities including connection tracking overhead, and precisely when it outperforms round-robin approaches.
The Least Connections algorithm is conceptually simple:
Core Algorithm:
Definition of "Connection":
The term "connection" means different things at different layers:
Most implementations track HTTP requests or active TCP connections depending on the load balancer type.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
import threadingfrom dataclasses import dataclassfrom typing import Optionalimport heapq @dataclassclass ServerState: """Tracks connection state for a single server.""" address: str active_connections: int = 0 healthy: bool = True class LeastConnectionsLoadBalancer: """ Least Connections load balancer implementation. Uses a simple linear scan to find the server with fewest connections. For small server counts (< 100), this is efficient. """ def __init__(self, servers: list[str]): """ Initialize with list of backend server addresses. Args: servers: List of server identifiers """ self._servers = { addr: ServerState(address=addr) for addr in servers } self._lock = threading.Lock() def get_next_server(self) -> Optional[str]: """ Select server with fewest active connections. Time Complexity: O(n) where n is number of servers Returns: Server address with minimum connections, or None if all unhealthy """ with self._lock: # Filter to healthy servers only healthy_servers = [ s for s in self._servers.values() if s.healthy ] if not healthy_servers: return None # Find server with minimum connections # Use address as tiebreaker for determinism best = min( healthy_servers, key=lambda s: (s.active_connections, s.address) ) # Increment connection count best.active_connections += 1 return best.address def release_connection(self, address: str) -> None: """ Called when a connection completes. CRITICAL: This must be called for every connection, or the load balancer state becomes corrupted! """ with self._lock: if address in self._servers: server = self._servers[address] server.active_connections = max(0, server.active_connections - 1) def mark_unhealthy(self, address: str) -> None: """Mark a server as unhealthy.""" with self._lock: if address in self._servers: self._servers[address].healthy = False # Reset connections - existing requests will still complete self._servers[address].active_connections = 0 def mark_healthy(self, address: str) -> None: """Mark a server as healthy.""" with self._lock: if address in self._servers: self._servers[address].healthy = True def get_connection_counts(self) -> dict[str, int]: """Get current connection counts for monitoring.""" with self._lock: return { addr: s.active_connections for addr, s in self._servers.items() } # Context manager for clean connection lifecycleclass ConnectionContext: """ Ensures connection is properly released after use. Usage: with ConnectionContext(lb) as server: # Make request to server response = make_request(server) """ def __init__(self, lb: LeastConnectionsLoadBalancer): self._lb = lb self._server = None def __enter__(self) -> Optional[str]: self._server = self._lb.get_next_server() return self._server def __exit__(self, exc_type, exc_val, exc_tb): if self._server: self._lb.release_connection(self._server) return False # Don't suppress exceptions # Example usageif __name__ == "__main__": import time import random servers = ["10.0.1.1:8080", "10.0.1.2:8080", "10.0.1.3:8080"] lb = LeastConnectionsLoadBalancer(servers) # Simulate variable workload print("Simulating variable request durations:") print("-" * 50) for i in range(10): with ConnectionContext(lb) as server: # Simulate variable processing time duration = random.uniform(0.1, 1.0) print(f"Request {i+1}: {server} " f"(counts: {lb.get_connection_counts()})") time.sleep(duration) print("-" * 50) print(f"Final state: {lb.get_connection_counts()}")The most common bug in Least Connections implementations is failing to decrement the connection count when requests complete. Every 'get' must have a matching 'release'. Use RAII patterns, context managers, or defer statements to ensure cleanup happens even when requests fail.
To understand when Least Connections excels, we need to examine workload variability—the differences in processing time across requests.
Workload Variability Classifications:
| Pattern | Characteristics | Example | Best Algorithm |
|---|---|---|---|
| Uniform | All requests ~same duration | Health checks, simple APIs | Round Robin (simpler) |
| Low Variance | Most requests similar, occasional outliers | CRUD operations | Round Robin acceptable |
| High Variance | Wide range of durations | Report generation, batch processing | Least Connections |
| Bimodal | Two distinct duration classes | Quick reads + slow writes | Least Connections or queue separation |
| Long-tail | Most fast, few very slow | Search with complex queries | Least Connections essential |
Mathematical Illustration:
Consider 2 servers receiving requests with these processing times:
With Round Robin (1000 requests):
With Least Connections:
Least Connections is self-balancing: servers that are slow (for any reason) automatically receive less traffic. This is powerful for heterogeneous backends, variable workloads, or when server performance degrades due to garbage collection, I/O issues, or noisy neighbors in cloud environments.
Standard Least Connections assumes all servers have equal capacity. Weighted Least Connections combines connection tracking with capacity weighting.
The Algorithm:
Instead of selecting the server with minimum connections, select the server with minimum connections / weight.
This ratio represents the load relative to capacity—a server with weight 10 can handle 10 connections as easily as a weight-1 server handles 1.
Example:
Server A: weight=10, connections=3 → ratio = 0.3 Server B: weight=5, connections=2 → ratio = 0.4 Server C: weight=2, connections=0 → ratio = 0.0
Select Server C (lowest ratio).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import threadingfrom dataclasses import dataclassfrom typing import Optional @dataclassclass WeightedServer: """Server with weight and connection tracking.""" address: str weight: int active_connections: int = 0 healthy: bool = True @property def load_ratio(self) -> float: """ Calculate load relative to capacity. Lower ratio = server has more available capacity. Weight of 0 returns infinity to avoid selection. """ if self.weight == 0: return float('inf') return self.active_connections / self.weight class WeightedLeastConnectionsLB: """ Weighted Least Connections load balancer. Selects the server with the lowest connections/weight ratio, effectively choosing the server with most available capacity. """ def __init__(self, servers: dict[str, int]): """ Initialize with servers and their weights. Args: servers: Dict mapping server address to weight """ self._servers = { addr: WeightedServer(address=addr, weight=w) for addr, w in servers.items() } self._lock = threading.Lock() def get_next_server(self) -> Optional[str]: """ Select server with lowest load ratio. Time Complexity: O(n) Returns: Server address, or None if no healthy servers """ with self._lock: healthy_servers = [ s for s in self._servers.values() if s.healthy and s.weight > 0 ] if not healthy_servers: return None # Select by minimum load ratio # Use address as tiebreaker for determinism best = min( healthy_servers, key=lambda s: (s.load_ratio, s.address) ) best.active_connections += 1 return best.address def release_connection(self, address: str) -> None: """Release a connection when request completes.""" with self._lock: if address in self._servers: server = self._servers[address] server.active_connections = max(0, server.active_connections - 1) def update_weight(self, address: str, new_weight: int) -> None: """ Update server weight dynamically. Useful for: - Responding to detected capacity changes - Gradual scaling up/down - Manual traffic shifts """ with self._lock: if address in self._servers: self._servers[address].weight = max(0, new_weight) def get_states(self) -> list[dict]: """Get current state for monitoring/debugging.""" with self._lock: return [ { "address": s.address, "weight": s.weight, "connections": s.active_connections, "load_ratio": round(s.load_ratio, 3), "healthy": s.healthy, } for s in self._servers.values() ] # Demonstrationif __name__ == "__main__": # Heterogeneous cluster: different capacity servers servers = { "large-instance": 10, # High capacity "medium-instance": 5, # Medium capacity "small-instance": 2, # Lower capacity } lb = WeightedLeastConnectionsLB(servers) print("Initial state:") for state in lb.get_states(): print(f" {state}") print("\nDistributing 17 requests (proportional to weights 10:5:2):") selections = [] for i in range(17): server = lb.get_next_server() selections.append(server) # Count selections from collections import Counter counts = Counter(selections) print(f"\nDistribution: {dict(counts)}") print("Expected ratio: 10:5:2") # Show final state (all connections still active) print("\nFinal state (connections active):") for state in lb.get_states(): print(f" {state}")Use Weighted Least Connections when you have heterogeneous backends (different instance sizes, hardware generations). Use unweighted Least Connections when all servers are identical—the weight tracking adds complexity without benefit.
Let's examine how production load balancers implement Least Connections.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
# NGINX Least Connections Configuration upstream backend_pool { # Enable least_conn algorithm least_conn; # Server list - NGINX tracks connections to each server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080; # With health checking (NGINX Plus) # health_check interval=5s fails=3 passes=2;} upstream weighted_least_conn { # Combine least_conn with weights # NGINX uses weighted least connections when weights differ least_conn; # High-capacity servers server 10.0.1.1:8080 weight=10; server 10.0.1.2:8080 weight=10; # Medium capacity server 10.0.2.1:8080 weight=5; # Lower capacity server 10.0.3.1:8080 weight=2; # Connection limits per server (optional) # Prevents any single server from being overwhelmed # server 10.0.1.1:8080 weight=10 max_conns=1000;} upstream with_keepalive { least_conn; server 10.0.1.1:8080; server 10.0.1.2:8080; # Keep idle connections to backends # Reduces connection overhead for HTTP/1.1 keepalive 32; # For HTTP/2 backends # keepalive_requests 1000; # keepalive_timeout 60s;} server { listen 80; location / { proxy_pass http://backend_pool; # Required for keepalive to work proxy_http_version 1.1; proxy_set_header Connection ""; # Timeouts affect connection counts # Long timeouts = connections held longer proxy_connect_timeout 5s; proxy_read_timeout 60s; proxy_send_timeout 60s; } # Status page showing connection distribution (NGINX Plus) location /api/status { # api write=off; # allow 127.0.0.1; # deny all; }}123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
# HAProxy Least Connections Configuration global maxconn 50000 defaults mode http timeout connect 5s timeout client 30s timeout server 30s option httplog backend app_servers_leastconn # Least connections balancing balance leastconn server app1 10.0.1.1:8080 check maxconn 2000 server app2 10.0.1.2:8080 check maxconn 2000 server app3 10.0.1.3:8080 check maxconn 2000 # maxconn: Maximum connections per server # Queues requests when limit reached backend weighted_leastconn # Weighted least connections balance leastconn # Weight affects the connection/weight ratio calculation server large1 10.0.1.1:8080 weight 100 check server large2 10.0.1.2:8080 weight 100 check server medium1 10.0.2.1:8080 weight 50 check server small1 10.0.3.1:8080 weight 25 check backend with_queue_management balance leastconn # fullconn: Expected max connections for queue calculation fullconn 5000 server app1 10.0.1.1:8080 weight 100 check maxconn 1000 server app2 10.0.1.2:8080 weight 100 check maxconn 1000 # Queue management # When all servers at maxconn, requests queue here timeout queue 30s # Retries on connection failure retries 3 option redispatch # Allow retry on different server frontend http_front bind *:80 default_backend app_servers_leastconn # Very useful: stats page shows live connection countslisten stats bind *:8404 stats enable stats uri /stats stats refresh 5s # Shows per-server: # - Current connections # - Connection rate # - Bytes in/out # - Response timesThe maxconn setting in HAProxy and max_conns in NGINX Plus provide crucial backpressure. Without limits, a slow server accumulates unbounded connections. With limits, excess requests queue at the load balancer, and health checks can detect the overload condition.
Least Connections introduces complexity that Round Robin avoids. Understanding these challenges is essential for correct implementation.
Solving the Cold Start Problem:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import timefrom dataclasses import dataclass, fieldfrom typing import Optionalimport threading @dataclassclass ServerWithSlowStart: """Server with slow-start support to prevent thundering herd.""" address: str weight: int = 1 active_connections: int = 0 healthy: bool = True # Slow start tracking _startup_time: Optional[float] = None _slow_start_duration: float = 60.0 # seconds def mark_healthy(self): """Called when server becomes healthy.""" self.healthy = True self._startup_time = time.time() @property def effective_weight(self) -> float: """ Weight adjusted for slow-start period. Gradually increases from 10% to 100% over slow_start_duration. """ if not self._startup_time: return float(self.weight) elapsed = time.time() - self._startup_time if elapsed >= self._slow_start_duration: # Slow start complete self._startup_time = None return float(self.weight) # Linear ramp from 0.1 to 1.0 progress = elapsed / self._slow_start_duration multiplier = 0.1 + (0.9 * progress) return self.weight * multiplier @property def load_ratio(self) -> float: """Load ratio using effective weight.""" ew = self.effective_weight if ew == 0: return float('inf') return self.active_connections / ew class SlowStartLeastConnections: """ Least Connections with slow-start support. New/recovered servers receive gradually increasing traffic over the slow-start period, preventing thundering herd. """ def __init__(self, servers: dict[str, int], slow_start_seconds: float = 60.0): self._slow_start_duration = slow_start_seconds self._servers = {} self._lock = threading.Lock() for addr, weight in servers.items(): server = ServerWithSlowStart( address=addr, weight=weight, healthy=True, ) server._slow_start_duration = slow_start_seconds # Existing servers don't need slow start server._startup_time = None self._servers[addr] = server def add_server(self, address: str, weight: int = 1) -> None: """Add a new server with slow-start.""" with self._lock: server = ServerWithSlowStart( address=address, weight=weight, healthy=False, # Will be marked healthy with slow-start ) server._slow_start_duration = self._slow_start_duration self._servers[address] = server # Trigger slow-start server.mark_healthy() def get_next_server(self) -> Optional[str]: """Select server with lowest load ratio (with slow-start adjustment).""" with self._lock: healthy = [s for s in self._servers.values() if s.healthy] if not healthy: return None best = min(healthy, key=lambda s: s.load_ratio) best.active_connections += 1 return best.address def release_connection(self, address: str) -> None: """Release connection when complete.""" with self._lock: if address in self._servers: s = self._servers[address] s.active_connections = max(0, s.active_connections - 1)Both HAProxy (slowstart parameter) and NGINX Plus have built-in slow-start support. HAProxy linearly increases the effective weight from 0 to configured weight over the slowstart period.
Several variants of Least Connections exist, each optimizing for different scenarios.
| Variant | Selection Criteria | Use Case | Trade-off |
|---|---|---|---|
| Basic Least Conn | Min connections | Homogeneous servers | Simple, widely supported |
| Weighted Least Conn | Min connections/weight | Heterogeneous servers | Accounts for capacity differences |
| Least Outstanding Requests | Min pending requests (L7) | HTTP with keep-alive | More accurate than connections for HTTP/2 |
| Least Response Time | Min (connections × response_time) | Backend performance varies | Requires response time tracking |
| Peak EWMA | Exponentially weighted response time | Twitter Finagle approach | Smooth response to latency changes |
| P2C + Least Conn | Pick two random, choose lowest | Very large server pools | O(1) with good distribution |
Power of Two Random Choices (P2C):
For very large server pools, scanning all servers becomes expensive. The P2C algorithm provides O(1) selection with provably good load distribution:
Mathematically, this achieves near-optimal distribution. The intuition: even random selection between just two options dramatically reduces the chance of selecting an overloaded server.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import randomimport threadingfrom typing import Optional class P2CLeastConnections: """ Power of Two Random Choices + Least Connections. Provides O(1) selection with near-optimal load distribution. Especially valuable for large server pools (100+ servers). """ def __init__(self, servers: list[str]): self._servers = list(servers) self._connections = {s: 0 for s in servers} self._lock = threading.Lock() def get_next_server(self) -> Optional[str]: """ Select server using P2C algorithm. Time Complexity: O(1) Distribution: Provably near-optimal for large pools """ with self._lock: if len(self._servers) < 2: if self._servers: server = self._servers[0] self._connections[server] += 1 return server return None # Pick two random servers a, b = random.sample(self._servers, 2) # Choose the one with fewer connections if self._connections[a] <= self._connections[b]: self._connections[a] += 1 return a else: self._connections[b] += 1 return b def release_connection(self, address: str) -> None: """Release connection.""" with self._lock: if address in self._connections: self._connections[address] = max(0, self._connections[address] - 1) # Benchmark comparisonif __name__ == "__main__": import time # Large server pool servers = [f"server-{i}" for i in range(1000)] # P2C is O(1) p2c_lb = P2CLeastConnections(servers) # Time 100,000 selections start = time.perf_counter() for _ in range(100_000): server = p2c_lb.get_next_server() p2c_lb.release_connection(server) duration = time.perf_counter() - start print(f"P2C: 100,000 selections in {duration:.3f}s") print(f" {100_000/duration:.0f} selections/second")Least Connections is more complex than Round Robin. Use it when the benefits justify the complexity.
For most web applications with API backends, Least Connections is a reasonable default. It handles workload variance gracefully without significant downsides. Start with Least Connections unless you have specific reasons to prefer Round Robin's simplicity or need algorithms with different properties (like consistent hashing for session affinity).
What's Next:
We've seen unweighted and weighted Least Connections. The next page combines these insights into Weighted Least Connections, diving deeper into the algorithm that handles both heterogeneous hardware AND variable workloads simultaneously.
You now understand Least Connections deeply—why it matters for variable workloads, how to implement it correctly, common pitfalls to avoid, and when to choose it over simpler algorithms. You can make informed decisions about load-aware distribution in your systems.