Loading learning content...
Rate limiting on a single server is straightforward—a simple in-memory data structure suffices. But modern systems don't run on single servers. Your API is served by dozens of application instances behind a load balancer. Your edge network spans continents. Your microservices communicate across data centers.
In this distributed reality, rate limiting becomes a coordination problem. How do 50 servers agree that a client has exceeded their limit? How do you enforce limits when network partitions prevent servers from communicating? How do you balance consistency against latency?
Distributed rate limiting sits at the intersection of distributed systems theory and practical engineering constraints. The solutions involve tradeoffs between accuracy, performance, availability, and complexity—tradeoffs that every principal engineer must understand.
By the end of this page, you will understand the fundamental challenges of distributed rate limiting, the consistency models available, centralized vs. decentralized architectures, synchronization strategies, and practical implementation patterns used by high-scale systems including Redis-based coordination and edge-local enforcement.
Consider a simple scenario: You're running 10 API servers behind a load balancer, and you want to enforce a limit of 100 requests per minute per user. A user makes requests that are distributed across all 10 servers.
The naive approach fails:
If each server enforces a 100 request/minute limit independently, the user gets 100 × 10 = 1000 requests per minute—10x your intended limit. This completely defeats the purpose of rate limiting.
Possible approaches:
Sticky sessions — Route all requests from a user to the same server. But this defeats load balancing benefits and fails when servers are added/removed.
Divide the limit — Give each server 10 requests/minute (100 ÷ 10). But this is fragile—it assumes perfect distribution and fails if one server handles more traffic.
Shared state — Synchronize counters across servers. This is the general solution, but introduces its own challenges.
Accept inaccuracy — Design for eventual consistency, accepting that limits may be temporarily exceeded. Often the practical choice.
The right approach depends on your consistency requirements, acceptable latency overhead, and infrastructure constraints.
Distributed rate limiting is subject to the CAP theorem. You cannot have perfect consistency, 100% availability, AND partition tolerance. Most systems choose eventual consistency (AP) because the cost of rate limiting being briefly inaccurate is lower than the cost of blocking legitimate requests during network issues.
Before examining architectures, let's understand the consistency spectrum. Different applications have different tolerance for rate limiting inaccuracy.
| Model | Guarantee | Latency Cost | Best For |
|---|---|---|---|
| Strong Consistency | Exactly N requests allowed globally | High (synchronous coordination) | Financial transactions, one-time operations |
| Sequential Consistency | Same order seen by all nodes | Medium-High | Audit-sensitive operations |
| Causal Consistency | Causally related ops in order | Medium | Complex multi-step workflows |
| Eventual Consistency | Converges eventually, may overshoot short-term | Low | Most API rate limiting (recommended) |
| Best-Effort | No guarantees, local only | None | Edge caching, rough limits |
Why eventual consistency usually wins:
For most rate limiting use cases, temporarily allowing N+X requests instead of exactly N causes minimal harm:
When strong consistency matters:
For these cases, pay the latency cost of synchronous coordination or use compensating alternatives (e.g., optimistic execution + rollback).
In practice, ~80% of rate limiting can use eventual consistency with significant limits (e.g., 100 req/min). ~15% benefits from tighter eventual consistency (sync within 100-500ms). Only ~5% truly needs strong consistency. Design your architecture to handle all three, applying stronger guarantees only where needed.
The most straightforward approach to distributed rate limiting is centralization: use a single, shared data store that all application servers consult before processing requests.
The architecture:
Clients → Load Balancer → App Servers → Rate Limit Check → Central Store
↓
(Redis/Memcached)
All rate limit state lives in a single Redis or Memcached cluster. Every request triggers a check against this store. If the limit isn't exceeded, the request proceeds and the counter increments atomically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
/** * Centralized Rate Limiter using Redis * * All application instances share rate limit state in Redis. * Uses Lua script for atomic check-and-increment. */import Redis from 'ioredis'; class CentralizedRateLimiter { private redis: Redis; private readonly limit: number; private readonly windowSeconds: number; // Lua script for atomic sliding window counter // Runs entirely on Redis - atomic and fast private readonly luaScript = ` local key = KEYS[1] local limit = tonumber(ARGV[1]) local window_size = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local current_window = math.floor(now / window_size) * window_size local previous_window = current_window - window_size local current_key = key .. ":" .. current_window local previous_key = key .. ":" .. previous_window local current_count = tonumber(redis.call("GET", current_key) or 0) local previous_count = tonumber(redis.call("GET", previous_key) or 0) local window_elapsed = (now - current_window) / window_size local previous_weight = 1 - window_elapsed local weighted_count = current_count + (previous_count * previous_weight) if weighted_count < limit then redis.call("INCR", current_key) redis.call("EXPIRE", current_key, window_size * 2) return {1, limit - weighted_count - 1, window_size - (now - current_window)} end return {0, 0, window_size - (now - current_window)} `; constructor(redisUrl: string, limit: number, windowSeconds: number) { this.redis = new Redis(redisUrl); this.limit = limit; this.windowSeconds = windowSeconds; } /** * Check if request is allowed * Returns: { allowed, remaining, resetIn } */ async checkLimit(clientId: string): Promise<{ allowed: boolean; remaining: number; resetIn: number; }> { const key = `ratelimit:${clientId}`; const now = Math.floor(Date.now() / 1000); try { const result = await this.redis.eval( this.luaScript, 1, key, this.limit, this.windowSeconds, now ) as [number, number, number]; return { allowed: result[0] === 1, remaining: Math.max(0, Math.floor(result[1])), resetIn: Math.ceil(result[2]), }; } catch (error) { // Fail open: if Redis is unavailable, allow the request // This is a policy decision - some systems should fail closed console.error('Rate limit check failed:', error); return { allowed: true, remaining: this.limit, resetIn: this.windowSeconds }; } }} // Usage in Express middlewareimport { Request, Response, NextFunction } from 'express'; const limiter = new CentralizedRateLimiter( 'redis://localhost:6379', 100, // 100 requests 60 // per 60 seconds); async function rateLimitMiddleware( req: Request, res: Response, next: NextFunction) { const clientId = req.ip || req.headers['x-forwarded-for'] as string; const result = await limiter.checkLimit(clientId); // Set standard rate limit headers res.setHeader('X-RateLimit-Limit', '100'); res.setHeader('X-RateLimit-Remaining', result.remaining.toString()); res.setHeader('X-RateLimit-Reset', result.resetIn.toString()); if (!result.allowed) { res.setHeader('Retry-After', result.resetIn.toString()); return res.status(429).json({ error: 'Too Many Requests', message: 'Rate limit exceeded. Please retry later.', retryAfter: result.resetIn, }); } next();}When your rate limit store is unavailable, you have a critical decision: Fail open (allow all requests) or fail closed (reject all requests). Most systems fail open for general API limits—better to risk some abuse than block all legitimate traffic. Fail closed for security-critical limits like authentication attempts.
Decentralized rate limiting removes the single point of failure by distributing rate limit state across application nodes themselves. Each node maintains local state and periodically synchronizes with peers or a coordinator.
Key approaches:
Local limits with global synchronization — Each node enforces a fraction of the global limit locally, with periodic sync to adjust
Gossip-based coordination — Nodes share counters via gossip protocol, eventually converging on global counts
Consistent hashing — Route decisions for each client to a specific 'owner' node responsible for that client's limits
Local enforcement with periodic flush — Track locally, push to central store asynchronously
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
/** * Decentralized Rate Limiter with Periodic Sync * * Each node tracks requests locally and periodically syncs * to a central store. This reduces latency while maintaining * approximate global limits. */import Redis from 'ioredis'; interface LocalState { count: number; windowStart: number; lastSync: number;} class DecentralizedRateLimiter { private redis: Redis; private nodeId: string; private localState: Map<string, LocalState> = new Map(); private readonly globalLimit: number; private readonly localLimit: number; // globalLimit / expectedNodeCount private readonly windowMs: number; private readonly syncIntervalMs: number; private readonly safetyMultiplier: number; // Allow some slack before sync constructor( redisUrl: string, nodeId: string, globalLimit: number, expectedNodeCount: number, windowSeconds: number, syncIntervalMs: number = 100 ) { this.redis = new Redis(redisUrl); this.nodeId = nodeId; this.globalLimit = globalLimit; // Conservative local limit - each node gets fraction of global // Multiply by safety factor to allow bursts before sync this.localLimit = Math.ceil(globalLimit / expectedNodeCount * 0.8); this.safetyMultiplier = 0.8; this.windowMs = windowSeconds * 1000; this.syncIntervalMs = syncIntervalMs; // Start background sync this.startBackgroundSync(); } /** * Fast local check - no network call in hot path */ isAllowedLocal(clientId: string): boolean { const now = Date.now(); const windowStart = Math.floor(now / this.windowMs) * this.windowMs; let state = this.localState.get(clientId); // Reset on new window if (!state || state.windowStart !== windowStart) { state = { count: 0, windowStart, lastSync: now }; this.localState.set(clientId, state); } // Check against local limit if (state.count < this.localLimit) { state.count++; return true; } // Local limit reached - need to check global return false; } /** * Full check with Redis - use when local limit reached * or when accurate count is needed */ async isAllowedGlobal(clientId: string): Promise<boolean> { const now = Date.now(); const windowStart = Math.floor(now / this.windowMs) * this.windowMs; const key = `ratelimit:${clientId}:${windowStart}`; // Atomic increment and check const count = await this.redis.incr(key); // Set expiry if new key if (count === 1) { await this.redis.pexpire(key, this.windowMs); } return count <= this.globalLimit; } /** * Combined check: fast local, fallback to global */ async isAllowed(clientId: string): Promise<boolean> { // Try local first - microseconds if (this.isAllowedLocal(clientId)) { return true; } // Local limit reached - check global return this.isAllowedGlobal(clientId); } /** * Background sync: push local counts to Redis periodically */ private startBackgroundSync(): void { setInterval(async () => { const now = Date.now(); const windowStart = Math.floor(now / this.windowMs) * this.windowMs; for (const [clientId, state] of this.localState.entries()) { // Only sync current window if (state.windowStart !== windowStart) { this.localState.delete(clientId); continue; } // Calculate delta since last sync const key = `ratelimit:${clientId}:${windowStart}`; try { // Push local count to Redis and get global count back const pipeline = this.redis.pipeline(); pipeline.incrby(key, state.count); pipeline.pexpire(key, this.windowMs); pipeline.get(key); const results = await pipeline.exec(); const globalCount = parseInt(results?.[2]?.[1] as string || '0'); // Reset local count after sync state.count = 0; state.lastSync = now; // If global limit is close, reduce local limit temporarily const remaining = this.globalLimit - globalCount; // Adaptive: reduce local limit when nearing global limit // This prevents overshoot across nodes } catch (error) { console.error(`Sync failed for ${clientId}:`, error); // Continue with local limits if sync fails } } }, this.syncIntervalMs); }}Token Bucket with Periodic Refill from Central Store:
Another elegant pattern is to treat Redis as a 'bank' that allocates tokens to nodes:
This dramatically reduces Redis load while maintaining reasonable accuracy:
Global Limit: 1000/min
Node A: Gets 200 tokens, spent 150, returns 50
Node B: Gets 200 tokens, spent 200, requests 100 more
Node C: Gets 200 tokens, spent 50, returns 150
...
The tradeoff is that tokens allocated to idle nodes aren't available to busy nodes until the next sync period.
Improve token allocation by tracking each node's request rate. Nodes handling more traffic get larger token allocations. This work-stealing approach ensures tokens go where they're needed without excessive synchronization.
Global services face an additional challenge: rate limiting across data centers separated by tens or hundreds of milliseconds of network latency. Synchronous coordination becomes impractical when every check adds 100ms+ of latency.
The multi-region reality:
Synchronous Redis checks across regions would add unacceptable latency to every request. Different architectures are needed.
| Strategy | Consistency | Latency | Complexity | Use Case |
|---|---|---|---|---|
| Regional Independence | None across regions | Lowest | Low | Regional limits only; users stick to one region |
| Async Replication | Eventual (seconds) | Low | Medium | Where brief overshoot is acceptable |
| Leader Region | Strong via one region | High for non-leader | Medium | When one region dominates traffic |
| CRDTs | Eventual (auto-merge) | Low | High | When you can use specialized data structures |
| Hierarchical | Local + global async | Low-Medium | High | Tiered limits (local strict, global eventual) |
Hierarchical Rate Limiting (Recommended):
The most practical pattern for geo-distributed systems is hierarchical rate limiting:
Each tier catches different abuse patterns:
┌──────────────────┐
│ Global Limit │ (10,000/day)
│ Async Sync │
└────────┬─────────┘
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ US-East │ │ EU-West │ │ AP-Tokyo │
│ Region Limit │ │ Region Limit │ │ Region Limit │ (60/min)
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
┌──┴──┐ ┌──┴──┐ ┌──┴──┐
▼ ▼ ▼ ▼ ▼ ▼
Node Node Node Node Node Node (10/sec)
Some systems assign users to 'cells' (regional data center clusters) and perform all operations, including rate limiting, within that cell. This provides strong consistency within a cell while isolating failures. When a user must access another cell, the request is either proxied (with latency) or given a separate limit for that operation.
Modern CDN and edge computing platforms (Cloudflare Workers, AWS CloudFront Functions, Fastly Compute@Edge) enable rate limiting at points of presence (PoPs) close to users—often within 50ms of the end user.
Benefits of edge rate limiting:
The challenge:
Edge PoPs (there might be 200+ globally) can't synchronously coordinate. Each PoP might see a fraction of a user's traffic, making accurate global limits difficult.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
/** * Edge Rate Limiting with Cloudflare Workers * * Uses Durable Objects for per-object consistency * or Workers KV for eventual consistency. */ // Simple approach using Workers KV (eventual consistency)export default { async fetch(request: Request, env: Env): Promise<Response> { const ip = request.headers.get('CF-Connecting-IP') || 'unknown'; const now = Math.floor(Date.now() / 1000); const windowStart = Math.floor(now / 60) * 60; // 1-minute windows const key = `rate:${ip}:${windowStart}`; // Read current count from KV (eventually consistent) const currentCount = parseInt(await env.RATE_LIMITS.get(key) || '0'); if (currentCount >= 100) { return new Response('Rate limit exceeded', { status: 429, headers: { 'Retry-After': String(60 - (now - windowStart)), 'X-RateLimit-Limit': '100', 'X-RateLimit-Remaining': '0', }, }); } // Increment count (best-effort, not atomic) await env.RATE_LIMITS.put(key, String(currentCount + 1), { expirationTtl: 120, // 2 minutes }); // Continue to origin const response = await fetch(request); // Add rate limit headers to response const newResponse = new Response(response.body, response); newResponse.headers.set('X-RateLimit-Limit', '100'); newResponse.headers.set('X-RateLimit-Remaining', String(99 - currentCount)); return newResponse; },}; // More accurate approach using Durable Objects (per-object consistency)// Each client's rate limit is handled by a single Durable Objectexport class RateLimiter implements DurableObject { private count: number = 0; private windowStart: number = 0; private readonly limit: number = 100; private readonly windowMs: number = 60000; async fetch(request: Request): Promise<Response> { const now = Date.now(); const currentWindowStart = Math.floor(now / this.windowMs) * this.windowMs; // Reset on new window if (currentWindowStart !== this.windowStart) { this.count = 0; this.windowStart = currentWindowStart; } if (this.count >= this.limit) { const resetIn = this.windowStart + this.windowMs - now; return new Response(JSON.stringify({ allowed: false, remaining: 0, resetIn: Math.ceil(resetIn / 1000), }), { status: 429, headers: { 'Content-Type': 'application/json' }, }); } this.count++; return new Response(JSON.stringify({ allowed: true, remaining: this.limit - this.count, resetIn: Math.ceil((this.windowStart + this.windowMs - now) / 1000), }), { headers: { 'Content-Type': 'application/json' }, }); }}Edge PoP Coordination Patterns:
Independent PoP limits — Each PoP enforces its own limit (say 100/minute). A user accessing 10 PoPs gets 1000/minute globally. Simple but leaky.
Anycast convergence — Route users to consistent PoPs using consistent hashing on client IP. Same user → same PoP → accurate limit.
Background sync — PoPs sync counts periodically (every 5-10 seconds). Eventually consistent but can handle most abuse.
Durable Objects (Cloudflare) — Each rate limit is handled by a single Durable Object, which can be globally consistent but adds latency when the object is in a different region.
Practical recommendation: Use aggressive edge limits (strict per-IP burst limits) combined with more accurate origin limits. Edge catches the obvious attacks; origin handles the subtle abuse.
The best architecture combines edge and origin rate limiting: Edge (loose, fast): Block obvious abuse (1000/min per IP). Origin (strict, accurate): Enforce precise limits per user/API key. This way, edge handles the volume attacks, and origin handles sophisticated abuse that slips through.
Redis is the de facto standard for centralized rate limiting due to its speed, atomic operations, and built-in expiration. Let's explore production-grade Redis patterns for rate limiting.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- Production-grade sliding window counter with detailed response-- Returns: [allowed (0/1), remaining, resetMs, retryAfterMs]---- KEYS[1] = base rate limit key-- ARGV[1] = limit -- ARGV[2] = window size in seconds-- ARGV[3] = current timestamp in milliseconds local key = KEYS[1]local limit = tonumber(ARGV[1])local window_size_ms = tonumber(ARGV[2]) * 1000local now = tonumber(ARGV[3]) -- Calculate window boundarieslocal current_window = math.floor(now / window_size_ms) * window_size_mslocal previous_window = current_window - window_size_ms local current_key = key .. ":" .. current_windowlocal previous_key = key .. ":" .. previous_window -- Get both counts in one round-triplocal counts = redis.call("MGET", current_key, previous_key)local current_count = tonumber(counts[1]) or 0local previous_count = tonumber(counts[2]) or 0 -- Calculate sliding window weightlocal window_elapsed = (now - current_window) / window_size_mslocal previous_weight = 1 - window_elapsed -- Weighted countlocal weighted_count = current_count + (previous_count * previous_weight) -- Calculate reset time (when sliding window will be fully in new window)local reset_ms = current_window + window_size_ms - now if weighted_count < limit then -- Allow: increment current window counter local new_count = redis.call("INCR", current_key) -- Set TTL if this is a new key (only on first increment in window) if new_count == 1 then redis.call("PEXPIRE", current_key, window_size_ms * 2) end local remaining = math.max(0, limit - (weighted_count + 1)) return {1, remaining, reset_ms, 0}end -- Reject: calculate when client can retry-- They can retry when weighted_count drops below limit-- This happens as previous_weight approaches 0local retry_after_ms = 0if previous_count > 0 then -- Calculate when enough of previous window falls off local excess = weighted_count - limit + 1 local time_needed = (excess / previous_count) * window_size_ms retry_after_ms = math.max(0, math.ceil(time_needed))else -- No previous window, just wait for reset retry_after_ms = reset_msend return {0, 0, reset_ms, retry_after_ms}Redis Cluster Considerations:
Key distribution — In Redis Cluster, keys are sharded across nodes. Ensure related keys hash to the same node using hash tags: {user:123}:ratelimit:api and {user:123}:ratelimit:auth will be on the same shard.
Lua script restrictions — In Cluster mode, Lua scripts can only access keys on the same shard. Design your keys accordingly.
Failover handling — Redis Sentinel or Cluster handles failover automatically, but there's a brief period where writes may be lost. Accept this for rate limiting.
Read replicas — Don't use replicas for rate limit reads—stale data leads to allowing too many requests. Only read from primary.
Connection pooling — Rate limiting generates many small requests. Use connection pooling to avoid connection overhead.
A single Redis instance can handle ~100,000+ operations/second. But at very high scale, even this becomes a bottleneck. Solutions: shard by client ID across multiple Redis instances, use local caching with periodic sync, or consider purpose-built rate limiting services like Envoy's ratelimit or Kong's rate limiting plugin.
Distributed rate limiting adds significant complexity, but is essential for any system running on multiple servers. Let's consolidate the key insights:
What's next:
With an understanding of algorithms and distributed architectures, we now need to address a fundamental question: Who is making this request? The next page covers Client Identification—the techniques for identifying and distinguishing clients including IP addresses, API keys, user accounts, device fingerprinting, and handling challenges like NAT and proxy networks.
You now understand the challenges and solutions for distributed rate limiting. Whether building a simple Redis-backed limiter or a sophisticated multi-layer geo-distributed system, these patterns provide the foundation for protecting your services at scale.