Loading learning content...
Your API serves users across the globe from data centers in Virginia, Frankfurt, Tokyo, and São Paulo. Each region runs multiple gateway instances. When a user's request lands on a server in Tokyo, how does that server know the user has already made 999 requests this hour through servers in Frankfurt?
Distributed rate limiting is one of the hardest problems in API gateway design. It requires balancing strict consistency (accurate limits) with availability and latency (fast responses). Get it wrong, and you either over-admit requests (threatening system stability) or over-reject them (frustrating users).
This page explores the strategies, trade-offs, and production-proven patterns for rate limiting at global scale.
By the end of this page, you will understand the CAP theorem's implications for rate limiting, synchronization strategies from local-first to strictly consistent, production architectures using Redis Cluster, and how major platforms solve this problem.
When rate limit state is distributed across multiple nodes, fundamental distributed systems challenges emerge.
The CAP Theorem Reality:
CAP theorem states you can have at most two of: Consistency, Availability, Partition tolerance. Since network partitions are inevitable, you must choose between:
Most rate limiting systems choose AP with eventual consistency—slight over-admission during brief partitions is acceptable compared to rejecting legitimate traffic.
There are several approaches to keeping rate limit state synchronized across distributed nodes, each with different consistency/latency trade-offs.
| Strategy | Consistency | Latency | Best For |
|---|---|---|---|
| Local Only | None | Lowest | Stateless limits (per-request) |
| Sticky Sessions | Per-session | Low | Session-scoped limits |
| Async Sync | Eventual | Low | Soft limits, high throughput |
| Central Store | Strong | Medium | Strict limits, single region |
| Gossip Protocol | Eventual | Low | Multi-region, high availability |
| Consensus-Based | Strong | High | Critical limits, low volume |
Local Counters with Async Synchronization:
Each node maintains local counters and periodically syncs with a central store. This provides excellent latency with eventual consistency.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
class LocalAsyncRateLimiter { private localCounts: Map<string, number>; private centralStore: RedisClient; private syncIntervalMs: number; private maxLocalDrift: number; constructor(config: { syncIntervalMs: number; maxLocalDrift: number }) { this.localCounts = new Map(); this.syncIntervalMs = config.syncIntervalMs; this.maxLocalDrift = config.maxLocalDrift; // Periodic sync to central store setInterval(() => this.syncToCentral(), this.syncIntervalMs); } async tryConsume(key: string, limit: number): Promise<boolean> { // Fast path: check local counter const localCount = this.localCounts.get(key) ?? 0; // If local count exceeds limit + drift tolerance, definitely reject if (localCount >= limit + this.maxLocalDrift) { return false; } // If local count is well under limit, allow locally if (localCount < limit * 0.8) { this.localCounts.set(key, localCount + 1); return true; } // Near limit: check central store for accuracy const globalCount = await this.centralStore.get(key); if (globalCount >= limit) { return false; } this.localCounts.set(key, localCount + 1); return true; } private async syncToCentral(): Promise<void> { for (const [key, localDelta] of this.localCounts) { if (localDelta > 0) { // Atomic increment in central store await this.centralStore.incrBy(key, localDelta); this.localCounts.set(key, 0); } } }}Redis is the most common backend for distributed rate limiting due to its speed, atomic operations, and cluster support. Here's a production-grade architecture.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
class ResilientRedisRateLimiter { private redis: RedisCluster; private localFallback: LocalRateLimiter; private healthChecker: HealthChecker; private metrics: MetricsClient; async tryConsume(key: string, limit: number): Promise<RateLimitResult> { // Check Redis health if (!this.healthChecker.isRedisHealthy()) { this.metrics.increment('rate_limit.fallback.local'); return this.localFallback.tryConsume(key, limit); } try { const result = await this.withTimeout( this.checkRedis(key, limit), 50 // 50ms timeout ); this.metrics.timing('rate_limit.redis.latency', result.latencyMs); return result; } catch (error) { this.metrics.increment('rate_limit.redis.error'); this.healthChecker.recordFailure(); // Fail open or closed based on configuration if (this.config.failOpen) { this.metrics.increment('rate_limit.fail_open'); return { allowed: true, remaining: 0, uncertain: true }; } else { this.metrics.increment('rate_limit.fail_closed'); return { allowed: false, remaining: 0, uncertain: true }; } } } private async checkRedis(key: string, limit: number): Promise<RateLimitResult> { const pipeline = this.redis.pipeline(); // Multi-window check in one round-trip const windows = ['minute', 'hour', 'day']; for (const window of windows) { const windowKey = `{ratelimit:${key}}:${window}`; pipeline.evalsha(this.scriptSha, 1, windowKey, this.getLimitForWindow(limit, window)); } const results = await pipeline.exec(); return this.aggregateResults(results); }}When Redis is unavailable, you must choose: fail open (allow requests, risk overload) or fail closed (reject requests, risk availability). Most systems fail open for non-critical limits and fail closed for security-critical limits (like login attempts).
Global systems face the challenge of synchronizing rate limits across continents with 100ms+ latency between regions.
| Approach | Consistency | Latency | Complexity |
|---|---|---|---|
| Single Global Store | Strong | High (cross-region) | Low |
| Per-Region Stores + Sync | Eventual | Low (local) | Medium |
| Partitioned by User Location | Strong (for user) | Low | Medium |
| Hierarchical (Region + Global) | Hybrid | Low + Periodic | High |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
class MultiRegionRateLimiter { private localRedis: RedisClient; // Same-region Redis private globalSyncer: GlobalSyncer; // Cross-region sync service private region: string; async tryConsume(userId: string, limit: number): Promise<RateLimitResult> { // Step 1: Check local region's count const localKey = `ratelimit:${userId}:${this.region}`; const localCount = await this.localRedis.incr(localKey); // Step 2: Get estimated global count // This is eventually consistent but fast const globalEstimate = await this.globalSyncer.getEstimatedGlobalCount(userId); // Step 3: Calculate effective count const effectiveCount = localCount + globalEstimate; // Step 4: Apply limit with buffer for sync lag const effectiveLimit = limit * 0.9; // 10% buffer for sync lag if (effectiveCount > effectiveLimit) { // Rollback local increment await this.localRedis.decr(localKey); return { allowed: false, remaining: 0 }; } // Step 5: Async report to global syncer this.globalSyncer.reportUsage(userId, this.region, 1); return { allowed: true, remaining: Math.max(0, limit - effectiveCount), }; }} class GlobalSyncer { private regionCounts: Map<string, Map<string, number>>; // userId -> region -> count // Called by each region periodically (e.g., every 5 seconds) async syncFromRegion(region: string, counts: Map<string, number>): Promise<void> { for (const [userId, count] of counts) { if (!this.regionCounts.has(userId)) { this.regionCounts.set(userId, new Map()); } this.regionCounts.get(userId)!.set(region, count); } } getEstimatedGlobalCount(userId: string): number { const userCounts = this.regionCounts.get(userId); if (!userCounts) return 0; let total = 0; for (const count of userCounts.values()) { total += count; } return total; }}With eventual consistency, set effective limits at 90% of the actual limit. The 10% buffer absorbs sync lag. Users experience the stated limit, while the system has headroom for delayed synchronization.
Let's examine how major platforms solve distributed rate limiting:
Notice that all these systems accept eventual consistency. None require strict global consistency because the trade-off (high latency, complexity, reduced availability) isn't worth it for rate limiting. Slight over-admission is acceptable; strict accuracy is not worth the cost.
Module Complete!
You've now completed the Rate Limiting at Gateway module. You understand why rate limiting matters, the core algorithms (token bucket, sliding window), multi-dimensional limiting strategies, and how to implement rate limiting at global scale.
This knowledge equips you to design rate limiting systems that protect your infrastructure, enable fair usage, and support business monetization—all while maintaining the availability and performance your users expect.
Congratulations! You now have a comprehensive understanding of rate limiting at the API gateway layer. From algorithms to distributed systems, you're equipped to design and implement production-grade rate limiting for any scale.