System Design (HLD)Rate Limiting at Gateway

Rate Limiting at API Gateway

LevelIntermediate

Duration75 mins

TopicRate Limiting at Gateway

5 / 5

Distributed Rate Limiting

The Global Scale Challenge

Your API serves users across the globe from data centers in Virginia, Frankfurt, Tokyo, and São Paulo. Each region runs multiple gateway instances. When a user's request lands on a server in Tokyo, how does that server know the user has already made 999 requests this hour through servers in Frankfurt?

Distributed rate limiting is one of the hardest problems in API gateway design. It requires balancing strict consistency (accurate limits) with availability and latency (fast responses). Get it wrong, and you either over-admit requests (threatening system stability) or over-reject them (frustrating users).

This page explores the strategies, trade-offs, and production-proven patterns for rate limiting at global scale.

What You Will Learn

By the end of this page, you will understand the CAP theorem's implications for rate limiting, synchronization strategies from local-first to strictly consistent, production architectures using Redis Cluster, and how major platforms solve this problem.

The Distributed Challenge

When rate limit state is distributed across multiple nodes, fundamental distributed systems challenges emerge.

Core Challenges

•Consistency vs. Latency — Checking a central store for every request adds latency. Caching locally risks stale data.
•Network Partitions — What happens when data centers can't communicate? Do we fail open (admit) or closed (reject)?
•Race Conditions — Two servers might check the limit simultaneously, both see 'allowed', and both increment.
•Clock Skew — Server clocks differ slightly. Time-window algorithms may behave inconsistently.
•Replication Lag — Even in a replicated datastore, writes take time to propagate to all replicas.

The CAP Theorem Reality:

CAP theorem states you can have at most two of: Consistency, Availability, Partition tolerance. Since network partitions are inevitable, you must choose between:

CP (Consistent + Partition-tolerant): During partitions, reject requests rather than risk over-admission. Safer but impacts availability.
AP (Available + Partition-tolerant): During partitions, allow requests based on local state. Maintains availability but may over-admit.

Most rate limiting systems choose AP with eventual consistency—slight over-admission during brief partitions is acceptable compared to rejecting legitimate traffic.

Converting Mermaid diagram...

Synchronization Strategies

There are several approaches to keeping rate limit state synchronized across distributed nodes, each with different consistency/latency trade-offs.

Distributed Rate Limiting Strategies
Strategy	Consistency	Latency	Best For
Local Only	None	Lowest	Stateless limits (per-request)
Sticky Sessions	Per-session	Low	Session-scoped limits
Async Sync	Eventual	Low	Soft limits, high throughput
Central Store	Strong	Medium	Strict limits, single region
Gossip Protocol	Eventual	Low	Multi-region, high availability
Consensus-Based	Strong	High	Critical limits, low volume

Local Counters with Async Synchronization:

Each node maintains local counters and periodically syncs with a central store. This provides excellent latency with eventual consistency.

local-async-limiter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class LocalAsyncRateLimiter {
  private localCounts: Map<string, number>;
  private centralStore: RedisClient;
  private syncIntervalMs: number;
  private maxLocalDrift: number;
 
  constructor(config: { syncIntervalMs: number; maxLocalDrift: number }) {
    this.localCounts = new Map();
    this.syncIntervalMs = config.syncIntervalMs;
    this.maxLocalDrift = config.maxLocalDrift;
    
    // Periodic sync to central store
    setInterval(() => this.syncToCentral(), this.syncIntervalMs);
  }
 
  async tryConsume(key: string, limit: number): Promise<boolean> {
    // Fast path: check local counter
    const localCount = this.localCounts.get(key) ?? 0;
    
    // If local count exceeds limit + drift tolerance, definitely reject
    if (localCount >= limit + this.maxLocalDrift) {
      return false;
    }
    
    // If local count is well under limit, allow locally
    if (localCount < limit * 0.8) {
      this.localCounts.set(key, localCount + 1);
      return true;
    }
    
    // Near limit: check central store for accuracy
    const globalCount = await this.centralStore.get(key);
    if (globalCount >= limit) {
      return false;
    }
    
    this.localCounts.set(key, localCount + 1);
    return true;
  }
 
  private async syncToCentral(): Promise<void> {
    for (const [key, localDelta] of this.localCounts) {
      if (localDelta > 0) {
        // Atomic increment in central store
        await this.centralStore.incrBy(key, localDelta);
        this.localCounts.set(key, 0);
      }
    }
  }
}

Redis-Based Production Architecture

Redis is the most common backend for distributed rate limiting due to its speed, atomic operations, and cluster support. Here's a production-grade architecture.

Redis Cluster Best Practices

•Use Lua scripts for atomicity — INCR + EXPIRE must be atomic; use EVALSHA
•Hash keys for even distribution — Use hash tags {user:123} to control slot placement
•Configure appropriate timeouts — Short timeouts (50-100ms) to fail fast
•Plan for Redis failure — Implement fallback behavior when Redis is unavailable
•Monitor latency percentiles — p99 latency matters more than average
•Use pipelining for batch checks — Check multiple limits in one round-trip

resilient-redis-limiter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class ResilientRedisRateLimiter {
  private redis: RedisCluster;
  private localFallback: LocalRateLimiter;
  private healthChecker: HealthChecker;
  private metrics: MetricsClient;
 
  async tryConsume(key: string, limit: number): Promise<RateLimitResult> {
    // Check Redis health
    if (!this.healthChecker.isRedisHealthy()) {
      this.metrics.increment('rate_limit.fallback.local');
      return this.localFallback.tryConsume(key, limit);
    }
 
    try {
      const result = await this.withTimeout(
        this.checkRedis(key, limit),
        50  // 50ms timeout
      );
      this.metrics.timing('rate_limit.redis.latency', result.latencyMs);
      return result;
    } catch (error) {
      this.metrics.increment('rate_limit.redis.error');
      this.healthChecker.recordFailure();
      
      // Fail open or closed based on configuration
      if (this.config.failOpen) {
        this.metrics.increment('rate_limit.fail_open');
        return { allowed: true, remaining: 0, uncertain: true };
      } else {
        this.metrics.increment('rate_limit.fail_closed');
        return { allowed: false, remaining: 0, uncertain: true };
      }
    }
  }
 
  private async checkRedis(key: string, limit: number): Promise<RateLimitResult> {
    const pipeline = this.redis.pipeline();
    
    // Multi-window check in one round-trip
    const windows = ['minute', 'hour', 'day'];
    for (const window of windows) {
      const windowKey = `{ratelimit:${key}}:${window}`;
      pipeline.evalsha(this.scriptSha, 1, windowKey, this.getLimitForWindow(limit, window));
    }
    
    const results = await pipeline.exec();
    return this.aggregateResults(results);
  }
}

Fail Open vs. Fail Closed

When Redis is unavailable, you must choose: fail open (allow requests, risk overload) or fail closed (reject requests, risk availability). Most systems fail open for non-critical limits and fail closed for security-critical limits (like login attempts).

Multi-Region Rate Limiting

Global systems face the challenge of synchronizing rate limits across continents with 100ms+ latency between regions.

Multi-Region Approaches
Approach	Consistency	Latency	Complexity
Single Global Store	Strong	High (cross-region)	Low
Per-Region Stores + Sync	Eventual	Low (local)	Medium
Partitioned by User Location	Strong (for user)	Low	Medium
Hierarchical (Region + Global)	Hybrid	Low + Periodic	High

multi-region-limiter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class MultiRegionRateLimiter {
  private localRedis: RedisClient;      // Same-region Redis
  private globalSyncer: GlobalSyncer;   // Cross-region sync service
  private region: string;
 
  async tryConsume(userId: string, limit: number): Promise<RateLimitResult> {
    // Step 1: Check local region's count
    const localKey = `ratelimit:${userId}:${this.region}`;
    const localCount = await this.localRedis.incr(localKey);
    
    // Step 2: Get estimated global count
    // This is eventually consistent but fast
    const globalEstimate = await this.globalSyncer.getEstimatedGlobalCount(userId);
    
    // Step 3: Calculate effective count
    const effectiveCount = localCount + globalEstimate;
    
    // Step 4: Apply limit with buffer for sync lag
    const effectiveLimit = limit * 0.9;  // 10% buffer for sync lag
    
    if (effectiveCount > effectiveLimit) {
      // Rollback local increment
      await this.localRedis.decr(localKey);
      return { allowed: false, remaining: 0 };
    }
    
    // Step 5: Async report to global syncer
    this.globalSyncer.reportUsage(userId, this.region, 1);
    
    return {
      allowed: true,
      remaining: Math.max(0, limit - effectiveCount),
    };
  }
}
 
class GlobalSyncer {
  private regionCounts: Map<string, Map<string, number>>;  // userId -> region -> count
  
  // Called by each region periodically (e.g., every 5 seconds)
  async syncFromRegion(region: string, counts: Map<string, number>): Promise<void> {
    for (const [userId, count] of counts) {
      if (!this.regionCounts.has(userId)) {
        this.regionCounts.set(userId, new Map());
      }
      this.regionCounts.get(userId)!.set(region, count);
    }
  }
 
  getEstimatedGlobalCount(userId: string): number {
    const userCounts = this.regionCounts.get(userId);
    if (!userCounts) return 0;
    
    let total = 0;
    for (const count of userCounts.values()) {
      total += count;
    }
    return total;
  }
}

The 90% Rule

With eventual consistency, set effective limits at 90% of the actual limit. The 10% buffer absorbs sync lag. Users experience the stated limit, while the system has headroom for delayed synchronization.

Real-World Patterns from Major Platforms

Let's examine how major platforms solve distributed rate limiting:

Stripe's Approach

•Uses Redis with sliding window counter
•Soft limits with eventual consistency
•Separate rate limiters for different API key types
•Generous burst allowances via token bucket overlay

Cloudflare's Approach

•Edge-based counting at each PoP
•Gossip protocol for cross-PoP synchronization
•Millisecond-level updates within region
•Second-level eventual consistency globally

AWS API Gateway

•Token bucket algorithm
•Regional deployment (not globally consistent)
•Separate throttling at account and API level
•Burst credits that accumulate during low usage

Common Thread

Notice that all these systems accept eventual consistency. None require strict global consistency because the trade-off (high latency, complexity, reduced availability) isn't worth it for rate limiting. Slight over-admission is acceptable; strict accuracy is not worth the cost.

Summary: Distributed Rate Limiting

Key Takeaways

•CAP theorem applies — Choose between consistency and availability during partitions. Most systems choose availability.
•Eventual consistency is usually sufficient — Slight over-admission during sync lag is acceptable.
•Redis Cluster is the common choice — Fast, atomic operations, good cluster support.
•Always plan for failure — Implement fallback behavior when the central store is unavailable.
•Multi-region requires trade-offs — Use local counting with async sync and safety buffers.
•Learn from the giants — Stripe, Cloudflare, AWS all accept eventual consistency.

Module Complete!

You've now completed the Rate Limiting at Gateway module. You understand why rate limiting matters, the core algorithms (token bucket, sliding window), multi-dimensional limiting strategies, and how to implement rate limiting at global scale.

This knowledge equips you to design rate limiting systems that protect your infrastructure, enable fair usage, and support business monetization—all while maintaining the availability and performance your users expect.

Module Complete

Congratulations! You now have a comprehensive understanding of rate limiting at the API gateway layer. From algorithms to distributed systems, you're equipped to design and implement production-grade rate limiting for any scale.

5 / 5

Loading learning content...

System Design (HLD)Rate Limiting at Gateway

Rate Limiting at API Gateway

LevelIntermediate

Duration75 mins

TopicRate Limiting at Gateway

5 / 5

Distributed Rate Limiting

The Global Scale Challenge

This page explores the strategies, trade-offs, and production-proven patterns for rate limiting at global scale.

What You Will Learn

The Distributed Challenge

When rate limit state is distributed across multiple nodes, fundamental distributed systems challenges emerge.

Core Challenges

•Consistency vs. Latency — Checking a central store for every request adds latency. Caching locally risks stale data.
•Network Partitions — What happens when data centers can't communicate? Do we fail open (admit) or closed (reject)?
•Race Conditions — Two servers might check the limit simultaneously, both see 'allowed', and both increment.
•Clock Skew — Server clocks differ slightly. Time-window algorithms may behave inconsistently.
•Replication Lag — Even in a replicated datastore, writes take time to propagate to all replicas.

The CAP Theorem Reality:

CAP theorem states you can have at most two of: Consistency, Availability, Partition tolerance. Since network partitions are inevitable, you must choose between:

CP (Consistent + Partition-tolerant): During partitions, reject requests rather than risk over-admission. Safer but impacts availability.
AP (Available + Partition-tolerant): During partitions, allow requests based on local state. Maintains availability but may over-admit.

Most rate limiting systems choose AP with eventual consistency—slight over-admission during brief partitions is acceptable compared to rejecting legitimate traffic.

Converting Mermaid diagram...

Synchronization Strategies

There are several approaches to keeping rate limit state synchronized across distributed nodes, each with different consistency/latency trade-offs.

Distributed Rate Limiting Strategies
Strategy	Consistency	Latency	Best For
Local Only	None	Lowest	Stateless limits (per-request)
Sticky Sessions	Per-session	Low	Session-scoped limits
Async Sync	Eventual	Low	Soft limits, high throughput
Central Store	Strong	Medium	Strict limits, single region
Gossip Protocol	Eventual	Low	Multi-region, high availability
Consensus-Based	Strong	High	Critical limits, low volume

Local Counters with Async Synchronization:

Each node maintains local counters and periodically syncs with a central store. This provides excellent latency with eventual consistency.

local-async-limiter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class LocalAsyncRateLimiter {
  private localCounts: Map<string, number>;
  private centralStore: RedisClient;
  private syncIntervalMs: number;
  private maxLocalDrift: number;
 
  constructor(config: { syncIntervalMs: number; maxLocalDrift: number }) {
    this.localCounts = new Map();
    this.syncIntervalMs = config.syncIntervalMs;
    this.maxLocalDrift = config.maxLocalDrift;
    
    // Periodic sync to central store
    setInterval(() => this.syncToCentral(), this.syncIntervalMs);
  }
 
  async tryConsume(key: string, limit: number): Promise<boolean> {
    // Fast path: check local counter
    const localCount = this.localCounts.get(key) ?? 0;
    
    // If local count exceeds limit + drift tolerance, definitely reject
    if (localCount >= limit + this.maxLocalDrift) {
      return false;
    }
    
    // If local count is well under limit, allow locally
    if (localCount < limit * 0.8) {
      this.localCounts.set(key, localCount + 1);
      return true;
    }
    
    // Near limit: check central store for accuracy
    const globalCount = await this.centralStore.get(key);
    if (globalCount >= limit) {
      return false;
    }
    
    this.localCounts.set(key, localCount + 1);
    return true;
  }
 
  private async syncToCentral(): Promise<void> {
    for (const [key, localDelta] of this.localCounts) {
      if (localDelta > 0) {
        // Atomic increment in central store
        await this.centralStore.incrBy(key, localDelta);
        this.localCounts.set(key, 0);
      }
    }
  }
}

Redis-Based Production Architecture

Redis is the most common backend for distributed rate limiting due to its speed, atomic operations, and cluster support. Here's a production-grade architecture.

Redis Cluster Best Practices

•Use Lua scripts for atomicity — INCR + EXPIRE must be atomic; use EVALSHA
•Hash keys for even distribution — Use hash tags {user:123} to control slot placement
•Configure appropriate timeouts — Short timeouts (50-100ms) to fail fast
•Plan for Redis failure — Implement fallback behavior when Redis is unavailable
•Monitor latency percentiles — p99 latency matters more than average
•Use pipelining for batch checks — Check multiple limits in one round-trip

resilient-redis-limiter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class ResilientRedisRateLimiter {
  private redis: RedisCluster;
  private localFallback: LocalRateLimiter;
  private healthChecker: HealthChecker;
  private metrics: MetricsClient;
 
  async tryConsume(key: string, limit: number): Promise<RateLimitResult> {
    // Check Redis health
    if (!this.healthChecker.isRedisHealthy()) {
      this.metrics.increment('rate_limit.fallback.local');
      return this.localFallback.tryConsume(key, limit);
    }
 
    try {
      const result = await this.withTimeout(
        this.checkRedis(key, limit),
        50  // 50ms timeout
      );
      this.metrics.timing('rate_limit.redis.latency', result.latencyMs);
      return result;
    } catch (error) {
      this.metrics.increment('rate_limit.redis.error');
      this.healthChecker.recordFailure();
      
      // Fail open or closed based on configuration
      if (this.config.failOpen) {
        this.metrics.increment('rate_limit.fail_open');
        return { allowed: true, remaining: 0, uncertain: true };
      } else {
        this.metrics.increment('rate_limit.fail_closed');
        return { allowed: false, remaining: 0, uncertain: true };
      }
    }
  }
 
  private async checkRedis(key: string, limit: number): Promise<RateLimitResult> {
    const pipeline = this.redis.pipeline();
    
    // Multi-window check in one round-trip
    const windows = ['minute', 'hour', 'day'];
    for (const window of windows) {
      const windowKey = `{ratelimit:${key}}:${window}`;
      pipeline.evalsha(this.scriptSha, 1, windowKey, this.getLimitForWindow(limit, window));
    }
    
    const results = await pipeline.exec();
    return this.aggregateResults(results);
  }
}

Fail Open vs. Fail Closed

Multi-Region Rate Limiting

Global systems face the challenge of synchronizing rate limits across continents with 100ms+ latency between regions.

Multi-Region Approaches
Approach	Consistency	Latency	Complexity
Single Global Store	Strong	High (cross-region)	Low
Per-Region Stores + Sync	Eventual	Low (local)	Medium
Partitioned by User Location	Strong (for user)	Low	Medium
Hierarchical (Region + Global)	Hybrid	Low + Periodic	High

multi-region-limiter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class MultiRegionRateLimiter {
  private localRedis: RedisClient;      // Same-region Redis
  private globalSyncer: GlobalSyncer;   // Cross-region sync service
  private region: string;
 
  async tryConsume(userId: string, limit: number): Promise<RateLimitResult> {
    // Step 1: Check local region's count
    const localKey = `ratelimit:${userId}:${this.region}`;
    const localCount = await this.localRedis.incr(localKey);
    
    // Step 2: Get estimated global count
    // This is eventually consistent but fast
    const globalEstimate = await this.globalSyncer.getEstimatedGlobalCount(userId);
    
    // Step 3: Calculate effective count
    const effectiveCount = localCount + globalEstimate;
    
    // Step 4: Apply limit with buffer for sync lag
    const effectiveLimit = limit * 0.9;  // 10% buffer for sync lag
    
    if (effectiveCount > effectiveLimit) {
      // Rollback local increment
      await this.localRedis.decr(localKey);
      return { allowed: false, remaining: 0 };
    }
    
    // Step 5: Async report to global syncer
    this.globalSyncer.reportUsage(userId, this.region, 1);
    
    return {
      allowed: true,
      remaining: Math.max(0, limit - effectiveCount),
    };
  }
}
 
class GlobalSyncer {
  private regionCounts: Map<string, Map<string, number>>;  // userId -> region -> count
  
  // Called by each region periodically (e.g., every 5 seconds)
  async syncFromRegion(region: string, counts: Map<string, number>): Promise<void> {
    for (const [userId, count] of counts) {
      if (!this.regionCounts.has(userId)) {
        this.regionCounts.set(userId, new Map());
      }
      this.regionCounts.get(userId)!.set(region, count);
    }
  }
 
  getEstimatedGlobalCount(userId: string): number {
    const userCounts = this.regionCounts.get(userId);
    if (!userCounts) return 0;
    
    let total = 0;
    for (const count of userCounts.values()) {
      total += count;
    }
    return total;
  }
}

The 90% Rule

Real-World Patterns from Major Platforms

Let's examine how major platforms solve distributed rate limiting:

Stripe's Approach

•Uses Redis with sliding window counter
•Soft limits with eventual consistency
•Separate rate limiters for different API key types
•Generous burst allowances via token bucket overlay

Cloudflare's Approach

•Edge-based counting at each PoP
•Gossip protocol for cross-PoP synchronization
•Millisecond-level updates within region
•Second-level eventual consistency globally

AWS API Gateway

•Token bucket algorithm
•Regional deployment (not globally consistent)
•Separate throttling at account and API level
•Burst credits that accumulate during low usage

Common Thread

Summary: Distributed Rate Limiting

Key Takeaways

•CAP theorem applies — Choose between consistency and availability during partitions. Most systems choose availability.
•Eventual consistency is usually sufficient — Slight over-admission during sync lag is acceptable.
•Redis Cluster is the common choice — Fast, atomic operations, good cluster support.
•Always plan for failure — Implement fallback behavior when the central store is unavailable.
•Multi-region requires trade-offs — Use local counting with async sync and safety buffers.
•Learn from the giants — Stripe, Cloudflare, AWS all accept eventual consistency.

Module Complete!

Module Complete

5 / 5