System Design HLDRate Limiting and Throttling

Rate Limiting and Throttling

LevelAdvanced

Duration90 mins

TopicRate Limiting and Throttling

3 / 5

Distributed Rate Limiting

The Distributed Challenge

Rate limiting on a single server is straightforward—a simple in-memory data structure suffices. But modern systems don't run on single servers. Your API is served by dozens of application instances behind a load balancer. Your edge network spans continents. Your microservices communicate across data centers.

In this distributed reality, rate limiting becomes a coordination problem. How do 50 servers agree that a client has exceeded their limit? How do you enforce limits when network partitions prevent servers from communicating? How do you balance consistency against latency?

Distributed rate limiting sits at the intersection of distributed systems theory and practical engineering constraints. The solutions involve tradeoffs between accuracy, performance, availability, and complexity—tradeoffs that every principal engineer must understand.

What You Will Learn

By the end of this page, you will understand the fundamental challenges of distributed rate limiting, the consistency models available, centralized vs. decentralized architectures, synchronization strategies, and practical implementation patterns used by high-scale systems including Redis-based coordination and edge-local enforcement.

The Coordination Problem

Consider a simple scenario: You're running 10 API servers behind a load balancer, and you want to enforce a limit of 100 requests per minute per user. A user makes requests that are distributed across all 10 servers.

The naive approach fails:

If each server enforces a 100 request/minute limit independently, the user gets 100 × 10 = 1000 requests per minute—10x your intended limit. This completely defeats the purpose of rate limiting.

Converting Mermaid diagram...

Possible approaches:

Sticky sessions — Route all requests from a user to the same server. But this defeats load balancing benefits and fails when servers are added/removed.
Divide the limit — Give each server 10 requests/minute (100 ÷ 10). But this is fragile—it assumes perfect distribution and fails if one server handles more traffic.
Shared state — Synchronize counters across servers. This is the general solution, but introduces its own challenges.
Accept inaccuracy — Design for eventual consistency, accepting that limits may be temporarily exceeded. Often the practical choice.

The right approach depends on your consistency requirements, acceptable latency overhead, and infrastructure constraints.

CAP Theorem Implications

Distributed rate limiting is subject to the CAP theorem. You cannot have perfect consistency, 100% availability, AND partition tolerance. Most systems choose eventual consistency (AP) because the cost of rate limiting being briefly inaccurate is lower than the cost of blocking legitimate requests during network issues.

Consistency Models for Rate Limiting

Before examining architectures, let's understand the consistency spectrum. Different applications have different tolerance for rate limiting inaccuracy.

Consistency Models and Their Tradeoffs
Model	Guarantee	Latency Cost	Best For
Strong Consistency	Exactly N requests allowed globally	High (synchronous coordination)	Financial transactions, one-time operations
Sequential Consistency	Same order seen by all nodes	Medium-High	Audit-sensitive operations
Causal Consistency	Causally related ops in order	Medium	Complex multi-step workflows
Eventual Consistency	Converges eventually, may overshoot short-term	Low	Most API rate limiting (recommended)
Best-Effort	No guarantees, local only	None	Edge caching, rough limits

Why eventual consistency usually wins:

For most rate limiting use cases, temporarily allowing N+X requests instead of exactly N causes minimal harm:

API quotas: A few extra requests won't bankrupt your infrastructure
Abuse prevention: Attackers are still significantly slowed
Fair usage: Users still experience limits, just not perfectly

When strong consistency matters:

One-time offers: 'First 100 users get discount' must not become 150
Financial limits: Daily withdrawal limits protecting against fraud
Security-critical: Password reset attempts, MFA challenges
Compliance: Regulatory requirements for exact audit trails

For these cases, pay the latency cost of synchronous coordination or use compensating alternatives (e.g., optimistic execution + rollback).

The 80/20 Rule

In practice, ~80% of rate limiting can use eventual consistency with significant limits (e.g., 100 req/min). ~15% benefits from tighter eventual consistency (sync within 100-500ms). Only ~5% truly needs strong consistency. Design your architecture to handle all three, applying stronger guarantees only where needed.

Centralized Rate Limiting

The most straightforward approach to distributed rate limiting is centralization: use a single, shared data store that all application servers consult before processing requests.

The architecture:

Clients → Load Balancer → App Servers → Rate Limit Check → Central Store
                                                              ↓
                                                          (Redis/Memcached)

All rate limit state lives in a single Redis or Memcached cluster. Every request triggers a check against this store. If the limit isn't exceeded, the request proceeds and the counter increments atomically.

centralized-rate-limiter.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
/**
 * Centralized Rate Limiter using Redis
 * 
 * All application instances share rate limit state in Redis.
 * Uses Lua script for atomic check-and-increment.
 */
import Redis from 'ioredis';
 
class CentralizedRateLimiter {
  private redis: Redis;
  private readonly limit: number;
  private readonly windowSeconds: number;
 
  // Lua script for atomic sliding window counter
  // Runs entirely on Redis - atomic and fast
  private readonly luaScript = `
    local key = KEYS[1]
    local limit = tonumber(ARGV[1])
    local window_size = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
    
    local current_window = math.floor(now / window_size) * window_size
    local previous_window = current_window - window_size
    
    local current_key = key .. ":" .. current_window
    local previous_key = key .. ":" .. previous_window
    
    local current_count = tonumber(redis.call("GET", current_key) or 0)
    local previous_count = tonumber(redis.call("GET", previous_key) or 0)
    
    local window_elapsed = (now - current_window) / window_size
    local previous_weight = 1 - window_elapsed
    local weighted_count = current_count + (previous_count * previous_weight)
    
    if weighted_count < limit then
      redis.call("INCR", current_key)
      redis.call("EXPIRE", current_key, window_size * 2)
      return {1, limit - weighted_count - 1, window_size - (now - current_window)}
    end
    
    return {0, 0, window_size - (now - current_window)}
  `;
 
  constructor(redisUrl: string, limit: number, windowSeconds: number) {
    this.redis = new Redis(redisUrl);
    this.limit = limit;
    this.windowSeconds = windowSeconds;
  }
 
  /**
   * Check if request is allowed
   * Returns: { allowed, remaining, resetIn }
   */
  async checkLimit(clientId: string): Promise<{
    allowed: boolean;
    remaining: number;
    resetIn: number;
  }> {
    const key = `ratelimit:${clientId}`;
    const now = Math.floor(Date.now() / 1000);
 
    try {
      const result = await this.redis.eval(
        this.luaScript,
        1,
        key,
        this.limit,
        this.windowSeconds,
        now
      ) as [number, number, number];
 
      return {
        allowed: result[0] === 1,
        remaining: Math.max(0, Math.floor(result[1])),
        resetIn: Math.ceil(result[2]),
      };
    } catch (error) {
      // Fail open: if Redis is unavailable, allow the request
      // This is a policy decision - some systems should fail closed
      console.error('Rate limit check failed:', error);
      return { allowed: true, remaining: this.limit, resetIn: this.windowSeconds };
    }
  }
}
 
// Usage in Express middleware
import { Request, Response, NextFunction } from 'express';
 
const limiter = new CentralizedRateLimiter(
  'redis://localhost:6379',
  100,  // 100 requests
  60    // per 60 seconds
);
 
async function rateLimitMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
) {
  const clientId = req.ip || req.headers['x-forwarded-for'] as string;
  const result = await limiter.checkLimit(clientId);
 
  // Set standard rate limit headers
  res.setHeader('X-RateLimit-Limit', '100');
  res.setHeader('X-RateLimit-Remaining', result.remaining.toString());
  res.setHeader('X-RateLimit-Reset', result.resetIn.toString());
 
  if (!result.allowed) {
    res.setHeader('Retry-After', result.resetIn.toString());
    return res.status(429).json({
      error: 'Too Many Requests',
      message: 'Rate limit exceeded. Please retry later.',
      retryAfter: result.resetIn,
    });
  }
 
  next();
}

Centralized Strengths

•Strong consistency — Single source of truth; no coordination needed
•Simplicity — Easy to reason about and debug
•Atomic operations — Lua scripts guarantee atomicity
•Accurate counts — No approximation or drift between nodes
•Standard tooling — Redis is well-understood, widely supported

Centralized Weaknesses

•Single point of failure — Redis down = rate limiting down
•Network latency — Every request needs Redis round-trip
•Network dependency — Network partitions affect all servers
•Scaling ceiling — Redis throughput limits max request rate
•Geographic distribution — Cross-region latency can be problematic

Fail Open vs Fail Closed

When your rate limit store is unavailable, you have a critical decision: Fail open (allow all requests) or fail closed (reject all requests). Most systems fail open for general API limits—better to risk some abuse than block all legitimate traffic. Fail closed for security-critical limits like authentication attempts.

Decentralized Rate Limiting

Decentralized rate limiting removes the single point of failure by distributing rate limit state across application nodes themselves. Each node maintains local state and periodically synchronizes with peers or a coordinator.

Key approaches:

Local limits with global synchronization — Each node enforces a fraction of the global limit locally, with periodic sync to adjust
Gossip-based coordination — Nodes share counters via gossip protocol, eventually converging on global counts
Consistent hashing — Route decisions for each client to a specific 'owner' node responsible for that client's limits
Local enforcement with periodic flush — Track locally, push to central store asynchronously

local-with-sync.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
/**
 * Decentralized Rate Limiter with Periodic Sync
 * 
 * Each node tracks requests locally and periodically syncs
 * to a central store. This reduces latency while maintaining
 * approximate global limits.
 */
import Redis from 'ioredis';
 
interface LocalState {
  count: number;
  windowStart: number;
  lastSync: number;
}
 
class DecentralizedRateLimiter {
  private redis: Redis;
  private nodeId: string;
  private localState: Map<string, LocalState> = new Map();
  
  private readonly globalLimit: number;
  private readonly localLimit: number;      // globalLimit / expectedNodeCount
  private readonly windowMs: number;
  private readonly syncIntervalMs: number;
  private readonly safetyMultiplier: number; // Allow some slack before sync
 
  constructor(
    redisUrl: string,
    nodeId: string,
    globalLimit: number,
    expectedNodeCount: number,
    windowSeconds: number,
    syncIntervalMs: number = 100
  ) {
    this.redis = new Redis(redisUrl);
    this.nodeId = nodeId;
    this.globalLimit = globalLimit;
    // Conservative local limit - each node gets fraction of global
    // Multiply by safety factor to allow bursts before sync
    this.localLimit = Math.ceil(globalLimit / expectedNodeCount * 0.8);
    this.safetyMultiplier = 0.8;
    this.windowMs = windowSeconds * 1000;
    this.syncIntervalMs = syncIntervalMs;
 
    // Start background sync
    this.startBackgroundSync();
  }
 
  /**
   * Fast local check - no network call in hot path
   */
  isAllowedLocal(clientId: string): boolean {
    const now = Date.now();
    const windowStart = Math.floor(now / this.windowMs) * this.windowMs;
 
    let state = this.localState.get(clientId);
 
    // Reset on new window
    if (!state || state.windowStart !== windowStart) {
      state = { count: 0, windowStart, lastSync: now };
      this.localState.set(clientId, state);
    }
 
    // Check against local limit
    if (state.count < this.localLimit) {
      state.count++;
      return true;
    }
 
    // Local limit reached - need to check global
    return false;
  }
 
  /**
   * Full check with Redis - use when local limit reached
   * or when accurate count is needed
   */
  async isAllowedGlobal(clientId: string): Promise<boolean> {
    const now = Date.now();
    const windowStart = Math.floor(now / this.windowMs) * this.windowMs;
    const key = `ratelimit:${clientId}:${windowStart}`;
 
    // Atomic increment and check
    const count = await this.redis.incr(key);
    
    // Set expiry if new key
    if (count === 1) {
      await this.redis.pexpire(key, this.windowMs);
    }
 
    return count <= this.globalLimit;
  }
 
  /**
   * Combined check: fast local, fallback to global
   */
  async isAllowed(clientId: string): Promise<boolean> {
    // Try local first - microseconds
    if (this.isAllowedLocal(clientId)) {
      return true;
    }
 
    // Local limit reached - check global
    return this.isAllowedGlobal(clientId);
  }
 
  /**
   * Background sync: push local counts to Redis periodically
   */
  private startBackgroundSync(): void {
    setInterval(async () => {
      const now = Date.now();
      const windowStart = Math.floor(now / this.windowMs) * this.windowMs;
 
      for (const [clientId, state] of this.localState.entries()) {
        // Only sync current window
        if (state.windowStart !== windowStart) {
          this.localState.delete(clientId);
          continue;
        }
 
        // Calculate delta since last sync
        const key = `ratelimit:${clientId}:${windowStart}`;
        
        try {
          // Push local count to Redis and get global count back
          const pipeline = this.redis.pipeline();
          pipeline.incrby(key, state.count);
          pipeline.pexpire(key, this.windowMs);
          pipeline.get(key);
          
          const results = await pipeline.exec();
          const globalCount = parseInt(results?.[2]?.[1] as string || '0');
 
          // Reset local count after sync
          state.count = 0;
          state.lastSync = now;
 
          // If global limit is close, reduce local limit temporarily
          const remaining = this.globalLimit - globalCount;
          // Adaptive: reduce local limit when nearing global limit
          // This prevents overshoot across nodes
        } catch (error) {
          console.error(`Sync failed for ${clientId}:`, error);
          // Continue with local limits if sync fails
        }
      }
    }, this.syncIntervalMs);
  }
}

Token Bucket with Periodic Refill from Central Store:

Another elegant pattern is to treat Redis as a 'bank' that allocates tokens to nodes:

Each node requests a batch of tokens from Redis (e.g., 100 tokens)
Node spends tokens locally without network calls
When local tokens depleted, request another batch
Unused tokens are returned to the pool periodically

This dramatically reduces Redis load while maintaining reasonable accuracy:

Global Limit: 1000/min
Node A: Gets 200 tokens, spent 150, returns 50
Node B: Gets 200 tokens, spent 200, requests 100 more
Node C: Gets 200 tokens, spent 50, returns 150
...

The tradeoff is that tokens allocated to idle nodes aren't available to busy nodes until the next sync period.

Adaptive Allocation

Improve token allocation by tracking each node's request rate. Nodes handling more traffic get larger token allocations. This work-stealing approach ensures tokens go where they're needed without excessive synchronization.

Geo-Distributed Rate Limiting

Global services face an additional challenge: rate limiting across data centers separated by tens or hundreds of milliseconds of network latency. Synchronous coordination becomes impractical when every check adds 100ms+ of latency.

The multi-region reality:

US-East to US-West: ~60-80ms RTT
US to Europe: ~80-120ms RTT
US to Asia Pacific: ~150-200ms RTT

Synchronous Redis checks across regions would add unacceptable latency to every request. Different architectures are needed.

Geo-Distribution Strategies
Strategy	Consistency	Latency	Complexity	Use Case
Regional Independence	None across regions	Lowest	Low	Regional limits only; users stick to one region
Async Replication	Eventual (seconds)	Low	Medium	Where brief overshoot is acceptable
Leader Region	Strong via one region	High for non-leader	Medium	When one region dominates traffic
CRDTs	Eventual (auto-merge)	Low	High	When you can use specialized data structures
Hierarchical	Local + global async	Low-Medium	High	Tiered limits (local strict, global eventual)

Hierarchical Rate Limiting (Recommended):

The most practical pattern for geo-distributed systems is hierarchical rate limiting:

Local (per-node): Very fast, in-memory, strict limits (e.g., 10/second/IP)
Regional (per-datacenter): Shared Redis within region, moderate limits (e.g., 60/minute/user)
Global (cross-region): Async replication, lenient limits (e.g., 10,000/day/account)

Each tier catches different abuse patterns:

Local stops aggressive bursts immediately
Regional enforces fair per-user allocation within the region
Global prevents distributed attacks spreading across regions

                    ┌──────────────────┐
                    │   Global Limit   │  (10,000/day)
                    │  Async Sync      │
                    └────────┬─────────┘
               ┌─────────────┼─────────────┐
               ▼             ▼             ▼
    ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
    │   US-East    │ │   EU-West    │ │  AP-Tokyo    │
    │ Region Limit │ │ Region Limit │ │ Region Limit │  (60/min)
    └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
        ┌──┴──┐          ┌──┴──┐          ┌──┴──┐
        ▼     ▼          ▼     ▼          ▼     ▼
      Node   Node      Node   Node      Node   Node    (10/sec)

Cell-Based Architecture

Some systems assign users to 'cells' (regional data center clusters) and perform all operations, including rate limiting, within that cell. This provides strong consistency within a cell while isolating failures. When a user must access another cell, the request is either proxied (with latency) or given a separate limit for that operation.

Edge Rate Limiting

Modern CDN and edge computing platforms (Cloudflare Workers, AWS CloudFront Functions, Fastly Compute@Edge) enable rate limiting at points of presence (PoPs) close to users—often within 50ms of the end user.

Benefits of edge rate limiting:

Bandwidth savings — Blocked requests never consume origin bandwidth
DDoS resilience — Attack traffic stopped at network edge, not origin
Lower latency — Legitimate requests aren't delayed by rate limit checks to origin
Geographic distribution — Natural handling of globally distributed traffic

The challenge:

Edge PoPs (there might be 200+ globally) can't synchronously coordinate. Each PoP might see a fraction of a user's traffic, making accurate global limits difficult.

edge-rate-limiting.ts
Cloudflare Worker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
/**
 * Edge Rate Limiting with Cloudflare Workers
 * 
 * Uses Durable Objects for per-object consistency
 * or Workers KV for eventual consistency.
 */
 
// Simple approach using Workers KV (eventual consistency)
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const ip = request.headers.get('CF-Connecting-IP') || 'unknown';
    const now = Math.floor(Date.now() / 1000);
    const windowStart = Math.floor(now / 60) * 60; // 1-minute windows
    const key = `rate:${ip}:${windowStart}`;
 
    // Read current count from KV (eventually consistent)
    const currentCount = parseInt(await env.RATE_LIMITS.get(key) || '0');
    
    if (currentCount >= 100) {
      return new Response('Rate limit exceeded', {
        status: 429,
        headers: {
          'Retry-After': String(60 - (now - windowStart)),
          'X-RateLimit-Limit': '100',
          'X-RateLimit-Remaining': '0',
        },
      });
    }
 
    // Increment count (best-effort, not atomic)
    await env.RATE_LIMITS.put(key, String(currentCount + 1), {
      expirationTtl: 120, // 2 minutes
    });
 
    // Continue to origin
    const response = await fetch(request);
    
    // Add rate limit headers to response
    const newResponse = new Response(response.body, response);
    newResponse.headers.set('X-RateLimit-Limit', '100');
    newResponse.headers.set('X-RateLimit-Remaining', String(99 - currentCount));
    
    return newResponse;
  },
};
 
// More accurate approach using Durable Objects (per-object consistency)
// Each client's rate limit is handled by a single Durable Object
export class RateLimiter implements DurableObject {
  private count: number = 0;
  private windowStart: number = 0;
  private readonly limit: number = 100;
  private readonly windowMs: number = 60000;
 
  async fetch(request: Request): Promise<Response> {
    const now = Date.now();
    const currentWindowStart = Math.floor(now / this.windowMs) * this.windowMs;
 
    // Reset on new window
    if (currentWindowStart !== this.windowStart) {
      this.count = 0;
      this.windowStart = currentWindowStart;
    }
 
    if (this.count >= this.limit) {
      const resetIn = this.windowStart + this.windowMs - now;
      return new Response(JSON.stringify({
        allowed: false,
        remaining: 0,
        resetIn: Math.ceil(resetIn / 1000),
      }), {
        status: 429,
        headers: { 'Content-Type': 'application/json' },
      });
    }
 
    this.count++;
    
    return new Response(JSON.stringify({
      allowed: true,
      remaining: this.limit - this.count,
      resetIn: Math.ceil((this.windowStart + this.windowMs - now) / 1000),
    }), {
      headers: { 'Content-Type': 'application/json' },
    });
  }
}

Edge PoP Coordination Patterns:

Independent PoP limits — Each PoP enforces its own limit (say 100/minute). A user accessing 10 PoPs gets 1000/minute globally. Simple but leaky.
Anycast convergence — Route users to consistent PoPs using consistent hashing on client IP. Same user → same PoP → accurate limit.
Background sync — PoPs sync counts periodically (every 5-10 seconds). Eventually consistent but can handle most abuse.
Durable Objects (Cloudflare) — Each rate limit is handled by a single Durable Object, which can be globally consistent but adds latency when the object is in a different region.

Practical recommendation: Use aggressive edge limits (strict per-IP burst limits) combined with more accurate origin limits. Edge catches the obvious attacks; origin handles the subtle abuse.

Layered Defense

The best architecture combines edge and origin rate limiting: Edge (loose, fast): Block obvious abuse (1000/min per IP). Origin (strict, accurate): Enforce precise limits per user/API key. This way, edge handles the volume attacks, and origin handles sophisticated abuse that slips through.

Redis Cluster Patterns

Redis is the de facto standard for centralized rate limiting due to its speed, atomic operations, and built-in expiration. Let's explore production-grade Redis patterns for rate limiting.

redis-patterns.lua
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Production-grade sliding window counter with detailed response
-- Returns: [allowed (0/1), remaining, resetMs, retryAfterMs]
--
-- KEYS[1] = base rate limit key
-- ARGV[1] = limit 
-- ARGV[2] = window size in seconds
-- ARGV[3] = current timestamp in milliseconds
 
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window_size_ms = tonumber(ARGV[2]) * 1000
local now = tonumber(ARGV[3])
 
-- Calculate window boundaries
local current_window = math.floor(now / window_size_ms) * window_size_ms
local previous_window = current_window - window_size_ms
 
local current_key = key .. ":" .. current_window
local previous_key = key .. ":" .. previous_window
 
-- Get both counts in one round-trip
local counts = redis.call("MGET", current_key, previous_key)
local current_count = tonumber(counts[1]) or 0
local previous_count = tonumber(counts[2]) or 0
 
-- Calculate sliding window weight
local window_elapsed = (now - current_window) / window_size_ms
local previous_weight = 1 - window_elapsed
 
-- Weighted count
local weighted_count = current_count + (previous_count * previous_weight)
 
-- Calculate reset time (when sliding window will be fully in new window)
local reset_ms = current_window + window_size_ms - now
 
if weighted_count < limit then
    -- Allow: increment current window counter
    local new_count = redis.call("INCR", current_key)
    
    -- Set TTL if this is a new key (only on first increment in window)
    if new_count == 1 then
        redis.call("PEXPIRE", current_key, window_size_ms * 2)
    end
    
    local remaining = math.max(0, limit - (weighted_count + 1))
    return {1, remaining, reset_ms, 0}
end
 
-- Reject: calculate when client can retry
-- They can retry when weighted_count drops below limit
-- This happens as previous_weight approaches 0
local retry_after_ms = 0
if previous_count > 0 then
    -- Calculate when enough of previous window falls off
    local excess = weighted_count - limit + 1
    local time_needed = (excess / previous_count) * window_size_ms
    retry_after_ms = math.max(0, math.ceil(time_needed))
else
    -- No previous window, just wait for reset
    retry_after_ms = reset_ms
end
 
return {0, 0, reset_ms, retry_after_ms}

Redis Cluster Considerations:

Key distribution — In Redis Cluster, keys are sharded across nodes. Ensure related keys hash to the same node using hash tags: {user:123}:ratelimit:api and {user:123}:ratelimit:auth will be on the same shard.
Lua script restrictions — In Cluster mode, Lua scripts can only access keys on the same shard. Design your keys accordingly.
Failover handling — Redis Sentinel or Cluster handles failover automatically, but there's a brief period where writes may be lost. Accept this for rate limiting.
Read replicas — Don't use replicas for rate limit reads—stale data leads to allowing too many requests. Only read from primary.
Connection pooling — Rate limiting generates many small requests. Use connection pooling to avoid connection overhead.

Redis Performance at Scale

A single Redis instance can handle ~100,000+ operations/second. But at very high scale, even this becomes a bottleneck. Solutions: shard by client ID across multiple Redis instances, use local caching with periodic sync, or consider purpose-built rate limiting services like Envoy's ratelimit or Kong's rate limiting plugin.

Summary: Distributed Rate Limiting

Distributed rate limiting adds significant complexity, but is essential for any system running on multiple servers. Let's consolidate the key insights:

Key Takeaways

•Coordination is the core challenge — Multiple servers must agree on counts, which requires shared state or synchronization
•Eventual consistency usually suffices — For most rate limiting, briefly exceeding limits is acceptable; save strong consistency for security-critical operations
•Centralized (Redis) is the default choice — Simple, accurate, well-supported, and fast enough for most use cases
•Decentralized reduces latency — When microseconds matter, track locally and sync periodically; accept slight accuracy loss
•Geo-distribution requires hierarchy — Local strict limits + regional sync + global async provides reasonable accuracy with acceptable latency
•Edge rate limiting stops attacks early — Block abuse at CDN/edge PoPs before it reaches your origin infrastructure
•Fail open (usually) — When rate limit infrastructure fails, allow requests rather than blocking all traffic—except for security-critical operations
•Redis Lua scripts are powerful — Atomic operations enable accurate, consistent rate limiting without race conditions

What's next:

With an understanding of algorithms and distributed architectures, we now need to address a fundamental question: Who is making this request? The next page covers Client Identification—the techniques for identifying and distinguishing clients including IP addresses, API keys, user accounts, device fingerprinting, and handling challenges like NAT and proxy networks.

Page Complete

You now understand the challenges and solutions for distributed rate limiting. Whether building a simple Redis-backed limiter or a sophisticated multi-layer geo-distributed system, these patterns provide the foundation for protecting your services at scale.

3 / 5

Loading learning content...

System Design HLDRate Limiting and Throttling

Rate Limiting and Throttling

LevelAdvanced

Duration90 mins

TopicRate Limiting and Throttling

3 / 5

Distributed Rate Limiting

The Distributed Challenge

What You Will Learn

The Coordination Problem

The naive approach fails:

If each server enforces a 100 request/minute limit independently, the user gets 100 × 10 = 1000 requests per minute—10x your intended limit. This completely defeats the purpose of rate limiting.

Converting Mermaid diagram...

Possible approaches:

Sticky sessions — Route all requests from a user to the same server. But this defeats load balancing benefits and fails when servers are added/removed.
Divide the limit — Give each server 10 requests/minute (100 ÷ 10). But this is fragile—it assumes perfect distribution and fails if one server handles more traffic.
Shared state — Synchronize counters across servers. This is the general solution, but introduces its own challenges.
Accept inaccuracy — Design for eventual consistency, accepting that limits may be temporarily exceeded. Often the practical choice.

The right approach depends on your consistency requirements, acceptable latency overhead, and infrastructure constraints.

CAP Theorem Implications

Consistency Models for Rate Limiting

Before examining architectures, let's understand the consistency spectrum. Different applications have different tolerance for rate limiting inaccuracy.

Consistency Models and Their Tradeoffs
Model	Guarantee	Latency Cost	Best For
Strong Consistency	Exactly N requests allowed globally	High (synchronous coordination)	Financial transactions, one-time operations
Sequential Consistency	Same order seen by all nodes	Medium-High	Audit-sensitive operations
Causal Consistency	Causally related ops in order	Medium	Complex multi-step workflows
Eventual Consistency	Converges eventually, may overshoot short-term	Low	Most API rate limiting (recommended)
Best-Effort	No guarantees, local only	None	Edge caching, rough limits

Why eventual consistency usually wins:

For most rate limiting use cases, temporarily allowing N+X requests instead of exactly N causes minimal harm:

API quotas: A few extra requests won't bankrupt your infrastructure
Abuse prevention: Attackers are still significantly slowed
Fair usage: Users still experience limits, just not perfectly

When strong consistency matters:

One-time offers: 'First 100 users get discount' must not become 150
Financial limits: Daily withdrawal limits protecting against fraud
Security-critical: Password reset attempts, MFA challenges
Compliance: Regulatory requirements for exact audit trails

For these cases, pay the latency cost of synchronous coordination or use compensating alternatives (e.g., optimistic execution + rollback).

The 80/20 Rule

Centralized Rate Limiting

The most straightforward approach to distributed rate limiting is centralization: use a single, shared data store that all application servers consult before processing requests.

The architecture:

Clients → Load Balancer → App Servers → Rate Limit Check → Central Store
                                                              ↓
                                                          (Redis/Memcached)

centralized-rate-limiter.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
/**
 * Centralized Rate Limiter using Redis
 * 
 * All application instances share rate limit state in Redis.
 * Uses Lua script for atomic check-and-increment.
 */
import Redis from 'ioredis';
 
class CentralizedRateLimiter {
  private redis: Redis;
  private readonly limit: number;
  private readonly windowSeconds: number;
 
  // Lua script for atomic sliding window counter
  // Runs entirely on Redis - atomic and fast
  private readonly luaScript = `
    local key = KEYS[1]
    local limit = tonumber(ARGV[1])
    local window_size = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
    
    local current_window = math.floor(now / window_size) * window_size
    local previous_window = current_window - window_size
    
    local current_key = key .. ":" .. current_window
    local previous_key = key .. ":" .. previous_window
    
    local current_count = tonumber(redis.call("GET", current_key) or 0)
    local previous_count = tonumber(redis.call("GET", previous_key) or 0)
    
    local window_elapsed = (now - current_window) / window_size
    local previous_weight = 1 - window_elapsed
    local weighted_count = current_count + (previous_count * previous_weight)
    
    if weighted_count < limit then
      redis.call("INCR", current_key)
      redis.call("EXPIRE", current_key, window_size * 2)
      return {1, limit - weighted_count - 1, window_size - (now - current_window)}
    end
    
    return {0, 0, window_size - (now - current_window)}
  `;
 
  constructor(redisUrl: string, limit: number, windowSeconds: number) {
    this.redis = new Redis(redisUrl);
    this.limit = limit;
    this.windowSeconds = windowSeconds;
  }
 
  /**
   * Check if request is allowed
   * Returns: { allowed, remaining, resetIn }
   */
  async checkLimit(clientId: string): Promise<{
    allowed: boolean;
    remaining: number;
    resetIn: number;
  }> {
    const key = `ratelimit:${clientId}`;
    const now = Math.floor(Date.now() / 1000);
 
    try {
      const result = await this.redis.eval(
        this.luaScript,
        1,
        key,
        this.limit,
        this.windowSeconds,
        now
      ) as [number, number, number];
 
      return {
        allowed: result[0] === 1,
        remaining: Math.max(0, Math.floor(result[1])),
        resetIn: Math.ceil(result[2]),
      };
    } catch (error) {
      // Fail open: if Redis is unavailable, allow the request
      // This is a policy decision - some systems should fail closed
      console.error('Rate limit check failed:', error);
      return { allowed: true, remaining: this.limit, resetIn: this.windowSeconds };
    }
  }
}
 
// Usage in Express middleware
import { Request, Response, NextFunction } from 'express';
 
const limiter = new CentralizedRateLimiter(
  'redis://localhost:6379',
  100,  // 100 requests
  60    // per 60 seconds
);
 
async function rateLimitMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
) {
  const clientId = req.ip || req.headers['x-forwarded-for'] as string;
  const result = await limiter.checkLimit(clientId);
 
  // Set standard rate limit headers
  res.setHeader('X-RateLimit-Limit', '100');
  res.setHeader('X-RateLimit-Remaining', result.remaining.toString());
  res.setHeader('X-RateLimit-Reset', result.resetIn.toString());
 
  if (!result.allowed) {
    res.setHeader('Retry-After', result.resetIn.toString());
    return res.status(429).json({
      error: 'Too Many Requests',
      message: 'Rate limit exceeded. Please retry later.',
      retryAfter: result.resetIn,
    });
  }
 
  next();
}

Centralized Strengths

•Strong consistency — Single source of truth; no coordination needed
•Simplicity — Easy to reason about and debug
•Atomic operations — Lua scripts guarantee atomicity
•Accurate counts — No approximation or drift between nodes
•Standard tooling — Redis is well-understood, widely supported

Centralized Weaknesses

•Single point of failure — Redis down = rate limiting down
•Network latency — Every request needs Redis round-trip
•Network dependency — Network partitions affect all servers
•Scaling ceiling — Redis throughput limits max request rate
•Geographic distribution — Cross-region latency can be problematic

Fail Open vs Fail Closed

Decentralized Rate Limiting

Key approaches:

Local limits with global synchronization — Each node enforces a fraction of the global limit locally, with periodic sync to adjust
Gossip-based coordination — Nodes share counters via gossip protocol, eventually converging on global counts
Consistent hashing — Route decisions for each client to a specific 'owner' node responsible for that client's limits
Local enforcement with periodic flush — Track locally, push to central store asynchronously

local-with-sync.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
/**
 * Decentralized Rate Limiter with Periodic Sync
 * 
 * Each node tracks requests locally and periodically syncs
 * to a central store. This reduces latency while maintaining
 * approximate global limits.
 */
import Redis from 'ioredis';
 
interface LocalState {
  count: number;
  windowStart: number;
  lastSync: number;
}
 
class DecentralizedRateLimiter {
  private redis: Redis;
  private nodeId: string;
  private localState: Map<string, LocalState> = new Map();
  
  private readonly globalLimit: number;
  private readonly localLimit: number;      // globalLimit / expectedNodeCount
  private readonly windowMs: number;
  private readonly syncIntervalMs: number;
  private readonly safetyMultiplier: number; // Allow some slack before sync
 
  constructor(
    redisUrl: string,
    nodeId: string,
    globalLimit: number,
    expectedNodeCount: number,
    windowSeconds: number,
    syncIntervalMs: number = 100
  ) {
    this.redis = new Redis(redisUrl);
    this.nodeId = nodeId;
    this.globalLimit = globalLimit;
    // Conservative local limit - each node gets fraction of global
    // Multiply by safety factor to allow bursts before sync
    this.localLimit = Math.ceil(globalLimit / expectedNodeCount * 0.8);
    this.safetyMultiplier = 0.8;
    this.windowMs = windowSeconds * 1000;
    this.syncIntervalMs = syncIntervalMs;
 
    // Start background sync
    this.startBackgroundSync();
  }
 
  /**
   * Fast local check - no network call in hot path
   */
  isAllowedLocal(clientId: string): boolean {
    const now = Date.now();
    const windowStart = Math.floor(now / this.windowMs) * this.windowMs;
 
    let state = this.localState.get(clientId);
 
    // Reset on new window
    if (!state || state.windowStart !== windowStart) {
      state = { count: 0, windowStart, lastSync: now };
      this.localState.set(clientId, state);
    }
 
    // Check against local limit
    if (state.count < this.localLimit) {
      state.count++;
      return true;
    }
 
    // Local limit reached - need to check global
    return false;
  }
 
  /**
   * Full check with Redis - use when local limit reached
   * or when accurate count is needed
   */
  async isAllowedGlobal(clientId: string): Promise<boolean> {
    const now = Date.now();
    const windowStart = Math.floor(now / this.windowMs) * this.windowMs;
    const key = `ratelimit:${clientId}:${windowStart}`;
 
    // Atomic increment and check
    const count = await this.redis.incr(key);
    
    // Set expiry if new key
    if (count === 1) {
      await this.redis.pexpire(key, this.windowMs);
    }
 
    return count <= this.globalLimit;
  }
 
  /**
   * Combined check: fast local, fallback to global
   */
  async isAllowed(clientId: string): Promise<boolean> {
    // Try local first - microseconds
    if (this.isAllowedLocal(clientId)) {
      return true;
    }
 
    // Local limit reached - check global
    return this.isAllowedGlobal(clientId);
  }
 
  /**
   * Background sync: push local counts to Redis periodically
   */
  private startBackgroundSync(): void {
    setInterval(async () => {
      const now = Date.now();
      const windowStart = Math.floor(now / this.windowMs) * this.windowMs;
 
      for (const [clientId, state] of this.localState.entries()) {
        // Only sync current window
        if (state.windowStart !== windowStart) {
          this.localState.delete(clientId);
          continue;
        }
 
        // Calculate delta since last sync
        const key = `ratelimit:${clientId}:${windowStart}`;
        
        try {
          // Push local count to Redis and get global count back
          const pipeline = this.redis.pipeline();
          pipeline.incrby(key, state.count);
          pipeline.pexpire(key, this.windowMs);
          pipeline.get(key);
          
          const results = await pipeline.exec();
          const globalCount = parseInt(results?.[2]?.[1] as string || '0');
 
          // Reset local count after sync
          state.count = 0;
          state.lastSync = now;
 
          // If global limit is close, reduce local limit temporarily
          const remaining = this.globalLimit - globalCount;
          // Adaptive: reduce local limit when nearing global limit
          // This prevents overshoot across nodes
        } catch (error) {
          console.error(`Sync failed for ${clientId}:`, error);
          // Continue with local limits if sync fails
        }
      }
    }, this.syncIntervalMs);
  }
}

Token Bucket with Periodic Refill from Central Store:

Another elegant pattern is to treat Redis as a 'bank' that allocates tokens to nodes:

Each node requests a batch of tokens from Redis (e.g., 100 tokens)
Node spends tokens locally without network calls
When local tokens depleted, request another batch
Unused tokens are returned to the pool periodically

This dramatically reduces Redis load while maintaining reasonable accuracy:

Global Limit: 1000/min
Node A: Gets 200 tokens, spent 150, returns 50
Node B: Gets 200 tokens, spent 200, requests 100 more
Node C: Gets 200 tokens, spent 50, returns 150
...

The tradeoff is that tokens allocated to idle nodes aren't available to busy nodes until the next sync period.

Adaptive Allocation

Geo-Distributed Rate Limiting

The multi-region reality:

US-East to US-West: ~60-80ms RTT
US to Europe: ~80-120ms RTT
US to Asia Pacific: ~150-200ms RTT

Synchronous Redis checks across regions would add unacceptable latency to every request. Different architectures are needed.

Geo-Distribution Strategies
Strategy	Consistency	Latency	Complexity	Use Case
Regional Independence	None across regions	Lowest	Low	Regional limits only; users stick to one region
Async Replication	Eventual (seconds)	Low	Medium	Where brief overshoot is acceptable
Leader Region	Strong via one region	High for non-leader	Medium	When one region dominates traffic
CRDTs	Eventual (auto-merge)	Low	High	When you can use specialized data structures
Hierarchical	Local + global async	Low-Medium	High	Tiered limits (local strict, global eventual)

Hierarchical Rate Limiting (Recommended):

The most practical pattern for geo-distributed systems is hierarchical rate limiting:

Local (per-node): Very fast, in-memory, strict limits (e.g., 10/second/IP)
Regional (per-datacenter): Shared Redis within region, moderate limits (e.g., 60/minute/user)
Global (cross-region): Async replication, lenient limits (e.g., 10,000/day/account)

Each tier catches different abuse patterns:

Local stops aggressive bursts immediately
Regional enforces fair per-user allocation within the region
Global prevents distributed attacks spreading across regions

                    ┌──────────────────┐
                    │   Global Limit   │  (10,000/day)
                    │  Async Sync      │
                    └────────┬─────────┘
               ┌─────────────┼─────────────┐
               ▼             ▼             ▼
    ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
    │   US-East    │ │   EU-West    │ │  AP-Tokyo    │
    │ Region Limit │ │ Region Limit │ │ Region Limit │  (60/min)
    └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
        ┌──┴──┐          ┌──┴──┐          ┌──┴──┐
        ▼     ▼          ▼     ▼          ▼     ▼
      Node   Node      Node   Node      Node   Node    (10/sec)

Cell-Based Architecture

Edge Rate Limiting

Benefits of edge rate limiting:

Bandwidth savings — Blocked requests never consume origin bandwidth
DDoS resilience — Attack traffic stopped at network edge, not origin
Lower latency — Legitimate requests aren't delayed by rate limit checks to origin
Geographic distribution — Natural handling of globally distributed traffic

The challenge:

Edge PoPs (there might be 200+ globally) can't synchronously coordinate. Each PoP might see a fraction of a user's traffic, making accurate global limits difficult.

edge-rate-limiting.ts
Cloudflare Worker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
/**
 * Edge Rate Limiting with Cloudflare Workers
 * 
 * Uses Durable Objects for per-object consistency
 * or Workers KV for eventual consistency.
 */
 
// Simple approach using Workers KV (eventual consistency)
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const ip = request.headers.get('CF-Connecting-IP') || 'unknown';
    const now = Math.floor(Date.now() / 1000);
    const windowStart = Math.floor(now / 60) * 60; // 1-minute windows
    const key = `rate:${ip}:${windowStart}`;
 
    // Read current count from KV (eventually consistent)
    const currentCount = parseInt(await env.RATE_LIMITS.get(key) || '0');
    
    if (currentCount >= 100) {
      return new Response('Rate limit exceeded', {
        status: 429,
        headers: {
          'Retry-After': String(60 - (now - windowStart)),
          'X-RateLimit-Limit': '100',
          'X-RateLimit-Remaining': '0',
        },
      });
    }
 
    // Increment count (best-effort, not atomic)
    await env.RATE_LIMITS.put(key, String(currentCount + 1), {
      expirationTtl: 120, // 2 minutes
    });
 
    // Continue to origin
    const response = await fetch(request);
    
    // Add rate limit headers to response
    const newResponse = new Response(response.body, response);
    newResponse.headers.set('X-RateLimit-Limit', '100');
    newResponse.headers.set('X-RateLimit-Remaining', String(99 - currentCount));
    
    return newResponse;
  },
};
 
// More accurate approach using Durable Objects (per-object consistency)
// Each client's rate limit is handled by a single Durable Object
export class RateLimiter implements DurableObject {
  private count: number = 0;
  private windowStart: number = 0;
  private readonly limit: number = 100;
  private readonly windowMs: number = 60000;
 
  async fetch(request: Request): Promise<Response> {
    const now = Date.now();
    const currentWindowStart = Math.floor(now / this.windowMs) * this.windowMs;
 
    // Reset on new window
    if (currentWindowStart !== this.windowStart) {
      this.count = 0;
      this.windowStart = currentWindowStart;
    }
 
    if (this.count >= this.limit) {
      const resetIn = this.windowStart + this.windowMs - now;
      return new Response(JSON.stringify({
        allowed: false,
        remaining: 0,
        resetIn: Math.ceil(resetIn / 1000),
      }), {
        status: 429,
        headers: { 'Content-Type': 'application/json' },
      });
    }
 
    this.count++;
    
    return new Response(JSON.stringify({
      allowed: true,
      remaining: this.limit - this.count,
      resetIn: Math.ceil((this.windowStart + this.windowMs - now) / 1000),
    }), {
      headers: { 'Content-Type': 'application/json' },
    });
  }
}

Edge PoP Coordination Patterns:

Independent PoP limits — Each PoP enforces its own limit (say 100/minute). A user accessing 10 PoPs gets 1000/minute globally. Simple but leaky.
Anycast convergence — Route users to consistent PoPs using consistent hashing on client IP. Same user → same PoP → accurate limit.
Background sync — PoPs sync counts periodically (every 5-10 seconds). Eventually consistent but can handle most abuse.
Durable Objects (Cloudflare) — Each rate limit is handled by a single Durable Object, which can be globally consistent but adds latency when the object is in a different region.

Practical recommendation: Use aggressive edge limits (strict per-IP burst limits) combined with more accurate origin limits. Edge catches the obvious attacks; origin handles the subtle abuse.

Layered Defense

Redis Cluster Patterns

Redis is the de facto standard for centralized rate limiting due to its speed, atomic operations, and built-in expiration. Let's explore production-grade Redis patterns for rate limiting.

redis-patterns.lua
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Production-grade sliding window counter with detailed response
-- Returns: [allowed (0/1), remaining, resetMs, retryAfterMs]
--
-- KEYS[1] = base rate limit key
-- ARGV[1] = limit 
-- ARGV[2] = window size in seconds
-- ARGV[3] = current timestamp in milliseconds
 
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window_size_ms = tonumber(ARGV[2]) * 1000
local now = tonumber(ARGV[3])
 
-- Calculate window boundaries
local current_window = math.floor(now / window_size_ms) * window_size_ms
local previous_window = current_window - window_size_ms
 
local current_key = key .. ":" .. current_window
local previous_key = key .. ":" .. previous_window
 
-- Get both counts in one round-trip
local counts = redis.call("MGET", current_key, previous_key)
local current_count = tonumber(counts[1]) or 0
local previous_count = tonumber(counts[2]) or 0
 
-- Calculate sliding window weight
local window_elapsed = (now - current_window) / window_size_ms
local previous_weight = 1 - window_elapsed
 
-- Weighted count
local weighted_count = current_count + (previous_count * previous_weight)
 
-- Calculate reset time (when sliding window will be fully in new window)
local reset_ms = current_window + window_size_ms - now
 
if weighted_count < limit then
    -- Allow: increment current window counter
    local new_count = redis.call("INCR", current_key)
    
    -- Set TTL if this is a new key (only on first increment in window)
    if new_count == 1 then
        redis.call("PEXPIRE", current_key, window_size_ms * 2)
    end
    
    local remaining = math.max(0, limit - (weighted_count + 1))
    return {1, remaining, reset_ms, 0}
end
 
-- Reject: calculate when client can retry
-- They can retry when weighted_count drops below limit
-- This happens as previous_weight approaches 0
local retry_after_ms = 0
if previous_count > 0 then
    -- Calculate when enough of previous window falls off
    local excess = weighted_count - limit + 1
    local time_needed = (excess / previous_count) * window_size_ms
    retry_after_ms = math.max(0, math.ceil(time_needed))
else
    -- No previous window, just wait for reset
    retry_after_ms = reset_ms
end
 
return {0, 0, reset_ms, retry_after_ms}

Redis Cluster Considerations:

Key distribution — In Redis Cluster, keys are sharded across nodes. Ensure related keys hash to the same node using hash tags: {user:123}:ratelimit:api and {user:123}:ratelimit:auth will be on the same shard.
Lua script restrictions — In Cluster mode, Lua scripts can only access keys on the same shard. Design your keys accordingly.
Failover handling — Redis Sentinel or Cluster handles failover automatically, but there's a brief period where writes may be lost. Accept this for rate limiting.
Read replicas — Don't use replicas for rate limit reads—stale data leads to allowing too many requests. Only read from primary.
Connection pooling — Rate limiting generates many small requests. Use connection pooling to avoid connection overhead.

Redis Performance at Scale

Summary: Distributed Rate Limiting

Distributed rate limiting adds significant complexity, but is essential for any system running on multiple servers. Let's consolidate the key insights:

Key Takeaways

•Coordination is the core challenge — Multiple servers must agree on counts, which requires shared state or synchronization
•Eventual consistency usually suffices — For most rate limiting, briefly exceeding limits is acceptable; save strong consistency for security-critical operations
•Centralized (Redis) is the default choice — Simple, accurate, well-supported, and fast enough for most use cases
•Decentralized reduces latency — When microseconds matter, track locally and sync periodically; accept slight accuracy loss
•Geo-distribution requires hierarchy — Local strict limits + regional sync + global async provides reasonable accuracy with acceptable latency
•Edge rate limiting stops attacks early — Block abuse at CDN/edge PoPs before it reaches your origin infrastructure
•Fail open (usually) — When rate limit infrastructure fails, allow requests rather than blocking all traffic—except for security-critical operations
•Redis Lua scripts are powerful — Atomic operations enable accurate, consistent rate limiting without race conditions

What's next:

Page Complete

3 / 5