Rate Limiter - Learning Module

Loading content...

0/273

Token Bucket Algorithm

The Elegant Solution to Rate Limiting

When engineers at AWS, Stripe, and Cloudflare needed to rate limit billions of API requests, they reached for the same algorithm: the Token Bucket. This elegant algorithm, originally designed for network traffic shaping in the 1980s, has become the industry standard for API rate limiting due to its perfect balance of simplicity, efficiency, and flexibility.

The Token Bucket's genius lies in its intuitive metaphor: imagine a bucket that slowly fills with tokens at a steady rate. Each request consumes a token. If the bucket is empty, the request must wait. This simple mental model captures complex behaviors—allowing bursts while enforcing long-term averages—in a single, unified framework.

In this page, we'll master the Token Bucket algorithm from its mathematical foundations to production implementation, exploring the nuances that separate textbook descriptions from battle-tested systems.

What You Will Master

By the end of this page, you'll understand: (1) The mathematical model behind token buckets, (2) Implementation with constant-time O(1) operations, (3) How to tune buckets for different traffic patterns, (4) The atomic token bucket for concurrent systems, and (5) Production considerations including edge cases and monitoring.

The Token Bucket Mental Model

Before diving into implementation, let's build a robust mental model of how the token bucket works. This understanding will guide every design decision.

The Physical Analogy:

Imagine a physical bucket with these properties:

The bucket has a maximum capacity (burst size) — it can only hold so many tokens
Tokens are added at a constant rate — like a dripping faucet adding tokens over time
Each request consumes tokens — one token per request (or more for expensive operations)
If empty, requests are denied or delayed — no credit, no service
Excess tokens overflow and are lost — the bucket doesn't overflow above capacity

This simple model enables sophisticated behaviors:

Token Bucket Behaviors
Scenario	Bucket State	Behavior	Result
Steady traffic at rate limit	Tokens consumed as fast as added	Sustainable	All requests allowed
Traffic burst after idle	Bucket full, has accumulated tokens	Burst allowed	Smooth handling of spikes
Sustained traffic over limit	Bucket drains to zero	Limiting engaged	Excess requests denied
Traffic returns to normal	Bucket slowly refills	Recovery	Future bursts possible again

Key Parameters:

Every token bucket is defined by two fundamental parameters:

Rate (r) — Tokens added per second
- Determines sustained throughput
- Example: 100 tokens/second = 100 requests/second sustainable rate
Bucket Size/Burst (b) — Maximum tokens the bucket can hold
- Determines maximum burst size
- Example: 500 tokens = 500 requests can happen instantly

Rate vs. Burst Trade-off:

The relationship between rate and burst size determines traffic characteristics:

Configuration	Behavior	Use Case
Rate: 100/s, Burst: 100	Very smooth, minimal bursting	Real-time systems
Rate: 100/s, Burst: 500	Absorbs 5s worth of burst	Web APIs
Rate: 100/s, Burst: 3600	Can dump an hour's quota instantly	Batch processing support
Rate: 10/s, Burst: 1	Strict 1 request per 100ms	Critical resource protection

Choosing Burst Size

A common heuristic: set burst = rate × acceptable_burst_duration. For an API where 2-second bursts are acceptable with a 100/sec rate, use burst = 200. This allows clients to 'bank' 2 seconds of unused quota. Too small a burst creates friction; too large defeats rate limiting.

Mathematical Foundation

Understanding the mathematics behind token buckets enables optimized implementation and correct reasoning about edge cases. Let's formalize the algorithm.

Formal Definition:

At any time t, the number of tokens in the bucket is given by:

tokens(t) = min(b, tokens(t₀) + r × (t - t₀))

Where:

b = bucket capacity (maximum tokens)
r = refill rate (tokens per second)
t₀ = time of last observation
tokens(t₀) = tokens at last observation

Request Admission:

A request requiring c tokens (cost) at time t is admitted if:

tokens(t) >= c

If admitted, the new token count becomes:

tokens_new = tokens(t) - c

Key Insight: Lazy Evaluation

We don't need to continuously update the token count. We only need to compute it when a request arrives. This is the crucial optimization that makes token buckets efficient.

Time-to-Next-Token:

When a request is denied, we can calculate exactly when the next token will be available:

time_to_token = (c - tokens(t)) / r

This enables the Retry-After header, telling clients exactly when to retry.

Average Rate Guarantee:

Token bucket provides a mathematical guarantee about long-term rate:

Over any time period T, the maximum requests allowed is:

max_requests(T) = b + r × T

For large T, this approaches r × T, meaning the long-term average rate approaches r.

Burst Characteristics:

After the bucket has been full and idle:

Maximum instant burst: b requests
Time to exhaust burst at rate R: b / R seconds (for request rate R)
Time to refill to full: b / r seconds

token-bucket-math.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Mathematical token bucket computations
 
interface BucketMath {
  rate: number;           // r: tokens per second
  capacity: number;       // b: bucket size
  tokens: number;         // current tokens
  lastRefill: number;     // timestamp of last calculation
}
 
/**
 * Calculate current tokens at time 't'
 * tokens(t) = min(capacity, tokens(t₀) + rate × (t - t₀))
 */
function tokensAt(bucket: BucketMath, currentTime: number): number {
  const elapsed = currentTime - bucket.lastRefill;
  const tokensToAdd = bucket.rate * elapsed;
  return Math.min(bucket.capacity, bucket.tokens + tokensToAdd);
}
 
/**
 * Calculate time until 'cost' tokens are available
 * Returns 0 if tokens are already available
 */
function timeToTokens(bucket: BucketMath, cost: number, currentTime: number): number {
  const available = tokensAt(bucket, currentTime);
  if (available >= cost) return 0;
  
  const tokensNeeded = cost - available;
  return tokensNeeded / bucket.rate;
}
 
/**
 * Calculate maximum requests allowed over time period T
 * max_requests(T) = capacity + rate × T
 */
function maxRequestsOverTime(bucket: BucketMath, seconds: number): number {
  return bucket.capacity + (bucket.rate * seconds);
}
 
/**
 * After how many seconds does rate limiting effectively start?
 * (time to exhaust a full bucket at maximum rate)
 */
function burstDuration(bucket: BucketMath, requestRate: number): number {
  return bucket.capacity / requestRate;
}

Efficient Implementation

The beauty of the token bucket is that it can be implemented with O(1) time complexity per request—no loops, no complex data structures, just simple arithmetic. Let's build a production-quality implementation.

token-bucket.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
/**
 * Production-grade Token Bucket implementation
 * O(1) time and O(1) space per bucket
 */
class TokenBucket {
  private tokens: number;
  private lastRefillTime: number;
  
  constructor(
    private readonly capacity: number,      // Maximum tokens (burst size)
    private readonly refillRate: number,    // Tokens per second
    initialTokens?: number                  // Starting tokens (default: full)
  ) {
    this.tokens = initialTokens ?? capacity;
    this.lastRefillTime = Date.now() / 1000; // Current time in seconds
  }
  
  /**
   * Attempt to consume 'cost' tokens
   * Returns true if successful, false if insufficient tokens
   */
  tryConsume(cost: number = 1): boolean {
    this.refill();
    
    if (this.tokens >= cost) {
      this.tokens -= cost;
      return true;
    }
    return false;
  }
  
  /**
   * Consume tokens, returning detailed result
   */
  consume(cost: number = 1): TokenBucketResult {
    this.refill();
    
    if (this.tokens >= cost) {
      this.tokens -= cost;
      return {
        allowed: true,
        remaining: Math.floor(this.tokens),
        limit: this.capacity,
        resetAt: this.calculateResetTime(),
        retryAfter: 0
      };
    }
    
    const retryAfter = (cost - this.tokens) / this.refillRate;
    return {
      allowed: false,
      remaining: 0,
      limit: this.capacity,
      resetAt: this.calculateResetTime(),
      retryAfter: Math.ceil(retryAfter)
    };
  }
  
  /**
   * Refill bucket based on elapsed time
   * This is the key optimization: lazy evaluation
   */
  private refill(): void {
    const now = Date.now() / 1000;
    const elapsed = now - this.lastRefillTime;
    
    // Calculate tokens to add (may be fractional)
    const tokensToAdd = elapsed * this.refillRate;
    
    // Update tokens, capped at capacity
    this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
    this.lastRefillTime = now;
  }
  
  /**
   * Calculate when bucket will be full again
   */
  private calculateResetTime(): number {
    const tokensNeeded = this.capacity - this.tokens;
    const secondsToFull = tokensNeeded / this.refillRate;
    return Date.now() + (secondsToFull * 1000);
  }
  
  /**
   * Get current bucket status without consuming tokens
   */
  getStatus(): TokenBucketStatus {
    this.refill();
    return {
      tokens: this.tokens,
      capacity: this.capacity,
      refillRate: this.refillRate
    };
  }
}
 
interface TokenBucketResult {
  allowed: boolean;
  remaining: number;
  limit: number;
  resetAt: number;     // Unix timestamp (milliseconds)
  retryAfter: number;  // Seconds until retry
}
 
interface TokenBucketStatus {
  tokens: number;
  capacity: number;
  refillRate: number;
}

Floating Point Precision

Notice we use floating-point for tokens. This allows fractional tokens from partial refills. When reporting to clients, we floor to integers (you can't consume 0.3 of a token), but internally, maintaining precision prevents accumulation errors over time.

Memory Layout

Each token bucket requires minimal memory:

TokenBucket:
  tokens: 8 bytes (double)
  lastRefillTime: 8 bytes (double) 
  capacity: 8 bytes (double, or 4 bytes if int)
  refillRate: 8 bytes (double)
  ----------------------------
  Total: 32 bytes per bucket

For 10 million clients:

Memory: 10M × 32 bytes = 320 MB
With overhead: ~500 MB

This fits comfortably in memory on a single server, enabling ultra-fast lookups.

Thread-Safe Token Bucket

In real-world systems, multiple threads or processes handle requests concurrently. A naive implementation has race conditions that allow clients to exceed their limits. Let's build a thread-safe version.

The Race Condition:

Thread A: reads tokens = 1
Thread B: reads tokens = 1  
Thread A: tokens >= 1, allows request
Thread B: tokens >= 1, allows request
Thread A: writes tokens = 0
Thread B: writes tokens = 0

Result: Two requests allowed, but only one token was available!

This race condition becomes severe at high concurrency. With 1000 concurrent requests, clients could massively exceed their limits.

atomic-token-bucket.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/**
 * Thread-safe Token Bucket using atomic operations
 * Suitable for multi-threaded environments
 */
import { Mutex } from 'async-mutex';
 
class AtomicTokenBucket {
  private tokens: number;
  private lastRefillTime: number;
  private readonly mutex = new Mutex();
  
  constructor(
    private readonly capacity: number,
    private readonly refillRate: number
  ) {
    this.tokens = capacity;
    this.lastRefillTime = Date.now() / 1000;
  }
  
  /**
   * Thread-safe consume with mutex
   * Ensures atomic read-modify-write operations
   */
  async consume(cost: number = 1): Promise<TokenBucketResult> {
    const release = await this.mutex.acquire();
    
    try {
      // Critical section: all operations are atomic
      this.refill();
      
      if (this.tokens >= cost) {
        this.tokens -= cost;
        return {
          allowed: true,
          remaining: Math.floor(this.tokens),
          limit: this.capacity,
          resetAt: this.calculateResetTime(),
          retryAfter: 0
        };
      }
      
      const retryAfter = (cost - this.tokens) / this.refillRate;
      return {
        allowed: false,
        remaining: 0,
        limit: this.capacity,
        resetAt: this.calculateResetTime(),
        retryAfter: Math.ceil(retryAfter)
      };
    } finally {
      release();
    }
  }
  
  private refill(): void {
    const now = Date.now() / 1000;
    const elapsed = now - this.lastRefillTime;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefillTime = now;
  }
  
  private calculateResetTime(): number {
    const tokensNeeded = this.capacity - this.tokens;
    const secondsToFull = tokensNeeded / this.refillRate;
    return Date.now() + (secondsToFull * 1000);
  }
}

Redis-Based Atomic Token Bucket

For distributed systems, we need coordination across nodes. Redis provides atomic operations that enable distributed token buckets with EVAL scripts:

redis-token-bucket.lua
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Redis Lua script for atomic token bucket
-- KEYS[1] = bucket key
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)  
-- ARGV[3] = cost (tokens to consume)
-- ARGV[4] = current timestamp (seconds)
 
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local cost = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
 
-- Get current bucket state
local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1]) or capacity
local lastRefill = tonumber(bucket[2]) or now
 
-- Calculate tokens to add (time-based refill)
local elapsed = now - lastRefill
local tokensToAdd = elapsed * refillRate
tokens = math.min(capacity, tokens + tokensToAdd)
 
-- Attempt to consume
local allowed = tokens >= cost
local retryAfter = 0
 
if allowed then
  tokens = tokens - cost
else
  retryAfter = math.ceil((cost - tokens) / refillRate)
end
 
-- Update bucket state
redis.call('HMSET', key, 
  'tokens', tokens,
  'lastRefill', now
)
 
-- Set expiry to prevent orphan keys (2x window size or 24h)
redis.call('EXPIRE', key, 86400)
 
-- Return: allowed (0/1), remaining, limit, retryAfter
return {allowed and 1 or 0, math.floor(tokens), capacity, retryAfter}

Why Lua Scripts in Redis

Redis executes Lua scripts atomically—no other command runs between script start and end. This eliminates race conditions without application-level locking. The script runs server-side, minimizing network round trips. All major rate limiters (Stripe, GitHub, etc.) use this pattern.

Token Bucket Variations

The basic token bucket can be enhanced for specific use cases. Let's explore important variations used in production systems.

Leaky Bucket vs Token Bucket

Often confused, these are related but distinct algorithms:

Token Bucket:

Tokens accumulate over time, up to capacity
Requests consume tokens immediately
Allows bursts (empty the bucket at once)
Rate is the average limit

Leaky Bucket:

Requests enter a queue (the bucket)
Requests leave at a fixed rate (the leak)
Bursts are queued, not immediately processed
Rate is the maximum limit

Metric	Token Bucket	Leaky Bucket
Bursts	Allowed	Queued/smoothed
Latency	Consistent	Increases with burst
Implementation	Counter-based	Queue-based
Memory	O(1) per client	O(burst) per client
Use case	API rate limiting	Traffic shaping

Weighted Token Bucket

Some requests are more expensive than others. Weighted token buckets assign different costs:

// Different endpoints consume different token amounts
const costs: Record<string, number> = {
  'GET /api/simple': 1,      // Cheap, cached
  'POST /api/upload': 10,    // Expensive, I/O intensive
  'GET /api/search': 5,      // Moderate, DB query
  'POST /api/process': 50,   // Very expensive, ML inference
};

function rateLimitRequest(bucket: TokenBucket, endpoint: string) {
  const cost = costs[endpoint] || 1;
  return bucket.consume(cost);
}

This ensures expensive operations consume proportionally more quota, protecting backend resources equitably.

Hierarchical Token Buckets

Complex rate limiting policies often require multi-level token buckets:

Organization Bucket (100,000 tokens/hour)
  ├── User A Bucket (10,000 tokens/hour)
  │     ├── Read Bucket (8,000 tokens/hour)
  │     └── Write Bucket (2,000 tokens/hour)
  ├── User B Bucket (10,000 tokens/hour)
  └── User C Bucket (10,000 tokens/hour)

A request must pass through all levels of the hierarchy. Even if User A has tokens remaining, if the organization bucket is empty, the request is denied.

This prevents:

Single user exhausting organization quota
Write operations starving reads
Free tier users impacting paying customers

hierarchical-token-bucket.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
/**
 * Hierarchical Token Bucket that checks multiple levels
 */
class HierarchicalTokenBucket {
  private buckets: Map<string, TokenBucket> = new Map();
  
  constructor(private hierarchy: BucketConfig[]) {
    for (const config of hierarchy) {
      this.buckets.set(config.id, new TokenBucket(
        config.capacity,
        config.refillRate
      ));
    }
  }
  
  /**
   * Consume from all buckets in hierarchy
   * Returns first failure or success if all pass
   */
  consume(bucketIds: string[], cost: number = 1): HierarchicalResult {
    // First pass: check all buckets have capacity
    const statuses: BucketCheck[] = [];
    
    for (const id of bucketIds) {
      const bucket = this.buckets.get(id);
      if (!bucket) throw new Error(`Unknown bucket: ${id}`);
      
      const status = bucket.getStatus();
      statuses.push({
        id,
        hasCapacity: status.tokens >= cost,
        tokens: status.tokens
      });
    }
    
    // Find first bucket without capacity
    const blocker = statuses.find(s => !s.hasCapacity);
    if (blocker) {
      return {
        allowed: false,
        blockedBy: blocker.id,
        remaining: Math.floor(blocker.tokens)
      };
    }
    
    // Second pass: consume from all buckets atomically
    for (const id of bucketIds) {
      const bucket = this.buckets.get(id)!;
      bucket.tryConsume(cost);
    }
    
    // Return the most constrained bucket's remaining tokens
    const minRemaining = Math.min(
      ...statuses.map(s => s.tokens - cost)
    );
    
    return {
      allowed: true,
      remaining: Math.floor(minRemaining)
    };
  }
}
 
interface BucketConfig {
  id: string;
  capacity: number;
  refillRate: number;
}
 
interface BucketCheck {
  id: string;
  hasCapacity: boolean;
  tokens: number;
}
 
interface HierarchicalResult {
  allowed: boolean;
  blockedBy?: string;
  remaining: number;
}

Production Considerations

Moving from implementation to production requires handling edge cases, monitoring, and operational concerns. Let's address the issues that matter in real-world deployments.

Critical Edge Cases

•Clock Skew — Servers with different times compute different token counts. Use synchronized NTP and tolerate small variances. For Redis, use Redis's TIME command.
•Integer Overflow — Long-running buckets or high rates can overflow. Use 64-bit numbers and reset periodically.
•Initialization Race — Multiple processes creating the same bucket simultaneously. Use SETNX/HSETNX for atomic creation.
•Bucket Expiry — Inactive clients create orphan buckets. Set TTL based on 2-3x window duration.
•Negative Tokens — If cost > capacity, tokens go negative. Cap at zero or reject requests with cost > capacity.

Monitoring and Observability

Production token buckets require comprehensive monitoring:

Key Metrics:

rate_limit_allowed_total{client, endpoint}    # Counter of allowed requests
rate_limit_denied_total{client, endpoint}     # Counter of denied requests  
rate_limit_remaining{client}                  # Gauge of remaining tokens
rate_limit_latency_seconds{quantile}          # Histogram of decision time
rate_limit_bucket_count                       # Number of active buckets
rate_limit_memory_bytes                       # Memory consumption

Alerting Thresholds:

Alert when >10% of requests are rate limited (possible attack or misconfiguration)
Alert when decision latency exceeds 5ms (performance degradation)
Alert when bucket count grows >2x expected (possible memory leak)

Dashboards:

Top 10 clients by rate limit hits
Rate limit trends over time
Bucket utilization distribution (how close are clients to limits?)

Configuration Management

Rate limit configurations change over time. Production systems need:

Dynamic Updates:

interface RateLimitConfig {
  id: string;
  selector: ClientSelector;  // Which clients does this apply to?
  capacity: number;
  refillRate: number;
  version: number;           // For tracking config changes
  effectiveAt: Date;         // When this config activates
}

// Config service watches for changes and updates buckets
class ConfigService {
  async updateConfig(config: RateLimitConfig): Promise<void> {
    // Validate new config
    // Update in config store
    // Notify all rate limiter nodes
    // Take effect at specified time
  }
}

Gradual Rollout:

Test new limits on a small percentage of traffic
Monitor for unexpected impacts
Roll back quickly if issues arise

Emergency Overrides:

Ability to temporarily raise/lower limits
Audit trail of who made what changes when
Automatic expiry of temporary overrides

Beware of Hot Keys

Some clients generate vastly more traffic than others (viral users, large enterprises). These 'hot keys' can overwhelm a single Redis shard. Mitigation: use consistent hashing to spread hot keys, or implement local rate limiting before distributed checks to reduce load on central storage.

Performance Optimization

For systems handling millions of requests per second, every microsecond counts. Let's explore advanced optimizations.

Latency Optimizations

•Local Caching — Cache recent decisions for 10-100ms. Trade accuracy for speed.
•Async Updates — Check locally, update centrally asynchronously.
•Connection Pooling — Reuse Redis connections to eliminate handshake overhead.
•Pipelining — Batch multiple operations in a single round trip.

Throughput Optimizations

•Sharding — Partition clients across multiple Redis instances.
•Local Rate Limiting — First-pass filter before distributed check.
•Batch Processing — Aggregate updates before sync.
•Probabilistic Limiting — Sample traffic for very high-volume clients.

optimized-rate-limiter.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
/**
 * High-performance rate limiter with local caching
 * Provides sub-millisecond decisions with eventual consistency
 */
class OptimizedRateLimiter {
  private localCache: Map<string, CachedDecision> = new Map();
  private localBuckets: Map<string, TokenBucket> = new Map();
  
  private readonly CACHE_TTL_MS = 100;      // Decision cache lifetime
  private readonly SYNC_INTERVAL_MS = 500;  // Distributed sync frequency
  
  constructor(
    private readonly redis: RedisClient,
    private readonly config: RateLimitConfig
  ) {
    // Periodic sync with distributed state
    setInterval(() => this.syncWithRedis(), this.SYNC_INTERVAL_MS);
  }
  
  /**
   * Fast path: check local cache and bucket
   * Slow path: only on cache miss or first request
   */
  async checkLimit(clientId: string): Promise<RateLimitDecision> {
    // Check if we have a cached decision
    const cached = this.localCache.get(clientId);
    if (cached && cached.expiresAt > Date.now()) {
      return cached.decision;
    }
    
    // Local rate limit check (fast, per-node)
    let localBucket = this.localBuckets.get(clientId);
    if (!localBucket) {
      localBucket = new TokenBucket(
        this.config.capacity / this.config.nodeCount,  // Per-node share
        this.config.refillRate / this.config.nodeCount
      );
      this.localBuckets.set(clientId, localBucket);
    }
    
    // Quick local check
    const localResult = localBucket.consume();
    if (!localResult.allowed) {
      // Definitely denied - no need for distributed check
      this.cacheDecision(clientId, localResult);
      return localResult;
    }
    
    // Allowed locally, record for async sync
    // In background, sync with Redis
    this.markForSync(clientId);
    
    // Return optimistic allow
    this.cacheDecision(clientId, localResult);
    return localResult;
  }
  
  private cacheDecision(clientId: string, decision: RateLimitDecision): void {
    this.localCache.set(clientId, {
      decision,
      expiresAt: Date.now() + this.CACHE_TTL_MS
    });
  }
  
  private pendingSyncs: Set<string> = new Set();
  
  private markForSync(clientId: string): void {
    this.pendingSyncs.add(clientId);
  }
  
  private async syncWithRedis(): Promise<void> {
    if (this.pendingSyncs.size === 0) return;
    
    const clientIds = Array.from(this.pendingSyncs);
    this.pendingSyncs.clear();
    
    // Batch update all pending syncs
    const pipeline = this.redis.pipeline();
    for (const clientId of clientIds) {
      pipeline.evalsha(TOKEN_BUCKET_SCRIPT, 1, clientId, ...this.config);
    }
    
    await pipeline.exec();
  }
}
 
interface CachedDecision {
  decision: RateLimitDecision;
  expiresAt: number;
}

Accuracy vs Latency Trade-off

Local caching can allow slightly more requests than configured (up to nodeCount × limit in extreme cases). For most APIs, this trade-off is acceptable—the limits are approximate quotas, not strict guarantees. For financial or security-critical rate limiting, use synchronous distributed checks.

Summary: Token Bucket Algorithm

The Token Bucket algorithm is the workhorse of rate limiting in production systems. Let's consolidate our understanding:

Key Takeaways

•Simple and intuitive — Bucket fills with tokens, requests consume tokens. Captures burst allowance and rate limiting in one model.
•Mathematically sound — Precise formulas for current tokens, time-to-next-token, and maximum throughput over time.
•O(1) performance — Constant time per operation using lazy evaluation. No loops or complex data structures.
•Thread-safe variants — Use mutexes locally or Redis Lua scripts for distributed atomicity.
•Flexible variations — Weighted buckets, hierarchical buckets, and hybrid approaches address diverse requirements.
•Production-ready concerns — Clock skew, hot keys, monitoring, and configuration management are essential.
•Optimizable for scale — Local caching, async updates, and sharding enable millions of decisions per second.

What's Next:

While the token bucket handles most rate limiting needs, some scenarios require more precise time window enforcement. In the next page, we'll explore the Sliding Window algorithm, which provides exact rate limiting without the boundary issues of fixed windows while maintaining efficiency.

Page Complete

You now have a deep understanding of the Token Bucket algorithm—from mathematical foundations to production implementation. You can implement, tune, and optimize token buckets for any scale. Next, we'll explore the Sliding Window algorithm for scenarios requiring exact time-window rate limiting.