Loading content...
When engineers at AWS, Stripe, and Cloudflare needed to rate limit billions of API requests, they reached for the same algorithm: the Token Bucket. This elegant algorithm, originally designed for network traffic shaping in the 1980s, has become the industry standard for API rate limiting due to its perfect balance of simplicity, efficiency, and flexibility.
The Token Bucket's genius lies in its intuitive metaphor: imagine a bucket that slowly fills with tokens at a steady rate. Each request consumes a token. If the bucket is empty, the request must wait. This simple mental model captures complex behaviors—allowing bursts while enforcing long-term averages—in a single, unified framework.
In this page, we'll master the Token Bucket algorithm from its mathematical foundations to production implementation, exploring the nuances that separate textbook descriptions from battle-tested systems.
By the end of this page, you'll understand: (1) The mathematical model behind token buckets, (2) Implementation with constant-time O(1) operations, (3) How to tune buckets for different traffic patterns, (4) The atomic token bucket for concurrent systems, and (5) Production considerations including edge cases and monitoring.
Before diving into implementation, let's build a robust mental model of how the token bucket works. This understanding will guide every design decision.
The Physical Analogy:
Imagine a physical bucket with these properties:
This simple model enables sophisticated behaviors:
| Scenario | Bucket State | Behavior | Result |
|---|---|---|---|
| Steady traffic at rate limit | Tokens consumed as fast as added | Sustainable | All requests allowed |
| Traffic burst after idle | Bucket full, has accumulated tokens | Burst allowed | Smooth handling of spikes |
| Sustained traffic over limit | Bucket drains to zero | Limiting engaged | Excess requests denied |
| Traffic returns to normal | Bucket slowly refills | Recovery | Future bursts possible again |
Key Parameters:
Every token bucket is defined by two fundamental parameters:
Rate (r) — Tokens added per second
Bucket Size/Burst (b) — Maximum tokens the bucket can hold
Rate vs. Burst Trade-off:
The relationship between rate and burst size determines traffic characteristics:
| Configuration | Behavior | Use Case |
|---|---|---|
| Rate: 100/s, Burst: 100 | Very smooth, minimal bursting | Real-time systems |
| Rate: 100/s, Burst: 500 | Absorbs 5s worth of burst | Web APIs |
| Rate: 100/s, Burst: 3600 | Can dump an hour's quota instantly | Batch processing support |
| Rate: 10/s, Burst: 1 | Strict 1 request per 100ms | Critical resource protection |
A common heuristic: set burst = rate × acceptable_burst_duration. For an API where 2-second bursts are acceptable with a 100/sec rate, use burst = 200. This allows clients to 'bank' 2 seconds of unused quota. Too small a burst creates friction; too large defeats rate limiting.
Understanding the mathematics behind token buckets enables optimized implementation and correct reasoning about edge cases. Let's formalize the algorithm.
Formal Definition:
At any time t, the number of tokens in the bucket is given by:
tokens(t) = min(b, tokens(t₀) + r × (t - t₀))
Where:
Request Admission:
A request requiring c tokens (cost) at time t is admitted if:
tokens(t) >= c
If admitted, the new token count becomes:
tokens_new = tokens(t) - c
Key Insight: Lazy Evaluation
We don't need to continuously update the token count. We only need to compute it when a request arrives. This is the crucial optimization that makes token buckets efficient.
Time-to-Next-Token:
When a request is denied, we can calculate exactly when the next token will be available:
time_to_token = (c - tokens(t)) / r
This enables the Retry-After header, telling clients exactly when to retry.
Average Rate Guarantee:
Token bucket provides a mathematical guarantee about long-term rate:
Over any time period T, the maximum requests allowed is:
max_requests(T) = b + r × T
For large T, this approaches r × T, meaning the long-term average rate approaches r.
Burst Characteristics:
After the bucket has been full and idle:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Mathematical token bucket computations interface BucketMath { rate: number; // r: tokens per second capacity: number; // b: bucket size tokens: number; // current tokens lastRefill: number; // timestamp of last calculation} /** * Calculate current tokens at time 't' * tokens(t) = min(capacity, tokens(t₀) + rate × (t - t₀)) */function tokensAt(bucket: BucketMath, currentTime: number): number { const elapsed = currentTime - bucket.lastRefill; const tokensToAdd = bucket.rate * elapsed; return Math.min(bucket.capacity, bucket.tokens + tokensToAdd);} /** * Calculate time until 'cost' tokens are available * Returns 0 if tokens are already available */function timeToTokens(bucket: BucketMath, cost: number, currentTime: number): number { const available = tokensAt(bucket, currentTime); if (available >= cost) return 0; const tokensNeeded = cost - available; return tokensNeeded / bucket.rate;} /** * Calculate maximum requests allowed over time period T * max_requests(T) = capacity + rate × T */function maxRequestsOverTime(bucket: BucketMath, seconds: number): number { return bucket.capacity + (bucket.rate * seconds);} /** * After how many seconds does rate limiting effectively start? * (time to exhaust a full bucket at maximum rate) */function burstDuration(bucket: BucketMath, requestRate: number): number { return bucket.capacity / requestRate;}The beauty of the token bucket is that it can be implemented with O(1) time complexity per request—no loops, no complex data structures, just simple arithmetic. Let's build a production-quality implementation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
/** * Production-grade Token Bucket implementation * O(1) time and O(1) space per bucket */class TokenBucket { private tokens: number; private lastRefillTime: number; constructor( private readonly capacity: number, // Maximum tokens (burst size) private readonly refillRate: number, // Tokens per second initialTokens?: number // Starting tokens (default: full) ) { this.tokens = initialTokens ?? capacity; this.lastRefillTime = Date.now() / 1000; // Current time in seconds } /** * Attempt to consume 'cost' tokens * Returns true if successful, false if insufficient tokens */ tryConsume(cost: number = 1): boolean { this.refill(); if (this.tokens >= cost) { this.tokens -= cost; return true; } return false; } /** * Consume tokens, returning detailed result */ consume(cost: number = 1): TokenBucketResult { this.refill(); if (this.tokens >= cost) { this.tokens -= cost; return { allowed: true, remaining: Math.floor(this.tokens), limit: this.capacity, resetAt: this.calculateResetTime(), retryAfter: 0 }; } const retryAfter = (cost - this.tokens) / this.refillRate; return { allowed: false, remaining: 0, limit: this.capacity, resetAt: this.calculateResetTime(), retryAfter: Math.ceil(retryAfter) }; } /** * Refill bucket based on elapsed time * This is the key optimization: lazy evaluation */ private refill(): void { const now = Date.now() / 1000; const elapsed = now - this.lastRefillTime; // Calculate tokens to add (may be fractional) const tokensToAdd = elapsed * this.refillRate; // Update tokens, capped at capacity this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd); this.lastRefillTime = now; } /** * Calculate when bucket will be full again */ private calculateResetTime(): number { const tokensNeeded = this.capacity - this.tokens; const secondsToFull = tokensNeeded / this.refillRate; return Date.now() + (secondsToFull * 1000); } /** * Get current bucket status without consuming tokens */ getStatus(): TokenBucketStatus { this.refill(); return { tokens: this.tokens, capacity: this.capacity, refillRate: this.refillRate }; }} interface TokenBucketResult { allowed: boolean; remaining: number; limit: number; resetAt: number; // Unix timestamp (milliseconds) retryAfter: number; // Seconds until retry} interface TokenBucketStatus { tokens: number; capacity: number; refillRate: number;}Notice we use floating-point for tokens. This allows fractional tokens from partial refills. When reporting to clients, we floor to integers (you can't consume 0.3 of a token), but internally, maintaining precision prevents accumulation errors over time.
Each token bucket requires minimal memory:
TokenBucket:
tokens: 8 bytes (double)
lastRefillTime: 8 bytes (double)
capacity: 8 bytes (double, or 4 bytes if int)
refillRate: 8 bytes (double)
----------------------------
Total: 32 bytes per bucket
For 10 million clients:
This fits comfortably in memory on a single server, enabling ultra-fast lookups.
In real-world systems, multiple threads or processes handle requests concurrently. A naive implementation has race conditions that allow clients to exceed their limits. Let's build a thread-safe version.
The Race Condition:
Thread A: reads tokens = 1
Thread B: reads tokens = 1
Thread A: tokens >= 1, allows request
Thread B: tokens >= 1, allows request
Thread A: writes tokens = 0
Thread B: writes tokens = 0
Result: Two requests allowed, but only one token was available!
This race condition becomes severe at high concurrency. With 1000 concurrent requests, clients could massively exceed their limits.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
/** * Thread-safe Token Bucket using atomic operations * Suitable for multi-threaded environments */import { Mutex } from 'async-mutex'; class AtomicTokenBucket { private tokens: number; private lastRefillTime: number; private readonly mutex = new Mutex(); constructor( private readonly capacity: number, private readonly refillRate: number ) { this.tokens = capacity; this.lastRefillTime = Date.now() / 1000; } /** * Thread-safe consume with mutex * Ensures atomic read-modify-write operations */ async consume(cost: number = 1): Promise<TokenBucketResult> { const release = await this.mutex.acquire(); try { // Critical section: all operations are atomic this.refill(); if (this.tokens >= cost) { this.tokens -= cost; return { allowed: true, remaining: Math.floor(this.tokens), limit: this.capacity, resetAt: this.calculateResetTime(), retryAfter: 0 }; } const retryAfter = (cost - this.tokens) / this.refillRate; return { allowed: false, remaining: 0, limit: this.capacity, resetAt: this.calculateResetTime(), retryAfter: Math.ceil(retryAfter) }; } finally { release(); } } private refill(): void { const now = Date.now() / 1000; const elapsed = now - this.lastRefillTime; this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate); this.lastRefillTime = now; } private calculateResetTime(): number { const tokensNeeded = this.capacity - this.tokens; const secondsToFull = tokensNeeded / this.refillRate; return Date.now() + (secondsToFull * 1000); }}For distributed systems, we need coordination across nodes. Redis provides atomic operations that enable distributed token buckets with EVAL scripts:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- Redis Lua script for atomic token bucket-- KEYS[1] = bucket key-- ARGV[1] = capacity (max tokens)-- ARGV[2] = refill rate (tokens per second) -- ARGV[3] = cost (tokens to consume)-- ARGV[4] = current timestamp (seconds) local key = KEYS[1]local capacity = tonumber(ARGV[1])local refillRate = tonumber(ARGV[2])local cost = tonumber(ARGV[3])local now = tonumber(ARGV[4]) -- Get current bucket statelocal bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')local tokens = tonumber(bucket[1]) or capacitylocal lastRefill = tonumber(bucket[2]) or now -- Calculate tokens to add (time-based refill)local elapsed = now - lastRefilllocal tokensToAdd = elapsed * refillRatetokens = math.min(capacity, tokens + tokensToAdd) -- Attempt to consumelocal allowed = tokens >= costlocal retryAfter = 0 if allowed then tokens = tokens - costelse retryAfter = math.ceil((cost - tokens) / refillRate)end -- Update bucket stateredis.call('HMSET', key, 'tokens', tokens, 'lastRefill', now) -- Set expiry to prevent orphan keys (2x window size or 24h)redis.call('EXPIRE', key, 86400) -- Return: allowed (0/1), remaining, limit, retryAfterreturn {allowed and 1 or 0, math.floor(tokens), capacity, retryAfter}Redis executes Lua scripts atomically—no other command runs between script start and end. This eliminates race conditions without application-level locking. The script runs server-side, minimizing network round trips. All major rate limiters (Stripe, GitHub, etc.) use this pattern.
The basic token bucket can be enhanced for specific use cases. Let's explore important variations used in production systems.
Often confused, these are related but distinct algorithms:
Token Bucket:
Leaky Bucket:
| Metric | Token Bucket | Leaky Bucket |
|---|---|---|
| Bursts | Allowed | Queued/smoothed |
| Latency | Consistent | Increases with burst |
| Implementation | Counter-based | Queue-based |
| Memory | O(1) per client | O(burst) per client |
| Use case | API rate limiting | Traffic shaping |
Some requests are more expensive than others. Weighted token buckets assign different costs:
// Different endpoints consume different token amounts
const costs: Record<string, number> = {
'GET /api/simple': 1, // Cheap, cached
'POST /api/upload': 10, // Expensive, I/O intensive
'GET /api/search': 5, // Moderate, DB query
'POST /api/process': 50, // Very expensive, ML inference
};
function rateLimitRequest(bucket: TokenBucket, endpoint: string) {
const cost = costs[endpoint] || 1;
return bucket.consume(cost);
}
This ensures expensive operations consume proportionally more quota, protecting backend resources equitably.
Complex rate limiting policies often require multi-level token buckets:
Organization Bucket (100,000 tokens/hour)
├── User A Bucket (10,000 tokens/hour)
│ ├── Read Bucket (8,000 tokens/hour)
│ └── Write Bucket (2,000 tokens/hour)
├── User B Bucket (10,000 tokens/hour)
└── User C Bucket (10,000 tokens/hour)
A request must pass through all levels of the hierarchy. Even if User A has tokens remaining, if the organization bucket is empty, the request is denied.
This prevents:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
/** * Hierarchical Token Bucket that checks multiple levels */class HierarchicalTokenBucket { private buckets: Map<string, TokenBucket> = new Map(); constructor(private hierarchy: BucketConfig[]) { for (const config of hierarchy) { this.buckets.set(config.id, new TokenBucket( config.capacity, config.refillRate )); } } /** * Consume from all buckets in hierarchy * Returns first failure or success if all pass */ consume(bucketIds: string[], cost: number = 1): HierarchicalResult { // First pass: check all buckets have capacity const statuses: BucketCheck[] = []; for (const id of bucketIds) { const bucket = this.buckets.get(id); if (!bucket) throw new Error(`Unknown bucket: ${id}`); const status = bucket.getStatus(); statuses.push({ id, hasCapacity: status.tokens >= cost, tokens: status.tokens }); } // Find first bucket without capacity const blocker = statuses.find(s => !s.hasCapacity); if (blocker) { return { allowed: false, blockedBy: blocker.id, remaining: Math.floor(blocker.tokens) }; } // Second pass: consume from all buckets atomically for (const id of bucketIds) { const bucket = this.buckets.get(id)!; bucket.tryConsume(cost); } // Return the most constrained bucket's remaining tokens const minRemaining = Math.min( ...statuses.map(s => s.tokens - cost) ); return { allowed: true, remaining: Math.floor(minRemaining) }; }} interface BucketConfig { id: string; capacity: number; refillRate: number;} interface BucketCheck { id: string; hasCapacity: boolean; tokens: number;} interface HierarchicalResult { allowed: boolean; blockedBy?: string; remaining: number;}Moving from implementation to production requires handling edge cases, monitoring, and operational concerns. Let's address the issues that matter in real-world deployments.
Production token buckets require comprehensive monitoring:
Key Metrics:
rate_limit_allowed_total{client, endpoint} # Counter of allowed requests
rate_limit_denied_total{client, endpoint} # Counter of denied requests
rate_limit_remaining{client} # Gauge of remaining tokens
rate_limit_latency_seconds{quantile} # Histogram of decision time
rate_limit_bucket_count # Number of active buckets
rate_limit_memory_bytes # Memory consumption
Alerting Thresholds:
Dashboards:
Rate limit configurations change over time. Production systems need:
Dynamic Updates:
interface RateLimitConfig {
id: string;
selector: ClientSelector; // Which clients does this apply to?
capacity: number;
refillRate: number;
version: number; // For tracking config changes
effectiveAt: Date; // When this config activates
}
// Config service watches for changes and updates buckets
class ConfigService {
async updateConfig(config: RateLimitConfig): Promise<void> {
// Validate new config
// Update in config store
// Notify all rate limiter nodes
// Take effect at specified time
}
}
Gradual Rollout:
Emergency Overrides:
Some clients generate vastly more traffic than others (viral users, large enterprises). These 'hot keys' can overwhelm a single Redis shard. Mitigation: use consistent hashing to spread hot keys, or implement local rate limiting before distributed checks to reduce load on central storage.
For systems handling millions of requests per second, every microsecond counts. Let's explore advanced optimizations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
/** * High-performance rate limiter with local caching * Provides sub-millisecond decisions with eventual consistency */class OptimizedRateLimiter { private localCache: Map<string, CachedDecision> = new Map(); private localBuckets: Map<string, TokenBucket> = new Map(); private readonly CACHE_TTL_MS = 100; // Decision cache lifetime private readonly SYNC_INTERVAL_MS = 500; // Distributed sync frequency constructor( private readonly redis: RedisClient, private readonly config: RateLimitConfig ) { // Periodic sync with distributed state setInterval(() => this.syncWithRedis(), this.SYNC_INTERVAL_MS); } /** * Fast path: check local cache and bucket * Slow path: only on cache miss or first request */ async checkLimit(clientId: string): Promise<RateLimitDecision> { // Check if we have a cached decision const cached = this.localCache.get(clientId); if (cached && cached.expiresAt > Date.now()) { return cached.decision; } // Local rate limit check (fast, per-node) let localBucket = this.localBuckets.get(clientId); if (!localBucket) { localBucket = new TokenBucket( this.config.capacity / this.config.nodeCount, // Per-node share this.config.refillRate / this.config.nodeCount ); this.localBuckets.set(clientId, localBucket); } // Quick local check const localResult = localBucket.consume(); if (!localResult.allowed) { // Definitely denied - no need for distributed check this.cacheDecision(clientId, localResult); return localResult; } // Allowed locally, record for async sync // In background, sync with Redis this.markForSync(clientId); // Return optimistic allow this.cacheDecision(clientId, localResult); return localResult; } private cacheDecision(clientId: string, decision: RateLimitDecision): void { this.localCache.set(clientId, { decision, expiresAt: Date.now() + this.CACHE_TTL_MS }); } private pendingSyncs: Set<string> = new Set(); private markForSync(clientId: string): void { this.pendingSyncs.add(clientId); } private async syncWithRedis(): Promise<void> { if (this.pendingSyncs.size === 0) return; const clientIds = Array.from(this.pendingSyncs); this.pendingSyncs.clear(); // Batch update all pending syncs const pipeline = this.redis.pipeline(); for (const clientId of clientIds) { pipeline.evalsha(TOKEN_BUCKET_SCRIPT, 1, clientId, ...this.config); } await pipeline.exec(); }} interface CachedDecision { decision: RateLimitDecision; expiresAt: number;}Local caching can allow slightly more requests than configured (up to nodeCount × limit in extreme cases). For most APIs, this trade-off is acceptable—the limits are approximate quotas, not strict guarantees. For financial or security-critical rate limiting, use synchronous distributed checks.
The Token Bucket algorithm is the workhorse of rate limiting in production systems. Let's consolidate our understanding:
What's Next:
While the token bucket handles most rate limiting needs, some scenarios require more precise time window enforcement. In the next page, we'll explore the Sliding Window algorithm, which provides exact rate limiting without the boundary issues of fixed windows while maintaining efficiency.
You now have a deep understanding of the Token Bucket algorithm—from mathematical foundations to production implementation. You can implement, tune, and optimize token buckets for any scale. Next, we'll explore the Sliding Window algorithm for scenarios requiring exact time-window rate limiting.