Loading content...
Imagine you're managing a popular concert venue. You want to control the flow of people entering, but you also understand that people arrive in groups—not one by one in perfect intervals. A rigid "exactly one person per second" rule would create frustrating queues, while "unlimited entry" would overwhelm your venue.
The solution? Give each person a token at the door. Tokens accumulate up to a maximum (the bucket size), and each person entering consumes one token. If tokens are available, entry is immediate. If the bucket is empty, wait for more tokens to accumulate.
This is the Token Bucket Algorithm—and it has been the industry standard for rate limiting for decades. Used by AWS, Google Cloud, Stripe, and virtually every major API provider, it elegantly balances burst accommodation with sustained rate enforcement.
By the end of this page, you will understand the token bucket algorithm's mechanics, mathematical properties, implementation patterns, and practical considerations. You'll know how to configure buckets for your specific use cases and understand the algorithm's behavior under various traffic patterns.
The token bucket algorithm operates on a beautifully simple principle: tokens are added to a bucket at a constant rate, and each request consumes tokens from the bucket. Let's formalize the mechanics.
The Algorithm:
When a request arrives:
elapsed_time × refill_ratetokens = min(tokens + accumulated, bucket_size)tokens >= cost, allow request and tokens -= costThis lazy evaluation approach is efficient—we only update the bucket when requests arrive, not continuously.
Understanding the mathematical properties of the token bucket helps you configure it correctly and predict its behavior under various conditions.
| Property | Formula | Interpretation |
|---|---|---|
| Maximum burst | b (bucket size) | Maximum requests that can be processed instantaneously when bucket is full |
| Sustained rate | r (refill rate) | Long-term average rate the system can sustain |
| Time to refill | b / r | Time to refill an empty bucket to full |
| Burst interval | b / r | Minimum time between maximum bursts |
| Tokens after time T | min(b, t₀ + r×T) | Where t₀ is initial tokens; capped at bucket size |
| Max requests in time T | b + r×T | Initial burst plus sustained rate over time |
A larger bucket allows bigger bursts but also means longer recovery time after a burst. A smaller bucket provides tighter control but may reject legitimate bursty traffic. The key is matching bucket size to expected legitimate burst patterns—like page loads that trigger multiple API calls.
Example Scenario:
Configure a bucket with:
Behavior:
Let's build a production-ready token bucket implementation. We'll start with a single-node version, then discuss distributed considerations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
interface TokenBucketConfig { bucketSize: number; // Maximum tokens (burst capacity) refillRate: number; // Tokens per second costPerRequest?: number; // Default: 1} interface BucketState { tokens: number; lastRefillTime: number; // Unix timestamp in ms} class TokenBucket { private config: Required<TokenBucketConfig>; private state: BucketState; constructor(config: TokenBucketConfig) { this.config = { ...config, costPerRequest: config.costPerRequest ?? 1, }; // Start with a full bucket this.state = { tokens: this.config.bucketSize, lastRefillTime: Date.now(), }; } /** * Attempt to consume tokens for a request. * Returns true if allowed, false if rate limited. */ tryConsume(cost: number = this.config.costPerRequest): boolean { this.refill(); if (this.state.tokens >= cost) { this.state.tokens -= cost; return true; } return false; } /** * Get current bucket state (for rate limit headers) */ getState(): { remaining: number; resetTime: number } { this.refill(); const timeToFullRefill = (this.config.bucketSize - this.state.tokens) / this.config.refillRate; return { remaining: Math.floor(this.state.tokens), resetTime: Date.now() + timeToFullRefill * 1000, }; } /** * Refill tokens based on elapsed time (lazy evaluation) */ private refill(): void { const now = Date.now(); const elapsedSeconds = (now - this.state.lastRefillTime) / 1000; const tokensToAdd = elapsedSeconds * this.config.refillRate; this.state.tokens = Math.min( this.config.bucketSize, this.state.tokens + tokensToAdd ); this.state.lastRefillTime = now; }} // Usage exampleconst limiter = new TokenBucket({ bucketSize: 100, // Allow burst of 100 requests refillRate: 10, // Sustain 10 requests/second}); function handleRequest(req: Request): Response { if (!limiter.tryConsume()) { const state = limiter.getState(); return new Response(JSON.stringify({ error: "rate_limit_exceeded", retry_after_ms: (state.resetTime - Date.now()), }), { status: 429, headers: { "X-RateLimit-Remaining": String(state.remaining), "Retry-After": String(Math.ceil((state.resetTime - Date.now()) / 1000)), }, }); } // Process request normally... return processRequest(req);}Notice we don't run a background timer to add tokens. Instead, we calculate accumulated tokens when a request arrives. This lazy evaluation is more efficient and avoids timer overhead, especially when managing millions of buckets (one per user).
In practice, you need separate buckets for different clients—identified by API key, user ID, IP address, or other attributes. This requires managing a collection of buckets.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
interface RateLimiterConfig { bucketSize: number; refillRate: number; keyExtractor: (req: Request) => string; cleanupIntervalMs?: number; // How often to purge stale buckets bucketTTLMs?: number; // How long to keep idle buckets} class RateLimiterManager { private buckets: Map<string, { bucket: TokenBucket; lastAccess: number }>; private config: Required<RateLimiterConfig>; private cleanupInterval: NodeJS.Timeout; constructor(config: RateLimiterConfig) { this.config = { ...config, cleanupIntervalMs: config.cleanupIntervalMs ?? 60000, // 1 minute bucketTTLMs: config.bucketTTLMs ?? 3600000, // 1 hour }; this.buckets = new Map(); // Periodic cleanup to prevent memory leaks this.cleanupInterval = setInterval( () => this.cleanup(), this.config.cleanupIntervalMs ); } tryConsume(req: Request, cost: number = 1): boolean { const key = this.config.keyExtractor(req); const entry = this.getOrCreateBucket(key); entry.lastAccess = Date.now(); return entry.bucket.tryConsume(cost); } private getOrCreateBucket(key: string) { let entry = this.buckets.get(key); if (!entry) { entry = { bucket: new TokenBucket({ bucketSize: this.config.bucketSize, refillRate: this.config.refillRate, }), lastAccess: Date.now(), }; this.buckets.set(key, entry); } return entry; } private cleanup(): void { const now = Date.now(); for (const [key, entry] of this.buckets) { if (now - entry.lastAccess > this.config.bucketTTLMs) { this.buckets.delete(key); } } } destroy(): void { clearInterval(this.cleanupInterval); }} // Usage: Rate limit by API keyconst apiKeyLimiter = new RateLimiterManager({ bucketSize: 1000, refillRate: 100, keyExtractor: (req) => req.headers.get("X-API-Key") ?? "anonymous",}); // Usage: Rate limit by IP addressconst ipLimiter = new RateLimiterManager({ bucketSize: 60, refillRate: 1, keyExtractor: (req) => req.headers.get("X-Forwarded-For")?.split(",")[0] ?? "unknown",});Without cleanup, bucket maps grow unbounded. If you have millions of unique clients, you'll exhaust memory. The cleanup mechanism purges buckets that haven't been accessed recently—they'll be recreated fresh if the client returns.
Not all API endpoints are equal. A simple health check consumes far fewer resources than a complex search query or data export. Token buckets support this naturally through variable token costs.
| Endpoint | Token Cost | Rationale |
|---|---|---|
| GET /health | 0 | No-cost monitoring endpoints |
| GET /users/:id | 1 | Simple database lookup |
| GET /users | 5 | Paginated list with pagination |
| POST /search | 10 | Full-text search across indexes |
| POST /export | 50 | Expensive data export operation |
| POST /batch | N | Dynamic cost based on batch size |
123456789101112131415161718192021222324252627282930313233343536373839404142
// Define endpoint costsconst ENDPOINT_COSTS: Record<string, number> = { "GET:/health": 0, "GET:/users/:id": 1, "GET:/users": 5, "POST:/search": 10, "POST:/export": 50,}; function getEndpointCost(method: string, path: string): number { // Normalize path (e.g., /users/123 -> /users/:id) const normalizedPath = normalizePath(path); const key = `${method}:${normalizedPath}`; return ENDPOINT_COSTS[key] ?? 1; // Default cost of 1} function handleRequest(req: Request): Response { const cost = getEndpointCost(req.method, req.url); // Health checks bypass rate limiting entirely if (cost === 0) { return processRequest(req); } if (!limiter.tryConsume(req, cost)) { return rateLimitResponse(req); } return processRequest(req);} // For batch endpoints, calculate dynamic costfunction handleBatchRequest(req: Request): Response { const items = req.body.items; const cost = Math.min(items.length, 100); // Cap at 100 to prevent abuse if (!limiter.tryConsume(req, cost)) { return rateLimitResponse(req); } return processBatch(req);}Profile your endpoints to understand actual resource consumption. A search endpoint that queries Elasticsearch and processes results might consume 10x the resources of a simple GET. Your cost weights should reflect this reality.
The token bucket is often confused with the leaky bucket algorithm. While related, they have distinct behaviors that suit different use cases.
Key Difference:
The token bucket controls the average rate while allowing bursts. The leaky bucket controls the instantaneous rate, smoothing bursts into a constant flow.
For API rate limiting, the token bucket is almost always preferred because:
The token bucket algorithm is the workhorse of API rate limiting. Let's consolidate what we've learned:
What's Next:
The token bucket is excellent for controlling average rates with burst accommodation. But what if you need more precise time-window control? The next page explores the Sliding Window Algorithm—an approach that provides smoother rate limiting without the "boundary problem" of fixed windows.
You now have a deep understanding of the token bucket algorithm and can implement it in production. This knowledge forms the foundation for more advanced rate limiting strategies we'll explore next.