Loading learning content...
Every system has limits. Databases can handle only so many connections. Compute resources are finite. Network bandwidth is constrained. When clients exceed these limits—whether through legitimate traffic spikes, misbehaving applications, or malicious attacks—systems fail.
Rate limiting is the mechanism that protects your system from being overwhelmed. It controls how many requests a client can make in a given time period, ensuring fair access for all users while protecting backend services from abuse.
At the API Gateway, rate limiting is implemented at the edge—before requests reach your services. This is the most efficient place to reject excess traffic, preventing wasted resources on requests that would eventually fail anyway. This page provides a comprehensive exploration of rate limiting strategies, algorithms, and implementation patterns for production-grade systems.
By the end of this page, you will understand rate limiting algorithms (token bucket, leaky bucket, sliding window), distributed rate limiting challenges, client identification strategies, response handling best practices, and how to design rate limiting policies that balance protection with user experience.
Rate limiting serves multiple purposes, from operational protection to business policy enforcement.
Protection Goals:
Business Goals:
Rejecting a request at the gateway costs microseconds and trivial resources. Rejecting it after authentication, authorization, and partial processing costs orders of magnitude more. Rate limiting at the edge is the most efficient protection.
Several algorithms exist for implementing rate limiting, each with different characteristics. Understanding these algorithms helps you choose the right one for your use case.
Algorithm Comparison:
| Algorithm | How It Works | Pros | Cons |
|---|---|---|---|
| Token Bucket | Tokens added at fixed rate; each request consumes token | Allows bursts, smooth average rate | Memory for bucket state per client |
| Leaky Bucket | Requests enter queue; processed at fixed rate | Smooth output rate, no bursts | Latency in queue; may drop requests |
| Fixed Window | Count requests in fixed time windows (e.g., per minute) | Simple, low memory | Burst at window edges |
| Sliding Window Log | Track timestamp of each request; count in sliding window | Accurate, no edge bursts | Memory intensive (stores all timestamps) |
| Sliding Window Counter | Weighted average of current and previous window | Memory efficient, smooth | Slight approximation |
1234567891011121314151617181920212223242526272829303132333435
// Token Bucket Algorithminterface TokenBucket { tokens: number; lastRefill: number; // Timestamp capacity: number; // Maximum tokens refillRate: number; // Tokens per second} function checkRateLimit(bucket: TokenBucket): boolean { const now = Date.now(); const elapsed = (now - bucket.lastRefill) / 1000; // Refill tokens based on elapsed time bucket.tokens = Math.min( bucket.capacity, bucket.tokens + elapsed * bucket.refillRate ); bucket.lastRefill = now; // Check if we have a token to consume if (bucket.tokens >= 1) { bucket.tokens -= 1; return true; // Request allowed } return false; // Rate limited} // Example: 100 requests/minute with burst capacity of 20const bucket: TokenBucket = { tokens: 20, // Initial burst capacity lastRefill: Date.now(), capacity: 20, // Max burst refillRate: 100 / 60, // ~1.67 tokens per second};Token bucket is the most commonly used algorithm because it naturally allows short bursts (improving user experience) while maintaining a consistent average rate. It's used by AWS API Gateway, Kong, and most commercial API gateways.
In production, you don't have one gateway instance—you have many, often across multiple data centers. Each instance must enforce the same rate limits, which requires shared state.
The Challenge:
Imagine a client with a limit of 100 requests/minute hitting a system with 10 gateway instances. If each instance tracks limits locally, the client could make 1000 requests/minute (100 per instance). This defeats the purpose entirely.
Solutions:
12345678910111213141516171819202122232425262728293031323334
# Kong rate limiting with Redis backendplugins: - name: rate-limiting config: # Rate limit: 100 requests per minute minute: 100 # Use Redis for distributed state policy: redis redis_host: redis-cluster.internal redis_port: 6379 redis_password: ${REDIS_PASSWORD} redis_database: 0 redis_timeout: 2000 # How to identify clients limit_by: consumer # or: ip, header, credential # Fault tolerance: if Redis is down fault_tolerant: true # Allow requests if Redis unavailable # Include rate limit headers in response hide_client_headers: false # Redis connection pooling redis_pool_size: 10 # Per-consumer overrideplugins: - name: rate-limiting consumer: premium-client config: minute: 10000 policy: redisEvery rate limit check adds latency. With Redis, expect 0.5-2ms per check. For high-throughput systems, consider local caching with background sync, or accept slightly less accurate limits for lower latency. Always measure the impact on P99 latency.
Rate limits are per-client, but who is the 'client'? Identifying clients correctly is crucial—too granular and you waste resources, too coarse and bad actors affect legitimate users.
Identification Methods:
| Method | How to Identify | Best For | Limitations |
|---|---|---|---|
| IP Address | Request source IP | Unauthenticated traffic, basic protection | NAT, proxies can share IPs; easy to rotate IPs |
| API Key | Header: X-API-Key | Partner APIs, machine clients | Keys can be shared; revocation overhead |
| User ID | From authenticated token | Per-user limits, fair usage | Requires authentication first |
| Tenant ID | From JWT or header | Multi-tenant SaaS | Shared tenant users affected together |
| Fingerprint | Combination of IP + User-Agent + etc. | Bot detection, abuse prevention | Sophisticated actors can rotate fingerprints |
| Custom Header | X-Client-ID or similar | Flexible identification | Clients must cooperate; can be spoofed |
1234567891011121314151617181920212223242526272829303132333435363738394041
# Multi-level rate limiting strategy# Apply limits from most specific to least specific rate_limits: # Level 1: Global protection (all traffic) - scope: global limit: 100000 window: 1s action: reject # Level 2: Per-IP (unauthenticated traffic) - scope: ip identify_by: source_ip limit: 100 window: 1m action: reject # Level 3: Per-API-Key (authenticated partners) - scope: api_key identify_by: header:X-API-Key default_limit: 1000 window: 1m overrides: "key_premium_123": 10000 "key_trial_456": 100 # Level 4: Per-User (authenticated users) - scope: user identify_by: jwt:sub default_limit: 100 window: 1m # Level 5: Per-Endpoint (expensive operations) - scope: endpoint match: POST:/api/export identify_by: jwt:sub limit: 5 window: 1h message: "Export limit reached. Try again in {retry_after}." # Request must pass ALL applicable levelsApply rate limits at multiple levels: global (system protection), per-IP (unauthenticated protection), per-client (fair usage), and per-endpoint (resource protection). A well-designed system has all of these working together.
When a client exceeds their rate limit, how you respond matters. Good response handling helps legitimate clients adapt while discouraging abuse.
Standard Response Headers:
12345678910
HTTP/1.1 200 OKContent-Type: application/jsonX-RateLimit-Limit: 100 # Maximum requests allowedX-RateLimit-Remaining: 42 # Requests remaining in windowX-RateLimit-Reset: 1735689600 # Unix timestamp when limit resetsRateLimit-Limit: 100 # IETF draft standardRateLimit-Remaining: 42RateLimit-Reset: 45 # Seconds until reset {"data": {...}}RATE_LIMIT_EXCEEDED are easier to handle than messagesIf you return 429 without Retry-After, many clients will immediately retry, amplifying the load. The Retry-After header spreads retries over time. Consider adding jitter to the suggested retry time to prevent synchronized retries.
Beyond basic request counting, sophisticated systems implement advanced patterns for nuanced traffic management.
Adaptive Rate Limiting:
123456789101112131415161718192021222324252627282930313233343536373839404142
// Adaptive rate limiting based on system healthinterface SystemHealth { latencyP99: number; // milliseconds errorRate: number; // 0-1 queueDepth: number; cpuUtilization: number;} function calculateDynamicLimit( baseLimit: number, health: SystemHealth): number { let multiplier = 1.0; // Reduce limit as latency increases if (health.latencyP99 > 500) { multiplier *= 0.8; } if (health.latencyP99 > 1000) { multiplier *= 0.6; } // Reduce limit on high error rate if (health.errorRate > 0.01) { // > 1% errors multiplier *= 0.5; } // Reduce limit on deep queues if (health.queueDepth > 1000) { multiplier *= 0.7; } // Reduce limit on high CPU if (health.cpuUtilization > 0.8) { multiplier *= 0.8; } // Never go below 10% of base limit multiplier = Math.max(multiplier, 0.1); return Math.floor(baseLimit * multiplier);}Not all requests are equal. A simple GET costs less than a complex search, which costs less than an export. Cost-based limiting lets you give users a 'budget' that reflects actual resource consumption, rather than treating all requests equally.
Rate limiting is a form of graceful degradation—protecting the system by limiting access. But within rate limiting, you can implement further graceful degradation strategies.
Degradation Strategies:
12345678910111213141516171819202122232425262728293031323334
# Progressive degradation based on system statedegradation: triggers: - condition: "rate_limit_utilization > 80%" actions: - disable_feature: autocomplete - reduce_page_size: 50% - condition: "rate_limit_utilization > 90%" actions: - serve_from_cache: true - cache_ttl_override: 300s - disable_feature: recommendations - disable_feature: real_time_updates - condition: "rate_limit_utilization > 95%" actions: - response_mode: minimal # Only essential fields - disable_feature: pagination - reject_new_connections: true fallbacks: # When rate limited, try these fallbacks - endpoint: /api/search fallback: - serve_cached_popular_results - serve_static_suggestions - return_503_with_retry - endpoint: /api/feed fallback: - serve_cached_feed - serve_trending_content - return_503_with_retryClient-Side Considerations:
Good clients implement their own rate limit handling:
| Client Behavior | Description |
|---|---|
| Respect Retry-After | Wait the specified time before retrying |
| Exponential Backoff | Increase delay between retries |
| Jitter | Add randomness to prevent synchronized retries |
| Circuit Breaker | Stop trying after repeated failures |
| Proactive Throttling | Slow down before hitting limits using X-RateLimit-Remaining |
When serving degraded responses, communicate this to clients. Headers like X-Degraded-Mode: true or response fields indicating freshness help clients understand they're not getting full functionality. This is better than silently returning stale data.
Rate limiting is essential for protecting systems and ensuring fair access. Let's consolidate the key insights:
What's Next:
The final page in this module covers Service Composition—how the API Gateway orchestrates calls to multiple backend services, aggregates responses, and enables complex client requirements with single API calls.
You now understand rate limiting comprehensively—from algorithms to distributed implementations, from client identification to graceful degradation. You're equipped to design rate limiting systems that protect your infrastructure while providing excellent developer experience.