Loading content...
On November 24, 2014, a simple API endpoint at a major cloud provider received an unusual surge of traffic. What started as a legitimate marketing campaign cascaded into billions of requests per minute. Without effective rate limiting, the authentication service collapsed, taking down the entire platform for hours. Millions in revenue evaporated.
Rate limiting is the practice of controlling the frequency of events—requests, packets, operations—to protect systems from overload, ensure fair resource allocation, and implement business policies. It's the difference between a system that gracefully handles abuse and one that collapses under pressure.
By the end of this page, you will understand rate limiting strategies beyond basic algorithms—including multi-dimensional limiting, distributed rate limiting, graceful degradation patterns, implementation challenges, and best practices from production systems. You'll be equipped to design rate limiting for any scale.
Rate limiting is the application of traffic shaping principles to control the rate of operations in a system. While leaky bucket and token bucket provide the algorithms, rate limiting encompasses the broader practice of deciding what to limit, where, and how.
Why Rate Limit?
Rate limiting serves multiple purposes across different domains:
| Purpose | Description | Example |
|---|---|---|
| Protection | Prevent system overload and resource exhaustion | Limit API calls to prevent database connection pool exhaustion |
| Fairness | Ensure equitable resource access across users | Limit per-user bandwidth so no user can monopolize |
| Cost Control | Limit expensive operations to control costs | Limit LLM API calls to stay within budget |
| Security | Mitigate abuse, DDoS, and brute force attacks | Limit login attempts to prevent credential stuffing |
| Compliance | Meet SLA commitments and contractual obligations | Shape traffic to match purchased bandwidth tier |
| Business | Implement tiered service and monetization | Free tier: 100 requests/day; Pro: 10,000/day |
Rate Limiting Dimensions:
Modern rate limiting operates across multiple dimensions simultaneously:
| Dimension | Description | Example |
|---|---|---|
| Time | Events per time window | 100 requests per minute |
| Identity | Per user, API key, IP address | Each user: 1000 req/hour |
| Resource | Per endpoint, operation type | /search: 10/min, /upload: 5/hour |
| Concurrency | Simultaneous active operations | Max 5 concurrent connections |
| Data Volume | Bytes transferred | 1 GB download per day |
| Cost | Weighted by operation expense | 100 'credits' per minute |
Multi-dimensional Example:
User "alice" limits:
- Global: 10,000 requests/hour
- Per endpoint:
- /api/search: 100 requests/minute
- /api/export: 10 requests/hour (expensive)
- Concurrent: max 20 simultaneous requests
- Data: 5 GB egress/day
A request that passes the global rate limit might still be rejected for exceeding an endpoint-specific limit. Production systems typically enforce limits across multiple dimensions simultaneously, with 'OR' semantics—exceed any limit, and the request is throttled.
Beyond leaky and token bucket, several algorithms address different rate limiting needs:
1. Fixed Window Counter:
Simplest approach—count events in fixed time windows.
Every minute window:
- Count requests in current window
- Allow if count < limit
- Reset count at window boundary
Problem: Boundary bursts—100 requests at 11:59:59 + 100 at 12:00:00 = 200 requests in 2 seconds.
| Algorithm | Burst Handling | Memory | Accuracy | Use Case |
|---|---|---|---|---|
| Fixed Window | Poor (boundary bursts) | O(1) | Low | Simple, non-critical limits |
| Sliding Window Log | Excellent | O(n) per user | Highest | When precision is critical |
| Sliding Window Counter | Good | O(1) | High | Best general-purpose balance |
| Token Bucket | Excellent | O(1) | High | Network shaping, APIs |
| Leaky Bucket | Excellent (smoothed) | O(1) | High | Constant-rate output |
2. Sliding Window Log:
Store timestamp of each request, count those within the sliding window.
On request at time T:
- Remove all timestamps older than (T - window_size)
- If count < limit:
- Add timestamp T
- Allow request
- Else: reject
Advantage: Perfect accuracy—no boundary effects. Disadvantage: Memory proportional to request count.
3. Sliding Window Counter (Hybrid):
Combines fixed window counting with sliding window semantics:
Parameters:
window_size: 60 seconds
limit: 100 requests
On request at time T:
current_window = floor(T / window_size)
previous_window = current_window - 1
# Weight of previous window (sliding portion)
weight = 1 - ((T % window_size) / window_size)
# Weighted count
count = (previous_window_count × weight) + current_window_count
if count < limit:
increment current_window_count
allow
else:
reject
This provides sliding window behavior with O(1) memory per key.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import timefrom dataclasses import dataclassfrom typing import Dict, Tuple @dataclassclass WindowState: """State for sliding window counter.""" window_id: int = 0 prev_count: int = 0 curr_count: int = 0 class SlidingWindowCounter: """ Sliding Window Counter rate limiter. Provides approximate sliding window semantics with fixed memory per key (O(1)). """ def __init__(self, window_seconds: int, limit: int): """ Initialize the rate limiter. Args: window_seconds: Size of the window in seconds limit: Maximum requests allowed per window """ self.window_size = window_seconds self.limit = limit self.state: Dict[str, WindowState] = {} def _get_window(self, timestamp: float) -> Tuple[int, float]: """Get current window ID and position within window.""" window_id = int(timestamp // self.window_size) position = (timestamp % self.window_size) / self.window_size return window_id, position def allow(self, key: str, timestamp: float = None) -> Tuple[bool, dict]: """ Check if request is allowed and record it. Args: key: Rate limit key (user ID, IP, etc.) timestamp: Request time (defaults to now) Returns: (allowed, info_dict) tuple """ if timestamp is None: timestamp = time.time() window_id, position = self._get_window(timestamp) # Initialize state if needed if key not in self.state: self.state[key] = WindowState() state = self.state[key] # Roll over windows if needed if window_id > state.window_id: if window_id == state.window_id + 1: # Move current to previous state.prev_count = state.curr_count else: # Gap of 2+ windows, previous is empty state.prev_count = 0 state.curr_count = 0 state.window_id = window_id # Calculate weighted count prev_weight = 1.0 - position # Weight of previous window weighted_count = (state.prev_count * prev_weight) + state.curr_count # Check limit if weighted_count < self.limit: state.curr_count += 1 return True, { "remaining": int(self.limit - weighted_count - 1), "limit": self.limit, "window": self.window_size, "reset": (window_id + 1) * self.window_size, } else: return False, { "remaining": 0, "limit": self.limit, "window": self.window_size, "reset": (window_id + 1) * self.window_size, "retry_after": (1 - position) * self.window_size, } # Example usageif __name__ == "__main__": limiter = SlidingWindowCounter(window_seconds=10, limit=5) print("=== Rapid Fire Test ===") for i in range(8): allowed, info = limiter.allow("user:alice") status = "✓ Allowed" if allowed else "✗ Limited" print(f"Request {i+1}: {status}, remaining: {info['remaining']}") print("\n=== After 5 second wait (half window) ===") time.sleep(5) for i in range(4): allowed, info = limiter.allow("user:alice") status = "✓ Allowed" if allowed else "✗ Limited" print(f"Request {i+1}: {status}, remaining: {info['remaining']}")Sliding window counter is the most common algorithm in production API rate limiters. It provides very good accuracy (typically within 0.003% of true sliding window) with minimal memory and simple implementation. Companies like Cloudflare, Stripe, and GitHub use variants of this approach.
In distributed systems, rate limiting becomes significantly more challenging. Requests for a single user may arrive at different servers, requiring coordination.
The Coordination Problem:
Imagine a user has a limit of 100 requests/minute, and you have 5 API servers:
Approaches to Distributed Rate Limiting:
| Approach | Consistency | Latency | Complexity | Use Case |
|---|---|---|---|---|
| Centralized Store (Redis) | Strong | ~1-5ms | Low | Most common, works at scale |
| Sticky Sessions | Local strong | 0ms | Medium | When user affinity is acceptable |
| Gossip Protocol | Eventual | Variable | High | Extreme scale, eventual OK |
| Cell-Based | Cell-strong | 0ms + cell sync | Medium | Geographic distribution |
| Local + Sync | Approximate | 0ms + async | Medium | High-perf with tolerance |
1. Centralized Store (Redis):
Most common approach—all servers share state via Redis or similar:
Advantages:
- Exact global state
- Simple implementation (Redis has atomic operations)
- Works with any request routing
Challenges:
- Redis becomes single point of failure
- Added latency per request (1-5ms)
- Network partition = all requests allowed/denied?
2. Local Limits with Async Aggregation:
Each server maintains local counters, periodically syncing:
Implementation:
1. Each server has local rate limiter
2. Local limits set to global_limit / num_servers × N
(N>1 provides slack for uneven distribution)
3. Background process aggregates across servers
4. If global limit reached, notify all servers
Advantages:
- No synchronous latency
- Resilient to coordinator failure
Challenges:
- Can temporarily exceed limits
- Uneven distribution wastes capacity
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160
"""Distributed Rate Limiter using Redis with Lua for atomicity."""import redisimport timefrom typing import Tuple class DistributedRateLimiter: """ Distributed sliding window counter using Redis. Uses Lua scripting for atomic operations. """ # Lua script for atomic rate limit check and update LUA_SCRIPT = """ local key = KEYS[1] local window = tonumber(ARGV[1]) local limit = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) -- Calculate current and previous window local window_id = math.floor(now / window) local window_start = window_id * window local position = (now - window_start) / window local prev_key = key .. ":" .. (window_id - 1) local curr_key = key .. ":" .. window_id -- Get counts local prev_count = tonumber(redis.call("GET", prev_key) or "0") local curr_count = tonumber(redis.call("GET", curr_key) or "0") -- Calculate weighted count local weighted = (prev_count * (1 - position)) + curr_count if weighted < limit then -- Increment current window redis.call("INCR", curr_key) redis.call("EXPIRE", curr_key, window * 2) return {1, limit - weighted - 1, window_start + window} else return {0, 0, window_start + window} end """ def __init__( self, redis_client: redis.Redis, prefix: str = "ratelimit" ): self.redis = redis_client self.prefix = prefix self.script = self.redis.register_script(self.LUA_SCRIPT) def allow( self, key: str, window_seconds: int, limit: int ) -> Tuple[bool, dict]: """ Check if request is allowed. Args: key: Rate limit key (e.g., "user:12345") window_seconds: Window size in seconds limit: Maximum requests per window Returns: (allowed, info) tuple """ full_key = f"{self.prefix}:{key}" now = time.time() try: result = self.script( keys=[full_key], args=[window_seconds, limit, now] ) allowed, remaining, reset = result return bool(allowed), { "allowed": bool(allowed), "remaining": int(remaining), "limit": limit, "reset": int(reset), "retry_after": max(0, reset - now) if not allowed else 0, } except redis.RedisError as e: # Fail open: allow request if Redis unavailable # (Alternative: fail closed for security-critical limits) return True, { "allowed": True, "remaining": -1, "limit": limit, "error": str(e), "fail_open": True, } # Multi-tier rate limiterclass MultiTierRateLimiter: """ Rate limiter with multiple limit tiers. E.g., per-minute AND per-hour limits. """ def __init__(self, redis_client: redis.Redis): self.limiter = DistributedRateLimiter(redis_client) def allow( self, key: str, limits: list[tuple[int, int]] # [(window_secs, limit), ...] ) -> Tuple[bool, list[dict]]: """ Check all limit tiers. Reject if any tier exceeded. Args: key: Rate limit key limits: List of (window_seconds, limit) tuples Returns: (allowed, info_list) tuple """ results = [] allowed = True for window, limit in limits: tier_key = f"{key}:w{window}" tier_allowed, info = self.limiter.allow(tier_key, window, limit) results.append({ "window": window, **info }) if not tier_allowed: allowed = False return allowed, results # Example usageif __name__ == "__main__": r = redis.Redis() limiter = MultiTierRateLimiter(r) # 5 per second AND 20 per minute limits = [ (1, 5), # 5 per second (60, 20), # 20 per minute ] for i in range(25): allowed, info = limiter.allow("user:alice", limits) status = "✓" if allowed else "✗" print(f"Request {i+1}: {status}") if i % 5 == 4: print(" (sleeping 1s)") time.sleep(1)When the rate limiting system fails (Redis down, network partition), you must choose: Fail Open (allow all requests) maintains availability but loses protection. Fail Closed (deny all requests) maintains protection but hurts availability. The right choice depends on your use case—abuse prevention might fail closed, while general API limits might fail open.
How you respond when limits are exceeded significantly impacts user experience and system behavior.
Response Actions:
| Action | Description | HTTP Code | Use Case |
|---|---|---|---|
| Hard Reject | Immediately reject the request | 429 Too Many Requests | Standard API rate limiting |
| Queue (Delay) | Accept but delay processing | 202 Accepted | When delay is acceptable |
| Degrade | Serve reduced-quality response | 200 OK (degraded) | When partial service is useful |
| Shed Load | Randomly drop requests | 503 Service Unavailable | Emergency overload protection |
| Throttle | Slow down the connection | N/A (network level) | Network-level shaping |
| Redirect | Send to alternative service | 307/302 Redirect | Geographic or tier-based routing |
429 Response Best Practices:
When rejecting with HTTP 429, provide helpful information:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699920000
X-RateLimit-Policy: 100;w=60
{
"error": {
"code": "rate_limit_exceeded",
"message": "You have exceeded the rate limit. Please retry after 30 seconds.",
"retry_after": 30,
"limit": 100,
"window": 60,
"documentation_url": "https://docs.example.com/rate-limits"
}
}
Header Standards:
| Header | Description | Standard |
|---|---|---|
Retry-After | Seconds until retry is safe | RFC 7231 |
X-RateLimit-Limit | Request limit | De facto standard |
X-RateLimit-Remaining | Requests remaining | De facto standard |
X-RateLimit-Reset | Unix timestamp when limit resets | De facto standard |
RateLimit-* | IETF standardized versions | RFC 9110 draft |
Well-behaved clients should implement exponential backoff when rate limited. First retry after 1s, then 2s, 4s, 8s, etc., with jitter. This prevents thundering herd problems where many clients retry simultaneously when a limit window resets.
Graceful Degradation Example:
Instead of rejecting, provide reduced service:
def get_user_recommendations(user_id: str):
# Check rate limits for expensive operations
if not rate_limiter.allow(f"recommendations:{user_id}",
window=60, limit=10):
# Degraded: return cached/simpler recommendations
return get_cached_recommendations(user_id)
# Full service: compute personalized recommendations
return compute_ml_recommendations(user_id)
This maintains perceived availability while protecting expensive backends.
Choosing the right identity for rate limiting is crucial—too broad and abusers affect legitimate users; too narrow and abusers create many identities.
Identity Dimensions:
| Identity | Pros | Cons | Best For |
|---|---|---|---|
| IP Address | No auth required, catches bots | NAT shares IPs, VPNs bypass | Anonymous/unauthenticated endpoints |
| User ID | Accurate per-user limits | Only works for authenticated | API quotas, user-facing features |
| API Key | Per-application limits | Keys can be shared/leaked | Developer APIs, B2B services |
| Device Fingerprint | Tracks across identities | Privacy concerns, cat-and-mouse | Fraud prevention, abuse detection |
| Organization/Tenant | Org-level quotas | Unfair to larger orgs | Multi-tenant SaaS |
| Compound | Multi-dimensional | Complex to configure | Defense in depth |
Layered Identity Strategy:
Production systems typically use multiple identity layers:
def check_rate_limits(request):
"""
Layered rate limiting by multiple identities.
Request is allowed only if ALL limits pass.
"""
checks = []
# Layer 1: Global limit (protect service)
checks.append(limiter.allow(
key="global",
limit=1_000_000,
window=60
))
# Layer 2: Per-IP limit (stop unauthenticated abuse)
checks.append(limiter.allow(
key=f"ip:{request.ip}",
limit=1000,
window=60
))
# Layer 3: Per-user limit (if authenticated)
if request.user_id:
user_tier = get_user_tier(request.user_id)
checks.append(limiter.allow(
key=f"user:{request.user_id}",
limit=TIER_LIMITS[user_tier],
window=60
))
# Layer 4: Per-endpoint limit (protect expensive operations)
endpoint_limit = ENDPOINT_LIMITS.get(request.endpoint)
if endpoint_limit:
checks.append(limiter.allow(
key=f"endpoint:{request.endpoint}:{request.user_id or request.ip}",
limit=endpoint_limit,
window=60
))
# All must pass
return all(check[0] for check in checks)
Handling Shared IP Addresses:
Large organizations (universities, enterprises) share a single NAT IP:
IPv6 gives each device effectively unlimited addresses. Rate limiting by /128 (single address) is ineffective—limit by /56 or /48 (ISP-assigned prefix) instead. Monitor for clients rotating through large IPv6 blocks to evade limits.
Fixed rate limits are simple but wasteful—they must be set for worst-case load, leaving capacity unused during normal operation. Adaptive rate limiting adjusts limits based on current system state.
AIMD (Additive Increase, Multiplicative Decrease):
Borrowed from TCP congestion control:
AIMD Rate Limiting:
State:
current_limit: starts at initial_limit
Every adjustment_period:
success_rate = successful_requests / total_requests
if success_rate > target_success_rate:
# System healthy, increase limits
current_limit = min(max_limit, current_limit + additive_increase)
else:
# System stressed, back off
current_limit = max(min_limit, current_limit × multiplicative_decrease)
Load Shedding at Scale:
When systems approach overload, intelligent load shedding preserves function:
class AdaptiveLoadShedder:
"""
Sheds load based on system health metrics.
"""
def __init__(self, target_latency_ms: int = 100):
self.target_latency = target_latency_ms
self.shed_probability = 0.0
def update(self, current_latency_ms: float):
"""Update shed probability based on latency."""
if current_latency_ms > self.target_latency * 2:
# Way over target: aggressive shedding
self.shed_probability = min(0.9, self.shed_probability + 0.1)
elif current_latency_ms > self.target_latency:
# Over target: gradual increase
self.shed_probability = min(0.5, self.shed_probability + 0.05)
else:
# Under target: gradual recovery
self.shed_probability = max(0.0, self.shed_probability - 0.01)
def should_shed(self, request_priority: int = 0) -> bool:
"""
Determine if request should be shed.
Higher priority = less likely to shed.
"""
adjusted_probability = self.shed_probability / (priority + 1)
return random.random() < adjusted_probability
Priority-Based Shedding:
Not all requests are equal—shed low-priority first:
| Priority | Examples | Shed Order |
|---|---|---|
| Critical | Health checks, auth tokens | Never shed |
| High | User-facing, purchased tier | Shed last |
| Normal | Standard API calls | Normal shedding |
| Low | Analytics, background sync | Shed first |
| Best-effort | Prefetch, speculative | Shed aggressively |
Google's Guava RateLimiter and similar systems use techniques inspired by CoDel (Controlled Delay) queue management. Rather than just limiting rate, they ensure queued requests don't wait too long. Old requests are dropped because their clients have likely given up—processing them wastes resources and delivers unusable responses.
Production experience has identified patterns that work well and anti-patterns to avoid.
Pattern: Defense in Depth
Apply rate limiting at multiple layers:
Pattern: Token Budget System
Instead of simple request counts, assign costs to operations:
OPERATION_COSTS = {
'GET /users': 1,
'GET /search': 5, # More expensive
'POST /export': 50, # Very expensive
'POST /ml-inference': 100, # Resource-intensive
}
def check_budget(user_id: str, operation: str) -> bool:
cost = OPERATION_COSTS.get(operation, 1)
# User has 1000 tokens per minute
return token_bucket.consume(
key=f"budget:{user_id}",
tokens=cost,
bucket_size=1000,
refill_rate=1000/60
)
This ensures expensive operations are limited more aggressively while allowing many cheap operations.
Pattern: Separate Rate Limit Types
Different limit categories for different purposes:
Track what percentage of your capacity is actually used. If limits are rarely hit, they may be too high (wasting resources) or traffic patterns changed. If limits are constantly hit, they may be too low or you have a scaling problem. Good limits should be hit occasionally—enough to provide protection without hampering normal use.
Examining how major platforms implement rate limiting provides practical insights.
Case Study 1: GitHub API
GitHub implements sophisticated per-resource rate limiting:
Authenticated users:
- Core API: 5,000 requests per hour
- Search API: 30 requests per minute
- GraphQL: 5,000 points per hour (complexity-based)
OAuth Apps:
- 5,000 requests per hour per user
- Can request higher limits
GitHub Actions:
- Separate limits per workflow
- Concurrency limits per repository
Key insight: Different resources have different limits—search is expensive so gets stricter limits.
Case Study 2: Cloudflare Rate Limiting
Cloudflare operates at edge, processing millions of requests per second:
Key lesson: At extreme scale, exact counting is impossible—sampling and estimation become necessary.
Case Study 3: Netflix Circuit Breakers
Netflix combines rate limiting with circuit breakers (Hystrix pattern):
This defense-in-depth approach handles both traffic quantity and backend health.
The best rate limiting implementations are transparent. Companies like Stripe, GitHub, and Twilio publish their limits, explain the rationale, provide clear error messages, and offer ways to increase limits. This transparency builds trust and helps developers build resilient applications.
We've comprehensively explored rate limiting—the practice of applying traffic shaping principles to protect systems, ensure fairness, and implement business policies.
What's Next:
With rate limiting mastered, we'll complete our traffic shaping journey by exploring QoS (Quality of Service) Implementation—how traffic shaping technologies are combined to deliver differentiated service quality across network infrastructure.
You now understand rate limiting comprehensively—from algorithms and distribution challenges to response strategies and real-world implementations. This knowledge is directly applicable to designing resilient, fair, and secure distributed systems.