Traffic Shaping - Learning Module

Loading content...

0/228

Rate Limiting

The Guardians of Systems

On November 24, 2014, a simple API endpoint at a major cloud provider received an unusual surge of traffic. What started as a legitimate marketing campaign cascaded into billions of requests per minute. Without effective rate limiting, the authentication service collapsed, taking down the entire platform for hours. Millions in revenue evaporated.

Rate limiting is the practice of controlling the frequency of events—requests, packets, operations—to protect systems from overload, ensure fair resource allocation, and implement business policies. It's the difference between a system that gracefully handles abuse and one that collapses under pressure.

What You Will Learn

By the end of this page, you will understand rate limiting strategies beyond basic algorithms—including multi-dimensional limiting, distributed rate limiting, graceful degradation patterns, implementation challenges, and best practices from production systems. You'll be equipped to design rate limiting for any scale.

Rate Limiting Fundamentals

Rate limiting is the application of traffic shaping principles to control the rate of operations in a system. While leaky bucket and token bucket provide the algorithms, rate limiting encompasses the broader practice of deciding what to limit, where, and how.

Why Rate Limit?

Rate limiting serves multiple purposes across different domains:

Rate Limiting Purposes
Purpose	Description	Example
Protection	Prevent system overload and resource exhaustion	Limit API calls to prevent database connection pool exhaustion
Fairness	Ensure equitable resource access across users	Limit per-user bandwidth so no user can monopolize
Cost Control	Limit expensive operations to control costs	Limit LLM API calls to stay within budget
Security	Mitigate abuse, DDoS, and brute force attacks	Limit login attempts to prevent credential stuffing
Compliance	Meet SLA commitments and contractual obligations	Shape traffic to match purchased bandwidth tier
Business	Implement tiered service and monetization	Free tier: 100 requests/day; Pro: 10,000/day

Rate Limiting Dimensions:

Modern rate limiting operates across multiple dimensions simultaneously:

Dimension	Description	Example
Time	Events per time window	100 requests per minute
Identity	Per user, API key, IP address	Each user: 1000 req/hour
Resource	Per endpoint, operation type	/search: 10/min, /upload: 5/hour
Concurrency	Simultaneous active operations	Max 5 concurrent connections
Data Volume	Bytes transferred	1 GB download per day
Cost	Weighted by operation expense	100 'credits' per minute

Multi-dimensional Example:

User "alice" limits:
  - Global: 10,000 requests/hour
  - Per endpoint:
    - /api/search: 100 requests/minute
    - /api/export: 10 requests/hour (expensive)
  - Concurrent: max 20 simultaneous requests
  - Data: 5 GB egress/day

Think in Multiple Dimensions

A request that passes the global rate limit might still be rejected for exceeding an endpoint-specific limit. Production systems typically enforce limits across multiple dimensions simultaneously, with 'OR' semantics—exceed any limit, and the request is throttled.

Rate Limiting Algorithms Compared

Beyond leaky and token bucket, several algorithms address different rate limiting needs:

1. Fixed Window Counter:

Simplest approach—count events in fixed time windows.

Every minute window:
  - Count requests in current window
  - Allow if count < limit
  - Reset count at window boundary

Problem: Boundary bursts—100 requests at 11:59:59 + 100 at 12:00:00 = 200 requests in 2 seconds.

Rate Limiting Algorithm Comparison
Algorithm	Burst Handling	Memory	Accuracy	Use Case
Fixed Window	Poor (boundary bursts)	O(1)	Low	Simple, non-critical limits
Sliding Window Log	Excellent	O(n) per user	Highest	When precision is critical
Sliding Window Counter	Good	O(1)	High	Best general-purpose balance
Token Bucket	Excellent	O(1)	High	Network shaping, APIs
Leaky Bucket	Excellent (smoothed)	O(1)	High	Constant-rate output

2. Sliding Window Log:

Store timestamp of each request, count those within the sliding window.

On request at time T:
  - Remove all timestamps older than (T - window_size)
  - If count < limit:
    - Add timestamp T
    - Allow request
  - Else: reject

Advantage: Perfect accuracy—no boundary effects. Disadvantage: Memory proportional to request count.

3. Sliding Window Counter (Hybrid):

Combines fixed window counting with sliding window semantics:

Parameters:
  window_size: 60 seconds
  limit: 100 requests

On request at time T:
  current_window = floor(T / window_size)
  previous_window = current_window - 1
  
  # Weight of previous window (sliding portion)
  weight = 1 - ((T % window_size) / window_size)
  
  # Weighted count
  count = (previous_window_count × weight) + current_window_count
  
  if count < limit:
    increment current_window_count
    allow
  else:
    reject

This provides sliding window behavior with O(1) memory per key.

sliding_window_counter.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import time
from dataclasses import dataclass
from typing import Dict, Tuple
 
@dataclass
class WindowState:
    """State for sliding window counter."""
    window_id: int = 0
    prev_count: int = 0
    curr_count: int = 0
 
class SlidingWindowCounter:
    """
    Sliding Window Counter rate limiter.
    
    Provides approximate sliding window semantics with
    fixed memory per key (O(1)).
    """
    
    def __init__(self, window_seconds: int, limit: int):
        """
        Initialize the rate limiter.
        
        Args:
            window_seconds: Size of the window in seconds
            limit: Maximum requests allowed per window
        """
        self.window_size = window_seconds
        self.limit = limit
        self.state: Dict[str, WindowState] = {}
    
    def _get_window(self, timestamp: float) -> Tuple[int, float]:
        """Get current window ID and position within window."""
        window_id = int(timestamp // self.window_size)
        position = (timestamp % self.window_size) / self.window_size
        return window_id, position
    
    def allow(self, key: str, timestamp: float = None) -> Tuple[bool, dict]:
        """
        Check if request is allowed and record it.
        
        Args:
            key: Rate limit key (user ID, IP, etc.)
            timestamp: Request time (defaults to now)
            
        Returns:
            (allowed, info_dict) tuple
        """
        if timestamp is None:
            timestamp = time.time()
        
        window_id, position = self._get_window(timestamp)
        
        # Initialize state if needed
        if key not in self.state:
            self.state[key] = WindowState()
        
        state = self.state[key]
        
        # Roll over windows if needed
        if window_id > state.window_id:
            if window_id == state.window_id + 1:
                # Move current to previous
                state.prev_count = state.curr_count
            else:
                # Gap of 2+ windows, previous is empty
                state.prev_count = 0
            state.curr_count = 0
            state.window_id = window_id
        
        # Calculate weighted count
        prev_weight = 1.0 - position  # Weight of previous window
        weighted_count = (state.prev_count * prev_weight) + state.curr_count
        
        # Check limit
        if weighted_count < self.limit:
            state.curr_count += 1
            return True, {
                "remaining": int(self.limit - weighted_count - 1),
                "limit": self.limit,
                "window": self.window_size,
                "reset": (window_id + 1) * self.window_size,
            }
        else:
            return False, {
                "remaining": 0,
                "limit": self.limit,
                "window": self.window_size,
                "reset": (window_id + 1) * self.window_size,
                "retry_after": (1 - position) * self.window_size,
            }
 
 
# Example usage
if __name__ == "__main__":
    limiter = SlidingWindowCounter(window_seconds=10, limit=5)
    
    print("=== Rapid Fire Test ===")
    for i in range(8):
        allowed, info = limiter.allow("user:alice")
        status = "✓ Allowed" if allowed else "✗ Limited"
        print(f"Request {i+1}: {status}, remaining: {info['remaining']}")
    
    print("\n=== After 5 second wait (half window) ===")
    time.sleep(5)
    
    for i in range(4):
        allowed, info = limiter.allow("user:alice")
        status = "✓ Allowed" if allowed else "✗ Limited"
        print(f"Request {i+1}: {status}, remaining: {info['remaining']}")

Industry Standard

Sliding window counter is the most common algorithm in production API rate limiters. It provides very good accuracy (typically within 0.003% of true sliding window) with minimal memory and simple implementation. Companies like Cloudflare, Stripe, and GitHub use variants of this approach.

Distributed Rate Limiting

In distributed systems, rate limiting becomes significantly more challenging. Requests for a single user may arrive at different servers, requiring coordination.

The Coordination Problem:

Imagine a user has a limit of 100 requests/minute, and you have 5 API servers:

Without coordination: Each server allows 100 → User can make 500 requests
With naive splitting: Each server allows 20 → Unfair if user hits one server
Need: Global state across all servers

Approaches to Distributed Rate Limiting:

Distributed Rate Limiting Strategies
Approach	Consistency	Latency	Complexity	Use Case
Centralized Store (Redis)	Strong	~1-5ms	Low	Most common, works at scale
Sticky Sessions	Local strong	0ms	Medium	When user affinity is acceptable
Gossip Protocol	Eventual	Variable	High	Extreme scale, eventual OK
Cell-Based	Cell-strong	0ms + cell sync	Medium	Geographic distribution
Local + Sync	Approximate	0ms + async	Medium	High-perf with tolerance

1. Centralized Store (Redis):

Most common approach—all servers share state via Redis or similar:

Advantages:
  - Exact global state
  - Simple implementation (Redis has atomic operations)
  - Works with any request routing

Challenges:
  - Redis becomes single point of failure
  - Added latency per request (1-5ms)
  - Network partition = all requests allowed/denied?

2. Local Limits with Async Aggregation:

Each server maintains local counters, periodically syncing:

Implementation:
  1. Each server has local rate limiter
  2. Local limits set to global_limit / num_servers × N
     (N>1 provides slack for uneven distribution)
  3. Background process aggregates across servers
  4. If global limit reached, notify all servers

Advantages:
  - No synchronous latency
  - Resilient to coordinator failure

Challenges:
  - Can temporarily exceed limits
  - Uneven distribution wastes capacity

distributed_rate_limiter.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
"""
Distributed Rate Limiter using Redis with Lua for atomicity.
"""
import redis
import time
from typing import Tuple
 
class DistributedRateLimiter:
    """
    Distributed sliding window counter using Redis.
    Uses Lua scripting for atomic operations.
    """
    
    # Lua script for atomic rate limit check and update
    LUA_SCRIPT = """
    local key = KEYS[1]
    local window = tonumber(ARGV[1])
    local limit = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
    
    -- Calculate current and previous window
    local window_id = math.floor(now / window)
    local window_start = window_id * window
    local position = (now - window_start) / window
    
    local prev_key = key .. ":" .. (window_id - 1)
    local curr_key = key .. ":" .. window_id
    
    -- Get counts
    local prev_count = tonumber(redis.call("GET", prev_key) or "0")
    local curr_count = tonumber(redis.call("GET", curr_key) or "0")
    
    -- Calculate weighted count
    local weighted = (prev_count * (1 - position)) + curr_count
    
    if weighted < limit then
        -- Increment current window
        redis.call("INCR", curr_key)
        redis.call("EXPIRE", curr_key, window * 2)
        
        return {1, limit - weighted - 1, window_start + window}
    else
        return {0, 0, window_start + window}
    end
    """
    
    def __init__(
        self,
        redis_client: redis.Redis,
        prefix: str = "ratelimit"
    ):
        self.redis = redis_client
        self.prefix = prefix
        self.script = self.redis.register_script(self.LUA_SCRIPT)
    
    def allow(
        self,
        key: str,
        window_seconds: int,
        limit: int
    ) -> Tuple[bool, dict]:
        """
        Check if request is allowed.
        
        Args:
            key: Rate limit key (e.g., "user:12345")
            window_seconds: Window size in seconds
            limit: Maximum requests per window
            
        Returns:
            (allowed, info) tuple
        """
        full_key = f"{self.prefix}:{key}"
        now = time.time()
        
        try:
            result = self.script(
                keys=[full_key],
                args=[window_seconds, limit, now]
            )
            
            allowed, remaining, reset = result
            return bool(allowed), {
                "allowed": bool(allowed),
                "remaining": int(remaining),
                "limit": limit,
                "reset": int(reset),
                "retry_after": max(0, reset - now) if not allowed else 0,
            }
        except redis.RedisError as e:
            # Fail open: allow request if Redis unavailable
            # (Alternative: fail closed for security-critical limits)
            return True, {
                "allowed": True,
                "remaining": -1,
                "limit": limit,
                "error": str(e),
                "fail_open": True,
            }
 
 
# Multi-tier rate limiter
class MultiTierRateLimiter:
    """
    Rate limiter with multiple limit tiers.
    E.g., per-minute AND per-hour limits.
    """
    
    def __init__(self, redis_client: redis.Redis):
        self.limiter = DistributedRateLimiter(redis_client)
    
    def allow(
        self,
        key: str,
        limits: list[tuple[int, int]]  # [(window_secs, limit), ...]
    ) -> Tuple[bool, list[dict]]:
        """
        Check all limit tiers. Reject if any tier exceeded.
        
        Args:
            key: Rate limit key
            limits: List of (window_seconds, limit) tuples
            
        Returns:
            (allowed, info_list) tuple
        """
        results = []
        allowed = True
        
        for window, limit in limits:
            tier_key = f"{key}:w{window}"
            tier_allowed, info = self.limiter.allow(tier_key, window, limit)
            results.append({
                "window": window,
                **info
            })
            if not tier_allowed:
                allowed = False
        
        return allowed, results
 
 
# Example usage
if __name__ == "__main__":
    r = redis.Redis()
    limiter = MultiTierRateLimiter(r)
    
    # 5 per second AND 20 per minute
    limits = [
        (1, 5),    # 5 per second
        (60, 20),  # 20 per minute
    ]
    
    for i in range(25):
        allowed, info = limiter.allow("user:alice", limits)
        status = "✓" if allowed else "✗"
        print(f"Request {i+1}: {status}")
        if i % 5 == 4:
            print("  (sleeping 1s)")
            time.sleep(1)

Fail Open vs Fail Closed

When the rate limiting system fails (Redis down, network partition), you must choose: Fail Open (allow all requests) maintains availability but loses protection. Fail Closed (deny all requests) maintains protection but hurts availability. The right choice depends on your use case—abuse prevention might fail closed, while general API limits might fail open.

Rate Limit Response Strategies

How you respond when limits are exceeded significantly impacts user experience and system behavior.

Response Actions:

Rate Limit Response Actions
Action	Description	HTTP Code	Use Case
Hard Reject	Immediately reject the request	429 Too Many Requests	Standard API rate limiting
Queue (Delay)	Accept but delay processing	202 Accepted	When delay is acceptable
Degrade	Serve reduced-quality response	200 OK (degraded)	When partial service is useful
Shed Load	Randomly drop requests	503 Service Unavailable	Emergency overload protection
Throttle	Slow down the connection	N/A (network level)	Network-level shaping
Redirect	Send to alternative service	307/302 Redirect	Geographic or tier-based routing

429 Response Best Practices:

When rejecting with HTTP 429, provide helpful information:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699920000
X-RateLimit-Policy: 100;w=60

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded the rate limit. Please retry after 30 seconds.",
    "retry_after": 30,
    "limit": 100,
    "window": 60,
    "documentation_url": "https://docs.example.com/rate-limits"
  }
}

Header Standards:

Header	Description	Standard
`Retry-After`	Seconds until retry is safe	RFC 7231
`X-RateLimit-Limit`	Request limit	De facto standard
`X-RateLimit-Remaining`	Requests remaining	De facto standard
`X-RateLimit-Reset`	Unix timestamp when limit resets	De facto standard
`RateLimit-*`	IETF standardized versions	RFC 9110 draft

Exponential Backoff

Well-behaved clients should implement exponential backoff when rate limited. First retry after 1s, then 2s, 4s, 8s, etc., with jitter. This prevents thundering herd problems where many clients retry simultaneously when a limit window resets.

Graceful Degradation Example:

Instead of rejecting, provide reduced service:

def get_user_recommendations(user_id: str):
    # Check rate limits for expensive operations
    if not rate_limiter.allow(f"recommendations:{user_id}", 
                              window=60, limit=10):
        # Degraded: return cached/simpler recommendations
        return get_cached_recommendations(user_id)
    
    # Full service: compute personalized recommendations
    return compute_ml_recommendations(user_id)

This maintains perceived availability while protecting expensive backends.

Rate Limiting by Identity

Choosing the right identity for rate limiting is crucial—too broad and abusers affect legitimate users; too narrow and abusers create many identities.

Identity Dimensions:

Rate Limiting Identity Options
Identity	Pros	Cons	Best For
IP Address	No auth required, catches bots	NAT shares IPs, VPNs bypass	Anonymous/unauthenticated endpoints
User ID	Accurate per-user limits	Only works for authenticated	API quotas, user-facing features
API Key	Per-application limits	Keys can be shared/leaked	Developer APIs, B2B services
Device Fingerprint	Tracks across identities	Privacy concerns, cat-and-mouse	Fraud prevention, abuse detection
Organization/Tenant	Org-level quotas	Unfair to larger orgs	Multi-tenant SaaS
Compound	Multi-dimensional	Complex to configure	Defense in depth

Layered Identity Strategy:

Production systems typically use multiple identity layers:

def check_rate_limits(request):
    """
    Layered rate limiting by multiple identities.
    Request is allowed only if ALL limits pass.
    """
    checks = []
    
    # Layer 1: Global limit (protect service)
    checks.append(limiter.allow(
        key="global",
        limit=1_000_000,
        window=60
    ))
    
    # Layer 2: Per-IP limit (stop unauthenticated abuse)
    checks.append(limiter.allow(
        key=f"ip:{request.ip}",
        limit=1000,
        window=60
    ))
    
    # Layer 3: Per-user limit (if authenticated)
    if request.user_id:
        user_tier = get_user_tier(request.user_id)
        checks.append(limiter.allow(
            key=f"user:{request.user_id}",
            limit=TIER_LIMITS[user_tier],
            window=60
        ))
    
    # Layer 4: Per-endpoint limit (protect expensive operations)
    endpoint_limit = ENDPOINT_LIMITS.get(request.endpoint)
    if endpoint_limit:
        checks.append(limiter.allow(
            key=f"endpoint:{request.endpoint}:{request.user_id or request.ip}",
            limit=endpoint_limit,
            window=60
        ))
    
    # All must pass
    return all(check[0] for check in checks)

Handling Shared IP Addresses:

Large organizations (universities, enterprises) share a single NAT IP:

Higher limits for known NAT ranges: Identify and whitelist corporate ranges
Compound keys: IP + user-agent + other signals
Authenticated exemptions: Logged-in users bypass IP limits
CAPTCHA fallback: Present CAPTCHA instead of hard block
Customer support escalation: Allow manual limit increases

IPv6 Considerations

IPv6 gives each device effectively unlimited addresses. Rate limiting by /128 (single address) is ineffective—limit by /56 or /48 (ISP-assigned prefix) instead. Monitor for clients rotating through large IPv6 blocks to evade limits.

Adaptive Rate Limiting

Fixed rate limits are simple but wasteful—they must be set for worst-case load, leaving capacity unused during normal operation. Adaptive rate limiting adjusts limits based on current system state.

AIMD (Additive Increase, Multiplicative Decrease):

Borrowed from TCP congestion control:

AIMD Rate Limiting:

State:
  current_limit: starts at initial_limit

Every adjustment_period:
  success_rate = successful_requests / total_requests
  
  if success_rate > target_success_rate:
    # System healthy, increase limits
    current_limit = min(max_limit, current_limit + additive_increase)
  else:
    # System stressed, back off
    current_limit = max(min_limit, current_limit × multiplicative_decrease)

Adaptive Rate Limiting Signals

•Response latency — If P99 latency exceeds threshold, reduce limits to reduce load
•Error rate — High 5xx error rates indicate backend stress; reduce inbound rate
•Queue depth — Long queues at backends signal need for backpressure
•CPU/memory utilization — High resource usage triggers protective reducing
•Downstream health — Database, cache, external API health affects limits
•Time of day — Different limits for peak vs off-peak hours

Load Shedding at Scale:

When systems approach overload, intelligent load shedding preserves function:

class AdaptiveLoadShedder:
    """
    Sheds load based on system health metrics.
    """
    
    def __init__(self, target_latency_ms: int = 100):
        self.target_latency = target_latency_ms
        self.shed_probability = 0.0
    
    def update(self, current_latency_ms: float):
        """Update shed probability based on latency."""
        if current_latency_ms > self.target_latency * 2:
            # Way over target: aggressive shedding
            self.shed_probability = min(0.9, self.shed_probability + 0.1)
        elif current_latency_ms > self.target_latency:
            # Over target: gradual increase
            self.shed_probability = min(0.5, self.shed_probability + 0.05)
        else:
            # Under target: gradual recovery
            self.shed_probability = max(0.0, self.shed_probability - 0.01)
    
    def should_shed(self, request_priority: int = 0) -> bool:
        """
        Determine if request should be shed.
        Higher priority = less likely to shed.
        """
        adjusted_probability = self.shed_probability / (priority + 1)
        return random.random() < adjusted_probability

Priority-Based Shedding:

Not all requests are equal—shed low-priority first:

Priority	Examples	Shed Order
Critical	Health checks, auth tokens	Never shed
High	User-facing, purchased tier	Shed last
Normal	Standard API calls	Normal shedding
Low	Analytics, background sync	Shed first
Best-effort	Prefetch, speculative	Shed aggressively

Google's CoDel-Inspired Approach

Google's Guava RateLimiter and similar systems use techniques inspired by CoDel (Controlled Delay) queue management. Rather than just limiting rate, they ensure queued requests don't wait too long. Old requests are dropped because their clients have likely given up—processing them wastes resources and delivers unusable responses.

Rate Limiting Patterns and Anti-Patterns

Production experience has identified patterns that work well and anti-patterns to avoid.

Pattern: Defense in Depth

Apply rate limiting at multiple layers:

Converting Mermaid diagram...

Patterns

•Defense in depth — Multiple layers, each protecting the next
•Fail gracefully — Degrade service instead of failing completely
•Communicate limits — Headers, documentation, error messages
•Test your limits — Load test to verify limits behave correctly
•Per-endpoint limits — Different limits for different costs
•Burst allowance — Token bucket for user experience
•Prioritization — Protect important requests during overload

Anti-Patterns

•Single point of limiting — All limits in one place fails together
•Hard rejections only — No queuing or degradation options
•Silent limiting — No feedback to clients about why or when
•Global-only limits — One bad user affects everyone
•Set and forget — Limits never reviewed or adjusted
•Equal treatment — Same limits for $10 and $10,000 customers
•No monitoring — Can't tell when limits are being hit

Pattern: Token Budget System

Instead of simple request counts, assign costs to operations:

OPERATION_COSTS = {
    'GET /users': 1,
    'GET /search': 5,          # More expensive
    'POST /export': 50,        # Very expensive
    'POST /ml-inference': 100, # Resource-intensive
}

def check_budget(user_id: str, operation: str) -> bool:
    cost = OPERATION_COSTS.get(operation, 1)
    
    # User has 1000 tokens per minute
    return token_bucket.consume(
        key=f"budget:{user_id}",
        tokens=cost,
        bucket_size=1000,
        refill_rate=1000/60
    )

This ensures expensive operations are limited more aggressively while allowing many cheap operations.

Pattern: Separate Rate Limit Types

Different limit categories for different purposes:

Abuse limits: High bar, protects against attacks
Service limits: Medium bar, prevents overload
Product limits: Lower bar, implements business rules

Monitor Limit Efficiency

Track what percentage of your capacity is actually used. If limits are rarely hit, they may be too high (wasting resources) or traffic patterns changed. If limits are constantly hit, they may be too low or you have a scaling problem. Good limits should be hit occasionally—enough to provide protection without hampering normal use.

Real-World Case Studies

Examining how major platforms implement rate limiting provides practical insights.

Case Study 1: GitHub API

GitHub implements sophisticated per-resource rate limiting:

Authenticated users:
  - Core API: 5,000 requests per hour
  - Search API: 30 requests per minute
  - GraphQL: 5,000 points per hour (complexity-based)
  
OAuth Apps:
  - 5,000 requests per hour per user
  - Can request higher limits
  
GitHub Actions:
  - Separate limits per workflow
  - Concurrency limits per repository

Key insight: Different resources have different limits—search is expensive so gets stricter limits.

Industry Rate Limiting Examples

•Stripe: 100 read requests/second, 100 write requests/second per publishable key. Burst tolerance, clear 429 responses, Idempotency-Key support for safe retries.
•AWS: Per-service limits (EC2 API: 100/s, S3: 3,500 PUT/s per prefix), with token bucket allowing bursts. Service quotas adjustable per account.
•Twitter/X API: Tiered by subscription: Free (50 tweets/day), Basic (100/month read), Pro (1M tweets/month). App-level and user-level limits.
•Cloudflare WAF: Per-rule rate limiting at edge, with sliding window, geographic awareness, and enterprise-configurable thresholds. Handles DDoS at network edge.
•Discord: Message rate limits per channel (5/5s), reaction limits (1/0.25s), gateway rate limits (120/60s). Different limits for bots vs users.
•OpenAI: Token-based limits (TPM/RPM), different per model tier. GPT-4 has much lower limits than GPT-3.5 due to cost.

Case Study 2: Cloudflare Rate Limiting

Cloudflare operates at edge, processing millions of requests per second:

Edge enforcement: Limits computed at edge, no central coordination
Sampling: For very high-rate attacks, sample 1% of requests to estimate
Challenge ladder: 429 → CAPTCHA → JavaScript challenge → block
Machine learning: Detect patterns and auto-create limits

Key lesson: At extreme scale, exact counting is impossible—sampling and estimation become necessary.

Case Study 3: Netflix Circuit Breakers

Netflix combines rate limiting with circuit breakers (Hystrix pattern):

Rate limiting prevents gradual overload
Circuit breakers detect failures and fail fast
Bulkheads isolate failures to prevent cascade
Fallbacks provide degraded service

This defense-in-depth approach handles both traffic quantity and backend health.

Transparency Builds Trust

The best rate limiting implementations are transparent. Companies like Stripe, GitHub, and Twilio publish their limits, explain the rationale, provide clear error messages, and offer ways to increase limits. This transparency builds trust and helps developers build resilient applications.

Summary: Rate Limiting

We've comprehensively explored rate limiting—the practice of applying traffic shaping principles to protect systems, ensure fairness, and implement business policies.

Key Takeaways

•Rate limiting serves multiple purposes — Protection, fairness, cost control, security, compliance, and business policy enforcement.
•Multiple algorithms exist with different tradeoffs — Sliding window counter is the most common, balancing accuracy and efficiency.
•Distributed rate limiting requires coordination — Redis-based centralized or eventual-consistent local approaches, each with tradeoffs.
•Response strategy matters — 429 with clear headers, graceful degradation, and exponential backoff guidance.
•Identity selection is critical — IP, user, API key, or compound—each has different evasion and collateral damage characteristics.
•Adaptive limits optimize resource usage — Adjust based on latency, error rate, and system health rather than fixed thresholds.
•Defense in depth is essential — Apply limits at multiple layers, protecting each service from its downstream dependencies.

What's Next:

With rate limiting mastered, we'll complete our traffic shaping journey by exploring QoS (Quality of Service) Implementation—how traffic shaping technologies are combined to deliver differentiated service quality across network infrastructure.

Page Complete

You now understand rate limiting comprehensively—from algorithms and distribution challenges to response strategies and real-world implementations. This knowledge is directly applicable to designing resilient, fair, and secure distributed systems.