Rate Limiter - Learning Module

Loading content...

0/273

Requirements: Limit API Requests

The Invisible Guardian of Every API

Every successful API in production has an invisible guardian standing between clients and servers—a Rate Limiter. This critical component silently defends against abuse, ensures fair resource distribution, and maintains system stability even when millions of requests surge simultaneously.

Without rate limiting, a single misbehaving client could monopolize server resources, denial-of-service attacks would be trivially easy, and runaway scripts could generate bills that bankrupt companies. Rate limiting isn't optional for production systems—it's existential.

In this module, we'll design a production-grade rate limiter from first principles, covering the algorithms, architecture, and operational considerations that separate toy implementations from systems protecting real-world APIs at companies like Stripe, GitHub, and Cloudflare.

What You Will Learn

By the end of this page, you'll understand: (1) Why rate limiting is essential for any production API, (2) The complete functional requirements for a rate limiter, (3) Non-functional requirements including latency, throughput, and fault tolerance, (4) Back-of-envelope estimations for scale, and (5) The key design decisions that shape rate limiter architecture.

Why Rate Limiting Matters

Rate limiting controls how frequently clients can make requests to an API within a given time window. This seemingly simple concept addresses multiple critical concerns that every production system must handle. Let's understand the fundamental motivations driving rate limiter design.

Core Motivations for Rate Limiting

•Prevent Resource Starvation — Without rate limits, a single client could consume all available server resources, starving legitimate users. A single runaway script making 10,000 requests/second could bring down an entire service.
•Defense Against Denial of Service — Rate limiting is the first line of defense against DoS attacks. By limiting request rates, attackers cannot overwhelm servers with volume alone.
•Cost Control — Cloud resources cost money. An abusive client generating 100x normal traffic translates directly to 100x infrastructure costs. Rate limiting protects business sustainability.
•Fair Resource Distribution — When capacity is constrained, rate limiting ensures all clients get reasonable access rather than first-come-first-served chaos.
•Enforce Business Tiers — APIs often have different pricing tiers (free, pro, enterprise). Rate limiting enforces these contractual boundaries programmatically.
•Protect Downstream Dependencies — Your API likely depends on databases, caches, and third-party services. Rate limiting prevents cascading failures when clients overwhelm your system.

Real-World Incident: The $72,000 API Call

In 2019, a startup's developer accidentally deployed a script with an infinite loop calling their cloud provider's API. Without rate limiting on their own internal services, the script ran for 6 hours before detection, generating a $72,000 bill from API and compute charges. Proper rate limiting would have stopped this within seconds.

The economics of rate limiting:

Consider a simple example: An API server can handle 10,000 requests per second (rps) sustainably. During normal operation, you have 1,000 clients each making 5 rps = 5,000 rps total (50% utilization, healthy headroom).

Now imagine one client's code has a bug causing 100 rps instead of 5:

Without rate limiting: That client consumes 1% of capacity, reducing headroom
With 10 buggy clients: 1,000 rps consumed, other clients start experiencing latency
With 100 buggy clients: 10,000 rps consumed, system is overwhelmed, all clients fail

Rate limiting at 10 rps per client means those 100 buggy clients together only consume 1,000 rps—the system remains healthy while you notify affected clients to fix their code.

Functional Requirements

Before designing any system, we must precisely define what it needs to do. For a rate limiter, the functional requirements establish the core behaviors and capabilities that the system must provide. We'll analyze each requirement in depth, considering variations and edge cases.

Core Functional Requirements
Requirement	Description	Example
Limit Request Rate	Restrict number of requests per time window per identity	Max 1000 requests per hour per API key
Identify Clients	Determine who is making requests for rate limit tracking	By API key, user ID, IP address, or combination
Reject/Allow Decision	Make a binary decision for each request	Return 200 OK or 429 Too Many Requests
Communicate Limits	Inform clients of their current state and limits	X-RateLimit-Remaining: 423 headers
Support Multiple Limits	Apply different limits at different granularities	100/min AND 5000/hour per user
Reset Windows	Clear counters when time windows expire	Counter resets at top of each minute

FR-1: Accurate Rate Limiting

The rate limiter must accurately count requests and enforce configured limits. This sounds trivial but involves subtle complexity:

Time Window Definition:

Fixed windows (e.g., 1:00-1:01, 1:01-1:02) are simple but create boundary problems
Sliding windows provide smoother limiting but require more computation
Token buckets allow bursting while maintaining long-term averages

Counting Accuracy:

In distributed systems, counts may be slightly inconsistent across nodes
The system should either be slightly conservative (allow fewer than limit) or slightly lenient (allow more)
Perfect accuracy is impossible; we define acceptable error bounds

Request Attribution:

What constitutes 'one request'? All HTTP calls? Only POSTs? Only successful calls?
Should retries count? Should OPTIONS preflight requests count?
These decisions significantly impact client experience.

Design Decision: Counting Policy

Most production rate limiters count all requests regardless of outcome. Counting only successful requests would allow attackers to make unlimited malformed requests. Counting only specific methods (like POST) would leave other methods unprotected. The safest default is counting everything, with the option to exclude health checks and monitoring endpoints.

FR-2: Client Identification

Rate limiting requires identifying 'who' is making requests. The choice of identity is critical:

By API Key:

Most common for authenticated APIs
Clear ownership and accountability
Easy to track, quota, and bill
Challenge: Key sharing and leakage

By User ID:

Limits per authenticated user regardless of which API key
Prevents users from creating multiple keys to bypass limits
Requires authentication before rate limiting

By IP Address:

Works for unauthenticated endpoints
Problematic: NAT/proxies make many users share IPs
Corporate networks often show as single IP
Mobile carriers do the same

By Composite Key:

Combine multiple factors (e.g., API key + endpoint)
Allows different limits for different operations
Example: 10,000/hour for reads, 100/hour for writes

Identity Hierarchy: Production systems often apply multiple limits:

Global rate limit (protect infrastructure)
Organization rate limit (aggregate for billing)
User rate limit (individual fair use)
Endpoint-specific limits (expensive operations)

FR-3: Multiple Limit Dimensions

Real-world APIs need limits at multiple time scales and dimensions:

Time Dimensions:

Per-second: 50 requests/second (prevent bursting)
Per-minute: 1,000 requests/minute (sustained rate)
Per-hour: 20,000 requests/hour (capacity planning)
Per-day: 100,000 requests/day (quota management)

Resource Dimensions:

/api/search: 60 requests/minute (expensive computation)
/api/data: 1,000 requests/minute (cached reads)
/api/upload: 10 requests/minute (storage-intensive)
/api/webhook: 100 requests/hour (third-party integration)

Combining Limits: A single request might be evaluated against multiple limits:

Global: Under 10M requests/hour total? ✓
Organization: Under 100K requests/hour for this org? ✓
User: Under 1K requests/hour for this user? ✓
Endpoint: Under 60 requests/minute for /search? ✗ REJECT

Limit Inheritance: Some systems support hierarchical limits where child entities inherit parent limits. An organization limit of 100K/hour applies to all users, with individual users getting 10K/hour within that.

Additional Functional Considerations

•Burst Allowance — Allow clients to temporarily exceed rates if they've been below quota. Token bucket algorithms excel at this.
•Graceful Degradation — When limits are hit, prefer slowing down to hard rejections when possible (response delays).
•Exempt Traffic — Some traffic (health checks, internal monitoring) should bypass rate limits.
•Override Capability — Operators need emergency ability to raise/lower limits without deployment.
•Audit Trail — Which requests were rate limited? When? This data is essential for debugging and customer support.

Non-Functional Requirements

Non-functional requirements (NFRs) define how well the system performs its functions. For a rate limiter, NFRs are exceptionally important because the rate limiter sits in the critical path of every API request. Any deficiency in the rate limiter directly impacts the entire system.

Performance Requirements

•Latency — Sub-millisecond p99. The rate limiter adds overhead to every request; this must be negligible.
•Throughput — Handle millions of decisions per second across all rate limiter instances.
•Memory Efficiency — Track millions of unique clients without excessive memory consumption.
•CPU Efficiency — Decision computation must be O(1) per request, not proportional to history.

Reliability Requirements

•Availability — 99.99%+. Rate limiter downtime means either all traffic blocked or unlimited.
•Consistency — Limits should be enforced consistently across distributed nodes.
•Fault Tolerance — Continue operating during network partitions and node failures.
•No Single Point of Failure — Distributed architecture with graceful degradation.

NFR-1: Ultra-Low Latency

The rate limiter is in the hot path of every API request. Consider the impact of latency:

Rate Limiter Latency	Requests/sec	Daily Latency Impact
1ms	1M	1,000 seconds added delay
0.1ms	1M	100 seconds added delay
0.01ms	1M	10 seconds added delay

Target: p50 < 0.1ms, p99 < 1ms, p99.9 < 5ms

To achieve this:

Use in-memory data structures when possible
Minimize network round trips (local caching)
Avoid locks in the hot path
Pre-compute rate limit configurations

NFR-2: Massive Scale Throughput

At enterprise scale, rate limiters must handle enormous throughput:

Scale Targets:

Decisions per second: 1-10 million
Unique identities tracked: 10-100 million
Rules evaluated per decision: 3-10
Geographic distribution: 5-20 regions

Scaling Approach:

Horizontal scaling: Add more rate limiter nodes
Sharding: Different clients handled by different nodes
Local caching: Reduce distributed coordination
Batching: Aggregate updates before synchronization

NFR-3: Fault Tolerance and Graceful Degradation

What happens when the rate limiter itself has problems? This is a critical design decision:

Option A: Fail Open (Allow all traffic)

Pro: Service remains available during rate limiter failures
Con: No protection during outages; potential for abuse and overload
Best for: Internal services, low-risk APIs

Option B: Fail Closed (Reject all traffic)

Pro: Maximum protection during failures
Con: Complete service outage if rate limiter fails
Best for: High-security, payment, or critical APIs

Option C: Fail with Cached State (Allow based on last known state)

Pro: Reasonable protection with continued service
Con: Stale data may allow some abuse or wrongly block legitimate clients
Best for: Most production APIs

Option D: Local Fallback

Each application node has local rate limiting as backup
Continues protecting during distributed rate limiter failures
Less accurate but maintains protection

Industry Standard: Fail Open with Local Limits

Most production systems use fail-open for distributed rate limiting but maintain local (per-node) rate limits as a fallback. This ensures that even during central rate limiter failures, no single client can overwhelm any individual application node. Services remain available while still protected.

Non-Functional Requirements Summary
Requirement	Target	Rationale
p50 Latency	< 0.1ms	Imperceptible overhead on requests
p99 Latency	< 1ms	Tail latency still negligible
Availability	99.99%	Better availability than APIs it protects
Throughput	1M+ decisions/sec/node	Handle peak traffic with headroom
Memory per client	< 100 bytes	Support 100M clients with 10GB RAM
Recovery time	< 10 seconds	Fast recovery from node failures
Data staleness	< 1 second	Near-real-time accuracy

Back-of-Envelope Estimation

Before designing the system, let's estimate the scale we need to support. We'll consider a rate limiter for a large API platform serving global traffic—similar to Stripe, Twilio, or GitHub's API infrastructure.

Estimation Scenario

We're designing a rate limiter for a large API platform with: 10 million active API keys, 100 million API calls per hour at peak, serving requests globally across 10 geographic regions, with a 99.99% availability requirement.

Traffic Estimation

Request Volume:

Peak requests per hour: 100,000,000
Peak requests per second: 100M / 3600 = ~28,000 rps
With 3x headroom for spikes: 84,000 rps target capacity
Per region (10 regions): 8,400 rps per region

Rate Limit Decisions: Each request requires rate limit evaluation:

Decisions per second: 28,000 (1 per request)
With multiple rules per decision (avg 5): 140,000 rule evaluations/sec
Per region: 14,000 rule evaluations/sec

Update Frequency: Each decision updates counters:

Counter updates per second: 28,000
With distributed sync (100ms batching): 280 sync operations/sec

Storage Estimation

Per-Client Storage (Token Bucket):

Client ID (hash): 8 bytes
Last refill time: 8 bytes
Token count: 8 bytes
Bucket size/rate: 8 bytes
Total per client: ~32 bytes
With overhead: ~50 bytes per bucket

Multiple Buckets per Client:

Average buckets per API key: 3 (by time window)
Endpoint-specific buckets: 5 common endpoints
Total buckets per API key: ~8 buckets
Storage per API key: 8 × 50 = 400 bytes

Total Active Storage:

Active API keys: 10,000,000
Storage per key: 400 bytes
Total: 10M × 400 = 4 GB
With 2x headroom: 8 GB

Per Region:

Total keys (replicated): 10,000,000
Storage: 8 GB per region
10 regions: 80 GB total across infrastructure

Network Estimation

Synchronization Traffic:

Counter updates per second: 28,000
Update message size: ~50 bytes (key + delta)
Inter-region sync frequency: 100ms batches
Updates per batch: 2,800
Batch size: 2,800 × 50 = 140 KB per region
Sync with 9 other regions: 1.26 MB per batch
Sync bandwidth: 1.26 MB × 10/sec = 12.6 MB/sec

Client Communication:

Rate limit headers: ~100 bytes per response
At 28,000 rps: 2.8 MB/sec additional header overhead

This is well within network capacity for modern infrastructure.

Estimation Summary
Metric	Estimate	Notes
Peak RPS	28,000 (84K with headroom)	Globally distributed
Rule evaluations/sec	140,000	Average 5 rules per request
Active clients	10 million	API keys with recent activity
Storage per region	8 GB	In-memory for speed
Cross-region sync	12.6 MB/sec	Batched every 100ms
Decision latency budget	< 1ms p99	Critical path
Nodes per region	3-5	For redundancy and load

Key Design Decisions

Based on our requirements and estimations, several key design decisions will shape our rate limiter architecture. Let's examine the major trade-offs.

Critical Design Decisions

•Algorithm Choice — Token bucket vs. sliding window vs. fixed window. Each offers different trade-offs between accuracy, memory, and burst handling. We'll explore token bucket and sliding window in detail in subsequent pages.
•Centralized vs. Distributed — A single rate limiter is simpler but creates a bottleneck and single point of failure. Distributed rate limiting adds complexity but scales horizontally.
•Synchronous vs. Asynchronous — Check limits synchronously (blocking) for accuracy, or asynchronously (non-blocking) for speed with eventual consistency.
•In-Memory vs. Persistent Storage — Memory is fast but volatile. Persistence survives restarts but adds latency. Hybrid approaches use memory with async persistence.
•Strict vs. Approximate Limiting — Exact counting is expensive at scale. Approximate algorithms (like HyperLogLog) trade accuracy for efficiency.

Recommended Approach

•Token bucket for per-client limits (smooth, allows burste)
•Sliding window for high-accuracy limits (premium tiers)
•Distributed with local caching (balance speed and accuracy)
•In-memory with async persistence (fast recovery)
•Fail-open with local fallback (available but protected)

Approaches to Avoid

•Fixed windows alone (boundary spikes)
•Fully synchronous distributed checks (latency)
•Pure in-memory without persistence (data loss)
•Single centralized instance (bottleneck)
•Complex multi-phase algorithms (maintenance burden)

High-Level Architecture Preview

Based on these decisions, our rate limiter architecture will consist of:

Local Rate Limiter (per application node)
- In-memory token buckets
- Sub-millisecond decisions
- First line of defense
Distributed Counter Service
- Aggregates counts across nodes
- Eventually consistent synchronization
- Shared state for accurate limiting
Configuration Service
- Stores rate limit rules
- Dynamic updates without deployment
- Rule versioning and rollback
Analytics Pipeline
- Tracks rate limit events
- Dashboards and alerting
- Client usage patterns

In the following pages, we'll dive deep into the algorithms (token bucket, sliding window), distributed rate limiting strategies, and client communication patterns.

API Interface Design

Before implementation, let's define the interfaces our rate limiter will expose. These APIs serve both internal rate-checking and external configuration management.

rate-limiter-interface.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Core rate limiting interface
interface RateLimiter {
  // Check if request is allowed and consume quota
  // Returns decision with metadata for headers
  checkLimit(request: RateLimitRequest): Promise<RateLimitDecision>;
  
  // Preview limit status without consuming quota
  getStatus(clientId: string, ruleId: string): Promise<LimitStatus>;
  
  // Administrative operations
  resetLimit(clientId: string, ruleId?: string): Promise<void>;
  overrideLimit(clientId: string, override: LimitOverride): Promise<void>;
}
 
interface RateLimitRequest {
  clientId: string;           // API key, user ID, or IP
  resource: string;           // Endpoint or resource being accessed
  cost?: number;              // Request weight (default 1)
  timestamp?: number;         // Request time (default now)
}
 
interface RateLimitDecision {
  allowed: boolean;           // Is request permitted?
  remaining: number;          // Requests remaining in window
  limit: number;              // Total limit for this window
  resetAt: number;            // When window resets (Unix ms)
  retryAfter?: number;        // Seconds to wait if denied
  rule: string;               // Which rule triggered limit
}
 
interface LimitStatus {
  current: number;            // Current count in window
  limit: number;              // Maximum allowed
  remaining: number;          // Requests remaining  
  resetAt: number;            // Window reset time
  windowSize: number;         // Window duration in seconds
}
 
interface LimitOverride {
  limit?: number;             // Override limit value
  expiresAt?: number;         // When override expires
  reason: string;             // Audit trail
}

HTTP Response Headers

Following industry standards (RFC 6585, RFC 7231), our rate limiter communicates limits via HTTP headers:

Standard Headers:

X-RateLimit-Limit: 1000          # Maximum requests in window
X-RateLimit-Remaining: 423       # Requests remaining
X-RateLimit-Reset: 1609459200    # Unix timestamp of window reset
Retry-After: 37                  # Seconds to wait (on 429)

Extended Headers (Optional):

X-RateLimit-Policy: 1000;w=3600;burst=50   # Policy details
X-RateLimit-Scope: user                     # What identity is limited
X-RateLimit-Resource: /api/search           # Which endpoint

These headers enable clients to:

Implement client-side throttling
Display usage to users
Gracefully handle approaching limits
Retry at appropriate times

Summary: Rate Limiter Requirements

We've established a comprehensive understanding of what a production-grade rate limiter requires. Let's consolidate the key takeaways:

Key Takeaways

•Rate limiting is essential — It protects against resource starvation, DoS attacks, runaway costs, and unfair usage. Every production API needs it.
•Multiple dimensions — Effective rate limiting applies limits by identity, time window, and resource. Hierarchical limits prevent gaming.
•Extreme performance — Sub-millisecond latency is non-negotiable. The rate limiter is in every request's critical path.
•Fault tolerance is paramount — The rate limiter must be more reliable than the APIs it protects. Fail-open with local fallback is recommended.
•Client communication matters — Clear headers and error messages enable clients to handle limits gracefully.
•Scale considerations drive design — At 10M+ clients and 100M+ requests/hour, every byte of storage and every microsecond matters.

What's Next:

In the following pages, we'll dive into the core algorithms that power rate limiting:

Page 2: Token Bucket Algorithm — The classic approach enabling smooth rate limiting with burst support
Page 3: Sliding Window Algorithm — Precise rate limiting without boundary problems
Page 4: Distributed Rate Limiting — Coordinating limits across multiple nodes and regions
Page 5: Per-User vs Per-API Limits — Designing flexible, hierarchical rate limiting policies
Page 6: Client Communication — Headers, error responses, and retry guidance

Page Complete

You now understand the comprehensive requirements for building a production-grade rate limiter. You can articulate why rate limiting is essential, define functional and non-functional requirements, estimate scale, and identify key design decisions. Next, we'll implement the core algorithms.