System Design (HLD)Rate Limiting at Gateway

Rate Limiting at API Gateway

LevelIntermediate

Duration75 mins

TopicRate Limiting at Gateway

1 / 5

Why Rate Limiting

The Invisible Shield

On February 28, 2017, a single misconfigured script at a major cloud provider triggered a cascading failure that took down a significant portion of the internet. Websites from Trello to Quora became unreachable. The root cause? An automation error that sent requests at a rate far exceeding what the system could handle.

This incident illustrates a fundamental truth about distributed systems: without rate limiting, a single misbehaving client—whether malicious or accidental—can bring down services that millions depend upon.

Rate limiting is the invisible shield that stands between your API and chaos. It is not merely a defensive mechanism; it is a foundational architectural pattern that enables fair resource allocation, predictable system behavior, and sustainable service operation at scale.

What You Will Learn

By the end of this page, you will understand why rate limiting is non-negotiable for production systems, the threats it mitigates, the business value it provides, and the core principles that guide effective rate limiting design. This foundation prepares you for the algorithmic deep-dives in subsequent pages.

The Necessity of Rate Limiting

At its core, rate limiting is the practice of controlling the rate at which clients can make requests to a service. This seemingly simple concept has profound implications for system reliability, security, and economics.

The Fundamental Problem:

Every system has finite capacity. Whether it's CPU cycles, memory, database connections, network bandwidth, or downstream service capacity—resources are bounded. When demand exceeds capacity, systems degrade or fail entirely.

Without rate limiting, you face a tragedy of the commons: each client, acting in their own interest, may consume more resources than their fair share, ultimately harming all users—including themselves.

Scenarios That Demand Rate Limiting

•Traffic Spikes — A viral social media post sends 100x normal traffic to your API. Without limits, backend services collapse under load.
•Misbehaving Clients — A bug in a client application causes it to retry requests in an infinite loop, generating millions of requests per minute.
•Denial of Service (DoS) Attacks — Malicious actors deliberately flood your API with requests to exhaust resources and cause service unavailability.
•Expensive Operations — Some API endpoints trigger costly computations or database queries. Unlimited access to these endpoints can bankrupt your infrastructure budget.
•Cascading Failures — One overwhelmed service sends timeouts to its callers, which retry, amplifying the load and spreading the failure across the system.
•Noisy Neighbors — In multi-tenant systems, one customer's excessive usage degrades performance for all other customers.

The Retry Amplification Problem

When services become slow or return errors, well-intentioned retry logic kicks in. But retries add load to an already stressed system. Without rate limiting, retries from thousands of clients can turn a minor slowdown into a complete outage. This is why rate limiting is often the first line of defense—it prevents the initial overload that triggers the retry cascade.

Threats Mitigated by Rate Limiting

Rate limiting addresses a spectrum of threats, from accidental misuse to sophisticated attacks. Understanding these threats helps you design appropriate rate limiting strategies for each.

Threat Categories and Rate Limiting's Role
Threat Category	Description	How Rate Limiting Helps
Volumetric DoS	Overwhelming the system with sheer request volume	Caps total requests, rejecting excess before they consume resources
Application-Layer DoS	Targeting expensive endpoints (login, search, checkout)	Per-endpoint limits protect resource-intensive operations
Credential Stuffing	Automated attempts to log in with stolen credentials	Limits login attempts per IP/user, slowing attackers
Web Scraping	Automated extraction of data at rates harmful to the service	Throttles request rates to prevent bulk data extraction
API Abuse	Clients exceeding fair usage, intentionally or not	Enforces contractual limits and fair resource sharing
Brute Force Attacks	Systematic attempts to guess passwords or keys	Rate limits make brute force computationally infeasible
Resource Exhaustion	Consuming finite resources (connections, memory, CPU)	Ensures capacity is reserved for legitimate traffic

Defense in Depth:

Rate limiting is not a silver bullet. It works best as part of a layered security strategy:

Network Layer — DDoS protection services (Cloudflare, AWS Shield) absorb volumetric attacks
API Gateway Layer — Rate limiting catches application-layer abuse and enforces business rules
Application Layer — Business logic validation catches semantic attacks
Data Layer — Query limits and connection pooling protect databases

The API gateway is the ideal location for rate limiting because it serves as the single entry point for all traffic, has visibility into request patterns, and can make decisions before requests consume backend resources.

Business Value of Rate Limiting

Rate limiting isn't just a technical necessity—it delivers tangible business value across multiple dimensions. Understanding this helps justify investment in robust rate limiting infrastructure.

Economic Benefits

•Infrastructure Cost Control — By preventing runaway usage, rate limiting keeps cloud bills predictable. A well-rate-limited API can serve 10x more legitimate users on the same infrastructure.
•Revenue Protection — Tiered rate limits enable differentiated pricing. Free tier users get 100 requests/minute; enterprise customers get 10,000. Rate limiting enforces these tiers automatically.
•SLA Compliance — Service Level Agreements often include availability guarantees. Rate limiting protects against the overload scenarios that cause SLA breaches and financial penalties.
•Reduced Incident Costs — Each production incident costs money in engineering time, customer support, and reputation damage. Rate limiting prevents many incidents from occurring at all.

Operational Benefits

•Predictable Capacity Planning — When you know the maximum request rate, capacity planning becomes straightforward. You can provision for known limits rather than unknown spikes.
•Graceful Degradation — Instead of complete failure under load, rate-limited systems reject excess traffic gracefully, maintaining service for the majority of users.
•Fair Multi-Tenancy — In SaaS platforms, rate limiting ensures one customer's usage pattern doesn't impact others. This 'noisy neighbor' isolation is critical for customer satisfaction.
•API Monetization — Rate limits enable usage-based pricing models. Metering and billing become straightforward when limits are clearly defined and enforced.

The 99th Percentile Problem

Without rate limiting, you must provision infrastructure for worst-case traffic, which might be 100x your average. With rate limiting, you provision for your defined limits plus a safety margin. This difference can represent millions of dollars in infrastructure costs for high-traffic services.

Core Principles of Rate Limiting

Effective rate limiting is guided by principles that balance protection with usability. Violating these principles leads to systems that either fail to protect or frustrate legitimate users.

Principle 1: Fail Fast

•Reject excess requests immediately at the edge
•Don't queue requests that will eventually timeout
•Return clear 429 status codes with retry information
•Provide Retry-After headers to guide client behavior

Principle 2: Be Transparent

•Document rate limits clearly in API documentation
•Include rate limit headers in every response
•Provide dashboards for users to monitor their usage
•Alert users before they hit limits, not after

Principle 3: Be Proportional

•Match limits to actual resource costs
•Expensive operations get stricter limits
•Read operations typically get higher limits than writes
•Consider the downstream impact of each endpoint

Principle 4: Be Granular

•Apply multiple limit dimensions (user, IP, API key, endpoint)
•Use hierarchical limits for layered protection
•Allow different limits for different user tiers
•Consider time-based variations (business hours vs off-peak)

rate-limit-headers.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Standard Rate Limit Headers (RFC 6585 + Draft RateLimit Fields)
# Include these in EVERY API response for transparency
 
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000          # Maximum requests allowed in window
X-RateLimit-Remaining: 847       # Requests remaining in current window
X-RateLimit-Reset: 1609459200    # Unix timestamp when window resets
Retry-After: 3600                # Seconds to wait (only on 429 responses)
 
# Example 429 Response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1609459200
Retry-After: 60
 
{
  "error": "rate_limit_exceeded",
  "message": "You have exceeded the rate limit of 1000 requests per hour",
  "retry_after_seconds": 60,
  "documentation_url": "https://api.example.com/docs/rate-limits"
}

Why Rate Limit at the API Gateway?

Rate limiting can be implemented at various layers of the stack. The API gateway is often the optimal location, though understanding the trade-offs helps you make informed architectural decisions.

Rate Limiting Implementation Layers
Layer	Advantages	Disadvantages
CDN/Edge	Stops attacks before they reach your infrastructure; global distribution	Limited visibility into application context; coarse-grained
API Gateway	Central enforcement point; full request visibility; rich policy support	Single point of failure if not designed for HA; added latency
Application	Full business context; can implement complex rules	Each service must implement; requests already consumed resources
Database	Protects the most critical resource	Too late—request has already traversed the entire stack

The Gateway Sweet Spot:

The API gateway occupies the ideal position for rate limiting:

Single Entry Point — All traffic flows through the gateway, ensuring no bypass paths
Pre-Processing — Requests are rejected before consuming backend resources
Request Context — The gateway has access to authentication, headers, and request details needed for sophisticated limiting
Centralized Management — Policies are configured in one place, not scattered across services
Consistent Enforcement — All services benefit from the same protection without individual implementation

Converting Mermaid diagram...

Defense in Layers

Best practice is to implement rate limiting at multiple layers. The CDN handles volumetric attacks, the gateway enforces application-level limits, and services implement business-specific throttling. Each layer catches what the previous layer missed.

Designing Effective Rate Limits

Setting rate limits is both art and science. Limits that are too strict frustrate legitimate users; limits that are too generous fail to protect the system. Here's a systematic approach to determining appropriate limits.

Step-by-Step Rate Limit Design

•Analyze Current Traffic — Examine your traffic patterns. What's the p50, p95, p99 request rate per user? Set limits above p99 to avoid impacting normal users.
•Understand Resource Costs — Profile your endpoints. Which are CPU-intensive? Which hit the database heavily? Weight limits accordingly.
•Define User Tiers — Establish different limits for anonymous, free, paid, and enterprise users. Align with your business model.
•Consider Burst Patterns — Legitimate usage often comes in bursts (page loads, batch operations). Allow brief bursts while limiting sustained rates.
•Plan for Growth — Set limits that accommodate expected growth. Revisit limits quarterly as traffic patterns evolve.
•Test Under Load — Simulate traffic at and beyond your limits. Verify the system degrades gracefully.

rate-limit-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Example Rate Limit Configuration
# Demonstrates multi-dimensional rate limiting
 
rate_limits:
  # Global limits protect overall system capacity
  global:
    requests_per_second: 10000
    burst_size: 15000
    
  # Per-IP limits catch automated abuse
  per_ip:
    anonymous:
      requests_per_minute: 60
      burst_size: 20
    authenticated:
      requests_per_minute: 300
      burst_size: 50
      
  # Per-user limits enforce fair usage
  per_user:
    free_tier:
      requests_per_hour: 1000
      requests_per_day: 10000
    pro_tier:
      requests_per_hour: 10000
      requests_per_day: 100000
    enterprise:
      requests_per_hour: 100000
      custom_daily_limit: true
      
  # Per-endpoint limits protect expensive operations
  endpoints:
    "/api/search":
      requests_per_minute: 30
      cost_weight: 5  # Counts as 5 requests toward user limit
    "/api/export":
      requests_per_hour: 10
      cost_weight: 50
    "/api/login":
      per_ip_per_minute: 5  # Strict limit on auth endpoints
    "/api/v1/*":
      requests_per_second: 100  # Catch-all for standard endpoints

Start Conservative, Loosen Gradually

It's easier to increase limits than to decrease them. Start with conservative limits, monitor for legitimate users hitting them, and adjust upward based on data. Decreasing limits after users depend on them often causes complaints and integration breakage.

Summary: Why Rate Limiting

We've established the foundational case for rate limiting. Let's consolidate the key insights:

Key Takeaways

•Rate limiting is non-negotiable — Every production API needs it. The question is not whether, but how.
•It protects against diverse threats — From accidental bugs to malicious attacks, rate limiting is the first line of defense.
•It delivers business value — Cost control, revenue enablement, SLA compliance, and operational predictability.
•The gateway is the ideal location — Central enforcement, full visibility, and pre-processing efficiency.
•Design requires data and iteration — Analyze traffic, understand costs, define tiers, and continuously refine.
•Transparency builds trust — Clear documentation, informative headers, and helpful error messages.

What's Next:

Now that we understand why rate limiting matters, we'll explore how to implement it. The next page dives deep into the Token Bucket Algorithm—a battle-tested approach that elegantly handles both sustained rates and burst traffic patterns.

Page Complete

You now understand the critical importance of rate limiting in API gateway architecture. It's not just about security—it's about building sustainable, predictable, and economically viable services at scale. Next, we'll master the Token Bucket algorithm.

1 / 5

Loading learning content...

System Design (HLD)Rate Limiting at Gateway

Rate Limiting at API Gateway

LevelIntermediate

Duration75 mins

TopicRate Limiting at Gateway

1 / 5

Why Rate Limiting

The Invisible Shield

What You Will Learn

The Necessity of Rate Limiting

The Fundamental Problem:

Scenarios That Demand Rate Limiting

•Traffic Spikes — A viral social media post sends 100x normal traffic to your API. Without limits, backend services collapse under load.
•Misbehaving Clients — A bug in a client application causes it to retry requests in an infinite loop, generating millions of requests per minute.
•Denial of Service (DoS) Attacks — Malicious actors deliberately flood your API with requests to exhaust resources and cause service unavailability.
•Expensive Operations — Some API endpoints trigger costly computations or database queries. Unlimited access to these endpoints can bankrupt your infrastructure budget.
•Cascading Failures — One overwhelmed service sends timeouts to its callers, which retry, amplifying the load and spreading the failure across the system.
•Noisy Neighbors — In multi-tenant systems, one customer's excessive usage degrades performance for all other customers.

The Retry Amplification Problem

Threats Mitigated by Rate Limiting

Rate limiting addresses a spectrum of threats, from accidental misuse to sophisticated attacks. Understanding these threats helps you design appropriate rate limiting strategies for each.

Threat Categories and Rate Limiting's Role
Threat Category	Description	How Rate Limiting Helps
Volumetric DoS	Overwhelming the system with sheer request volume	Caps total requests, rejecting excess before they consume resources
Application-Layer DoS	Targeting expensive endpoints (login, search, checkout)	Per-endpoint limits protect resource-intensive operations
Credential Stuffing	Automated attempts to log in with stolen credentials	Limits login attempts per IP/user, slowing attackers
Web Scraping	Automated extraction of data at rates harmful to the service	Throttles request rates to prevent bulk data extraction
API Abuse	Clients exceeding fair usage, intentionally or not	Enforces contractual limits and fair resource sharing
Brute Force Attacks	Systematic attempts to guess passwords or keys	Rate limits make brute force computationally infeasible
Resource Exhaustion	Consuming finite resources (connections, memory, CPU)	Ensures capacity is reserved for legitimate traffic

Defense in Depth:

Rate limiting is not a silver bullet. It works best as part of a layered security strategy:

Network Layer — DDoS protection services (Cloudflare, AWS Shield) absorb volumetric attacks
API Gateway Layer — Rate limiting catches application-layer abuse and enforces business rules
Application Layer — Business logic validation catches semantic attacks
Data Layer — Query limits and connection pooling protect databases

Business Value of Rate Limiting

Rate limiting isn't just a technical necessity—it delivers tangible business value across multiple dimensions. Understanding this helps justify investment in robust rate limiting infrastructure.

Economic Benefits

•Infrastructure Cost Control — By preventing runaway usage, rate limiting keeps cloud bills predictable. A well-rate-limited API can serve 10x more legitimate users on the same infrastructure.
•Revenue Protection — Tiered rate limits enable differentiated pricing. Free tier users get 100 requests/minute; enterprise customers get 10,000. Rate limiting enforces these tiers automatically.
•SLA Compliance — Service Level Agreements often include availability guarantees. Rate limiting protects against the overload scenarios that cause SLA breaches and financial penalties.
•Reduced Incident Costs — Each production incident costs money in engineering time, customer support, and reputation damage. Rate limiting prevents many incidents from occurring at all.

Operational Benefits

•Predictable Capacity Planning — When you know the maximum request rate, capacity planning becomes straightforward. You can provision for known limits rather than unknown spikes.
•Graceful Degradation — Instead of complete failure under load, rate-limited systems reject excess traffic gracefully, maintaining service for the majority of users.
•Fair Multi-Tenancy — In SaaS platforms, rate limiting ensures one customer's usage pattern doesn't impact others. This 'noisy neighbor' isolation is critical for customer satisfaction.
•API Monetization — Rate limits enable usage-based pricing models. Metering and billing become straightforward when limits are clearly defined and enforced.

The 99th Percentile Problem

Core Principles of Rate Limiting

Effective rate limiting is guided by principles that balance protection with usability. Violating these principles leads to systems that either fail to protect or frustrate legitimate users.

Principle 1: Fail Fast

•Reject excess requests immediately at the edge
•Don't queue requests that will eventually timeout
•Return clear 429 status codes with retry information
•Provide Retry-After headers to guide client behavior

Principle 2: Be Transparent

•Document rate limits clearly in API documentation
•Include rate limit headers in every response
•Provide dashboards for users to monitor their usage
•Alert users before they hit limits, not after

Principle 3: Be Proportional

•Match limits to actual resource costs
•Expensive operations get stricter limits
•Read operations typically get higher limits than writes
•Consider the downstream impact of each endpoint

Principle 4: Be Granular

•Apply multiple limit dimensions (user, IP, API key, endpoint)
•Use hierarchical limits for layered protection
•Allow different limits for different user tiers
•Consider time-based variations (business hours vs off-peak)

rate-limit-headers.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Standard Rate Limit Headers (RFC 6585 + Draft RateLimit Fields)
# Include these in EVERY API response for transparency
 
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000          # Maximum requests allowed in window
X-RateLimit-Remaining: 847       # Requests remaining in current window
X-RateLimit-Reset: 1609459200    # Unix timestamp when window resets
Retry-After: 3600                # Seconds to wait (only on 429 responses)
 
# Example 429 Response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1609459200
Retry-After: 60
 
{
  "error": "rate_limit_exceeded",
  "message": "You have exceeded the rate limit of 1000 requests per hour",
  "retry_after_seconds": 60,
  "documentation_url": "https://api.example.com/docs/rate-limits"
}

Why Rate Limit at the API Gateway?

Rate limiting can be implemented at various layers of the stack. The API gateway is often the optimal location, though understanding the trade-offs helps you make informed architectural decisions.

Rate Limiting Implementation Layers
Layer	Advantages	Disadvantages
CDN/Edge	Stops attacks before they reach your infrastructure; global distribution	Limited visibility into application context; coarse-grained
API Gateway	Central enforcement point; full request visibility; rich policy support	Single point of failure if not designed for HA; added latency
Application	Full business context; can implement complex rules	Each service must implement; requests already consumed resources
Database	Protects the most critical resource	Too late—request has already traversed the entire stack

The Gateway Sweet Spot:

The API gateway occupies the ideal position for rate limiting:

Single Entry Point — All traffic flows through the gateway, ensuring no bypass paths
Pre-Processing — Requests are rejected before consuming backend resources
Request Context — The gateway has access to authentication, headers, and request details needed for sophisticated limiting
Centralized Management — Policies are configured in one place, not scattered across services
Consistent Enforcement — All services benefit from the same protection without individual implementation

Converting Mermaid diagram...

Defense in Layers

Designing Effective Rate Limits

Step-by-Step Rate Limit Design

•Analyze Current Traffic — Examine your traffic patterns. What's the p50, p95, p99 request rate per user? Set limits above p99 to avoid impacting normal users.
•Understand Resource Costs — Profile your endpoints. Which are CPU-intensive? Which hit the database heavily? Weight limits accordingly.
•Define User Tiers — Establish different limits for anonymous, free, paid, and enterprise users. Align with your business model.
•Consider Burst Patterns — Legitimate usage often comes in bursts (page loads, batch operations). Allow brief bursts while limiting sustained rates.
•Plan for Growth — Set limits that accommodate expected growth. Revisit limits quarterly as traffic patterns evolve.
•Test Under Load — Simulate traffic at and beyond your limits. Verify the system degrades gracefully.

rate-limit-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Example Rate Limit Configuration
# Demonstrates multi-dimensional rate limiting
 
rate_limits:
  # Global limits protect overall system capacity
  global:
    requests_per_second: 10000
    burst_size: 15000
    
  # Per-IP limits catch automated abuse
  per_ip:
    anonymous:
      requests_per_minute: 60
      burst_size: 20
    authenticated:
      requests_per_minute: 300
      burst_size: 50
      
  # Per-user limits enforce fair usage
  per_user:
    free_tier:
      requests_per_hour: 1000
      requests_per_day: 10000
    pro_tier:
      requests_per_hour: 10000
      requests_per_day: 100000
    enterprise:
      requests_per_hour: 100000
      custom_daily_limit: true
      
  # Per-endpoint limits protect expensive operations
  endpoints:
    "/api/search":
      requests_per_minute: 30
      cost_weight: 5  # Counts as 5 requests toward user limit
    "/api/export":
      requests_per_hour: 10
      cost_weight: 50
    "/api/login":
      per_ip_per_minute: 5  # Strict limit on auth endpoints
    "/api/v1/*":
      requests_per_second: 100  # Catch-all for standard endpoints

Start Conservative, Loosen Gradually

Summary: Why Rate Limiting

We've established the foundational case for rate limiting. Let's consolidate the key insights:

Key Takeaways

•Rate limiting is non-negotiable — Every production API needs it. The question is not whether, but how.
•It protects against diverse threats — From accidental bugs to malicious attacks, rate limiting is the first line of defense.
•It delivers business value — Cost control, revenue enablement, SLA compliance, and operational predictability.
•The gateway is the ideal location — Central enforcement, full visibility, and pre-processing efficiency.
•Design requires data and iteration — Analyze traffic, understand costs, define tiers, and continuously refine.
•Transparency builds trust — Clear documentation, informative headers, and helpful error messages.

What's Next:

Page Complete

1 / 5