System Design (HLD)API Gateway in Microservices

API Gateway in Microservices

LevelAdvanced

Duration90 mins

TopicAPI Gateway in Microservices

4 / 5

Rate Limiting

The Gatekeeper of Your System's Resources

Every system has limits. Databases can handle only so many connections. Compute resources are finite. Network bandwidth is constrained. When clients exceed these limits—whether through legitimate traffic spikes, misbehaving applications, or malicious attacks—systems fail.

Rate limiting is the mechanism that protects your system from being overwhelmed. It controls how many requests a client can make in a given time period, ensuring fair access for all users while protecting backend services from abuse.

At the API Gateway, rate limiting is implemented at the edge—before requests reach your services. This is the most efficient place to reject excess traffic, preventing wasted resources on requests that would eventually fail anyway. This page provides a comprehensive exploration of rate limiting strategies, algorithms, and implementation patterns for production-grade systems.

What You Will Learn

By the end of this page, you will understand rate limiting algorithms (token bucket, leaky bucket, sliding window), distributed rate limiting challenges, client identification strategies, response handling best practices, and how to design rate limiting policies that balance protection with user experience.

Why Rate Limiting at the Gateway

Rate limiting serves multiple purposes, from operational protection to business policy enforcement.

Protection Goals:

What Rate Limiting Protects Against

•Denial of Service (DoS) — Malicious actors flooding your API to exhaust resources
•Misbehaving Clients — Bugs causing infinite loops of API calls
•Traffic Spikes — Legitimate but sudden traffic increases (viral moments, marketing campaigns)
•Cascade Failures — One overloaded service affecting others through shared resources
•Resource Exhaustion — Database connections, memory, CPU, network bandwidth
•Cost Overruns — Cloud resource costs spiraling from excessive usage

Business Goals:

Business Applications of Rate Limiting

•Tiered Service Plans — Free tier: 100 req/min, Pro: 1000 req/min, Enterprise: unlimited
•Fair Usage — Ensure no single client monopolizes shared resources
•Cost Control — Limit expensive operations (ML inference, data exports)
•Capacity Planning — Predictable load patterns enable better resource planning
•Partner SLAs — Enforce contracted usage limits for B2B APIs

Early Rejection Saves Resources

Rejecting a request at the gateway costs microseconds and trivial resources. Rejecting it after authentication, authorization, and partial processing costs orders of magnitude more. Rate limiting at the edge is the most efficient protection.

Rate Limiting Algorithms

Several algorithms exist for implementing rate limiting, each with different characteristics. Understanding these algorithms helps you choose the right one for your use case.

Algorithm Comparison:

Rate Limiting Algorithm Comparison
Algorithm	How It Works	Pros	Cons
Token Bucket	Tokens added at fixed rate; each request consumes token	Allows bursts, smooth average rate	Memory for bucket state per client
Leaky Bucket	Requests enter queue; processed at fixed rate	Smooth output rate, no bursts	Latency in queue; may drop requests
Fixed Window	Count requests in fixed time windows (e.g., per minute)	Simple, low memory	Burst at window edges
Sliding Window Log	Track timestamp of each request; count in sliding window	Accurate, no edge bursts	Memory intensive (stores all timestamps)
Sliding Window Counter	Weighted average of current and previous window	Memory efficient, smooth	Slight approximation

Rate Limiting Algorithm Implementations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Token Bucket Algorithm
interface TokenBucket {
    tokens: number;
    lastRefill: number;  // Timestamp
    capacity: number;    // Maximum tokens
    refillRate: number;  // Tokens per second
}
 
function checkRateLimit(bucket: TokenBucket): boolean {
    const now = Date.now();
    const elapsed = (now - bucket.lastRefill) / 1000;
    
    // Refill tokens based on elapsed time
    bucket.tokens = Math.min(
        bucket.capacity,
        bucket.tokens + elapsed * bucket.refillRate
    );
    bucket.lastRefill = now;
    
    // Check if we have a token to consume
    if (bucket.tokens >= 1) {
        bucket.tokens -= 1;
        return true;  // Request allowed
    }
    
    return false;  // Rate limited
}
 
// Example: 100 requests/minute with burst capacity of 20
const bucket: TokenBucket = {
    tokens: 20,           // Initial burst capacity
    lastRefill: Date.now(),
    capacity: 20,         // Max burst
    refillRate: 100 / 60, // ~1.67 tokens per second
};

Token Bucket is Usually the Right Choice

Token bucket is the most commonly used algorithm because it naturally allows short bursts (improving user experience) while maintaining a consistent average rate. It's used by AWS API Gateway, Kong, and most commercial API gateways.

Distributed Rate Limiting

In production, you don't have one gateway instance—you have many, often across multiple data centers. Each instance must enforce the same rate limits, which requires shared state.

The Challenge:

Imagine a client with a limit of 100 requests/minute hitting a system with 10 gateway instances. If each instance tracks limits locally, the client could make 1000 requests/minute (100 per instance). This defeats the purpose entirely.

Solutions:

Shared State (Recommended)

•Redis — Most common; fast, atomic operations with Lua scripts
•Memcached — Lower latency but no atomic operations
•Database — Works but adds latency; not recommended
•Centralized Rate Limiter — Dedicated service for rate decisions

Local State with Sync

•Local counters + eventual sync — Less accurate but no external dependency
•Gossip protocol — Instances share state peer-to-peer
•Hybrid — Local fast path + background sync to Redis

Distributed Rate Limiting Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Kong rate limiting with Redis backend
plugins:
  - name: rate-limiting
    config:
      # Rate limit: 100 requests per minute
      minute: 100
      
      # Use Redis for distributed state
      policy: redis
      redis_host: redis-cluster.internal
      redis_port: 6379
      redis_password: ${REDIS_PASSWORD}
      redis_database: 0
      redis_timeout: 2000
      
      # How to identify clients
      limit_by: consumer  # or: ip, header, credential
      
      # Fault tolerance: if Redis is down
      fault_tolerant: true  # Allow requests if Redis unavailable
      
      # Include rate limit headers in response
      hide_client_headers: false
      
      # Redis connection pooling
      redis_pool_size: 10
 
# Per-consumer override
plugins:
  - name: rate-limiting
    consumer: premium-client
    config:
      minute: 10000
      policy: redis

Redis Latency Matters

Every rate limit check adds latency. With Redis, expect 0.5-2ms per check. For high-throughput systems, consider local caching with background sync, or accept slightly less accurate limits for lower latency. Always measure the impact on P99 latency.

Client Identification Strategies

Rate limits are per-client, but who is the 'client'? Identifying clients correctly is crucial—too granular and you waste resources, too coarse and bad actors affect legitimate users.

Identification Methods:

Client Identification Strategies
Method	How to Identify	Best For	Limitations
IP Address	Request source IP	Unauthenticated traffic, basic protection	NAT, proxies can share IPs; easy to rotate IPs
API Key	Header: X-API-Key	Partner APIs, machine clients	Keys can be shared; revocation overhead
User ID	From authenticated token	Per-user limits, fair usage	Requires authentication first
Tenant ID	From JWT or header	Multi-tenant SaaS	Shared tenant users affected together
Fingerprint	Combination of IP + User-Agent + etc.	Bot detection, abuse prevention	Sophisticated actors can rotate fingerprints
Custom Header	X-Client-ID or similar	Flexible identification	Clients must cooperate; can be spoofed

Multi-Level Rate Limiting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Multi-level rate limiting strategy
# Apply limits from most specific to least specific
 
rate_limits:
  # Level 1: Global protection (all traffic)
  - scope: global
    limit: 100000
    window: 1s
    action: reject
    
  # Level 2: Per-IP (unauthenticated traffic)
  - scope: ip
    identify_by: source_ip
    limit: 100
    window: 1m
    action: reject
    
  # Level 3: Per-API-Key (authenticated partners)
  - scope: api_key
    identify_by: header:X-API-Key
    default_limit: 1000
    window: 1m
    overrides:
      "key_premium_123": 10000
      "key_trial_456": 100
      
  # Level 4: Per-User (authenticated users)
  - scope: user
    identify_by: jwt:sub
    default_limit: 100
    window: 1m
    
  # Level 5: Per-Endpoint (expensive operations)
  - scope: endpoint
    match: POST:/api/export
    identify_by: jwt:sub
    limit: 5
    window: 1h
    message: "Export limit reached. Try again in {retry_after}."
 
# Request must pass ALL applicable levels

Layer Your Rate Limits

Apply rate limits at multiple levels: global (system protection), per-IP (unauthenticated protection), per-client (fair usage), and per-endpoint (resource protection). A well-designed system has all of these working together.

Response Handling and Client Communication

When a client exceeds their rate limit, how you respond matters. Good response handling helps legitimate clients adapt while discouraging abuse.

Standard Response Headers:

Rate Limit Response Headers
1
2
3
4
5
6
7
8
9
10
HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 100      # Maximum requests allowed
X-RateLimit-Remaining: 42   # Requests remaining in window
X-RateLimit-Reset: 1735689600  # Unix timestamp when limit resets
RateLimit-Limit: 100        # IETF draft standard
RateLimit-Remaining: 42
RateLimit-Reset: 45         # Seconds until reset
 
{"data": {...}}

Response Best Practices

•Always include Retry-After — Helps clients implement proper backoff; reduces immediate retries
•Provide remaining count — Clients can proactively throttle before hitting limits
•Use machine-readable errors — Error codes like RATE_LIMIT_EXCEEDED are easier to handle than messages
•Link to documentation — Help developers understand limits and upgrade options
•Distinguish limit types — Different messages for global vs. per-user vs. per-endpoint limits
•Consider 503 for system overload — 429 implies client fault; 503 for system-wide issues

Retry-After Prevents Thundering Herd

If you return 429 without Retry-After, many clients will immediately retry, amplifying the load. The Retry-After header spreads retries over time. Consider adding jitter to the suggested retry time to prevent synchronized retries.

Advanced Rate Limiting Patterns

Beyond basic request counting, sophisticated systems implement advanced patterns for nuanced traffic management.

Adaptive Rate Limiting:

Adaptive Techniques

•Load-Based Limits — Tighten limits when backend is under stress; loosen when healthy
•Latency-Based — Reduce limits when response times increase beyond thresholds
•Queue Depth — Tighten when request queues build up
•Error Rate — Reduce traffic if high error rates indicate backend issues

Advanced Rate Limiting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Adaptive rate limiting based on system health
interface SystemHealth {
    latencyP99: number;   // milliseconds
    errorRate: number;    // 0-1
    queueDepth: number;
    cpuUtilization: number;
}
 
function calculateDynamicLimit(
    baseLimit: number,
    health: SystemHealth
): number {
    let multiplier = 1.0;
    
    // Reduce limit as latency increases
    if (health.latencyP99 > 500) {
        multiplier *= 0.8;
    }
    if (health.latencyP99 > 1000) {
        multiplier *= 0.6;
    }
    
    // Reduce limit on high error rate
    if (health.errorRate > 0.01) {  // > 1% errors
        multiplier *= 0.5;
    }
    
    // Reduce limit on deep queues
    if (health.queueDepth > 1000) {
        multiplier *= 0.7;
    }
    
    // Reduce limit on high CPU
    if (health.cpuUtilization > 0.8) {
        multiplier *= 0.8;
    }
    
    // Never go below 10% of base limit
    multiplier = Math.max(multiplier, 0.1);
    
    return Math.floor(baseLimit * multiplier);
}

Cost-Based Limits Are Fairer

Not all requests are equal. A simple GET costs less than a complex search, which costs less than an export. Cost-based limiting lets you give users a 'budget' that reflects actual resource consumption, rather than treating all requests equally.

Graceful Degradation Under Load

Rate limiting is a form of graceful degradation—protecting the system by limiting access. But within rate limiting, you can implement further graceful degradation strategies.

Degradation Strategies:

Degradation Techniques

•Serve Stale Cache — Return cached responses (with Cache-Control headers) instead of rejecting
•Reduce Functionality — Disable expensive features (search autocomplete, recommendations)
•Simplify Responses — Return minimal data instead of full responses
•Queue for Later — Accept request but process asynchronously
•Redirect to Status Page — For extreme situations, redirect to a static status page

Graceful Degradation
Degradation Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Progressive degradation based on system state
degradation:
  triggers:
    - condition: "rate_limit_utilization > 80%"
      actions:
        - disable_feature: autocomplete
        - reduce_page_size: 50%
        
    - condition: "rate_limit_utilization > 90%"
      actions:
        - serve_from_cache: true
        - cache_ttl_override: 300s
        - disable_feature: recommendations
        - disable_feature: real_time_updates
        
    - condition: "rate_limit_utilization > 95%"
      actions:
        - response_mode: minimal  # Only essential fields
        - disable_feature: pagination
        - reject_new_connections: true
        
  fallbacks:
    # When rate limited, try these fallbacks
    - endpoint: /api/search
      fallback:
        - serve_cached_popular_results
        - serve_static_suggestions
        - return_503_with_retry
        
    - endpoint: /api/feed
      fallback:
        - serve_cached_feed
        - serve_trending_content
        - return_503_with_retry

Client-Side Considerations:

Good clients implement their own rate limit handling:

Client Behavior	Description
Respect Retry-After	Wait the specified time before retrying
Exponential Backoff	Increase delay between retries
Jitter	Add randomness to prevent synchronized retries
Circuit Breaker	Stop trying after repeated failures
Proactive Throttling	Slow down before hitting limits using X-RateLimit-Remaining

Communicate Degradation

When serving degraded responses, communicate this to clients. Headers like X-Degraded-Mode: true or response fields indicating freshness help clients understand they're not getting full functionality. This is better than silently returning stale data.

Summary: Rate Limiting Mastery

Rate limiting is essential for protecting systems and ensuring fair access. Let's consolidate the key insights:

Key Takeaways

•Multiple purposes — Rate limiting protects against DoS, enforces fair usage, enables tiered pricing, and prevents cascade failures.
•Token bucket dominates — It allows bursts while maintaining average rates, making it the most practical algorithm for most use cases.
•Distributed is essential — Production systems need shared state (Redis) for consistent limits across gateway instances.
•Client identification matters — Layer limits by IP, API key, user, and endpoint for comprehensive protection.
•Responses drive behavior — Always include Retry-After; provide remaining counts; use clear error codes.
•Advanced patterns exist — Adaptive limits, cost-based budgets, and priority queuing handle sophisticated needs.
•Graceful degradation — When limits are hit, serve cached data, reduce functionality, or queue for later instead of hard rejection.
•Monitor and tune — Track rate limit hits, review false positives, and adjust limits based on actual usage patterns.

What's Next:

The final page in this module covers Service Composition—how the API Gateway orchestrates calls to multiple backend services, aggregates responses, and enables complex client requirements with single API calls.

Page Complete

You now understand rate limiting comprehensively—from algorithms to distributed implementations, from client identification to graceful degradation. You're equipped to design rate limiting systems that protect your infrastructure while providing excellent developer experience.

4 / 5

Loading learning content...

System Design (HLD)API Gateway in Microservices

API Gateway in Microservices

LevelAdvanced

Duration90 mins

TopicAPI Gateway in Microservices

4 / 5

Rate Limiting

The Gatekeeper of Your System's Resources

What You Will Learn

Why Rate Limiting at the Gateway

Rate limiting serves multiple purposes, from operational protection to business policy enforcement.

Protection Goals:

What Rate Limiting Protects Against

•Denial of Service (DoS) — Malicious actors flooding your API to exhaust resources
•Misbehaving Clients — Bugs causing infinite loops of API calls
•Traffic Spikes — Legitimate but sudden traffic increases (viral moments, marketing campaigns)
•Cascade Failures — One overloaded service affecting others through shared resources
•Resource Exhaustion — Database connections, memory, CPU, network bandwidth
•Cost Overruns — Cloud resource costs spiraling from excessive usage

Business Goals:

Business Applications of Rate Limiting

•Tiered Service Plans — Free tier: 100 req/min, Pro: 1000 req/min, Enterprise: unlimited
•Fair Usage — Ensure no single client monopolizes shared resources
•Cost Control — Limit expensive operations (ML inference, data exports)
•Capacity Planning — Predictable load patterns enable better resource planning
•Partner SLAs — Enforce contracted usage limits for B2B APIs

Early Rejection Saves Resources

Rate Limiting Algorithms

Several algorithms exist for implementing rate limiting, each with different characteristics. Understanding these algorithms helps you choose the right one for your use case.

Algorithm Comparison:

Rate Limiting Algorithm Comparison
Algorithm	How It Works	Pros	Cons
Token Bucket	Tokens added at fixed rate; each request consumes token	Allows bursts, smooth average rate	Memory for bucket state per client
Leaky Bucket	Requests enter queue; processed at fixed rate	Smooth output rate, no bursts	Latency in queue; may drop requests
Fixed Window	Count requests in fixed time windows (e.g., per minute)	Simple, low memory	Burst at window edges
Sliding Window Log	Track timestamp of each request; count in sliding window	Accurate, no edge bursts	Memory intensive (stores all timestamps)
Sliding Window Counter	Weighted average of current and previous window	Memory efficient, smooth	Slight approximation

Rate Limiting Algorithm Implementations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Token Bucket Algorithm
interface TokenBucket {
    tokens: number;
    lastRefill: number;  // Timestamp
    capacity: number;    // Maximum tokens
    refillRate: number;  // Tokens per second
}
 
function checkRateLimit(bucket: TokenBucket): boolean {
    const now = Date.now();
    const elapsed = (now - bucket.lastRefill) / 1000;
    
    // Refill tokens based on elapsed time
    bucket.tokens = Math.min(
        bucket.capacity,
        bucket.tokens + elapsed * bucket.refillRate
    );
    bucket.lastRefill = now;
    
    // Check if we have a token to consume
    if (bucket.tokens >= 1) {
        bucket.tokens -= 1;
        return true;  // Request allowed
    }
    
    return false;  // Rate limited
}
 
// Example: 100 requests/minute with burst capacity of 20
const bucket: TokenBucket = {
    tokens: 20,           // Initial burst capacity
    lastRefill: Date.now(),
    capacity: 20,         // Max burst
    refillRate: 100 / 60, // ~1.67 tokens per second
};

Token Bucket is Usually the Right Choice

Distributed Rate Limiting

In production, you don't have one gateway instance—you have many, often across multiple data centers. Each instance must enforce the same rate limits, which requires shared state.

The Challenge:

Solutions:

Shared State (Recommended)

•Redis — Most common; fast, atomic operations with Lua scripts
•Memcached — Lower latency but no atomic operations
•Database — Works but adds latency; not recommended
•Centralized Rate Limiter — Dedicated service for rate decisions

Local State with Sync

•Local counters + eventual sync — Less accurate but no external dependency
•Gossip protocol — Instances share state peer-to-peer
•Hybrid — Local fast path + background sync to Redis

Distributed Rate Limiting Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Kong rate limiting with Redis backend
plugins:
  - name: rate-limiting
    config:
      # Rate limit: 100 requests per minute
      minute: 100
      
      # Use Redis for distributed state
      policy: redis
      redis_host: redis-cluster.internal
      redis_port: 6379
      redis_password: ${REDIS_PASSWORD}
      redis_database: 0
      redis_timeout: 2000
      
      # How to identify clients
      limit_by: consumer  # or: ip, header, credential
      
      # Fault tolerance: if Redis is down
      fault_tolerant: true  # Allow requests if Redis unavailable
      
      # Include rate limit headers in response
      hide_client_headers: false
      
      # Redis connection pooling
      redis_pool_size: 10
 
# Per-consumer override
plugins:
  - name: rate-limiting
    consumer: premium-client
    config:
      minute: 10000
      policy: redis

Redis Latency Matters

Client Identification Strategies

Rate limits are per-client, but who is the 'client'? Identifying clients correctly is crucial—too granular and you waste resources, too coarse and bad actors affect legitimate users.

Identification Methods:

Client Identification Strategies
Method	How to Identify	Best For	Limitations
IP Address	Request source IP	Unauthenticated traffic, basic protection	NAT, proxies can share IPs; easy to rotate IPs
API Key	Header: X-API-Key	Partner APIs, machine clients	Keys can be shared; revocation overhead
User ID	From authenticated token	Per-user limits, fair usage	Requires authentication first
Tenant ID	From JWT or header	Multi-tenant SaaS	Shared tenant users affected together
Fingerprint	Combination of IP + User-Agent + etc.	Bot detection, abuse prevention	Sophisticated actors can rotate fingerprints
Custom Header	X-Client-ID or similar	Flexible identification	Clients must cooperate; can be spoofed

Multi-Level Rate Limiting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Multi-level rate limiting strategy
# Apply limits from most specific to least specific
 
rate_limits:
  # Level 1: Global protection (all traffic)
  - scope: global
    limit: 100000
    window: 1s
    action: reject
    
  # Level 2: Per-IP (unauthenticated traffic)
  - scope: ip
    identify_by: source_ip
    limit: 100
    window: 1m
    action: reject
    
  # Level 3: Per-API-Key (authenticated partners)
  - scope: api_key
    identify_by: header:X-API-Key
    default_limit: 1000
    window: 1m
    overrides:
      "key_premium_123": 10000
      "key_trial_456": 100
      
  # Level 4: Per-User (authenticated users)
  - scope: user
    identify_by: jwt:sub
    default_limit: 100
    window: 1m
    
  # Level 5: Per-Endpoint (expensive operations)
  - scope: endpoint
    match: POST:/api/export
    identify_by: jwt:sub
    limit: 5
    window: 1h
    message: "Export limit reached. Try again in {retry_after}."
 
# Request must pass ALL applicable levels

Layer Your Rate Limits

Response Handling and Client Communication

When a client exceeds their rate limit, how you respond matters. Good response handling helps legitimate clients adapt while discouraging abuse.

Standard Response Headers:

Rate Limit Response Headers
1
2
3
4
5
6
7
8
9
10
HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 100      # Maximum requests allowed
X-RateLimit-Remaining: 42   # Requests remaining in window
X-RateLimit-Reset: 1735689600  # Unix timestamp when limit resets
RateLimit-Limit: 100        # IETF draft standard
RateLimit-Remaining: 42
RateLimit-Reset: 45         # Seconds until reset
 
{"data": {...}}

Response Best Practices

•Always include Retry-After — Helps clients implement proper backoff; reduces immediate retries
•Provide remaining count — Clients can proactively throttle before hitting limits
•Use machine-readable errors — Error codes like RATE_LIMIT_EXCEEDED are easier to handle than messages
•Link to documentation — Help developers understand limits and upgrade options
•Distinguish limit types — Different messages for global vs. per-user vs. per-endpoint limits
•Consider 503 for system overload — 429 implies client fault; 503 for system-wide issues

Retry-After Prevents Thundering Herd

Advanced Rate Limiting Patterns

Beyond basic request counting, sophisticated systems implement advanced patterns for nuanced traffic management.

Adaptive Rate Limiting:

Adaptive Techniques

•Load-Based Limits — Tighten limits when backend is under stress; loosen when healthy
•Latency-Based — Reduce limits when response times increase beyond thresholds
•Queue Depth — Tighten when request queues build up
•Error Rate — Reduce traffic if high error rates indicate backend issues

Advanced Rate Limiting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Adaptive rate limiting based on system health
interface SystemHealth {
    latencyP99: number;   // milliseconds
    errorRate: number;    // 0-1
    queueDepth: number;
    cpuUtilization: number;
}
 
function calculateDynamicLimit(
    baseLimit: number,
    health: SystemHealth
): number {
    let multiplier = 1.0;
    
    // Reduce limit as latency increases
    if (health.latencyP99 > 500) {
        multiplier *= 0.8;
    }
    if (health.latencyP99 > 1000) {
        multiplier *= 0.6;
    }
    
    // Reduce limit on high error rate
    if (health.errorRate > 0.01) {  // > 1% errors
        multiplier *= 0.5;
    }
    
    // Reduce limit on deep queues
    if (health.queueDepth > 1000) {
        multiplier *= 0.7;
    }
    
    // Reduce limit on high CPU
    if (health.cpuUtilization > 0.8) {
        multiplier *= 0.8;
    }
    
    // Never go below 10% of base limit
    multiplier = Math.max(multiplier, 0.1);
    
    return Math.floor(baseLimit * multiplier);
}

Cost-Based Limits Are Fairer

Graceful Degradation Under Load

Rate limiting is a form of graceful degradation—protecting the system by limiting access. But within rate limiting, you can implement further graceful degradation strategies.

Degradation Strategies:

Degradation Techniques

•Serve Stale Cache — Return cached responses (with Cache-Control headers) instead of rejecting
•Reduce Functionality — Disable expensive features (search autocomplete, recommendations)
•Simplify Responses — Return minimal data instead of full responses
•Queue for Later — Accept request but process asynchronously
•Redirect to Status Page — For extreme situations, redirect to a static status page

Graceful Degradation
Degradation Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Progressive degradation based on system state
degradation:
  triggers:
    - condition: "rate_limit_utilization > 80%"
      actions:
        - disable_feature: autocomplete
        - reduce_page_size: 50%
        
    - condition: "rate_limit_utilization > 90%"
      actions:
        - serve_from_cache: true
        - cache_ttl_override: 300s
        - disable_feature: recommendations
        - disable_feature: real_time_updates
        
    - condition: "rate_limit_utilization > 95%"
      actions:
        - response_mode: minimal  # Only essential fields
        - disable_feature: pagination
        - reject_new_connections: true
        
  fallbacks:
    # When rate limited, try these fallbacks
    - endpoint: /api/search
      fallback:
        - serve_cached_popular_results
        - serve_static_suggestions
        - return_503_with_retry
        
    - endpoint: /api/feed
      fallback:
        - serve_cached_feed
        - serve_trending_content
        - return_503_with_retry

Client-Side Considerations:

Good clients implement their own rate limit handling:

Client Behavior	Description
Respect Retry-After	Wait the specified time before retrying
Exponential Backoff	Increase delay between retries
Jitter	Add randomness to prevent synchronized retries
Circuit Breaker	Stop trying after repeated failures
Proactive Throttling	Slow down before hitting limits using X-RateLimit-Remaining

Communicate Degradation

Summary: Rate Limiting Mastery

Rate limiting is essential for protecting systems and ensuring fair access. Let's consolidate the key insights:

Key Takeaways

•Multiple purposes — Rate limiting protects against DoS, enforces fair usage, enables tiered pricing, and prevents cascade failures.
•Token bucket dominates — It allows bursts while maintaining average rates, making it the most practical algorithm for most use cases.
•Distributed is essential — Production systems need shared state (Redis) for consistent limits across gateway instances.
•Client identification matters — Layer limits by IP, API key, user, and endpoint for comprehensive protection.
•Responses drive behavior — Always include Retry-After; provide remaining counts; use clear error codes.
•Advanced patterns exist — Adaptive limits, cost-based budgets, and priority queuing handle sophisticated needs.
•Graceful degradation — When limits are hit, serve cached data, reduce functionality, or queue for later instead of hard rejection.
•Monitor and tune — Track rate limit hits, review false positives, and adjust limits based on actual usage patterns.

What's Next:

Page Complete

4 / 5