Twitter/X Feed - Learning Module

Loading content...

0/273

Scaling Considerations

Operating at Planet Scale

We've designed individual components: timeline generation, tweet storage, trending topics. Now we must address the meta-challenge: How do all these components work together at global scale?

Twitter's Operating Reality:

350+ million MAU across every continent
Peak events (Super Bowl, elections, disasters) spike traffic 5-10x
99.99% availability SLA means <52 minutes of downtime per year
<300ms P99 latency for users worldwide—from Tokyo to São Paulo
Multi-region deployment for resilience and latency

This final page synthesizes everything into a cohesive, production-ready system architecture. We'll cover global distribution, caching strategies, capacity planning, graceful degradation, and operational patterns that enable Twitter-scale systems to run reliably.

What You Will Learn

By the end of this page, you'll understand multi-region architecture patterns, edge caching strategies, capacity planning methodologies, graceful degradation techniques, and operational best practices for systems serving hundreds of millions of users.

Global Multi-Region Architecture

A single data center cannot serve a global user base with acceptable latency. Multi-region deployment is essential.

1.1 Region Strategy

Example Multi-Region Deployment
Region	Location	Primary Role	User Base
US-East	Virginia	Primary write region for Americas	North/South America
US-West	Oregon	Read replicas, failover for US-East	West Coast Americas
Europe	Ireland/Frankfurt	Primary write region for EMEA	Europe, Middle East, Africa
Asia-Pacific	Singapore/Tokyo	Primary write region for APAC	Asia-Pacific, Oceania

Converting Mermaid diagram...

1.2 Data Partitioning Across Regions

Critical question: Should each region have all data, or should data be partitioned by user location?

Option A: Full Replication

•Every region has copy of all data
•Any tweet readable from any region
•Simpler failover—any region can serve anyone
•Higher storage costs (3x or more)
•Replication lag = stale reads
•Best for: Read-heavy, tolerance for eventual consistency

Option B: Geo-Partitioned

•User's data lives in nearest region
•Cross-region queries for global feeds
•Lower storage costs
•Complex routing logic required
•Cross-region latency for some queries
•Best for: Data residency compliance, cost optimization

1.3 Twitter's Hybrid Approach

Twitter uses a hybrid model optimized for its access patterns:

data_distribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
"""
Twitter's Data Distribution Strategy
 
Different data types have different distribution patterns
based on their access characteristics.
"""
 
class DataDistributionStrategy:
    """
    Data Distribution by Type
    =========================
    
    1. USER DATA (profile, settings, auth)
       - Primary: User's "home" region (based on signup location)
       - Replicated: All regions (read-only copies)
       - Rationale: Users mostly access their own profile; 
                   others need fast reads globally
    
    2. TWEETS
       - Primary: Author's home region
       - Replicated: All regions asynchronously
       - Rationale: Tweets are immutable; eventual consistency OK
                   Most reads are for recent tweets of followed users
    
    3. TIMELINE CACHES
       - Primary: User's current region (not home region)
       - Not replicated: Generated locally from replicated data
       - Rationale: Timelines are computed, not stored
                   Each region builds from its local tweet copy
    
    4. SOCIAL GRAPH (follows)
       - Primary: User's home region
       - Replicated: All regions frequently
       - Rationale: Read-heavy, rarely changes
                   Needs to be consistent for fanout
    
    5. ENGAGEMENT COUNTERS (likes, retweets)
       - Primary: Sharded globally (not regional)
       - Updates: Async aggregation from all regions
       - Rationale: Write-heavy, eventual consistency OK
                   Exact counts not critical
    """
    
    def route_write(self, data_type: str, user_id: str) -> str:
        """Determine write region for a data type."""
        home_region = self.get_user_home_region(user_id)
        
        if data_type in ["user", "tweet", "follow"]:
            return home_region
        elif data_type == "timeline_cache":
            return self.get_current_region()  # Where user is now
        elif data_type == "counter":
            return self.get_counter_shard_region(user_id)
        
        return home_region
    
    def route_read(self, data_type: str, user_id: str) -> str:
        """Determine read region for a data type."""
        current_region = self.get_current_region()
        
        # All reads prefer local region (replicas available)
        return current_region
    
    def should_wait_for_replication(
        self, 
        data_type: str, 
        operation: str
    ) -> bool:
        """
        Should we wait for cross-region replication?
        
        Generally: Write locally, read eventually consistent.
        Exception: User's own writes should be visible immediately.
        """
        if operation == "read_own_write":
            # User reading data they just wrote
            # Route to primary or use read-your-writes consistency
            return True
        
        return False

Read-Your-Writes Consistency

Even in an eventually consistent system, users expect to see their own changes immediately. When Alice posts a tweet, she should see it in her timeline instantly—even if replication to other regions takes seconds. Achieve this by routing her reads to the same region that processed her write, or by reading from a cache populated at write time.

Caching at Every Layer

Caching is the single most important technique for achieving Twitter-scale performance. Every layer of the stack uses caching to reduce latency and load on downstream systems.

2.1 Cache Hierarchy

Multi-Layer Cache Architecture
Layer	Technology	TTL	Data Cached	Hit Rate Target
CDN Edge	Cloudflare/Fastly	5-60s	Static assets, media, API responses	95%
API Gateway	Nginx cache	1-10s	Trending topics, public profiles	80%
Application	Local in-memory	Request-scoped	Frequently accessed objects	90%
Distributed	Redis Cluster	Minutes-hours	Timelines, tweet objects	95%
Local Replicas	SSD-backed cache	Hours-days	User data, less hot content	80%

cache_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
"""
Multi-Layer Cache Architecture
 
Each layer handles different access patterns and provides
different consistency guarantees.
"""
 
class CacheHierarchy:
    def __init__(
        self,
        local_cache,      # In-process (Guava, Caffeine)
        redis_cache,      # Distributed Redis cluster
        cdn_cache         # Edge CDN cache
    ):
        self.local = local_cache
        self.redis = redis_cache
        self.cdn = cdn_cache
    
    async def get_timeline(
        self, 
        user_id: str, 
        viewer_id: str
    ) -> List[Tweet]:
        """
        Get timeline with multi-layer cache lookup.
        
        Cache key strategy:
        - Private timeline: "tl:{user_id}:{viewer_id}" (viewer-specific)
        - Public profile: "tl:{user_id}:public" (shared across viewers)
        """
        is_own_profile = (user_id == viewer_id)
        
        if is_own_profile:
            # Own timeline: check Redis directly (personalized)
            return await self._get_private_timeline(user_id)
        else:
            # Viewing other's profile: use multi-layer cache
            return await self._get_public_timeline(user_id, viewer_id)
    
    async def _get_private_timeline(self, user_id: str) -> List[Tweet]:
        """Private timeline - user's own home timeline."""
        cache_key = f"home_tl:{user_id}"
        
        # Skip local cache (too personalized to share)
        # Check Redis
        cached = await self.redis.get(cache_key)
        if cached:
            return cached
        
        # Build timeline
        timeline = await self._build_home_timeline(user_id)
        
        # Cache in Redis (short TTL for freshness)
        await self.redis.set(cache_key, timeline, ex=60)
        
        return timeline
    
    async def _get_public_timeline(
        self, 
        user_id: str, 
        viewer_id: str
    ) -> List[Tweet]:
        """Public profile timeline - can be heavily cached."""
        cache_key = f"profile_tl:{user_id}"
        
        # Check local cache first (hot profiles)
        cached = self.local.get(cache_key)
        if cached:
            return self._filter_for_viewer(cached, viewer_id)
        
        # Check Redis
        cached = await self.redis.get(cache_key)
        if cached:
            self.local.set(cache_key, cached, ttl=30)  # Cache locally
            return self._filter_for_viewer(cached, viewer_id)
        
        # Build timeline
        timeline = await self._build_profile_timeline(user_id)
        
        # Cache at multiple layers
        await self.redis.set(cache_key, timeline, ex=300)
        self.local.set(cache_key, timeline, ttl=30)
        
        return self._filter_for_viewer(timeline, viewer_id)
    
    def _filter_for_viewer(
        self, 
        tweets: List[Tweet], 
        viewer_id: str
    ) -> List[Tweet]:
        """Apply viewer-specific filtering (blocks, mutes)."""
        blocked_users = self.get_blocked_muted(viewer_id)
        return [t for t in tweets if t.author_id not in blocked_users]

2.2 Cache Invalidation Strategies

cache_invalidation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
"""
Cache Invalidation Patterns
 
The famous quote: "There are only two hard things in computer science:
cache invalidation and naming things." Here's how we handle it.
"""
 
class CacheInvalidator:
    """
    Invalidation Strategies by Data Type
    =====================================
    
    1. TIME-BASED (TTL)
       - Data naturally expires
       - Simple, no coordination needed
       - Use for: Trending topics, aggregate counts, public timelines
    
    2. EVENT-BASED (Explicit Invalidation)
       - Writer invalidates cache after update
       - Requires coordination between services
       - Use for: Home timelines (on new tweet), user profiles (on edit)
    
    3. VERSION-BASED (Cache Keys with Versions)
       - Include version number in cache key
       - Old keys naturally expire, new key used immediately
       - Use for: Data that changes infrequently but needs immediate update
    
    4. BACKGROUND REFRESH (Stale-While-Revalidate)
       - Serve stale data while refreshing in background
       - Never blocks on cache miss
       - Use for: Non-critical data where freshness is acceptable
    """
    
    async def invalidate_user_timeline(self, user_id: str):
        """
        Invalidate a user's home timeline cache.
        Called when: new tweet from followed account, new follow, unfollow.
        """
        cache_key = f"home_tl:{user_id}"
        
        # Option 1: Simple delete (next read will rebuild)
        await self.redis.delete(cache_key)
        
        # Option 2: Refresh in background (non-blocking)
        # await self.queue_timeline_refresh(user_id)
    
    async def invalidate_tweet(self, tweet_id: str, author_id: str):
        """
        Invalidate caches when a tweet is deleted or modified.
        
        Challenge: Tweet might be cached in:
        - Author's profile timeline
        - All followers' home timelines
        - Quote tweet caches
        - Reply thread caches
        """
        # Invalidate author's profile
        await self.redis.delete(f"profile_tl:{author_id}")
        
        # For followers' timelines: don't invalidate immediately
        # Instead, filter out deleted tweets at read time
        # (Eventually, TTL will expire old timeline caches)
        
        # Mark tweet as deleted in tweet cache
        await self.redis.hset(f"tweet:{tweet_id}", "deleted", "true")
    
    async def invalidate_all_user_caches(self, user_id: str):
        """
        Nuclear option: invalidate all caches for a user.
        Used for: Account suspension, privacy changes, data deletion requests.
        """
        patterns = [
            f"home_tl:{user_id}",
            f"profile_tl:{user_id}",
            f"user:{user_id}:*",
            f"tweets:{user_id}:*",
        ]
        
        for pattern in patterns:
            keys = await self.redis.keys(pattern)
            if keys:
                await self.redis.delete(*keys)

Lazy Invalidation

For very expensive invalidations (like removing a deleted tweet from millions of follower timelines), prefer lazy invalidation: filter out deleted content at read time, let TTL handle cache expiry. It's eventually consistent but much cheaper than eager invalidation.

Capacity Planning and Auto-Scaling

Capacity planning ensures the system can handle expected load with headroom for growth and spikes.

3.1 Load Estimation Framework

capacity_planning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
"""
Capacity Planning Framework
 
Systematic approach to sizing infrastructure for Twitter-scale systems.
"""
 
class CapacityPlanner:
    def __init__(
        self,
        dau: int = 200_000_000,          # Daily Active Users
        peak_multiplier: float = 3.0,     # Peak vs average ratio
        growth_rate: float = 0.25,        # 25% annual growth
        headroom: float = 0.30            # 30% safety margin
    ):
        self.dau = dau
        self.peak_multiplier = peak_multiplier
        self.growth_rate = growth_rate
        self.headroom = headroom
    
    def calculate_timeline_api_capacity(self) -> dict:
        """
        Calculate required capacity for Timeline API servers.
        """
        # Average requests per user per day
        timeline_loads_per_user = 20
        
        # Total daily requests
        daily_requests = self.dau * timeline_loads_per_user
        
        # Average requests per second
        avg_rps = daily_requests / 86400
        
        # Peak requests per second
        peak_rps = avg_rps * self.peak_multiplier
        
        # With headroom
        target_capacity = peak_rps * (1 + self.headroom)
        
        # Server sizing (assume 2000 RPS per server)
        rps_per_server = 2000
        servers_needed = int(target_capacity / rps_per_server) + 1
        
        return {
            "avg_rps": int(avg_rps),
            "peak_rps": int(peak_rps),
            "target_capacity": int(target_capacity),
            "servers_needed": servers_needed,
            "per_region": servers_needed // 4,  # 4 regions
        }
    
    def calculate_redis_capacity(self) -> dict:
        """
        Calculate required Redis cache capacity.
        """
        # Timeline cache per user
        timeline_size_bytes = 800 * 100  # 800 tweet IDs × 100 bytes each
        
        # Active users have cached timelines
        active_fraction = 0.7  # 70% of DAU are active enough to cache
        cached_timelines = int(self.dau * active_fraction)
        
        timeline_storage = cached_timelines * timeline_size_bytes
        
        # Tweet cache (recent tweets)
        tweets_per_day = 500_000_000
        tweet_cache_days = 7
        tweet_size = 1000  # 1KB per tweet
        tweet_storage = tweets_per_day * tweet_cache_days * tweet_size
        
        # Hot user cache
        hot_users = 10_000_000  # 10M "hot" users
        user_size = 500  # 500 bytes per user object
        user_storage = hot_users * user_size
        
        total_bytes = timeline_storage + tweet_storage + user_storage
        total_gb = total_bytes / (1024**3)
        
        # With replication
        replication_factor = 3
        required_gb = total_gb * replication_factor
        
        # Redis cluster sizing (assume 256GB per node max)
        gb_per_node = 200  # Leave headroom
        nodes_needed = int(required_gb / gb_per_node) + 1
        
        return {
            "timeline_cache_gb": timeline_storage / (1024**3),
            "tweet_cache_gb": tweet_storage / (1024**3),
            "user_cache_gb": user_storage / (1024**3),
            "total_gb": total_gb,
            "with_replication_gb": required_gb,
            "redis_nodes_needed": nodes_needed
        }
    
    def calculate_database_capacity(self) -> dict:
        """
        Calculate required database storage.
        """
        tweets_per_day = 500_000_000
        tweet_size_bytes = 1000
        
        # 3 years of tweets
        years = 3
        total_tweets = tweets_per_day * 365 * years
        tweet_storage = total_tweets * tweet_size_bytes
        
        # Index overhead (~30%)
        index_overhead = 1.3
        total_with_indexes = tweet_storage * index_overhead
        
        # Replication factor
        replication = 3
        total_replicated = total_with_indexes * replication
        
        total_pb = total_replicated / (1024**5)
        
        return {
            "total_tweets": total_tweets,
            "raw_storage_tb": tweet_storage / (1024**4),
            "with_indexes_tb": total_with_indexes / (1024**4),
            "with_replication_pb": total_pb,
        }

3.2 Auto-Scaling Configuration

Static capacity planning isn't enough—systems must scale dynamically:

autoscaling_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Kubernetes Horizontal Pod Autoscaler Configuration
# For Timeline API Service
 
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: timeline-api-hpa
  namespace: twitter-core
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: timeline-api
  
  # Scaling bounds
  minReplicas: 50        # Never go below 50 pods per region
  maxReplicas: 500       # Can burst to 500 for major events
  
  # Scaling metrics
  metrics:
    # CPU-based scaling (primary)
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65  # Scale up at 65% CPU
    
    # RPS-based scaling (secondary)
    - type: Pods
      pods:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: "1500"  # Scale up if RPS exceeds 1500/pod
    
    # Latency-based scaling (prevent degradation)
    - type: Pods
      pods:
        metric:
          name: response_time_p99_ms
        target:
          type: AverageValue
          averageValue: "200"  # Scale up if P99 exceeds 200ms
  
  # Scaling behavior
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  # Wait 60s before scaling up more
      policies:
        - type: Percent
          value: 50               # Add up to 50% more pods
          periodSeconds: 60
        - type: Pods
          value: 20               # Or add up to 20 pods
          periodSeconds: 60
      selectPolicy: Max           # Use whichever adds more
    
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
        - type: Percent
          value: 10               # Remove at most 10% of pods
          periodSeconds: 120
      selectPolicy: Min           # Conservative scale-down

Scale-Down Carefully

Scaling down too aggressively causes oscillation (scale down → load spikes → scale up → repeat). Use longer stabilization windows for scale-down, and remove capacity gradually. It's better to pay for a few extra servers than to create instability.

Graceful Degradation Strategies

When systems are overloaded, the goal is to degrade gracefully—maintain core functionality while sacrificing non-essential features.

4.1 Degradation Levels

Graceful Degradation Levels
Level	Trigger	Actions	User Impact
Normal	All metrics healthy	Full feature set	None
Advisory	P99 > 200ms	Reduce non-essential API calls	Minimal
Warning	P99 > 400ms	Disable luxury features	Noticeable but acceptable
Critical	P99 > 1s or errors > 1%	Emergency load shedding	Significant
Survival	System at risk	Core features only	Major, but system survives

degradation_controller.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
"""
Degradation Controller
 
Coordinates graceful degradation across services based on
real-time system health indicators.
"""
 
from enum import IntEnum
from dataclasses import dataclass
from typing import Set
 
class DegradationLevel(IntEnum):
    NORMAL = 0
    ADVISORY = 1
    WARNING = 2
    CRITICAL = 3
    SURVIVAL = 4
 
@dataclass
class FeatureFlag:
    name: str
    min_level: DegradationLevel  # Feature disabled at this level and above
    description: str
 
class DegradationController:
    # Features that can be disabled during degradation
    DEGRADABLE_FEATURES = [
        FeatureFlag("algorithmic_timeline", DegradationLevel.WARNING,
            "Serve chronological timeline instead of ranked"),
        FeatureFlag("trending_topics", DegradationLevel.WARNING,
            "Hide trending topics section"),
        FeatureFlag("who_to_follow", DegradationLevel.ADVISORY,
            "Disable follow suggestions"),
        FeatureFlag("analytics_tracking", DegradationLevel.ADVISORY,
            "Stop non-essential analytics"),
        FeatureFlag("media_previews", DegradationLevel.WARNING,
            "Show links instead of previews"),
        FeatureFlag("real_time_updates", DegradationLevel.CRITICAL,
            "Disable WebSocket updates, require refresh"),
        FeatureFlag("search", DegradationLevel.CRITICAL,
            "Disable search, show cached trends"),
        FeatureFlag("new_tweets", DegradationLevel.SURVIVAL,
            "Disable posting, read-only mode"),
    ]
    
    def __init__(self, metrics_client, feature_flags):
        self.metrics = metrics_client
        self.flags = feature_flags
        self.current_level = DegradationLevel.NORMAL
    
    def evaluate_health(self) -> DegradationLevel:
        """
        Evaluate current system health and return appropriate level.
        """
        p99_latency = self.metrics.get("timeline_api.latency.p99")
        error_rate = self.metrics.get("timeline_api.error_rate")
        cpu_usage = self.metrics.get("cluster.cpu.avg")
        
        if error_rate > 0.05 or p99_latency > 2000:
            return DegradationLevel.SURVIVAL
        elif error_rate > 0.01 or p99_latency > 1000:
            return DegradationLevel.CRITICAL
        elif p99_latency > 400 or cpu_usage > 0.85:
            return DegradationLevel.WARNING
        elif p99_latency > 200 or cpu_usage > 0.75:
            return DegradationLevel.ADVISORY
        else:
            return DegradationLevel.NORMAL
    
    def get_disabled_features(self) -> Set[str]:
        """Get features that should be disabled at current level."""
        disabled = set()
        for feature in self.DEGRADABLE_FEATURES:
            if self.current_level >= feature.min_level:
                disabled.add(feature.name)
        return disabled
    
    def update_and_notify(self):
        """
        Update degradation level and notify relevant services.
        """
        new_level = self.evaluate_health()
        
        if new_level != self.current_level:
            old_level = self.current_level
            self.current_level = new_level
            
            # Update feature flags
            disabled = self.get_disabled_features()
            for feature in self.DEGRADABLE_FEATURES:
                self.flags.set(
                    feature.name, 
                    feature.name not in disabled
                )
            
            # Alert operations team
            if new_level >= DegradationLevel.WARNING:
                self.alert_ops(old_level, new_level)
            
            # Log for analysis
            self.log_level_change(old_level, new_level)
    
    def is_feature_enabled(self, feature_name: str) -> bool:
        """Check if a feature should be enabled."""
        disabled = self.get_disabled_features()
        return feature_name not in disabled

4.2 Load Shedding

When overwhelmed, the system must shed load to protect core functionality:

load_shedding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
"""
Load Shedding Strategies
 
When the system cannot handle all requests, strategically
reject some to protect the rest.
"""
 
class LoadShedder:
    """
    Load Shedding Priorities
    ========================
    
    Priority 1 (Never Shed): Core functionality
    - User authentication
    - Tweet posting (with rate limit)
    - Critical notifications
    
    Priority 2 (Shed Last): Primary features
    - Home timeline
    - User profile views
    - Notifications
    
    Priority 3 (Shed Early): Secondary features
    - Search
    - Explore/Discovery
    - Analytics
    
    Priority 4 (Shed First): Luxury features
    - Who to follow
    - Trending topics
    - Related tweets
    """
    
    PRIORITY_MAP = {
        "/auth/*": 1,
        "/tweets/create": 1,
        "/timeline/home": 2,
        "/users/*/profile": 2,
        "/notifications": 2,
        "/search/*": 3,
        "/explore/*": 3,
        "/trending": 4,
        "/suggestions/*": 4,
    }
    
    def __init__(self, target_rps: int, current_rps_getter):
        self.target_rps = target_rps
        self.get_current_rps = current_rps_getter
    
    def should_shed(self, request_path: str) -> bool:
        """
        Determine if a request should be shed.
        """
        current_rps = self.get_current_rps()
        
        if current_rps <= self.target_rps:
            return False  # Under capacity, accept all
        
        overload_ratio = current_rps / self.target_rps
        priority = self._get_priority(request_path)
        
        # Higher overload + lower priority = more likely to shed
        shed_probability = self._calculate_shed_probability(
            overload_ratio, priority
        )
        
        return random.random() < shed_probability
    
    def _get_priority(self, path: str) -> int:
        """Get priority for a request path."""
        for pattern, priority in self.PRIORITY_MAP.items():
            if self._match_pattern(path, pattern):
                return priority
        return 3  # Default to middle priority
    
    def _calculate_shed_probability(
        self, 
        overload_ratio: float, 
        priority: int
    ) -> float:
        """
        Calculate probability of shedding this request.
        
        At 1.5x overload:
        - Priority 1: 0% shed
        - Priority 2: 10% shed
        - Priority 3: 30% shed
        - Priority 4: 50% shed
        
        At 2x overload:
        - Priority 1: 0% shed
        - Priority 2: 25% shed
        - Priority 3: 50% shed
        - Priority 4: 75% shed
        """
        if priority == 1:
            return 0.0  # Never shed priority 1
        
        base_shed = (overload_ratio - 1) * 0.5  # 50% at 2x overload
        priority_multiplier = (priority - 1) / 3  # 0 to 1 based on priority
        
        return min(0.95, base_shed * (1 + priority_multiplier))

Communicate Degradation

When features are degraded, communicate this to users. A banner saying 'Some features are temporarily limited' is better than confusing silent failures. Users are surprisingly understanding when you're transparent about issues.

Operational Excellence

Running a Twitter-scale system requires operational practices beyond the code itself.

5.1 Observability Stack

Three Pillars of Observability

•Metrics: Time-series data (Prometheus/Datadog) — latency histograms, error rates, throughput, saturation
•Logs: Structured event logs (ELK/Splunk) — request details, errors, audit trails
•Traces: Distributed tracing (Jaeger/Zipkin) — request flow across services, latency breakdown

5.2 Key Dashboards and Alerts

Critical Monitoring Dashboards
Dashboard	Key Metrics	Alert Thresholds
API Health	RPS, Latency P50/P99, Error Rate	P99 > 300ms, Errors > 0.1%
Infrastructure	CPU, Memory, Disk, Network	CPU > 80%, Memory > 85%
Database	QPS, Replication Lag, Connection Pool	Lag > 10s, Pool > 90%
Cache	Hit Rate, Evictions, Memory	Hit Rate < 90%, Memory > 90%
Fanout	Queue Depth, Processing Lag	Depth > 1M, Lag > 60s

5.3 Incident Response Framework

incident_runbook.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Timeline API Incident Response Runbook
 
## Severity Levels
- SEV1: Complete outage, >50% of users affected
- SEV2: Major degradation, >10% of users affected  
- SEV3: Minor degradation, <10% of users affected
- SEV4: No user impact, potential risk
 
## Immediate Actions (First 5 minutes)
 
### 1. Assess Scope
- Check: Is this regional or global?
- Check: Which services are affected?
- Check: When did it start? (correlate with deploys)
 
### 2. Communicate
- SEV1/SEV2: Page incident commander immediately
- Update status page within 10 minutes
- Open incident Slack channel
 
### 3. Mitigate First, Debug Later
- Recent deploy? ROLLBACK IMMEDIATELY
- Traffic spike? Enable aggressive load shedding
- Single region? Failover to healthy region
- Database issue? Failover to replica
 
## Common Issues and Fixes
 
### Timeline API Latency Spike
1. Check Redis cache hit rate (dashboard: Cache Health)
2. If hit rate dropped → Check for cache invalidation storm
3. If hit rate OK → Check database latency
4. If database OK → Check downstream services (Fanout, User Service)
 
### Fanout Queue Backing Up
1. Check for celebrity tweet (high follower count)
2. Scale up fanout workers (kubectl scale --replicas=200)
3. If still backing up → Enable emergency fanout limits
4. Consider temporarily disabling push for high-follower accounts
 
### Database Connection Exhaustion
1. Check for connection leaks (long-running queries)
2. Kill long-running queries if safe
3. Scale database replicas
4. Enable connection pool shedding
 
## Post-Incident
- Blameless post-mortem within 48 hours
- Action items with owners and due dates
- Update runbooks with learnings

5.4 Chaos Engineering

Chaos Engineering Practices

•Game Days: Scheduled exercises simulating failures (monthly)
•Fault Injection: Automatically inject failures in staging (continuous)
•Region Failover Drills: Practice failing over between regions (quarterly)
•Dependency Failures: Simulate downstream service outages
•Latency Injection: Add artificial delay to uncover timeout bugs
•Data Corruption: Simulate cache corruption, database inconsistency

Build Confidence Through Failure

Chaos engineering isn't about breaking things—it's about building confidence that your systems handle failures gracefully. Run chaos experiments continuously in staging, and periodically in production during low-traffic periods. The bugs you find proactively are far less costly than the ones found during peak traffic.

Complete System Architecture

We've designed a complete Twitter-like system. Let's see how all the pieces fit together:

Converting Mermaid diagram...

Module Recap: What You've Mastered

•Requirements Analysis: Functional requirements (post, follow, timeline), NFRs (scale, latency, availability), and critical constraints (celebrity problem)
•Feed Generation: Pull-based, push-based, and hybrid approaches with detailed trade-off analysis
•Fanout Strategies: Message queue architecture, idempotent workers, edge case handling (deletes, blocks), and performance optimizations
•Tweet Storage: Snowflake IDs, user-based sharding, multi-tier storage (hot/warm/cold), and indexing strategies
•Trending Topics: Streaming pipelines, approximate counting (Count-Min Sketch, HyperLogLog), scoring algorithms, and anti-gaming
•Scaling: Multi-region architecture, caching strategies, capacity planning, graceful degradation, and operational excellence

Key Design Principles Applied

Throughout this module, we've applied fundamental system design principles:

Trade-offs are inevitable: Every decision involves trade-offs. We chose availability over consistency, favored read performance over write simplicity, and balanced cost against latency.
Design for scale from day one: The celebrity problem wasn't an afterthought—it shaped our entire architecture. Anticipate extreme cases.
Caching is king: Multi-layer caching transformed impossible latency targets into achievable ones. Know your cache hit rates.
Fail gracefully: From circuit breakers to load shedding to degradation levels, we planned for failure at every layer.
Observability enables everything: You can't optimize what you can't measure. Comprehensive monitoring is as important as the features themselves.

Module Complete: Twitter/X Feed Design

Congratulations! You've completed a comprehensive deep-dive into designing a Twitter-like social media platform. You now have the knowledge to discuss Twitter's architecture at a principal engineer level—understanding not just what to build, but why each decision was made and what trade-offs were accepted. This foundation applies to any large-scale content distribution system.

Scaling Considerations

Operating at Planet Scale

We've designed individual components: timeline generation, tweet storage, trending topics. Now we must address the meta-challenge: How do all these components work together at global scale?

Twitter's Operating Reality:

350+ million MAU across every continent
Peak events (Super Bowl, elections, disasters) spike traffic 5-10x
99.99% availability SLA means <52 minutes of downtime per year
<300ms P99 latency for users worldwide—from Tokyo to São Paulo
Multi-region deployment for resilience and latency

What You Will Learn

Global Multi-Region Architecture

A single data center cannot serve a global user base with acceptable latency. Multi-region deployment is essential.

1.1 Region Strategy

Example Multi-Region Deployment
Region	Location	Primary Role	User Base
US-East	Virginia	Primary write region for Americas	North/South America
US-West	Oregon	Read replicas, failover for US-East	West Coast Americas
Europe	Ireland/Frankfurt	Primary write region for EMEA	Europe, Middle East, Africa
Asia-Pacific	Singapore/Tokyo	Primary write region for APAC	Asia-Pacific, Oceania

Converting Mermaid diagram...

1.2 Data Partitioning Across Regions

Critical question: Should each region have all data, or should data be partitioned by user location?

Option A: Full Replication

•Every region has copy of all data
•Any tweet readable from any region
•Simpler failover—any region can serve anyone
•Higher storage costs (3x or more)
•Replication lag = stale reads
•Best for: Read-heavy, tolerance for eventual consistency

Option B: Geo-Partitioned

•User's data lives in nearest region
•Cross-region queries for global feeds
•Lower storage costs
•Complex routing logic required
•Cross-region latency for some queries
•Best for: Data residency compliance, cost optimization

1.3 Twitter's Hybrid Approach

Twitter uses a hybrid model optimized for its access patterns:

data_distribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
"""
Twitter's Data Distribution Strategy
 
Different data types have different distribution patterns
based on their access characteristics.
"""
 
class DataDistributionStrategy:
    """
    Data Distribution by Type
    =========================
    
    1. USER DATA (profile, settings, auth)
       - Primary: User's "home" region (based on signup location)
       - Replicated: All regions (read-only copies)
       - Rationale: Users mostly access their own profile; 
                   others need fast reads globally
    
    2. TWEETS
       - Primary: Author's home region
       - Replicated: All regions asynchronously
       - Rationale: Tweets are immutable; eventual consistency OK
                   Most reads are for recent tweets of followed users
    
    3. TIMELINE CACHES
       - Primary: User's current region (not home region)
       - Not replicated: Generated locally from replicated data
       - Rationale: Timelines are computed, not stored
                   Each region builds from its local tweet copy
    
    4. SOCIAL GRAPH (follows)
       - Primary: User's home region
       - Replicated: All regions frequently
       - Rationale: Read-heavy, rarely changes
                   Needs to be consistent for fanout
    
    5. ENGAGEMENT COUNTERS (likes, retweets)
       - Primary: Sharded globally (not regional)
       - Updates: Async aggregation from all regions
       - Rationale: Write-heavy, eventual consistency OK
                   Exact counts not critical
    """
    
    def route_write(self, data_type: str, user_id: str) -> str:
        """Determine write region for a data type."""
        home_region = self.get_user_home_region(user_id)
        
        if data_type in ["user", "tweet", "follow"]:
            return home_region
        elif data_type == "timeline_cache":
            return self.get_current_region()  # Where user is now
        elif data_type == "counter":
            return self.get_counter_shard_region(user_id)
        
        return home_region
    
    def route_read(self, data_type: str, user_id: str) -> str:
        """Determine read region for a data type."""
        current_region = self.get_current_region()
        
        # All reads prefer local region (replicas available)
        return current_region
    
    def should_wait_for_replication(
        self, 
        data_type: str, 
        operation: str
    ) -> bool:
        """
        Should we wait for cross-region replication?
        
        Generally: Write locally, read eventually consistent.
        Exception: User's own writes should be visible immediately.
        """
        if operation == "read_own_write":
            # User reading data they just wrote
            # Route to primary or use read-your-writes consistency
            return True
        
        return False

Read-Your-Writes Consistency

Caching at Every Layer

Caching is the single most important technique for achieving Twitter-scale performance. Every layer of the stack uses caching to reduce latency and load on downstream systems.

2.1 Cache Hierarchy

Multi-Layer Cache Architecture
Layer	Technology	TTL	Data Cached	Hit Rate Target
CDN Edge	Cloudflare/Fastly	5-60s	Static assets, media, API responses	95%
API Gateway	Nginx cache	1-10s	Trending topics, public profiles	80%
Application	Local in-memory	Request-scoped	Frequently accessed objects	90%
Distributed	Redis Cluster	Minutes-hours	Timelines, tweet objects	95%
Local Replicas	SSD-backed cache	Hours-days	User data, less hot content	80%

cache_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
"""
Multi-Layer Cache Architecture
 
Each layer handles different access patterns and provides
different consistency guarantees.
"""
 
class CacheHierarchy:
    def __init__(
        self,
        local_cache,      # In-process (Guava, Caffeine)
        redis_cache,      # Distributed Redis cluster
        cdn_cache         # Edge CDN cache
    ):
        self.local = local_cache
        self.redis = redis_cache
        self.cdn = cdn_cache
    
    async def get_timeline(
        self, 
        user_id: str, 
        viewer_id: str
    ) -> List[Tweet]:
        """
        Get timeline with multi-layer cache lookup.
        
        Cache key strategy:
        - Private timeline: "tl:{user_id}:{viewer_id}" (viewer-specific)
        - Public profile: "tl:{user_id}:public" (shared across viewers)
        """
        is_own_profile = (user_id == viewer_id)
        
        if is_own_profile:
            # Own timeline: check Redis directly (personalized)
            return await self._get_private_timeline(user_id)
        else:
            # Viewing other's profile: use multi-layer cache
            return await self._get_public_timeline(user_id, viewer_id)
    
    async def _get_private_timeline(self, user_id: str) -> List[Tweet]:
        """Private timeline - user's own home timeline."""
        cache_key = f"home_tl:{user_id}"
        
        # Skip local cache (too personalized to share)
        # Check Redis
        cached = await self.redis.get(cache_key)
        if cached:
            return cached
        
        # Build timeline
        timeline = await self._build_home_timeline(user_id)
        
        # Cache in Redis (short TTL for freshness)
        await self.redis.set(cache_key, timeline, ex=60)
        
        return timeline
    
    async def _get_public_timeline(
        self, 
        user_id: str, 
        viewer_id: str
    ) -> List[Tweet]:
        """Public profile timeline - can be heavily cached."""
        cache_key = f"profile_tl:{user_id}"
        
        # Check local cache first (hot profiles)
        cached = self.local.get(cache_key)
        if cached:
            return self._filter_for_viewer(cached, viewer_id)
        
        # Check Redis
        cached = await self.redis.get(cache_key)
        if cached:
            self.local.set(cache_key, cached, ttl=30)  # Cache locally
            return self._filter_for_viewer(cached, viewer_id)
        
        # Build timeline
        timeline = await self._build_profile_timeline(user_id)
        
        # Cache at multiple layers
        await self.redis.set(cache_key, timeline, ex=300)
        self.local.set(cache_key, timeline, ttl=30)
        
        return self._filter_for_viewer(timeline, viewer_id)
    
    def _filter_for_viewer(
        self, 
        tweets: List[Tweet], 
        viewer_id: str
    ) -> List[Tweet]:
        """Apply viewer-specific filtering (blocks, mutes)."""
        blocked_users = self.get_blocked_muted(viewer_id)
        return [t for t in tweets if t.author_id not in blocked_users]

2.2 Cache Invalidation Strategies

cache_invalidation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
"""
Cache Invalidation Patterns
 
The famous quote: "There are only two hard things in computer science:
cache invalidation and naming things." Here's how we handle it.
"""
 
class CacheInvalidator:
    """
    Invalidation Strategies by Data Type
    =====================================
    
    1. TIME-BASED (TTL)
       - Data naturally expires
       - Simple, no coordination needed
       - Use for: Trending topics, aggregate counts, public timelines
    
    2. EVENT-BASED (Explicit Invalidation)
       - Writer invalidates cache after update
       - Requires coordination between services
       - Use for: Home timelines (on new tweet), user profiles (on edit)
    
    3. VERSION-BASED (Cache Keys with Versions)
       - Include version number in cache key
       - Old keys naturally expire, new key used immediately
       - Use for: Data that changes infrequently but needs immediate update
    
    4. BACKGROUND REFRESH (Stale-While-Revalidate)
       - Serve stale data while refreshing in background
       - Never blocks on cache miss
       - Use for: Non-critical data where freshness is acceptable
    """
    
    async def invalidate_user_timeline(self, user_id: str):
        """
        Invalidate a user's home timeline cache.
        Called when: new tweet from followed account, new follow, unfollow.
        """
        cache_key = f"home_tl:{user_id}"
        
        # Option 1: Simple delete (next read will rebuild)
        await self.redis.delete(cache_key)
        
        # Option 2: Refresh in background (non-blocking)
        # await self.queue_timeline_refresh(user_id)
    
    async def invalidate_tweet(self, tweet_id: str, author_id: str):
        """
        Invalidate caches when a tweet is deleted or modified.
        
        Challenge: Tweet might be cached in:
        - Author's profile timeline
        - All followers' home timelines
        - Quote tweet caches
        - Reply thread caches
        """
        # Invalidate author's profile
        await self.redis.delete(f"profile_tl:{author_id}")
        
        # For followers' timelines: don't invalidate immediately
        # Instead, filter out deleted tweets at read time
        # (Eventually, TTL will expire old timeline caches)
        
        # Mark tweet as deleted in tweet cache
        await self.redis.hset(f"tweet:{tweet_id}", "deleted", "true")
    
    async def invalidate_all_user_caches(self, user_id: str):
        """
        Nuclear option: invalidate all caches for a user.
        Used for: Account suspension, privacy changes, data deletion requests.
        """
        patterns = [
            f"home_tl:{user_id}",
            f"profile_tl:{user_id}",
            f"user:{user_id}:*",
            f"tweets:{user_id}:*",
        ]
        
        for pattern in patterns:
            keys = await self.redis.keys(pattern)
            if keys:
                await self.redis.delete(*keys)

Lazy Invalidation

Capacity Planning and Auto-Scaling

Capacity planning ensures the system can handle expected load with headroom for growth and spikes.

3.1 Load Estimation Framework

capacity_planning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
"""
Capacity Planning Framework
 
Systematic approach to sizing infrastructure for Twitter-scale systems.
"""
 
class CapacityPlanner:
    def __init__(
        self,
        dau: int = 200_000_000,          # Daily Active Users
        peak_multiplier: float = 3.0,     # Peak vs average ratio
        growth_rate: float = 0.25,        # 25% annual growth
        headroom: float = 0.30            # 30% safety margin
    ):
        self.dau = dau
        self.peak_multiplier = peak_multiplier
        self.growth_rate = growth_rate
        self.headroom = headroom
    
    def calculate_timeline_api_capacity(self) -> dict:
        """
        Calculate required capacity for Timeline API servers.
        """
        # Average requests per user per day
        timeline_loads_per_user = 20
        
        # Total daily requests
        daily_requests = self.dau * timeline_loads_per_user
        
        # Average requests per second
        avg_rps = daily_requests / 86400
        
        # Peak requests per second
        peak_rps = avg_rps * self.peak_multiplier
        
        # With headroom
        target_capacity = peak_rps * (1 + self.headroom)
        
        # Server sizing (assume 2000 RPS per server)
        rps_per_server = 2000
        servers_needed = int(target_capacity / rps_per_server) + 1
        
        return {
            "avg_rps": int(avg_rps),
            "peak_rps": int(peak_rps),
            "target_capacity": int(target_capacity),
            "servers_needed": servers_needed,
            "per_region": servers_needed // 4,  # 4 regions
        }
    
    def calculate_redis_capacity(self) -> dict:
        """
        Calculate required Redis cache capacity.
        """
        # Timeline cache per user
        timeline_size_bytes = 800 * 100  # 800 tweet IDs × 100 bytes each
        
        # Active users have cached timelines
        active_fraction = 0.7  # 70% of DAU are active enough to cache
        cached_timelines = int(self.dau * active_fraction)
        
        timeline_storage = cached_timelines * timeline_size_bytes
        
        # Tweet cache (recent tweets)
        tweets_per_day = 500_000_000
        tweet_cache_days = 7
        tweet_size = 1000  # 1KB per tweet
        tweet_storage = tweets_per_day * tweet_cache_days * tweet_size
        
        # Hot user cache
        hot_users = 10_000_000  # 10M "hot" users
        user_size = 500  # 500 bytes per user object
        user_storage = hot_users * user_size
        
        total_bytes = timeline_storage + tweet_storage + user_storage
        total_gb = total_bytes / (1024**3)
        
        # With replication
        replication_factor = 3
        required_gb = total_gb * replication_factor
        
        # Redis cluster sizing (assume 256GB per node max)
        gb_per_node = 200  # Leave headroom
        nodes_needed = int(required_gb / gb_per_node) + 1
        
        return {
            "timeline_cache_gb": timeline_storage / (1024**3),
            "tweet_cache_gb": tweet_storage / (1024**3),
            "user_cache_gb": user_storage / (1024**3),
            "total_gb": total_gb,
            "with_replication_gb": required_gb,
            "redis_nodes_needed": nodes_needed
        }
    
    def calculate_database_capacity(self) -> dict:
        """
        Calculate required database storage.
        """
        tweets_per_day = 500_000_000
        tweet_size_bytes = 1000
        
        # 3 years of tweets
        years = 3
        total_tweets = tweets_per_day * 365 * years
        tweet_storage = total_tweets * tweet_size_bytes
        
        # Index overhead (~30%)
        index_overhead = 1.3
        total_with_indexes = tweet_storage * index_overhead
        
        # Replication factor
        replication = 3
        total_replicated = total_with_indexes * replication
        
        total_pb = total_replicated / (1024**5)
        
        return {
            "total_tweets": total_tweets,
            "raw_storage_tb": tweet_storage / (1024**4),
            "with_indexes_tb": total_with_indexes / (1024**4),
            "with_replication_pb": total_pb,
        }

3.2 Auto-Scaling Configuration

Static capacity planning isn't enough—systems must scale dynamically:

autoscaling_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Kubernetes Horizontal Pod Autoscaler Configuration
# For Timeline API Service
 
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: timeline-api-hpa
  namespace: twitter-core
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: timeline-api
  
  # Scaling bounds
  minReplicas: 50        # Never go below 50 pods per region
  maxReplicas: 500       # Can burst to 500 for major events
  
  # Scaling metrics
  metrics:
    # CPU-based scaling (primary)
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65  # Scale up at 65% CPU
    
    # RPS-based scaling (secondary)
    - type: Pods
      pods:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: "1500"  # Scale up if RPS exceeds 1500/pod
    
    # Latency-based scaling (prevent degradation)
    - type: Pods
      pods:
        metric:
          name: response_time_p99_ms
        target:
          type: AverageValue
          averageValue: "200"  # Scale up if P99 exceeds 200ms
  
  # Scaling behavior
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  # Wait 60s before scaling up more
      policies:
        - type: Percent
          value: 50               # Add up to 50% more pods
          periodSeconds: 60
        - type: Pods
          value: 20               # Or add up to 20 pods
          periodSeconds: 60
      selectPolicy: Max           # Use whichever adds more
    
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
        - type: Percent
          value: 10               # Remove at most 10% of pods
          periodSeconds: 120
      selectPolicy: Min           # Conservative scale-down

Scale-Down Carefully

Graceful Degradation Strategies

When systems are overloaded, the goal is to degrade gracefully—maintain core functionality while sacrificing non-essential features.

4.1 Degradation Levels

Graceful Degradation Levels
Level	Trigger	Actions	User Impact
Normal	All metrics healthy	Full feature set	None
Advisory	P99 > 200ms	Reduce non-essential API calls	Minimal
Warning	P99 > 400ms	Disable luxury features	Noticeable but acceptable
Critical	P99 > 1s or errors > 1%	Emergency load shedding	Significant
Survival	System at risk	Core features only	Major, but system survives

degradation_controller.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
"""
Degradation Controller
 
Coordinates graceful degradation across services based on
real-time system health indicators.
"""
 
from enum import IntEnum
from dataclasses import dataclass
from typing import Set
 
class DegradationLevel(IntEnum):
    NORMAL = 0
    ADVISORY = 1
    WARNING = 2
    CRITICAL = 3
    SURVIVAL = 4
 
@dataclass
class FeatureFlag:
    name: str
    min_level: DegradationLevel  # Feature disabled at this level and above
    description: str
 
class DegradationController:
    # Features that can be disabled during degradation
    DEGRADABLE_FEATURES = [
        FeatureFlag("algorithmic_timeline", DegradationLevel.WARNING,
            "Serve chronological timeline instead of ranked"),
        FeatureFlag("trending_topics", DegradationLevel.WARNING,
            "Hide trending topics section"),
        FeatureFlag("who_to_follow", DegradationLevel.ADVISORY,
            "Disable follow suggestions"),
        FeatureFlag("analytics_tracking", DegradationLevel.ADVISORY,
            "Stop non-essential analytics"),
        FeatureFlag("media_previews", DegradationLevel.WARNING,
            "Show links instead of previews"),
        FeatureFlag("real_time_updates", DegradationLevel.CRITICAL,
            "Disable WebSocket updates, require refresh"),
        FeatureFlag("search", DegradationLevel.CRITICAL,
            "Disable search, show cached trends"),
        FeatureFlag("new_tweets", DegradationLevel.SURVIVAL,
            "Disable posting, read-only mode"),
    ]
    
    def __init__(self, metrics_client, feature_flags):
        self.metrics = metrics_client
        self.flags = feature_flags
        self.current_level = DegradationLevel.NORMAL
    
    def evaluate_health(self) -> DegradationLevel:
        """
        Evaluate current system health and return appropriate level.
        """
        p99_latency = self.metrics.get("timeline_api.latency.p99")
        error_rate = self.metrics.get("timeline_api.error_rate")
        cpu_usage = self.metrics.get("cluster.cpu.avg")
        
        if error_rate > 0.05 or p99_latency > 2000:
            return DegradationLevel.SURVIVAL
        elif error_rate > 0.01 or p99_latency > 1000:
            return DegradationLevel.CRITICAL
        elif p99_latency > 400 or cpu_usage > 0.85:
            return DegradationLevel.WARNING
        elif p99_latency > 200 or cpu_usage > 0.75:
            return DegradationLevel.ADVISORY
        else:
            return DegradationLevel.NORMAL
    
    def get_disabled_features(self) -> Set[str]:
        """Get features that should be disabled at current level."""
        disabled = set()
        for feature in self.DEGRADABLE_FEATURES:
            if self.current_level >= feature.min_level:
                disabled.add(feature.name)
        return disabled
    
    def update_and_notify(self):
        """
        Update degradation level and notify relevant services.
        """
        new_level = self.evaluate_health()
        
        if new_level != self.current_level:
            old_level = self.current_level
            self.current_level = new_level
            
            # Update feature flags
            disabled = self.get_disabled_features()
            for feature in self.DEGRADABLE_FEATURES:
                self.flags.set(
                    feature.name, 
                    feature.name not in disabled
                )
            
            # Alert operations team
            if new_level >= DegradationLevel.WARNING:
                self.alert_ops(old_level, new_level)
            
            # Log for analysis
            self.log_level_change(old_level, new_level)
    
    def is_feature_enabled(self, feature_name: str) -> bool:
        """Check if a feature should be enabled."""
        disabled = self.get_disabled_features()
        return feature_name not in disabled

4.2 Load Shedding

When overwhelmed, the system must shed load to protect core functionality:

load_shedding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
"""
Load Shedding Strategies
 
When the system cannot handle all requests, strategically
reject some to protect the rest.
"""
 
class LoadShedder:
    """
    Load Shedding Priorities
    ========================
    
    Priority 1 (Never Shed): Core functionality
    - User authentication
    - Tweet posting (with rate limit)
    - Critical notifications
    
    Priority 2 (Shed Last): Primary features
    - Home timeline
    - User profile views
    - Notifications
    
    Priority 3 (Shed Early): Secondary features
    - Search
    - Explore/Discovery
    - Analytics
    
    Priority 4 (Shed First): Luxury features
    - Who to follow
    - Trending topics
    - Related tweets
    """
    
    PRIORITY_MAP = {
        "/auth/*": 1,
        "/tweets/create": 1,
        "/timeline/home": 2,
        "/users/*/profile": 2,
        "/notifications": 2,
        "/search/*": 3,
        "/explore/*": 3,
        "/trending": 4,
        "/suggestions/*": 4,
    }
    
    def __init__(self, target_rps: int, current_rps_getter):
        self.target_rps = target_rps
        self.get_current_rps = current_rps_getter
    
    def should_shed(self, request_path: str) -> bool:
        """
        Determine if a request should be shed.
        """
        current_rps = self.get_current_rps()
        
        if current_rps <= self.target_rps:
            return False  # Under capacity, accept all
        
        overload_ratio = current_rps / self.target_rps
        priority = self._get_priority(request_path)
        
        # Higher overload + lower priority = more likely to shed
        shed_probability = self._calculate_shed_probability(
            overload_ratio, priority
        )
        
        return random.random() < shed_probability
    
    def _get_priority(self, path: str) -> int:
        """Get priority for a request path."""
        for pattern, priority in self.PRIORITY_MAP.items():
            if self._match_pattern(path, pattern):
                return priority
        return 3  # Default to middle priority
    
    def _calculate_shed_probability(
        self, 
        overload_ratio: float, 
        priority: int
    ) -> float:
        """
        Calculate probability of shedding this request.
        
        At 1.5x overload:
        - Priority 1: 0% shed
        - Priority 2: 10% shed
        - Priority 3: 30% shed
        - Priority 4: 50% shed
        
        At 2x overload:
        - Priority 1: 0% shed
        - Priority 2: 25% shed
        - Priority 3: 50% shed
        - Priority 4: 75% shed
        """
        if priority == 1:
            return 0.0  # Never shed priority 1
        
        base_shed = (overload_ratio - 1) * 0.5  # 50% at 2x overload
        priority_multiplier = (priority - 1) / 3  # 0 to 1 based on priority
        
        return min(0.95, base_shed * (1 + priority_multiplier))

Communicate Degradation

Operational Excellence

Running a Twitter-scale system requires operational practices beyond the code itself.

5.1 Observability Stack

Three Pillars of Observability

•Metrics: Time-series data (Prometheus/Datadog) — latency histograms, error rates, throughput, saturation
•Logs: Structured event logs (ELK/Splunk) — request details, errors, audit trails
•Traces: Distributed tracing (Jaeger/Zipkin) — request flow across services, latency breakdown

5.2 Key Dashboards and Alerts

Critical Monitoring Dashboards
Dashboard	Key Metrics	Alert Thresholds
API Health	RPS, Latency P50/P99, Error Rate	P99 > 300ms, Errors > 0.1%
Infrastructure	CPU, Memory, Disk, Network	CPU > 80%, Memory > 85%
Database	QPS, Replication Lag, Connection Pool	Lag > 10s, Pool > 90%
Cache	Hit Rate, Evictions, Memory	Hit Rate < 90%, Memory > 90%
Fanout	Queue Depth, Processing Lag	Depth > 1M, Lag > 60s

5.3 Incident Response Framework

incident_runbook.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Timeline API Incident Response Runbook
 
## Severity Levels
- SEV1: Complete outage, >50% of users affected
- SEV2: Major degradation, >10% of users affected  
- SEV3: Minor degradation, <10% of users affected
- SEV4: No user impact, potential risk
 
## Immediate Actions (First 5 minutes)
 
### 1. Assess Scope
- Check: Is this regional or global?
- Check: Which services are affected?
- Check: When did it start? (correlate with deploys)
 
### 2. Communicate
- SEV1/SEV2: Page incident commander immediately
- Update status page within 10 minutes
- Open incident Slack channel
 
### 3. Mitigate First, Debug Later
- Recent deploy? ROLLBACK IMMEDIATELY
- Traffic spike? Enable aggressive load shedding
- Single region? Failover to healthy region
- Database issue? Failover to replica
 
## Common Issues and Fixes
 
### Timeline API Latency Spike
1. Check Redis cache hit rate (dashboard: Cache Health)
2. If hit rate dropped → Check for cache invalidation storm
3. If hit rate OK → Check database latency
4. If database OK → Check downstream services (Fanout, User Service)
 
### Fanout Queue Backing Up
1. Check for celebrity tweet (high follower count)
2. Scale up fanout workers (kubectl scale --replicas=200)
3. If still backing up → Enable emergency fanout limits
4. Consider temporarily disabling push for high-follower accounts
 
### Database Connection Exhaustion
1. Check for connection leaks (long-running queries)
2. Kill long-running queries if safe
3. Scale database replicas
4. Enable connection pool shedding
 
## Post-Incident
- Blameless post-mortem within 48 hours
- Action items with owners and due dates
- Update runbooks with learnings

5.4 Chaos Engineering

Chaos Engineering Practices

•Game Days: Scheduled exercises simulating failures (monthly)
•Fault Injection: Automatically inject failures in staging (continuous)
•Region Failover Drills: Practice failing over between regions (quarterly)
•Dependency Failures: Simulate downstream service outages
•Latency Injection: Add artificial delay to uncover timeout bugs
•Data Corruption: Simulate cache corruption, database inconsistency

Build Confidence Through Failure

Complete System Architecture

We've designed a complete Twitter-like system. Let's see how all the pieces fit together:

Converting Mermaid diagram...

Module Recap: What You've Mastered

•Requirements Analysis: Functional requirements (post, follow, timeline), NFRs (scale, latency, availability), and critical constraints (celebrity problem)
•Feed Generation: Pull-based, push-based, and hybrid approaches with detailed trade-off analysis
•Fanout Strategies: Message queue architecture, idempotent workers, edge case handling (deletes, blocks), and performance optimizations
•Tweet Storage: Snowflake IDs, user-based sharding, multi-tier storage (hot/warm/cold), and indexing strategies
•Trending Topics: Streaming pipelines, approximate counting (Count-Min Sketch, HyperLogLog), scoring algorithms, and anti-gaming
•Scaling: Multi-region architecture, caching strategies, capacity planning, graceful degradation, and operational excellence

Key Design Principles Applied

Throughout this module, we've applied fundamental system design principles:

Trade-offs are inevitable: Every decision involves trade-offs. We chose availability over consistency, favored read performance over write simplicity, and balanced cost against latency.
Design for scale from day one: The celebrity problem wasn't an afterthought—it shaped our entire architecture. Anticipate extreme cases.
Caching is king: Multi-layer caching transformed impossible latency targets into achievable ones. Know your cache hit rates.
Fail gracefully: From circuit breakers to load shedding to degradation levels, we planned for failure at every layer.
Observability enables everything: You can't optimize what you can't measure. Comprehensive monitoring is as important as the features themselves.

Module Complete: Twitter/X Feed Design