Twitter/X Feed - Learning Module

Loading content...

0/273

Tweet Storage and Retrieval

Storing the World's Conversations

Every tweet ever posted—over 500 billion and counting—must be stored, indexed, and retrievable in milliseconds. This is not a typical database problem. The scale, access patterns, and reliability requirements demand specialized solutions.

The Storage Challenge:

Volume: 500+ million new tweets daily, 200+ TB of new data yearly
Random Access: Any tweet must be retrievable by ID in <10ms
Range Queries: User's tweets, sorted by time, paginated efficiently
Hot/Cold Patterns: Today's tweets accessed millions of times; year-old tweets rarely touched
Availability: Tweet storage is a critical dependency—if it's down, Twitter is down

This page explores how to design storage that meets these requirements, examining data models, sharding strategies, and the multi-tier storage architecture that powers Twitter's tweet infrastructure.

What You Will Learn

By the end of this page, you'll understand tweet data modeling, ID generation strategies (Snowflake), sharding approaches for scalable storage, indexing for efficient queries, and multi-tier storage architectures that balance performance with cost.

Tweet Data Model

Before choosing storage technologies, we must define exactly what a tweet contains. The data model impacts storage size, query patterns, and indexing strategies.

1.1 Core Tweet Entity

tweet_schema.proto
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Protocol Buffer schema for Tweet entity
syntax = "proto3";
 
message Tweet {
    // === Identity ===
    int64 id = 1;                    // Snowflake ID (see ID generation section)
    int64 author_id = 2;             // User who created this tweet
    
    // === Content ===
    string text = 3;                 // Tweet text (1-280 chars, 1-4000 premium)
    repeated int64 media_ids = 4;    // References to media attachments
    int64 quoted_tweet_id = 5;       // If this is a quote tweet
    int64 reply_to_tweet_id = 6;     // If this is a reply
    int64 reply_to_user_id = 7;      // Author of tweet being replied to
    int64 conversation_id = 8;       // Root tweet of the conversation thread
    
    // === Entities (parsed from text) ===
    repeated Mention mentions = 9;    // @username references
    repeated Hashtag hashtags = 10;   // #topic references  
    repeated Url urls = 11;           // Embedded URLs
    
    // === Metadata ===
    int64 created_at_ms = 12;        // Unix timestamp in milliseconds
    string source = 13;              // e.g., "Twitter for iPhone"
    string language = 14;            // Detected language (ISO 639-1)
    GeoLocation geo = 15;            // Optional geolocation
    
    // === Visibility ===
    bool is_protected = 16;          // From protected account
    ReplyRestriction reply_restriction = 17;
    bool is_deleted = 18;            // Soft delete flag
    
    // === Engagement Counters ===
    // Note: Often stored separately for better update performance
    TweetMetrics metrics = 19;
}
 
message TweetMetrics {
    int64 like_count = 1;
    int64 retweet_count = 2;
    int64 reply_count = 3;
    int64 quote_count = 4;
    int64 view_count = 5;
    int64 bookmark_count = 6;
}
 
message Mention {
    int64 user_id = 1;
    string username = 2;
    int32 start_index = 3;
    int32 end_index = 4;
}
 
message Hashtag {
    string tag = 1;
    int32 start_index = 2;
    int32 end_index = 3;
}
 
message Url {
    string url = 1;           // Original URL in tweet
    string expanded_url = 2;  // Full URL after expansion
    string display_url = 3;   // Shortened for display
    int32 start_index = 4;
    int32 end_index = 5;
}
 
enum ReplyRestriction {
    EVERYONE = 0;
    FOLLOWING = 1;       // Only people author follows
    MENTIONED = 2;       // Only mentioned users
}

1.2 Storage Size Estimation

Understanding storage size per tweet guides capacity planning:

Tweet Storage Size Breakdown
Field	Typical Size	Notes
ID fields (id, author_id, etc.)	64 bytes	8 × 8-byte integers
Text content	560 bytes	280 chars × 2 (UTF-16 avg)
Entities (mentions, hashtags, urls)	200 bytes	Variable, avg estimate
Metadata (timestamps, source, lang)	50 bytes	Fixed overhead
Counters (if embedded)	48 bytes	6 × 8-byte integers
Protobuf overhead	50 bytes	Field tags, length prefixes
Total per tweet	~1 KB	Conservative estimate

storage_calculation.txt
Storage Capacity Planning
=========================
 
Daily tweet volume: 500 million tweets
Average tweet size: 1 KB
 
Daily storage growth:
  500M × 1 KB = 500 GB/day
 
Annual storage growth:
  500 GB × 365 = ~180 TB/year (tweet data only)
 
With 3x replication for durability:
  180 TB × 3 = 540 TB/year
 
10-year retention:
  540 TB × 10 = 5.4 PB
 
Plus indexes (estimate 30% of base):
  5.4 PB × 1.3 = ~7 PB total
 
This excludes:
- Media storage (images, videos) - stored separately
- Timeline caches - Redis tier
- Search indexes - Elasticsearch tier

Metrics Separation

Engagement counters (likes, retweets) are often stored separately from core tweet data. Counters change frequently; tweet content never changes after creation. Separating them allows different update patterns and caching strategies.

ID Generation: Snowflake Algorithm

Every tweet needs a unique identifier. At Twitter's scale, simple auto-increment IDs don't work—they require centralized coordination that becomes a bottleneck. Twitter invented Snowflake to solve this problem.

2.1 Snowflake ID Structure

A Snowflake ID is a 64-bit integer composed of three parts:

snowflake_structure.txt
Snowflake ID (64 bits total)
============================
 
┌─────────────────────────────────────────────────────────────────────┐
│  Sign │        Timestamp (41 bits)        │ DC │  Worker  │  Seq  │
│  (1)  │     Milliseconds since epoch      │(5) │   (5)    │ (12)  │
└─────────────────────────────────────────────────────────────────────┘
 
Bit allocation:
- 1 bit:  Sign (always 0 for positive IDs)
- 41 bits: Timestamp (milliseconds since custom epoch, e.g., 2010-11-04)
- 5 bits:  Datacenter ID (0-31 datacenters)
- 5 bits:  Worker/Machine ID (0-31 workers per datacenter)
- 12 bits: Sequence number (0-4095 per millisecond per worker)
 
Capacity:
- Timestamp: 2^41 ms ≈ 69 years from epoch
- Per worker: 4096 IDs per millisecond
- Total: 32 datacenters × 32 workers × 4096 IDs/ms = 4.2 million IDs/ms
 
Example ID: 1628851200000000000
Binary: 0|10110100111001011110100011000000000|00001|00011|000000000001
        └─┬─┘└──────────────────────────────────┬─┘└──┬──┘└──┬──┘└─────┬─────┘
          │                                      │     │      │        │
        Sign                               Timestamp  DC   Worker    Seq

2.2 Snowflake Implementation

snowflake_generator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
"""
Snowflake ID Generator
 
Generates globally unique, roughly time-sorted 64-bit IDs without
requiring coordination between generators. Each machine runs its
own generator independently.
 
Key properties:
1. Unique: Datacenter + Worker + Sequence guarantees uniqueness
2. Time-sorted: IDs are roughly ordered by creation time
3. Fast: No network calls, no locks (per-worker)
4. Compact: Fits in a 64-bit integer (8 bytes)
"""
 
import time
import threading
 
class SnowflakeGenerator:
    # Custom epoch: November 4, 2010 (Twitter's choice)
    EPOCH = 1288834974657  # milliseconds
    
    # Bit allocations
    DATACENTER_BITS = 5
    WORKER_BITS = 5
    SEQUENCE_BITS = 12
    
    # Maximum values
    MAX_DATACENTER_ID = (1 << DATACENTER_BITS) - 1  # 31
    MAX_WORKER_ID = (1 << WORKER_BITS) - 1          # 31
    MAX_SEQUENCE = (1 << SEQUENCE_BITS) - 1         # 4095
    
    # Bit shift amounts
    WORKER_SHIFT = SEQUENCE_BITS                     # 12
    DATACENTER_SHIFT = SEQUENCE_BITS + WORKER_BITS   # 17
    TIMESTAMP_SHIFT = DATACENTER_SHIFT + DATACENTER_BITS  # 22
    
    def __init__(self, datacenter_id: int, worker_id: int):
        if datacenter_id < 0 or datacenter_id > self.MAX_DATACENTER_ID:
            raise ValueError(f"Datacenter ID must be 0-{self.MAX_DATACENTER_ID}")
        if worker_id < 0 or worker_id > self.MAX_WORKER_ID:
            raise ValueError(f"Worker ID must be 0-{self.MAX_WORKER_ID}")
        
        self.datacenter_id = datacenter_id
        self.worker_id = worker_id
        self.sequence = 0
        self.last_timestamp = -1
        self.lock = threading.Lock()
    
    def generate(self) -> int:
        """Generate a new Snowflake ID."""
        with self.lock:
            timestamp = self._current_millis()
            
            if timestamp < self.last_timestamp:
                # Clock moved backwards - this is a critical error
                raise ClockMovedBackwardsError(
                    f"Clock moved backwards by "
                    f"{self.last_timestamp - timestamp}ms"
                )
            
            if timestamp == self.last_timestamp:
                # Same millisecond - increment sequence
                self.sequence = (self.sequence + 1) & self.MAX_SEQUENCE
                
                if self.sequence == 0:
                    # Sequence exhausted - wait for next millisecond
                    timestamp = self._wait_next_millis(timestamp)
            else:
                # New millisecond - reset sequence
                self.sequence = 0
            
            self.last_timestamp = timestamp
            
            # Compose the ID
            snowflake_id = (
                ((timestamp - self.EPOCH) << self.TIMESTAMP_SHIFT) |
                (self.datacenter_id << self.DATACENTER_SHIFT) |
                (self.worker_id << self.WORKER_SHIFT) |
                self.sequence
            )
            
            return snowflake_id
    
    def _current_millis(self) -> int:
        return int(time.time() * 1000)
    
    def _wait_next_millis(self, last_ts: int) -> int:
        """Busy-wait until next millisecond."""
        ts = self._current_millis()
        while ts <= last_ts:
            ts = self._current_millis()
        return ts
    
    @staticmethod
    def parse(snowflake_id: int) -> dict:
        """Extract components from a Snowflake ID."""
        timestamp = (snowflake_id >> SnowflakeGenerator.TIMESTAMP_SHIFT) 
        timestamp += SnowflakeGenerator.EPOCH
        
        datacenter = (snowflake_id >> SnowflakeGenerator.DATACENTER_SHIFT) 
        datacenter &= SnowflakeGenerator.MAX_DATACENTER_ID
        
        worker = (snowflake_id >> SnowflakeGenerator.WORKER_SHIFT) 
        worker &= SnowflakeGenerator.MAX_WORKER_ID
        
        sequence = snowflake_id & SnowflakeGenerator.MAX_SEQUENCE
        
        return {
            "timestamp_ms": timestamp,
            "datacenter_id": datacenter,
            "worker_id": worker,
            "sequence": sequence
        }

Why Snowflake Matters for Storage

Snowflake IDs are roughly time-sorted, meaning recent tweets have larger IDs. This property is crucial for efficient range queries: 'Get tweets from user X after ID Y' becomes a simple range scan on sorted storage. No need to query and filter by timestamp separately.

Sharding Strategies

No single database can store hundreds of billions of tweets. We must shard the data across many machines. The sharding strategy fundamentally impacts query patterns and system complexity.

3.1 Sharding Options

Option 1: Shard by Tweet ID

•Hash tweet_id to determine shard
•Even distribution across shards
•Fast single-tweet lookups
•User's tweets scattered across shards
•'Get user timeline' requires scatter-gather

Option 2: Shard by User ID

•Hash user_id to determine shard
•User's tweets co-located on one shard
•'Get user timeline' hits single shard
•Celebrities become hot shards
•Uneven distribution risk

3.2 Hybrid Approach: User-Based with Tweet ID Index

Twitter uses a hybrid approach that optimizes for the most common queries:

sharding_strategy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
"""
Hybrid Sharding Strategy
 
Primary storage: Sharded by user_id
- User's tweets stored together
- Efficient user timeline queries
 
Secondary index: Global tweet_id lookup
- Maps tweet_id -> (shard, user_id)
- Enables efficient single-tweet retrieval
 
This provides best of both worlds:
- User timeline: O(1) shard lookup
- Single tweet: O(1) index lookup + O(1) shard lookup
"""
 
class TweetShardRouter:
    def __init__(self, num_shards: int = 1024):
        self.num_shards = num_shards
        self.tweet_index = TweetIdIndex()  # Distributed cache
    
    def get_shard_for_user(self, user_id: int) -> int:
        """
        Determine which shard stores a user's tweets.
        Uses consistent hashing for stable shard assignment.
        """
        return self._consistent_hash(user_id) % self.num_shards
    
    def route_tweet_write(self, tweet: Tweet) -> int:
        """
        Route a new tweet to its storage shard.
        Also updates the tweet_id index.
        """
        shard = self.get_shard_for_user(tweet.author_id)
        
        # Write tweet to user's shard
        self._write_to_shard(shard, tweet)
        
        # Update tweet_id index
        self.tweet_index.set(
            tweet.id, 
            {"shard": shard, "user_id": tweet.author_id}
        )
        
        return shard
    
    def route_tweet_read(self, tweet_id: int) -> Optional[Tweet]:
        """
        Retrieve a tweet by ID.
        Uses index to find the correct shard.
        """
        # Look up shard from index
        index_entry = self.tweet_index.get(tweet_id)
        if not index_entry:
            return None
        
        shard = index_entry["shard"]
        return self._read_from_shard(shard, tweet_id)
    
    def route_user_timeline(
        self, 
        user_id: int, 
        limit: int = 50
    ) -> List[Tweet]:
        """
        Get a user's tweets.
        Single-shard operation - very efficient.
        """
        shard = self.get_shard_for_user(user_id)
        return self._read_user_tweets(shard, user_id, limit)
    
    def _consistent_hash(self, key: int) -> int:
        """
        Consistent hashing for stable shard assignment.
        Minimizes resharding when cluster size changes.
        """
        # Using jump consistent hash for simplicity
        b, j = -1, 0
        while j < self.num_shards:
            b = j
            key = ((key * 2862933555777941757) + 1) & 0xFFFFFFFFFFFFFFFF
            j = int((b + 1) * (1 << 31) / ((key >> 33) + 1))
        return b

3.3 Handling Hot Shards (Celebrity Problem)

Even with user-based sharding, some shards become hot due to popular users:

Hot Shard Mitigation Strategies

•Read replicas: Hot shards have more read replicas to distribute load
•Caching layer: Celebrity tweets cached aggressively, never hitting storage
•Request coalescing: Multiple requests for same celebrity tweet served from one DB query
•Separate celebrity storage: Most-followed accounts get dedicated storage tier
•Virtual sharding: Each physical shard contains multiple virtual shards for easier rebalancing

hot_shard_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
"""
Hot Shard Detection and Mitigation
 
Automatically detects and handles hot shards to prevent
performance degradation and ensure fair resource distribution.
"""
 
class HotShardManager:
    def __init__(
        self, 
        metrics_client,
        replica_manager,
        threshold_qps: int = 10000
    ):
        self.metrics = metrics_client
        self.replicas = replica_manager
        self.threshold = threshold_qps
        self.hot_shards: Set[int] = set()
    
    async def monitor_and_scale(self):
        """
        Continuously monitor shard traffic and scale as needed.
        """
        while True:
            shard_qps = await self.metrics.get_shard_qps()
            
            for shard_id, qps in shard_qps.items():
                if qps > self.threshold and shard_id not in self.hot_shards:
                    # Shard became hot - add replicas
                    await self._scale_up_shard(shard_id)
                    self.hot_shards.add(shard_id)
                    
                elif qps < self.threshold * 0.5 and shard_id in self.hot_shards:
                    # Shard cooled down - remove extra replicas
                    await self._scale_down_shard(shard_id)
                    self.hot_shards.remove(shard_id)
            
            await asyncio.sleep(60)  # Check every minute
    
    async def _scale_up_shard(self, shard_id: int):
        """Add read replicas to hot shard."""
        current_replicas = await self.replicas.get_replica_count(shard_id)
        target_replicas = min(current_replicas * 2, 10)  # Cap at 10
        
        await self.replicas.set_replica_count(shard_id, target_replicas)
        
        logging.info(
            f"Scaled up shard {shard_id}: {current_replicas} -> {target_replicas}"
        )

Resharding Complexity

Changing the number of shards requires moving massive amounts of data. Use consistent hashing to minimize data movement, plan resharding during low-traffic periods, and always maintain enough capacity headroom to avoid emergency resharding.

Indexing Strategies

Efficient retrieval requires carefully designed indexes. Different query patterns require different index structures.

4.1 Primary Indexes (Per-Shard)

Tweet Storage Indexes
Index	Structure	Query Pattern	Notes
tweet_by_id	Hash index	GET tweet by ID	Primary key lookup
tweets_by_user	B-tree (user_id, created_at DESC)	User timeline	Range scan for pagination
tweets_by_conversation	B-tree (conversation_id, created_at)	Thread view	Replies to a tweet
tweets_by_reply_to	B-tree (reply_to_tweet_id)	Reply count	Count replies

tweet_indexes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
-- Core tweet table (per shard)
CREATE TABLE tweets (
    id BIGINT PRIMARY KEY,           -- Snowflake ID
    author_id BIGINT NOT NULL,
    text TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL,
    conversation_id BIGINT,
    reply_to_tweet_id BIGINT,
    reply_to_user_id BIGINT,
    is_deleted BOOLEAN DEFAULT FALSE,
    -- ... other fields
);
 
-- Primary lookup by ID (automatic with PRIMARY KEY)
-- Already optimal: hash-based access
 
-- User timeline: All tweets by a user, newest first
CREATE INDEX idx_user_timeline 
ON tweets (author_id, created_at DESC)
WHERE is_deleted = FALSE;
 
-- Conversation view: All tweets in a thread
CREATE INDEX idx_conversation 
ON tweets (conversation_id, created_at ASC)
WHERE conversation_id IS NOT NULL 
AND is_deleted = FALSE;
 
-- Reply lookup: Find replies to a specific tweet
CREATE INDEX idx_replies 
ON tweets (reply_to_tweet_id, created_at ASC)
WHERE reply_to_tweet_id IS NOT NULL
AND is_deleted = FALSE;
 
-- Efficient queries:
-- User timeline
SELECT * FROM tweets 
WHERE author_id = :user_id 
  AND is_deleted = FALSE
ORDER BY created_at DESC 
LIMIT 50;
 
-- Thread view
SELECT * FROM tweets 
WHERE conversation_id = :root_tweet_id
  AND is_deleted = FALSE
ORDER BY created_at ASC;
 
-- Cursor-based pagination (more efficient than OFFSET)
SELECT * FROM tweets 
WHERE author_id = :user_id 
  AND created_at < :cursor_timestamp
  AND is_deleted = FALSE
ORDER BY created_at DESC 
LIMIT 50;

4.2 Global Secondary Indexes

Some queries span multiple shards and require global indexes:

global_indexes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
"""
Global Secondary Indexes
 
Indexes that span all shards, enabling queries that would
otherwise require scatter-gather across all shards.
 
Stored separately from tweet data, often in a distributed
key-value store like Redis or a specialized index service.
"""
 
class GlobalTweetIndex:
    """
    Global index mapping tweet_id to its location.
    Essential for single-tweet lookups in user-sharded system.
    """
    
    def __init__(self, redis_cluster):
        self.redis = redis_cluster
        self.TTL = 86400 * 7  # 7 days (tweets rarely move)
    
    async def set_location(self, tweet_id: int, shard_id: int, user_id: int):
        """Record where a tweet is stored."""
        key = f"tweet_loc:{tweet_id}"
        value = f"{shard_id}:{user_id}"
        await self.redis.set(key, value, ex=self.TTL)
    
    async def get_location(self, tweet_id: int) -> Optional[Tuple[int, int]]:
        """Look up tweet location."""
        key = f"tweet_loc:{tweet_id}"
        value = await self.redis.get(key)
        if value:
            shard_id, user_id = value.split(":")
            return int(shard_id), int(user_id)
        return None
    
    async def batch_get_locations(
        self, 
        tweet_ids: List[int]
    ) -> Dict[int, Tuple[int, int]]:
        """Batch lookup for efficiency."""
        keys = [f"tweet_loc:{tid}" for tid in tweet_ids]
        values = await self.redis.mget(keys)
        
        result = {}
        for tid, value in zip(tweet_ids, values):
            if value:
                shard_id, user_id = value.split(":")
                result[tid] = (int(shard_id), int(user_id))
        
        return result
 
 
class MentionIndex:
    """
    Index for finding tweets that mention a specific user.
    Enables @mention notifications and mention timelines.
    """
    
    async def add_mention(
        self, 
        mentioned_user_id: int, 
        tweet_id: int,
        timestamp: float
    ):
        """Record a mention for a user."""
        key = f"mentions:{mentioned_user_id}"
        await self.redis.zadd(key, {str(tweet_id): timestamp})
        
        # Trim to last 1000 mentions
        await self.redis.zremrangebyrank(key, 0, -1001)
    
    async def get_mentions(
        self, 
        user_id: int, 
        limit: int = 50
    ) -> List[int]:
        """Get recent tweets mentioning a user."""
        key = f"mentions:{user_id}"
        tweet_ids = await self.redis.zrevrange(key, 0, limit - 1)
        return [int(tid) for tid in tweet_ids]

Index Maintenance Cost

Every index adds write overhead. Adding a tweet with 3 mentions requires 4 writes: 1 to tweet storage + 3 to mention index. Balance query performance against write amplification. Only create indexes for queries that are frequent and performance-critical.

Multi-Tier Storage Architecture

Not all tweets are accessed equally. Today's viral tweet might get millions of reads; a 5-year-old tweet might get one read per month. A multi-tier storage architecture optimizes for this access pattern.

5.1 Storage Tiers

Converting Mermaid diagram...

Storage Tier Characteristics
Tier	Age Range	Storage Type	Latency	Cost/GB	Access Pattern
Hot	0-7 days	Redis + RAM-MySQL	<5ms	$$$	Very frequent
Warm	7-90 days	SSD-MySQL	<20ms	$$	Occasional
Cold	90+ days	HDFS/S3	<200ms	$	Rare
Archive	1+ years	Glacier/Tape	Hours	¢	Compliance only

tiered_storage_router.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
"""
Tiered Storage Router
 
Routes read and write requests to the appropriate storage tier
based on tweet age and access patterns.
"""
 
from datetime import datetime, timedelta
from enum import Enum
 
class StorageTier(Enum):
    HOT = "hot"       # < 7 days
    WARM = "warm"     # 7-90 days
    COLD = "cold"     # > 90 days
 
class TieredStorageRouter:
    HOT_THRESHOLD = timedelta(days=7)
    WARM_THRESHOLD = timedelta(days=90)
    
    def __init__(
        self,
        hot_storage,    # Redis + RAM-optimized MySQL
        warm_storage,   # SSD MySQL
        cold_storage    # HDFS/S3
    ):
        self.hot = hot_storage
        self.warm = warm_storage
        self.cold = cold_storage
    
    def determine_tier(self, created_at: datetime) -> StorageTier:
        """Determine storage tier based on tweet age."""
        age = datetime.utcnow() - created_at
        
        if age < self.HOT_THRESHOLD:
            return StorageTier.HOT
        elif age < self.WARM_THRESHOLD:
            return StorageTier.WARM
        else:
            return StorageTier.COLD
    
    async def write_tweet(self, tweet: Tweet):
        """
        Write new tweet to hot storage.
        Background job will age-out to cooler tiers.
        """
        # All writes go to hot storage
        await self.hot.write(tweet)
    
    async def read_tweet(self, tweet_id: int) -> Optional[Tweet]:
        """
        Read tweet, checking tiers in order.
        Promotes to hot tier if accessed from cold storage.
        """
        # Check hot first (most common case)
        tweet = await self.hot.read(tweet_id)
        if tweet:
            return tweet
        
        # Check warm
        tweet = await self.warm.read(tweet_id)
        if tweet:
            # Promote to hot if frequently accessed
            await self._maybe_promote(tweet, StorageTier.WARM)
            return tweet
        
        # Check cold
        tweet = await self.cold.read(tweet_id)
        if tweet:
            # Promote to warm (never directly to hot)
            await self._maybe_promote(tweet, StorageTier.COLD)
            return tweet
        
        return None
    
    async def read_user_timeline(
        self, 
        user_id: int,
        limit: int = 50
    ) -> List[Tweet]:
        """
        Read user's timeline, spanning tiers if necessary.
        Most timelines fit entirely in hot tier.
        """
        tweets = []
        
        # Start from hot tier
        hot_tweets = await self.hot.read_user_tweets(user_id, limit)
        tweets.extend(hot_tweets)
        
        if len(tweets) < limit:
            # Need more from warm tier
            remaining = limit - len(tweets)
            warm_tweets = await self.warm.read_user_tweets(
                user_id, remaining
            )
            tweets.extend(warm_tweets)
        
        if len(tweets) < limit:
            # Need more from cold tier (unusual)
            remaining = limit - len(tweets)
            cold_tweets = await self.cold.read_user_tweets(
                user_id, remaining
            )
            tweets.extend(cold_tweets)
        
        return tweets
    
    async def _maybe_promote(self, tweet: Tweet, from_tier: StorageTier):
        """
        Promote tweet to hotter tier if access pattern warrants.
        Uses access counter to avoid promoting one-time accesses.
        """
        access_count = await self._increment_access_count(tweet.id)
        
        if from_tier == StorageTier.COLD and access_count >= 5:
            # Accessed 5+ times - promote to warm
            await self.warm.write(tweet)
        elif from_tier == StorageTier.WARM and access_count >= 20:
            # Accessed 20+ times - promote to hot (going viral)
            await self.hot.write(tweet)

5.2 Data Migration (Age-Out) Process

age_out_job.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
"""
Age-Out Background Job
 
Moves tweets from hot to warm to cold storage based on age.
Runs continuously, processing in small batches to minimize
impact on production traffic.
"""
 
class AgeOutJob:
    def __init__(
        self,
        tiered_storage: TieredStorageRouter,
        batch_size: int = 1000,
        sleep_between_batches: float = 0.1
    ):
        self.storage = tiered_storage
        self.batch_size = batch_size
        self.sleep_ms = sleep_between_batches
    
    async def run_hot_to_warm_migration(self):
        """
        Continuously migrate tweets from hot to warm storage.
        """
        while True:
            threshold = datetime.utcnow() - timedelta(days=7)
            
            # Find tweets older than 7 days in hot storage
            old_tweets = await self.storage.hot.scan_older_than(
                threshold,
                limit=self.batch_size
            )
            
            if not old_tweets:
                # Nothing to migrate - sleep longer
                await asyncio.sleep(60)
                continue
            
            # Batch write to warm storage
            await self.storage.warm.batch_write(old_tweets)
            
            # Delete from hot storage
            tweet_ids = [t.id for t in old_tweets]
            await self.storage.hot.batch_delete(tweet_ids)
            
            self.metrics.increment(
                "age_out.hot_to_warm",
                len(old_tweets)
            )
            
            # Rate limit to avoid impacting production
            await asyncio.sleep(self.sleep_ms)
    
    async def run_warm_to_cold_migration(self):
        """
        Migrate tweets from warm to cold storage.
        Similar pattern, different thresholds.
        """
        while True:
            threshold = datetime.utcnow() - timedelta(days=90)
            
            old_tweets = await self.storage.warm.scan_older_than(
                threshold,
                limit=self.batch_size
            )
            
            if not old_tweets:
                await asyncio.sleep(300)  # Check less frequently
                continue
            
            # Cold storage (HDFS/S3) uses different write pattern
            await self.storage.cold.batch_write(old_tweets)
            
            await self.storage.warm.batch_delete(
                [t.id for t in old_tweets]
            )
            
            self.metrics.increment(
                "age_out.warm_to_cold",
                len(old_tweets)
            )
            
            await asyncio.sleep(self.sleep_ms)

Cost vs Performance Trade-off

Multi-tier storage reduces costs significantly—cold storage is 10-100x cheaper than hot storage. But it adds complexity: migration jobs, cross-tier queries, and promotion logic. For smaller systems, a single tier might be simpler and sufficient.

Summary: Tweet Storage Mastery

You've now mastered the core concepts of tweet storage at scale. Let's consolidate:

Key Takeaways

•Tweet Data Model: ~1KB per tweet including content, metadata, and entities. Counters often stored separately for better update performance.
•Snowflake IDs: 64-bit, time-sorted, globally unique without coordination. Enables efficient range queries and implicit time ordering.
•User-Based Sharding: Primary data sharded by user_id for efficient user timeline queries. Global tweet_id index enables single-tweet lookups.
•Hot Shard Mitigation: Read replicas, caching, and request coalescing prevent celebrity accounts from overwhelming individual shards.
•Strategic Indexing: Primary indexes per-shard for common queries; global indexes for cross-shard patterns like mention lookups.
•Multi-Tier Storage: Hot/warm/cold tiers optimize cost while maintaining performance for the access patterns that matter most.

What's Next:

With storage fundamentals covered, the next page explores Trending Topics—how to detect and surface what's happening in real-time across millions of tweets per hour. We'll cover streaming data pipelines, approximate counting algorithms, personalization, and anti-gaming measures.

Page Complete

You now understand how to store and retrieve hundreds of billions of tweets with sub-10ms latency. This knowledge applies beyond Twitter—to any system storing massive volumes of user-generated content with similar access patterns.