System Design (HLD)Back-of-Envelope Estimation

Back-of-Envelope Estimation

LevelIntermediate

Duration90 mins

TopicBack-of-Envelope Estimation

2 / 5

Storage Estimation

The Data That Never Stops Growing

In 2020, Instagram reported storing over 1.2 billion photos daily. WhatsApp processes 100 billion messages every day. Netflix stores thousands of hours of video in multiple resolutions across global data centers. Behind these staggering numbers lies a fundamental system design question: How much storage do we actually need?

Storage estimation is where theoretical capacity meets economic reality. Unlike traffic—which is transient—storage accumulates. Every message sent, every photo uploaded, every log line written takes up space forever until explicitly deleted. A 1% underestimation in traffic means slightly slower responses. A 10% underestimation in storage means you run out of disk space and experience a catastrophic outage.

More critically, storage decisions are hard to reverse. Choosing the wrong database or storage tier is expensive to fix. Data migrations at scale can take months. Storage estimation isn't just about calculating numbers—it's about making architectural decisions with multi-year consequences.

What You Will Learn

By the end of this page, you will be able to: (1) Calculate storage requirements for different data types, (2) Project storage growth over multiple years, (3) Understand replication and backup overhead, (4) Optimize storage costs through tiered strategies, (5) Apply these principles in system design interviews.

The Storage Estimation Framework

Storage estimation follows a systematic framework. Every data point in your system flows through this chain:

The Storage Equation:

Total Storage = Objects Created × Size per Object × Retention Period × Replication Factor × Overhead

Let's decompose each component:

Objects Created: How many new data items are written per time unit? For Twitter, this is tweets per day. For Netflix, this is new video hours per month. For a banking system, this is transactions per day.

Size per Object: How large is each item? A tweet is a few kilobytes including metadata. A Netflix movie is potentially terabytes across all quality levels. Accurate size estimation requires understanding the data model.

Retention Period: How long do you keep data? Session logs might be kept for 30 days. Financial transactions for 7 years (regulatory). Social media posts forever (until user deletion).

Replication Factor: How many copies exist? Production databases typically replicate 3x. Backups add more. Cross-region redundancy doubles again.

Overhead: Indexes, metadata, filesystem overhead, and operational headroom. Typically 20-50% on top of raw data.

Storage Multipliers to Account For
Factor	Typical Multiplier	Reason
Database replication	3x	Primary + 2 replicas for HA
Cross-region redundancy	2x	DR in secondary region
Backup copies	1.5-2x	Daily/weekly/monthly backups
Index overhead	1.2-1.5x	B-tree indexes, secondary indexes
Filesystem overhead	1.1-1.2x	Block allocation, metadata
Operational headroom	1.3x	30% free space for operations

The Multiplicative Effect

These factors multiply, not add. If your raw data is 100TB: 100TB × 3 (replication) × 2 (DR) × 1.5 (backups) × 1.3 (indexes) × 1.2 (headroom) = 1,404 TB ≈ 1.4 PB. Your 100TB of 'data' requires 1.4PB of actual storage infrastructure.

Sizing Different Data Types

Different data types have vastly different storage characteristics. A senior engineer intuitively knows approximate sizes for common objects. Here's your reference guide:

Typical Object Sizes by Data Type
Data Type	Typical Size	Size Range	Storage Considerations
User ID (UUID)	16-36 bytes	16B binary, 36B string	Use binary UUIDs for 60% space savings
Integer ID	4-8 bytes	int32 vs int64	int64 for >2B records
Timestamp	8 bytes	4-8 bytes	Unix epoch (4B) or precise datetime (8B)
Short text (username)	20-50 bytes	Variable	VARCHAR, not fixed CHAR
Medium text (tweet)	300-500 bytes	140-280 chars + metadata	UTF-8 encoding varies by language
Long text (article)	5-50 KB	Variable	Consider compression
JSON document	1-10 KB	Variable	JSONB more compact than text JSON
Thumbnail image	10-50 KB	Variable	Aggressive compression
Standard photo	2-5 MB	1-20 MB	Quality-dependent
HD video (1 min)	50-150 MB	Variable	Highly codec-dependent
4K video (1 min)	200-500 MB	Variable	Multiple formats for adaptive streaming
Log entry	200-500 bytes	Variable	Structured logs more compact
Metric data point	8-32 bytes	Variable	Time-series optimized storage

Character Encoding Matters:

Character size varies by encoding:

ASCII: 1 byte per character
UTF-8: 1-4 bytes per character (English: 1 byte, Chinese/Japanese: 3 bytes, emojis: 4 bytes)
UTF-16: 2-4 bytes per character

A global platform with multi-language support should assume average 2 bytes per character for text content.

Metadata Overhead:

Every object has metadata beyond its content:

User ID (8 bytes)
Timestamps (created_at, updated_at: 16 bytes)
Status flags (4 bytes)
Foreign keys (8-24 bytes)
Version/checksum (8 bytes)

A "simple" tweet is 280 characters + 50+ bytes of metadata.

tweet_size_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# Detailed size calculation for a single tweet
 
class Tweet:
    """
    Tweet storage size breakdown
    All sizes in bytes
    """
    # Core identifiers
    tweet_id: int = 8           # int64
    user_id: int = 8            # int64
    
    # Content
    text_max: int = 560         # 280 chars × ~2 bytes (UTF-8 average)
    
    # Timestamps
    created_at: int = 8         # datetime
    updated_at: int = 8         # datetime
    
    # Engagement counters (denormalized for read performance)
    like_count: int = 4         # int32
    retweet_count: int = 4      # int32
    reply_count: int = 4        # int32
    quote_count: int = 4        # int32
    
    # Metadata
    language_code: int = 3      # 'en', 'es', 'jp', etc.
    source_app: int = 50        # 'Twitter for iPhone'
    
    # References
    reply_to_id: int = 8        # Nullable - original tweet ID
    quoted_tweet_id: int = 8    # Nullable - quoted tweet ID
    
    # Media references (actual media stored separately)
    media_ids: int = 32         # Up to 4 media items × 8 bytes
    
    # Location (optional)
    geo_lat: int = 8            # double
    geo_lng: int = 8            # double
    place_id: int = 24          # String reference
    
    # Flags
    is_sensitive: int = 1       # boolean
    is_reply: int = 1           # boolean
    has_media: int = 1          # boolean
    
    def total_size(self) -> int:
        return (
            self.tweet_id + self.user_id + self.text_max +
            self.created_at + self.updated_at +
            self.like_count + self.retweet_count + 
            self.reply_count + self.quote_count +
            self.language_code + self.source_app +
            self.reply_to_id + self.quoted_tweet_id +
            self.media_ids + self.geo_lat + self.geo_lng +
            self.place_id + self.is_sensitive + 
            self.is_reply + self.has_media
        )
 
tweet = Tweet()
print(f"Single tweet size: {tweet.total_size()} bytes ≈ {tweet.total_size()/1024:.2f} KB")
 
# Scale calculation
tweets_per_day = 500_000_000  # 500 million tweets/day
raw_daily_storage = tweets_per_day * tweet.total_size()
print(f"
Daily tweet storage: {raw_daily_storage / (1024**4):.2f} TB (raw)")
print(f"With 3x replication: {raw_daily_storage * 3 / (1024**4):.2f} TB")
print(f"Yearly (365 days): {raw_daily_storage * 365 / (1024**5):.2f} PB")

Database Storage Calculations

Databases add significant overhead beyond raw data. Understanding these overheads is crucial for accurate estimation.

Index Overhead:

Indexes trade space for query speed. A B-tree index on a column adds approximately:

Primary key index: 10-15% of table size
Each secondary index: 5-15% of table size (depends on key size)
Composite indexes: Larger, depends on included columns

Example: A users table with 100M rows × 500 bytes = 50GB raw data

Primary key index: +7GB
Email unique index: +5GB
Created_at index: +3GB
Composite (status, created_at): +8GB

Total: 50GB + 23GB = 73GB (46% overhead just from indexes)

Write Amplification:

When you write 1KB of data, the database might write 10KB or more:

Write-ahead log (WAL): Every write goes to log first
Index updates: Each index on the table must update
Page splits: B-trees occasionally reorganize
Compaction (LSM trees): Background merging multiplies writes

Storage Characteristics by Database Type
Database Type	Storage Efficiency	Index Overhead	Best For
PostgreSQL	High (TOAST compression)	10-30%	General purpose, structured data
MySQL (InnoDB)	Medium	15-40%	OLTP workloads
MongoDB	Medium (BSON)	20-50%	Flexible schemas
Cassandra	Low (replication)	5-15%	Write-heavy, wide-column
Redis	Low (in-memory)	50-100%	Caching, sessions
ClickHouse	Very High (columnar)	5-10%	Analytics, time-series
Elasticsearch	Low	100-300%	Full-text search

Elasticsearch Special Case:

Elasticsearch deserves special mention because its storage overhead often surprises engineers:

Each document stores: doc values, inverted index, stored fields
Default replication: 1 replica = 2x storage
Segment merging: 50%+ extra space needed during merges
A 100GB dataset often requires 800GB+ of storage

DynamoDB/Cassandra Distribution:

Distributed databases spread data across partitions:

Partition key determines data placement
Hot partitions can cause storage imbalance
Size limit per partition (e.g., DynamoDB: 10GB per partition)

When sizing, account for partition overhead and potential imbalance.

The Compression Opportunity

Text and JSON compress extremely well—often 5-10x reduction. Enable compression for archival storage. However, compression increases CPU usage. For hot data with frequent access, the CPU cost may outweigh storage savings. Compress cold data aggressively; keep hot data uncompressed for performance.

Media Storage: Images, Video, and Files

Media typically dominates storage in consumer applications. A single 4K video can consume more storage than millions of text records.

Image Storage:

Images are stored at multiple sizes for different use cases:

Thumbnail: 100×100 pixels, 10-20 KB
Small preview: 320×320 pixels, 30-50 KB
Medium display: 640×640 pixels, 100-200 KB
Large display: 1080×1080 pixels, 300-500 KB
Original: Variable, 2-20 MB

Total storage per image = sum of all versions ≈ 3-5 MB on average.

Video Storage:

Video requires multiple renditions for adaptive streaming (HLS/DASH):

360p: ~500 Kbps → 4 MB/min
480p: ~1 Mbps → 8 MB/min
720p: ~3 Mbps → 22 MB/min
1080p: ~6 Mbps → 45 MB/min
4K: ~20 Mbps → 150 MB/min

Total storage per minute of video (all qualities) ≈ 230 MB/min

video_platform_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Storage calculation for a YouTube-like platform
 
# Video upload assumptions
hours_uploaded_per_minute = 500  # YouTube actual stat: 500 hours/min
minutes_per_day = 60 * 24
 
# Minutes of video uploaded daily
video_minutes_daily = hours_uploaded_per_minute * 60 * minutes_per_day
print(f"Video minutes uploaded daily: {video_minutes_daily:,}")
 
# Storage per minute of video (all quality levels)
storage_per_minute_mb = {
    "144p": 1.5,
    "240p": 3,
    "360p": 4,
    "480p": 8,
    "720p": 22,
    "1080p": 45,
    "1440p": 90,
    "2160p (4K)": 150,
}
 
# Not all videos are encoded at all qualities
# Assume average encoding profile
average_mb_per_minute = 100  # Weighted average
 
# Daily storage (raw)
daily_storage_tb = video_minutes_daily * average_mb_per_minute / (1024 * 1024)
print(f"
Daily raw video storage: {daily_storage_tb:,.0f} TB")
 
# With CDN distribution (multiple copies across edge locations)
cdn_copies = 3  # Minimum copies for global coverage
# Plus origin storage with replication
origin_replication = 3
 
# Actual storage (origin + some CDN)
# CDN typically caches popular 20% of content
cdn_cached_percentage = 0.20
 
total_daily = (daily_storage_tb * origin_replication + 
               daily_storage_tb * cdn_cached_percentage * cdn_copies)
print(f"Daily storage with replication: {total_daily:,.0f} TB")
print(f"
Annual storage growth: {total_daily * 365 / 1000:,.1f} PB/year")

Media Storage Reference Guide
Media Type	Typical Size	Storage Strategy
Profile photo	500 KB total (all sizes)	Cache aggressively, rarely changes
Social media photo	3-5 MB (all sizes)	Hot storage for recent, cold for old
Short video (15 sec)	50-100 MB (all qualities)	CDN caching, adaptive streaming
Standard video (10 min)	1-3 GB (all qualities)	Tiered storage by view count
Movie (2 hours)	20-50 GB (all qualities)	Origin + edge caching
User-generated document	100 KB - 10 MB	Deduplicated storage
Audio track (3 min)	10-30 MB (all qualities)	Cache popular tracks

Content Deduplication

Many platforms see significant duplicate content. The same meme might be uploaded thousands of times. Content-addressable storage (using content hash as key) can reduce storage by 20-40% for UGC platforms. This is why services like IPFS and similar deduplication strategies matter at scale.

Multi-Year Storage Growth Projections

Storage planning requires looking years into the future. You can't simply buy more databases mid-year when capacity runs out.

The Compound Growth Formula:

Storage(year N) = Current Storage × (1 + Growth Rate)^N + Cumulative New Data

But this is simplistic. Real storage growth depends on:

User Growth: More users = more data
Engagement Growth: Existing users become more active
Feature Evolution: New features create new data types
Retention Policy Changes: Keeping data longer
Compliance Requirements: New regulations may require longer retention

Modeling Growth Rates:

Startup phase: 100-300% annual storage growth (explosive)
Growth phase: 50-100% annual storage growth
Mature phase: 20-40% annual storage growth
Saturated phase: 10-20% annual storage growth

storage_projection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# 5-Year storage projection model
 
from dataclasses import dataclass
from typing import List
 
@dataclass
class YearlyProjection:
    year: int
    dau: int
    objects_per_user_day: float
    object_size_bytes: int
    retention_days: int
    new_storage_pb: float
    cumulative_storage_pb: float
 
def project_storage(
    initial_dau: int,
    dau_growth_rate: float,
    engagement_growth_rate: float,
    initial_objects_per_user: float,
    object_size_bytes: int,
    retention_days: int,
    years: int
) -> List[YearlyProjection]:
    
    projections = []
    cumulative_storage = 0
    
    for year in range(1, years + 1):
        # Calculate metrics for this year
        dau = int(initial_dau * (1 + dau_growth_rate) ** (year - 1))
        objects_per_user = initial_objects_per_user * (1 + engagement_growth_rate) ** (year - 1)
        
        # Annual data creation
        daily_objects = dau * objects_per_user
        annual_objects = daily_objects * 365
        annual_storage_bytes = annual_objects * object_size_bytes
        annual_storage_pb = annual_storage_bytes / (1024 ** 5)
        
        # Account for data retention
        # If retention is 365 days, we keep 1 year of data
        # If retention is unlimited, cumulative grows indefinitely
        if retention_days >= 365:
            cumulative_storage += annual_storage_pb
        else:
            # Rolling retention - only keep retention_days worth
            daily_bytes = (dau * objects_per_user * object_size_bytes)
            cumulative_storage = (daily_bytes * retention_days) / (1024 ** 5)
        
        projections.append(YearlyProjection(
            year=year,
            dau=dau,
            objects_per_user_day=objects_per_user,
            object_size_bytes=object_size_bytes,
            retention_days=retention_days,
            new_storage_pb=annual_storage_pb,
            cumulative_storage_pb=cumulative_storage
        ))
    
    return projections
 
# Example: Social media platform with photos
projections = project_storage(
    initial_dau=50_000_000,         # 50M DAU year 1
    dau_growth_rate=0.25,           # 25% YoY user growth
    engagement_growth_rate=0.10,    # 10% more photos per user each year
    initial_objects_per_user=3,     # 3 photos/day initially
    object_size_bytes=3 * 1024 * 1024,  # 3MB per photo (all sizes)
    retention_days=36500,           # Keep forever (100 years)
    years=5
)
 
print("5-Year Storage Projection (Photo Platform)")
print("=" * 70)
for p in projections:
    print(f"Year {p.year}: DAU={p.dau/1e6:.0f}M | "
          f"Photos/user={p.objects_per_user_day:.1f} | "
          f"New={p.new_storage_pb:.1f}PB | "
          f"Total={p.cumulative_storage_pb:.1f}PB")

5-Year Storage Growth Summary
Year	DAU	New Data (PB)	Total Data (PB)	Storage Cost (est)
Year 1	50M	16.4 PB	16.4 PB	$400K/month
Year 2	62.5M	22.5 PB	38.9 PB	$970K/month
Year 3	78.1M	30.9 PB	69.8 PB	$1.7M/month
Year 4	97.6M	42.5 PB	112.3 PB	$2.8M/month
Year 5	122M	58.4 PB	170.7 PB	$4.3M/month

The Exponential Trap

Storage costs grow exponentially while revenue typically grows linearly or sub-linearly. A platform adding 16PB in Year 1 adds 58PB in Year 5. Without tiered storage strategies (moving cold data to cheaper tiers), storage costs can consume unsustainable portions of revenue.

Storage Tiers and Cost Optimization

Not all data deserves the same storage class. Hot data needs fast access; cold data just needs to exist. Tiered storage is key to controlling costs at scale.

The Storage Temperature Model:

Hot Storage: Frequently accessed, low latency required (<10ms). SSDs, high-IOPS databases. Most expensive.
Warm Storage: Occasionally accessed, moderate latency acceptable (<100ms). HDDs, standard cloud storage.
Cold Storage: Rarely accessed, high latency acceptable (<1 hour). Archive storage, tape.
Glacier/Archive: Almost never accessed, retrieval takes hours. Compliance and disaster recovery.

Cloud Storage Tier Comparison (AWS Pricing, 2024 Estimates)
Tier	Use Case	Latency	Cost per GB/month	Retrieval Cost
S3 Standard	Frequent access	<10ms	$0.023	Free
S3 Intelligent-Tiering	Variable access	<10ms	$0.0225	Free
S3 Standard-IA	Infrequent access	<10ms	$0.0125	$0.01/GB
S3 One Zone-IA	Recreatable data	<10ms	$0.01	$0.01/GB
S3 Glacier Instant	Rare access	<10ms	$0.004	$0.03/GB
S3 Glacier Flexible	Archive	Minutes-hours	$0.0036	$0.03-0.05/GB
S3 Glacier Deep Archive	Long-term archive	12-48 hours	$0.00099	$0.02/GB

Automatic Tiering Strategies:

Implement lifecycle policies based on access patterns:

Policy: Social Media Photos
- Days 0-30: S3 Standard (frequently viewed)
- Days 31-90: S3 Standard-IA (occasional viewing)
- Days 91-365: S3 Glacier Instant (rare viewing)
- Days 365+: S3 Glacier Deep Archive (almost never)

The 80/20 Rule of Storage:

In most systems:

80% of accesses go to 20% of data (recent/popular)
80% of data is rarely accessed (old/unpopular)

By tiering aggressively, you can reduce costs by 60-80% while maintaining user experience for the actively accessed content.

tiered_storage_savings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Calculate tiered storage savings
 
class StorageTierAnalysis:
    def __init__(self, total_storage_pb: float, monthly_growth_pb: float):
        self.total_storage_pb = total_storage_pb
        self.monthly_growth_pb = monthly_growth_pb
    
    def calculate_flat_cost(self) -> float:
        """All data in S3 Standard"""
        cost_per_gb = 0.023
        gb = self.total_storage_pb * 1024 * 1024  # PB to GB
        return gb * cost_per_gb
    
    def calculate_tiered_cost(self) -> float:
        """
        Distribution:
        - 15% Hot (last 30 days of new data)
        - 25% Warm (31-90 days)
        - 30% Cold (91-365 days)
        - 30% Archive (365+ days)
        """
        gb = self.total_storage_pb * 1024 * 1024
        
        hot_pct, warm_pct, cold_pct, archive_pct = 0.15, 0.25, 0.30, 0.30
        
        hot_cost = gb * hot_pct * 0.023       # S3 Standard
        warm_cost = gb * warm_pct * 0.0125    # S3 Standard-IA
        cold_cost = gb * cold_pct * 0.004     # Glacier Instant
        archive_cost = gb * archive_pct * 0.001  # Glacier Deep Archive
        
        return hot_cost + warm_cost + cold_cost + archive_cost
    
    def savings_analysis(self) -> dict:
        flat = self.calculate_flat_cost()
        tiered = self.calculate_tiered_cost()
        savings = flat - tiered
        savings_pct = (savings / flat) * 100
        
        return {
            "flat_cost_monthly": flat,
            "tiered_cost_monthly": tiered,
            "monthly_savings": savings,
            "savings_percentage": savings_pct,
            "annual_savings": savings * 12
        }
 
# Example: 100PB photo storage platform
analysis = StorageTierAnalysis(total_storage_pb=100, monthly_growth_pb=5)
results = analysis.savings_analysis()
 
print("Storage Cost Analysis: 100PB Photo Platform")
print("=" * 50)
print(f"Flat pricing (all S3 Standard): ${results['flat_cost_monthly']:, .0f
                        } / month")
print(f"Tiered pricing:                 ${results['tiered_cost_monthly']:,.0f}/month")
print(f"Monthly savings:                ${results['monthly_savings']:,.0f}")
print(f"Savings percentage:             {results['savings_percentage']:.1f}%")
print(f"Annual savings:                 ${results['annual_savings']/1e6:.1f}M")

The Power of Tiering

For a 100PB storage footprint, proper tiering can save $20+ million annually versus flat S3 Standard pricing. This is why every major platform has dedicated storage infrastructure teams focused on data lifecycle management.

Practical Storage Estimation Examples

Let's walk through complete storage estimations for three different system types:

Example 1: URL Shortening Service (Like bit.ly)

url_shortener_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# URL Shortening Service Storage Estimation
 
# Service assumptions
daily_url_creations = 100_000_000  # 100M URLs created / day
service_lifespan_years = 10        # URLs never expire
 
# Data model per URL
url_record = {
                            "short_code": 7,           # 7 character code
    "original_url": 200,       # Average URL length
    "user_id": 8,              # int64(nullable for anonymous)
    "created_at": 8,           # timestamp
    "click_count": 4,          # int32
    "last_clicked": 8,         # timestamp(nullable)
    "metadata": 50,            # custom tracking params, title
}
bytes_per_url = sum(url_record.values())
print(f"Bytes per URL: {bytes_per_url}")
 
# Analytics data(per click)
click_record = {
    "click_id": 8,
    "short_code": 7,
    "timestamp": 8,
    "ip_hash": 16,             # Anonymized
    "user_agent": 100,
    "referer": 100,
    "country": 2,
    "device_type": 1,
}
bytes_per_click = sum(click_record.values())
print(f"Bytes per click: {bytes_per_click}")
 
# Assume each URL gets clicked 50 times on average
clicks_per_url = 50
 
# Daily storage
daily_url_storage = daily_url_creations * bytes_per_url
daily_click_storage = daily_url_creations * clicks_per_url * bytes_per_click
daily_total = daily_url_storage + daily_click_storage
 
print(f"
Daily URL storage: {daily_url_storage / 1e9:.1f} GB")
print(f"Daily click storage: {daily_click_storage / 1e9:.1f} GB")
print(f"Daily total: {daily_total / 1e9:.1f} GB")
 
# With replication and overhead(3x replication, 1.5x indexes / overhead)
storage_multiplier = 3 * 1.5
daily_actual = daily_total * storage_multiplier
 
# 10 - year projection
ten_year_storage = daily_actual * 365 * service_lifespan_years
print(f"
10-year storage: {ten_year_storage / 1e15:.1f} PB")

Example 2: Chat Application (Like Slack/Discord)

chat_app_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Chat Application Storage Estimation
 
# User assumptions
monthly_active_users = 20_000_000   # 20M MAU
dau_mau_ratio = 0.65                # High engagement
daily_active_users = monthly_active_users * dau_mau_ratio
 
# Message patterns
messages_per_user_per_day = 40      # Active messengers
direct_message_ratio = 0.4         # 40 % DMs, 60 % channels
 
# Message data model
text_message = {
    "message_id": 16,              # Snowflake ID
    "channel_id": 16,              # Where sent
    "user_id": 16,
    "content": 500,                # Average message(incl.emojis, links)
    "created_at": 8,
    "edited_at": 8,
    "attachments_meta": 50,        # References to files
    "reactions_count": 4,
    "is_pinned": 1,
    "thread_id": 16,               # If reply
}
bytes_per_message = sum(text_message.values())
 
# File attachments(images, documents)
messages_with_attachments_ratio = 0.15  # 15 % have attachments
average_attachment_size_mb = 2
 
# Reactions(separate table for many - to - many)
    reactions_per_message = 0.5        # Average reactions
reaction_record_bytes = 32         # message_id + user_id + emoji
 
# Daily calculations
daily_messages = daily_active_users * messages_per_user_per_day
daily_text_storage = daily_messages * bytes_per_message
daily_attachment_storage = daily_messages * messages_with_attachments_ratio * average_attachment_size_mb * 1e6
daily_reaction_storage = daily_messages * reactions_per_message * reaction_record_bytes
 
print(f"Daily messages: {daily_messages/1e6:.0f}M")
print(f"Daily text storage: {daily_text_storage/1e9:.1f} GB")
print(f"Daily attachment storage: {daily_attachment_storage/1e9:.1f} GB")
print(f"Daily reaction storage: {daily_reaction_storage/1e9:.2f} GB")
 
total_daily = daily_text_storage + daily_attachment_storage + daily_reaction_storage
print(f"
Total daily storage: {total_daily/1e9:.1f} GB")
 
# Retention: Keep messages forever, but tier attachments
# With 3x replication
annual_storage_tb = total_daily * 365 * 3 / 1e12
print(f"Annual storage (replicated): {annual_storage_tb:.0f} TB")

Example 3: E-Commerce Platform (Like Amazon)

ecommerce_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# E - Commerce Platform Storage Estimation
 
# Business scale
products = 500_000_000           # 500M product listings
daily_orders = 50_000_000        # 50M orders / day
daily_product_views = 5_000_000_000  # 5B product views / day
 
# Product catalog storage
product_record = {
    "product_id": 8,
    "seller_id": 8,
    "title": 200,
    "description": 2000,
    "category_ids": 32,          # Multiple categories
    "price": 8,
    "inventory": 4,
    "ratings_summary": 24,
    "attributes": 500,           # JSON attributes
    "created_at": 8,
    "updated_at": 8,
}
bytes_per_product = sum(product_record.values())
 
# Product images(separate storage)
images_per_product = 7            # Main + gallery
image_size_all_versions_mb = 3    # All resolutions
 
# Product catalog total
catalog_data = products * bytes_per_product
catalog_images = products * images_per_product * image_size_all_versions_mb * 1e6
print(f"Catalog data: {catalog_data/1e12:.1f} TB")
print(f"Catalog images: {catalog_images/1e15:.1f} PB")
 
# Order storage
order_record = {
    "order_id": 16,
    "user_id": 8,
    "total": 8,
    "status": 1,
    "payment_status": 1,
    "shipping_address": 500,
    "created_at": 8,
    "updated_at": 8,
}
order_item_record = {
    "order_item_id": 16,
    "order_id": 16,
    "product_id": 8,
    "quantity": 4,
    "price": 8,
}
bytes_per_order = sum(order_record.values())
bytes_per_order_item = sum(order_item_record.values())
items_per_order = 3  # Average
 
daily_order_storage = daily_orders * (bytes_per_order + items_per_order * bytes_per_order_item)
print(f"
Daily order storage: {daily_order_storage/1e9:.1f} GB")
 
# View / click history for recommendations
view_event = 50  # bytes per event
daily_view_storage = daily_product_views * view_event
print(f"Daily view event storage: {daily_view_storage/1e12:.1f} TB")
 
# Retain views for 90 days(recommendation training)
rolling_view_storage = daily_view_storage * 90
print(f"90-day view history: {rolling_view_storage/1e15:.1f} PB")

Summary: Storage Estimation Mastery

You now have a comprehensive framework for estimating storage requirements. Let's consolidate the key principles:

Key Takeaways

•Storage = Objects × Size × Retention × Replication × Overhead — Account for all multipliers
•Know your object sizes — From 50-byte log entries to 500MB video minutes
•Database overhead is significant — Indexes alone add 20-50%
•Media dominates consumer apps — A single video can outweigh millions of text records
•Growth compounds aggressively — Plan for 5+ year horizons
•Tiered storage is essential — Move cold data to cheaper tiers for 60-80% savings
•The 80/20 rule applies — Most storage is rarely accessed

Storage Estimation Quick Reference
Formula	Usage
Raw Storage = Objects/Day × Size × Retention Days	Baseline calculation
Actual Storage = Raw × Replication (3x) × Overhead (1.5x)	Production sizing
Annual Cost = Storage GB × Tier Rate × 12	Budget planning
Savings = Flat Cost - Tiered Cost	Optimization opportunity

What's Next:

With traffic and storage estimation complete, we move to bandwidth estimation—calculating the network capacity needed to deliver your data to users. Bandwidth connects traffic (requests per second) with storage (data transferred per request) to determine network infrastructure requirements.

Page Complete

You now understand how to estimate storage for any system. Practice by analyzing products you use: How much data does a single Instagram post consume? How much storage does Netflix need for its movie catalog? Building this intuition makes system design interviews significantly easier.

2 / 5

Loading learning content...

System Design (HLD)Back-of-Envelope Estimation

Back-of-Envelope Estimation

LevelIntermediate

Duration90 mins

TopicBack-of-Envelope Estimation

2 / 5

Storage Estimation

The Data That Never Stops Growing

What You Will Learn

The Storage Estimation Framework

Storage estimation follows a systematic framework. Every data point in your system flows through this chain:

The Storage Equation:

Total Storage = Objects Created × Size per Object × Retention Period × Replication Factor × Overhead

Let's decompose each component:

Retention Period: How long do you keep data? Session logs might be kept for 30 days. Financial transactions for 7 years (regulatory). Social media posts forever (until user deletion).

Replication Factor: How many copies exist? Production databases typically replicate 3x. Backups add more. Cross-region redundancy doubles again.

Overhead: Indexes, metadata, filesystem overhead, and operational headroom. Typically 20-50% on top of raw data.

Storage Multipliers to Account For
Factor	Typical Multiplier	Reason
Database replication	3x	Primary + 2 replicas for HA
Cross-region redundancy	2x	DR in secondary region
Backup copies	1.5-2x	Daily/weekly/monthly backups
Index overhead	1.2-1.5x	B-tree indexes, secondary indexes
Filesystem overhead	1.1-1.2x	Block allocation, metadata
Operational headroom	1.3x	30% free space for operations

The Multiplicative Effect

Sizing Different Data Types

Different data types have vastly different storage characteristics. A senior engineer intuitively knows approximate sizes for common objects. Here's your reference guide:

Typical Object Sizes by Data Type
Data Type	Typical Size	Size Range	Storage Considerations
User ID (UUID)	16-36 bytes	16B binary, 36B string	Use binary UUIDs for 60% space savings
Integer ID	4-8 bytes	int32 vs int64	int64 for >2B records
Timestamp	8 bytes	4-8 bytes	Unix epoch (4B) or precise datetime (8B)
Short text (username)	20-50 bytes	Variable	VARCHAR, not fixed CHAR
Medium text (tweet)	300-500 bytes	140-280 chars + metadata	UTF-8 encoding varies by language
Long text (article)	5-50 KB	Variable	Consider compression
JSON document	1-10 KB	Variable	JSONB more compact than text JSON
Thumbnail image	10-50 KB	Variable	Aggressive compression
Standard photo	2-5 MB	1-20 MB	Quality-dependent
HD video (1 min)	50-150 MB	Variable	Highly codec-dependent
4K video (1 min)	200-500 MB	Variable	Multiple formats for adaptive streaming
Log entry	200-500 bytes	Variable	Structured logs more compact
Metric data point	8-32 bytes	Variable	Time-series optimized storage

Character Encoding Matters:

Character size varies by encoding:

ASCII: 1 byte per character
UTF-8: 1-4 bytes per character (English: 1 byte, Chinese/Japanese: 3 bytes, emojis: 4 bytes)
UTF-16: 2-4 bytes per character

A global platform with multi-language support should assume average 2 bytes per character for text content.

Metadata Overhead:

Every object has metadata beyond its content:

User ID (8 bytes)
Timestamps (created_at, updated_at: 16 bytes)
Status flags (4 bytes)
Foreign keys (8-24 bytes)
Version/checksum (8 bytes)

A "simple" tweet is 280 characters + 50+ bytes of metadata.

tweet_size_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# Detailed size calculation for a single tweet
 
class Tweet:
    """
    Tweet storage size breakdown
    All sizes in bytes
    """
    # Core identifiers
    tweet_id: int = 8           # int64
    user_id: int = 8            # int64
    
    # Content
    text_max: int = 560         # 280 chars × ~2 bytes (UTF-8 average)
    
    # Timestamps
    created_at: int = 8         # datetime
    updated_at: int = 8         # datetime
    
    # Engagement counters (denormalized for read performance)
    like_count: int = 4         # int32
    retweet_count: int = 4      # int32
    reply_count: int = 4        # int32
    quote_count: int = 4        # int32
    
    # Metadata
    language_code: int = 3      # 'en', 'es', 'jp', etc.
    source_app: int = 50        # 'Twitter for iPhone'
    
    # References
    reply_to_id: int = 8        # Nullable - original tweet ID
    quoted_tweet_id: int = 8    # Nullable - quoted tweet ID
    
    # Media references (actual media stored separately)
    media_ids: int = 32         # Up to 4 media items × 8 bytes
    
    # Location (optional)
    geo_lat: int = 8            # double
    geo_lng: int = 8            # double
    place_id: int = 24          # String reference
    
    # Flags
    is_sensitive: int = 1       # boolean
    is_reply: int = 1           # boolean
    has_media: int = 1          # boolean
    
    def total_size(self) -> int:
        return (
            self.tweet_id + self.user_id + self.text_max +
            self.created_at + self.updated_at +
            self.like_count + self.retweet_count + 
            self.reply_count + self.quote_count +
            self.language_code + self.source_app +
            self.reply_to_id + self.quoted_tweet_id +
            self.media_ids + self.geo_lat + self.geo_lng +
            self.place_id + self.is_sensitive + 
            self.is_reply + self.has_media
        )
 
tweet = Tweet()
print(f"Single tweet size: {tweet.total_size()} bytes ≈ {tweet.total_size()/1024:.2f} KB")
 
# Scale calculation
tweets_per_day = 500_000_000  # 500 million tweets/day
raw_daily_storage = tweets_per_day * tweet.total_size()
print(f"
Daily tweet storage: {raw_daily_storage / (1024**4):.2f} TB (raw)")
print(f"With 3x replication: {raw_daily_storage * 3 / (1024**4):.2f} TB")
print(f"Yearly (365 days): {raw_daily_storage * 365 / (1024**5):.2f} PB")

Database Storage Calculations

Databases add significant overhead beyond raw data. Understanding these overheads is crucial for accurate estimation.

Index Overhead:

Indexes trade space for query speed. A B-tree index on a column adds approximately:

Primary key index: 10-15% of table size
Each secondary index: 5-15% of table size (depends on key size)
Composite indexes: Larger, depends on included columns

Example: A users table with 100M rows × 500 bytes = 50GB raw data

Primary key index: +7GB
Email unique index: +5GB
Created_at index: +3GB
Composite (status, created_at): +8GB

Total: 50GB + 23GB = 73GB (46% overhead just from indexes)

Write Amplification:

When you write 1KB of data, the database might write 10KB or more:

Write-ahead log (WAL): Every write goes to log first
Index updates: Each index on the table must update
Page splits: B-trees occasionally reorganize
Compaction (LSM trees): Background merging multiplies writes

Storage Characteristics by Database Type
Database Type	Storage Efficiency	Index Overhead	Best For
PostgreSQL	High (TOAST compression)	10-30%	General purpose, structured data
MySQL (InnoDB)	Medium	15-40%	OLTP workloads
MongoDB	Medium (BSON)	20-50%	Flexible schemas
Cassandra	Low (replication)	5-15%	Write-heavy, wide-column
Redis	Low (in-memory)	50-100%	Caching, sessions
ClickHouse	Very High (columnar)	5-10%	Analytics, time-series
Elasticsearch	Low	100-300%	Full-text search

Elasticsearch Special Case:

Elasticsearch deserves special mention because its storage overhead often surprises engineers:

Each document stores: doc values, inverted index, stored fields
Default replication: 1 replica = 2x storage
Segment merging: 50%+ extra space needed during merges
A 100GB dataset often requires 800GB+ of storage

DynamoDB/Cassandra Distribution:

Distributed databases spread data across partitions:

Partition key determines data placement
Hot partitions can cause storage imbalance
Size limit per partition (e.g., DynamoDB: 10GB per partition)

When sizing, account for partition overhead and potential imbalance.

The Compression Opportunity

Media Storage: Images, Video, and Files

Media typically dominates storage in consumer applications. A single 4K video can consume more storage than millions of text records.

Image Storage:

Images are stored at multiple sizes for different use cases:

Thumbnail: 100×100 pixels, 10-20 KB
Small preview: 320×320 pixels, 30-50 KB
Medium display: 640×640 pixels, 100-200 KB
Large display: 1080×1080 pixels, 300-500 KB
Original: Variable, 2-20 MB

Total storage per image = sum of all versions ≈ 3-5 MB on average.

Video Storage:

Video requires multiple renditions for adaptive streaming (HLS/DASH):

360p: ~500 Kbps → 4 MB/min
480p: ~1 Mbps → 8 MB/min
720p: ~3 Mbps → 22 MB/min
1080p: ~6 Mbps → 45 MB/min
4K: ~20 Mbps → 150 MB/min

Total storage per minute of video (all qualities) ≈ 230 MB/min

video_platform_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Storage calculation for a YouTube-like platform
 
# Video upload assumptions
hours_uploaded_per_minute = 500  # YouTube actual stat: 500 hours/min
minutes_per_day = 60 * 24
 
# Minutes of video uploaded daily
video_minutes_daily = hours_uploaded_per_minute * 60 * minutes_per_day
print(f"Video minutes uploaded daily: {video_minutes_daily:,}")
 
# Storage per minute of video (all quality levels)
storage_per_minute_mb = {
    "144p": 1.5,
    "240p": 3,
    "360p": 4,
    "480p": 8,
    "720p": 22,
    "1080p": 45,
    "1440p": 90,
    "2160p (4K)": 150,
}
 
# Not all videos are encoded at all qualities
# Assume average encoding profile
average_mb_per_minute = 100  # Weighted average
 
# Daily storage (raw)
daily_storage_tb = video_minutes_daily * average_mb_per_minute / (1024 * 1024)
print(f"
Daily raw video storage: {daily_storage_tb:,.0f} TB")
 
# With CDN distribution (multiple copies across edge locations)
cdn_copies = 3  # Minimum copies for global coverage
# Plus origin storage with replication
origin_replication = 3
 
# Actual storage (origin + some CDN)
# CDN typically caches popular 20% of content
cdn_cached_percentage = 0.20
 
total_daily = (daily_storage_tb * origin_replication + 
               daily_storage_tb * cdn_cached_percentage * cdn_copies)
print(f"Daily storage with replication: {total_daily:,.0f} TB")
print(f"
Annual storage growth: {total_daily * 365 / 1000:,.1f} PB/year")

Media Storage Reference Guide
Media Type	Typical Size	Storage Strategy
Profile photo	500 KB total (all sizes)	Cache aggressively, rarely changes
Social media photo	3-5 MB (all sizes)	Hot storage for recent, cold for old
Short video (15 sec)	50-100 MB (all qualities)	CDN caching, adaptive streaming
Standard video (10 min)	1-3 GB (all qualities)	Tiered storage by view count
Movie (2 hours)	20-50 GB (all qualities)	Origin + edge caching
User-generated document	100 KB - 10 MB	Deduplicated storage
Audio track (3 min)	10-30 MB (all qualities)	Cache popular tracks

Content Deduplication

Multi-Year Storage Growth Projections

Storage planning requires looking years into the future. You can't simply buy more databases mid-year when capacity runs out.

The Compound Growth Formula:

Storage(year N) = Current Storage × (1 + Growth Rate)^N + Cumulative New Data

But this is simplistic. Real storage growth depends on:

User Growth: More users = more data
Engagement Growth: Existing users become more active
Feature Evolution: New features create new data types
Retention Policy Changes: Keeping data longer
Compliance Requirements: New regulations may require longer retention

Modeling Growth Rates:

Startup phase: 100-300% annual storage growth (explosive)
Growth phase: 50-100% annual storage growth
Mature phase: 20-40% annual storage growth
Saturated phase: 10-20% annual storage growth

storage_projection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# 5-Year storage projection model
 
from dataclasses import dataclass
from typing import List
 
@dataclass
class YearlyProjection:
    year: int
    dau: int
    objects_per_user_day: float
    object_size_bytes: int
    retention_days: int
    new_storage_pb: float
    cumulative_storage_pb: float
 
def project_storage(
    initial_dau: int,
    dau_growth_rate: float,
    engagement_growth_rate: float,
    initial_objects_per_user: float,
    object_size_bytes: int,
    retention_days: int,
    years: int
) -> List[YearlyProjection]:
    
    projections = []
    cumulative_storage = 0
    
    for year in range(1, years + 1):
        # Calculate metrics for this year
        dau = int(initial_dau * (1 + dau_growth_rate) ** (year - 1))
        objects_per_user = initial_objects_per_user * (1 + engagement_growth_rate) ** (year - 1)
        
        # Annual data creation
        daily_objects = dau * objects_per_user
        annual_objects = daily_objects * 365
        annual_storage_bytes = annual_objects * object_size_bytes
        annual_storage_pb = annual_storage_bytes / (1024 ** 5)
        
        # Account for data retention
        # If retention is 365 days, we keep 1 year of data
        # If retention is unlimited, cumulative grows indefinitely
        if retention_days >= 365:
            cumulative_storage += annual_storage_pb
        else:
            # Rolling retention - only keep retention_days worth
            daily_bytes = (dau * objects_per_user * object_size_bytes)
            cumulative_storage = (daily_bytes * retention_days) / (1024 ** 5)
        
        projections.append(YearlyProjection(
            year=year,
            dau=dau,
            objects_per_user_day=objects_per_user,
            object_size_bytes=object_size_bytes,
            retention_days=retention_days,
            new_storage_pb=annual_storage_pb,
            cumulative_storage_pb=cumulative_storage
        ))
    
    return projections
 
# Example: Social media platform with photos
projections = project_storage(
    initial_dau=50_000_000,         # 50M DAU year 1
    dau_growth_rate=0.25,           # 25% YoY user growth
    engagement_growth_rate=0.10,    # 10% more photos per user each year
    initial_objects_per_user=3,     # 3 photos/day initially
    object_size_bytes=3 * 1024 * 1024,  # 3MB per photo (all sizes)
    retention_days=36500,           # Keep forever (100 years)
    years=5
)
 
print("5-Year Storage Projection (Photo Platform)")
print("=" * 70)
for p in projections:
    print(f"Year {p.year}: DAU={p.dau/1e6:.0f}M | "
          f"Photos/user={p.objects_per_user_day:.1f} | "
          f"New={p.new_storage_pb:.1f}PB | "
          f"Total={p.cumulative_storage_pb:.1f}PB")

5-Year Storage Growth Summary
Year	DAU	New Data (PB)	Total Data (PB)	Storage Cost (est)
Year 1	50M	16.4 PB	16.4 PB	$400K/month
Year 2	62.5M	22.5 PB	38.9 PB	$970K/month
Year 3	78.1M	30.9 PB	69.8 PB	$1.7M/month
Year 4	97.6M	42.5 PB	112.3 PB	$2.8M/month
Year 5	122M	58.4 PB	170.7 PB	$4.3M/month

The Exponential Trap

Storage Tiers and Cost Optimization

Not all data deserves the same storage class. Hot data needs fast access; cold data just needs to exist. Tiered storage is key to controlling costs at scale.

The Storage Temperature Model:

Hot Storage: Frequently accessed, low latency required (<10ms). SSDs, high-IOPS databases. Most expensive.
Warm Storage: Occasionally accessed, moderate latency acceptable (<100ms). HDDs, standard cloud storage.
Cold Storage: Rarely accessed, high latency acceptable (<1 hour). Archive storage, tape.
Glacier/Archive: Almost never accessed, retrieval takes hours. Compliance and disaster recovery.

Cloud Storage Tier Comparison (AWS Pricing, 2024 Estimates)
Tier	Use Case	Latency	Cost per GB/month	Retrieval Cost
S3 Standard	Frequent access	<10ms	$0.023	Free
S3 Intelligent-Tiering	Variable access	<10ms	$0.0225	Free
S3 Standard-IA	Infrequent access	<10ms	$0.0125	$0.01/GB
S3 One Zone-IA	Recreatable data	<10ms	$0.01	$0.01/GB
S3 Glacier Instant	Rare access	<10ms	$0.004	$0.03/GB
S3 Glacier Flexible	Archive	Minutes-hours	$0.0036	$0.03-0.05/GB
S3 Glacier Deep Archive	Long-term archive	12-48 hours	$0.00099	$0.02/GB

Automatic Tiering Strategies:

Implement lifecycle policies based on access patterns:

Policy: Social Media Photos
- Days 0-30: S3 Standard (frequently viewed)
- Days 31-90: S3 Standard-IA (occasional viewing)
- Days 91-365: S3 Glacier Instant (rare viewing)
- Days 365+: S3 Glacier Deep Archive (almost never)

The 80/20 Rule of Storage:

In most systems:

80% of accesses go to 20% of data (recent/popular)
80% of data is rarely accessed (old/unpopular)

By tiering aggressively, you can reduce costs by 60-80% while maintaining user experience for the actively accessed content.

tiered_storage_savings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Calculate tiered storage savings
 
class StorageTierAnalysis:
    def __init__(self, total_storage_pb: float, monthly_growth_pb: float):
        self.total_storage_pb = total_storage_pb
        self.monthly_growth_pb = monthly_growth_pb
    
    def calculate_flat_cost(self) -> float:
        """All data in S3 Standard"""
        cost_per_gb = 0.023
        gb = self.total_storage_pb * 1024 * 1024  # PB to GB
        return gb * cost_per_gb
    
    def calculate_tiered_cost(self) -> float:
        """
        Distribution:
        - 15% Hot (last 30 days of new data)
        - 25% Warm (31-90 days)
        - 30% Cold (91-365 days)
        - 30% Archive (365+ days)
        """
        gb = self.total_storage_pb * 1024 * 1024
        
        hot_pct, warm_pct, cold_pct, archive_pct = 0.15, 0.25, 0.30, 0.30
        
        hot_cost = gb * hot_pct * 0.023       # S3 Standard
        warm_cost = gb * warm_pct * 0.0125    # S3 Standard-IA
        cold_cost = gb * cold_pct * 0.004     # Glacier Instant
        archive_cost = gb * archive_pct * 0.001  # Glacier Deep Archive
        
        return hot_cost + warm_cost + cold_cost + archive_cost
    
    def savings_analysis(self) -> dict:
        flat = self.calculate_flat_cost()
        tiered = self.calculate_tiered_cost()
        savings = flat - tiered
        savings_pct = (savings / flat) * 100
        
        return {
            "flat_cost_monthly": flat,
            "tiered_cost_monthly": tiered,
            "monthly_savings": savings,
            "savings_percentage": savings_pct,
            "annual_savings": savings * 12
        }
 
# Example: 100PB photo storage platform
analysis = StorageTierAnalysis(total_storage_pb=100, monthly_growth_pb=5)
results = analysis.savings_analysis()
 
print("Storage Cost Analysis: 100PB Photo Platform")
print("=" * 50)
print(f"Flat pricing (all S3 Standard): ${results['flat_cost_monthly']:, .0f
                        } / month")
print(f"Tiered pricing:                 ${results['tiered_cost_monthly']:,.0f}/month")
print(f"Monthly savings:                ${results['monthly_savings']:,.0f}")
print(f"Savings percentage:             {results['savings_percentage']:.1f}%")
print(f"Annual savings:                 ${results['annual_savings']/1e6:.1f}M")

The Power of Tiering

Practical Storage Estimation Examples

Let's walk through complete storage estimations for three different system types:

Example 1: URL Shortening Service (Like bit.ly)

url_shortener_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# URL Shortening Service Storage Estimation
 
# Service assumptions
daily_url_creations = 100_000_000  # 100M URLs created / day
service_lifespan_years = 10        # URLs never expire
 
# Data model per URL
url_record = {
                            "short_code": 7,           # 7 character code
    "original_url": 200,       # Average URL length
    "user_id": 8,              # int64(nullable for anonymous)
    "created_at": 8,           # timestamp
    "click_count": 4,          # int32
    "last_clicked": 8,         # timestamp(nullable)
    "metadata": 50,            # custom tracking params, title
}
bytes_per_url = sum(url_record.values())
print(f"Bytes per URL: {bytes_per_url}")
 
# Analytics data(per click)
click_record = {
    "click_id": 8,
    "short_code": 7,
    "timestamp": 8,
    "ip_hash": 16,             # Anonymized
    "user_agent": 100,
    "referer": 100,
    "country": 2,
    "device_type": 1,
}
bytes_per_click = sum(click_record.values())
print(f"Bytes per click: {bytes_per_click}")
 
# Assume each URL gets clicked 50 times on average
clicks_per_url = 50
 
# Daily storage
daily_url_storage = daily_url_creations * bytes_per_url
daily_click_storage = daily_url_creations * clicks_per_url * bytes_per_click
daily_total = daily_url_storage + daily_click_storage
 
print(f"
Daily URL storage: {daily_url_storage / 1e9:.1f} GB")
print(f"Daily click storage: {daily_click_storage / 1e9:.1f} GB")
print(f"Daily total: {daily_total / 1e9:.1f} GB")
 
# With replication and overhead(3x replication, 1.5x indexes / overhead)
storage_multiplier = 3 * 1.5
daily_actual = daily_total * storage_multiplier
 
# 10 - year projection
ten_year_storage = daily_actual * 365 * service_lifespan_years
print(f"
10-year storage: {ten_year_storage / 1e15:.1f} PB")

Example 2: Chat Application (Like Slack/Discord)

chat_app_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Chat Application Storage Estimation
 
# User assumptions
monthly_active_users = 20_000_000   # 20M MAU
dau_mau_ratio = 0.65                # High engagement
daily_active_users = monthly_active_users * dau_mau_ratio
 
# Message patterns
messages_per_user_per_day = 40      # Active messengers
direct_message_ratio = 0.4         # 40 % DMs, 60 % channels
 
# Message data model
text_message = {
    "message_id": 16,              # Snowflake ID
    "channel_id": 16,              # Where sent
    "user_id": 16,
    "content": 500,                # Average message(incl.emojis, links)
    "created_at": 8,
    "edited_at": 8,
    "attachments_meta": 50,        # References to files
    "reactions_count": 4,
    "is_pinned": 1,
    "thread_id": 16,               # If reply
}
bytes_per_message = sum(text_message.values())
 
# File attachments(images, documents)
messages_with_attachments_ratio = 0.15  # 15 % have attachments
average_attachment_size_mb = 2
 
# Reactions(separate table for many - to - many)
    reactions_per_message = 0.5        # Average reactions
reaction_record_bytes = 32         # message_id + user_id + emoji
 
# Daily calculations
daily_messages = daily_active_users * messages_per_user_per_day
daily_text_storage = daily_messages * bytes_per_message
daily_attachment_storage = daily_messages * messages_with_attachments_ratio * average_attachment_size_mb * 1e6
daily_reaction_storage = daily_messages * reactions_per_message * reaction_record_bytes
 
print(f"Daily messages: {daily_messages/1e6:.0f}M")
print(f"Daily text storage: {daily_text_storage/1e9:.1f} GB")
print(f"Daily attachment storage: {daily_attachment_storage/1e9:.1f} GB")
print(f"Daily reaction storage: {daily_reaction_storage/1e9:.2f} GB")
 
total_daily = daily_text_storage + daily_attachment_storage + daily_reaction_storage
print(f"
Total daily storage: {total_daily/1e9:.1f} GB")
 
# Retention: Keep messages forever, but tier attachments
# With 3x replication
annual_storage_tb = total_daily * 365 * 3 / 1e12
print(f"Annual storage (replicated): {annual_storage_tb:.0f} TB")

Example 3: E-Commerce Platform (Like Amazon)

ecommerce_storage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# E - Commerce Platform Storage Estimation
 
# Business scale
products = 500_000_000           # 500M product listings
daily_orders = 50_000_000        # 50M orders / day
daily_product_views = 5_000_000_000  # 5B product views / day
 
# Product catalog storage
product_record = {
    "product_id": 8,
    "seller_id": 8,
    "title": 200,
    "description": 2000,
    "category_ids": 32,          # Multiple categories
    "price": 8,
    "inventory": 4,
    "ratings_summary": 24,
    "attributes": 500,           # JSON attributes
    "created_at": 8,
    "updated_at": 8,
}
bytes_per_product = sum(product_record.values())
 
# Product images(separate storage)
images_per_product = 7            # Main + gallery
image_size_all_versions_mb = 3    # All resolutions
 
# Product catalog total
catalog_data = products * bytes_per_product
catalog_images = products * images_per_product * image_size_all_versions_mb * 1e6
print(f"Catalog data: {catalog_data/1e12:.1f} TB")
print(f"Catalog images: {catalog_images/1e15:.1f} PB")
 
# Order storage
order_record = {
    "order_id": 16,
    "user_id": 8,
    "total": 8,
    "status": 1,
    "payment_status": 1,
    "shipping_address": 500,
    "created_at": 8,
    "updated_at": 8,
}
order_item_record = {
    "order_item_id": 16,
    "order_id": 16,
    "product_id": 8,
    "quantity": 4,
    "price": 8,
}
bytes_per_order = sum(order_record.values())
bytes_per_order_item = sum(order_item_record.values())
items_per_order = 3  # Average
 
daily_order_storage = daily_orders * (bytes_per_order + items_per_order * bytes_per_order_item)
print(f"
Daily order storage: {daily_order_storage/1e9:.1f} GB")
 
# View / click history for recommendations
view_event = 50  # bytes per event
daily_view_storage = daily_product_views * view_event
print(f"Daily view event storage: {daily_view_storage/1e12:.1f} TB")
 
# Retain views for 90 days(recommendation training)
rolling_view_storage = daily_view_storage * 90
print(f"90-day view history: {rolling_view_storage/1e15:.1f} PB")

Summary: Storage Estimation Mastery

You now have a comprehensive framework for estimating storage requirements. Let's consolidate the key principles:

Key Takeaways

•Storage = Objects × Size × Retention × Replication × Overhead — Account for all multipliers
•Know your object sizes — From 50-byte log entries to 500MB video minutes
•Database overhead is significant — Indexes alone add 20-50%
•Media dominates consumer apps — A single video can outweigh millions of text records
•Growth compounds aggressively — Plan for 5+ year horizons
•Tiered storage is essential — Move cold data to cheaper tiers for 60-80% savings
•The 80/20 rule applies — Most storage is rarely accessed

Storage Estimation Quick Reference
Formula	Usage
Raw Storage = Objects/Day × Size × Retention Days	Baseline calculation
Actual Storage = Raw × Replication (3x) × Overhead (1.5x)	Production sizing
Annual Cost = Storage GB × Tier Rate × 12	Budget planning
Savings = Flat Cost - Tiered Cost	Optimization opportunity

What's Next:

Page Complete

2 / 5