Loading content...
Every photo and video uploaded to Instagram must be stored durably, replicated for availability, and served quickly to users worldwide. This is not a typical storage problem—it's one of the largest media storage systems ever built.
Instagram's Storage Scale:
| Metric | Scale | Implication |
|---|---|---|
| Daily new uploads | 2 billion | ~6 petabytes of new data daily |
| Total media objects | 100+ billion | Requires distributed, sharded storage |
| Total storage | 20+ exabytes | At this scale, even 1% waste is massive |
| Read QPS | 10+ billion/day | Heavy CDN caching required |
| Durability requirement | 99.999999999% | Never lose a user's photo |
| Availability requirement | 99.99% | Always accessible |
At this scale, everything is a trade-off. Storage cost, access latency, durability, and operational complexity must all be balanced. This page explores how Instagram manages media storage at exabyte scale.
By the end of this page, you will understand: (1) Object storage architecture and why it's chosen for media, (2) Tiered storage strategies that balance cost and access latency, (3) Replication and erasure coding for durability, (4) Sharding strategies for massive object counts, (5) Cross-region distribution and disaster recovery, (6) Data lifecycle management and deletion, and (7) Cost optimization techniques at exabyte scale.
Instagram uses object storage (similar to Amazon S3, Google Cloud Storage, or Azure Blob Storage) for media rather than traditional file systems or databases. Understanding why requires examining the properties of media workloads.
Media Workload Characteristics:
Meta's Internal Object Storage:
Meta (Instagram's parent company) operates internal object storage systems specifically designed for social media workloads:
These systems are conceptually similar to S3/GCS but optimized for Meta's specific access patterns, hardware, and scale.
Object Key Design:
# Object key structure
/{shard}/media/{year}/{month}/{media_id}/{variant}.{format}
# Examples:
/shard-0042/media/2024/01/abc123def456/large_1080.jpg
/shard-0042/media/2024/01/abc123def456/thumb_150.webp
Key components:
Object storage systems often partition by key prefix. Naive sequential keys (upload_0001, upload_0002...) create hot partitions. Using random prefixes (hash of media_id) or including the shard number ensures even distribution across storage nodes.
Not all photos are accessed equally. Recent photos from active users are accessed constantly; old photos from inactive accounts are rarely viewed. Tiered storage aligns storage cost with access patterns.
The Access Pattern Reality:
Instagram's access follows an extreme power law:
Storing all content on high-performance (expensive) storage wastes money. Tiered storage places data on the appropriate storage class based on access likelihood.
| Tier | Age | Storage Type | Cost | Access Time | % of Total Data |
|---|---|---|---|---|---|
| Hot | <7 days | SSD-backed object store | $$$$ | <10ms | ~5% |
| Warm | 7-90 days | HDD-backed object store | $$ | <50ms | ~15% |
| Cold | 90 days - 1 year | Archival storage | $ | Minutes | ~30% |
| Glacier/Archive | 1 year | Deep archive (tape) | ¢ | Hours | ~50% |
Tier Migration Logic:
class TierMigrationService:
"""Manages automatic migration of objects between storage tiers."""
# Tier thresholds
HOT_TO_WARM_DAYS = 7
WARM_TO_COLD_DAYS = 90
COLD_TO_ARCHIVE_DAYS = 365
async def evaluate_for_migration(self, object_key: str, metadata: ObjectMetadata):
"""
Determine if object should be migrated to a different tier.
Considers age and recent access patterns.
"""
age_days = (time.now() - metadata.created_at).days
last_access_days = (time.now() - metadata.last_accessed_at).days
current_tier = metadata.storage_tier
# Hot → Warm migration
if current_tier == 'hot' and age_days > self.HOT_TO_WARM_DAYS:
# Check if still being accessed frequently
if metadata.access_count_7d < 10: # Low access
await self.migrate_object(object_key, 'warm')
# Warm → Cold migration
elif current_tier == 'warm' and age_days > self.WARM_TO_COLD_DAYS:
if metadata.access_count_30d < 5: # Very low access
await self.migrate_object(object_key, 'cold')
# Cold → Archive migration
elif current_tier == 'cold' and age_days > self.COLD_TO_ARCHIVE_DAYS:
if last_access_days > 180: # No access in 6 months
await self.migrate_object(object_key, 'archive')
async def migrate_object(self, object_key: str, target_tier: str):
"""
Migrate object to target tier.
Uses background copy + delete to avoid data loss.
"""
# Copy to target tier storage
await target_storage[target_tier].copy_from(
source_storage[current_tier],
object_key
)
# Verify copy
await self.verify_copy(object_key, target_tier)
# Update metadata to point to new location
await metadata_service.update_tier(object_key, target_tier)
# Delete from source tier (after verification)
await source_storage[current_tier].delete(object_key)
# Log migration event
await log_migration(object_key, current_tier, target_tier)
Tier Promotion (Cold → Hot):
When cold content is accessed, it may be promoted back to warmer tiers:
async def handle_cold_access(object_key: str):
"""
Handle access to cold-tier content.
May trigger promotion to warmer tier.
"""
# Fetch from cold storage (slow)
data = await cold_storage.get(object_key)
# Track access
await metadata_service.record_access(object_key)
access_count = await metadata_service.get_recent_access_count(object_key)
# Consider promotion if suddenly popular
if access_count > PROMOTION_THRESHOLD:
# Async promotion (don't block the request)
asyncio.create_task(promote_to_warm(object_key, data))
return data
Cost Impact:
Proper tiering dramatically reduces costs:
| Scenario | Monthly Cost Estimate |
|---|---|
| All data in hot tier | $100M+ |
| Tiered storage | $20-30M |
| Savings | 70-80% |
Migration decisions aren't purely age-based. Hot content from celebrities might stay in hot tier longer. Content that suddenly goes viral (reshared, appears in news) triggers automatic promotion. ML models can predict access likelihood to optimize tier placement.
Instagram promises users their photos will never be lost. Achieving 11-nines durability (99.999999999%) requires sophisticated replication strategies.
Durability Math:
What does 11-nines durability mean?
11-nines = 99.999999999% durability
= 0.000000001% chance of data loss
= 1 in 100 billion objects might be lost per year
With 100 billion objects stored:
→ Expect ~1 object lost per year (approximately)
This is achieved through replication and other protections.
Replication Strategies:
| Strategy | Copies | Storage Overhead | Durability | Use Case |
|---|---|---|---|---|
| 3-way replication | 3 full copies | 200% overhead | ~8 nines | Hot tier, highest performance |
| Cross-AZ replication | 3 copies in 3 AZs | 200% overhead | ~9 nines | Standard durability |
| Cross-region replication | 3+ copies across regions | 200%+ overhead | ~11 nines | Disaster recovery |
| Erasure coding (RS 6,3) | 9 fragments, any 6 recover | 50% overhead | ~9 nines | Cold storage cost efficiency |
| Erasure coding (RS 10,4) | 14 fragments, any 10 recover | 40% overhead | ~10 nines | Archive tier |
Erasure Coding Deep Dive:
For cold/archive data, full replication is too expensive. Erasure coding provides durability with less storage overhead:
import galois # Finite field math for Reed-Solomon
class ErasureCodingService:
"""
Implements Reed-Solomon erasure coding.
Splits data into k data fragments, generates m parity fragments.
Any k fragments can reconstruct the original data.
"""
def __init__(self, k=6, m=3):
"""
k=6, m=3 means:
- Split into 6 data fragments
- Generate 3 parity fragments
- Total 9 fragments stored across 9 nodes
- Any 6 fragments can reconstruct data
- Can tolerate loss of any 3 fragments
- Storage overhead: (6+3)/6 = 1.5x (50% overhead vs 200% for 3-copy)
"""
self.k = k
self.m = m
self.n = k + m
def encode(self, data: bytes) -> List[bytes]:
"""Split data into n fragments using Reed-Solomon encoding."""
# Pad data to be divisible by k
padded_data = self._pad_data(data)
# Split into k data fragments
fragment_size = len(padded_data) // self.k
data_fragments = [
padded_data[i*fragment_size:(i+1)*fragment_size]
for i in range(self.k)
]
# Generate m parity fragments using GF(2^8) math
parity_fragments = self._generate_parity(data_fragments)
return data_fragments + parity_fragments
def decode(self, fragments: List[bytes], available_indices: List[int]) -> bytes:
"""Reconstruct original data from any k fragments."""
if len(available_indices) < self.k:
raise InsufficientFragmentsError(
f"Need {self.k} fragments, only have {len(available_indices)}"
)
# Use first k available fragments
selected = list(zip(available_indices, fragments))[:self.k]
# Reconstruct using Galois field math
return self._reconstruct(selected)
Fragment Placement:
Fragments must be placed to survive correlated failures:
| Failure Type | Mitigation |
|---|---|
| Disk failure | Fragments on different disks |
| Server failure | Fragments on different servers |
| Rack failure | Fragments in different racks |
| Data center failure | Fragments in different data centers |
| Region failure | Fragments in different geographic regions |
Durability Monitoring:
class DurabilityMonitor:
"""Continuously monitors and maintains data durability."""
async def scan_for_degraded_objects(self):
"""Find objects with fewer than required copies/fragments."""
for shard in all_shards():
async for object_key, fragment_status in shard.scan_objects():
available_count = sum(1 for f in fragment_status if f.is_available)
if available_count < self.min_fragments_for_recovery:
# CRITICAL: Object at risk
await self.alert_critical(object_key)
elif available_count < self.target_fragments:
# WARN: Need to re-replicate
await self.queue_repair(object_key)
async def repair_object(self, object_key: str):
"""Reconstruct missing fragments and restore target durability."""
metadata = await get_object_metadata(object_key)
# Read available fragments
available = await gather_available_fragments(object_key)
# Reconstruct original data
data = erasure_codec.decode(available)
# Generate all fragments fresh
fragments = erasure_codec.encode(data)
# Re-distribute to healthy nodes
await distribute_fragments(object_key, fragments)
Durability (data not lost) and availability (data accessible now) are different. An object in archive tier has high durability but low availability (access takes hours). Instagram optimizes each tier for its requirements—hot tier prioritizes availability; archive tier prioritizes durability per dollar.
No single storage node can hold 100 billion objects. Data must be sharded (partitioned) across thousands of nodes. The sharding strategy determines performance, scalability, and operational complexity.
Sharding Dimensions:
Consistent Hashing for Sharding:
Instagram uses consistent hashing to distribute objects across shards:
import hashlib
class ConsistentHashRing:
"""
Consistent hash ring for object sharding.
Minimizes data movement when adding/removing shards.
"""
def __init__(self, shards: List[str], virtual_nodes: int = 150):
"""
Initialize ring with shards.
Virtual nodes ensure even distribution.
"""
self.ring = {}
self.sorted_keys = []
for shard in shards:
for i in range(virtual_nodes):
# Create virtual node key
key = self._hash(f"{shard}:{i}")
self.ring[key] = shard
self.sorted_keys = sorted(self.ring.keys())
def get_shard(self, object_key: str) -> str:
"""Determine which shard stores this object."""
if not self.sorted_keys:
raise NoShardsError()
key_hash = self._hash(object_key)
# Find first shard with hash >= object hash
for shard_hash in self.sorted_keys:
if shard_hash >= key_hash:
return self.ring[shard_hash]
# Wrap around to first shard
return self.ring[self.sorted_keys[0]]
def _hash(self, key: str) -> int:
"""Hash function returning consistent integer."""
return int(hashlib.md5(key.encode()).hexdigest(), 16)
Shard Topology:
Instagram Storage Cluster
├── Region: US-West
│ ├── Zone: us-west-1a
│ │ ├── Shard-0001 ... Shard-0500
│ ├── Zone: us-west-1b
│ │ ├── Shard-0501 ... Shard-1000
│ └── Zone: us-west-1c
│ ├── Shard-1001 ... Shard-1500
├── Region: EU-West
│ └── [Similar structure]
└── Region: Asia-Pacific
└── [Similar structure]
Total shards: ~10,000+
Objects per shard: ~10 million
Avoiding Hot Shards:
Certain patterns create hot shards:
| Problem | Cause | Solution |
|---|---|---|
| Temporal clustering | Sequential IDs (upload_0001, upload_0002) | Random media_id generation |
| Celebrity uploads | All celebrity content on one shard | Hash by media_id, not user_id |
| Viral content | Single object gets millions of reads | CDN caching before storage |
| Large user | User with millions of photos | Spread across shards via hash |
Resharding Strategy:
When capacity is reached, more shards must be added:
async def expand_shard_capacity():
"""Add new shards to the cluster."""
# 1. Add new shards to the hash ring
new_shards = ['shard-1501', 'shard-1502', ...]
for shard in new_shards:
hash_ring.add_shard(shard)
# 2. In consistent hashing, only ~1/N data moves
# For 1500→1510 shards, ~0.6% of data migrates
# 3. Background migration
for old_shard in existing_shards:
for object_key in old_shard.scan_objects():
new_shard = hash_ring.get_shard(object_key)
if new_shard != old_shard:
await migrate_object(object_key, old_shard, new_shard)
# 4. Update routing table atomically
await update_routing_table(hash_ring)
Without virtual nodes, consistent hashing can create uneven distribution (some shards get more data). Using 100-200 virtual nodes per physical shard creates near-uniform distribution. The trade-off is slightly more memory for the hash ring structure.
Instagram serves users worldwide and must survive regional disasters (data center fires, network outages, natural disasters). Cross-region distribution ensures both low latency and disaster recovery.
Multi-Region Architecture:
Replication Modes:
| Mode | Consistency | Latency | Use Case |
|---|---|---|---|
| Synchronous | Strong | High (cross-region RTT) | Critical data, metadata |
| Asynchronous | Eventual | Low (local ack) | Media objects (standard) |
| Semi-synchronous | Bounded staleness | Medium | Balanced approach |
Media uses asynchronous replication — writes are acknowledged after local durability, then replicated to other regions in the background:
async def write_object_with_replication(object_key: str, data: bytes, regions: List[str]):
"""Write object with cross-region replication."""
primary_region = get_primary_region(object_key)
secondary_regions = [r for r in regions if r != primary_region]
# Phase 1: Write to primary (synchronous)
await primary_storage[primary_region].put(object_key, data)
# Object is now durable in primary region
# Can acknowledge to client here
ack_to_client(object_key, status='written')
# Phase 2: Replicate to secondaries (asynchronous)
for region in secondary_regions:
asyncio.create_task(
replicate_to_region(object_key, data, region)
)
async def replicate_to_region(object_key: str, data: bytes, region: str):
"""Async replication to secondary region."""
try:
await secondary_storage[region].put(object_key, data)
await update_replication_status(object_key, region, 'replicated')
except ReplicationError as e:
# Queue for retry
await replication_queue.enqueue(object_key, region)
await alert_replication_lag(object_key, region)
Disaster Recovery Scenarios:
| Scenario | Impact | Recovery |
|---|---|---|
| Single disk failure | No impact | Automatic repair from redundant copies |
| Server failure | Brief latency increase | Requests route to other servers |
| Rack failure | Minor impact | Data spread across racks |
| AZ failure | ~33% capacity loss | Traffic shifts to other AZs |
| Region failure | Major event | Failover to secondary region |
| Multi-region failure | Catastrophic | Restore from globally distributed backups |
Regional Failover:
class RegionalFailover:
"""Manages failover between regions."""
async def detect_region_health(self, region: str) -> HealthStatus:
"""Check if region is healthy."""
checks = await asyncio.gather(
self.check_storage_health(region),
self.check_network_health(region),
self.check_compute_health(region),
)
return HealthStatus.aggregate(checks)
async def initiate_failover(self, failed_region: str, backup_region: str):
"""Failover traffic from failed region to backup."""
# 1. Update DNS to route to backup region
await dns_service.update_region_routing(
failed_region,
target=backup_region
)
# 2. Ensure backup region has capacity
await auto_scaler.scale_up(backup_region, factor=2.0)
# 3. Warm caches with likely-needed data
await cache_warmer.warm_region(backup_region, failed_region)
# 4. Monitor and alert
await alerting.send_failover_notification(failed_region, backup_region)
RTO and RPO:
Async replication means data written just before a failure might not have reached secondary regions. This 'replication lag' (typically seconds to minutes) represents potential data loss. Critical data (user accounts, financial) uses synchronous replication despite higher latency.
Data doesn't live forever. Users delete photos, accounts are terminated, and regulations require data removal. Managing the lifecycle of 100+ billion objects is an operational challenge.
Data Deletion Scenarios:
| Scenario | Trigger | Timeline | Complexity |
|---|---|---|---|
| User deletes post | User action | Immediate (soft) / 30 days (hard) | Low |
| Story expiration | 24-hour TTL | Automatic | Low |
| Account deletion | User request | 30-day grace, then permanent | High |
| Policy takedown | Policy violation | Immediate blocking, async deletion | Medium |
| GDPR/CCPA request | User right request | 30 days mandated | High |
| Copyright takedown | DMCA request | Prompt removal | Medium |
| Court order | Legal requirement | As specified | Variable |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
class DeletionService: """Manages data deletion across all storage systems.""" # Grace period before permanent deletion SOFT_DELETE_GRACE_DAYS = 30 async def delete_post(self, post_id: str, reason: DeletionReason): """ Delete a post and all associated media. Uses soft delete with grace period for user deletions. """ # 1. Soft delete - hide from users immediately await self.soft_delete(post_id) # 2. Queue for hard deletion after grace period if reason == DeletionReason.USER_INITIATED: await deletion_queue.schedule( post_id, execution_time=time.now() + timedelta(days=self.SOFT_DELETE_GRACE_DAYS) ) else: # Policy violations: shorter grace or immediate await deletion_queue.schedule( post_id, execution_time=time.now() + timedelta(hours=24) ) async def soft_delete(self, post_id: str): """Mark as deleted but don't remove data yet.""" await db.update('posts', post_id, { 'is_deleted': True, 'deleted_at': time.now(), 'visible': False }) # Remove from all caches await cache.invalidate(f"post:{post_id}") await cache.invalidate(f"media:{post_id}:*") # Remove from CDN await cdn.purge_urls(get_media_urls(post_id)) async def hard_delete(self, post_id: str): """Permanently remove all data.""" # 1. Delete all media variants from storage media_keys = await get_media_keys(post_id) for key in media_keys: # Delete from all regions for region in ALL_REGIONS: await storage[region].delete(key) # 2. Delete metadata await db.delete('posts', post_id) await db.delete('post_metadata', post_id) # 3. Delete from search index await search_index.delete(post_id) # 4. Delete analytics data (may have separate retention) await analytics.queue_deletion(post_id) # 5. Log deletion for compliance await audit_log.record_deletion(post_id)Account Deletion Complexity:
Deleting an account requires removing data from dozens of systems:
Account deletion checklist:
□ Profile data (name, bio, settings)
□ All posts and stories
□ All media files (photos, videos)
□ Direct messages (both sides?)
□ Comments on others' posts
□ Likes and saves
□ Follower/following relationships
□ Notification history
□ Search history
□ Login sessions
□ Third-party app connections
□ Analytics and insights data
□ Ad targeting data
□ ML feature vectors
□ Recommendation embeddings
□ Payment information
□ Business account data (if applicable)
GDPR Right to Erasure:
European users can request complete data deletion under GDPR:
async def process_gdpr_erasure_request(user_id: str, request_id: str):
"""Process GDPR Article 17 erasure request."""
# Validate request authenticity
await verify_user_identity(user_id, request_id)
# Create deletion manifest
manifest = await generate_deletion_manifest(user_id)
# Execute deletion across all systems
deletion_tasks = []
for system, data_categories in manifest.items():
task = delete_from_system(system, user_id, data_categories)
deletion_tasks.append(task)
results = await asyncio.gather(*deletion_tasks, return_exceptions=True)
# Verify completion
failed = [r for r in results if isinstance(r, Exception)]
if failed:
await escalate_deletion_failures(user_id, failed)
# Generate compliance report
report = await generate_deletion_report(user_id, manifest, results)
# Retain deletion record for compliance (proving we deleted)
await compliance_log.record(request_id, report)
# Notify user
await notify_erasure_complete(user_id, request_id)
GDPR requires both deleting user data AND proving you deleted it. This creates a paradox: you must keep some record (the deletion audit log) to prove deletion occurred. The audit log contains minimal identifying information and is kept separately with strict access controls.
At exabyte scale, every percentage of optimization saves millions of dollars annually. Instagram employs numerous techniques to minimize storage costs while maintaining performance and durability.
Cost Breakdown:
| Cost Component | Approximate % | Optimization Lever |
|---|---|---|
| Storage capacity | 60% | Tiering, compression, deduplication |
| Data transfer (egress) | 25% | CDN caching, regional routing |
| Operations (IOPS) | 10% | Caching, batch operations |
| Compute (processing) | 5% | Efficient codecs, batching |
Optimization Techniques:
Image Compression Optimization:
class AdaptiveCompressor:
"""Optimize image compression based on content."""
def compress_image(self, image: Image) -> bytes:
"""
Compress image with content-aware quality selection.
"""
# Analyze image complexity
complexity = self.analyze_complexity(image)
# Simple images (flat colors, graphics) can be heavily compressed
# Complex images (textures, nature) need higher quality
if complexity < 0.3:
quality = 70 # Low complexity, high compression
elif complexity < 0.6:
quality = 80 # Medium complexity
else:
quality = 90 # High complexity, preserve detail
# Choose format based on content
if image.has_transparency:
return self.encode_webp(image, quality)
elif complexity < 0.5:
return self.encode_webp(image, quality) # WebP better for simple
else:
return self.encode_jpeg(image, quality) # JPEG competitive for complex
def analyze_complexity(self, image: Image) -> float:
"""Estimate image complexity 0-1 using edge detection."""
edges = cv2.Canny(image, 100, 200)
edge_density = np.mean(edges) / 255
return edge_density
Deduplication Strategy:
class PhotoDeduplicator:
"""Detect and deduplicate identical uploads."""
def __init__(self):
# Store perceptual hashes of all photos
self.hash_index = PerceptualHashIndex()
async def check_duplicate(self, image: bytes) -> Optional[str]:
"""
Check if this image is a duplicate of existing content.
Returns existing media_id if duplicate found.
"""
# Compute perceptual hash
phash = self.compute_perceptual_hash(image)
# Check for exact match
existing_id = await self.hash_index.lookup_exact(phash)
if existing_id:
return existing_id
# Check for near-duplicate (slight edits, different resolution)
near_matches = await self.hash_index.lookup_similar(phash, threshold=0.95)
if near_matches:
# Verify with content comparison
for match_id in near_matches:
if await self.verify_visual_match(image, match_id):
return match_id
return None
async def handle_duplicate_upload(self, new_upload_id: str, existing_id: str):
"""
Handle duplicate by referencing existing media.
"""
# Don't store new media - reference existing
await db.update('posts', new_upload_id, {
'media_id': existing_id,
'is_reference': True
})
# Increment reference count on original
await db.increment('media', existing_id, 'reference_count', 1)
# Log storage saved
await metrics.record('storage_saved_bytes', await get_media_size(existing_id))
| Optimization | Storage Savings | Annual Savings* |
|---|---|---|
| Modern codecs (WebP/AVIF) | 30-40% | $20-30M |
| Tiered storage | 60-70% | $50-60M |
| Deduplication | 5-10% | $5-10M |
| Erasure coding vs replication | 50% (on cold) | $10-15M |
| Variant optimization | 10-15% | $5-10M |
At Meta's scale, building custom storage systems (Haystack, f4) is cost-effective. But for most companies, using cloud object storage (S3, GCS) is cheaper considering engineering cost. The break-even point is typically around 100PB+ where custom systems become economical.
We've explored how Instagram stores, replicates, and manages exabytes of media—perhaps the largest photo storage system in the world. Let's consolidate the key principles:
Module Complete: Instagram Photo Platform
You've now completed a comprehensive journey through Instagram's architecture:
These patterns—media processing pipelines, hybrid fanout, ML-powered ranking, tiered storage—are foundational to any large-scale content platform. The specific implementations vary, but the principles remain constant.
You now have a Principal Engineer-level understanding of Instagram's architecture. You can reason about trade-offs in media platforms, design content distribution systems, and architect storage for planetary scale. These patterns will serve you in any system design interview or real-world media platform challenge.