System Design (HLD)Scaling Playbook

The Scaling Playbook: From Startup to Enterprise

LevelAdvanced

Duration90 mins

TopicScaling Playbook

2 / 5

Database Scaling Journey

The Heart of Every System

If you ask any experienced scaling engineer what component gives them the most sleepless nights, the answer is nearly universal: the database. While application servers can be replicated trivially and caches can be flushed and rebuilt, the database holds state—irreplaceable, precious, authoritative state.

Database scaling is uniquely challenging because:

Data must be correct — Unlike caches, you can't simply discard and regenerate database content
Migrations are irreversible — Once you shard, you can't easily unshard; the decisions compound
Downtime is unacceptable — Users tolerate slow apps; they don't tolerate lost data
Complexity grows non-linearly — Each scaling step introduces new failure modes and operational burden

This page maps the complete database scaling journey—from single instance to globally distributed shards—with practical guidance on when to take each step and how to navigate the transition safely.

What You Will Learn

By the end of this page, you will understand the full spectrum of database scaling strategies, from simple optimizations to complex sharding architectures. You'll learn to recognize the warning signs that indicate when it's time to progress to the next level, and you'll have practical playbooks for each transition.

Stage 1: Single Database Optimization

Before adding complexity, exhaust the capabilities of your single database. This stage is often underestimated—a well-optimized single database can handle surprising scale. Companies with millions of users sometimes still run on a single (powerful) database instance.

Query Optimization: The Highest-Leverage Work

The single most impactful optimization is query efficiency. A query that takes 100ms instead of 1ms means your database handles 100x fewer requests per second. Before scaling horizontally, ensure every query is optimal.

The EXPLAIN Ritual:

Every slow query should be analyzed with EXPLAIN (PostgreSQL) or EXPLAIN ANALYZE (MySQL). This reveals:

Which indexes are (or aren't) being used
How many rows are scanned versus returned
Whether sorts or aggregations spill to disk
The join strategies being employed

query-optimization-analysis
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- BEFORE: Slow query (no index on user_id + created_at)
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE user_id = 12345 
  AND created_at > '2024-01-01'
ORDER BY created_at DESC;
 
-- Result: Seq Scan on orders (cost=0.00..185432.00 rows=50000)
--         Filter: (user_id = 12345 AND created_at > '2024-01-01')
--         Rows Removed by Filter: 4950000
--         Planning Time: 0.5ms
--         Execution Time: 4523ms  -- TERRIBLE!
 
-- Solution: Add composite index
CREATE INDEX idx_orders_user_created 
ON orders(user_id, created_at DESC);
 
-- AFTER: Same query with index
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE user_id = 12345 
  AND created_at > '2024-01-01'
ORDER BY created_at DESC;
 
-- Result: Index Scan using idx_orders_user_created (cost=0.42..52.00 rows=50)
--         Index Cond: (user_id = 12345 AND created_at > '2024-01-01')
--         Planning Time: 0.2ms
--         Execution Time: 0.8ms  -- 5000x FASTER!

Single Database Optimization Checklist

•Index Coverage — Every WHERE clause and JOIN condition should hit an index. Composite indexes for multi-column conditions. Monitor unused indexes and remove them (they slow writes).
•Query Patterns — Avoid SELECT *; fetch only needed columns. Limit result sets; never paginate with OFFSET for large tables. Use cursor-based pagination instead.
•Connection Pooling — Implement external pooling (PgBouncer, ProxySQL) if application-level pooling is insufficient. Database connections are expensive to establish.
•Configuration Tuning — shared_buffers, work_mem, effective_cache_size for PostgreSQL; innodb_buffer_pool_size for MySQL. These parameters should match your hardware.
•Hardware Optimization — NVMe SSDs for storage (10x faster than SATA SSD). Sufficient RAM for working set. Fast CPUs for complex queries.
•Vacuum and Analyze — For PostgreSQL, ensure autovacuum is properly configured. Dead tuples accumulate and slow queries.
•Slow Query Logging — Enable and monitor. The pg_stat_statements extension (PostgreSQL) is invaluable for finding problematic queries.

The 80/20 Rule of Database Optimization

80% of database load typically comes from 20% of queries. Identify your top 10 slowest and most frequent queries. Optimizing these often yields as much benefit as all other optimizations combined. Don't boil the ocean—focus on the hot spots.

Stage 2: Read Replicas Architecture

When single-database optimization reaches its limits—typically signaled by sustained high CPU, query latency creeping up, or connection pool exhaustion—read replicas become the first horizontal scaling step.

The Read/Write Split:

The core concept is simple: one primary database handles all writes, while one or more replicas handle reads. Since most applications are read-heavy (often 90%+ of queries), this can provide substantial relief.

Implementation Architecture:

Converting Mermaid diagram...

Replication Modes:

Asynchronous Replication (default for most systems)

Primary doesn't wait for replicas to confirm writes
Lower latency for writes
Replicas may lag behind, especially under load
Risk: data loss if primary fails before replication completes

Synchronous Replication (for critical data)

Primary waits for at least one replica to confirm
Higher write latency (network round-trip added)
Guarantees that committed data exists on multiple machines
Trade-off: availability reduced if synchronous replica is down

Semi-synchronous Replication (middle ground)

Primary waits for replica acknowledgment, but with timeout
Falls back to async if replica is slow
Balances durability and availability

Replica Placement Strategies
Strategy	Use Case	Latency	Disaster Recovery
Same Availability Zone	Maximum performance	< 1ms replication lag	No AZ-level protection
Cross-AZ (same region)	Balanced approach	1-5ms replication lag	AZ failure protection
Cross-Region	Disaster recovery + geo reads	50-200ms replication lag	Full regional protection
Multi-Region Active	Global user base	50-200ms, varies by region	Maximum resilience, complex consistency

Application-Level Routing:

The application must know which queries to send where. Common approaches:

Connection-based routing — Separate connection pools for primary and replicas. Explicit choice at query time.

ORM-level routing — Many ORMs support read/write splitting. Django's DATABASE_ROUTERS, Rails' ActiveRecord multi-database support.

Proxy-based routing — A proxy (ProxySQL, PgPool) intercepts queries and routes based on query type. Transparent to application.

Critical consideration: Write-then-read scenarios require careful handling. If a user updates their profile and immediately views it, the read might hit a replica that hasn't received the write yet. Solutions:

Read-after-write from primary (for that user's session)
Sticky sessions (route that user to a specific replica)
Causal consistency (track replication position and only route to caught-up replicas)

The Hidden Replica Costs

Each replica increases storage costs (full data copy), increases primary CPU (replication overhead), and adds operational complexity (monitoring, failover procedures). Don't add replicas preemptively. Add them when CPU or query latency indicates the primary is genuinely constrained—and ensure query optimization is already done.

Stage 3: Vertical Partitioning (Functional Sharding)

Before diving into horizontal sharding—the most complex database scaling pattern—consider vertical partitioning: splitting the database by function rather than by row.

The insight:

Not all data needs to live in the same database. User profiles, order history, analytics, and product catalog serve different purposes with different access patterns. Separating them provides:

Independent scaling — Heavy-write tables (events, logs) don't contend with critical transaction tables
Technology choice — Use PostgreSQL for transactional data, ClickHouse for analytics
Team autonomy — Different teams can own and operate their data stores
Failure isolation — An analytics query won't take down the order system

Before: Single Database

All tables in one database:

users
orders
products
analytics_events
notifications
audit_logs
sessions

A heavy analytics query on analytics_events can starve connection pool, affecting checkout flow.

After: Vertical Partitioning

Core Transaction DB:

users, orders, products

Analytics DB:

analytics_events (ClickHouse)

Notification DB:

notifications

Audit DB:

audit_logs

Each DB scales independently.

Implementation Approach:

Step 1: Identify natural boundaries Look for tables that:

Are queried independently (no JOINs with other tables)
Have different access patterns (read-heavy vs write-heavy)
Belong to different bounded contexts (users, payments, inventory)
Have different consistency requirements (eventual OK vs. strong required)

Step 2: Establish clear APIs Once data is in separate databases, you can't JOIN across them. Services must expose APIs for cross-data-store queries. This is the beginning of service-oriented architecture.

Step 3: Migrate incrementally

Create the new database
Set up dual-writes (write to both old and new)
Backfill historical data
Verify data consistency
Switch reads to new database
Stop writes to old location
Clean up old tables

The join problem:

The most significant challenge of vertical partitioning is losing the ability to JOIN. A query like "show orders with product details" that was previously a simple JOIN now requires:

Fetch orders from orders database
Fetch products from products database
Join in application code

This is more complex and often slower. Caching and denormalization become important tools to mitigate this cost.

When Vertical Partitioning Makes Sense

Vertical partitioning is ideal when you have clear functional boundaries with minimal cross-boundary queries. Analytics, logging, and notifications are almost always good candidates—they're write-heavy, rarely joined with core data, and have different performance characteristics. Start here before considering horizontal sharding.

Stage 4: Horizontal Sharding (Data Partitioning)

When vertical partitioning and replicas have been exhausted, and when a single table exceeds what can be handled by the beefiest available hardware, horizontal sharding becomes necessary. This is the most complex database scaling pattern—and the most powerful.

The concept:

Horizontal sharding splits a single table across multiple databases based on a shard key. The shard key determines which database stores a given row. All rows for user_id=12345 might live on shard_3, while user_id=67890 lives on shard_7.

Critical decisions:

Shard Key Selection Criteria

•Even Distribution — The shard key should distribute data evenly across shards. UUID is ideal; timestamp is terrible (all recent data on one shard).
•Query Affinity — Most queries should include the shard key in their WHERE clause. If you shard by user_id but frequently query by order_id, every query hits every shard.
•Stable Over Time — The shard key shouldn't change for a given entity. user_id works because it never changes. status doesn't work because it changes.
•Business Meaning — Prefer keys that make operational sense. user_id means you can talk about 'moving all of this user's data' coherently.
•Growth Pattern — Consider how the key grows. Auto-increment IDs create hot spots on one shard for new records. UUIDs distribute uniformly.

Sharding Strategies:

Range Sharding

user_id 1-1,000,000 → shard_1
user_id 1,000,001-2,000,000 → shard_2
Simple but creates hot spots as new users cluster on latest shard
Good for time-series data that's queried by time ranges

Hash Sharding

shard = hash(user_id) % num_shards
Even distribution, but range queries hit all shards
Adding shards requires rehashing (expensive)

Consistent Hashing

More complex but allows adding shards with minimal data movement
Used by Cassandra, DynamoDB, and similar systems
Approximately 1/N of data moves when adding Nth shard

Directory-Based Sharding

A lookup table maps shard keys to shards
Maximum flexibility but adds a lookup step and a single point of failure
Enables arbitrary shard moves without application changes

Sharding Strategy Comparison
Strategy	Distribution	Range Queries	Adding Shards	Complexity
Range	Poor (hot spots)	Excellent	Simple	Low
Hash	Excellent	Poor (fan-out)	Expensive (rehash)	Medium
Consistent Hash	Good	Poor (fan-out)	Minimal data movement	High
Directory	Configurable	Depends on design	Flexible	High

The Cross-Shard Query Problem

Once data is sharded, any query that doesn't include the shard key must hit all shards. SELECT * FROM orders WHERE created_at > yesterday becomes N queries (one per shard), results combined in application. Performance degrades as shards increase. Design your shard key based on query patterns, not data structure.

Sharding Implementation Deep Dive

Implementing sharding requires careful orchestration across application code, data migration, and operational procedures. This section provides a practical implementation roadmap.

Application Layer Changes:

The application must become shard-aware. Every database query must route to the correct shard.

Pattern 1: Application-level routing The application directly calculates which shard to query and maintains connections to all shards.

sharding-application-layer
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Shard-aware database access layer
interface ShardConfig {
    shardId: number;
    connectionString: string;
}
 
class ShardRouter {
    private shards: Map<number, DatabaseConnection>;
    private totalShards: number;
    
    constructor(shardConfigs: ShardConfig[]) {
        this.totalShards = shardConfigs.length;
        this.shards = new Map();
        
        for (const config of shardConfigs) {
            this.shards.set(config.shardId, new DatabaseConnection(config.connectionString));
        }
    }
    
    // Calculate shard from user ID using consistent hashing
    getShardForUser(userId: string): number {
        const hash = this.hashFunction(userId);
        return hash % this.totalShards;
    }
    
    // Get connection for a specific shard
    getShard(shardId: number): DatabaseConnection {
        const shard = this.shards.get(shardId);
        if (!shard) throw new Error(`Invalid shard: ${shardId}`);
        return shard;
    }
    
    // Execute query on the correct shard
    async queryByUser<T>(userId: string, sql: string, params: any[]): Promise<T> {
        const shardId = this.getShardForUser(userId);
        const connection = this.getShard(shardId);
        return connection.query<T>(sql, params);
    }
    
    // Execute query across ALL shards (expensive - avoid when possible)
    async queryAllShards<T>(sql: string, params: any[]): Promise<T[]> {
        const promises = Array.from(this.shards.values()).map(
            shard => shard.query<T>(sql, params)
        );
        const results = await Promise.all(promises);
        return results.flat();
    }
    
    private hashFunction(key: string): number {
        // Use a consistent hash function (MurmurHash, xxHash, etc.)
        let hash = 0;
        for (let i = 0; i < key.length; i++) {
            hash = ((hash << 5) - hash) + key.charCodeAt(i);
            hash = hash & hash; // Convert to 32-bit integer
        }
        return Math.abs(hash);
    }
}

Pattern 2: Proxy-based routing

A proxy (like Vitess, Citus, ProxySQL) sits between application and databases, parsing queries and routing automatically. Advantages:

Application code remains largely unchanged
Centralized routing logic
Cross-shard query handling

Data Migration Strategy:

Migrating to a sharded architecture requires careful choreography:

Phase 1: Dual-write

Application writes to both old single database and new sharded databases
Ensures no data loss during transition
New writes exist in both places

Phase 2: Backfill

Historical data migrated shard by shard
Typically done during low-traffic periods
Verify row counts and checksums after migration

Phase 3: Shadow reads

Read from old database, but also query new shards
Compare results to verify correctness
Log discrepancies without affecting users

Phase 4: Cut over

Switch reads to sharded database
Monitor closely for issues
Keep old database as fallback

Phase 5: Cleanup

Stop dual-writes
Decommission old database
Update documentation and runbooks

The Sharding Tax

Sharding isn't free. You pay with: no cross-shard transactions, no cross-shard JOINs, complex migrations, uneven shard sizes over time, and significantly increased operational burden. Accept this cost only when the alternatives are exhausted. Many successful companies avoid sharding entirely—Instagram ran on PostgreSQL for years by aggressive vertical optimization and intelligent archiving.

Managing Shard Growth and Rebalancing

Shards don't stay balanced. Over time, some shards grow larger than others due to:

Hot users — A few high-activity users generate disproportionate data
Business changes — New features might cause certain data patterns to grow faster
Time effects — If sharding by customer_id, enterprise customers might grow faster than consumers

Monitoring shard health:

Track these metrics per shard:

Disk usage and growth rate
Query latency (p50, p95, p99)
Connection utilization
CPU and memory usage
Replication lag (if shards have replicas)

Rebalancing strategies:

Manual shard splitting: Identify a hot shard and split it in two. This is operationally complex:

Create new shard
Replicate data to new shard
Update routing to direct some keys to new shard
Clean up old shard

Automatic rebalancing (e.g., Vitess): Some sharding systems automatically detect imbalance and redistribute data. This is operationally simpler but requires sophisticated infrastructure.

Shard-nothing architecture: Rather than splitting shards, add more shards of smaller size. Works with consistent hashing where adding a shard naturally redistributes some data.

Shard Sizing Guidelines
Factor	Recommendation	Rationale
Max shard size	500GB - 1TB	Larger shards are harder to backup, restore, and migrate
Min shards	More than 2x expected final	Adding shards is hard; start with headroom
Data per shard	Even within 20%	Unbalanced shards create hot spots
Connections per shard	< 80% of max	Leave headroom for spikes
Query latency	p99 < 100ms	Investigate if significantly higher than others

The shard key immutability challenge:

Once you've sharded by user_id, data for each user is locked to its shard. But what if you later need to query by product_id? You have several options, all with trade-offs:

1. Fan-out queries: Query all shards, aggregate results. Works for infrequent queries, but doesn't scale for common access patterns.

2. Denormalized copies: Maintain a separate table sharded by the secondary key. Requires dual-writes and eventual consistency.

3. Global indexes: A separate, small database maintains mappings from secondary keys to shards. Adds a lookup hop but enables efficient routing.

4. Change data capture (CDC): Stream changes to a secondary system (Elasticsearch, data warehouse) optimized for different query patterns.

The best approach depends on query frequency, latency requirements, and consistency needs. Often, a combination is used: primary queries on sharded database, secondary queries on specialized systems.

Plan for Multiple Access Patterns

Before choosing a shard key, list every access pattern in your application. If you have 20 query patterns and only 5 can efficiently use your shard key, you need a secondary strategy for the other 15. This analysis should happen before sharding, not after—changing shard keys is extremely expensive.

Database Scaling Anti-Patterns

Learning from failures is as valuable as studying successes. These anti-patterns have caused countless outages and migrations:

Anti-Pattern 1: Premature Sharding

Symptom: "We're going to be big, so let's shard from day one."

Reality: Sharding adds complexity that slows development velocity when you most need speed (early product development). The bottleneck at this stage is rarely database capacity—it's finding product-market fit.

Better approach: Use a single, well-optimized database until actual metrics indicate capacity strain.

Anti-Pattern 2: Wrong Shard Key

Symptom: Chose shard key based on data model, not query patterns.

Reality: A beautifully normalized shard key that doesn't appear in your common queries means every query fans out to all shards. Performance is worse than before sharding.

Better approach: Analyze actual query logs. Shard by the field that appears most frequently in WHERE clauses.

Anti-Pattern 3: Ignoring Replication Lag

Symptom: Read replicas seem to work in testing, bugs appear in production.

Reality: Replication lag is milliseconds in testing, can be seconds under production load. Users see stale data, sometimes causing data corruption (e.g., duplicate transactions).

Better approach: Design for replication lag from the start. Use read-your-writes consistency for critical paths. Monitor lag as a key metric.

More Database Scaling Anti-Patterns

•Not monitoring before scaling — Adding capacity without understanding the bottleneck often moves the problem elsewhere. Measure first, always.
•Scaling compute when storage is the issue — More CPU doesn't help if you're I/O bound. Profile before provisioning.
•Ignoring query optimization — Adding read replicas to handle poorly written queries multiplies inefficiency. Fix the queries first.
•Inconsistent shard logic — Application code calculates shards differently in different places, leading to data landing on wrong shards.
•No runbooks for shard operations — When a shard needs splitting at 3 AM, you need documented procedures, not improvisation.
•Testing at small scale — Sharding bugs often only appear with realistic data volumes. Test with production-like scale.

The Shard Split Emergency

A common failure mode: a shard fills up (disk, connections, or throughput) faster than expected, and there's no capacity to split it. Splitting requires temporary capacity to hold two copies. If the shard is at 100% of everything, you're in crisis mode. Always maintain headroom—never let shards exceed 70% of capacity on any dimension.

Modern Database Scaling Options

The database landscape has evolved significantly. Modern options reduce the operational burden of scaling:

Managed Services:

Amazon Aurora — MySQL/PostgreSQL compatible with up to 15 read replicas, automatic storage scaling to 128TB, and automatic failover. Significantly reduces operational burden while maintaining familiar interfaces.

Google Cloud Spanner — Globally distributed, strongly consistent relational database. Handles sharding transparently. Expensive but eliminates most scaling complexity.

CockroachDB — PostgreSQL-compatible distributed SQL. Handles sharding, replication, and rebalancing automatically. Can be self-hosted or managed.

PlanetScale — MySQL-compatible serverless database based on Vitess. Horizontal scaling with schema change workflow built in.

Distributed Database Comparison
Database	Scaling Model	Consistency	Best For	Consideration
Aurora	Managed read replicas	Strong (single write point)	Lift-and-shift PostgreSQL/MySQL	Still single-writer limited
Spanner	Auto-sharded, global	Strong (TrueTime)	Global apps needing consistency	High cost, GCP lock-in
CockroachDB	Auto-sharded	Serializable	Distributed SQL without NoSQL trade-offs	Newer, smaller ecosystem
PlanetScale	Vitess-based sharding	Eventual (by design)	High-scale MySQL workloads	Schema changes need workflow
TiDB	Auto-sharded, MySQL compatible	Snapshot isolation	MySQL replacement at scale	Complex operational model

When to consider distributed databases:

Scale requirements are clear — You've measured and know you'll exceed single-node capacity
Team has operational maturity — Distributed databases require sophisticated operations
Use case matches — The database's trade-offs align with your requirements
Budget accommodates — Managed distributed databases are typically 3-10x the cost of simple managed services

When traditional databases + smart architecture wins:

Scale is uncertain — Don't pay for distribution you might not need
Existing expertise — Your team knows PostgreSQL deeply
Simple access patterns — If you can shard manually without pain, complexity isn't justified
Cost sensitivity — A well-architected traditional database can be far cheaper

The best database is the one your team can operate reliably. A perfectly scaled complex system that your team doesn't understand is worse than a simpler system with occasional growing pains.

The PostgreSQL Staying Power

Despite the proliferation of distributed databases, PostgreSQL remains remarkably capable. With proper optimization, connection pooling (PgBouncer), partitioning (native table partitioning), and read replicas, PostgreSQL can handle millions of users. Companies like Notion, Figma, and Instagram have scaled PostgreSQL to remarkable levels before needing alternatives. Don't underestimate the boring choice.

Summary: The Database Scaling Journey

Database scaling is the most consequential and challenging aspect of system scaling. Let's consolidate the journey:

Key Takeaways

•Optimize first, scale second — Query optimization, proper indexing, and configuration tuning often yield 10-100x improvements before any scaling is needed.
•Read replicas address read-heavy workloads — Most applications read far more than they write. Exploit this asymmetry with replicas.
•Vertical partitioning before horizontal — Split by function (analytics, logs, transactions) before splitting by row. Simpler and often sufficient.
•Sharding is a last resort — The operational burden is immense. Ensure all simpler strategies are exhausted.
•Shard key selection is irreversible — Choose based on query patterns, not data model. Analyze before deciding.
•Plan for cross-shard queries — They will be needed. Have a strategy (fan-out, denormalization, or specialized systems).
•Modern managed options reduce burden — Evaluate Aurora, Spanner, CockroachDB, or PlanetScale if operational simplicity is worth the cost.
•The boring database often wins — PostgreSQL with good practices scales further than most people realize.

What's next:

With database scaling understood, we turn to the most effective latency-reduction pattern: caching layer introduction. Caching can reduce database load by 90%+ for read-heavy workloads, often deferring the need for database scaling entirely. The next page explores caching strategies, cache invalidation challenges, and the art of building effective cache hierarchies.

Page Complete

You now understand the complete database scaling journey—from single-instance optimization through read replicas, vertical partitioning, and horizontal sharding. You can recognize when each transition is appropriate and understand the trade-offs involved. This knowledge is foundational for any scaling engineer.

2 / 5

Loading learning content...

System Design (HLD)Scaling Playbook

The Scaling Playbook: From Startup to Enterprise

LevelAdvanced

Duration90 mins

TopicScaling Playbook

2 / 5

Database Scaling Journey

The Heart of Every System

Database scaling is uniquely challenging because:

Data must be correct — Unlike caches, you can't simply discard and regenerate database content
Migrations are irreversible — Once you shard, you can't easily unshard; the decisions compound
Downtime is unacceptable — Users tolerate slow apps; they don't tolerate lost data
Complexity grows non-linearly — Each scaling step introduces new failure modes and operational burden

What You Will Learn

Stage 1: Single Database Optimization

Query Optimization: The Highest-Leverage Work

The EXPLAIN Ritual:

Every slow query should be analyzed with EXPLAIN (PostgreSQL) or EXPLAIN ANALYZE (MySQL). This reveals:

Which indexes are (or aren't) being used
How many rows are scanned versus returned
Whether sorts or aggregations spill to disk
The join strategies being employed

query-optimization-analysis
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- BEFORE: Slow query (no index on user_id + created_at)
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE user_id = 12345 
  AND created_at > '2024-01-01'
ORDER BY created_at DESC;
 
-- Result: Seq Scan on orders (cost=0.00..185432.00 rows=50000)
--         Filter: (user_id = 12345 AND created_at > '2024-01-01')
--         Rows Removed by Filter: 4950000
--         Planning Time: 0.5ms
--         Execution Time: 4523ms  -- TERRIBLE!
 
-- Solution: Add composite index
CREATE INDEX idx_orders_user_created 
ON orders(user_id, created_at DESC);
 
-- AFTER: Same query with index
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE user_id = 12345 
  AND created_at > '2024-01-01'
ORDER BY created_at DESC;
 
-- Result: Index Scan using idx_orders_user_created (cost=0.42..52.00 rows=50)
--         Index Cond: (user_id = 12345 AND created_at > '2024-01-01')
--         Planning Time: 0.2ms
--         Execution Time: 0.8ms  -- 5000x FASTER!

Single Database Optimization Checklist

•Index Coverage — Every WHERE clause and JOIN condition should hit an index. Composite indexes for multi-column conditions. Monitor unused indexes and remove them (they slow writes).
•Query Patterns — Avoid SELECT *; fetch only needed columns. Limit result sets; never paginate with OFFSET for large tables. Use cursor-based pagination instead.
•Connection Pooling — Implement external pooling (PgBouncer, ProxySQL) if application-level pooling is insufficient. Database connections are expensive to establish.
•Configuration Tuning — shared_buffers, work_mem, effective_cache_size for PostgreSQL; innodb_buffer_pool_size for MySQL. These parameters should match your hardware.
•Hardware Optimization — NVMe SSDs for storage (10x faster than SATA SSD). Sufficient RAM for working set. Fast CPUs for complex queries.
•Vacuum and Analyze — For PostgreSQL, ensure autovacuum is properly configured. Dead tuples accumulate and slow queries.
•Slow Query Logging — Enable and monitor. The pg_stat_statements extension (PostgreSQL) is invaluable for finding problematic queries.

The 80/20 Rule of Database Optimization

Stage 2: Read Replicas Architecture

The Read/Write Split:

Implementation Architecture:

Converting Mermaid diagram...

Replication Modes:

Asynchronous Replication (default for most systems)

Primary doesn't wait for replicas to confirm writes
Lower latency for writes
Replicas may lag behind, especially under load
Risk: data loss if primary fails before replication completes

Synchronous Replication (for critical data)

Primary waits for at least one replica to confirm
Higher write latency (network round-trip added)
Guarantees that committed data exists on multiple machines
Trade-off: availability reduced if synchronous replica is down

Semi-synchronous Replication (middle ground)

Primary waits for replica acknowledgment, but with timeout
Falls back to async if replica is slow
Balances durability and availability

Replica Placement Strategies
Strategy	Use Case	Latency	Disaster Recovery
Same Availability Zone	Maximum performance	< 1ms replication lag	No AZ-level protection
Cross-AZ (same region)	Balanced approach	1-5ms replication lag	AZ failure protection
Cross-Region	Disaster recovery + geo reads	50-200ms replication lag	Full regional protection
Multi-Region Active	Global user base	50-200ms, varies by region	Maximum resilience, complex consistency

Application-Level Routing:

The application must know which queries to send where. Common approaches:

Connection-based routing — Separate connection pools for primary and replicas. Explicit choice at query time.

ORM-level routing — Many ORMs support read/write splitting. Django's DATABASE_ROUTERS, Rails' ActiveRecord multi-database support.

Proxy-based routing — A proxy (ProxySQL, PgPool) intercepts queries and routes based on query type. Transparent to application.

Read-after-write from primary (for that user's session)
Sticky sessions (route that user to a specific replica)
Causal consistency (track replication position and only route to caught-up replicas)

The Hidden Replica Costs

Stage 3: Vertical Partitioning (Functional Sharding)

Before diving into horizontal sharding—the most complex database scaling pattern—consider vertical partitioning: splitting the database by function rather than by row.

The insight:

Not all data needs to live in the same database. User profiles, order history, analytics, and product catalog serve different purposes with different access patterns. Separating them provides:

Independent scaling — Heavy-write tables (events, logs) don't contend with critical transaction tables
Technology choice — Use PostgreSQL for transactional data, ClickHouse for analytics
Team autonomy — Different teams can own and operate their data stores
Failure isolation — An analytics query won't take down the order system

Before: Single Database

All tables in one database:

users
orders
products
analytics_events
notifications
audit_logs
sessions

A heavy analytics query on analytics_events can starve connection pool, affecting checkout flow.

After: Vertical Partitioning

Core Transaction DB:

users, orders, products

Analytics DB:

analytics_events (ClickHouse)

Notification DB:

notifications

Audit DB:

audit_logs

Each DB scales independently.

Implementation Approach:

Step 1: Identify natural boundaries Look for tables that:

Are queried independently (no JOINs with other tables)
Have different access patterns (read-heavy vs write-heavy)
Belong to different bounded contexts (users, payments, inventory)
Have different consistency requirements (eventual OK vs. strong required)

Step 3: Migrate incrementally

Create the new database
Set up dual-writes (write to both old and new)
Backfill historical data
Verify data consistency
Switch reads to new database
Stop writes to old location
Clean up old tables

The join problem:

The most significant challenge of vertical partitioning is losing the ability to JOIN. A query like "show orders with product details" that was previously a simple JOIN now requires:

Fetch orders from orders database
Fetch products from products database
Join in application code

This is more complex and often slower. Caching and denormalization become important tools to mitigate this cost.

When Vertical Partitioning Makes Sense

Stage 4: Horizontal Sharding (Data Partitioning)

The concept:

Critical decisions:

Shard Key Selection Criteria

•Even Distribution — The shard key should distribute data evenly across shards. UUID is ideal; timestamp is terrible (all recent data on one shard).
•Query Affinity — Most queries should include the shard key in their WHERE clause. If you shard by user_id but frequently query by order_id, every query hits every shard.
•Stable Over Time — The shard key shouldn't change for a given entity. user_id works because it never changes. status doesn't work because it changes.
•Business Meaning — Prefer keys that make operational sense. user_id means you can talk about 'moving all of this user's data' coherently.
•Growth Pattern — Consider how the key grows. Auto-increment IDs create hot spots on one shard for new records. UUIDs distribute uniformly.

Sharding Strategies:

Range Sharding

user_id 1-1,000,000 → shard_1
user_id 1,000,001-2,000,000 → shard_2
Simple but creates hot spots as new users cluster on latest shard
Good for time-series data that's queried by time ranges

Hash Sharding

shard = hash(user_id) % num_shards
Even distribution, but range queries hit all shards
Adding shards requires rehashing (expensive)

Consistent Hashing

More complex but allows adding shards with minimal data movement
Used by Cassandra, DynamoDB, and similar systems
Approximately 1/N of data moves when adding Nth shard

Directory-Based Sharding

A lookup table maps shard keys to shards
Maximum flexibility but adds a lookup step and a single point of failure
Enables arbitrary shard moves without application changes

Sharding Strategy Comparison
Strategy	Distribution	Range Queries	Adding Shards	Complexity
Range	Poor (hot spots)	Excellent	Simple	Low
Hash	Excellent	Poor (fan-out)	Expensive (rehash)	Medium
Consistent Hash	Good	Poor (fan-out)	Minimal data movement	High
Directory	Configurable	Depends on design	Flexible	High

The Cross-Shard Query Problem

Sharding Implementation Deep Dive

Implementing sharding requires careful orchestration across application code, data migration, and operational procedures. This section provides a practical implementation roadmap.

Application Layer Changes:

The application must become shard-aware. Every database query must route to the correct shard.

Pattern 1: Application-level routing The application directly calculates which shard to query and maintains connections to all shards.

sharding-application-layer
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Shard-aware database access layer
interface ShardConfig {
    shardId: number;
    connectionString: string;
}
 
class ShardRouter {
    private shards: Map<number, DatabaseConnection>;
    private totalShards: number;
    
    constructor(shardConfigs: ShardConfig[]) {
        this.totalShards = shardConfigs.length;
        this.shards = new Map();
        
        for (const config of shardConfigs) {
            this.shards.set(config.shardId, new DatabaseConnection(config.connectionString));
        }
    }
    
    // Calculate shard from user ID using consistent hashing
    getShardForUser(userId: string): number {
        const hash = this.hashFunction(userId);
        return hash % this.totalShards;
    }
    
    // Get connection for a specific shard
    getShard(shardId: number): DatabaseConnection {
        const shard = this.shards.get(shardId);
        if (!shard) throw new Error(`Invalid shard: ${shardId}`);
        return shard;
    }
    
    // Execute query on the correct shard
    async queryByUser<T>(userId: string, sql: string, params: any[]): Promise<T> {
        const shardId = this.getShardForUser(userId);
        const connection = this.getShard(shardId);
        return connection.query<T>(sql, params);
    }
    
    // Execute query across ALL shards (expensive - avoid when possible)
    async queryAllShards<T>(sql: string, params: any[]): Promise<T[]> {
        const promises = Array.from(this.shards.values()).map(
            shard => shard.query<T>(sql, params)
        );
        const results = await Promise.all(promises);
        return results.flat();
    }
    
    private hashFunction(key: string): number {
        // Use a consistent hash function (MurmurHash, xxHash, etc.)
        let hash = 0;
        for (let i = 0; i < key.length; i++) {
            hash = ((hash << 5) - hash) + key.charCodeAt(i);
            hash = hash & hash; // Convert to 32-bit integer
        }
        return Math.abs(hash);
    }
}

Pattern 2: Proxy-based routing

A proxy (like Vitess, Citus, ProxySQL) sits between application and databases, parsing queries and routing automatically. Advantages:

Application code remains largely unchanged
Centralized routing logic
Cross-shard query handling

Data Migration Strategy:

Migrating to a sharded architecture requires careful choreography:

Phase 1: Dual-write

Application writes to both old single database and new sharded databases
Ensures no data loss during transition
New writes exist in both places

Phase 2: Backfill

Historical data migrated shard by shard
Typically done during low-traffic periods
Verify row counts and checksums after migration

Phase 3: Shadow reads

Read from old database, but also query new shards
Compare results to verify correctness
Log discrepancies without affecting users

Phase 4: Cut over

Switch reads to sharded database
Monitor closely for issues
Keep old database as fallback

Phase 5: Cleanup

Stop dual-writes
Decommission old database
Update documentation and runbooks

The Sharding Tax

Managing Shard Growth and Rebalancing

Shards don't stay balanced. Over time, some shards grow larger than others due to:

Hot users — A few high-activity users generate disproportionate data
Business changes — New features might cause certain data patterns to grow faster
Time effects — If sharding by customer_id, enterprise customers might grow faster than consumers

Monitoring shard health:

Track these metrics per shard:

Disk usage and growth rate
Query latency (p50, p95, p99)
Connection utilization
CPU and memory usage
Replication lag (if shards have replicas)

Rebalancing strategies:

Manual shard splitting: Identify a hot shard and split it in two. This is operationally complex:

Create new shard
Replicate data to new shard
Update routing to direct some keys to new shard
Clean up old shard

Automatic rebalancing (e.g., Vitess): Some sharding systems automatically detect imbalance and redistribute data. This is operationally simpler but requires sophisticated infrastructure.

Shard-nothing architecture: Rather than splitting shards, add more shards of smaller size. Works with consistent hashing where adding a shard naturally redistributes some data.

Shard Sizing Guidelines
Factor	Recommendation	Rationale
Max shard size	500GB - 1TB	Larger shards are harder to backup, restore, and migrate
Min shards	More than 2x expected final	Adding shards is hard; start with headroom
Data per shard	Even within 20%	Unbalanced shards create hot spots
Connections per shard	< 80% of max	Leave headroom for spikes
Query latency	p99 < 100ms	Investigate if significantly higher than others

The shard key immutability challenge:

Once you've sharded by user_id, data for each user is locked to its shard. But what if you later need to query by product_id? You have several options, all with trade-offs:

1. Fan-out queries: Query all shards, aggregate results. Works for infrequent queries, but doesn't scale for common access patterns.

2. Denormalized copies: Maintain a separate table sharded by the secondary key. Requires dual-writes and eventual consistency.

3. Global indexes: A separate, small database maintains mappings from secondary keys to shards. Adds a lookup hop but enables efficient routing.

4. Change data capture (CDC): Stream changes to a secondary system (Elasticsearch, data warehouse) optimized for different query patterns.

The best approach depends on query frequency, latency requirements, and consistency needs. Often, a combination is used: primary queries on sharded database, secondary queries on specialized systems.

Plan for Multiple Access Patterns

Database Scaling Anti-Patterns

Learning from failures is as valuable as studying successes. These anti-patterns have caused countless outages and migrations:

Anti-Pattern 1: Premature Sharding

Symptom: "We're going to be big, so let's shard from day one."

Better approach: Use a single, well-optimized database until actual metrics indicate capacity strain.

Anti-Pattern 2: Wrong Shard Key

Symptom: Chose shard key based on data model, not query patterns.

Reality: A beautifully normalized shard key that doesn't appear in your common queries means every query fans out to all shards. Performance is worse than before sharding.

Better approach: Analyze actual query logs. Shard by the field that appears most frequently in WHERE clauses.

Anti-Pattern 3: Ignoring Replication Lag

Symptom: Read replicas seem to work in testing, bugs appear in production.

Reality: Replication lag is milliseconds in testing, can be seconds under production load. Users see stale data, sometimes causing data corruption (e.g., duplicate transactions).

Better approach: Design for replication lag from the start. Use read-your-writes consistency for critical paths. Monitor lag as a key metric.

More Database Scaling Anti-Patterns

•Not monitoring before scaling — Adding capacity without understanding the bottleneck often moves the problem elsewhere. Measure first, always.
•Scaling compute when storage is the issue — More CPU doesn't help if you're I/O bound. Profile before provisioning.
•Ignoring query optimization — Adding read replicas to handle poorly written queries multiplies inefficiency. Fix the queries first.
•Inconsistent shard logic — Application code calculates shards differently in different places, leading to data landing on wrong shards.
•No runbooks for shard operations — When a shard needs splitting at 3 AM, you need documented procedures, not improvisation.
•Testing at small scale — Sharding bugs often only appear with realistic data volumes. Test with production-like scale.

The Shard Split Emergency

Modern Database Scaling Options

The database landscape has evolved significantly. Modern options reduce the operational burden of scaling:

Managed Services:

Google Cloud Spanner — Globally distributed, strongly consistent relational database. Handles sharding transparently. Expensive but eliminates most scaling complexity.

CockroachDB — PostgreSQL-compatible distributed SQL. Handles sharding, replication, and rebalancing automatically. Can be self-hosted or managed.

PlanetScale — MySQL-compatible serverless database based on Vitess. Horizontal scaling with schema change workflow built in.

Distributed Database Comparison
Database	Scaling Model	Consistency	Best For	Consideration
Aurora	Managed read replicas	Strong (single write point)	Lift-and-shift PostgreSQL/MySQL	Still single-writer limited
Spanner	Auto-sharded, global	Strong (TrueTime)	Global apps needing consistency	High cost, GCP lock-in
CockroachDB	Auto-sharded	Serializable	Distributed SQL without NoSQL trade-offs	Newer, smaller ecosystem
PlanetScale	Vitess-based sharding	Eventual (by design)	High-scale MySQL workloads	Schema changes need workflow
TiDB	Auto-sharded, MySQL compatible	Snapshot isolation	MySQL replacement at scale	Complex operational model

When to consider distributed databases:

Scale requirements are clear — You've measured and know you'll exceed single-node capacity
Team has operational maturity — Distributed databases require sophisticated operations
Use case matches — The database's trade-offs align with your requirements
Budget accommodates — Managed distributed databases are typically 3-10x the cost of simple managed services

When traditional databases + smart architecture wins:

Scale is uncertain — Don't pay for distribution you might not need
Existing expertise — Your team knows PostgreSQL deeply
Simple access patterns — If you can shard manually without pain, complexity isn't justified
Cost sensitivity — A well-architected traditional database can be far cheaper

The best database is the one your team can operate reliably. A perfectly scaled complex system that your team doesn't understand is worse than a simpler system with occasional growing pains.

The PostgreSQL Staying Power

Summary: The Database Scaling Journey

Database scaling is the most consequential and challenging aspect of system scaling. Let's consolidate the journey:

Key Takeaways

•Optimize first, scale second — Query optimization, proper indexing, and configuration tuning often yield 10-100x improvements before any scaling is needed.
•Read replicas address read-heavy workloads — Most applications read far more than they write. Exploit this asymmetry with replicas.
•Vertical partitioning before horizontal — Split by function (analytics, logs, transactions) before splitting by row. Simpler and often sufficient.
•Sharding is a last resort — The operational burden is immense. Ensure all simpler strategies are exhausted.
•Shard key selection is irreversible — Choose based on query patterns, not data model. Analyze before deciding.
•Plan for cross-shard queries — They will be needed. Have a strategy (fan-out, denormalization, or specialized systems).
•Modern managed options reduce burden — Evaluate Aurora, Spanner, CockroachDB, or PlanetScale if operational simplicity is worth the cost.
•The boring database often wins — PostgreSQL with good practices scales further than most people realize.

What's next:

Page Complete

2 / 5