Choosing NoSQL - Learning Module

Loading content...

0/273

Scale Requirements: Matching Databases to Growth

The Scale Question That Changes Everything

A database that performs beautifully at 1 million records may crumble at 100 million. A system that handles 1,000 writes per second might fail at 100,000. And a single-region deployment's simple architecture becomes untenable when users span the globe.

Scale is not just 'more of the same.' It introduces qualitative changes in how systems behave. Algorithms that were O(log n) effectively become O(n) when n exceeds memory. Network latencies that were negligible become dominant. Failure modes that were theoretical become daily realities.

Choosing a NoSQL database requires honest assessment of your scale requirements—not just current state, but realistic projections. The database that fits perfectly today may become an expensive migration project in two years if it can't scale with your growth.

What You Will Learn

By the end of this page, you will understand how to quantify your scale requirements across multiple dimensions (data volume, throughput, geographic distribution), how different NoSQL databases scale along each dimension, and how to anticipate scaling bottlenecks before they become production crises.

The Four Dimensions of Scale

Scale isn't one-dimensional. Applications scale along multiple axes, each presenting different challenges and requiring different database capabilities.

Dimension 1: Data Volume

How much data will you store? This determines:

Storage costs and tier selection
Index size and memory requirements
Backup and recovery time
Query performance (full scans, aggregations)

Dimension 2: Read Throughput

How many read operations per second? This determines:

Need for read replicas
Caching requirements
Load balancing complexity

Dimension 3: Write Throughput

How many write operations per second? This determines:

Need for partitioning/sharding
Conflict resolution complexity
Replication lag tolerance

Dimension 4: Geographic Distribution

Where are your users? This determines:

Need for multi-region deployment
Latency vs. consistency trade-offs
Data sovereignty compliance

Scale Dimension Impact by Database Type
Dimension	Key-Value	Document	Wide-Column	Graph
Data Volume	Excellent (petabytes)	Good (terabytes)	Excellent (petabytes)	Moderate (limited by single-machine memory for traversals)
Read Throughput	Excellent (millions/sec)	Good (hundreds of thousands/sec)	Good (hundreds of thousands/sec)	Good (varies by query complexity)
Write Throughput	Excellent (millions/sec)	Good (hundreds of thousands/sec)	Excellent (millions/sec)	Moderate (relationship creation overhead)
Geographic Distribution	Excellent (simple partitioning)	Good (replica sets + sharding)	Excellent (multi-DC native)	Challenging (graph locality requirements)

Right-Size Your Expectations

Most applications never need 'web-scale' databases. If your realistic 5-year projection is 10 million records with 1,000 requests/second, nearly any modern database will suffice. Don't over-engineer for hypothetical scale—but do choose a database with a clear scaling path if growth is genuinely anticipated.

Data Volume: When Gigabytes Become Terabytes

Data volume is the most intuitive scale dimension: how much data will you store, and what happens as it grows?

Volume Thresholds and Their Implications:

Data Volume Scaling Thresholds
Volume	Infrastructure	Typical Challenges	Database Strategies
< 10 GB	Single node sufficient	Rarely limited by data volume	Any database works
10-100 GB	Single node, SSD recommended	Index size, backup time	Optimize indexes, consider partitioning
100 GB - 1 TB	High-memory nodes or sharding	Full scan impossibility, memory pressure	Sharding required for scalability
1-10 TB	Distributed cluster	Backup/restore complexity, rebalancing	Native distributed databases essential
10 TB - 1 PB	Large distributed clusters	Cost optimization, tiered storage	Purpose-built for scale (Cassandra, BigTable)
1 PB	Custom infrastructure often needed	Everything is hard at this scale	Specialized solutions, often custom

Database Volume Capabilities:

Key-Value (Redis, DynamoDB)

Redis: Limited by memory × cluster size. Redis Cluster to 1000 nodes. Practical limit ~10TB for pure in-memory.
DynamoDB: Effectively unlimited. AWS manages partitioning transparently. Billions of items per table.

Document (MongoDB)

Single server: Limited by disk (practical limit ~2TB for queries to remain performant)
Sharded cluster: Effectively unlimited. Companies operate multi-petabyte MongoDB clusters.

Wide-Column (Cassandra, HBase)

Designed for massive scale from day one. Cassandra clusters commonly hold 100+ TB.
Linear scaling: add nodes, capacity increases proportionally.

Graph (Neo4j)

Single instance: limited by RAM for relationship cache (~100B nodes feasible with sufficient memory)
Neo4j Fabric: Federated queries across shards for larger graphs
Graph databases generally more constrained than other categories due to traversal locality requirements.

Calculating Your Volume Requirements:

// Estimation framework
const estimateStorage = {
  // Per-record size (average, including indexes)
  recordSizeBytes: 2 * 1024,  // 2 KB average document
  
  // Growth projection
  initialRecords: 1_000_000,
  dailyNewRecords: 50_000,
  retentionDays: 365 * 3,  // 3 years
  
  // Calculate
  totalRecords: function() {
    return this.initialRecords + (this.dailyNewRecords * this.retentionDays);
  },
  
  totalStorageGB: function() {
    return (this.totalRecords() * this.recordSizeBytes) / (1024 ** 3);
  },
  
  // With replication (3x for typical distributed setup)
  replicatedStorageGB: function() {
    return this.totalStorageGB() * 3;
  }
};

console.log(`3-year projection: ${estimateStorage.replicatedStorageGB().toFixed(0)} GB`);
// Output: 3-year projection: 312 GB (with 3x replication)

The Index Multiplier

Storage isn't just data—it's data + indexes + replication + operational overhead (compaction, etc.). A 100 GB dataset with 5 indexes might consume 300 GB of storage. Factor 2-3x overhead for realistic capacity planning. Additionally, indexes in RAM requirements often exceed raw data storage requirements for query performance.

Throughput: Operations Per Second at Scale

Throughput—the number of operations your database can handle per second—is often the binding constraint for growing applications. Unlike data volume (which grows predictably), throughput can spike unexpectedly with traffic patterns.

Understanding Throughput Metrics:

Throughput = Operations / Time

Read Throughput:  Queries per second (QPS)
Write Throughput: Writes per second (WPS)

Key insight: Read and write throughput scale differently.
            Most databases scale reads easily (replicas).
            Writes are constrained by consistency/durability.

Throughput Scaling Strategies:

Read Scaling (Relatively Easy)

                    ┌─────────────────┐
                    │   Application   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Load Balancer  │
                    └────────┬────────┘
              ┌──────────────┼──────────────┐
              │              │              │
        ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
        │ Replica 1 │  │ Replica 2 │  │ Replica 3 │
        │  (reads)  │  │  (reads)  │  │  (reads)  │
        └───────────┘  └───────────┘  └───────────┘

10,000 QPS requirement × 3 replicas = ~3,333 QPS per replica

Most databases support read replicas. Adding replicas linearly increases read throughput (with eventual consistency caveats).

Write Scaling (Harder)

Writes must propagate to all replicas, limiting pure replication-based scaling. Solutions:

Sharding/Partitioning: Writes to different partitions go to different nodes
Write batching: Combine many small writes into fewer large writes
Async writes: Accept writes without durability, confirm later
Write-optimized structures: LSM trees (Cassandra, RocksDB) buffer writes efficiently

Throughput Capabilities by Database
Database	Read QPS (Single Node)	Write TPS (Single Node)	Horizontal Scaling
Redis	100,000+	100,000+	Redis Cluster: millions/sec
DynamoDB	N/A (managed)	N/A (managed)	Unlimited with on-demand capacity
MongoDB	10,000-50,000	5,000-20,000	Linear with sharding
Cassandra	10,000-30,000	10,000-50,000	Linear with nodes
Neo4j	10,000-100,000 (varies)	1,000-10,000	Limited (read replicas only)

Calculating Throughput Requirements:

// Throughput estimation
const throughputCalc = {
  // Peak active users
  peakConcurrentUsers: 100_000,
  
  // Actions per user per minute
  readsPerUserPerMinute: 30,
  writesPerUserPerMinute: 5,
  
  // Calculate per second (with 2x safety margin)
  peakReadQPS: function() {
    return (this.peakConcurrentUsers * this.readsPerUserPerMinute / 60) * 2;
  },
  
  peakWriteTPS: function() {
    return (this.peakConcurrentUsers * this.writesPerUserPerMinute / 60) * 2;
  }
};

console.log(`Peak reads: ${throughputCalc.peakReadQPS()} QPS`);
console.log(`Peak writes: ${throughputCalc.peakWriteTPS()} TPS`);
// Output: Peak reads: 100,000 QPS
// Output: Peak writes: 16,667 TPS

The Latency-Throughput Tradeoff

High throughput often comes at the cost of latency. Under load, queue depths increase and response times rise. A database might handle 50,000 QPS at p99 latency of 5ms, but only 20,000 QPS at p99 latency of 2ms. Understand both your throughput AND latency requirements—they constrain each other.

Geographic Distribution: Global Users, Local Data

For applications serving global users, geographic distribution becomes a critical scaling dimension. Physics imposes hard limits: light travels 200km per millisecond through fiber. A round-trip from New York to Tokyo adds ~200ms minimum latency.

The Geographic Challenge:

                   ┌─────────────────────────────────────────┐
                   │          Atlantic Ocean: 80ms           │
                   └─────────────────────────────────────────┘
     🗽 New York                                        🗼 London
    ┌──────────┐                                      ┌──────────┐
    │  User A  │ ────── If single datacenter ──────── │ Database │
    └──────────┘         in London: 160ms RTT         └──────────┘

     🏯 Tokyo                                         
    ┌──────────┐     
    │  User B  │ ────── To London: 280ms RTT ────────
    └──────────┘         Unacceptable for interactive apps

Geographic Distribution Strategies:

1. Read Replicas per Region

Most common approach: writes go to primary region, reads served from local replicas.

         US-EAST (Primary)          EU-WEST (Replica)          AP-NORTHEAST (Replica)
         ┌────────────┐             ┌────────────┐             ┌────────────┐
         │   Primary  │ ──replicate─▶│  Replica  │ ◀─replicate─│  Replica  │
         │   (R+W)    │             │   (R only) │             │   (R only) │
         └────────────┘             └────────────┘             └────────────┘
              ▲                           ▲                          ▲
              │                           │                          │
         US Users                    EU Users                   Asia Users
         (fast R+W)                (fast R, slow W)           (fast R, slow W)

Trade-off: Writes from non-primary regions incur cross-region latency.

2. Multi-Primary (Active-Active)

All regions accept writes, with async replication between them.

         US-EAST (Primary)          EU-WEST (Primary)          AP-NORTHEAST (Primary)
         ┌────────────┐             ┌────────────┐             ┌────────────┐
         │  Primary 1 │◀───sync───▶│  Primary 2 │◀───sync───▶│  Primary 3 │
         │   (R+W)    │             │   (R+W)    │             │   (R+W)    │
         └────────────┘             └────────────┘             └────────────┘

All users get fast reads AND writes.
Trade-off: Conflict resolution required for concurrent writes.

3. Data Locality (Shard by Region)

User data lives in their home region. Cross-region queries rare.

User in EU → Data stored in EU → Read/write in EU → No cross-region traffic

Trade-off: Users traveling see stale data or high latency.
           Cross-regional features (EU user views US user profile) are slow.

Geographic Distribution Capabilities
Database	Multi-Region Support	Global Strong Consistency	Typical Approach
Google Spanner	Native, first-class	Yes (TrueTime)	Globally distributed with synchronous replication
CockroachDB	Native multi-region	Yes (Raft)	Follow-the-sun, geo-partitioning
DynamoDB	Global Tables	Eventually consistent	Active-active with LWW conflict resolution
Cassandra	Multi-datacenter native	Tunable (LOCAL_QUORUM)	Multi-primary with configurable consistency
MongoDB	Zone sharding	With zone-aware reads	Primary region + read replicas

Global Strong Consistency Is Expensive

True global strong consistency (Spanner-style) requires cross-region coordination for every write—adding 100-300ms to write latency. Most applications use region-local strong consistency with cross-region eventual consistency, accepting that a user in Tokyo may see a 2-second-old version of data written in London.

Horizontal vs. Vertical Scaling: Fundamental Trade-offs

Before selecting a database for scale, understand the fundamental difference between vertical and horizontal scaling—and how different databases support each.

Vertical Scaling (Scale Up)

Add resources to a single node: more CPU, more RAM, more disk.

┌───────────────────┐      ┌─────────────────────────────┐
│   Small Server    │  →   │      Large Server           │
│   4 CPU, 16GB     │      │   64 CPU, 512GB             │
│   500 GB SSD      │      │   10 TB NVMe RAID           │
└───────────────────┘      └─────────────────────────────┘

Advantages:               Disadvantages:
- Simple                  - Hard limits (biggest available machine)
- No distributed logic    - Single point of failure
- All data local          - Expensive ($/GB increases)
- Strong consistency      - Downtime for upgrades

Horizontal Scaling (Scale Out)

Add more nodes to a cluster, distributing data and load.

┌───────────┐      ┌───────────┐ ┌───────────┐ ┌───────────┐
│  1 Node   │  →   │  Node 1   │ │  Node 2   │ │  Node 3   │
└───────────┘      └───────────┘ └───────────┘ └───────────┘

Advantages:               Disadvantages:
- Unlimited theoretical   - Distributed system complexity
- Commodity hardware      - Network partitions possible
- No single point of      - Cross-node queries expensive
  failure (if replicated) - Consistency challenges
- Rolling upgrades        - Operational complexity

Databases That Scale Horizontally Well

•Cassandra — Masterless, linear scaling, add nodes seamlessly
•DynamoDB — Fully managed, auto-scaling, infinite practical capacity
•MongoDB — Sharding with auto-balancing, though operationally complex
•Redis Cluster — Horizontal key-value partitioning across nodes
•CockroachDB — Distributed SQL with automatic rebalancing

Databases With Scaling Limitations

•Neo4j — Graph traversals require locality; sharding is complex
•Single-instance Redis — Memory-bound; requires clustering for scale
•PostgreSQL — Excellent vertical scaling; horizontal is third-party (Citus)
•Elasticsearch — Scales reads well; write scaling has limitations

The Scaling Decision Matrix:

Your Situation	Recommended Approach
< 1 TB data, < 10K QPS	Vertical scaling (simpler)
> 1 TB data OR > 50K QPS	Horizontal scaling required
Unpredictable, bursty traffic	Managed services with auto-scaling
Steady, predictable growth	Self-managed with capacity planning
Multi-region latency requirements	Horizontally distributed from start
Strong consistency required	Vertical OR carefully designed horizontal

Start Vertical, Plan Horizontal

Vertical scaling is simpler and often sufficient for longer than expected. A well-optimized single PostgreSQL instance handles remarkable workloads. Start with vertical scaling but choose a database with clear horizontal scaling path. Don't prematurely add distributed system complexity.

Scaling Patterns and Anti-Patterns

Beyond choosing a database, how you design your data model and access patterns profoundly impacts scaling behavior. Some patterns scale gracefully; others create bottlenecks.

Anti-Pattern 1: Hot Partitions

All traffic concentrates on one partition while others sit idle.

// Bad: Sequential IDs create time-based hot partitions
Partition Key: order_id (auto-increment)

99% of reads are for recent orders → all on latest partition

┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐
│ Partition │  │ Partition │  │ Partition │  │ Partition │
│  1-1M     │  │  1M-2M    │  │  2M-3M    │  │  3M-4M    │
│   0%      │  │    0%     │  │    1%     │  │   99%     │
└───────────┘  └───────────┘  └───────────┘  └───────────┘
                                             Hot partition!

// Better: Hash-based partitioning
Partition Key: SHA256(order_id)

Even distribution across partitions

Anti-Pattern 2: Large Partitions

Unbounded partition growth creates operational problems.

-- Bad: All events for a user in one partition
PRIMARY KEY (user_id)

A power user with 10 million events → 10 GB partition
Slow queries, backup issues, rebalancing problems

-- Better: Bucket by time
PRIMARY KEY ((user_id, year_month), event_time)

Partitions bounded to ~1 month of data each

Anti-Pattern 3: Scatter-Gather Queries

Queries that touch all partitions don't scale.

// Bad: Unfiltered aggregations
SELECT AVG(amount) FROM orders;  -- Touches every partition

// Better: Pre-computed rollups or time-bounded queries
SELECT avg_amount FROM daily_order_stats WHERE date = '2024-01-15';

Scaling Best Practices

•Design for uniform partition distribution — Hash keys, randomize prefixes if necessary
•Bound partition sizes — Time-bucket, limit array/list growth, archive old data
•Avoid cross-partition transactions — If transaction spans partitions, reconsider data model
•Query within partitions — Ensure common queries include partition key
•Pre-compute aggregates — Maintain rollup tables updated on write
•Separate hot and cold data — Archive historical data to cheaper, slower storage
•Monitor partition metrics — Alert on size imbalance and hot partition detection

Scaling Anti-Patterns Are Hard to Fix Later

Hot partitions and scatter-gather queries often aren't apparent at small scale. They emerge suddenly when traffic exceeds a threshold: 'Yesterday it worked fine, today the database is on fire.' Invest in load testing with realistic data distributions before production. Synthetic uniform data hides partition-related problems.

Scale Planning: A Systematic Framework

How do you translate business projections into database scaling requirements? Use this systematic framework.

Step 1: Define Your Time Horizon

Typically plan for 2-3 years. Beyond that, re-evaluation is prudent.

Step 2: Project Growth Across Dimensions

Scale Projection Template
Metric	Current	1 Year	2 Years	3 Years
Active users	10K	50K	200K	500K
Data volume (GB)	20	100	400	1,000
Peak read QPS	500	2,500	10,000	25,000
Peak write TPS	100	500	2,000	5,000
Regions needed	1	1	2	3

Step 3: Identify Scaling Ceiling for Current Choice

For each candidate database, determine where you'll hit limits:

MongoDB single replica set:
  ✓ 20 GB → easily handled
  ✓ 100 GB → comfortable
  ✓ 400 GB → approaching practical limits
  ✗ 1 TB → sharding required

MongoDB sharded cluster:
  ✓ 1 TB → managed with 3-shard cluster
  ✓ 10 TB → managed with larger cluster
  
Verdict: Need to implement sharding by Year 2.
         Choose MongoDB with sharding-ready key design from Day 1.

Step 4: Factor in Operational Requirements

Can your team operate a sharded cluster?
What's the managed service cost at projected scale?
What's the migration path if you outgrow the initial choice?
What expertise does your team have?

Step 5: Choose Based on Ceiling, Not Floor

Don't choose based on what works today. Choose based on what works at 10x scale. The database that barely handles your 3-year projection is wrong—unexpected growth happens.

The Managed Service Advantage

For scale-uncertain applications, managed services (DynamoDB, MongoDB Atlas, Cosmos DB) provide elastic scaling without operational burden. You pay more per-unit, but you avoid the operational cost of managing infrastructure at scale. For startups with unpredictable growth, managed services often make the most economic sense.

Summary: Scale as Selection Criteria

Scale requirements are the fourth essential lens for database selection. A database might fit your data model, support your query patterns, and offer acceptable consistency—but if it can't scale to your projected needs, it's wrong for your use case.

The core insight: scale is multi-dimensional. Consider data volume, read throughput, write throughput, and geographic distribution separately. Different databases have different ceilings along each dimension.

Key Takeaways

•Scale has four dimensions: data volume, read throughput, write throughput, and geographic distribution.
•Read scaling is easier than write scaling — replicas for reads, partitioning for writes.
•Geographic distribution has hard physics limits — 100-200ms cross-region latency is unavoidable.
•Horizontal scaling introduces distributed system complexity — don't adopt it prematurely.
•Scaling anti-patterns (hot partitions, large partitions, scatter-gather) are hard to fix later.
•Plan for 10x your 3-year projection — unexpected growth shouldn't require emergency migration.
•Managed services trade higher per-unit cost for operational simplicity at scale.

What's next:

With all four lenses understood—data model fit, query patterns, consistency requirements, and scale requirements—the final page presents a comprehensive decision framework that synthesizes these considerations into a systematic database selection process. This framework transforms what seems like an overwhelming choice into a structured evaluation with clear criteria.

Page Complete

You now understand how to evaluate scale requirements as a critical database selection factor. The next step is projecting your application's growth across all scale dimensions and validating that candidate databases have adequate headroom. Choose based on where you'll be, not where you are.