System Design (HLD)Back-of-Envelope Estimation

Back-of-Envelope Estimation

LevelIntermediate

Duration90 mins

TopicBack-of-Envelope Estimation

4 / 5

Server Estimation

From Requests to Machines

You've estimated that your system needs to handle 100,000 requests per second. Now comes the critical question: How many servers do you need?

This isn't just a math problem. It's an engineering problem that involves understanding CPU capabilities, memory requirements, I/O constraints, and the complex interplay between software architecture and hardware capacity. A naive calculation might suggest 10 servers. A sophisticated analysis might reveal you need 50—or just 5—depending on your bottleneck.

Server estimation is where system design becomes infrastructure planning. Get it wrong in one direction, and you waste millions on idle machines. Get it wrong in the other, and your system crashes under real load. The goal is right-sizing—enough capacity for peak load with headroom for failures, but not so much that you're burning money.

What You Will Learn

By the end of this page, you will be able to: (1) Calculate server count from RPS and per-server capacity, (2) Identify CPU-bound vs I/O-bound workloads, (3) Size memory requirements for caching and working sets, (4) Apply appropriate headroom and redundancy factors, (5) Choose between scaling up and scaling out.

The Server Capacity Model

Server estimation starts with a fundamental equation:

The Core Formula:

Servers Needed = Peak RPS / RPS per Server × Safety Factor

But what determines "RPS per Server"? This is where engineering judgement meets measurement.

Factors Affecting Per-Server Capacity:

Request Complexity: A simple health check might handle 100K RPS. A complex database query might handle 100 RPS.
Latency Requirements: Targeting 99th percentile < 50ms limits concurrency differently than < 500ms.
Resource Bottleneck: Is the workload CPU-bound, memory-bound, I/O-bound, or network-bound?
Software Stack: Go/Rust handle 10x the connections of Python/Ruby for the same CPU.
Concurrency Model: Event-driven (Node.js) vs thread-per-request (traditional Java).

Typical RPS per Server by Workload Type
Workload Type	RPS/Server	Bottleneck	Example
Static content serving	10,000-50,000	Network/disk I/O	Nginx serving files
Cached API responses	5,000-20,000	Network/memory	Redis-backed API
Simple CRUD operations	1,000-5,000	Database I/O	REST API with PostgreSQL
Complex business logic	500-2,000	CPU	Order processing, validations
ML inference	100-500	CPU/GPU	Recommendation serving
Image processing	50-200	CPU	Thumbnail generation
Video transcoding	0.5-5	CPU intensive	Real-time transcoding

The 10x Rule of Estimation

Initial estimates are often off by 10x. Always measure actual per-server capacity with realistic traffic patterns before finalizing infrastructure. A "simple" API that touches 5 microservices and 3 databases performs very differently than a local benchmark.

Sizing for CPU-Bound Workloads

CPU-bound workloads are limited by processing power. Computation dominates request handling time.

Identifying CPU-Bound Work:

Business logic with complex calculations
Data serialization/deserialization (JSON parsing)
Cryptographic operations (TLS, hashing)
Compression/decompression
ML model inference
Image/video processing

CPU Capacity Calculation:

CPU Seconds Needed = Requests × CPU Time per Request
Servers Needed = CPU Seconds Needed / Available CPU Seconds per Server

cpu_bound_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# CPU-bound server sizing example: API with business logic
 
# Traffic requirements
peak_rps = 10_000
 
# Request characteristics (measured via profiling)
avg_cpu_ms_per_request = 20  # 20ms CPU time per request
 
# Server specifications
cores_per_server = 16
# CPU efficiency: not all time is usable (context switching, GC, OS overhead)
usable_cpu_efficiency = 0.70  # 70% of CPU is actually usable
 
# Calculate capacity per server
usable_cores = cores_per_server * usable_cpu_efficiency
cpu_ms_per_second_per_server = usable_cores * 1000  # ms available per second
requests_per_server = cpu_ms_per_second_per_server / avg_cpu_ms_per_request
 
print(f"Per-server capacity:")
print(f"  Physical cores: {cores_per_server}")
print(f"  Usable cores (70%): {usable_cores}")
print(f"  CPU ms available/sec: {cpu_ms_per_second_per_server:,}")
print(f"  Max RPS/server: {requests_per_server:,.0f}")
 
# Calculate servers needed
servers_for_peak = peak_rps / requests_per_server
print(f"
Servers for peak load ({peak_rps:,} RPS):")
print(f"  Minimum: {servers_for_peak:.1f}")
 
# Apply safety factors
n_plus_one_factor = 1.10    # Handle 1 server failure
headroom_factor = 1.30      # 30% headroom for spikes
deployment_factor = 1.10    # Rolling deployment overhead
 
total_servers = servers_for_peak * n_plus_one_factor * headroom_factor * deployment_factor
print(f"  With N+1 redundancy: {servers_for_peak * n_plus_one_factor:.1f}")
print(f"  With 30% headroom: {servers_for_peak * n_plus_one_factor * headroom_factor:.1f}")
print(f"  With deployment overhead: {total_servers:.0f}")
 
print(f"
Final recommendation: {round(total_servers / 5) * 5} servers")  # Round to 5

CPU Architecture Considerations:

Hyperthreading: 16 physical cores with HT = 32 logical cores, but only ~1.3x performance (not 2x)
Turbo Boost: Advertised 3.0 GHz might boost to 4.0 GHz under load, but only briefly
Thermal Throttling: Sustained high CPU causes frequency reduction
NUMA: Multi-socket servers have non-uniform memory access—cross-socket operations are 2x slower

Cloud Instance Considerations:

Cloud vCPUs are not equal to physical cores:

AWS: 1 vCPU = 1 hyperthread (half a core)
A 16-vCPU instance ≈ 8 physical cores of capacity
Shared tenancy adds variability ("noisy neighbors")

Profile Before You Scale

Never estimate CPU time based on assumptions. Profile your actual code with realistic requests. A 'simple' JSON response might spend more CPU time in serialization than in business logic. Profiling often reveals surprising bottlenecks.

Sizing for Memory-Bound Workloads

Memory-bound workloads are limited by RAM availability or memory bandwidth.

Identifying Memory-Bound Work:

In-memory caching (Redis clusters, Memcached)
Large working sets (graph databases, ML models)
High-concurrency servers (many connections × memory per connection)
JVM-based applications (heap sizing)
Database buffer pools

Memory Calculation:

Total Memory = Base Memory + (Concurrent Requests × Memory per Request) + Cache Size

memory_bound_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Memory-bound server sizing example: API gateway with caching
 
# Base requirements
os_and_runtime_mb = 2_000       # 2GB for OS + runtime
per_connection_kb = 50          # 50KB per active connection (buffers, state)
concurrent_connections = 10_000  # 10K concurrent during peak
 
# Application memory
application_heap_mb = 4_000     # 4GB heap for JVM/Go runtime
ml_model_mb = 2_000             # 2GB ML model in memory
 
# Caching layer (local cache for hot data)
cache_hit_target = 0.80         # 80% cache hit ratio needed
total_hot_data_gb = 100         # 100GB of frequently accessed data
cache_size_gb = total_hot_data_gb * cache_hit_target * 0.20  # 20% of hot data fits
 
# Calculate total memory per server
connection_memory_mb = (concurrent_connections * per_connection_kb) / 1024
cache_memory_mb = cache_size_gb * 1024
 
total_memory_mb = (
    os_and_runtime_mb +
    connection_memory_mb +
    application_heap_mb +
    ml_model_mb +
    cache_memory_mb
)
 
print("Memory breakdown per server:")
print(f"  OS + Runtime: {os_and_runtime_mb:,} MB")
print(f"  Connections ({concurrent_connections:,}): {connection_memory_mb:,.0f} MB")
print(f"  Application heap: {application_heap_mb:,} MB")
print(f"  ML model: {ml_model_mb:,} MB")
print(f"  Local cache: {cache_memory_mb:,.0f} MB")
print(f"  Total: {total_memory_mb:,.0f} MB ({total_memory_mb/1024:.1f} GB)")
 
# Size servers
available_instance_sizes = [
    ("m5.xlarge", 16_000, 4),     # 16GB RAM, 4 vCPU
    ("m5.2xlarge", 32_000, 8),    # 32GB RAM, 8 vCPU
    ("m5.4xlarge", 64_000, 16),   # 64GB RAM, 16 vCPU
    ("r5.4xlarge", 128_000, 16),  # 128GB RAM, 16 vCPU (memory optimized)
]
 
print("
Instance sizing options:")
for name, ram_mb, vcpu in available_instance_sizes:
    usable_ram = ram_mb * 0.90  # 90% usable (reserve for OS spikes)
    fits = "✓" if usable_ram >= total_memory_mb else "✗"
    print(f"  {name}: {ram_mb/1024:.0f}GB RAM, {vcpu} vCPU → {fits}")

Memory Sizing Rules of Thumb
Component	Memory Estimate	Notes
Linux OS baseline	500MB - 1GB	Minimal, no desktop
JVM heap (small app)	512MB - 2GB	Small microservice
JVM heap (typical)	4GB - 16GB	Standard service
JVM heap (large)	32GB - 128GB	Data-intensive
Per HTTP connection	10KB - 100KB	Depends on framework
Per WebSocket	50KB - 200KB	Higher due to state
Worker process (Python)	100MB - 500MB	Per Gunicorn worker
Node.js default heap	~1.5GB	V8 default limit

Memory Overcommit Warning

Cloud instances share physical memory. Overcommitting (using more than available) triggers swapping, which destroys performance. Leave 10-20% headroom. When memory pressure hits, response times become unpredictable.

Sizing for I/O-Bound Workloads

I/O-bound workloads spend most time waiting for external resources—disk, network, or downstream services.

Identifying I/O-Bound Work:

Database queries (waiting for disk/network)
External API calls (waiting for response)
File uploads/downloads
Message queue consumers
Microservice-heavy architectures

The Concurrency Model:

For I/O-bound work, the bottleneck isn't CPU—it's how many requests can wait simultaneously.

RPS = Concurrent Connections / Average Latency

Example: 1000 connections / 0.1 seconds = 10,000 RPS

io_bound_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# I/O-bound server sizing: API calling multiple backends
 
# Request profile (typical microservice)
avg_latency_ms = 150  # Total request time (dominated by backend calls)
# Latency breakdown:
#   - Database query: 50ms
#   - Cache lookup: 5ms
#   - External API: 80ms
#   - Network overhead: 15ms
 
# Target RPS
peak_rps = 5_000
 
# Calculate required concurrency
concurrent_requests_needed = peak_rps * (avg_latency_ms / 1000)
print(f"Concurrent in-flight requests needed: {concurrent_requests_needed:,.0f}")
 
# Server connection limits
max_connections_per_server = 5_000  # Typical limit for event-driven server
# But we also need resources per connection
memory_per_connection_kb = 50
available_ram_mb = 16_000  # 16GB instance
connection_limit_by_ram = (available_ram_mb * 1024) / memory_per_connection_kb
 
effective_connection_limit = min(max_connections_per_server, connection_limit_by_ram)
print(f"Effective connection limit per server: {effective_connection_limit:,.0f}")
 
# Calculate RPS per server
rps_per_server = effective_connection_limit / (avg_latency_ms / 1000)
print(f"Max RPS per server: {rps_per_server:,.0f}")
 
# Servers needed
base_servers = peak_rps / rps_per_server
print(f"
Servers for {peak_rps:,} RPS: {base_servers:.1f}")
 
# But we also need to consider downstream connection limits
# Each server opens connections to databases, caches, etc.
db_connections_per_server = 50
db_max_connections = 500  # PostgreSQL connection limit
 
servers_limited_by_db = db_max_connections / db_connections_per_server
print(f"Servers limited by DB connections: {servers_limited_by_db:.0f}")
 
# Thread pool exhaustion
thread_pool_size = 200  # Typical Java thread pool
max_concurrent_per_server = thread_pool_size  # Thread-per-request model
rps_if_thread_limited = max_concurrent_per_server / (avg_latency_ms / 1000)
print(f"RPS if thread-limited: {rps_if_thread_limited:,.0f}")
 
# Final calculation with all constraints
actual_rps_per_server = min(rps_per_server, rps_if_thread_limited)
actual_servers_needed = peak_rps / actual_rps_per_server
 
print(f"
Actual capacity per server: {actual_rps_per_server:,.0f} RPS")
print(f"Actual servers needed: {actual_servers_needed:.0f}")
print(f"With redundancy (1.5x): {actual_servers_needed * 1.5:.0f}")

Little's Law:

Little's Law is fundamental for I/O-bound sizing:

L = λ × W

Where:
L = Average number of requests in system (concurrency)
λ = Average arrival rate (RPS)
W = Average time in system (latency)

To handle 10,000 RPS with 100ms latency, you need capacity for 1,000 concurrent requests.

Connection Pool Sizing:

Downstream connections often limit capacity:

Database connection pools: 10-100 per server
HTTP client pools: 100-500 per destination
Redis connections: 50-200 per server

For a service calling 5 backends, 50 connections each = 250 connections maintained per server.

The Database Connection Bottleneck

A PostgreSQL instance handles ~500 connections effectively. With 50 app servers each holding 50 connections = 2,500 connection attempts. Solutions: connection pooling (PgBouncer), reducing connections per server, or database sharding. This is one of the most common scaling blockers.

Scaling Up vs Scaling Out

Once you know total required capacity, you must decide: fewer large servers or more small servers?

Scaling Up (Vertical): Increase resources per server

Move from 8 vCPU to 32 vCPU
Increase RAM from 16GB to 64GB
Upgrade to faster disks

Scaling Out (Horizontal): Increase number of servers

Deploy more instances of the same size
Distribute load across more machines
Typically requires stateless design

Scale Up Advantages

•Simpler architecture
•Fewer failure points
•No distributed complexity
•Single-machine consistency
•Less network overhead
•Easier debugging

Scale Out Advantages

•Near-infinite scaling
•Better fault isolation
•Incremental capacity additions
•Cost-efficient at scale
•Geographic distribution possible
•Rolling updates possible

Decision Framework:

Scale Up vs Scale Out Decision Criteria
Criterion	Favor Scale Up	Favor Scale Out
State	Stateful workloads	Stateless workloads
Cost at target scale	<$10K/month	$10K/month
Traffic variability	Stable	Highly variable
Failure tolerance	Acceptable downtime	Must survive failures
Team experience	Traditional ops	Cloud-native
Long-term growth	Bounded	Unbounded
Single point bottleneck	Network, specialized HW	CPU, memory

scaling_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Cost comparison: Scale Up vs Scale Out
 
# Scenario: Need 20,000 RPS capacity
 
# Scale Up Approach
large_instance = {
    "type": "m5.8xlarge",
    "vcpu": 32,
    "ram_gb": 128,
    "hourly_cost": 1.536,
    "rps_capacity": 5000,  # 5K RPS per large instance
}
 
large_count = 20_000 / large_instance["rps_capacity"]
large_total_hourly = large_count * large_instance["hourly_cost"]
large_with_redundancy = large_count * 1.5  # N+1 roughly
 
# Scale Out Approach
small_instance = {
    "type": "m5.xlarge",
    "vcpu": 4,
    "ram_gb": 16,
    "hourly_cost": 0.192,
    "rps_capacity": 800,  # 800 RPS per small instance
}
 
small_count = 20_000 / small_instance["rps_capacity"]
small_total_hourly = small_count * small_instance["hourly_cost"]
small_with_redundancy = small_count * 1.25  # Lower factor, more failure tolerance
 
print("Scaling Comparison for 20,000 RPS")
print("=" * 50)
print(f"
Scale Up ({large_instance['type']}):")
print(f"  Instances needed: {large_count:.0f} (+ {large_count*0.5:.0f} redundancy)")
print(f"  Hourly cost: ${large_total_hourly * 1.5: .2f
                            }")
print(f"  Monthly cost: ${large_total_hourly * 1.5 * 730:.0f}")
print(f"  Blast radius: 25% (if 1 of 4 fails)")
 
print(f"
Scale Out ({small_instance['type']}):")
print(f"  Instances needed: {small_count:.0f} (+ {small_count*0.25:.0f} redundancy)")
print(f"  Hourly cost: ${small_total_hourly * 1.25:.2f}")
print(f"  Monthly cost: ${small_total_hourly * 1.25 * 730:.0f}")
print(f"  Blast radius: 4% (if 1 of 25 fails)")
 
# Break - even analysis
print(f"
Cost difference: ${abs(large_total_hourly*1.5 - small_total_hourly*1.25)*730:.0f}/month")
 
# Operational complexity
print(f"
Operational factors:")
print(f"  Scale Up: {large_count*1.5:.0f} instances to manage")
print(f"  Scale Out: {small_count*1.25:.0f} instances to manage")

The Modern Default

Most cloud-native architectures default to scaling out. The tooling (Kubernetes, auto-scaling groups) is mature. Horizontal scaling also enables progressive deployments, A/B testing, and canary releases. Scale up when you have a clear reason (stateful workload, specialized hardware, simplicity preference).

Redundancy and Failure Planning

Production systems fail. Server estimation must account for failures, maintenance, and deployments.

Failure Scenarios to Plan For:

Single server failure: Hardware issue, process crash, OOM kill
Availability zone failure: AZ-wide network or power issue
Region failure: Rare but catastrophic
Dependency failure: Database, cache, or external service down
Deployment failures: Bad deploy takes down instances

Redundancy Strategies:

Redundancy Patterns and Overhead
Strategy	Extra Capacity	Survives	Use Case
N+1	1 extra server	1 server failure	Small deployments
N+2	2 extra servers	2 server failures	Medium deployments
2N	100% extra	Half of servers failing	Critical systems
2N+1	100% + 1 server	Half + maintenance	Mission critical
Multi-AZ (+50%)	~50% extra	AZ failure	Most production
Multi-Region (2x)	100% extra	Region failure	Global services

Calculating Capacity with Failures:

redundancy_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Redundancy calculation for production deployment
 
# Base requirements
peak_rps = 50_000
rps_per_server = 2_000
 
# Base servers for peak
base_servers = peak_rps / rps_per_server
print(f"Base servers for {peak_rps:,} RPS: {base_servers:.0f}")
 
# Multi - AZ deployment(3 AZs, survive 1 AZ loss)
azs = 3
az_failure_tolerance = 1  # Survive losing 1 AZ
 
# Each AZ must handle full load when one fails
# So 2 AZs must handle 100% at all times
servers_per_az_for_failure = base_servers / (azs - az_failure_tolerance)
total_with_az_redundancy = servers_per_az_for_failure * azs
az_overhead = (total_with_az_redundancy - base_servers) / base_servers * 100
 
print(f"
Multi-AZ (3 AZs, survive 1 failure):")
print(f"  Servers per AZ: {servers_per_az_for_failure:.0f}")
print(f"  Total servers: {total_with_az_redundancy:.0f}")
print(f"  Overhead: {az_overhead:.0f}%")
 
# Additional headroom for rolling deployments
# During deploy, 10-20 % of servers are being replaced
deployment_overhead = 1.15
with_deployment = total_with_az_redundancy * deployment_overhead
 
print(f"
With deployment headroom (15%):")
print(f"  Total servers: {with_deployment:.0f}")
 
# Spike headroom(unexpected traffic bursts)
spike_headroom = 1.20
final_count = with_deployment * spike_headroom
 
print(f"
With spike headroom (20%):")
print(f"  Total servers: {final_count:.0f}")
 
# Summary
total_overhead = (final_count - base_servers) / base_servers * 100
print(f"
{'='*50}")
print(f"SUMMARY:")
print(f"  Base requirement: {base_servers:.0f} servers")
print(f"  Production deployment: {final_count:.0f} servers")
print(f"  Total overhead: {total_overhead:.0f}%")
print(f"
Breakdown of {final_count:.0f} servers:")
print(f"  AZ-1: {final_count/3:.0f} servers")
print(f"  AZ-2: {final_count/3:.0f} servers")
print(f"  AZ-3: {final_count/3:.0f} servers")

The Cost of High Availability

Surviving an AZ failure typically requires 50% more servers. Surviving a region failure requires 100% more (full duplicate). Multi-region is expensive. Only deploy multi-region if your SLA truly requires it or regulations demand data locality.

Database and Specialty Server Sizing

Application servers aren't the only component. Databases, caches, and specialty systems have unique sizing requirements.

Database Server Sizing:

Databases are typically sized by:

Working Set: Hot data that must fit in RAM
IOPS Requirements: Read/write operations per second
Connection Limits: How many clients can connect
Storage Capacity: Total data size with growth

database_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Database server sizing example: PostgreSQL
 
# Data characteristics
total_data_gb = 500           # 500GB total database size
hot_data_pct = 0.20           # 20 % is actively queried
working_set_gb = total_data_gb * hot_data_pct
 
# Query patterns
read_qps = 10_000             # Read queries per second
write_qps = 1_000             # Write queries per second
 
# Query complexity
avg_rows_per_read = 50        # Average rows returned
avg_index_lookups = 3         # Indexes checked per query
 
# IOPS estimation
# Reads: mostly from buffer pool if working set fits in RAM
# Writes: WAL + table + index updates
estimated_read_iops = write_qps * 0.1  # Reads from disk if cache miss
estimated_write_iops = write_qps * 5   # 5 disk writes per transaction
total_iops = estimated_read_iops + estimated_write_iops
 
print("PostgreSQL Sizing:")
print(f"  Total data: {total_data_gb} GB")
print(f"  Working set: {working_set_gb} GB")
print(f"
  Read QPS: {read_qps:,}")
print(f"  Write QPS: {write_qps:,}")
print(f"  Estimated IOPS: {total_iops:,}")
 
# Memory sizing
# shared_buffers should be ~25 % of RAM
# Working set should fit in effective_cache_size
required_shared_buffers_gb = working_set_gb * 0.5
required_ram_gb = required_shared_buffers_gb * 4  # 25 % rule
 
print(f"
Memory requirements:")
print(f"  shared_buffers: {required_shared_buffers_gb:.0f} GB")
print(f"  Minimum RAM: {required_ram_gb:.0f} GB")
print(f"  Recommended RAM: {required_ram_gb * 1.5:.0f} GB")
 
# Connection sizing
app_servers = 50
connections_per_app = 30
max_connections = app_servers * connections_per_app
# Each connection uses ~10MB
connection_memory_gb = max_connections * 10 / 1024
 
print(f"
Connection requirements:")
print(f"  App servers: {app_servers}")
print(f"  Connections per app: {connections_per_app}")
print(f"  Max connections: {max_connections}")
print(f"  Connection overhead: {connection_memory_gb:.1f} GB")
 
# Final sizing
final_ram = required_ram_gb + connection_memory_gb + 16  # + 16GB buffer
print(f"
Recommended DB instance: {final_ram:.0f}GB RAM, {write_qps*10:.0f}+ IOPS")

Specialty Server Sizing Guidelines
Server Type	Key Metric	Sizing Rule
Redis Cache	Hot data size	RAM = Data × 2 (overhead) × Replicas
Elasticsearch	Index size	RAM = Index size × 0.5 + 50% heap
MongoDB	Working set	RAM > Working set for good performance
Kafka	Throughput	Partition count × message rate × retention
ML Inference	Model size + batch	GPU VRAM > Model + batch tensors
Load Balancer	Connections	Size for peak concurrent connections

Database Scaling is Different

Unlike stateless app servers, databases can't simply 'add more instances.' Sharding requires data redistribution. Replicas have replication lag. Connection pooling has limits. Database capacity planning requires more upfront investment and longer lead times.

Server Estimation in Interviews

In system design interviews, server estimation demonstrates your ability to translate abstract requirements into concrete infrastructure.

Framework for Interview Server Estimation:

State your assumptions clearly
Identify the bottleneck (CPU, memory, I/O, network)
Do quick math (round aggressively)
Apply redundancy factors
Sanity check against known systems

interview_server_estimation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Sample interview server estimation (URL Shortener)
 
"Let me estimate the server requirements for our URL shortening service.
 
TRAFFIC RECAP(from earlier):
        - 100M DAU
    - 1 URL creation per user / day = 100M writes / day
    - 100 clicks per URL = 10B reads / day
 
RPS CALCULATION:
        - Writes: 100M / 86, 400 ≈ 1, 200 writes / sec peak(×3)
    - Reads: 10B / 86, 400 ≈ 350,000 reads / sec peak(×3)
 
BOTTLENECK ANALYSIS:
        - Writes are simple(store URL mapping) → DB - bound
    - Reads are simple lookup → Cache - bound, very fast
 
SERVER CAPACITY ASSUMPTIONS:
        - Each app server: 5,000 RPS for simple lookups
            - Assuming 90 % cache hit rate on reads
                - DB handles 5,000 writes / sec per instance
 
READ PATH(350K RPS):
    - 350K RPS / 5K per server = 70 servers
        - With 50 % headroom: ~100 servers
            - Distributed across 3 AZs: 33 per AZ
 
WRITE PATH(1.2K RPS):
    - 1.2K writes easily handled by read servers
        - DB: Single primary handles 1.2K writes
            - Need replica for reads: 1 primary + 2 replicas
 
CACHE LAYER:
    - Hot URLs(last 30 days): ~3B URLs × 100 bytes = 300GB
        - Redis cluster: 4 - 6 instances with 64GB each
 
    SUMMARY:
    - Application servers: ~100(34 per AZ)
        - Redis cluster: 6 instances
            - PostgreSQL: 1 primary + 2 replicas
                - CDN / Load balancers: Managed service
 
This handles 10x our current traffic with room to grow."

The Reasonableness Check

After calculating, compare to known numbers. Twitter has ~300M DAU and uses thousands of servers. Your 100M DAU estimate of ~100 servers should be similar order of magnitude for similar workloads. If you calculate 10,000 servers for 1M users, something is wrong.

Summary: Server Estimation Mastery

You now understand how to translate traffic requirements into concrete infrastructure specifications. Let's consolidate the key principles:

Key Takeaways

•Servers = Peak RPS / Capacity per Server × Safety Factor — The fundamental equation
•Identify your bottleneck — CPU, memory, or I/O determines sizing strategy
•CPU-bound: Profile actual CPU time — Don't assume, measure
•Memory-bound: Sum all components — Base + connections + cache + application
•I/O-bound: Apply Little's Law — Concurrency = RPS × Latency
•Scale out for fault tolerance — More small servers > fewer large ones
•Plan for 50-100% overhead — AZ redundancy, deployments, spikes

Server Estimation Quick Reference
Formula	Usage
Base Servers = Peak RPS / RPS per Server	Starting calculation
With AZ redundancy = Base × 1.5	Multi-AZ deployment
With deployment = AZ × 1.15	Rolling update capacity
Production = Deployment × 1.2	Spike headroom
Concurrent Requests = RPS × Latency(s)	Little's Law

What's Next:

With individual estimation skills complete—traffic, storage, bandwidth, and servers—we'll consolidate everything into a unified estimation framework. The final page provides quick-reference formulas, common pitfalls to avoid, and practice problems to solidify your skills.

Page Complete

You can now estimate server requirements for any system design. Remember: precision isn't the goal—order-of-magnitude accuracy is. A calculation showing you need 'roughly 50-100 servers' is far more valuable than one claiming exactly 73 servers, because the latter implies false precision.

4 / 5

Loading learning content...

System Design (HLD)Back-of-Envelope Estimation

Back-of-Envelope Estimation

LevelIntermediate

Duration90 mins

TopicBack-of-Envelope Estimation

4 / 5

Server Estimation

From Requests to Machines

You've estimated that your system needs to handle 100,000 requests per second. Now comes the critical question: How many servers do you need?

What You Will Learn

The Server Capacity Model

Server estimation starts with a fundamental equation:

The Core Formula:

Servers Needed = Peak RPS / RPS per Server × Safety Factor

But what determines "RPS per Server"? This is where engineering judgement meets measurement.

Factors Affecting Per-Server Capacity:

Request Complexity: A simple health check might handle 100K RPS. A complex database query might handle 100 RPS.
Latency Requirements: Targeting 99th percentile < 50ms limits concurrency differently than < 500ms.
Resource Bottleneck: Is the workload CPU-bound, memory-bound, I/O-bound, or network-bound?
Software Stack: Go/Rust handle 10x the connections of Python/Ruby for the same CPU.
Concurrency Model: Event-driven (Node.js) vs thread-per-request (traditional Java).

Typical RPS per Server by Workload Type
Workload Type	RPS/Server	Bottleneck	Example
Static content serving	10,000-50,000	Network/disk I/O	Nginx serving files
Cached API responses	5,000-20,000	Network/memory	Redis-backed API
Simple CRUD operations	1,000-5,000	Database I/O	REST API with PostgreSQL
Complex business logic	500-2,000	CPU	Order processing, validations
ML inference	100-500	CPU/GPU	Recommendation serving
Image processing	50-200	CPU	Thumbnail generation
Video transcoding	0.5-5	CPU intensive	Real-time transcoding

The 10x Rule of Estimation

Sizing for CPU-Bound Workloads

CPU-bound workloads are limited by processing power. Computation dominates request handling time.

Identifying CPU-Bound Work:

Business logic with complex calculations
Data serialization/deserialization (JSON parsing)
Cryptographic operations (TLS, hashing)
Compression/decompression
ML model inference
Image/video processing

CPU Capacity Calculation:

CPU Seconds Needed = Requests × CPU Time per Request
Servers Needed = CPU Seconds Needed / Available CPU Seconds per Server

cpu_bound_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# CPU-bound server sizing example: API with business logic
 
# Traffic requirements
peak_rps = 10_000
 
# Request characteristics (measured via profiling)
avg_cpu_ms_per_request = 20  # 20ms CPU time per request
 
# Server specifications
cores_per_server = 16
# CPU efficiency: not all time is usable (context switching, GC, OS overhead)
usable_cpu_efficiency = 0.70  # 70% of CPU is actually usable
 
# Calculate capacity per server
usable_cores = cores_per_server * usable_cpu_efficiency
cpu_ms_per_second_per_server = usable_cores * 1000  # ms available per second
requests_per_server = cpu_ms_per_second_per_server / avg_cpu_ms_per_request
 
print(f"Per-server capacity:")
print(f"  Physical cores: {cores_per_server}")
print(f"  Usable cores (70%): {usable_cores}")
print(f"  CPU ms available/sec: {cpu_ms_per_second_per_server:,}")
print(f"  Max RPS/server: {requests_per_server:,.0f}")
 
# Calculate servers needed
servers_for_peak = peak_rps / requests_per_server
print(f"
Servers for peak load ({peak_rps:,} RPS):")
print(f"  Minimum: {servers_for_peak:.1f}")
 
# Apply safety factors
n_plus_one_factor = 1.10    # Handle 1 server failure
headroom_factor = 1.30      # 30% headroom for spikes
deployment_factor = 1.10    # Rolling deployment overhead
 
total_servers = servers_for_peak * n_plus_one_factor * headroom_factor * deployment_factor
print(f"  With N+1 redundancy: {servers_for_peak * n_plus_one_factor:.1f}")
print(f"  With 30% headroom: {servers_for_peak * n_plus_one_factor * headroom_factor:.1f}")
print(f"  With deployment overhead: {total_servers:.0f}")
 
print(f"
Final recommendation: {round(total_servers / 5) * 5} servers")  # Round to 5

CPU Architecture Considerations:

Hyperthreading: 16 physical cores with HT = 32 logical cores, but only ~1.3x performance (not 2x)
Turbo Boost: Advertised 3.0 GHz might boost to 4.0 GHz under load, but only briefly
Thermal Throttling: Sustained high CPU causes frequency reduction
NUMA: Multi-socket servers have non-uniform memory access—cross-socket operations are 2x slower

Cloud Instance Considerations:

Cloud vCPUs are not equal to physical cores:

AWS: 1 vCPU = 1 hyperthread (half a core)
A 16-vCPU instance ≈ 8 physical cores of capacity
Shared tenancy adds variability ("noisy neighbors")

Profile Before You Scale

Sizing for Memory-Bound Workloads

Memory-bound workloads are limited by RAM availability or memory bandwidth.

Identifying Memory-Bound Work:

In-memory caching (Redis clusters, Memcached)
Large working sets (graph databases, ML models)
High-concurrency servers (many connections × memory per connection)
JVM-based applications (heap sizing)
Database buffer pools

Memory Calculation:

Total Memory = Base Memory + (Concurrent Requests × Memory per Request) + Cache Size

memory_bound_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Memory-bound server sizing example: API gateway with caching
 
# Base requirements
os_and_runtime_mb = 2_000       # 2GB for OS + runtime
per_connection_kb = 50          # 50KB per active connection (buffers, state)
concurrent_connections = 10_000  # 10K concurrent during peak
 
# Application memory
application_heap_mb = 4_000     # 4GB heap for JVM/Go runtime
ml_model_mb = 2_000             # 2GB ML model in memory
 
# Caching layer (local cache for hot data)
cache_hit_target = 0.80         # 80% cache hit ratio needed
total_hot_data_gb = 100         # 100GB of frequently accessed data
cache_size_gb = total_hot_data_gb * cache_hit_target * 0.20  # 20% of hot data fits
 
# Calculate total memory per server
connection_memory_mb = (concurrent_connections * per_connection_kb) / 1024
cache_memory_mb = cache_size_gb * 1024
 
total_memory_mb = (
    os_and_runtime_mb +
    connection_memory_mb +
    application_heap_mb +
    ml_model_mb +
    cache_memory_mb
)
 
print("Memory breakdown per server:")
print(f"  OS + Runtime: {os_and_runtime_mb:,} MB")
print(f"  Connections ({concurrent_connections:,}): {connection_memory_mb:,.0f} MB")
print(f"  Application heap: {application_heap_mb:,} MB")
print(f"  ML model: {ml_model_mb:,} MB")
print(f"  Local cache: {cache_memory_mb:,.0f} MB")
print(f"  Total: {total_memory_mb:,.0f} MB ({total_memory_mb/1024:.1f} GB)")
 
# Size servers
available_instance_sizes = [
    ("m5.xlarge", 16_000, 4),     # 16GB RAM, 4 vCPU
    ("m5.2xlarge", 32_000, 8),    # 32GB RAM, 8 vCPU
    ("m5.4xlarge", 64_000, 16),   # 64GB RAM, 16 vCPU
    ("r5.4xlarge", 128_000, 16),  # 128GB RAM, 16 vCPU (memory optimized)
]
 
print("
Instance sizing options:")
for name, ram_mb, vcpu in available_instance_sizes:
    usable_ram = ram_mb * 0.90  # 90% usable (reserve for OS spikes)
    fits = "✓" if usable_ram >= total_memory_mb else "✗"
    print(f"  {name}: {ram_mb/1024:.0f}GB RAM, {vcpu} vCPU → {fits}")

Memory Sizing Rules of Thumb
Component	Memory Estimate	Notes
Linux OS baseline	500MB - 1GB	Minimal, no desktop
JVM heap (small app)	512MB - 2GB	Small microservice
JVM heap (typical)	4GB - 16GB	Standard service
JVM heap (large)	32GB - 128GB	Data-intensive
Per HTTP connection	10KB - 100KB	Depends on framework
Per WebSocket	50KB - 200KB	Higher due to state
Worker process (Python)	100MB - 500MB	Per Gunicorn worker
Node.js default heap	~1.5GB	V8 default limit

Memory Overcommit Warning

Sizing for I/O-Bound Workloads

I/O-bound workloads spend most time waiting for external resources—disk, network, or downstream services.

Identifying I/O-Bound Work:

Database queries (waiting for disk/network)
External API calls (waiting for response)
File uploads/downloads
Message queue consumers
Microservice-heavy architectures

The Concurrency Model:

For I/O-bound work, the bottleneck isn't CPU—it's how many requests can wait simultaneously.

RPS = Concurrent Connections / Average Latency

Example: 1000 connections / 0.1 seconds = 10,000 RPS

io_bound_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# I/O-bound server sizing: API calling multiple backends
 
# Request profile (typical microservice)
avg_latency_ms = 150  # Total request time (dominated by backend calls)
# Latency breakdown:
#   - Database query: 50ms
#   - Cache lookup: 5ms
#   - External API: 80ms
#   - Network overhead: 15ms
 
# Target RPS
peak_rps = 5_000
 
# Calculate required concurrency
concurrent_requests_needed = peak_rps * (avg_latency_ms / 1000)
print(f"Concurrent in-flight requests needed: {concurrent_requests_needed:,.0f}")
 
# Server connection limits
max_connections_per_server = 5_000  # Typical limit for event-driven server
# But we also need resources per connection
memory_per_connection_kb = 50
available_ram_mb = 16_000  # 16GB instance
connection_limit_by_ram = (available_ram_mb * 1024) / memory_per_connection_kb
 
effective_connection_limit = min(max_connections_per_server, connection_limit_by_ram)
print(f"Effective connection limit per server: {effective_connection_limit:,.0f}")
 
# Calculate RPS per server
rps_per_server = effective_connection_limit / (avg_latency_ms / 1000)
print(f"Max RPS per server: {rps_per_server:,.0f}")
 
# Servers needed
base_servers = peak_rps / rps_per_server
print(f"
Servers for {peak_rps:,} RPS: {base_servers:.1f}")
 
# But we also need to consider downstream connection limits
# Each server opens connections to databases, caches, etc.
db_connections_per_server = 50
db_max_connections = 500  # PostgreSQL connection limit
 
servers_limited_by_db = db_max_connections / db_connections_per_server
print(f"Servers limited by DB connections: {servers_limited_by_db:.0f}")
 
# Thread pool exhaustion
thread_pool_size = 200  # Typical Java thread pool
max_concurrent_per_server = thread_pool_size  # Thread-per-request model
rps_if_thread_limited = max_concurrent_per_server / (avg_latency_ms / 1000)
print(f"RPS if thread-limited: {rps_if_thread_limited:,.0f}")
 
# Final calculation with all constraints
actual_rps_per_server = min(rps_per_server, rps_if_thread_limited)
actual_servers_needed = peak_rps / actual_rps_per_server
 
print(f"
Actual capacity per server: {actual_rps_per_server:,.0f} RPS")
print(f"Actual servers needed: {actual_servers_needed:.0f}")
print(f"With redundancy (1.5x): {actual_servers_needed * 1.5:.0f}")

Little's Law:

Little's Law is fundamental for I/O-bound sizing:

L = λ × W

Where:
L = Average number of requests in system (concurrency)
λ = Average arrival rate (RPS)
W = Average time in system (latency)

To handle 10,000 RPS with 100ms latency, you need capacity for 1,000 concurrent requests.

Connection Pool Sizing:

Downstream connections often limit capacity:

Database connection pools: 10-100 per server
HTTP client pools: 100-500 per destination
Redis connections: 50-200 per server

For a service calling 5 backends, 50 connections each = 250 connections maintained per server.

The Database Connection Bottleneck

Scaling Up vs Scaling Out

Once you know total required capacity, you must decide: fewer large servers or more small servers?

Scaling Up (Vertical): Increase resources per server

Move from 8 vCPU to 32 vCPU
Increase RAM from 16GB to 64GB
Upgrade to faster disks

Scaling Out (Horizontal): Increase number of servers

Deploy more instances of the same size
Distribute load across more machines
Typically requires stateless design

Scale Up Advantages

•Simpler architecture
•Fewer failure points
•No distributed complexity
•Single-machine consistency
•Less network overhead
•Easier debugging

Scale Out Advantages

•Near-infinite scaling
•Better fault isolation
•Incremental capacity additions
•Cost-efficient at scale
•Geographic distribution possible
•Rolling updates possible

Decision Framework:

Scale Up vs Scale Out Decision Criteria
Criterion	Favor Scale Up	Favor Scale Out
State	Stateful workloads	Stateless workloads
Cost at target scale	<$10K/month	$10K/month
Traffic variability	Stable	Highly variable
Failure tolerance	Acceptable downtime	Must survive failures
Team experience	Traditional ops	Cloud-native
Long-term growth	Bounded	Unbounded
Single point bottleneck	Network, specialized HW	CPU, memory

scaling_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Cost comparison: Scale Up vs Scale Out
 
# Scenario: Need 20,000 RPS capacity
 
# Scale Up Approach
large_instance = {
    "type": "m5.8xlarge",
    "vcpu": 32,
    "ram_gb": 128,
    "hourly_cost": 1.536,
    "rps_capacity": 5000,  # 5K RPS per large instance
}
 
large_count = 20_000 / large_instance["rps_capacity"]
large_total_hourly = large_count * large_instance["hourly_cost"]
large_with_redundancy = large_count * 1.5  # N+1 roughly
 
# Scale Out Approach
small_instance = {
    "type": "m5.xlarge",
    "vcpu": 4,
    "ram_gb": 16,
    "hourly_cost": 0.192,
    "rps_capacity": 800,  # 800 RPS per small instance
}
 
small_count = 20_000 / small_instance["rps_capacity"]
small_total_hourly = small_count * small_instance["hourly_cost"]
small_with_redundancy = small_count * 1.25  # Lower factor, more failure tolerance
 
print("Scaling Comparison for 20,000 RPS")
print("=" * 50)
print(f"
Scale Up ({large_instance['type']}):")
print(f"  Instances needed: {large_count:.0f} (+ {large_count*0.5:.0f} redundancy)")
print(f"  Hourly cost: ${large_total_hourly * 1.5: .2f
                            }")
print(f"  Monthly cost: ${large_total_hourly * 1.5 * 730:.0f}")
print(f"  Blast radius: 25% (if 1 of 4 fails)")
 
print(f"
Scale Out ({small_instance['type']}):")
print(f"  Instances needed: {small_count:.0f} (+ {small_count*0.25:.0f} redundancy)")
print(f"  Hourly cost: ${small_total_hourly * 1.25:.2f}")
print(f"  Monthly cost: ${small_total_hourly * 1.25 * 730:.0f}")
print(f"  Blast radius: 4% (if 1 of 25 fails)")
 
# Break - even analysis
print(f"
Cost difference: ${abs(large_total_hourly*1.5 - small_total_hourly*1.25)*730:.0f}/month")
 
# Operational complexity
print(f"
Operational factors:")
print(f"  Scale Up: {large_count*1.5:.0f} instances to manage")
print(f"  Scale Out: {small_count*1.25:.0f} instances to manage")

The Modern Default

Redundancy and Failure Planning

Production systems fail. Server estimation must account for failures, maintenance, and deployments.

Failure Scenarios to Plan For:

Single server failure: Hardware issue, process crash, OOM kill
Availability zone failure: AZ-wide network or power issue
Region failure: Rare but catastrophic
Dependency failure: Database, cache, or external service down
Deployment failures: Bad deploy takes down instances

Redundancy Strategies:

Redundancy Patterns and Overhead
Strategy	Extra Capacity	Survives	Use Case
N+1	1 extra server	1 server failure	Small deployments
N+2	2 extra servers	2 server failures	Medium deployments
2N	100% extra	Half of servers failing	Critical systems
2N+1	100% + 1 server	Half + maintenance	Mission critical
Multi-AZ (+50%)	~50% extra	AZ failure	Most production
Multi-Region (2x)	100% extra	Region failure	Global services

Calculating Capacity with Failures:

redundancy_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Redundancy calculation for production deployment
 
# Base requirements
peak_rps = 50_000
rps_per_server = 2_000
 
# Base servers for peak
base_servers = peak_rps / rps_per_server
print(f"Base servers for {peak_rps:,} RPS: {base_servers:.0f}")
 
# Multi - AZ deployment(3 AZs, survive 1 AZ loss)
azs = 3
az_failure_tolerance = 1  # Survive losing 1 AZ
 
# Each AZ must handle full load when one fails
# So 2 AZs must handle 100% at all times
servers_per_az_for_failure = base_servers / (azs - az_failure_tolerance)
total_with_az_redundancy = servers_per_az_for_failure * azs
az_overhead = (total_with_az_redundancy - base_servers) / base_servers * 100
 
print(f"
Multi-AZ (3 AZs, survive 1 failure):")
print(f"  Servers per AZ: {servers_per_az_for_failure:.0f}")
print(f"  Total servers: {total_with_az_redundancy:.0f}")
print(f"  Overhead: {az_overhead:.0f}%")
 
# Additional headroom for rolling deployments
# During deploy, 10-20 % of servers are being replaced
deployment_overhead = 1.15
with_deployment = total_with_az_redundancy * deployment_overhead
 
print(f"
With deployment headroom (15%):")
print(f"  Total servers: {with_deployment:.0f}")
 
# Spike headroom(unexpected traffic bursts)
spike_headroom = 1.20
final_count = with_deployment * spike_headroom
 
print(f"
With spike headroom (20%):")
print(f"  Total servers: {final_count:.0f}")
 
# Summary
total_overhead = (final_count - base_servers) / base_servers * 100
print(f"
{'='*50}")
print(f"SUMMARY:")
print(f"  Base requirement: {base_servers:.0f} servers")
print(f"  Production deployment: {final_count:.0f} servers")
print(f"  Total overhead: {total_overhead:.0f}%")
print(f"
Breakdown of {final_count:.0f} servers:")
print(f"  AZ-1: {final_count/3:.0f} servers")
print(f"  AZ-2: {final_count/3:.0f} servers")
print(f"  AZ-3: {final_count/3:.0f} servers")

The Cost of High Availability

Database and Specialty Server Sizing

Application servers aren't the only component. Databases, caches, and specialty systems have unique sizing requirements.

Database Server Sizing:

Databases are typically sized by:

Working Set: Hot data that must fit in RAM
IOPS Requirements: Read/write operations per second
Connection Limits: How many clients can connect
Storage Capacity: Total data size with growth

database_sizing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Database server sizing example: PostgreSQL
 
# Data characteristics
total_data_gb = 500           # 500GB total database size
hot_data_pct = 0.20           # 20 % is actively queried
working_set_gb = total_data_gb * hot_data_pct
 
# Query patterns
read_qps = 10_000             # Read queries per second
write_qps = 1_000             # Write queries per second
 
# Query complexity
avg_rows_per_read = 50        # Average rows returned
avg_index_lookups = 3         # Indexes checked per query
 
# IOPS estimation
# Reads: mostly from buffer pool if working set fits in RAM
# Writes: WAL + table + index updates
estimated_read_iops = write_qps * 0.1  # Reads from disk if cache miss
estimated_write_iops = write_qps * 5   # 5 disk writes per transaction
total_iops = estimated_read_iops + estimated_write_iops
 
print("PostgreSQL Sizing:")
print(f"  Total data: {total_data_gb} GB")
print(f"  Working set: {working_set_gb} GB")
print(f"
  Read QPS: {read_qps:,}")
print(f"  Write QPS: {write_qps:,}")
print(f"  Estimated IOPS: {total_iops:,}")
 
# Memory sizing
# shared_buffers should be ~25 % of RAM
# Working set should fit in effective_cache_size
required_shared_buffers_gb = working_set_gb * 0.5
required_ram_gb = required_shared_buffers_gb * 4  # 25 % rule
 
print(f"
Memory requirements:")
print(f"  shared_buffers: {required_shared_buffers_gb:.0f} GB")
print(f"  Minimum RAM: {required_ram_gb:.0f} GB")
print(f"  Recommended RAM: {required_ram_gb * 1.5:.0f} GB")
 
# Connection sizing
app_servers = 50
connections_per_app = 30
max_connections = app_servers * connections_per_app
# Each connection uses ~10MB
connection_memory_gb = max_connections * 10 / 1024
 
print(f"
Connection requirements:")
print(f"  App servers: {app_servers}")
print(f"  Connections per app: {connections_per_app}")
print(f"  Max connections: {max_connections}")
print(f"  Connection overhead: {connection_memory_gb:.1f} GB")
 
# Final sizing
final_ram = required_ram_gb + connection_memory_gb + 16  # + 16GB buffer
print(f"
Recommended DB instance: {final_ram:.0f}GB RAM, {write_qps*10:.0f}+ IOPS")

Specialty Server Sizing Guidelines
Server Type	Key Metric	Sizing Rule
Redis Cache	Hot data size	RAM = Data × 2 (overhead) × Replicas
Elasticsearch	Index size	RAM = Index size × 0.5 + 50% heap
MongoDB	Working set	RAM > Working set for good performance
Kafka	Throughput	Partition count × message rate × retention
ML Inference	Model size + batch	GPU VRAM > Model + batch tensors
Load Balancer	Connections	Size for peak concurrent connections

Database Scaling is Different

Server Estimation in Interviews

In system design interviews, server estimation demonstrates your ability to translate abstract requirements into concrete infrastructure.

Framework for Interview Server Estimation:

State your assumptions clearly
Identify the bottleneck (CPU, memory, I/O, network)
Do quick math (round aggressively)
Apply redundancy factors
Sanity check against known systems

interview_server_estimation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Sample interview server estimation (URL Shortener)
 
"Let me estimate the server requirements for our URL shortening service.
 
TRAFFIC RECAP(from earlier):
        - 100M DAU
    - 1 URL creation per user / day = 100M writes / day
    - 100 clicks per URL = 10B reads / day
 
RPS CALCULATION:
        - Writes: 100M / 86, 400 ≈ 1, 200 writes / sec peak(×3)
    - Reads: 10B / 86, 400 ≈ 350,000 reads / sec peak(×3)
 
BOTTLENECK ANALYSIS:
        - Writes are simple(store URL mapping) → DB - bound
    - Reads are simple lookup → Cache - bound, very fast
 
SERVER CAPACITY ASSUMPTIONS:
        - Each app server: 5,000 RPS for simple lookups
            - Assuming 90 % cache hit rate on reads
                - DB handles 5,000 writes / sec per instance
 
READ PATH(350K RPS):
    - 350K RPS / 5K per server = 70 servers
        - With 50 % headroom: ~100 servers
            - Distributed across 3 AZs: 33 per AZ
 
WRITE PATH(1.2K RPS):
    - 1.2K writes easily handled by read servers
        - DB: Single primary handles 1.2K writes
            - Need replica for reads: 1 primary + 2 replicas
 
CACHE LAYER:
    - Hot URLs(last 30 days): ~3B URLs × 100 bytes = 300GB
        - Redis cluster: 4 - 6 instances with 64GB each
 
    SUMMARY:
    - Application servers: ~100(34 per AZ)
        - Redis cluster: 6 instances
            - PostgreSQL: 1 primary + 2 replicas
                - CDN / Load balancers: Managed service
 
This handles 10x our current traffic with room to grow."

The Reasonableness Check

Summary: Server Estimation Mastery

You now understand how to translate traffic requirements into concrete infrastructure specifications. Let's consolidate the key principles:

Key Takeaways

•Servers = Peak RPS / Capacity per Server × Safety Factor — The fundamental equation
•Identify your bottleneck — CPU, memory, or I/O determines sizing strategy
•CPU-bound: Profile actual CPU time — Don't assume, measure
•Memory-bound: Sum all components — Base + connections + cache + application
•I/O-bound: Apply Little's Law — Concurrency = RPS × Latency
•Scale out for fault tolerance — More small servers > fewer large ones
•Plan for 50-100% overhead — AZ redundancy, deployments, spikes

Server Estimation Quick Reference
Formula	Usage
Base Servers = Peak RPS / RPS per Server	Starting calculation
With AZ redundancy = Base × 1.5	Multi-AZ deployment
With deployment = AZ × 1.15	Rolling update capacity
Production = Deployment × 1.2	Spike headroom
Concurrent Requests = RPS × Latency(s)	Little's Law

What's Next:

Page Complete

4 / 5