Loading content...
Databases don't fail suddenly without warning—they fail gradually, then all at once. The gradual phase is marked by warning signs: storage filling up, query latencies creeping upward, CPU approaching saturation. The "all at once" phase is when users experience errors, transactions fail, and business operations halt.
Capacity planning is the discipline that transforms this inevitable growth into predictable, manageable infrastructure evolution. It bridges the gap between current operations and future requirements, ensuring that resources are available before they're desperately needed.
The best DBAs don't fight fires—they prevent them. Capacity planning is how prevention happens: by understanding growth patterns, projecting future needs, and implementing changes proactively rather than reactively.
By the end of this page, you will understand comprehensive capacity planning strategies, including growth analysis methods, resource projection techniques, scaling approaches, cost optimization, and proactive planning processes. You'll learn to anticipate database resource needs and implement sustainable scaling strategies.
Database capacity isn't a single dimension—it encompasses multiple interrelated resources, each of which can become a bottleneck:
Primary Capacity Dimensions:
| Dimension | What It Affects | Key Metrics | Scaling Approach |
|---|---|---|---|
| Storage | Data persistence, growth room | GB/TB used, growth rate, IOPS | Add disks, expand volumes, archive old data |
| Memory | Buffer cache, query processing | Buffer hit ratio, memory pressure | Add RAM, optimize queries, partition data |
| CPU | Query processing, concurrency | CPU utilization, wait time | Add cores, optimize queries, scale out |
| Network | Connection handling, replication | Bandwidth, latency, connections | Upgrade NICs, connection pooling |
| Connections | Concurrent users/applications | Active connections, wait queue | Pooling, scale out, async processing |
Capacity Interdependencies:
Capacity dimensions interact in complex ways. Addressing one bottleneck often reveals another:
Capacity planning must consider the system holistically, not just individual resources.
As a general guideline, start capacity planning when any resource exceeds 80% utilization. This provides buffer for unexpected spikes and ensures time for orderly scaling. Reaching 95%+ is an emergency—at that point you're in crisis mode, not planning mode.
Effective capacity planning requires historical data. You cannot project future needs without understanding past and present consumption patterns.
What to Collect:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
-- PostgreSQL: Capacity Planning Data Collection -- Database sizes and growthSELECT datname, pg_size_pretty(pg_database_size(datname)) AS current_size, pg_database_size(datname) AS size_bytesFROM pg_databaseWHERE datname NOT IN ('template0', 'template1')ORDER BY pg_database_size(datname) DESC; -- Table sizes for growth trackingCREATE TABLE IF NOT EXISTS capacity_table_sizes ( captured_at TIMESTAMP DEFAULT NOW(), schema_name VARCHAR(100), table_name VARCHAR(100), row_count BIGINT, table_size_bytes BIGINT, index_size_bytes BIGINT, toast_size_bytes BIGINT); -- Capture table sizes (run daily)INSERT INTO capacity_table_sizes (schema_name, table_name, row_count, table_size_bytes, index_size_bytes, toast_size_bytes)SELECT n.nspname, c.relname, c.reltuples::bigint, pg_table_size(c.oid), pg_indexes_size(c.oid), pg_total_relation_size(c.oid) - pg_table_size(c.oid) - pg_indexes_size(c.oid)FROM pg_class cJOIN pg_namespace n ON c.relnamespace = n.oidWHERE c.relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'information_schema'); -- Growth analysis over timeSELECT table_name, MIN(captured_at) AS first_capture, MAX(captured_at) AS last_capture, MIN(table_size_bytes) AS initial_size, MAX(table_size_bytes) AS current_size, (MAX(table_size_bytes) - MIN(table_size_bytes)) / NULLIF(EXTRACT(EPOCH FROM MAX(captured_at) - MIN(captured_at)) / 86400, 0) AS bytes_per_dayFROM capacity_table_sizesGROUP BY table_nameHAVING COUNT(*) > 7 -- At least a week of dataORDER BY bytes_per_day DESC; -- Connection utilization over timeCREATE TABLE IF NOT EXISTS capacity_connection_snapshots ( captured_at TIMESTAMP DEFAULT NOW(), total_connections INTEGER, active_connections INTEGER, idle_connections INTEGER, waiting_connections INTEGER, max_connections INTEGER); -- Capture connection state (run every 5 minutes)INSERT INTO capacity_connection_snapshotsSELECT NOW(), COUNT(*), COUNT(*) FILTER (WHERE state = 'active'), COUNT(*) FILTER (WHERE state = 'idle'), COUNT(*) FILTER (WHERE wait_event IS NOT NULL), current_setting('max_connections')::intFROM pg_stat_activity;Collection Frequency:
Different metrics require different collection frequencies:
| Metric Type | Collection Frequency | Retention | Purpose |
|---|---|---|---|
| Real-time metrics (CPU, connections) | 1 minute or less | 7-30 days at high resolution | Incident analysis, daily patterns |
| Daily metrics (storage, table sizes) | Once daily | Years | Growth trends, capacity projection |
| Aggregate summaries (hourly/daily avg) | Hourly roll-up | Years | Long-term trend analysis |
| Transaction/query volumes | 15 minutes | 90 days | Workload patterns, seasonal analysis |
Manual metric collection is unsustainable. Use monitoring tools (Prometheus, Datadog, etc.) that automatically collect, store, and visualize capacity metrics. The investment in monitoring infrastructure pays dividends in capacity planning accuracy.
With historical data collected, the next step is analyzing patterns and projecting future needs.
Growth Pattern Types:
Different systems exhibit different growth behaviors:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
-- PostgreSQL: Storage Growth Projection -- Calculate daily growth rate from historical dataWITH daily_sizes AS ( SELECT date_trunc('day', captured_at) AS capture_day, SUM(table_size_bytes + index_size_bytes) AS total_size FROM capacity_table_sizes WHERE captured_at >= NOW() - INTERVAL '90 days' GROUP BY 1 ORDER BY 1),growth_calc AS ( SELECT capture_day, total_size, total_size - LAG(total_size) OVER (ORDER BY capture_day) AS daily_growth, (total_size - LAG(total_size) OVER (ORDER BY capture_day))::float / NULLIF(LAG(total_size) OVER (ORDER BY capture_day), 0) * 100 AS growth_percent FROM daily_sizes)SELECT -- Current state (SELECT pg_size_pretty(SUM(pg_database_size(datname))) FROM pg_database WHERE datname NOT LIKE 'template%') AS current_total, -- Average daily growth pg_size_pretty(AVG(daily_growth)::bigint) AS avg_daily_growth, -- Projection: 30 days pg_size_pretty( (SELECT SUM(pg_database_size(datname)) FROM pg_database WHERE datname NOT LIKE 'template%') + (AVG(daily_growth) * 30)::bigint ) AS projection_30_days, -- Projection: 90 days pg_size_pretty( (SELECT SUM(pg_database_size(datname)) FROM pg_database WHERE datname NOT LIKE 'template%') + (AVG(daily_growth) * 90)::bigint ) AS projection_90_days, -- Projection: 1 year pg_size_pretty( (SELECT SUM(pg_database_size(datname)) FROM pg_database WHERE datname NOT LIKE 'template%') + (AVG(daily_growth) * 365)::bigint ) AS projection_1_year FROM growth_calcWHERE daily_growth IS NOT NULL; -- Days until storage threshold reachedWITH current_usage AS ( SELECT SUM(pg_database_size(datname)) AS current_size FROM pg_database WHERE datname NOT LIKE 'template%'),growth_rate AS ( SELECT AVG(daily_growth) AS daily_rate FROM ( SELECT SUM(table_size_bytes + index_size_bytes) - LAG(SUM(table_size_bytes + index_size_bytes)) OVER (ORDER BY date_trunc('day', captured_at)) AS daily_growth FROM capacity_table_sizes WHERE captured_at >= NOW() - INTERVAL '30 days' GROUP BY date_trunc('day', captured_at) ) sub)SELECT pg_size_pretty(current_size) AS current_usage, pg_size_pretty(1000 * 1024 * 1024 * 1024::bigint) AS disk_capacity, -- 1TB example ROUND( (1000 * 1024 * 1024 * 1024::bigint - current_size) / NULLIF(daily_rate, 0) ) AS days_until_full, NOW() + ( ((1000 * 1024 * 1024 * 1024::bigint - current_size) / NULLIF(daily_rate, 0)) * INTERVAL '1 day' ) AS estimated_full_dateFROM current_usage, growth_rate;Incorporating Business Context:
Pure historical projection isn't enough. Business context dramatically affects future capacity:
Add 20-30% headroom to projections. Estimates are never perfect, spikes exceed averages, and procurement/scaling takes time. Running at projected capacity with no margin leaves you vulnerable to the unexpected.
When projections indicate capacity limits approaching, multiple scaling strategies are available. Each has trade-offs in complexity, cost, and effectiveness.
Vertical Scaling (Scale Up):
Increase resources on the existing database server—more CPU, RAM, or faster storage.
Horizontal Scaling (Scale Out):
Distribute workload across multiple database servers.
| Approach | Description | Complexity | Best For |
|---|---|---|---|
| Read Replicas | Route read queries to replicas; writes to primary | Low-Medium | Read-heavy workloads |
| Sharding | Partition data across multiple servers by key | High | Large datasets, write scaling |
| Federation | Different databases for different features/domains | Medium | Distinct, non-overlapping workloads |
| Distributed Database | Use databases designed for horizontal scale (CockroachDB, Spanner) | Medium | New deployments, cloud-native |
Query Optimization:
Before scaling hardware, ensure queries are efficient. Poor queries waste capacity that optimization could reclaim:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- Identify optimization opportunities before scaling -- Unused indexes consuming storage and write overheadSELECT schemaname, tablename, indexname, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS times_usedFROM pg_stat_user_indexesWHERE idx_scan = 0ORDER BY pg_relation_size(indexrelid) DESCLIMIT 20; -- Tables with no indexes (might need them)SELECT schemaname, relname, seq_scan AS sequential_scans, seq_tup_read AS rows_scanned_sequentially, n_live_tup AS row_countFROM pg_stat_user_tablesWHERE idx_scan = 0 AND seq_scan > 100ORDER BY seq_tup_read DESCLIMIT 20; -- Most expensive queries (candidates for optimization)SELECT substring(query, 1, 60) AS query_preview, calls, total_exec_time / 1000 AS total_seconds, mean_exec_time AS avg_ms, rowsFROM pg_stat_statementsORDER BY total_exec_time DESCLIMIT 20; -- Cache efficiency - low hit ratio tablesSELECT schemaname, tablename, heap_blks_hit + heap_blks_read AS total_reads, CASE WHEN heap_blks_hit + heap_blks_read > 0 THEN round(100.0 * heap_blks_hit / (heap_blks_hit + heap_blks_read), 2) ELSE 100 END AS cache_hit_pctFROM pg_statio_user_tablesWHERE heap_blks_read > 1000ORDER BY cache_hit_pct ASCLIMIT 20;A query using 100x more resources than necessary will use 100x more on bigger hardware too. Optimization is often cheaper and more effective than scaling. Rule of thumb: if you haven't reviewed top 20 slow queries in the last quarter, optimize before you scale.
Capacity planning is ultimately about spending money wisely. Understanding cost structures helps make informed decisions.
Cost Components:
| Component | On-Premises Cost | Cloud Cost Model |
|---|---|---|
| Compute (CPU/Memory) | Server hardware purchase/lease | Per-hour/vCPU pricing |
| Storage | Disk purchase, SAN/NAS licensing | Per-GB/month + IOPS charges |
| Licensing | Per-core, per-socket, or subscription | Often included or separate |
| Network | Switch/router hardware, bandwidth | Egress charges, private link fees |
| Backup Storage | Tape, disk, offsite facility | Per-GB storage + retrieval fees |
| Personnel | DBA salaries, training, tools | Same; potentially reduced with managed services |
Cloud Cost Optimization:
Cloud databases offer flexibility but can become expensive without optimization:
12345678910111213141516171819202122232425262728293031323334353637383940
# Capacity Planning Cost Analysis Template ## Current State- Instance Type: db.r6g.2xlarge (8 vCPU, 64GB RAM)- Storage: 2TB gp3 @ 3000 IOPS- Monthly Cost: $1,847 ## Projected 1-Year NeedsBased on 15% monthly data growth and 20% annual user growth:- Required RAM: 96GB (for 70% cache hit on working set)- Required Storage: 5TB- Required CPU: 12 vCPU at current efficiency ## Scaling Options ### Option A: Vertical Scale (db.r6g.4xlarge)- Configuration: 16 vCPU, 128GB RAM, 5TB storage- Monthly Cost: $3,124- 1-Year TCO: $37,488- Downtime: 10-15 minutes for resize- Risk: Single point of failure remains ### Option B: Add Read Replica- Primary: db.r6g.2xlarge (unchanged)- Replica: db.r6g.xlarge (4 vCPU, 32GB)- Monthly Cost: $2,386 (+$539)- 1-Year TCO: $28,632- Downtime: None (replica add is online)- Risk: Replication lag for read-after-write ### Option C: Query Optimization First- Estimated effort: 2 weeks DBA time- Cost: ~$5,000 (DBA time + tools)- Projected capacity gain: 25-40%- May defer scaling 6-12 months- 1-Year Savings: $8,000 - $15,000 ## RecommendationPursue Option C first, then Option B when optimization gains exhausted.Monitor for 90 days post-optimization before committing.Evaluate Total Cost of Ownership (TCO), not just compute costs. Include DBA time for management, developer time for schema changes, opportunity cost of downtime, and risk costs of capacity failures. Sometimes higher infrastructure cost delivers lower TCO.
Capacity planning shouldn't be ad hoc—it should follow a regular, documented process that engages stakeholders and produces actionable plans.
Planning Cycle:
Stakeholder Alignment:
Capacity planning requires input and buy-in from multiple stakeholders:
| Stakeholder | Input Provided | Approval Needed |
|---|---|---|
| Product Management | Product roadmap, feature launches, user growth targets | Planning assumptions |
| Engineering Leadership | Technical constraints, architecture decisions | Technical approach |
| Finance | Budget constraints, procurement timelines | Capital expenditure |
| Operations/SRE | Incident history, reliability requirements | Operational feasibility |
| Security | Compliance requirements, data classification | Security posture |
Capacity changes take time: procurement of hardware (weeks to months), cloud reservations (limited flexibility after commitment), application changes for scale-out (engineering sprints). Start planning early enough to execute without emergency measures.
Proactive alerts ensure you're notified before capacity limits are reached, not after.
Recommended Thresholds:
| Metric | Warning (Plan) | Critical (Act Now) | Action |
|---|---|---|---|
| Disk Space | 70% | 85% | Expand storage, archive data, clean temp files |
| CPU Utilization | 70% sustained | 90% sustained | Scale up, optimize queries, add replicas |
| Memory Utilization | 85% | 95% | Add memory, optimize queries, restart to clear fragmentation |
| Connection Utilization | 70% of max | 90% of max | Increase limit, implement pooling, review leaks |
| Replication Lag | 1 minute | 10 minutes | Check replica resources, network, long transactions |
| Transaction Log Growth | 50% of allocated | 80% of allocated | Increase allocation, check checkpoint frequency |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
groups: - name: database_capacity_alerts rules: # Storage capacity alerts - alert: DatabaseDiskSpaceWarning expr: | (pg_volume_stats_used_bytes / pg_volume_stats_capacity_bytes) > 0.70 for: 1h labels: severity: warning annotations: summary: "Database disk space exceeds 70%" description: "{{ $labels.instance }} disk usage is {{ $value | humanizePercentage }}" action: "Review capacity plan; schedule expansion if trending to critical" - alert: DatabaseDiskSpaceCritical expr: | (pg_volume_stats_used_bytes / pg_volume_stats_capacity_bytes) > 0.85 for: 15m labels: severity: critical annotations: summary: "Database disk space critical (>85%)" description: "{{ $labels.instance }} disk usage is {{ $value | humanizePercentage }}" action: "Immediate action required - expand storage or remove data" # Connection capacity - alert: DatabaseConnectionsWarning expr: | pg_stat_activity_count / pg_settings_max_connections > 0.70 for: 30m labels: severity: warning annotations: summary: "Connection usage exceeds 70%" description: "{{ $value | humanizePercentage }} of max connections in use" # Projected capacity exhaustion - alert: DatabaseDiskExhaustionProjected expr: | predict_linear(pg_volume_stats_used_bytes[7d], 30 * 24 * 3600) > pg_volume_stats_capacity_bytes for: 6h labels: severity: warning annotations: summary: "Disk space projected to exhaust within 30 days" description: "Based on 7-day trend, disk will be full in ~30 days"The most valuable capacity alerts are predictive: 'At current growth rate, disk will be full in 30 days.' This provides time to plan rather than react. Prometheus's predict_linear function and similar tools enable these projections.
Documentation captures current state, historical analysis, and future plans for reference and audit.
Essential Documentation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
# Database Capacity Report - Q4 2024 ## Executive SummaryPrimary production database is projected to require storage expansion within 6 months and CPU upgrade within 12 months based on current growth trends. ## Current State ### Resources| Database | Type | vCPU | RAM | Storage | Current Use ||----------|------|------|-----|---------|-------------|| prod-primary | PostgreSQL 15 | 8 | 64GB | 2TB NVMe | Active || prod-replica | PostgreSQL 15 | 4 | 32GB | 2TB NVMe | Read distribution | ### Utilization (Q4 Average)| Resource | Average | Peak | Trend ||----------|---------|------|-------|| CPU | 45% | 78% | +5%/quarter || Memory | 72% | 89% | +3%/quarter || Storage | 58% | 62% | +12%/quarter || Connections | 35% | 68% | Stable | ## Growth Analysis ### Storage- Current: 1.16TB of 2TB (58%)- 90-day growth: 180GB- Daily average: 2GB- Projected Q1 2025: 1.34TB (67%)- Projected exhaust: August 2025 ### CPU- Peak utilization during month-end processing- Year-end expected to add 15-20% additional load ## Recommendations 1. **Short-term (Q1 2025):** - Optimize top 10 queries (estimated 15% efficiency gain) - Implement table partitioning for largest tables 2. **Medium-term (Q2 2025):** - Expand storage to 4TB - Budget: $150/month additional 3. **Long-term (Q4 2025):** - Upgrade primary to 16 vCPU, 128GB RAM - Convert to reserved instance (3-year) - Budget: $1,200/month increase ## Approval[ ] Product Management - Growth assumptions[ ] Engineering - Technical approach[ ] Finance - Budget allocationStore capacity plans in version control alongside infrastructure-as-code. This creates an audit trail showing how plans evolved, which assumptions proved accurate, and how decisions were made over time.
Capacity planning transforms database operations from reactive firefighting to proactive management. By understanding current usage, projecting future needs, and implementing changes before crises occur, DBAs ensure databases scale gracefully with business growth.
Key Takeaways:
You have completed the DBA Responsibilities module, covering Installation and Configuration, Performance Monitoring, Security Management, Backup and Recovery, and Capacity Planning. These five core responsibilities form the foundation of professional database administration. Mastering them ensures databases operate reliably, securely, and efficiently throughout their operational lifecycle.