Loading learning content...
In the taxonomy of performance bottlenecks, the database occupies a special place of infamy. More production outages, slow requests, and scaling crises trace back to database issues than any other single component.
This isn't because databases are poorly engineered—modern databases are marvels of computer science. It's because databases sit at the center of nearly every system, handling the most precious resource: data. Every user action, every business transaction, every piece of state flows through the database layer. When the database struggles, the entire system struggles.
Understanding database bottlenecks is arguably the single most valuable performance skill for a backend engineer. This page provides an exhaustive framework for identifying, diagnosing, and addressing the full spectrum of database-related performance problems.
By the end of this page, you will understand the five major categories of database bottlenecks: query performance, connection management, lock contention, replication lag, and resource exhaustion. You'll learn diagnostic techniques for each category and understand when to optimize versus when to scale.
Before diving into specifics, let's establish the landscape. Database bottlenecks fall into five distinct categories, each with different symptoms, diagnostic approaches, and solutions:
| Category | Primary Symptom | Root Cause | Typical Resolution |
|---|---|---|---|
| Query Performance | Slow individual queries (100ms - 10s+) | Missing indexes, table scans, inefficient joins | Query optimization, indexing |
| Connection Management | Connection timeouts, pool exhaustion | Too few connections, connection leaks | Pool sizing, connection lifecycle management |
| Lock Contention | Queries block each other, deadlocks | Concurrent writes to same resources | Transaction design, isolation levels |
| Replication Lag | Stale reads from replicas, inconsistency | Writes outpace replica sync speed | Write reduction, schema optimization |
| Resource Exhaustion | CPU/memory/disk saturation on DB server | Workload exceeds server capacity | Vertical scaling, sharding, caching |
These categories aren't mutually exclusive—a system can suffer from multiple simultaneously. However, they require different diagnostic approaches and yield to different solutions. Correctly categorizing your bottleneck is the first step toward resolving it.
The diagnostic priority:
When investigating database issues, work through these categories in order of likelihood:
In most systems, 80% of database performance problems stem from query performance issues—missing indexes, full table scans, N+1 queries. Start there. Sophisticated solutions like sharding or read replicas are only necessary after query-level optimization is exhausted.
Query performance problems are the bread and butter of database optimization. A single poorly-indexed query can bring a production system to its knees. Understanding why queries are slow and how to fix them is essential knowledge.
Common Query Performance Anti-Patterns:
SELECT * FROM logs can crash applications.WHERE YEAR(created_at) = 2024 can't use an index on created_at. Must scan all rows and apply function.Using EXPLAIN to Diagnose Query Performance:
The EXPLAIN (or EXPLAIN ANALYZE) command is your primary diagnostic tool. It shows exactly how the database will execute your query—what indexes it will use, how many rows it expects to scan, and where time is spent.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
-- =====================================================-- BASIC EXPLAIN: Shows query execution plan-- ===================================================== EXPLAIN SELECT * FROM orders WHERE user_id = 12345; -- PostgreSQL output example:-- QUERY PLAN-- ------------------------------------------------------------- Index Scan using idx_orders_user_id on orders-- (cost=0.43..8.45 rows=1 width=100)-- Index Cond: (user_id = 12345) -- KEY METRICS TO EXAMINE:-- 1. Scan type: Index Scan (good) vs Seq Scan (usually bad)-- 2. Cost: Lower is better; first number is startup cost-- 3. Rows: Estimated rows to process-- 4. Width: Bytes per row (SELECT * = wide) -- =====================================================-- EXPLAIN ANALYZE: Actually runs the query with timing-- ===================================================== EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending'; -- PostgreSQL output example:-- Index Scan using idx_orders_user_id on orders-- (cost=0.43..8.45 rows=1 width=100)-- (actual time=0.025..0.027 rows=3 loops=1)-- Index Cond: (user_id = 12345)-- Filter: (status = 'pending')-- Rows Removed by Filter: 47-- Planning Time: 0.102 ms-- Execution Time: 0.045 ms -- CRITICAL OBSERVATIONS:-- 1. "Filter: ..." means post-index filtering (wasteful)-- 2. "Rows Removed by Filter: 47" means index isn't selective enough-- 3. A composite index on (user_id, status) would be more efficient -- =====================================================-- RED FLAGS IN EXPLAIN OUTPUT-- ===================================================== -- 1. SEQUENTIAL SCAN on large tableEXPLAIN SELECT * FROM orders WHERE email = 'user@example.com';-- → Seq Scan on orders (cost=0.00..145782.00 rows=1 width=100)-- FIX: Add index on email column -- 2. NESTED LOOP with high row estimatesEXPLAIN SELECT * FROM orders o JOIN order_items oi ON o.id = oi.order_idWHERE o.created_at > '2024-01-01';-- → Nested Loop (cost=...) (actual rows=5000000)-- FIX: Ensure both sides of join have appropriate indexes -- 3. SORT operation (expensive for large sets)EXPLAIN SELECT * FROM products ORDER BY price DESC LIMIT 10;-- → Sort (cost=12345.67...) Sort Method: external merge-- "external merge" means data didn't fit in memory-- FIX: Index on (price DESC) for top-N queries -- 4. High ROWS REMOVED BY FILTER-- Indicates index is being used but isn't selective-- Consider composite index or different query structureIndexing Strategies for Query Performance:
Indexes are the primary tool for query optimization. Understanding when and how to create effective indexes is crucial:
| Index Type | Best For | Trade-offs | Example |
|---|---|---|---|
| B-Tree (default) | Equality and range queries, sorting | General purpose; slower writes | idx ON orders(created_at) |
| Hash Index | Exact equality lookups only | Can't do ranges; rarely used | idx ON users USING hash(email) |
| Composite/Compound | Queries filtering on multiple columns | Column order matters (leftmost prefix) | idx ON orders(user_id, status, created_at) |
| Partial/Filtered | Queries that always filter on a condition | Smaller, faster; only indexes subset | idx ON orders(id) WHERE status = 'pending' |
| Covering | Queries that only need indexed columns | Avoids table lookup entirely | idx ON orders(user_id) INCLUDE (total, status) |
| GiST/GIN | Full-text search, arrays, JSON, geo | Specialized; slower to build | idx ON products USING gin(tags) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
-- =====================================================-- COMPOSITE INDEX: Column order matters!-- ===================================================== -- Query pattern: Find orders by user, then filter by statusSELECT * FROM orders WHERE user_id = 123 AND status = 'pending' ORDER BY created_at DESC; -- GOOD: Matches query pattern (leftmost prefix rule)CREATE INDEX idx_orders_user_status_created ON orders(user_id, status, created_at DESC);-- This index supports:-- - WHERE user_id = ?-- - WHERE user_id = ? AND status = ?-- - WHERE user_id = ? AND status = ? ORDER BY created_at -- BAD: Wrong column orderCREATE INDEX idx_orders_status_user ON orders(status, user_id);-- Cannot efficiently satisfy: WHERE user_id = 123-- (would need to scan all statuses) -- =====================================================-- PARTIAL INDEX: Index only what you query-- ===================================================== -- You only ever query pending ordersSELECT * FROM orders WHERE status = 'pending' AND user_id = 123; -- Instead of indexing all orders:CREATE INDEX idx_orders_pending ON orders(user_id) WHERE status = 'pending'; -- Benefits:-- 1. Much smaller index (only pending orders)-- 2. Faster to update (most inserts become delivered)-- 3. Query planner uses it for matching queries -- =====================================================-- COVERING INDEX: Never touch the table-- ===================================================== -- Query only needs user_id and totalSELECT user_id, total FROM orders WHERE user_id = 123; -- Covering index includes queried columnsCREATE INDEX idx_orders_user_covering ON orders(user_id) INCLUDE (total, status, created_at); -- Database reads ONLY from index, never accesses table-- Massive performance improvement for read-heavy queries -- =====================================================-- INDEX ANALYSIS: Find missing and unused indexes-- ===================================================== -- PostgreSQL: Find slow queries (pg_stat_statements extension)SELECT query, calls, mean_time, total_timeFROM pg_stat_statementsORDER BY total_time DESCLIMIT 20; -- PostgreSQL: Find missing indexes (tables with seq scans)SELECT relname, seq_scan, idx_scan, seq_scan - idx_scan AS deltaFROM pg_stat_user_tablesWHERE seq_scan > idx_scanORDER BY delta DESC; -- PostgreSQL: Find unused indexes (candidates for removal)SELECT indexrelname, idx_scanFROM pg_stat_user_indexesWHERE idx_scan = 0 AND indexrelname NOT LIKE 'pg_%';Every index speeds up reads but slows down writes. Each INSERT, UPDATE, and DELETE must update all affected indexes. A table with 20 indexes will have much slower write performance than one with 5. Only create indexes for actual query patterns, and periodically audit for unused indexes.
The N+1 Query Problem in Depth:
The N+1 problem deserves special attention because it's ubiquitous in applications using ORMs, and it can turn microsecond operations into seconds of latency:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
# =====================================================# THE N+1 PROBLEM: Death by a thousand queries# ===================================================== # ANTI-PATTERN: N+1 queriesdef get_orders_with_items_bad(user_id: int): # 1 query to get orders orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id) for order in orders: # N queries to get items (one per order!) items = db.query("SELECT * FROM order_items WHERE order_id = ?", order.id) order.items = items return orders # If user has 100 orders: 1 + 100 = 101 queries# Each query has network latency (~1ms) = 101ms minimum# At 1000 orders: 1+ seconds just in query overhead # SOLUTION 1: JOIN (single query)def get_orders_with_items_join(user_id: int): return db.query(""" SELECT o.*, oi.* FROM orders o LEFT JOIN order_items oi ON o.id = oi.order_id WHERE o.user_id = ? """, user_id)# 1 query regardless of order count# Trade-off: Duplicate order data per item row # SOLUTION 2: Batch loading (two queries)def get_orders_with_items_batch(user_id: int): # Query 1: Get all orders orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id) order_ids = [o.id for o in orders] # Query 2: Get all items for all orders at once items = db.query( "SELECT * FROM order_items WHERE order_id IN (?)", order_ids ) # Group items by order in application items_by_order = group_by(items, 'order_id') for order in orders: order.items = items_by_order.get(order.id, []) return orders# Always exactly 2 queries, regardless of order count # SOLUTION 3: ORM eager loading (framework-specific)# SQLAlchemy exampledef get_orders_with_items_eager(user_id: int): return ( session.query(Order) .options(joinedload(Order.items)) # or subqueryload() .filter(Order.user_id == user_id) .all() )# ORM handles the batching automatically # =====================================================# DETECTING N+1 QUERIES# ===================================================== # 1. Query logging - look for repeated similar queries# 2. APM tools (New Relic, Datadog) show query count per request# 3. Database: pg_stat_statements shows query call counts# 4. Rule of thumb: >10 queries per API request is suspiciousDatabase connections are expensive resources. Each connection consumes memory on both client and server, and establishing new connections involves multiple network round-trips plus authentication overhead. Poor connection management is a silent scalability killer.
Why Connections Are Expensive:
A database server with 200 max connections serving 50 application instances means only 4 connections per instance. Exceed this, and applications block waiting for connections.
Connection Pool Fundamentals:
Connection pools maintain a set of pre-established connections that are reused across requests. This eliminates connection establishment overhead and bounds resource usage.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
# =====================================================# CONNECTION POOL CONFIGURATION PRINCIPLES# ===================================================== # SIZING A CONNECTION POOL# # Key insight: Optimal pool size is usually SMALLER than you think.# Formula from PostgreSQL wiki:# pool_size = ((core_count * 2) + effective_spindle_count)## For a typical DB server with 8 cores and SSD:# pool_size = (8 * 2) + 1 = 17 connections## For application-side pools (many app instances):# per_instance_pool = max_db_connections / app_instance_count from sqlalchemy import create_enginefrom sqlalchemy.pool import QueuePool # Production connection pool configurationengine = create_engine( "postgresql://user:pass@db-host:5432/mydb", # Core pool parameters poolclass=QueuePool, pool_size=10, # Steady-state connections maintained max_overflow=5, # Additional connections under load # Total max = pool_size + max_overflow = 15 # Connection lifecycle pool_timeout=30, # Seconds to wait for available connection pool_recycle=1800, # Recycle connections after 30 minutes pool_pre_ping=True, # Verify connection liveness before use) # =====================================================# DIAGNOSING CONNECTION ISSUES# ===================================================== # Symptom: Connection timeout errors# Possible causes:# 1. Pool exhausted - queries holding connections too long# 2. Connection leak - connections not returned to pool# 3. Pool too small for load# 4. Database max_connections reached # PostgreSQL: Check current connections# SELECT count(*) FROM pg_stat_activity;# SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename; # PostgreSQL: Check connection state# SELECT state, count(*) FROM pg_stat_activity GROUP BY state;# States: active, idle, idle in transaction, idle in transaction (aborted) # =====================================================# CONNECTION LEAK DETECTION# ===================================================== # A connection leak occurs when code acquires a connection# but fails to return it (often due to unhandled exceptions) # ANTI-PATTERN: Connection leakdef bad_query(): conn = pool.get_connection() result = conn.execute("SELECT * FROM users") # What if execute() throws an exception? # Connection is never returned to pool! return result # Connection implicitly returned... sometimes # CORRECT: Always use context managersdef good_query(): with pool.connect() as conn: result = conn.execute("SELECT * FROM users") return result # Connection ALWAYS returned, even if exception occurs # SQLAlchemy scoped sessions (for web frameworks)from sqlalchemy.orm import scoped_session, sessionmaker Session = scoped_session(sessionmaker(bind=engine)) # In request handler:def handle_request(): try: result = Session.query(User).all() return result finally: Session.remove() # Return connection to pool # =====================================================# MONITORING CONNECTION POOL HEALTH# ===================================================== # Key metrics to monitor:# 1. Pool utilization: connections_in_use / pool_size# - > 80% sustained: consider increasing pool# - 100% with waiters: pool exhaustion## 2. Connection wait time: time blocked waiting for connection# - > 100ms average: pool pressure# - Spikes correlate with slow queries holding connections## 3. Connection churn: connections created/destroyed per second# - High churn: pool too small or connection leaks## 4. Idle connections: connections sitting unused# - Many idle in small pool: possibly wasted resourcesConnection Poolers: PgBouncer and ProxySQL
For high-scale deployments, external connection poolers provide additional capabilities:
| Feature | PgBouncer (PostgreSQL) | ProxySQL (MySQL) |
|---|---|---|
| Pooling Modes | Session, Transaction, Statement | Connection multiplexing |
| Resource Usage | ~2KB per connection | ~4KB per connection |
| Connection Limit | Can handle 10,000+ client connections | Similar scale |
| Query Routing | Read/write splitting | Advanced query routing, caching |
| Typical Use | Many app instances → single pooler → database | Proxy layer for routing, failover |
| Transaction Mode | Connections returned after each transaction | Connection multiplexing |
PgBouncer's 'transaction mode' returns connections to the pool after each transaction, enabling massive connection multiplexing. However, it breaks some PostgreSQL features like prepared statements, LISTEN/NOTIFY, and session-level settings. Always test your application's compatibility.
Databases use locks to ensure data consistency during concurrent access. When multiple transactions compete for the same resources, they must wait for each other—this is lock contention. In high-concurrency systems, lock contention can become a major bottleneck, serializing operations that should run in parallel.
Types of Database Locks:
| Lock Type | Scope | Blocks | Common Cause |
|---|---|---|---|
| Row-level lock | Single row | Other writes to same row | UPDATE, DELETE on specific row |
| Range lock | Range of rows | Inserts in range, reads depending on isolation | Range queries with locking |
| Table-level lock | Entire table | All operations on table | DDL, LOCK TABLE, some MySQL operations |
| Page-level lock | Database page (block) | Operations on same page | MySQL specific, can cause hotspots |
| Advisory lock | Application-defined | Other advisory lock requests | Application-level mutual exclusion |
Diagnosing Lock Contention:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
-- =====================================================-- POSTGRESQL: Finding Lock Contention-- ===================================================== -- View current locks and what's waitingSELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_queryFROM pg_catalog.pg_locks blocked_locksJOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pidJOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pidJOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pidWHERE NOT blocked_locks.granted; -- Simpler view: long-running queries (potential lock holders)SELECT pid, now() - pg_stat_activity.query_start AS duration, queryFROM pg_stat_activityWHERE (now() - pg_stat_activity.query_start) > interval '5 seconds' AND state = 'active'; -- Check for "idle in transaction" connections (holding locks)SELECT pid, usename, state, query, now() - xact_start AS transaction_durationFROM pg_stat_activityWHERE state = 'idle in transaction'ORDER BY transaction_duration DESC; -- =====================================================-- MYSQL: Finding Lock Contention -- ===================================================== -- Current InnoDB locksSELECT * FROM information_schema.INNODB_LOCKS; -- Lock waitsSELECT * FROM information_schema.INNODB_LOCK_WAITS; -- Combined view: who is blocking whomSELECT r.trx_id AS waiting_trx_id, r.trx_mysql_thread_id AS waiting_thread, r.trx_query AS waiting_query, b.trx_id AS blocking_trx_id, b.trx_mysql_thread_id AS blocking_thread, b.trx_query AS blocking_queryFROM information_schema.INNODB_LOCK_WAITS wJOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_idJOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id;Common Lock Contention Patterns and Solutions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
# =====================================================# STRATEGIES TO REDUCE LOCK CONTENTION# ===================================================== # 1. SHORTEN TRANSACTION DURATION# ===================================== # ANTI-PATTERN: External calls inside transactiondef place_order_bad(order_data): with db.transaction(): order = create_order(order_data) # Payment API call takes 2-3 seconds # Transaction holds locks the whole time! payment = payment_api.charge(order.total) order.payment_id = payment.id commit() # CORRECT: External calls outside transactiondef place_order_good(order_data): # Create order in transaction with db.transaction(): order = create_order(order_data) order.status = 'pending_payment' commit() # External call outside transaction (no locks held) try: payment = payment_api.charge(order.total) except PaymentError: with db.transaction(): order.status = 'payment_failed' commit() raise # Update order in new transaction with db.transaction(): order.payment_id = payment.id order.status = 'paid' commit() # 2. OPTIMISTIC LOCKING (for read-modify-write)# ===================================== # Instead of SELECT FOR UPDATE (pessimistic)def update_balance_pessimistic(user_id, amount): with db.transaction(): # Locks row until transaction ends account = db.query( "SELECT * FROM accounts WHERE user_id = ? FOR UPDATE", user_id ) account.balance += amount commit() # Use version-based optimistic lockingdef update_balance_optimistic(user_id, amount): while True: account = db.query( "SELECT * FROM accounts WHERE user_id = ?", user_id ) new_balance = account.balance + amount new_version = account.version + 1 # Conditional update - fails if version changed result = db.execute(""" UPDATE accounts SET balance = ?, version = ? WHERE user_id = ? AND version = ? """, new_balance, new_version, user_id, account.version) if result.rows_affected == 1: return # Success # Retry if version mismatch (concurrent update) # 3. DISTRIBUTED COUNTERS (for hot row counters)# ===================================== # ANTI-PATTERN: Single counter row# UPDATE stats SET page_views = page_views + 1 WHERE page_id = 1;# Every request contends for same row # SOLUTION: Sharded countersdef increment_page_view(page_id): # Spread across N shard rows shard_id = random.randint(1, 10) db.execute(""" INSERT INTO page_view_shards (page_id, shard_id, count) VALUES (?, ?, 1) ON CONFLICT (page_id, shard_id) DO UPDATE SET count = count + 1 """, page_id, shard_id) def get_page_views(page_id): # Sum all shards when reading return db.query(""" SELECT SUM(count) FROM page_view_shards WHERE page_id = ? """, page_id)Lower isolation levels (READ COMMITTED vs SERIALIZABLE) reduce lock contention but allow more anomalies. Most web applications work fine with READ COMMITTED. Only escalate to SERIALIZABLE for financial transactions or when correctness absolutely requires it.
In replicated database setups (primary-replica or leader-follower), the replica databases follow behind the primary. This delay—replication lag—causes stale reads and consistency issues.
Why Replication Lag Occurs:
Measuring Replication Lag:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- =====================================================-- POSTGRESQL: Measuring lag on replica-- ===================================================== -- Lag in bytes (how far behind in WAL)SELECT pg_wal_lsn_diff( pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes; -- Lag in time (requires logical replication or application tracking)-- On primary: write timestamp to tracking table every N seconds-- On replica: read timestamp and compute difference -- Check replication slot lag (on primary)SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytesFROM pg_replication_slots; -- =====================================================-- MYSQL: Measuring lag on replica-- ===================================================== SHOW SLAVE STATUS\G -- Key field: Seconds_Behind_Master-- 0 = caught up-- NULL = replication not running-- >0 = seconds of lag -- More accurate with heartbeat tables:-- Primary writes timestamp every second-- Replica reads and computes difference -- =====================================================-- APPLICATION-LEVEL LAG TRACKING-- ===================================================== -- More reliable than DB-level metrics for application consistency -- Pattern: Write version tag, read version tag-- 1. Application writes to primary with version token-- 2. Application reads from replica, checks if version present-- 3. If version not present, lag hasn't caught up -> route to primaryHandling Replication Lag in Application Design:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
# =====================================================# STRATEGIES FOR REPLICATION LAG TOLERANCE# ===================================================== # 1. READ-YOUR-WRITES CONSISTENCY# =====================================# After writing, user's subsequent reads go to primary class DatabaseRouter: def __init__(self, primary_conn, replica_conn): self.primary = primary_conn self.replica = replica_conn # Track which users recently wrote self.write_timestamps = {} # user_id -> timestamp self.write_window = 5 # seconds def execute_write(self, user_id, query): result = self.primary.execute(query) self.write_timestamps[user_id] = time.time() return result def execute_read(self, user_id, query): # If user wrote recently, read from primary last_write = self.write_timestamps.get(user_id, 0) if time.time() - last_write < self.write_window: return self.primary.execute(query) else: return self.replica.execute(query) # 2. CAUSAL CONSISTENCY WITH VERSION TOKENS# ===================================== class VersionedDatabaseRouter: """ Track logical version; route reads to primary until replica catches up to required version. """ def execute_write(self, query) -> int: result = self.primary.execute(query) # Return current primary version (e.g., LSN or sequence) return self.primary.execute("SELECT current_version()") def execute_read(self, query, min_version: int = None): if min_version is None: return self.replica.execute(query) # Check if replica has caught up replica_version = self.replica.execute("SELECT current_version()") if replica_version >= min_version: return self.replica.execute(query) else: # Replica behind; fall back to primary return self.primary.execute(query) # 3. APPLICATION-LEVEL EVENTUAL CONSISTENCY# =====================================# Accept that reads may be stale; design for it def update_user_profile(user_id, new_data): # Write to primary db.primary.execute( "UPDATE users SET data = ? WHERE id = ?", new_data, user_id ) # Also update cache immediately # Cache serves as "read-your-writes" buffer cache.set(f"user:{user_id}", new_data, ttl=60) def get_user_profile(user_id): # Try cache first (includes recent writes) cached = cache.get(f"user:{user_id}") if cached: return cached # Fall back to replica (may be stale) return db.replica.execute( "SELECT * FROM users WHERE id = ?", user_id ) # 4. LAG-AWARE ROUTING# =====================================# Route to least-lagged replica class LagAwareRouter: def __init__(self, primary, replicas): self.primary = primary self.replicas = replicas def get_healthiest_replica(self, max_lag_seconds=2): """Return replica with lag under threshold.""" best = None best_lag = float('inf') for replica in self.replicas: lag = replica.get_lag_seconds() if lag < max_lag_seconds and lag < best_lag: best = replica best_lag = lag # Fall back to primary if all replicas lagged return best or self.primarySome operations cannot tolerate stale reads: checking account balances before transfers, verifying inventory before purchasing, reading permissions before granting access. For these, always route to the primary database, regardless of lag.
Sometimes the database server's physical resources—CPU, memory, disk, or network—become the bottleneck. This often masks underlying query or design problems that should be addressed first.
| Resource | Symptoms | Common Causes | Solutions |
|---|---|---|---|
| CPU | High CPU%, slow queries | Complex queries, index-less scans, sorting large sets | Query optimization, add indexes, scale vertically |
| Memory | Swapping, OOM kills, slow buffer cache | Large working set, ineffective indexes | Increase RAM, optimize queries to read less data |
| Disk IOPS | High disk wait, slow random reads | Many index lookups, random access patterns | SSDs, covering indexes, query optimization |
| Disk Space | Out of space errors | Bloat, large tables, excessive logging | VACUUM (PostgreSQL), archive old data, compression |
| Network | Saturated network, high latency | Large result sets, many connections | Paginate results, connection pooling, read replicas |
When to Scale vs When to Optimize:
A common mistake is to vertically scale the database (bigger server) without first exhausting optimization opportunities. Use this decision framework:
Only after these optimizations are exhausted should you consider:
The reason: optimization often yields 10-100x improvements. Scaling yields 2-4x. And scaling is expensive, complex, and often introduces new problems (distributed transactions, data locality).
Before scaling infrastructure, ensure you can't get a 10x improvement through optimization. A missing index can make a query 1000x slower. Fixing it costs nothing. Buying a 10x bigger server costs 10x more in infrastructure bills—forever.
Database bottlenecks are the most common performance constraint in data-driven systems. Mastering their diagnosis and resolution is essential for any backend engineer working at scale.
What's Next:
With database bottlenecks covered, we'll examine network bottlenecks in the next page. You'll learn about bandwidth saturation, latency amplification, connection overhead, and the subtle ways network issues manifest as application slowness.
You now have a comprehensive framework for diagnosing database bottlenecks. From slow queries to connection pool exhaustion to locking issues, you can systematically identify and address the most common performance constraints in data-driven systems.