Loading content...
Every successful MySQL deployment eventually hits the wall.
The database that started as a simple PostgreSQL alternative—reliable, fast, well-understood—gradually becomes a bottleneck. First, you add read replicas to handle read traffic. Then you implement application-level sharding, splitting users across multiple database instances. Before long, your application is entangled with routing logic, cross-shard joins are impossible, and your infrastructure team spends more time managing databases than building features.
This is the scaling wall—the point where the single-node architecture of traditional databases becomes the limiting factor for your system's growth.
TiDB was purpose-built to eliminate this wall. By distributing data automatically across a cluster of nodes, TiDB scales reads and writes horizontally—just add more servers. The application sees a single logical database while TiDB handles the complexity of distribution, replication, and load balancing.
In this page, we'll dive deep into how TiDB achieves horizontal scalability: the architecture, the algorithms, and the operational practices that make it possible.
By the end of this page, you will understand TiDB's distributed storage architecture (TiKV and regions), how Raft consensus ensures data durability and consistency, how data is automatically distributed and rebalanced across nodes, and the practical implications for capacity planning and scaling operations.
At the heart of TiDB's scalability is TiKV—a distributed transactional key-value store. While TiDB handles SQL parsing and optimization, TiKV handles data storage, replication, and distributed transactions. Understanding TiKV is essential to understanding how TiDB scales.
Key-Value Foundation:
TiKV stores all data as key-value pairs. Table rows are encoded into keys using a specific format:
Table Row: t{table_id}_r{row_id} → {encoded_row_data}
Index Entry: t{table_id}_i{index_id}_{index_values} → {row_id}
For example, a row in a users table with id=1001 might be stored as:
Key: t45_r1001
Value: {"email":"alice@example.com","username":"alice",...}
This key encoding has important properties:
RocksDB as the Storage Engine:
Each TiKV node uses RocksDB as its local storage engine. RocksDB is a high-performance embedded key-value store (originally developed by Facebook) optimized for SSDs and modern hardware:
TiKV DATA ORGANIZATION════════════════════════════════════════════════════════════════════ TABLE ENCODING EXAMPLE (users table, table_id = 45)──────────────────────────────────────────────────────────────────── Row Data (ordered by primary key): t45_r1 → {"email":"alice@email.com","username":"alice","status":"active"} t45_r2 → {"email":"bob@email.com","username":"bob","status":"active"} t45_r3 → {"email":"carol@email.com","username":"carol","status":"pending"} ... t45_r1000 → {"email":"user1000@email.com","username":"user1000","status":"active"} Secondary Index (idx_username on users.username, index_id = 1): t45_i1_alice → 1 (points to row 1) t45_i1_bob → 2 (points to row 2) t45_i1_carol → 3 (points to row 3) ... Secondary Index (idx_status on users.status, index_id = 2): t45_i2_active_1 → 1 t45_i2_active_2 → 2 t45_i2_pending_3 → 3 ... TIKV NODE STRUCTURE──────────────────────────────────────────────────────────────────── ┌─────────────────────────────────────────────────────────────────┐│ TiKV Node │├─────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Region 1 │ │ Region 2 │ │ Region N │ ││ │ (Leader) │ │ (Follower) │ │ (Leader) │ ││ │ │ │ │ │ │ ││ │ t1_r1 - t1_r99│ │t45_r500-t45_r599│ │t100_r1 - t100_r50│ ││ │ │ │ │ │ │ ││ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ ││ │ │ │ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌─────────────────────────────────────────────────────────────┐││ │ RocksDB │││ │ ┌──────────────────────────────────────────────────────────┐│││ │ │ Memtable (write buffer) ││││ │ └──────────────────────────────────────────────────────────┘│││ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │││ │ │ SST L0 │ │ SST L1 │ │ SST L2 │ │ SST L3 │ │ SST L4 │ │││ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │││ └─────────────────────────────────────────────────────────────┘│└─────────────────────────────────────────────────────────────────┘ Each TiKV node hosts multiple regions (portions of the key space).Each region is independently replicated via Raft.Why Key-Value for a Relational Database?
Using a key-value foundation for a SQL database might seem odd, but it provides critical advantages for distribution:
The SQL layer (TiDB) handles the translation between relational semantics and key-value operations. From the application's perspective, it's a SQL database. Under the hood, it's a distributed key-value store with transactional guarantees.
TiKV is designed as an independent project and can be used without TiDB. Companies use TiKV directly for applications that need a distributed key-value store with transactions but don't require SQL. This separation of concerns makes both systems more focused and maintainable.
The key innovation in TiDB's scalability is the Region—the fundamental unit of data distribution, replication, and scheduling. Understanding regions is essential to understanding how TiDB scales.
What is a Region?
A region is a contiguous range of keys. By default:
For example, a table with 10GB of data might be divided into ~100 regions, each covering a portion of the key range:
Region 1: t45_r1 to t45_r10000
Region 2: t45_r10001 to t45_r20000
Region 3: t45_r20001 to t45_r30000
...
Region Replication with Raft:
Each region is replicated across multiple TiKV nodes using the Raft consensus protocol. A typical configuration has 3 replicas per region:
Writes must be acknowledged by a majority (2 of 3) before they're considered committed. This ensures data durability even if one replica fails.
REGION DISTRIBUTION AND RAFT REPLICATION════════════════════════════════════════════════════════════════════ EXAMPLE: Table "orders" with 500GB of data───────────────────────────────────────── Total Regions: ~5,200 (500GB ÷ 96MB per region) Region Distribution Across TiKV Cluster:┌─────────────────────────────────────────────────────────────────┐│ TiKV Cluster (9 nodes) │├─────────────────────────────────────────────────────────────────┤│ ││ TiKV-1 TiKV-2 TiKV-3 ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ R1 [L] │ │ R1 [F] │ │ R1 [F] │ ← Region 1 ││ │ R2 [F] │ │ R2 [L] │ │ R2 [F] │ Raft Group ││ │ R3 [F] │ │ R3 [F] │ │ R3 [L] │ ││ │ ... │ │ ... │ │ ... │ ││ │ R580 [L] │ │ R581 [L] │ │ R582 [L] │ ││ └───────────┘ └───────────┘ └───────────┘ ││ ││ TiKV-4 TiKV-5 TiKV-6 ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ R4 [L] │ │ R4 [F] │ │ R4 [F] │ ││ │ R5 [F] │ │ R5 [L] │ │ R5 [F] │ ││ │ R6 [F] │ │ R6 [F] │ │ R6 [L] │ ││ │ ... │ │ ... │ │ ... │ ││ └───────────┘ └───────────┘ └───────────┘ ││ ││ TiKV-7 TiKV-8 TiKV-9 ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ R7 [L] │ │ R7 [F] │ │ R7 [F] │ ││ │ R8 [F] │ │ R8 [L] │ │ R8 [F] │ ││ │ ... │ │ ... │ │ ... │ ││ └───────────┘ └───────────┘ └───────────┘ ││ ││ [L] = Leader [F] = Follower ││ ││ Each region has exactly 1 leader + 2 followers ││ Leaders are distributed across nodes for load balancing │└─────────────────────────────────────────────────────────────────┘ RAFT REPLICATION FLOW (for Region 1):───────────────────────────────────── Client Write Request TiKV-1 (Leader) │ │ └────────────────────────────► │ ┌─────┴─────┐ │ Append to │ │ Raft Log │ └─────┬─────┘ │ ┌──────────────────────┼──────────────────────┐ ▼ ▼ ▼ TiKV-2 (F) TiKV-1 (L) TiKV-3 (F) Append Log Wait for Append Log ACK ─────────────────► Majority ◄─────────────────── ACK │ ┌─────┴─────┐ │Commit Log │ │Apply to DB│ └─────┬─────┘ │ ▼ Response to ClientWhy 96MB Regions?
The region size is a careful tradeoff:
Smaller regions (e.g., 32MB):
Larger regions (e.g., 256MB):
The 96MB default balances these concerns for typical OLTP workloads. For very large sequential scans (analytics), larger regions might be appropriate.
Automatic Region Splitting:
When a region grows beyond the size threshold, TiKV automatically splits it:
This happens automatically and transparently—applications don't notice splits occurring.
Regions can also split based on traffic patterns, not just size. If a region is receiving disproportionate traffic (a 'hot' region), PD can trigger a split to distribute the load. This is crucial for handling access patterns like inserting sequential IDs or time-series data.
The Raft consensus protocol is the foundation of TiDB's data durability and consistency guarantees. Every region operates as an independent Raft group, ensuring that data is safely replicated before being acknowledged.
Raft Fundamentals:
Raft is a consensus algorithm designed to be understandable (compared to Paxos) while providing equivalent guarantees:
Leader Election: When a cluster starts or a leader fails, followers compete to become the new leader. A candidate must receive votes from a majority to become leader.
Log Replication: The leader receives all writes, appends them to its log, and replicates to followers. A write is committed when replicated to a majority.
Safety: Raft guarantees that committed entries are never lost—any elected leader will have all committed entries.
Multi-Raft in TiDB:
TiDB uses Multi-Raft—each region operates its own independent Raft group. This provides:
RAFT WRITE PATH - DETAILED FLOW════════════════════════════════════════════════════════════════════ Transaction: UPDATE accounts SET balance = 950 WHERE id = 1;1. TiDB determines row id=1 is in Region 422. Region 42's leader is on TiKV-3 TiDB Server TiKV-3 (Region 42 Leader) │ │ Step 1: │ ─── PreWrite(key=t5_r1, val=950) ───► │ │ │ │ ┌────┴────┐ │ │ Acquire │ │ │ Lock │ │ └────┬────┘ │ │ │ ┌────┴────┐ │ │ Append │ │ │ to Raft │ │ │ Log │ │ └────┬────┘ │ │ │ Replicate to Followers │ │ │ TiKV-1 (F) TiKV-5 (F) │ │ │ │ Append Log Append Log │ │ │ │ ACK ┘ └ ACK │ ▼ ▼ │ └──────┬───────┘ │ │ │ ┌──────┴──────┐ │ │ Majority │ │ │ Acknowledged│ │ │ (2 of 3) │ │ └──────┬──────┘ │ │ │ ◄──────── PreWrite Success ───────────── │ │ │ Step 2: │ ─────── Commit(key=t5_r1, ts=X) ───────► │ │ │ │ ┌──────┴──────┐ │ │ Write Commit│ │ │ Record │ │ │ (replicated │ │ │ via Raft) │ │ └──────┬──────┘ │ │ │ ◄──────── Commit Success ────────────────│ │ │ │ ┌──────┴──────┐ │ │ Async Apply │ │ │ to RocksDB │ │ └─────────────┘ DURABILITY GUARANTEE:─────────────────────Once Commit Success is returned: ✓ Data is in Raft logs on 2 of 3 replicas ✓ Even if TiKV-3 crashes immediately, data is safe ✓ Recovery: Any surviving replica can replay Raft log ✓ Data will eventually be persisted to RocksDB (background)Leader Election and Failover:
When a region leader fails, followers detect the absence of heartbeats and trigger an election:
During leader election, the region is temporarily unavailable for writes. However:
Raft Snapshot:
When a follower falls too far behind (or is a new replica), leader sends a snapshot—a complete copy of the region's data. This is more efficient than replaying thousands of log entries.
Snapshots are used for:
Raft logs are kept in memory until applied to RocksDB. Very large transactions or high write throughput can cause memory pressure. TiDB has safeguards (transaction size limits, log truncation) to prevent unbounded log growth. Monitor Raft log size in production.
The Placement Driver (PD) is TiDB's coordination layer—the brain that makes scaling decisions. While TiKV handles data storage and replication, PD handles metadata, scheduling, and cluster-wide coordination.
PD's Responsibilities:
Timestamp Oracle (TSO): Provides globally unique, monotonically increasing timestamps for transactions. Every transaction starts by getting a timestamp from PD.
Region Metadata: Maintains the mapping of key ranges to regions and regions to TiKV nodes. TiDB queries PD to find which TiKV node holds a key.
Scheduling: Decides when to move regions, split/merge regions, and transfer leaders. This is how TiDB achieves load balancing.
Cluster Configuration: Stores cluster-wide configuration and coordinates administrative operations.
PD itself runs as a cluster (typically 3 or 5 nodes) for high availability, using its own Raft group for consensus.
PD SCHEDULING OPERATIONS════════════════════════════════════════════════════════════════════ PD monitors cluster state and issues scheduling commands: 1. LEADER TRANSFER (balance leader count across nodes) ─────────────────────────────────────────────────── Before: After: TiKV-1: 150 leaders TiKV-1: 100 leaders TiKV-2: 50 leaders TiKV-2: 100 leaders TiKV-3: 100 leaders TiKV-3: 100 leaders PD Command: TransferLeader(Region45, TiKV-2) - Fast operation (Raft leader handoff) - No data movement, just leadership change - Completes in milliseconds 2. REGION TRANSFER (balance storage across nodes) ─────────────────────────────────────────────── Scenario: TiKV-4 added to cluster Before: After: TiKV-1: 800 regions TiKV-1: 600 regions TiKV-2: 800 regions TiKV-2: 600 regions TiKV-3: 800 regions TiKV-3: 600 regions TiKV-4: 0 regions (new) TiKV-4: 600 regions PD Commands (issued gradually): - AddReplica(Region100, TiKV-4) - RemoveReplica(Region100, TiKV-1) - ... (repeat for 600 regions) Data is copied via Raft snapshot, then old replica removed. 3. REGION SPLIT (handle growing regions) ─────────────────────────────────────── Trigger: Region size > 96MB or high traffic (hot region) Before: After: Region 50: Region 50: t10_r1 → t10_r50000 t10_r1 → t10_r25000 Size: 150MB Size: 75MB Region 51 (new): t10_r25001 → t10_r50000 Size: 75MB 4. PLACEMENT RULES (compliance and locality) ────────────────────────────────────────── Example: Ensure 1 replica per availability zone Rule: region.replicas = 3 replica[0] in zone=us-east-1a replica[1] in zone=us-east-1b replica[2] in zone=us-east-1c PD enforces this by: - Moving replicas to correct zones - Never placing 2 replicas in same zone - Prioritizing zone diversity in rebalancingScheduling Strategies:
PD continuously monitors cluster metrics and makes scheduling decisions based on multiple factors:
Balance Schedulers:
Repair Schedulers:
Constraint Schedulers:
TSO (Timestamp Oracle):
The TSO is critical for distributed transactions. Every transaction needs a start timestamp (start_ts) and a commit timestamp (commit_ts). PD's TSO provides:
The TSO is often the component requiring the most careful attention in very high-throughput deployments. PD leader handles all TSO requests, making it a potential bottleneck.
TiDB servers batch TSO requests from concurrent transactions into single PD requests. This dramatically reduces TSO latency under load. A single PD request might return 100+ timestamps for 100+ concurrent transactions starting simultaneously.
Understanding the theory is important, but what does scaling TiDB actually look like in practice? Let's walk through common scaling scenarios.
Scaling Out (Adding Nodes):
Adding capacity to a TiDB cluster is straightforward:
The key insight: adding nodes is non-disruptive. Existing traffic continues normally while rebalancing happens in the background.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- Monitor cluster capacity and distribution -- View store (TiKV node) statusSELECT store_id, address, state, capacity, available, leader_count, region_count, start_tsFROM information_schema.tikv_store_status; /*+----------+-------------------+-------+----------+-----------+--------------+--------------+---------------------+| store_id | address | state | capacity | available | leader_count | region_count | start_ts |+----------+-------------------+-------+----------+-----------+--------------+--------------+---------------------+| 1 | tikv-1.local:20160| Up | 2TB | 1.5TB | 1500 | 4500 | 2024-01-01 00:00:00 || 2 | tikv-2.local:20160| Up | 2TB | 1.5TB | 1500 | 4500 | 2024-01-01 00:00:00 || 3 | tikv-3.local:20160| Up | 2TB | 1.5TB | 1500 | 4500 | 2024-01-01 00:00:00 || 4 | tikv-4.local:20160| Up | 2TB | 2TB | 0 | 0 | 2024-01-07 10:00:00 | ← NEW NODE+----------+-------------------+-------+----------+-----------+--------------+--------------+---------------------+*/ -- After some time, PD rebalances:/*+----------+-------------------+-------+----------+-----------+--------------+--------------+| store_id | address | state | capacity | available | leader_count | region_count |+----------+-------------------+-------+----------+-----------+--------------+--------------+| 1 | tikv-1.local:20160| Up | 2TB | 1.6TB | 1125 | 3375 || 2 | tikv-2.local:20160| Up | 2TB | 1.6TB | 1125 | 3375 || 3 | tikv-3.local:20160| Up | 2TB | 1.6TB | 1125 | 3375 || 4 | tikv-4.local:20160| Up | 2TB | 1.6TB | 1125 | 3375 | ← NOW BALANCED+----------+-------------------+-------+----------+-----------+--------------+--------------+*/ -- Monitor rebalancing progressSHOW CONFIG WHERE name LIKE '%schedule%'; -- Check specific region distributionSELECT region_id, leader_store_id, peer_store_idsFROM information_schema.tikv_region_statusWHERE table_name = 'orders'LIMIT 10;Scaling the SQL Layer (TiDB Servers):
TiDB servers are stateless, making scaling trivial:
No coordination required—just add servers and update your load balancer. This is ideal for handling traffic spikes.
Handling Hot Spots:
Hot spots—where a small number of regions handle disproportionate traffic—are the enemy of horizontal scaling. TiDB provides several mechanisms:
1. AUTO_RANDOM for Primary Keys:
12345678910111213141516171819202122232425262728293031323334353637
-- Problem: Sequential AUTO_INCREMENT creates hotspot-- All inserts go to the same region (the one with the highest keys) CREATE TABLE events_bad ( id BIGINT PRIMARY KEY AUTO_INCREMENT, -- BAD: hotspot event_type VARCHAR(50), created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP); -- Solution 1: AUTO_RANDOM distributes across regionsCREATE TABLE events_good ( id BIGINT PRIMARY KEY AUTO_RANDOM, -- GOOD: random shard bits event_type VARCHAR(50), created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP);-- AUTO_RANDOM: high bits are random, preventing sequential clustering -- Solution 2: SHARD_ROW_ID_BITS for tables without explicit PKCREATE TABLE events_sharded ( event_id VARCHAR(36) PRIMARY KEY, -- Application-generated ID event_type VARCHAR(50)) SHARD_ROW_ID_BITS = 4;-- Pre-splits table into 2^4 = 16 regions -- Solution 3: Pre-split tables for predictable workloadsCREATE TABLE time_series ( ts TIMESTAMP, metric_name VARCHAR(100), value DOUBLE, PRIMARY KEY (ts, metric_name)) PARTITION BY RANGE (UNIX_TIMESTAMP(ts)) ( PARTITION p0 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-01')), PARTITION p1 VALUES LESS THAN (UNIX_TIMESTAMP('2024-04-01')), PARTITION p2 VALUES LESS THAN (UNIX_TIMESTAMP('2024-07-01')), PARTITION p3 VALUES LESS THAN (UNIX_TIMESTAMP('2024-10-01')), PARTITION pmax VALUES LESS THAN MAXVALUE);Companies like JD.com run TiDB clusters with 100+ TiKV nodes, storing petabytes of data across hundreds of millions of regions. The architecture scales because each region is independent—adding nodes adds capacity without fundamental architectural changes.
Effective capacity planning for TiDB requires understanding the resource requirements of each component and how they interact.
TiDB Server (SQL Layer) Resources:
TiDB servers are CPU and memory intensive:
Sizing guidance:
| Component | CPU | Memory | Storage | Network | Scaling Trigger |
|---|---|---|---|---|---|
| TiDB Server | 8-32 cores | 16-64 GB | Minimal (logs) | 10 Gbps+ | Query latency, connection count |
| TiKV | 8-32 cores | 32-128 GB | SSD, 500GB-4TB | 10 Gbps+ | Storage capacity, IOPS |
| PD | 4-8 cores | 8-16 GB | SSD, 20-100 GB | 1 Gbps | Rarely (3-5 nodes sufficient) |
| TiFlash | 16-48 cores | 64-256 GB | NVMe, 1-10 TB | 10 Gbps+ | Analytics query load |
TiKV Resource Requirements:
TiKV is storage and I/O intensive:
Critical: Block Cache Sizing
RocksDB's block cache is the most important memory configuration for TiKV:
block_cache_size = total_memory × 0.45 (rough guideline)
A larger block cache means more data can be served from memory, reducing disk I/O. For read-heavy workloads, maximize block cache within memory constraints.
Storage Capacity Planning:
TiKV storage usage exceeds raw data size due to:
Formula:
Required Storage = Raw Data Size × Replication Factor × 1.5 × 1.2
= Raw Data Size × Replication Factor × 1.8
For 100GB of data with 3 replicas:
100GB × 3 × 1.8 = 540GB total across cluster
CAPACITY PLANNING EXAMPLE════════════════════════════════════════════════════════════════════ Requirements: - 500 GB raw data size - 10,000 QPS (mixed read/write) - 99th percentile latency < 50ms - High availability (survive 1 node failure) Calculation: 1. STORAGE (TiKV) ─────────────── Raw data: 500 GB With 3x replication: 1,500 GB With space amp (1.8): 2,700 GB total Per TiKV node: 2,700 GB ÷ 3 nodes = 900 GB With headroom (20%): 1,100 GB per node Recommendation: 3 TiKV nodes with 1.5 TB SSD each 2. TIKV MEMORY ──────────── Per-node data: ~900 GB Block cache (45%): ~50 GB for good cache hit ratio Raft logs, etc: ~10 GB OS overhead: ~5 GB Recommendation: 64 GB RAM per TiKV node 3. TIDB SERVERS ───────────── 10,000 QPS with <50ms latency Single TiDB: ~5,000-10,000 QPS (depends on query complexity) For redundancy and headroom: 3 TiDB servers Recommendation: 3 TiDB servers with 16 cores, 32 GB RAM each 4. PD ──── Always deploy 3 or 5 nodes for Raft consensus Lightweight compared to TiKV Recommendation: 3 PD nodes with 4 cores, 16 GB RAM each 5. FINAL CLUSTER ────────────── ┌────────────────────────────────────────────────────────────┐ │ 3 × TiDB Server : 16 cores, 32 GB RAM each │ │ 3 × TiKV Node : 16 cores, 64 GB RAM, 1.5 TB SSD │ │ 3 × PD Node : 4 cores, 16 GB RAM, 100 GB SSD │ │ Optional: TiFlash for analytics (separate sizing) │ └────────────────────────────────────────────────────────────┘ Total: ~66 cores, 312 GB RAM, 4.8 TB SSDTiDB's horizontal scalability means you don't need to over-provision upfront. Start with a minimal production cluster (3 TiKV, 2-3 TiDB, 3 PD) and add nodes as needed. Monitoring will show when you're approaching capacity limits.
We've explored how TiDB achieves horizontal scalability—the architecture, the mechanisms, and the operational practices. Let's consolidate the key principles:
What's Next:
Horizontal scalability for OLTP workloads is valuable, but TiDB offers something even more unique: the ability to run HTAP (Hybrid Transactional/Analytical Processing) workloads on the same database. In the next page, we'll explore TiFlash and how TiDB enables real-time analytics on transactional data without the traditional ETL pipeline.
You now understand how TiDB achieves horizontal scalability through its distributed architecture—TiKV, regions, Raft consensus, and PD scheduling. Next, we'll explore TiDB's HTAP capabilities and how TiFlash enables real-time analytics on transactional data.