TiDB - Learning Module | OneNoughtOne

Loading content...

0/273

Horizontal Scalability: Scaling Beyond Single-Node Limits

The Scaling Wall

Every successful MySQL deployment eventually hits the wall.

The database that started as a simple PostgreSQL alternative—reliable, fast, well-understood—gradually becomes a bottleneck. First, you add read replicas to handle read traffic. Then you implement application-level sharding, splitting users across multiple database instances. Before long, your application is entangled with routing logic, cross-shard joins are impossible, and your infrastructure team spends more time managing databases than building features.

This is the scaling wall—the point where the single-node architecture of traditional databases becomes the limiting factor for your system's growth.

TiDB was purpose-built to eliminate this wall. By distributing data automatically across a cluster of nodes, TiDB scales reads and writes horizontally—just add more servers. The application sees a single logical database while TiDB handles the complexity of distribution, replication, and load balancing.

In this page, we'll dive deep into how TiDB achieves horizontal scalability: the architecture, the algorithms, and the operational practices that make it possible.

What You Will Learn

By the end of this page, you will understand TiDB's distributed storage architecture (TiKV and regions), how Raft consensus ensures data durability and consistency, how data is automatically distributed and rebalanced across nodes, and the practical implications for capacity planning and scaling operations.

The Distributed Storage Layer: TiKV

At the heart of TiDB's scalability is TiKV—a distributed transactional key-value store. While TiDB handles SQL parsing and optimization, TiKV handles data storage, replication, and distributed transactions. Understanding TiKV is essential to understanding how TiDB scales.

Key-Value Foundation:

TiKV stores all data as key-value pairs. Table rows are encoded into keys using a specific format:

Table Row:    t{table_id}_r{row_id} → {encoded_row_data}
Index Entry:  t{table_id}_i{index_id}_{index_values} → {row_id}

For example, a row in a users table with id=1001 might be stored as:

Key:   t45_r1001
Value: {"email":"alice@example.com","username":"alice",...}

This key encoding has important properties:

Lexicographic Ordering: Keys are sorted, so range scans on the primary key are efficient
Table Locality: All rows for a table share a prefix, so they're stored together
Index Locality: Index entries are stored with their table, enabling efficient index lookups

RocksDB as the Storage Engine:

Each TiKV node uses RocksDB as its local storage engine. RocksDB is a high-performance embedded key-value store (originally developed by Facebook) optimized for SSDs and modern hardware:

Log-Structured Merge-Tree (LSM-Tree) architecture
Excellent write throughput with background compaction
Compression support (zstd, lz4, snappy)
Fine-grained tuning for different workloads

tikv-data-layout.txt
TiKV DATA ORGANIZATION
════════════════════════════════════════════════════════════════════
 
TABLE ENCODING EXAMPLE (users table, table_id = 45)
────────────────────────────────────────────────────────────────────
 
Row Data (ordered by primary key):
  t45_r1     → {"email":"alice@email.com","username":"alice","status":"active"}
  t45_r2     → {"email":"bob@email.com","username":"bob","status":"active"}
  t45_r3     → {"email":"carol@email.com","username":"carol","status":"pending"}
  ...
  t45_r1000  → {"email":"user1000@email.com","username":"user1000","status":"active"}
 
Secondary Index (idx_username on users.username, index_id = 1):
  t45_i1_alice   → 1    (points to row 1)
  t45_i1_bob     → 2    (points to row 2)
  t45_i1_carol   → 3    (points to row 3)
  ...
 
Secondary Index (idx_status on users.status, index_id = 2):
  t45_i2_active_1    → 1
  t45_i2_active_2    → 2
  t45_i2_pending_3   → 3
  ...
 
TIKV NODE STRUCTURE
────────────────────────────────────────────────────────────────────
 
┌─────────────────────────────────────────────────────────────────┐
│                        TiKV Node                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐             │
│  │   Region 1   │ │   Region 2   │ │   Region N   │             │
│  │   (Leader)   │ │  (Follower)  │ │   (Leader)   │             │
│  │              │ │              │ │              │             │
│  │ t1_r1 - t1_r99│ │t45_r500-t45_r599│ │t100_r1 - t100_r50│      │
│  │              │ │              │ │              │             │
│  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘             │
│         │                │                │                      │
│         │                │                │                      │
│         ▼                ▼                ▼                      │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                       RocksDB                                ││
│  │  ┌──────────────────────────────────────────────────────────┐││
│  │  │ Memtable (write buffer)                                  │││
│  │  └──────────────────────────────────────────────────────────┘││
│  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐     ││
│  │  │ SST L0 │ │ SST L1 │ │ SST L2 │ │ SST L3 │ │ SST L4 │     ││
│  │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘     ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
 
Each TiKV node hosts multiple regions (portions of the key space).
Each region is independently replicated via Raft.

Why Key-Value for a Relational Database?

Using a key-value foundation for a SQL database might seem odd, but it provides critical advantages for distribution:

Simple Partitioning: Key ranges can be easily divided across nodes
Efficient Replication: Key-value pairs replicate without complex schema dependencies
Flexible Indexing: Both table data and indexes are just key-value pairs
Proven Technology: RocksDB is battle-tested at massive scale (Facebook, LinkedIn, etc.)

The SQL layer (TiDB) handles the translation between relational semantics and key-value operations. From the application's perspective, it's a SQL database. Under the hood, it's a distributed key-value store with transactional guarantees.

TiKV as Standalone

TiKV is designed as an independent project and can be used without TiDB. Companies use TiKV directly for applications that need a distributed key-value store with transactions but don't require SQL. This separation of concerns makes both systems more focused and maintainable.

Regions: The Unit of Distribution

The key innovation in TiDB's scalability is the Region—the fundamental unit of data distribution, replication, and scheduling. Understanding regions is essential to understanding how TiDB scales.

What is a Region?

A region is a contiguous range of keys. By default:

Maximum region size: 96 MB (configurable)
When a region exceeds this size, it splits into two regions
When adjacent regions become small, they may merge

For example, a table with 10GB of data might be divided into ~100 regions, each covering a portion of the key range:

Region 1: t45_r1 to t45_r10000
Region 2: t45_r10001 to t45_r20000
Region 3: t45_r20001 to t45_r30000
...

Region Replication with Raft:

Each region is replicated across multiple TiKV nodes using the Raft consensus protocol. A typical configuration has 3 replicas per region:

1 Leader: Handles all reads and writes for the region
2 Followers: Receive replicated data, can serve stale reads, participate in leader election

Writes must be acknowledged by a majority (2 of 3) before they're considered committed. This ensures data durability even if one replica fails.

region-architecture.txt
REGION DISTRIBUTION AND RAFT REPLICATION
════════════════════════════════════════════════════════════════════
 
EXAMPLE: Table "orders" with 500GB of data
─────────────────────────────────────────
 
Total Regions: ~5,200 (500GB ÷ 96MB per region)
 
Region Distribution Across TiKV Cluster:
┌─────────────────────────────────────────────────────────────────┐
│                    TiKV Cluster (9 nodes)                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  TiKV-1           TiKV-2           TiKV-3                       │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐               │
│  │ R1 [L]    │    │ R1 [F]    │    │ R1 [F]    │  ← Region 1   │
│  │ R2 [F]    │    │ R2 [L]    │    │ R2 [F]    │    Raft Group │
│  │ R3 [F]    │    │ R3 [F]    │    │ R3 [L]    │               │
│  │ ...       │    │ ...       │    │ ...       │               │
│  │ R580 [L]  │    │ R581 [L]  │    │ R582 [L]  │               │
│  └───────────┘    └───────────┘    └───────────┘               │
│                                                                  │
│  TiKV-4           TiKV-5           TiKV-6                       │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐               │
│  │ R4 [L]    │    │ R4 [F]    │    │ R4 [F]    │               │
│  │ R5 [F]    │    │ R5 [L]    │    │ R5 [F]    │               │
│  │ R6 [F]    │    │ R6 [F]    │    │ R6 [L]    │               │
│  │ ...       │    │ ...       │    │ ...       │               │
│  └───────────┘    └───────────┘    └───────────┘               │
│                                                                  │
│  TiKV-7           TiKV-8           TiKV-9                       │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐               │
│  │ R7 [L]    │    │ R7 [F]    │    │ R7 [F]    │               │
│  │ R8 [F]    │    │ R8 [L]    │    │ R8 [F]    │               │
│  │ ...       │    │ ...       │    │ ...       │               │
│  └───────────┘    └───────────┘    └───────────┘               │
│                                                                  │
│  [L] = Leader    [F] = Follower                                 │
│                                                                  │
│  Each region has exactly 1 leader + 2 followers                  │
│  Leaders are distributed across nodes for load balancing         │
└─────────────────────────────────────────────────────────────────┘
 
RAFT REPLICATION FLOW (for Region 1):
─────────────────────────────────────
 
Client Write Request           TiKV-1 (Leader)
        │                            │
        └────────────────────────────►
                                     │
                               ┌─────┴─────┐
                               │ Append to │
                               │ Raft Log  │
                               └─────┬─────┘
                                     │
            ┌──────────────────────┼──────────────────────┐
            ▼                       ▼                       ▼
      TiKV-2 (F)              TiKV-1 (L)              TiKV-3 (F)
      Append Log              Wait for                Append Log
      ACK ─────────────────►  Majority ◄─────────────────── ACK
                                     │
                               ┌─────┴─────┐
                               │Commit Log │
                               │Apply to DB│
                               └─────┬─────┘
                                     │
                                     ▼
                              Response to Client

Why 96MB Regions?

The region size is a careful tradeoff:

Smaller regions (e.g., 32MB):

More fine-grained load balancing
Faster recovery (less data to transfer when replacing a replica)
Higher metadata overhead (more regions to track)
More Raft overhead (more independent consensus groups)

Larger regions (e.g., 256MB):

Lower metadata overhead
More efficient for sequential scans
Coarser load balancing (harder to distribute evenly)
Longer recovery time after node failure

The 96MB default balances these concerns for typical OLTP workloads. For very large sequential scans (analytics), larger regions might be appropriate.

Automatic Region Splitting:

When a region grows beyond the size threshold, TiKV automatically splits it:

Select a split point (often the middle key)
Create two new regions from the original
Update PD with new region metadata
Both new regions continue with their own Raft groups

This happens automatically and transparently—applications don't notice splits occurring.

Hotspot Handling

Regions can also split based on traffic patterns, not just size. If a region is receiving disproportionate traffic (a 'hot' region), PD can trigger a split to distribute the load. This is crucial for handling access patterns like inserting sequential IDs or time-series data.

Raft Consensus for Durability

The Raft consensus protocol is the foundation of TiDB's data durability and consistency guarantees. Every region operates as an independent Raft group, ensuring that data is safely replicated before being acknowledged.

Raft Fundamentals:

Raft is a consensus algorithm designed to be understandable (compared to Paxos) while providing equivalent guarantees:

Leader Election: When a cluster starts or a leader fails, followers compete to become the new leader. A candidate must receive votes from a majority to become leader.
Log Replication: The leader receives all writes, appends them to its log, and replicates to followers. A write is committed when replicated to a majority.
Safety: Raft guarantees that committed entries are never lost—any elected leader will have all committed entries.

Multi-Raft in TiDB:

TiDB uses Multi-Raft—each region operates its own independent Raft group. This provides:

Parallelism: Different regions can commit independently
Isolation: A slow region doesn't block other regions
Scalability: Thousands of concurrent Raft groups across the cluster

raft-write-path.txt
RAFT WRITE PATH - DETAILED FLOW
════════════════════════════════════════════════════════════════════
 
Transaction: UPDATE accounts SET balance = 950 WHERE id = 1;
1. TiDB determines row id=1 is in Region 42
2. Region 42's leader is on TiKV-3
 
        TiDB Server                          TiKV-3 (Region 42 Leader)
             │                                        │
    Step 1:  │ ─── PreWrite(key=t5_r1, val=950) ───► │
             │                                        │
             │                                   ┌────┴────┐
             │                                   │ Acquire │
             │                                   │  Lock   │
             │                                   └────┬────┘
             │                                        │
             │                                   ┌────┴────┐
             │                                   │ Append  │
             │                                   │ to Raft │
             │                                   │   Log   │
             │                                   └────┬────┘
             │                                        │
             │                                   Replicate to Followers
             │                                        │
             │                               TiKV-1 (F)    TiKV-5 (F)
             │                                   │              │
             │                               Append Log     Append Log
             │                                   │              │
             │                               ACK ┘              └ ACK
             │                                   ▼              ▼
             │                                   └──────┬───────┘
             │                                          │
             │                                   ┌──────┴──────┐
             │                                   │ Majority    │
             │                                   │ Acknowledged│
             │                                   │ (2 of 3)    │
             │                                   └──────┬──────┘
             │                                          │
             │ ◄──────── PreWrite Success ───────────── │
             │                                          │
    Step 2:  │ ─────── Commit(key=t5_r1, ts=X) ───────► │
             │                                          │
             │                                   ┌──────┴──────┐
             │                                   │ Write Commit│
             │                                   │   Record    │
             │                                   │ (replicated │
             │                                   │  via Raft)  │
             │                                   └──────┬──────┘
             │                                          │
             │ ◄──────── Commit Success ────────────────│
             │                                          │
             │                                   ┌──────┴──────┐
             │                                   │ Async Apply │
             │                                   │ to RocksDB  │
             │                                   └─────────────┘
 
 
DURABILITY GUARANTEE:
─────────────────────
Once Commit Success is returned:
  ✓ Data is in Raft logs on 2 of 3 replicas
  ✓ Even if TiKV-3 crashes immediately, data is safe
  ✓ Recovery: Any surviving replica can replay Raft log
  ✓ Data will eventually be persisted to RocksDB (background)

Leader Election and Failover:

When a region leader fails, followers detect the absence of heartbeats and trigger an election:

Detection: Followers wait for election timeout (randomized, typically 1-3 seconds)
Candidacy: A follower becomes a candidate and requests votes
Voting: Other nodes vote for the first valid candidate they see
New Leader: Candidate with majority votes becomes leader
Resume Service: New leader begins accepting reads and writes

During leader election, the region is temporarily unavailable for writes. However:

Elections are fast (typically <2 seconds)
Only one region is affected; others continue normally
PD can preemptively move leaders from unhealthy nodes

Raft Snapshot:

When a follower falls too far behind (or is a new replica), leader sends a snapshot—a complete copy of the region's data. This is more efficient than replaying thousands of log entries.

Snapshots are used for:

Adding new replicas to a region
Recovering a replica that was offline too long
Migrations during rebalancing

Raft Log Size Limits

Raft logs are kept in memory until applied to RocksDB. Very large transactions or high write throughput can cause memory pressure. TiDB has safeguards (transaction size limits, log truncation) to prevent unbounded log growth. Monitor Raft log size in production.

PD: The Cluster Brain

The Placement Driver (PD) is TiDB's coordination layer—the brain that makes scaling decisions. While TiKV handles data storage and replication, PD handles metadata, scheduling, and cluster-wide coordination.

PD's Responsibilities:

Timestamp Oracle (TSO): Provides globally unique, monotonically increasing timestamps for transactions. Every transaction starts by getting a timestamp from PD.
Region Metadata: Maintains the mapping of key ranges to regions and regions to TiKV nodes. TiDB queries PD to find which TiKV node holds a key.
Scheduling: Decides when to move regions, split/merge regions, and transfer leaders. This is how TiDB achieves load balancing.
Cluster Configuration: Stores cluster-wide configuration and coordinates administrative operations.

PD itself runs as a cluster (typically 3 or 5 nodes) for high availability, using its own Raft group for consensus.

pd-scheduling.txt
PD SCHEDULING OPERATIONS
════════════════════════════════════════════════════════════════════
 
PD monitors cluster state and issues scheduling commands:
 
1. LEADER TRANSFER (balance leader count across nodes)
   ───────────────────────────────────────────────────
   
   Before:                          After:
   TiKV-1: 150 leaders              TiKV-1: 100 leaders
   TiKV-2: 50 leaders               TiKV-2: 100 leaders
   TiKV-3: 100 leaders              TiKV-3: 100 leaders
   
   PD Command: TransferLeader(Region45, TiKV-2)
   - Fast operation (Raft leader handoff)
   - No data movement, just leadership change
   - Completes in milliseconds
 
 
2. REGION TRANSFER (balance storage across nodes)
   ───────────────────────────────────────────────
   
   Scenario: TiKV-4 added to cluster
   
   Before:                          After:
   TiKV-1: 800 regions              TiKV-1: 600 regions
   TiKV-2: 800 regions              TiKV-2: 600 regions
   TiKV-3: 800 regions              TiKV-3: 600 regions
   TiKV-4: 0 regions (new)          TiKV-4: 600 regions
   
   PD Commands (issued gradually):
   - AddReplica(Region100, TiKV-4)
   - RemoveReplica(Region100, TiKV-1)
   - ... (repeat for 600 regions)
   
   Data is copied via Raft snapshot, then old replica removed.
 
 
3. REGION SPLIT (handle growing regions)
   ───────────────────────────────────────
   
   Trigger: Region size > 96MB or high traffic (hot region)
   
   Before:                          After:
   Region 50:                       Region 50:
     t10_r1 → t10_r50000              t10_r1 → t10_r25000
     Size: 150MB                      Size: 75MB
                                    Region 51 (new):
                                      t10_r25001 → t10_r50000
                                      Size: 75MB
 
 
4. PLACEMENT RULES (compliance and locality)
   ──────────────────────────────────────────
   
   Example: Ensure 1 replica per availability zone
   
   Rule: region.replicas = 3
         replica[0] in zone=us-east-1a
         replica[1] in zone=us-east-1b  
         replica[2] in zone=us-east-1c
   
   PD enforces this by:
   - Moving replicas to correct zones
   - Never placing 2 replicas in same zone
   - Prioritizing zone diversity in rebalancing

Scheduling Strategies:

PD continuously monitors cluster metrics and makes scheduling decisions based on multiple factors:

Balance Schedulers:

Balance Leader: Distribute leaders evenly for even read/write load
Balance Region: Distribute regions evenly for even storage usage
Hot Region: Detect and distribute hot spots (high-traffic regions)

Repair Schedulers:

Replica Checker: Ensure every region has the configured replica count
Merge Checker: Merge adjacent small regions to reduce overhead

Constraint Schedulers:

Placement Rules: Enforce cross-zone/region distribution
Label Scheduler: Respect node labels for placement decisions

TSO (Timestamp Oracle):

The TSO is critical for distributed transactions. Every transaction needs a start timestamp (start_ts) and a commit timestamp (commit_ts). PD's TSO provides:

Globally unique timestamps (within the cluster)
Monotonically increasing (for transaction ordering)
High throughput (~100,000+ timestamps/second)
Batched allocation to reduce round trips

The TSO is often the component requiring the most careful attention in very high-throughput deployments. PD leader handles all TSO requests, making it a potential bottleneck.

TSO Batching

TiDB servers batch TSO requests from concurrent transactions into single PD requests. This dramatically reduces TSO latency under load. A single PD request might return 100+ timestamps for 100+ concurrent transactions starting simultaneously.

Scaling Operations in Practice

Understanding the theory is important, but what does scaling TiDB actually look like in practice? Let's walk through common scaling scenarios.

Scaling Out (Adding Nodes):

Adding capacity to a TiDB cluster is straightforward:

Provision new TiKV nodes: Install TiKV, configure to join cluster
Join cluster: New nodes contact PD to join
Automatic rebalancing: PD schedules regions to new nodes
Gradual migration: Regions are transferred over time to avoid disruption

The key insight: adding nodes is non-disruptive. Existing traffic continues normally while rebalancing happens in the background.

scaling-example.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Monitor cluster capacity and distribution
 
-- View store (TiKV node) status
SELECT 
    store_id,
    address,
    state,
    capacity,
    available,
    leader_count,
    region_count,
    start_ts
FROM information_schema.tikv_store_status;
 
/*
+----------+-------------------+-------+----------+-----------+--------------+--------------+---------------------+
| store_id | address           | state | capacity | available | leader_count | region_count | start_ts            |
+----------+-------------------+-------+----------+-----------+--------------+--------------+---------------------+
| 1        | tikv-1.local:20160| Up    | 2TB      | 1.5TB     | 1500         | 4500         | 2024-01-01 00:00:00 |
| 2        | tikv-2.local:20160| Up    | 2TB      | 1.5TB     | 1500         | 4500         | 2024-01-01 00:00:00 |
| 3        | tikv-3.local:20160| Up    | 2TB      | 1.5TB     | 1500         | 4500         | 2024-01-01 00:00:00 |
| 4        | tikv-4.local:20160| Up    | 2TB      | 2TB       | 0            | 0            | 2024-01-07 10:00:00 |  ← NEW NODE
+----------+-------------------+-------+----------+-----------+--------------+--------------+---------------------+
*/
 
-- After some time, PD rebalances:
/*
+----------+-------------------+-------+----------+-----------+--------------+--------------+
| store_id | address           | state | capacity | available | leader_count | region_count |
+----------+-------------------+-------+----------+-----------+--------------+--------------+
| 1        | tikv-1.local:20160| Up    | 2TB      | 1.6TB     | 1125         | 3375         |
| 2        | tikv-2.local:20160| Up    | 2TB      | 1.6TB     | 1125         | 3375         |
| 3        | tikv-3.local:20160| Up    | 2TB      | 1.6TB     | 1125         | 3375         |
| 4        | tikv-4.local:20160| Up    | 2TB      | 1.6TB     | 1125         | 3375         |  ← NOW BALANCED
+----------+-------------------+-------+----------+-----------+--------------+--------------+
*/
 
-- Monitor rebalancing progress
SHOW CONFIG WHERE name LIKE '%schedule%';
 
-- Check specific region distribution
SELECT 
    region_id,
    leader_store_id,
    peer_store_ids
FROM information_schema.tikv_region_status
WHERE table_name = 'orders'
LIMIT 10;

Scaling the SQL Layer (TiDB Servers):

TiDB servers are stateless, making scaling trivial:

Add TiDB servers: Same binary, same configuration
Update load balancer: Add new servers to pool
Traffic distributes: New servers begin handling queries immediately

No coordination required—just add servers and update your load balancer. This is ideal for handling traffic spikes.

Handling Hot Spots:

Hot spots—where a small number of regions handle disproportionate traffic—are the enemy of horizontal scaling. TiDB provides several mechanisms:

1. AUTO_RANDOM for Primary Keys:

hotspot-prevention.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- Problem: Sequential AUTO_INCREMENT creates hotspot
-- All inserts go to the same region (the one with the highest keys)
 
CREATE TABLE events_bad (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,  -- BAD: hotspot
    event_type VARCHAR(50),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Solution 1: AUTO_RANDOM distributes across regions
CREATE TABLE events_good (
    id BIGINT PRIMARY KEY AUTO_RANDOM,  -- GOOD: random shard bits
    event_type VARCHAR(50),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- AUTO_RANDOM: high bits are random, preventing sequential clustering
 
-- Solution 2: SHARD_ROW_ID_BITS for tables without explicit PK
CREATE TABLE events_sharded (
    event_id VARCHAR(36) PRIMARY KEY,  -- Application-generated ID
    event_type VARCHAR(50)
) SHARD_ROW_ID_BITS = 4;
-- Pre-splits table into 2^4 = 16 regions
 
-- Solution 3: Pre-split tables for predictable workloads
CREATE TABLE time_series (
    ts TIMESTAMP,
    metric_name VARCHAR(100),
    value DOUBLE,
    PRIMARY KEY (ts, metric_name)
) PARTITION BY RANGE (UNIX_TIMESTAMP(ts)) (
    PARTITION p0 VALUES LESS THAN (UNIX_TIMESTAMP('2024-01-01')),
    PARTITION p1 VALUES LESS THAN (UNIX_TIMESTAMP('2024-04-01')),
    PARTITION p2 VALUES LESS THAN (UNIX_TIMESTAMP('2024-07-01')),
    PARTITION p3 VALUES LESS THAN (UNIX_TIMESTAMP('2024-10-01')),
    PARTITION pmax VALUES LESS THAN MAXVALUE
);

Scaling Best Practices

•Use AUTO_RANDOM for high-insert tables: Prevents write hotspots on sequential IDs
•Monitor region distribution: Use TiDB Dashboard to identify imbalanced regions
•Pre-split large tables: Use SHARD_ROW_ID_BITS or partitioning for predictable distribution
•Separate TiKV by workload: Consider dedicated TiKV nodes for hot tables
•Scale TiDB servers for query load: They're stateless—add more for more queries
•Scale TiKV nodes for storage/IOPS: Add nodes for more capacity and throughput

Real-World Scale

Companies like JD.com run TiDB clusters with 100+ TiKV nodes, storing petabytes of data across hundreds of millions of regions. The architecture scales because each region is independent—adding nodes adds capacity without fundamental architectural changes.

Capacity Planning

Effective capacity planning for TiDB requires understanding the resource requirements of each component and how they interact.

TiDB Server (SQL Layer) Resources:

TiDB servers are CPU and memory intensive:

CPU: Query parsing, optimization, execution, result aggregation
Memory: Query execution buffers, connection handling, caching
Network: Communication with TiKV, client connections

Sizing guidance:

8-16 cores per TiDB instance for moderate workloads
16-32GB RAM per instance
1 TiDB instance per ~1000 concurrent connections
Scale horizontally for more query capacity

TiDB Component Resource Guidelines
Component	CPU	Memory	Storage	Network	Scaling Trigger
TiDB Server	8-32 cores	16-64 GB	Minimal (logs)	10 Gbps+	Query latency, connection count
TiKV	8-32 cores	32-128 GB	SSD, 500GB-4TB	10 Gbps+	Storage capacity, IOPS
PD	4-8 cores	8-16 GB	SSD, 20-100 GB	1 Gbps	Rarely (3-5 nodes sufficient)
TiFlash	16-48 cores	64-256 GB	NVMe, 1-10 TB	10 Gbps+	Analytics query load

TiKV Resource Requirements:

TiKV is storage and I/O intensive:

CPU: Raft processing, RocksDB compaction, encryption
Memory: Block cache (critical for read performance), Raft logs
Storage: SSDs strongly recommended; NVMe for high throughput
Network: Region replication traffic between TiKV nodes

Critical: Block Cache Sizing

RocksDB's block cache is the most important memory configuration for TiKV:

block_cache_size = total_memory × 0.45  (rough guideline)

A larger block cache means more data can be served from memory, reducing disk I/O. For read-heavy workloads, maximize block cache within memory constraints.

Storage Capacity Planning:

TiKV storage usage exceeds raw data size due to:

Replication Factor: 3 replicas = 3x storage
LSM-Tree Space Amplification: ~1.2-1.5x data size
Compaction Headroom: Need ~20% free for compaction

Formula:

Required Storage = Raw Data Size × Replication Factor × 1.5 × 1.2
                 = Raw Data Size × Replication Factor × 1.8

For 100GB of data with 3 replicas:

100GB × 3 × 1.8 = 540GB total across cluster

capacity-planning-example.txt
CAPACITY PLANNING EXAMPLE
════════════════════════════════════════════════════════════════════
 
Requirements:
  - 500 GB raw data size
  - 10,000 QPS (mixed read/write)
  - 99th percentile latency < 50ms
  - High availability (survive 1 node failure)
 
Calculation:
 
1. STORAGE (TiKV)
   ───────────────
   Raw data:             500 GB
   With 3x replication:  1,500 GB
   With space amp (1.8): 2,700 GB total
   Per TiKV node:        2,700 GB ÷ 3 nodes = 900 GB
   With headroom (20%):  1,100 GB per node
   
   Recommendation: 3 TiKV nodes with 1.5 TB SSD each
 
2. TIKV MEMORY
   ────────────
   Per-node data:        ~900 GB
   Block cache (45%):    ~50 GB for good cache hit ratio
   Raft logs, etc:       ~10 GB
   OS overhead:          ~5 GB
   
   Recommendation: 64 GB RAM per TiKV node
 
3. TIDB SERVERS
   ─────────────
   10,000 QPS with <50ms latency
   Single TiDB: ~5,000-10,000 QPS (depends on query complexity)
   For redundancy and headroom: 3 TiDB servers
   
   Recommendation: 3 TiDB servers with 16 cores, 32 GB RAM each
 
4. PD
   ────
   Always deploy 3 or 5 nodes for Raft consensus
   Lightweight compared to TiKV
   
   Recommendation: 3 PD nodes with 4 cores, 16 GB RAM each
 
5. FINAL CLUSTER
   ──────────────
   ┌────────────────────────────────────────────────────────────┐
   │  3 × TiDB Server    : 16 cores, 32 GB RAM each            │
   │  3 × TiKV Node      : 16 cores, 64 GB RAM, 1.5 TB SSD     │
   │  3 × PD Node        : 4 cores, 16 GB RAM, 100 GB SSD      │
   │  Optional: TiFlash for analytics (separate sizing)        │
   └────────────────────────────────────────────────────────────┘
 
   Total: ~66 cores, 312 GB RAM, 4.8 TB SSD

Start Small, Scale Incrementally

TiDB's horizontal scalability means you don't need to over-provision upfront. Start with a minimal production cluster (3 TiKV, 2-3 TiDB, 3 PD) and add nodes as needed. Monitoring will show when you're approaching capacity limits.

Summary: Horizontal Scalability Principles

We've explored how TiDB achieves horizontal scalability—the architecture, the mechanisms, and the operational practices. Let's consolidate the key principles:

Key Takeaways

•TiKV Provides Distributed Storage: A transactional key-value store using RocksDB, scaled across multiple nodes with data encoded as key-value pairs.
•Regions Are the Unit of Distribution: ~96MB chunks of data that are independently replicated and can be moved between nodes for load balancing.
•Raft Ensures Durability: Every region uses Raft consensus with 3 replicas. Writes require majority acknowledgment before success is returned.
•PD Orchestrates the Cluster: The Placement Driver handles scheduling, load balancing, timestamp allocation, and metadata management.
•Scaling is Non-Disruptive: Add TiKV nodes for storage/throughput, add TiDB nodes for query capacity. Rebalancing happens automatically.
•Hotspot Prevention is Critical: Use AUTO_RANDOM, SHARD_ROW_ID_BITS, and table partitioning to distribute writes evenly.
•Capacity Planning Follows Clear Formulas: Account for replication, space amplification, and operational headroom.

What's Next:

Horizontal scalability for OLTP workloads is valuable, but TiDB offers something even more unique: the ability to run HTAP (Hybrid Transactional/Analytical Processing) workloads on the same database. In the next page, we'll explore TiFlash and how TiDB enables real-time analytics on transactional data without the traditional ETL pipeline.

Page Complete

You now understand how TiDB achieves horizontal scalability through its distributed architecture—TiKV, regions, Raft consensus, and PD scheduling. Next, we'll explore TiDB's HTAP capabilities and how TiFlash enables real-time analytics on transactional data.