System Design (HLD)Google Spanner

Google Spanner: The Globally Distributed Database

LevelAdvanced

Duration90 mins

TopicGoogle Spanner

4 / 5

Automatic Sharding

The End of Manual Sharding

Ask any engineer who has managed a sharded database about their experience, and you'll hear war stories: resharding projects that took months, application changes needed for every shard topology change, data migration scripts that failed at 3 AM, and the constant mental overhead of remembering which data lives where.

Manual sharding—dividing a database into pieces that different servers manage—works, but it imposes tremendous operational burden:

Application Awareness: Applications must know shard topology and route queries accordingly
Rebalancing Pain: Adding shards requires migrating data and updating routing logic
Hotspot Handling: Hot shards require manual intervention to split
Cross-Shard Queries: Joins across shards become the application's problem

Google Spanner eliminates this operational overhead through automatic sharding. The database handles data partitioning, movement, and rebalancing transparently. Applications interact with a single logical database while Spanner manages the distributed reality underneath.

This isn't just convenience—it's a fundamental shift in database operations. Engineers focus on application logic while Spanner ensures data is optimally distributed across a possibly massive fleet of servers.

What You Will Learn

By the end of this page, you will understand how Spanner automatically partitions data, how it detects and resolves hotspots, how data moves between servers without downtime, and how this automation enables truly hands-off scalability.

Splits, Tablets, and Directories

To understand automatic sharding, we need to understand Spanner's data organization hierarchy.

Key Ranges and Splits:

All data in Spanner is stored in primary key order. A split is a contiguous range of the key space—for example, all rows with primary keys from "A" to "M". Splits are the fundamental unit of sharding.

Tablets:

A tablet is the physical manifestation of a split on a single server. Each tablet:

Contains data for one key range
Is managed by one spanserver
Has its own Paxos group for replication
Can be moved between servers

Directories:

A directory is a set of contiguous rows that share:

Common replication configuration
Common access patterns
Locality properties (they're always stored together)

Directories are the unit of data movement. When Spanner moves data for load balancing or failure recovery, it moves entire directories.

The Hierarchy:

data-organization-hierarchy.txt
SPANNER DATA ORGANIZATION HIERARCHY
═══════════════════════════════════════════════════════════════════
 
DATABASE: ecommerce
├── TABLE: users (PRIMARY KEY: user_id)
├── TABLE: orders (PRIMARY KEY: user_id, order_id) INTERLEAVE IN users
└── TABLE: order_items (PRIMARY KEY: user_id, order_id, item_id) INTERLEAVE IN orders
 
KEY SPACE (all tables interleaved, sorted by key):
─────────────────────────────────────────────────────────────────────
│ users.1   │ orders.1.1 │ items.1.1.1 │ users.2   │ orders.2.1 │ ...
│─────────────────────────│─────────────────────────│
│    Directory: user_1    │    Directory: user_2    │ ...
─────────────────────────────────────────────────────────────────────
 
SPLITS (determined by Spanner based on size/load):
─────────────────────────────────────────────────────────────────────
│        SPLIT 1         │        SPLIT 2         │     SPLIT 3    │
│   Keys: users.1-999    │  Keys: users.1000-1999 │ Keys: users.2000+
│                        │                        │                │
│   ~100K directories    │    ~100K directories   │  growing...    │
─────────────────────────────────────────────────────────────────────
 
TABLETS (Splits assigned to Spanservers):
─────────────────────────────────────────────────────────────────────
 
┌─────────────────────────────────────────────────────────────────┐
│                        SPANSERVER FLEET                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Spanserver A          Spanserver B          Spanserver C      │
│   ┌───────────────┐    ┌───────────────┐    ┌───────────────┐  │
│   │ Tablet: S1-L  │    │ Tablet: S1-F  │    │ Tablet: S2-L  │  │
│   │ (Split 1,     │    │ (Split 1,     │    │ (Split 2,     │  │
│   │  Leader)      │    │  Follower)    │    │  Leader)      │  │
│   ├───────────────┤    ├───────────────┤    ├───────────────┤  │
│   │ Tablet: S3-F  │    │ Tablet: S2-F  │    │ Tablet: S3-L  │  │
│   │ (Split 3,     │    │ (Split 2,     │    │ (Split 3,     │  │
│   │  Follower)    │    │  Follower)    │    │  Leader)      │  │
│   └───────────────┘    └───────────────┘    └───────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
 
Each Split = 1 Paxos Group = 1 Leader + N Followers
Each Tablet = 1 Split Replica on 1 Spanserver
 
REPLICATION (3-way replication example):
────────────────────────────────────────
Split 1: Leader on Spanserver A, Followers on B and D
Split 2: Leader on Spanserver C, Followers on B and E
Split 3: Leader on Spanserver C, Followers on A and F

Why This Hierarchy Matters:

Directories preserve locality: Related data (a user and their orders) stays together, even as splits change.
Splits enable parallelism: Different splits can be served by different servers, allowing horizontal scaling.
Tablets enable distribution: By placing tablets on different spanservers, load is distributed across the fleet.
Paxos groups per split: Each split has its own consensus group, allowing independent commits.

Split Boundaries:

Spanner chooses split boundaries based on:

Size: Splits should be similar sizes (typically 1-4 GB each)
Load: Heavily accessed key ranges may be split finer
Directory boundaries: Splits try to respect directory boundaries to avoid cross-split directories

Interleaving Reduces Splits Needed

Because interleaved tables are stored together in key order, a user and all their related data typically fit in one split. This means operations on a single user's data are always local to one Paxos group—no distributed coordination needed.

Automatic Split and Merge

Spanner continuously monitors splits and automatically adjusts boundaries to maintain optimal performance.

Split Triggers:

Spanner splits a key range when:

Size Threshold: Split exceeds target size (typically 1-4 GB)
CPU Hotspot: Split is consuming disproportionate CPU time
Disk Hotspot: Split has disproportionate I/O traffic
Latency Anomaly: Operations on split have elevated latency

The Split Process:

Splitting is designed to be transparent and non-disruptive:

Boundary Selection: Spanner selects a split point in the middle of the key range (respecting directory boundaries when possible)
Metadata Update: Split metadata is updated—the old split becomes two new splits with new key ranges
Tablet Creation: New tablets are created on spanservers to manage the new splits
Paxos Reconfiguration: The old Paxos group becomes two new groups, each with its own leader
Traffic Redirection: Clients are seamlessly redirected to the appropriate new split

Zero Downtime:

The split operation doesn't block reads or writes:

Operations in progress complete against the old split
New operations are routed to the appropriate new split
There's no moment where data is unavailable

split-operation-flow.txt
AUTOMATIC SPLIT OPERATION
═══════════════════════════════════════════════════════════════════
 
BEFORE SPLIT:
─────────────
 
Split S1: Keys A-Z (4.2 GB, exceeds 4 GB threshold)
┌─────────────────────────────────────────────────────────────────┐
│                          SPLIT S1                               │
│                      Keys: [A ... Z]                            │
│                        Size: 4.2 GB                             │
│                                                                 │
│  Paxos Group 1:                                                 │
│  ├── Leader: Spanserver A (Zone US-E1)                         │
│  ├── Follower: Spanserver B (Zone US-E2)                       │
│  └── Follower: Spanserver C (Zone US-W1)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
 
 
SPLIT DECISION:
───────────────
 
1. Monitoring detects: Split S1 size = 4.2 GB > threshold 4 GB
2. Select split point: key "M" (approximately middle of key range)
3. Verify: "M" respects directory boundaries
4. Initiate split operation
 
 
SPLIT EXECUTION:
────────────────
 
Step 1: Create new split metadata
        S1 → [S1a: keys A-L] + [S1b: keys M-Z]
 
Step 2: Notify Paxos Group 1 of split
        Leader coordinates split with followers
 
Step 3: Create new Paxos groups
        ├── Paxos Group 1a (for S1a)
        │   ├── Leader: Spanserver A (stays, already has data)
        │   ├── Follower: Spanserver B
        │   └── Follower: Spanserver C
        │
        └── Paxos Group 1b (for S1b)
            ├── Leader: Spanserver D (new, receives data M-Z)
            ├── Follower: Spanserver E
            └── Follower: Spanserver F
 
Step 4: Replicate split boundary across all replicas (via Paxos)
 
Step 5: Update routing metadata in directory service
 
 
AFTER SPLIT:
────────────
 
┌──────────────────────────────┐  ┌──────────────────────────────┐
│         SPLIT S1a            │  │         SPLIT S1b            │
│       Keys: [A ... L]        │  │       Keys: [M ... Z]        │
│         Size: 2.1 GB         │  │         Size: 2.1 GB         │
│                              │  │                              │
│  Paxos Group 1a:             │  │  Paxos Group 1b:             │
│  ├── Leader: Spanserver A    │  │  ├── Leader: Spanserver D    │
│  ├── Follower: B             │  │  ├── Follower: E             │
│  └── Follower: C             │  │  └── Follower: F             │
│                              │  │                              │
└──────────────────────────────┘  └──────────────────────────────┘
 
 
CLIENT EXPERIENCE:
──────────────────
 
Time T0: Client issues query for key "P"
         → Routed to Split S1 (old, before split visible to client)
         → Query completes successfully
 
Time T1: Split operation completes, metadata updated
 
Time T2: Client issues query for key "P"  
         → Client's cached metadata is stale
         → Request goes to old split location
         → Spanserver returns "MOVED" with new location
         → Client updates cache, retries to Spanserver D
         → Query completes successfully
 
Time T3: Client issues query for key "P"
         → Client cache is updated
         → Routed directly to Split S1b (Spanserver D)
         → Query completes successfully
 
Total disruption: ONE retry, ~5-10ms additional latency for first query
                  after split. No errors visible to application.

Merge Triggers:

Spanner merges adjacent splits when:

Size Drops: Combined size of adjacent splits falls well below threshold
Load Decreases: Both splits have low utilization
Directory Efficiency: Merge would reunite a split directory

Merges are the inverse of splits—two Paxos groups become one, and tablets are consolidated.

Split/Merge Decisions:

Spanner's control plane continuously analyzes metrics and makes split/merge decisions:

Decisions are made centrally with global visibility
Changes are rate-limited to avoid oscillation
Hysteresis prevents rapid split-merge cycles
Time-of-day patterns are considered (don't split right before expected load increase)

Split Overhead

Splitting isn't free—it requires copying data to new replicas and updating metadata. Spanner's control plane balances split benefits against costs, avoiding unnecessary splits that would just be merged later.

Load-Based Placement

Beyond splitting key ranges, Spanner actively manages which spanservers host which tablets to balance load across the fleet.

The Placement Problem:

Given:

N spanservers with varying capacities
M tablets with varying load requirements
Replication constraints (replicas must be in different zones)
Leader placement preferences (leaders should be in certain regions)

Find: An assignment of tablets to spanservers that:

Balances CPU, memory, and I/O load
Respects fault tolerance requirements
Minimizes latency for common access patterns
Allows for future growth

The Placement Manager:

Spanner's placement manager continuously optimizes tablet placement:

Metric Collection: Each spanserver reports CPU, memory, I/O, and per-tablet metrics
Imbalance Detection: Placement manager identifies servers that are over/under loaded
Move Calculation: For overloaded servers, placement manager calculates which tablets to move and where
Move Execution: Tablets are moved to rebalance load
Monitoring: Results are monitored and adjustments made if needed

Tablet Placement Considerations
Factor	Goal	Constraint
CPU Utilization	Balance across servers	No server above 70% sustained
Memory Utilization	Balance across servers	No server above 80% memory
Disk I/O	Balance reads/writes	No server I/O saturated
Network Bandwidth	Minimize cross-zone traffic	Replicas in different zones
Leader Locality	Leaders near write sources	Leader in allowed regions
Replica Locality	Replicas near read sources	Geographic distribution for DR
Fault Tolerance	Survive zone/region failures	No majority in single failure domain

Movement Types:

1. Follower Movement (Cheapest):

Moving a follower replica to a different server
New follower catches up from existing replicas
Minimal disruption—Paxos continues with existing quorum

2. Leader Movement:

Moving the leader replica to a different server
Requires leader election after move
Brief pause in writes during election (~100ms)

3. Split Redistribution:

After a split, new tablets are placed on less-loaded servers
Spreads hot data across more servers

Movement Constraints:

Not all movements are safe or desirable:

Quorum Safety: Never move so many replicas that quorum is lost
Rate Limiting: Don't move too many tablets simultaneously
Cooldown: Recently moved tablets aren't immediately moved again
Preference Respect: Honor region/zone placement preferences

Live Traffic During Movement

Tablet movement doesn't block live traffic. The existing replica continues serving while the new replica catches up. Once the new replica is synchronized, traffic is redirected and the old replica is decommissioned.

Hotspot Detection and Handling

Hotspots—key ranges that receive disproportionate traffic—are the nemesis of distributed databases. Spanner's automatic sharding includes sophisticated hotspot detection and mitigation.

Types of Hotspots:

1. Write Hotspots:

Many transactions writing to the same key range
Often caused by sequential key generation (auto-increment IDs)
Particularly problematic because writes require leader coordination

2. Read Hotspots:

Many queries reading the same key range
Popular content, trending items, shared configuration
Less severe because reads can be distributed across replicas

3. Lock Contention Hotspots:

Many transactions competing for locks on the same rows
Metrics show low CPU but high transaction abort rates

Hotspot Detection:

Spanner monitors multiple signals:

Per-split CPU utilization
Per-split QPS and latency percentiles
Lock wait times and abort rates
Key access patterns (via sampling)

When metrics exceed thresholds, the split is flagged as a hotspot.

hotspot-detection-response.txt
HOTSPOT DETECTION AND RESPONSE
═══════════════════════════════════════════════════════════════════
 
SCENARIO: E-commerce flash sale causes product hotspot
 
DETECTION:
──────────
 
Time T0: Flash sale begins for popular product (product_id = 12345)
 
Time T0 + 30s: Monitoring observes:
              ┌─────────────────────────────────────────────────────┐
              │  Split containing product 12345 (products_12000-14000)│
              │                                                     │
              │  Metrics:                                           │
              │    CPU: 95% (normal: 30%)        ← ANOMALY          │
              │    QPS: 15,000 (normal: 500)     ← ANOMALY          │
              │    P99 Latency: 450ms (normal: 25ms) ← ANOMALY      │
              │    Lock waits: 2,100/s           ← ANOMALY          │
              │                                                     │
              │  Adjacent splits: normal load                       │
              └─────────────────────────────────────────────────────┘
 
              Diagnosis: WRITE HOTSPOT on product 12345
 
 
RESPONSE STRATEGY SELECTION:
────────────────────────────
 
Option 1: Split the key range
         Pro: Distributes future load
         Con: Doesn't help if single key is hot
         Verdict: ✗ Hotspot is single product, splitting won't help
 
Option 2: Add read replicas
         Pro: Distributes read load
         Con: Doesn't help with writes (still go to leader)
         Verdict: ∆ Partial help if reads dominate
 
Option 3: Move leader to more powerful server
         Pro: Handles more load
         Con: Limited by single server capacity
         Verdict: ∆ Temporary relief
 
Option 4: Application-level mitigation (requires developer action)
         • Rate limiting
         • Queue-based write batching  
         • Optimistic concurrency
         Verdict: ✓ Best long-term solution for single-key hotspots
 
 
AUTOMATIC RESPONSE (Spanner's actions):
───────────────────────────────────────
 
Time T0 + 60s: 
  1. Alert generated for operations team
  2. Leader moved to highest-capacity server in zone
  3. Pre-emptive splits on adjacent key ranges (anticipating overflow)
 
Time T0 + 5m:
  If hotspot persists and is due to many keys (not single key):
  4. Split the hot range at multiple points
  5. Redistribute new splits across servers
 
 
SINGLE-KEY HOTSPOT MITIGATION (Application Level):
──────────────────────────────────────────────────
 
For true single-key hotspots (one product, one counter, etc.),
automatic sharding can't help—all requests must go to one place.
 
Application solutions:
 
1. COUNTER SHARDING:
   Instead of: inventory_count for product 12345 = 5000
   Use: inventory_shard_0 = 1000, shard_1 = 1000, ... shard_4 = 1000
   Sum shards for total, update random shard for changes
   
2. WRITE BATCHING:
   Queue updates in memory, batch-write periodically
   Trades freshness for throughput
   
3. EVENTUAL CONSISTENCY FOR READS:
   Use stale reads for non-critical operations
   Only products service uses strong reads
 
4. QUEUE-BASED PROCESSING:
   Don't update directly—publish events to queue
   Worker processes queue sequentially

Prevention Through Schema Design:

The best hotspot mitigation is prevention. Schema design choices that prevent hotspots:

1. UUID or Random Keys:

-- Instead of:
CREATE TABLE orders (order_id INT64 AUTO_INCREMENT, ...);
-- Hotspot: All new orders go to highest key range

-- Use:
CREATE TABLE orders (order_id STRING(36), ...);  -- UUID
-- Orders distributed across entire key space

2. Key Prefixing for Distribution:

-- Instead of:
-- PRIMARY KEY (timestamp)  -- all recent data in one range

-- Use:
-- PRIMARY KEY (shard_id, timestamp)
-- where shard_id = hash(some_attribute) % num_shards

3. Separate Hot and Cold Data:

-- Hot table: recent/active data, smaller, more splits
CREATE TABLE active_sessions (...);  

-- Cold table: historical data, larger, fewer splits  
CREATE TABLE archived_sessions (...);

Automatic Sharding Has Limits

While Spanner handles most sharding automatically, true single-key hotspots (like a global counter) hit fundamental limits. If all requests must coordinate through one key, no amount of sharding helps. Designing for distributed access patterns is still essential.

Directory-Level Replication Control

While Spanner automatically manages split boundaries and tablet placement, it also provides fine-grained control over replication at the directory level.

Placement Groups:

Spanner allows defining placement groups—sets of databases or tables that share replication configuration:

-- All user data replicated to EU for GDPR
ALTER TABLE users SET OPTIONS (
  placement_group = 'eu_primary'
);

-- Analytics data replicated globally for low-latency reads
ALTER TABLE analytics SET OPTIONS (
  placement_group = 'global_read_optimized'
);

Per-Row Replication:

For advanced use cases, Spanner can even vary replication at the row level:

-- European users' data stays in EU
-- American users' data stays in Americas
-- Based on user.region column

This enables:

Data residency compliance (GDPR, data sovereignty laws)
Performance optimization (data near users who access it)
Cost optimization (less replication for less critical data)

Replication Configuration Examples
Use Case	Configuration	Benefit
GDPR Compliance	EU users → EU-only replicas	Data never leaves EU jurisdiction
Global Low-Latency	Global replicas + follower reads	Sub-50ms reads from any continent
Cost Optimization	Single-region for non-critical data	Lower replication costs
Analytics Offload	Read replicas in analytics region	Analytics doesn't impact production
Disaster Recovery	Multi-region with geo-diverse follower	Survives regional disasters

Movedir and Directory Movement:

Spanner's internal movedir operation handles directory movement:

Coordination: Central coordinator identifies source and destination zones
Replication Setup: New replica created at destination
Catch-Up: New replica catches up from source
Cutover: Traffic redirected to new replica
Cleanup: Old replica removed (if moving, not adding replica)

Automated Locality Optimization:

Spanner can automatically optimize directory placement based on access patterns:

Track where reads/writes originate
Move directories closer to their access sources
Adjust leader placement for write-heavy directories
Add read replicas for read-heavy directories

This optimization is continuous—as access patterns change, data placement adapts.

Watch the Dashboard

Cloud Spanner provides dashboards showing split distribution, hotspots, and placements. While the system manages this automatically, monitoring helps you identify schema design improvements and workload changes that might benefit from configuration adjustments.

Operational Experience: Sharding at Scale

Google's operational experience with Spanner's automatic sharding provides insights into real-world behavior at massive scale.

Scale Characteristics:

Google's internal Spanner deployment spans multiple continents
Thousands of spanservers managing millions of splits
Continuous rebalancing handles changing workloads
System maintains SLAs during organic growth and sudden traffic spikes

Emergent Behaviors:

At scale, interesting patterns emerge:

1. Diurnal Patterns:

Workloads follow time-of-day patterns
Spanner pre-emptively rebalances before predicted load increases
Leader placement may shift between regions as "follow the sun"

2. Event-Driven Scaling:

Marketing campaigns cause traffic spikes
Automatic splitting handles increased load
Merging happens gradually after event ends

3. Long-Tail Key Ranges:

Some key ranges are naturally quieter
These may consolidate into larger splits
System keeps reserve capacity for unexpected growth

Operational Best Practices

•Monitor Split Counts: Rapidly increasing splits may indicate hotspots or problematic access patterns.
•Watch for Repeated Splits: The same key range repeatedly splitting suggests a persistent hotspot needing schema changes.
•Track Cross-Split Transactions: High percentages of multi-split transactions indicate potential schema optimization opportunities.
•Review Leader Placement: Ensure leaders are in regions where writes originate for optimal latency.
•Test with Production-Like Load: Automatic behaviors may differ at scale—load test before major deployments.
•Understand Your Key Distribution: Skewed key distributions can undermine automatic sharding effectiveness.

What Spanner Doesn't Do Automatically:

While Spanner automates much of sharding, some decisions remain with operators:

Schema Design: Spanner can't fix a poorly designed schema. Key selection and interleaving decisions have lasting impact.
Replication Configuration: Multi-region vs. single-region, leader placement preferences—these are business decisions.
Capacity Planning: Spanner scales within provisioned resources. You must provision sufficient nodes.
Application-Level Sharding: If you need to shard at the application level (multi-tenancy, separate databases), that's your design.
Hotspot Prevention via Design: The best hotspot handling is prevention through good schema design.

Automatic ≠ Invisible

While automatic sharding reduces operational burden dramatically, understanding the system's behavior helps you make better design decisions. The best Spanner operators understand sharding even though they rarely manage it manually.

Summary: Sharding Without the Pain

We've explored how Spanner transforms sharding from an operational burden into an automated system behavior. Let's consolidate the key insights:

Key Takeaways

•Splits are Dynamic: Key range boundaries change based on size and load—no manual resharding needed.
•Directories Preserve Locality: Interleaved data stays together even as splits change.
•Hotspots Trigger Automatic Response: Over-loaded splits are detected and addressed through splitting and movement.
•Load Balancing is Continuous: Tablets continuously move to balance load across spanservers.
•Single-Key Hotspots Need Application Solutions: Sharding can't help when all requests must access one key.
•Replication is Configurable: Directory-level control enables data residency and performance optimization.
•Schema Design Still Matters: Automatic sharding works best with well-designed keys and interleaving.

What's Next:

We've covered Spanner's architecture—global distribution, TrueTime, transactions, and sharding. The final page brings it all together: When to Use Spanner. We'll explore the use cases where Spanner shines, when alternatives might be better, and how to evaluate Spanner for your specific requirements.

Page Complete

You now understand Spanner's automatic sharding—splits, tablets, directories, hotspot handling, and load balancing. Next, we'll synthesize everything into practical guidance on when Spanner is the right choice.

4 / 5

Loading learning content...

System Design (HLD)Google Spanner

Google Spanner: The Globally Distributed Database

LevelAdvanced

Duration90 mins

TopicGoogle Spanner

4 / 5

Automatic Sharding

The End of Manual Sharding

Manual sharding—dividing a database into pieces that different servers manage—works, but it imposes tremendous operational burden:

Application Awareness: Applications must know shard topology and route queries accordingly
Rebalancing Pain: Adding shards requires migrating data and updating routing logic
Hotspot Handling: Hot shards require manual intervention to split
Cross-Shard Queries: Joins across shards become the application's problem

What You Will Learn

Splits, Tablets, and Directories

To understand automatic sharding, we need to understand Spanner's data organization hierarchy.

Key Ranges and Splits:

Tablets:

A tablet is the physical manifestation of a split on a single server. Each tablet:

Contains data for one key range
Is managed by one spanserver
Has its own Paxos group for replication
Can be moved between servers

Directories:

A directory is a set of contiguous rows that share:

Common replication configuration
Common access patterns
Locality properties (they're always stored together)

Directories are the unit of data movement. When Spanner moves data for load balancing or failure recovery, it moves entire directories.

The Hierarchy:

data-organization-hierarchy.txt
SPANNER DATA ORGANIZATION HIERARCHY
═══════════════════════════════════════════════════════════════════
 
DATABASE: ecommerce
├── TABLE: users (PRIMARY KEY: user_id)
├── TABLE: orders (PRIMARY KEY: user_id, order_id) INTERLEAVE IN users
└── TABLE: order_items (PRIMARY KEY: user_id, order_id, item_id) INTERLEAVE IN orders
 
KEY SPACE (all tables interleaved, sorted by key):
─────────────────────────────────────────────────────────────────────
│ users.1   │ orders.1.1 │ items.1.1.1 │ users.2   │ orders.2.1 │ ...
│─────────────────────────│─────────────────────────│
│    Directory: user_1    │    Directory: user_2    │ ...
─────────────────────────────────────────────────────────────────────
 
SPLITS (determined by Spanner based on size/load):
─────────────────────────────────────────────────────────────────────
│        SPLIT 1         │        SPLIT 2         │     SPLIT 3    │
│   Keys: users.1-999    │  Keys: users.1000-1999 │ Keys: users.2000+
│                        │                        │                │
│   ~100K directories    │    ~100K directories   │  growing...    │
─────────────────────────────────────────────────────────────────────
 
TABLETS (Splits assigned to Spanservers):
─────────────────────────────────────────────────────────────────────
 
┌─────────────────────────────────────────────────────────────────┐
│                        SPANSERVER FLEET                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Spanserver A          Spanserver B          Spanserver C      │
│   ┌───────────────┐    ┌───────────────┐    ┌───────────────┐  │
│   │ Tablet: S1-L  │    │ Tablet: S1-F  │    │ Tablet: S2-L  │  │
│   │ (Split 1,     │    │ (Split 1,     │    │ (Split 2,     │  │
│   │  Leader)      │    │  Follower)    │    │  Leader)      │  │
│   ├───────────────┤    ├───────────────┤    ├───────────────┤  │
│   │ Tablet: S3-F  │    │ Tablet: S2-F  │    │ Tablet: S3-L  │  │
│   │ (Split 3,     │    │ (Split 2,     │    │ (Split 3,     │  │
│   │  Follower)    │    │  Follower)    │    │  Leader)      │  │
│   └───────────────┘    └───────────────┘    └───────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
 
Each Split = 1 Paxos Group = 1 Leader + N Followers
Each Tablet = 1 Split Replica on 1 Spanserver
 
REPLICATION (3-way replication example):
────────────────────────────────────────
Split 1: Leader on Spanserver A, Followers on B and D
Split 2: Leader on Spanserver C, Followers on B and E
Split 3: Leader on Spanserver C, Followers on A and F

Why This Hierarchy Matters:

Directories preserve locality: Related data (a user and their orders) stays together, even as splits change.
Splits enable parallelism: Different splits can be served by different servers, allowing horizontal scaling.
Tablets enable distribution: By placing tablets on different spanservers, load is distributed across the fleet.
Paxos groups per split: Each split has its own consensus group, allowing independent commits.

Split Boundaries:

Spanner chooses split boundaries based on:

Size: Splits should be similar sizes (typically 1-4 GB each)
Load: Heavily accessed key ranges may be split finer
Directory boundaries: Splits try to respect directory boundaries to avoid cross-split directories

Interleaving Reduces Splits Needed

Automatic Split and Merge

Spanner continuously monitors splits and automatically adjusts boundaries to maintain optimal performance.

Split Triggers:

Spanner splits a key range when:

Size Threshold: Split exceeds target size (typically 1-4 GB)
CPU Hotspot: Split is consuming disproportionate CPU time
Disk Hotspot: Split has disproportionate I/O traffic
Latency Anomaly: Operations on split have elevated latency

The Split Process:

Splitting is designed to be transparent and non-disruptive:

Boundary Selection: Spanner selects a split point in the middle of the key range (respecting directory boundaries when possible)
Metadata Update: Split metadata is updated—the old split becomes two new splits with new key ranges
Tablet Creation: New tablets are created on spanservers to manage the new splits
Paxos Reconfiguration: The old Paxos group becomes two new groups, each with its own leader
Traffic Redirection: Clients are seamlessly redirected to the appropriate new split

Zero Downtime:

The split operation doesn't block reads or writes:

Operations in progress complete against the old split
New operations are routed to the appropriate new split
There's no moment where data is unavailable

split-operation-flow.txt
AUTOMATIC SPLIT OPERATION
═══════════════════════════════════════════════════════════════════
 
BEFORE SPLIT:
─────────────
 
Split S1: Keys A-Z (4.2 GB, exceeds 4 GB threshold)
┌─────────────────────────────────────────────────────────────────┐
│                          SPLIT S1                               │
│                      Keys: [A ... Z]                            │
│                        Size: 4.2 GB                             │
│                                                                 │
│  Paxos Group 1:                                                 │
│  ├── Leader: Spanserver A (Zone US-E1)                         │
│  ├── Follower: Spanserver B (Zone US-E2)                       │
│  └── Follower: Spanserver C (Zone US-W1)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
 
 
SPLIT DECISION:
───────────────
 
1. Monitoring detects: Split S1 size = 4.2 GB > threshold 4 GB
2. Select split point: key "M" (approximately middle of key range)
3. Verify: "M" respects directory boundaries
4. Initiate split operation
 
 
SPLIT EXECUTION:
────────────────
 
Step 1: Create new split metadata
        S1 → [S1a: keys A-L] + [S1b: keys M-Z]
 
Step 2: Notify Paxos Group 1 of split
        Leader coordinates split with followers
 
Step 3: Create new Paxos groups
        ├── Paxos Group 1a (for S1a)
        │   ├── Leader: Spanserver A (stays, already has data)
        │   ├── Follower: Spanserver B
        │   └── Follower: Spanserver C
        │
        └── Paxos Group 1b (for S1b)
            ├── Leader: Spanserver D (new, receives data M-Z)
            ├── Follower: Spanserver E
            └── Follower: Spanserver F
 
Step 4: Replicate split boundary across all replicas (via Paxos)
 
Step 5: Update routing metadata in directory service
 
 
AFTER SPLIT:
────────────
 
┌──────────────────────────────┐  ┌──────────────────────────────┐
│         SPLIT S1a            │  │         SPLIT S1b            │
│       Keys: [A ... L]        │  │       Keys: [M ... Z]        │
│         Size: 2.1 GB         │  │         Size: 2.1 GB         │
│                              │  │                              │
│  Paxos Group 1a:             │  │  Paxos Group 1b:             │
│  ├── Leader: Spanserver A    │  │  ├── Leader: Spanserver D    │
│  ├── Follower: B             │  │  ├── Follower: E             │
│  └── Follower: C             │  │  └── Follower: F             │
│                              │  │                              │
└──────────────────────────────┘  └──────────────────────────────┘
 
 
CLIENT EXPERIENCE:
──────────────────
 
Time T0: Client issues query for key "P"
         → Routed to Split S1 (old, before split visible to client)
         → Query completes successfully
 
Time T1: Split operation completes, metadata updated
 
Time T2: Client issues query for key "P"  
         → Client's cached metadata is stale
         → Request goes to old split location
         → Spanserver returns "MOVED" with new location
         → Client updates cache, retries to Spanserver D
         → Query completes successfully
 
Time T3: Client issues query for key "P"
         → Client cache is updated
         → Routed directly to Split S1b (Spanserver D)
         → Query completes successfully
 
Total disruption: ONE retry, ~5-10ms additional latency for first query
                  after split. No errors visible to application.

Merge Triggers:

Spanner merges adjacent splits when:

Size Drops: Combined size of adjacent splits falls well below threshold
Load Decreases: Both splits have low utilization
Directory Efficiency: Merge would reunite a split directory

Merges are the inverse of splits—two Paxos groups become one, and tablets are consolidated.

Split/Merge Decisions:

Spanner's control plane continuously analyzes metrics and makes split/merge decisions:

Decisions are made centrally with global visibility
Changes are rate-limited to avoid oscillation
Hysteresis prevents rapid split-merge cycles
Time-of-day patterns are considered (don't split right before expected load increase)

Split Overhead

Load-Based Placement

Beyond splitting key ranges, Spanner actively manages which spanservers host which tablets to balance load across the fleet.

The Placement Problem:

Given:

N spanservers with varying capacities
M tablets with varying load requirements
Replication constraints (replicas must be in different zones)
Leader placement preferences (leaders should be in certain regions)

Find: An assignment of tablets to spanservers that:

Balances CPU, memory, and I/O load
Respects fault tolerance requirements
Minimizes latency for common access patterns
Allows for future growth

The Placement Manager:

Spanner's placement manager continuously optimizes tablet placement:

Metric Collection: Each spanserver reports CPU, memory, I/O, and per-tablet metrics
Imbalance Detection: Placement manager identifies servers that are over/under loaded
Move Calculation: For overloaded servers, placement manager calculates which tablets to move and where
Move Execution: Tablets are moved to rebalance load
Monitoring: Results are monitored and adjustments made if needed

Tablet Placement Considerations
Factor	Goal	Constraint
CPU Utilization	Balance across servers	No server above 70% sustained
Memory Utilization	Balance across servers	No server above 80% memory
Disk I/O	Balance reads/writes	No server I/O saturated
Network Bandwidth	Minimize cross-zone traffic	Replicas in different zones
Leader Locality	Leaders near write sources	Leader in allowed regions
Replica Locality	Replicas near read sources	Geographic distribution for DR
Fault Tolerance	Survive zone/region failures	No majority in single failure domain

Movement Types:

1. Follower Movement (Cheapest):

Moving a follower replica to a different server
New follower catches up from existing replicas
Minimal disruption—Paxos continues with existing quorum

2. Leader Movement:

Moving the leader replica to a different server
Requires leader election after move
Brief pause in writes during election (~100ms)

3. Split Redistribution:

After a split, new tablets are placed on less-loaded servers
Spreads hot data across more servers

Movement Constraints:

Not all movements are safe or desirable:

Quorum Safety: Never move so many replicas that quorum is lost
Rate Limiting: Don't move too many tablets simultaneously
Cooldown: Recently moved tablets aren't immediately moved again
Preference Respect: Honor region/zone placement preferences

Live Traffic During Movement

Hotspot Detection and Handling

Hotspots—key ranges that receive disproportionate traffic—are the nemesis of distributed databases. Spanner's automatic sharding includes sophisticated hotspot detection and mitigation.

Types of Hotspots:

1. Write Hotspots:

Many transactions writing to the same key range
Often caused by sequential key generation (auto-increment IDs)
Particularly problematic because writes require leader coordination

2. Read Hotspots:

Many queries reading the same key range
Popular content, trending items, shared configuration
Less severe because reads can be distributed across replicas

3. Lock Contention Hotspots:

Many transactions competing for locks on the same rows
Metrics show low CPU but high transaction abort rates

Hotspot Detection:

Spanner monitors multiple signals:

Per-split CPU utilization
Per-split QPS and latency percentiles
Lock wait times and abort rates
Key access patterns (via sampling)

When metrics exceed thresholds, the split is flagged as a hotspot.

hotspot-detection-response.txt
HOTSPOT DETECTION AND RESPONSE
═══════════════════════════════════════════════════════════════════
 
SCENARIO: E-commerce flash sale causes product hotspot
 
DETECTION:
──────────
 
Time T0: Flash sale begins for popular product (product_id = 12345)
 
Time T0 + 30s: Monitoring observes:
              ┌─────────────────────────────────────────────────────┐
              │  Split containing product 12345 (products_12000-14000)│
              │                                                     │
              │  Metrics:                                           │
              │    CPU: 95% (normal: 30%)        ← ANOMALY          │
              │    QPS: 15,000 (normal: 500)     ← ANOMALY          │
              │    P99 Latency: 450ms (normal: 25ms) ← ANOMALY      │
              │    Lock waits: 2,100/s           ← ANOMALY          │
              │                                                     │
              │  Adjacent splits: normal load                       │
              └─────────────────────────────────────────────────────┘
 
              Diagnosis: WRITE HOTSPOT on product 12345
 
 
RESPONSE STRATEGY SELECTION:
────────────────────────────
 
Option 1: Split the key range
         Pro: Distributes future load
         Con: Doesn't help if single key is hot
         Verdict: ✗ Hotspot is single product, splitting won't help
 
Option 2: Add read replicas
         Pro: Distributes read load
         Con: Doesn't help with writes (still go to leader)
         Verdict: ∆ Partial help if reads dominate
 
Option 3: Move leader to more powerful server
         Pro: Handles more load
         Con: Limited by single server capacity
         Verdict: ∆ Temporary relief
 
Option 4: Application-level mitigation (requires developer action)
         • Rate limiting
         • Queue-based write batching  
         • Optimistic concurrency
         Verdict: ✓ Best long-term solution for single-key hotspots
 
 
AUTOMATIC RESPONSE (Spanner's actions):
───────────────────────────────────────
 
Time T0 + 60s: 
  1. Alert generated for operations team
  2. Leader moved to highest-capacity server in zone
  3. Pre-emptive splits on adjacent key ranges (anticipating overflow)
 
Time T0 + 5m:
  If hotspot persists and is due to many keys (not single key):
  4. Split the hot range at multiple points
  5. Redistribute new splits across servers
 
 
SINGLE-KEY HOTSPOT MITIGATION (Application Level):
──────────────────────────────────────────────────
 
For true single-key hotspots (one product, one counter, etc.),
automatic sharding can't help—all requests must go to one place.
 
Application solutions:
 
1. COUNTER SHARDING:
   Instead of: inventory_count for product 12345 = 5000
   Use: inventory_shard_0 = 1000, shard_1 = 1000, ... shard_4 = 1000
   Sum shards for total, update random shard for changes
   
2. WRITE BATCHING:
   Queue updates in memory, batch-write periodically
   Trades freshness for throughput
   
3. EVENTUAL CONSISTENCY FOR READS:
   Use stale reads for non-critical operations
   Only products service uses strong reads
 
4. QUEUE-BASED PROCESSING:
   Don't update directly—publish events to queue
   Worker processes queue sequentially

Prevention Through Schema Design:

The best hotspot mitigation is prevention. Schema design choices that prevent hotspots:

1. UUID or Random Keys:

-- Instead of:
CREATE TABLE orders (order_id INT64 AUTO_INCREMENT, ...);
-- Hotspot: All new orders go to highest key range

-- Use:
CREATE TABLE orders (order_id STRING(36), ...);  -- UUID
-- Orders distributed across entire key space

2. Key Prefixing for Distribution:

-- Instead of:
-- PRIMARY KEY (timestamp)  -- all recent data in one range

-- Use:
-- PRIMARY KEY (shard_id, timestamp)
-- where shard_id = hash(some_attribute) % num_shards

3. Separate Hot and Cold Data:

-- Hot table: recent/active data, smaller, more splits
CREATE TABLE active_sessions (...);  

-- Cold table: historical data, larger, fewer splits  
CREATE TABLE archived_sessions (...);

Automatic Sharding Has Limits

Directory-Level Replication Control

While Spanner automatically manages split boundaries and tablet placement, it also provides fine-grained control over replication at the directory level.

Placement Groups:

Spanner allows defining placement groups—sets of databases or tables that share replication configuration:

-- All user data replicated to EU for GDPR
ALTER TABLE users SET OPTIONS (
  placement_group = 'eu_primary'
);

-- Analytics data replicated globally for low-latency reads
ALTER TABLE analytics SET OPTIONS (
  placement_group = 'global_read_optimized'
);

Per-Row Replication:

For advanced use cases, Spanner can even vary replication at the row level:

-- European users' data stays in EU
-- American users' data stays in Americas
-- Based on user.region column

This enables:

Data residency compliance (GDPR, data sovereignty laws)
Performance optimization (data near users who access it)
Cost optimization (less replication for less critical data)

Replication Configuration Examples
Use Case	Configuration	Benefit
GDPR Compliance	EU users → EU-only replicas	Data never leaves EU jurisdiction
Global Low-Latency	Global replicas + follower reads	Sub-50ms reads from any continent
Cost Optimization	Single-region for non-critical data	Lower replication costs
Analytics Offload	Read replicas in analytics region	Analytics doesn't impact production
Disaster Recovery	Multi-region with geo-diverse follower	Survives regional disasters

Movedir and Directory Movement:

Spanner's internal movedir operation handles directory movement:

Coordination: Central coordinator identifies source and destination zones
Replication Setup: New replica created at destination
Catch-Up: New replica catches up from source
Cutover: Traffic redirected to new replica
Cleanup: Old replica removed (if moving, not adding replica)

Automated Locality Optimization:

Spanner can automatically optimize directory placement based on access patterns:

Track where reads/writes originate
Move directories closer to their access sources
Adjust leader placement for write-heavy directories
Add read replicas for read-heavy directories

This optimization is continuous—as access patterns change, data placement adapts.

Watch the Dashboard

Operational Experience: Sharding at Scale

Google's operational experience with Spanner's automatic sharding provides insights into real-world behavior at massive scale.

Scale Characteristics:

Google's internal Spanner deployment spans multiple continents
Thousands of spanservers managing millions of splits
Continuous rebalancing handles changing workloads
System maintains SLAs during organic growth and sudden traffic spikes

Emergent Behaviors:

At scale, interesting patterns emerge:

1. Diurnal Patterns:

Workloads follow time-of-day patterns
Spanner pre-emptively rebalances before predicted load increases
Leader placement may shift between regions as "follow the sun"

2. Event-Driven Scaling:

Marketing campaigns cause traffic spikes
Automatic splitting handles increased load
Merging happens gradually after event ends

3. Long-Tail Key Ranges:

Some key ranges are naturally quieter
These may consolidate into larger splits
System keeps reserve capacity for unexpected growth

Operational Best Practices

•Monitor Split Counts: Rapidly increasing splits may indicate hotspots or problematic access patterns.
•Watch for Repeated Splits: The same key range repeatedly splitting suggests a persistent hotspot needing schema changes.
•Track Cross-Split Transactions: High percentages of multi-split transactions indicate potential schema optimization opportunities.
•Review Leader Placement: Ensure leaders are in regions where writes originate for optimal latency.
•Test with Production-Like Load: Automatic behaviors may differ at scale—load test before major deployments.
•Understand Your Key Distribution: Skewed key distributions can undermine automatic sharding effectiveness.

What Spanner Doesn't Do Automatically:

While Spanner automates much of sharding, some decisions remain with operators:

Schema Design: Spanner can't fix a poorly designed schema. Key selection and interleaving decisions have lasting impact.
Replication Configuration: Multi-region vs. single-region, leader placement preferences—these are business decisions.
Capacity Planning: Spanner scales within provisioned resources. You must provision sufficient nodes.
Application-Level Sharding: If you need to shard at the application level (multi-tenancy, separate databases), that's your design.
Hotspot Prevention via Design: The best hotspot handling is prevention through good schema design.

Automatic ≠ Invisible

Summary: Sharding Without the Pain

We've explored how Spanner transforms sharding from an operational burden into an automated system behavior. Let's consolidate the key insights:

Key Takeaways

•Splits are Dynamic: Key range boundaries change based on size and load—no manual resharding needed.
•Directories Preserve Locality: Interleaved data stays together even as splits change.
•Hotspots Trigger Automatic Response: Over-loaded splits are detected and addressed through splitting and movement.
•Load Balancing is Continuous: Tablets continuously move to balance load across spanservers.
•Single-Key Hotspots Need Application Solutions: Sharding can't help when all requests must access one key.
•Replication is Configurable: Directory-level control enables data residency and performance optimization.
•Schema Design Still Matters: Automatic sharding works best with well-designed keys and interleaving.

What's Next:

Page Complete

4 / 5