Loading learning content...
Ask any engineer who has managed a sharded database about their experience, and you'll hear war stories: resharding projects that took months, application changes needed for every shard topology change, data migration scripts that failed at 3 AM, and the constant mental overhead of remembering which data lives where.
Manual sharding—dividing a database into pieces that different servers manage—works, but it imposes tremendous operational burden:
Google Spanner eliminates this operational overhead through automatic sharding. The database handles data partitioning, movement, and rebalancing transparently. Applications interact with a single logical database while Spanner manages the distributed reality underneath.
This isn't just convenience—it's a fundamental shift in database operations. Engineers focus on application logic while Spanner ensures data is optimally distributed across a possibly massive fleet of servers.
By the end of this page, you will understand how Spanner automatically partitions data, how it detects and resolves hotspots, how data moves between servers without downtime, and how this automation enables truly hands-off scalability.
To understand automatic sharding, we need to understand Spanner's data organization hierarchy.
Key Ranges and Splits:
All data in Spanner is stored in primary key order. A split is a contiguous range of the key space—for example, all rows with primary keys from "A" to "M". Splits are the fundamental unit of sharding.
Tablets:
A tablet is the physical manifestation of a split on a single server. Each tablet:
Directories:
A directory is a set of contiguous rows that share:
Directories are the unit of data movement. When Spanner moves data for load balancing or failure recovery, it moves entire directories.
The Hierarchy:
SPANNER DATA ORGANIZATION HIERARCHY═══════════════════════════════════════════════════════════════════ DATABASE: ecommerce├── TABLE: users (PRIMARY KEY: user_id)├── TABLE: orders (PRIMARY KEY: user_id, order_id) INTERLEAVE IN users└── TABLE: order_items (PRIMARY KEY: user_id, order_id, item_id) INTERLEAVE IN orders KEY SPACE (all tables interleaved, sorted by key):─────────────────────────────────────────────────────────────────────│ users.1 │ orders.1.1 │ items.1.1.1 │ users.2 │ orders.2.1 │ ...│─────────────────────────│─────────────────────────││ Directory: user_1 │ Directory: user_2 │ ...───────────────────────────────────────────────────────────────────── SPLITS (determined by Spanner based on size/load):─────────────────────────────────────────────────────────────────────│ SPLIT 1 │ SPLIT 2 │ SPLIT 3 ││ Keys: users.1-999 │ Keys: users.1000-1999 │ Keys: users.2000+│ │ │ ││ ~100K directories │ ~100K directories │ growing... │───────────────────────────────────────────────────────────────────── TABLETS (Splits assigned to Spanservers):───────────────────────────────────────────────────────────────────── ┌─────────────────────────────────────────────────────────────────┐│ SPANSERVER FLEET │├─────────────────────────────────────────────────────────────────┤│ ││ Spanserver A Spanserver B Spanserver C ││ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ││ │ Tablet: S1-L │ │ Tablet: S1-F │ │ Tablet: S2-L │ ││ │ (Split 1, │ │ (Split 1, │ │ (Split 2, │ ││ │ Leader) │ │ Follower) │ │ Leader) │ ││ ├───────────────┤ ├───────────────┤ ├───────────────┤ ││ │ Tablet: S3-F │ │ Tablet: S2-F │ │ Tablet: S3-L │ ││ │ (Split 3, │ │ (Split 2, │ │ (Split 3, │ ││ │ Follower) │ │ Follower) │ │ Leader) │ ││ └───────────────┘ └───────────────┘ └───────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘ Each Split = 1 Paxos Group = 1 Leader + N FollowersEach Tablet = 1 Split Replica on 1 Spanserver REPLICATION (3-way replication example):────────────────────────────────────────Split 1: Leader on Spanserver A, Followers on B and DSplit 2: Leader on Spanserver C, Followers on B and ESplit 3: Leader on Spanserver C, Followers on A and FWhy This Hierarchy Matters:
Directories preserve locality: Related data (a user and their orders) stays together, even as splits change.
Splits enable parallelism: Different splits can be served by different servers, allowing horizontal scaling.
Tablets enable distribution: By placing tablets on different spanservers, load is distributed across the fleet.
Paxos groups per split: Each split has its own consensus group, allowing independent commits.
Split Boundaries:
Spanner chooses split boundaries based on:
Because interleaved tables are stored together in key order, a user and all their related data typically fit in one split. This means operations on a single user's data are always local to one Paxos group—no distributed coordination needed.
Spanner continuously monitors splits and automatically adjusts boundaries to maintain optimal performance.
Split Triggers:
Spanner splits a key range when:
The Split Process:
Splitting is designed to be transparent and non-disruptive:
Boundary Selection: Spanner selects a split point in the middle of the key range (respecting directory boundaries when possible)
Metadata Update: Split metadata is updated—the old split becomes two new splits with new key ranges
Tablet Creation: New tablets are created on spanservers to manage the new splits
Paxos Reconfiguration: The old Paxos group becomes two new groups, each with its own leader
Traffic Redirection: Clients are seamlessly redirected to the appropriate new split
Zero Downtime:
The split operation doesn't block reads or writes:
AUTOMATIC SPLIT OPERATION═══════════════════════════════════════════════════════════════════ BEFORE SPLIT:───────────── Split S1: Keys A-Z (4.2 GB, exceeds 4 GB threshold)┌─────────────────────────────────────────────────────────────────┐│ SPLIT S1 ││ Keys: [A ... Z] ││ Size: 4.2 GB ││ ││ Paxos Group 1: ││ ├── Leader: Spanserver A (Zone US-E1) ││ ├── Follower: Spanserver B (Zone US-E2) ││ └── Follower: Spanserver C (Zone US-W1) ││ │└─────────────────────────────────────────────────────────────────┘ SPLIT DECISION:─────────────── 1. Monitoring detects: Split S1 size = 4.2 GB > threshold 4 GB2. Select split point: key "M" (approximately middle of key range)3. Verify: "M" respects directory boundaries4. Initiate split operation SPLIT EXECUTION:──────────────── Step 1: Create new split metadata S1 → [S1a: keys A-L] + [S1b: keys M-Z] Step 2: Notify Paxos Group 1 of split Leader coordinates split with followers Step 3: Create new Paxos groups ├── Paxos Group 1a (for S1a) │ ├── Leader: Spanserver A (stays, already has data) │ ├── Follower: Spanserver B │ └── Follower: Spanserver C │ └── Paxos Group 1b (for S1b) ├── Leader: Spanserver D (new, receives data M-Z) ├── Follower: Spanserver E └── Follower: Spanserver F Step 4: Replicate split boundary across all replicas (via Paxos) Step 5: Update routing metadata in directory service AFTER SPLIT:──────────── ┌──────────────────────────────┐ ┌──────────────────────────────┐│ SPLIT S1a │ │ SPLIT S1b ││ Keys: [A ... L] │ │ Keys: [M ... Z] ││ Size: 2.1 GB │ │ Size: 2.1 GB ││ │ │ ││ Paxos Group 1a: │ │ Paxos Group 1b: ││ ├── Leader: Spanserver A │ │ ├── Leader: Spanserver D ││ ├── Follower: B │ │ ├── Follower: E ││ └── Follower: C │ │ └── Follower: F ││ │ │ │└──────────────────────────────┘ └──────────────────────────────┘ CLIENT EXPERIENCE:────────────────── Time T0: Client issues query for key "P" → Routed to Split S1 (old, before split visible to client) → Query completes successfully Time T1: Split operation completes, metadata updated Time T2: Client issues query for key "P" → Client's cached metadata is stale → Request goes to old split location → Spanserver returns "MOVED" with new location → Client updates cache, retries to Spanserver D → Query completes successfully Time T3: Client issues query for key "P" → Client cache is updated → Routed directly to Split S1b (Spanserver D) → Query completes successfully Total disruption: ONE retry, ~5-10ms additional latency for first query after split. No errors visible to application.Merge Triggers:
Spanner merges adjacent splits when:
Merges are the inverse of splits—two Paxos groups become one, and tablets are consolidated.
Split/Merge Decisions:
Spanner's control plane continuously analyzes metrics and makes split/merge decisions:
Splitting isn't free—it requires copying data to new replicas and updating metadata. Spanner's control plane balances split benefits against costs, avoiding unnecessary splits that would just be merged later.
Beyond splitting key ranges, Spanner actively manages which spanservers host which tablets to balance load across the fleet.
The Placement Problem:
Given:
Find: An assignment of tablets to spanservers that:
The Placement Manager:
Spanner's placement manager continuously optimizes tablet placement:
Metric Collection: Each spanserver reports CPU, memory, I/O, and per-tablet metrics
Imbalance Detection: Placement manager identifies servers that are over/under loaded
Move Calculation: For overloaded servers, placement manager calculates which tablets to move and where
Move Execution: Tablets are moved to rebalance load
Monitoring: Results are monitored and adjustments made if needed
| Factor | Goal | Constraint |
|---|---|---|
| CPU Utilization | Balance across servers | No server above 70% sustained |
| Memory Utilization | Balance across servers | No server above 80% memory |
| Disk I/O | Balance reads/writes | No server I/O saturated |
| Network Bandwidth | Minimize cross-zone traffic | Replicas in different zones |
| Leader Locality | Leaders near write sources | Leader in allowed regions |
| Replica Locality | Replicas near read sources | Geographic distribution for DR |
| Fault Tolerance | Survive zone/region failures | No majority in single failure domain |
Movement Types:
1. Follower Movement (Cheapest):
2. Leader Movement:
3. Split Redistribution:
Movement Constraints:
Not all movements are safe or desirable:
Tablet movement doesn't block live traffic. The existing replica continues serving while the new replica catches up. Once the new replica is synchronized, traffic is redirected and the old replica is decommissioned.
Hotspots—key ranges that receive disproportionate traffic—are the nemesis of distributed databases. Spanner's automatic sharding includes sophisticated hotspot detection and mitigation.
Types of Hotspots:
1. Write Hotspots:
2. Read Hotspots:
3. Lock Contention Hotspots:
Hotspot Detection:
Spanner monitors multiple signals:
When metrics exceed thresholds, the split is flagged as a hotspot.
HOTSPOT DETECTION AND RESPONSE═══════════════════════════════════════════════════════════════════ SCENARIO: E-commerce flash sale causes product hotspot DETECTION:────────── Time T0: Flash sale begins for popular product (product_id = 12345) Time T0 + 30s: Monitoring observes: ┌─────────────────────────────────────────────────────┐ │ Split containing product 12345 (products_12000-14000)│ │ │ │ Metrics: │ │ CPU: 95% (normal: 30%) ← ANOMALY │ │ QPS: 15,000 (normal: 500) ← ANOMALY │ │ P99 Latency: 450ms (normal: 25ms) ← ANOMALY │ │ Lock waits: 2,100/s ← ANOMALY │ │ │ │ Adjacent splits: normal load │ └─────────────────────────────────────────────────────┘ Diagnosis: WRITE HOTSPOT on product 12345 RESPONSE STRATEGY SELECTION:──────────────────────────── Option 1: Split the key range Pro: Distributes future load Con: Doesn't help if single key is hot Verdict: ✗ Hotspot is single product, splitting won't help Option 2: Add read replicas Pro: Distributes read load Con: Doesn't help with writes (still go to leader) Verdict: ∆ Partial help if reads dominate Option 3: Move leader to more powerful server Pro: Handles more load Con: Limited by single server capacity Verdict: ∆ Temporary relief Option 4: Application-level mitigation (requires developer action) • Rate limiting • Queue-based write batching • Optimistic concurrency Verdict: ✓ Best long-term solution for single-key hotspots AUTOMATIC RESPONSE (Spanner's actions):─────────────────────────────────────── Time T0 + 60s: 1. Alert generated for operations team 2. Leader moved to highest-capacity server in zone 3. Pre-emptive splits on adjacent key ranges (anticipating overflow) Time T0 + 5m: If hotspot persists and is due to many keys (not single key): 4. Split the hot range at multiple points 5. Redistribute new splits across servers SINGLE-KEY HOTSPOT MITIGATION (Application Level):────────────────────────────────────────────────── For true single-key hotspots (one product, one counter, etc.),automatic sharding can't help—all requests must go to one place. Application solutions: 1. COUNTER SHARDING: Instead of: inventory_count for product 12345 = 5000 Use: inventory_shard_0 = 1000, shard_1 = 1000, ... shard_4 = 1000 Sum shards for total, update random shard for changes 2. WRITE BATCHING: Queue updates in memory, batch-write periodically Trades freshness for throughput 3. EVENTUAL CONSISTENCY FOR READS: Use stale reads for non-critical operations Only products service uses strong reads 4. QUEUE-BASED PROCESSING: Don't update directly—publish events to queue Worker processes queue sequentiallyPrevention Through Schema Design:
The best hotspot mitigation is prevention. Schema design choices that prevent hotspots:
1. UUID or Random Keys:
-- Instead of:
CREATE TABLE orders (order_id INT64 AUTO_INCREMENT, ...);
-- Hotspot: All new orders go to highest key range
-- Use:
CREATE TABLE orders (order_id STRING(36), ...); -- UUID
-- Orders distributed across entire key space
2. Key Prefixing for Distribution:
-- Instead of:
-- PRIMARY KEY (timestamp) -- all recent data in one range
-- Use:
-- PRIMARY KEY (shard_id, timestamp)
-- where shard_id = hash(some_attribute) % num_shards
3. Separate Hot and Cold Data:
-- Hot table: recent/active data, smaller, more splits
CREATE TABLE active_sessions (...);
-- Cold table: historical data, larger, fewer splits
CREATE TABLE archived_sessions (...);
While Spanner handles most sharding automatically, true single-key hotspots (like a global counter) hit fundamental limits. If all requests must coordinate through one key, no amount of sharding helps. Designing for distributed access patterns is still essential.
While Spanner automatically manages split boundaries and tablet placement, it also provides fine-grained control over replication at the directory level.
Placement Groups:
Spanner allows defining placement groups—sets of databases or tables that share replication configuration:
-- All user data replicated to EU for GDPR
ALTER TABLE users SET OPTIONS (
placement_group = 'eu_primary'
);
-- Analytics data replicated globally for low-latency reads
ALTER TABLE analytics SET OPTIONS (
placement_group = 'global_read_optimized'
);
Per-Row Replication:
For advanced use cases, Spanner can even vary replication at the row level:
-- European users' data stays in EU
-- American users' data stays in Americas
-- Based on user.region column
This enables:
| Use Case | Configuration | Benefit |
|---|---|---|
| GDPR Compliance | EU users → EU-only replicas | Data never leaves EU jurisdiction |
| Global Low-Latency | Global replicas + follower reads | Sub-50ms reads from any continent |
| Cost Optimization | Single-region for non-critical data | Lower replication costs |
| Analytics Offload | Read replicas in analytics region | Analytics doesn't impact production |
| Disaster Recovery | Multi-region with geo-diverse follower | Survives regional disasters |
Movedir and Directory Movement:
Spanner's internal movedir operation handles directory movement:
Automated Locality Optimization:
Spanner can automatically optimize directory placement based on access patterns:
This optimization is continuous—as access patterns change, data placement adapts.
Cloud Spanner provides dashboards showing split distribution, hotspots, and placements. While the system manages this automatically, monitoring helps you identify schema design improvements and workload changes that might benefit from configuration adjustments.
Google's operational experience with Spanner's automatic sharding provides insights into real-world behavior at massive scale.
Scale Characteristics:
Emergent Behaviors:
At scale, interesting patterns emerge:
1. Diurnal Patterns:
2. Event-Driven Scaling:
3. Long-Tail Key Ranges:
What Spanner Doesn't Do Automatically:
While Spanner automates much of sharding, some decisions remain with operators:
Schema Design: Spanner can't fix a poorly designed schema. Key selection and interleaving decisions have lasting impact.
Replication Configuration: Multi-region vs. single-region, leader placement preferences—these are business decisions.
Capacity Planning: Spanner scales within provisioned resources. You must provision sufficient nodes.
Application-Level Sharding: If you need to shard at the application level (multi-tenancy, separate databases), that's your design.
Hotspot Prevention via Design: The best hotspot handling is prevention through good schema design.
While automatic sharding reduces operational burden dramatically, understanding the system's behavior helps you make better design decisions. The best Spanner operators understand sharding even though they rarely manage it manually.
We've explored how Spanner transforms sharding from an operational burden into an automated system behavior. Let's consolidate the key insights:
What's Next:
We've covered Spanner's architecture—global distribution, TrueTime, transactions, and sharding. The final page brings it all together: When to Use Spanner. We'll explore the use cases where Spanner shines, when alternatives might be better, and how to evaluate Spanner for your specific requirements.
You now understand Spanner's automatic sharding—splits, tablets, directories, hotspot handling, and load balancing. Next, we'll synthesize everything into practical guidance on when Spanner is the right choice.