Loading content...
In 2012, Google unveiled something that distributed systems researchers had considered theoretically impossible: a globally distributed database that provided strong consistency, SQL semantics, and horizontal scalability—all simultaneously. That database was Google Spanner.
Traditional distributed database wisdom held that you had to choose between consistency and availability (the CAP theorem), between SQL and scale (the NoSQL movement), between geographic distribution and transactional guarantees. Spanner broke these tradeoffs not by disproving the theorems, but by engineering around them in ways that shifted the boundaries of what was achievable.
Today, Google Spanner serves as the backbone for some of the world's most demanding applications: Google Ads, Google Play, Google Cloud's internal infrastructure, and thousands of external deployments. It processes millions of transactions per second across continents, maintaining the illusion that the entire planet is a single, coherent database.
By the end of this page, you will understand how Spanner achieves planetary-scale distribution, the architecture that makes global presence possible, and why this architecture represents a paradigm shift in distributed database design. You'll learn the specific techniques Spanner uses to span continents while maintaining the consistency guarantees that applications depend on.
Before we can appreciate Spanner's solution, we must understand the problem in depth. Global distribution of databases isn't simply about placing servers in different locations—it's about reconciling fundamental physical constraints with application requirements.
The Speed of Light Constraint:
The most fundamental constraint in global distribution is physics itself. Light travels at approximately 299,792 kilometers per second in a vacuum, which sounds fast until you consider the distances involved:
These are theoretical minimums. In practice, network latencies are 2-4x higher due to routing, switching, and the fact that fiber optic cables don't follow straight lines across the Earth.
Why This Matters:
If your database must coordinate between nodes in San Francisco and Sydney before committing a transaction, you're looking at a minimum of 80ms round-trip latency (and realistically 150-300ms). For a database handling thousands of transactions per second, this coordination overhead becomes the dominant factor in system performance.
| Route | Distance (km) | Theoretical Min (ms) | Typical Latency (ms) | Impact at 1000 TPS |
|---|---|---|---|---|
| US East ↔ US West | ~4,000 | ~13 | 40-60 | 40-60 seconds/second spent waiting |
| US ↔ Europe | ~6,500 | ~22 | 70-100 | 70-100 seconds/second spent waiting |
| US ↔ Asia Pacific | ~10,000 | ~33 | 120-180 | 120-180 seconds/second spent waiting |
| Europe ↔ Asia Pacific | ~9,000 | ~30 | 150-250 | 150-250 seconds/second spent waiting |
The Coordination Dilemma:
Traditional databases solve consistency through coordination—nodes agree on transaction ordering before committing. In a global system, this coordination must happen across continental distances, introducing unavoidable latency. The naive approach of requiring all nodes to agree on every transaction simply doesn't scale.
Consider what happens when a user in Tokyo and a user in London both try to update the same bank account simultaneously:
This is unacceptable for applications requiring sub-100ms response times.
The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance. Since network partitions are unavoidable in global systems, you must choose between consistency and availability during partitions. Spanner chooses consistency, but engineers the system so that partitions are extraordinarily rare and short-lived.
Spanner's architecture is built on several interlocking concepts that, together, enable global distribution with strong consistency. Understanding each component and how they interact is essential for grasping why Spanner works.
Zones and Zone Sets:
At the physical level, Spanner is deployed across multiple zones. A zone is the unit of administrative deployment and roughly corresponds to a datacenter or a distinct failure domain within a larger datacenter complex. Zones are independent—a failure in one zone (power outage, network partition, software bug) shouldn't affect other zones.
Zones are grouped into zone sets based on geographic proximity. A zone set typically contains zones within a single metropolitan area or region. Data is replicated within and across zone sets according to configurable policies.
Spanservers and Directories:
Within each zone, spanservers hold the actual data. Each spanserver manages multiple tablets—ranges of rows from the database, similar to shards in a traditional distributed database. However, Spanner adds another layer: directories.
A directory is the unit of data movement and replication. Directories contain contiguous ranges of rows that share common access patterns and replication configurations. When Spanner moves data for load balancing or fault tolerance, it moves entire directories.
The Replication Hierarchy:
Understanding Spanner's replication requires thinking in layers:
SPANNER ARCHITECTURE HIERARCHY═══════════════════════════════════════════════════════════════ ┌─────────────────────────────────────────────────────────────┐│ GLOBAL SPANNER DEPLOYMENT │├─────────────────────────────────────────────────────────────┤│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ││ │ Zone Set: │ │ Zone Set: │ │ Zone Set: │ ││ │ US-EAST │ │ EUROPE │ │ ASIA-PACIFIC │ ││ │ │ │ │ │ │ ││ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ ││ │ │Zone: US-E1│ │ │ │Zone: EU-W1│ │ │ │Zone:ASIA1 │ │ ││ │ │ │ │ │ │ │ │ │ │ │ │ ││ │ │Spanservers│ │ │ │Spanservers│ │ │ │Spanservers│ │ ││ │ │ ├─Tablet1│ │ │ │ ├─Tablet1│ │ │ │ ├─Tablet1│ │ ││ │ │ ├─Tablet2│ │ │ │ ├─Tablet2│ │ │ │ ├─Tablet2│ │ ││ │ │ └─Tablet3│ │ │ │ └─Tablet3│ │ │ │ └─Tablet3│ │ ││ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ ││ │ │ │ │ │ │ ││ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ ││ │ │Zone: US-E2│ │ │ │Zone: EU-W2│ │ │ │Zone:ASIA2 │ │ ││ │ │ (replica) │ │ │ │ (replica) │ │ │ │ (replica) │ │ ││ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ ││ └───────────────┘ └───────────────┘ └───────────────┘ │└─────────────────────────────────────────────────────────────┘ DIRECTORY REPLICATION (Example: 3-way replication):────────────────────────────────────────────────────Directory "accounts_100k_200k" (user accounts 100000-200000): ├── Leader replica: Zone US-E1 (accepts writes) ├── Follower replica: Zone EU-W1 (can serve stale reads) └── Follower replica: Zone ASIA1 (can serve stale reads) PAXOS GROUP for this directory: ├── Leader: US-E1/Spanserver-7/Tablet-12 ├── Acceptor: EU-W1/Spanserver-3/Tablet-5 └── Acceptor: ASIA1/Spanserver-9/Tablet-8Paxos Groups and Consensus:
Each directory in Spanner is managed by a Paxos group—a set of replicas that use the Paxos consensus protocol to agree on transaction ordering and ensure data is consistently replicated. The Paxos group has one leader (which accepts writes) and multiple followers (which participate in consensus and can serve reads).
Paxos ensures that once a transaction is committed, it's committed on a majority of replicas before returning success to the client. This provides durability even if individual replicas fail—as long as a majority survives, the data is safe.
Why Paxos, Not Raft?
Spanner was designed before Raft became popular (Raft was published in 2014, Spanner in 2012). Both protocols provide similar guarantees, but Spanner uses Multi-Paxos with several optimizations:
The location of the Paxos leader for a directory determines write latency for that data. Spanner allows configuring leader preferences—you can specify that user data from Europe should have European leaders, user data from Asia should have Asian leaders, etc. This keeps write latency low for the majority of users while still providing global availability.
Spanner's data model is specifically designed to support global distribution while maintaining relational semantics. Two key innovations make this possible: interleaved tables and hierarchical primary keys.
Interleaved Tables: Co-locating Related Data
In a traditional relational database, related tables (like users and orders) are stored independently. Joining them requires accessing potentially different physical locations. In a globally distributed system, this could mean cross-continental round trips for every join.
Spanner solves this with interleaved tables. An interleaved table is physically co-located with its parent table—rows from the child table are stored next to the parent row they reference. This ensures that operations involving a user and their orders never require cross-datacenter communication.
How Interleaving Works:
Consider a schema with Users and Orders. In Spanner, you'd define Orders as interleaved in Users:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
-- Parent table: UsersCREATE TABLE Users ( user_id INT64 NOT NULL, email STRING(255), name STRING(100), region STRING(50), created_at TIMESTAMP NOT NULL OPTIONS ( allow_commit_timestamp = true ),) PRIMARY KEY (user_id); -- Child table: Orders (interleaved in Users)CREATE TABLE Orders ( user_id INT64 NOT NULL, order_id INT64 NOT NULL, total FLOAT64, status STRING(20), created_at TIMESTAMP NOT NULL,) PRIMARY KEY (user_id, order_id), INTERLEAVE IN PARENT Users ON DELETE CASCADE; -- Grandchild table: OrderItems (interleaved in Orders)CREATE TABLE OrderItems ( user_id INT64 NOT NULL, order_id INT64 NOT NULL, item_id INT64 NOT NULL, product_id INT64, quantity INT64, unit_price FLOAT64,) PRIMARY KEY (user_id, order_id, item_id), INTERLEAVE IN PARENT Orders ON DELETE CASCADE; /* * PHYSICAL LAYOUT (conceptual): * * Directory for user 1001: * ┌────────────────────────────────────────────────────┐ * │ User(1001) | alice@email.com | "Alice Smith" | US │ * ├────────────────────────────────────────────────────┤ * │ Order(1001, 5001) | $150.00 | "shipped" │ * │ OrderItem(1001, 5001, 1) | Laptop | 1 | $999 │ * │ OrderItem(1001, 5001, 2) | Mouse | 2 | $25 │ * ├────────────────────────────────────────────────────┤ * │ Order(1001, 5002) | $45.00 | "pending" │ * │ OrderItem(1001, 5002, 1) | Cable | 3 | $15 │ * └────────────────────────────────────────────────────┘ * * ALL of user 1001's data is in ONE directory → ONE Paxos group * → All operations on this user are local to one datacenter */The Locality Benefit:
With interleaving, fetching a user and all their orders requires accessing only one directory—one Paxos group—in one datacenter (the one where the leader resides). Compare this to a non-interleaved design where user data might be in a US datacenter while their orders are split across Europe and Asia. The performance difference is dramatic.
Hierarchical Primary Keys:
Notice how the primary key of Orders includes user_id, and OrderItems includes both user_id and order_id. This hierarchical key structure serves multiple purposes:
Spanner automatically manages directory boundaries based on data size and access patterns. A very active user with millions of orders might have their data split into multiple directories, but the split will be chosen to minimize cross-directory transactions. The system continuously rebalances directories to maintain optimal performance.
One of Spanner's most powerful features is its flexible replica placement. Unlike databases that treat replication as an afterthought, Spanner puts replica configuration at the center of its design.
Replication Configurations:
Spanner supports multiple replication configurations, each optimized for different use cases:
1. Regional Configuration:
2. Multi-Regional Configuration:
3. Dual-Region Configuration:
| Configuration | Typical Replica Count | Write Latency | Failure Tolerance | Use Case |
|---|---|---|---|---|
| Regional (1 region) | 3-5 | <10ms | Zone failures | Single-region apps, low latency priority |
| Dual-Region | 4+ | 10-50ms | Region failure (one) | DR requirements, moderate latency |
| Multi-Region (3+) | 5+ | 50-200ms | Region failures (multiple) | Global apps, availability priority |
| Continental | 5-7 | 20-80ms | Multiple zones, some regions | Continental scale with DR |
Witness Replicas:
For configurations requiring odd numbers for Paxos majority, Spanner offers witness replicas. A witness participates in voting (consensus) but doesn't store a full copy of the data. This allows achieving required replica counts without the storage costs of full replicas.
For example, in a configuration with 2 full replicas, you might add a witness to achieve a 3-replica Paxos group. The witness can vote on commits but doesn't need the storage capacity of a full replica.
Read-Only Replicas:
Spanner also supports read-only replicas—replicas that receive data updates but don't participate in write consensus. These are ideal for:n
Leader Placement:
Critically, Spanner allows specifying leader placement constraints. You can configure:
EXAMPLE: GLOBAL E-COMMERCE PLATFORM CONFIGURATION══════════════════════════════════════════════════════ Business Requirements:- Users in North America, Europe, and Asia- Write-heavy for inventory updates- Read-heavy for product catalog- Regulatory: European user data must stay in EU SOLUTION: Multi-Region with Regional Leader Constraints ┌─────────────────────────────────────────────────────────────────┐│ SPANNER CONFIGURATION │├─────────────────────────────────────────────────────────────────┤│ ││ Database: product_catalog ││ ├── Leaders eligible: US-WEST1, US-EAST4, EUROPE-WEST1 ││ ├── Read replicas: ASIA-NORTHEAST1, AUSTRALIA-SOUTHEAST1 ││ └── Replication: 5 voting replicas + 2 read-only ││ ││ Database: user_data_americas ││ ├── Leaders eligible: US-WEST1, US-EAST4 only ││ ├── Full replicas: US-WEST1 (2), US-EAST4 (1), US-CENTRAL1 (1)││ └── Regional isolation for AMER user data ││ ││ Database: user_data_emea ││ ├── Leaders eligible: EUROPE-WEST1, EUROPE-WEST4 only ││ ├── Full replicas: EUROPE-WEST1 (2), EUROPE-WEST4 (2) ││ ├── Witness: EUROPE-NORTH1 ││ └── EU-only for GDPR compliance ││ ││ Database: user_data_apac ││ ├── Leaders eligible: ASIA-EAST1, ASIA-NORTHEAST1 ││ ├── Full replicas: ASIA-EAST1 (2), ASIA-NORTHEAST1 (2) ││ └── APAC isolation for local latency ││ ││ Database: inventory (write-heavy, latency-critical) ││ ├── Leaders: Follow user's region (US/EU/APAC) ││ ├── Directory-level leader placement based on warehouse region ││ └── Cross-region replication async for analytics ││ │└─────────────────────────────────────────────────────────────────┘ Result:- Users get <50ms writes to their regional leaders- Global reads served from nearest read replica- EU data physically stays in EU- Inventory updates localized to regional warehousesSpanner's configuration can be changed at runtime without downtime. You can add regions, move leaders, or adjust replica counts as your application's geographic footprint evolves. Spanner handles the data movement transparently.
Building a globally distributed database means accepting that failures are not exceptions—they are the norm. Network partitions, datacenter outages, and hardware failures happen constantly at Google's scale. Spanner's design assumes failure and builds recovery into every layer.
Google's Private Network Infrastructure:
One often-overlooked aspect of Spanner's success is Google's investment in private network infrastructure. Google operates one of the world's largest private networks, with:
This infrastructure dramatically reduces the probability of network partitions compared to relying solely on the public internet. When partitions do occur, they're typically brief and localized.
Failure Detection and Recovery:
Spanner uses aggressive failure detection with multiple mechanisms:
Heartbeat Monitoring: Replicas continuously exchange heartbeats. Missed heartbeats trigger investigation.
Lease Expiration: Leaders hold time-bounded leases. If a leader becomes unreachable, its lease expires and a new leader is elected.
Chubby Lock Service: Spanner uses Google's Chubby (similar to ZooKeeper) for coarse-grained coordination and leader election.
Automatic Recovery: When failures are detected, Spanner automatically:
Consistency During Failures:
The critical property Spanner maintains is linearizability—even during failures, the database behaves as if there's a single copy of the data with operations executing atomically in some total order. This is achieved through:
Paxos Majority Commits: A transaction isn't considered committed until a majority of replicas acknowledge. This means committed data survives any minority failure.
Leader Leases with TrueTime: Leaders hold leases with precisely bounded durations. A new leader can only begin accepting writes after the old leader's lease definitely expired (accounting for TrueTime uncertainty). This prevents split-brain scenarios.
Read Timestamps: Every read receives a timestamp. The database guarantees that reads at a timestamp see all transactions committed before that timestamp—even if leader changes occurred.
The Result:
Applications using Spanner don't need to implement retry logic for consistency issues, handle stale reads after failover, or worry about lost transactions. The database provides the same guarantees during failures as during normal operation. This dramatically simplifies application code.
Google reports that Spanner achieves 99.999% availability (about 5 minutes of downtime per year) for its global configurations. This is remarkable for a strongly consistent system—it demonstrates that the CAP theorem's consistency/availability tradeoff can be managed through engineering excellence.
We've covered substantial ground in understanding how Spanner achieves planetary-scale distribution. Let's consolidate the key architectural principles:
What's Next:
Global distribution alone doesn't guarantee consistency—you still need a way to order transactions across continents. In the next page, we'll explore TrueTime—Spanner's revolutionary approach to global clock synchronization that makes external consistency possible. TrueTime is the innovation that truly sets Spanner apart from other distributed databases.
You now understand the architectural foundations of Spanner's global distribution—zones, Paxos groups, interleaved tables, and configurable replication. Next, we'll see how TrueTime provides the clock synchronization that makes strong consistency possible across planetary distances.