Google Spanner - Learning Module

Loading content...

0/273

Global Distribution: The Architecture of Planetary Scale

The Impossible Made Possible

In 2012, Google unveiled something that distributed systems researchers had considered theoretically impossible: a globally distributed database that provided strong consistency, SQL semantics, and horizontal scalability—all simultaneously. That database was Google Spanner.

Traditional distributed database wisdom held that you had to choose between consistency and availability (the CAP theorem), between SQL and scale (the NoSQL movement), between geographic distribution and transactional guarantees. Spanner broke these tradeoffs not by disproving the theorems, but by engineering around them in ways that shifted the boundaries of what was achievable.

Today, Google Spanner serves as the backbone for some of the world's most demanding applications: Google Ads, Google Play, Google Cloud's internal infrastructure, and thousands of external deployments. It processes millions of transactions per second across continents, maintaining the illusion that the entire planet is a single, coherent database.

What You Will Learn

By the end of this page, you will understand how Spanner achieves planetary-scale distribution, the architecture that makes global presence possible, and why this architecture represents a paradigm shift in distributed database design. You'll learn the specific techniques Spanner uses to span continents while maintaining the consistency guarantees that applications depend on.

The Problem of Global Distribution

Before we can appreciate Spanner's solution, we must understand the problem in depth. Global distribution of databases isn't simply about placing servers in different locations—it's about reconciling fundamental physical constraints with application requirements.

The Speed of Light Constraint:

The most fundamental constraint in global distribution is physics itself. Light travels at approximately 299,792 kilometers per second in a vacuum, which sounds fast until you consider the distances involved:

San Francisco to New York: ~4,100 km → minimum ~14ms one-way latency
San Francisco to London: ~8,600 km → minimum ~29ms one-way latency
San Francisco to Tokyo: ~8,300 km → minimum ~28ms one-way latency
San Francisco to Sydney: ~12,000 km → minimum ~40ms one-way latency

These are theoretical minimums. In practice, network latencies are 2-4x higher due to routing, switching, and the fact that fiber optic cables don't follow straight lines across the Earth.

Why This Matters:

If your database must coordinate between nodes in San Francisco and Sydney before committing a transaction, you're looking at a minimum of 80ms round-trip latency (and realistically 150-300ms). For a database handling thousands of transactions per second, this coordination overhead becomes the dominant factor in system performance.

Real-World Inter-Continental Network Latencies
Route	Distance (km)	Theoretical Min (ms)	Typical Latency (ms)	Impact at 1000 TPS
US East ↔ US West	~4,000	~13	40-60	40-60 seconds/second spent waiting
US ↔ Europe	~6,500	~22	70-100	70-100 seconds/second spent waiting
US ↔ Asia Pacific	~10,000	~33	120-180	120-180 seconds/second spent waiting
Europe ↔ Asia Pacific	~9,000	~30	150-250	150-250 seconds/second spent waiting

The Coordination Dilemma:

Traditional databases solve consistency through coordination—nodes agree on transaction ordering before committing. In a global system, this coordination must happen across continental distances, introducing unavoidable latency. The naive approach of requiring all nodes to agree on every transaction simply doesn't scale.

Consider what happens when a user in Tokyo and a user in London both try to update the same bank account simultaneously:

Both transactions must be ordered (who goes first?)
The ordering must be agreed upon globally
The agreement requires round-trips across continents
The result: hundreds of milliseconds for each transaction

This is unacceptable for applications requiring sub-100ms response times.

The CAP Theorem Constraint

The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance. Since network partitions are unavoidable in global systems, you must choose between consistency and availability during partitions. Spanner chooses consistency, but engineers the system so that partitions are extraordinarily rare and short-lived.

Spanner's Architectural Foundation

Spanner's architecture is built on several interlocking concepts that, together, enable global distribution with strong consistency. Understanding each component and how they interact is essential for grasping why Spanner works.

Zones and Zone Sets:

At the physical level, Spanner is deployed across multiple zones. A zone is the unit of administrative deployment and roughly corresponds to a datacenter or a distinct failure domain within a larger datacenter complex. Zones are independent—a failure in one zone (power outage, network partition, software bug) shouldn't affect other zones.

Zones are grouped into zone sets based on geographic proximity. A zone set typically contains zones within a single metropolitan area or region. Data is replicated within and across zone sets according to configurable policies.

Spanservers and Directories:

Within each zone, spanservers hold the actual data. Each spanserver manages multiple tablets—ranges of rows from the database, similar to shards in a traditional distributed database. However, Spanner adds another layer: directories.

A directory is the unit of data movement and replication. Directories contain contiguous ranges of rows that share common access patterns and replication configurations. When Spanner moves data for load balancing or fault tolerance, it moves entire directories.

The Replication Hierarchy:

Understanding Spanner's replication requires thinking in layers:

Tablets contain directories
Directories contain rows with shared replication configs
Paxos groups manage consensus for each directory
Each zone has replicas of directories assigned to it
Zone configuration determines which zones hold replicas of which directories

spanner-conceptual-hierarchy.txt
SPANNER ARCHITECTURE HIERARCHY
═══════════════════════════════════════════════════════════════
 
┌─────────────────────────────────────────────────────────────┐
│                    GLOBAL SPANNER DEPLOYMENT                 │
├─────────────────────────────────────────────────────────────┤
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐    │
│  │  Zone Set:    │  │  Zone Set:    │  │  Zone Set:    │    │
│  │  US-EAST      │  │  EUROPE       │  │  ASIA-PACIFIC │    │
│  │               │  │               │  │               │    │
│  │ ┌───────────┐ │  │ ┌───────────┐ │  │ ┌───────────┐ │    │
│  │ │Zone: US-E1│ │  │ │Zone: EU-W1│ │  │ │Zone:ASIA1 │ │    │
│  │ │           │ │  │ │           │ │  │ │           │ │    │
│  │ │Spanservers│ │  │ │Spanservers│ │  │ │Spanservers│ │    │
│  │ │  ├─Tablet1│ │  │ │  ├─Tablet1│ │  │ │  ├─Tablet1│ │    │
│  │ │  ├─Tablet2│ │  │ │  ├─Tablet2│ │  │ │  ├─Tablet2│ │    │
│  │ │  └─Tablet3│ │  │ │  └─Tablet3│ │  │ │  └─Tablet3│ │    │
│  │ └───────────┘ │  │ └───────────┘ │  │ └───────────┘ │    │
│  │               │  │               │  │               │    │
│  │ ┌───────────┐ │  │ ┌───────────┐ │  │ ┌───────────┐ │    │
│  │ │Zone: US-E2│ │  │ │Zone: EU-W2│ │  │ │Zone:ASIA2 │ │    │
│  │ │ (replica) │ │  │ │ (replica) │ │  │ │ (replica) │ │    │
│  │ └───────────┘ │  │ └───────────┘ │  │ └───────────┘ │    │
│  └───────────────┘  └───────────────┘  └───────────────┘    │
└─────────────────────────────────────────────────────────────┘
 
DIRECTORY REPLICATION (Example: 3-way replication):
────────────────────────────────────────────────────
Directory "accounts_100k_200k" (user accounts 100000-200000):
  ├── Leader replica: Zone US-E1 (accepts writes)
  ├── Follower replica: Zone EU-W1 (can serve stale reads)
  └── Follower replica: Zone ASIA1 (can serve stale reads)
 
PAXOS GROUP for this directory:
  ├── Leader:   US-E1/Spanserver-7/Tablet-12
  ├── Acceptor: EU-W1/Spanserver-3/Tablet-5
  └── Acceptor: ASIA1/Spanserver-9/Tablet-8

Paxos Groups and Consensus:

Each directory in Spanner is managed by a Paxos group—a set of replicas that use the Paxos consensus protocol to agree on transaction ordering and ensure data is consistently replicated. The Paxos group has one leader (which accepts writes) and multiple followers (which participate in consensus and can serve reads).

Paxos ensures that once a transaction is committed, it's committed on a majority of replicas before returning success to the client. This provides durability even if individual replicas fail—as long as a majority survives, the data is safe.

Why Paxos, Not Raft?

Spanner was designed before Raft became popular (Raft was published in 2014, Spanner in 2012). Both protocols provide similar guarantees, but Spanner uses Multi-Paxos with several optimizations:

Leader leases: Leaders maintain leases to reduce the frequency of leader elections
Parallel proposals: Multiple transactions can be proposed simultaneously
Batching: Multiple transactions are batched into single Paxos rounds to amortize coordination overhead

Leader Location Matters

The location of the Paxos leader for a directory determines write latency for that data. Spanner allows configuring leader preferences—you can specify that user data from Europe should have European leaders, user data from Asia should have Asian leaders, etc. This keeps write latency low for the majority of users while still providing global availability.

The Global Data Model

Spanner's data model is specifically designed to support global distribution while maintaining relational semantics. Two key innovations make this possible: interleaved tables and hierarchical primary keys.

Interleaved Tables: Co-locating Related Data

In a traditional relational database, related tables (like users and orders) are stored independently. Joining them requires accessing potentially different physical locations. In a globally distributed system, this could mean cross-continental round trips for every join.

Spanner solves this with interleaved tables. An interleaved table is physically co-located with its parent table—rows from the child table are stored next to the parent row they reference. This ensures that operations involving a user and their orders never require cross-datacenter communication.

How Interleaving Works:

Consider a schema with Users and Orders. In Spanner, you'd define Orders as interleaved in Users:

spanner-interleaved-schema.sql
Spanner DDL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Parent table: Users
CREATE TABLE Users (
    user_id     INT64 NOT NULL,
    email       STRING(255),
    name        STRING(100),
    region      STRING(50),
    created_at  TIMESTAMP NOT NULL OPTIONS (
        allow_commit_timestamp = true
    ),
) PRIMARY KEY (user_id);
 
-- Child table: Orders (interleaved in Users)
CREATE TABLE Orders (
    user_id     INT64 NOT NULL,
    order_id    INT64 NOT NULL,
    total       FLOAT64,
    status      STRING(20),
    created_at  TIMESTAMP NOT NULL,
) PRIMARY KEY (user_id, order_id),
  INTERLEAVE IN PARENT Users ON DELETE CASCADE;
 
-- Grandchild table: OrderItems (interleaved in Orders)
CREATE TABLE OrderItems (
    user_id         INT64 NOT NULL,
    order_id        INT64 NOT NULL,
    item_id         INT64 NOT NULL,
    product_id      INT64,
    quantity        INT64,
    unit_price      FLOAT64,
) PRIMARY KEY (user_id, order_id, item_id),
  INTERLEAVE IN PARENT Orders ON DELETE CASCADE;
 
/*
 * PHYSICAL LAYOUT (conceptual):
 * 
 * Directory for user 1001:
 * ┌────────────────────────────────────────────────────┐
 * │ User(1001) | alice@email.com | "Alice Smith" | US │
 * ├────────────────────────────────────────────────────┤
 * │   Order(1001, 5001) | $150.00 | "shipped"        │
 * │     OrderItem(1001, 5001, 1) | Laptop | 1 | $999 │
 * │     OrderItem(1001, 5001, 2) | Mouse  | 2 | $25  │
 * ├────────────────────────────────────────────────────┤
 * │   Order(1001, 5002) | $45.00 | "pending"         │
 * │     OrderItem(1001, 5002, 1) | Cable  | 3 | $15  │
 * └────────────────────────────────────────────────────┘
 * 
 * ALL of user 1001's data is in ONE directory → ONE Paxos group
 * → All operations on this user are local to one datacenter
 */

The Locality Benefit:

With interleaving, fetching a user and all their orders requires accessing only one directory—one Paxos group—in one datacenter (the one where the leader resides). Compare this to a non-interleaved design where user data might be in a US datacenter while their orders are split across Europe and Asia. The performance difference is dramatic.

Hierarchical Primary Keys:

Notice how the primary key of Orders includes user_id, and OrderItems includes both user_id and order_id. This hierarchical key structure serves multiple purposes:

Uniqueness: The full key (user_id, order_id, item_id) uniquely identifies each item
Ordering: Rows are stored in primary key order, so all data for a user is contiguous
Efficient Scans: Range scans for "all orders for user X" are extremely efficient
Directory Mapping: The key prefix determines which directory (and thus which Paxos group) owns the data

Directory Boundaries

Spanner automatically manages directory boundaries based on data size and access patterns. A very active user with millions of orders might have their data split into multiple directories, but the split will be chosen to minimize cross-directory transactions. The system continuously rebalances directories to maintain optimal performance.

Replica Placement and Configuration

One of Spanner's most powerful features is its flexible replica placement. Unlike databases that treat replication as an afterthought, Spanner puts replica configuration at the center of its design.

Replication Configurations:

Spanner supports multiple replication configurations, each optimized for different use cases:

1. Regional Configuration:

All replicas in a single region
3-5 replicas across zones within the region
Lowest write latency (zone-to-zone, not continent-to-continent)
Survives zone failures but not regional outages
Best for applications with users concentrated in one geography

2. Multi-Regional Configuration:

Replicas across multiple regions/continents
Typically 5+ replicas for majority consensus
Higher write latency (must coordinate across regions)
Survives regional outages
Best for applications requiring global availability

3. Dual-Region Configuration:

Primary and secondary region
Automatic failover between regions
Balanced availability and latency

Spanner Replication Configurations Compared
Configuration	Typical Replica Count	Write Latency	Failure Tolerance	Use Case
Regional (1 region)	3-5	<10ms	Zone failures	Single-region apps, low latency priority
Dual-Region	4+	10-50ms	Region failure (one)	DR requirements, moderate latency
Multi-Region (3+)	5+	50-200ms	Region failures (multiple)	Global apps, availability priority
Continental	5-7	20-80ms	Multiple zones, some regions	Continental scale with DR

Witness Replicas:

For configurations requiring odd numbers for Paxos majority, Spanner offers witness replicas. A witness participates in voting (consensus) but doesn't store a full copy of the data. This allows achieving required replica counts without the storage costs of full replicas.

For example, in a configuration with 2 full replicas, you might add a witness to achieve a 3-replica Paxos group. The witness can vote on commits but doesn't need the storage capacity of a full replica.

Read-Only Replicas:

Spanner also supports read-only replicas—replicas that receive data updates but don't participate in write consensus. These are ideal for:n

Serving read-heavy workloads in geographies distant from write replicas
Analytics queries that shouldn't impact transactional performance
Reducing read latency without affecting write quorum

Leader Placement:

Critically, Spanner allows specifying leader placement constraints. You can configure:

Which regions are eligible to host leaders
Preferred leader regions (Spanner will elect leaders there when possible)
Leader-follower proximity requirements

spanner-instance-configs.txt
EXAMPLE: GLOBAL E-COMMERCE PLATFORM CONFIGURATION
══════════════════════════════════════════════════════
 
Business Requirements:
- Users in North America, Europe, and Asia
- Write-heavy for inventory updates
- Read-heavy for product catalog
- Regulatory: European user data must stay in EU
 
SOLUTION: Multi-Region with Regional Leader Constraints
 
┌─────────────────────────────────────────────────────────────────┐
│              SPANNER CONFIGURATION                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Database: product_catalog                                       │
│  ├── Leaders eligible: US-WEST1, US-EAST4, EUROPE-WEST1         │
│  ├── Read replicas: ASIA-NORTHEAST1, AUSTRALIA-SOUTHEAST1       │
│  └── Replication: 5 voting replicas + 2 read-only               │
│                                                                  │
│  Database: user_data_americas                                    │
│  ├── Leaders eligible: US-WEST1, US-EAST4 only                  │
│  ├── Full replicas: US-WEST1 (2), US-EAST4 (1), US-CENTRAL1 (1)│
│  └── Regional isolation for AMER user data                       │
│                                                                  │
│  Database: user_data_emea                                        │
│  ├── Leaders eligible: EUROPE-WEST1, EUROPE-WEST4 only          │
│  ├── Full replicas: EUROPE-WEST1 (2), EUROPE-WEST4 (2)          │
│  ├── Witness: EUROPE-NORTH1                                      │
│  └── EU-only for GDPR compliance                                 │
│                                                                  │
│  Database: user_data_apac                                        │
│  ├── Leaders eligible: ASIA-EAST1, ASIA-NORTHEAST1              │
│  ├── Full replicas: ASIA-EAST1 (2), ASIA-NORTHEAST1 (2)         │
│  └── APAC isolation for local latency                           │
│                                                                  │
│  Database: inventory (write-heavy, latency-critical)            │
│  ├── Leaders: Follow user's region (US/EU/APAC)                 │
│  ├── Directory-level leader placement based on warehouse region │
│  └── Cross-region replication async for analytics               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
 
Result:
- Users get <50ms writes to their regional leaders
- Global reads served from nearest read replica
- EU data physically stays in EU
- Inventory updates localized to regional warehouses

Configuration is Not Static

Spanner's configuration can be changed at runtime without downtime. You can add regions, move leaders, or adjust replica counts as your application's geographic footprint evolves. Spanner handles the data movement transparently.

Network and Failure Handling

Building a globally distributed database means accepting that failures are not exceptions—they are the norm. Network partitions, datacenter outages, and hardware failures happen constantly at Google's scale. Spanner's design assumes failure and builds recovery into every layer.

Google's Private Network Infrastructure:

One often-overlooked aspect of Spanner's success is Google's investment in private network infrastructure. Google operates one of the world's largest private networks, with:

Undersea cables connecting continents
Dedicated fiber links between datacenters
Software-defined networking for traffic optimization
Redundant paths for every critical route

This infrastructure dramatically reduces the probability of network partitions compared to relying solely on the public internet. When partitions do occur, they're typically brief and localized.

Failure Detection and Recovery:

Spanner uses aggressive failure detection with multiple mechanisms:

Heartbeat Monitoring: Replicas continuously exchange heartbeats. Missed heartbeats trigger investigation.
Lease Expiration: Leaders hold time-bounded leases. If a leader becomes unreachable, its lease expires and a new leader is elected.
Chubby Lock Service: Spanner uses Google's Chubby (similar to ZooKeeper) for coarse-grained coordination and leader election.
Automatic Recovery: When failures are detected, Spanner automatically:
- Elects new Paxos leaders
- Creates replacement replicas
- Rebalances load across healthy servers

Failure Scenarios and Spanner's Response

•Single Server Failure: Paxos continues with remaining replicas. New replica created from surviving copies. Zero data loss, minimal latency impact.
•Zone Failure: All servers in zone become unavailable. Leaders in that zone are re-elected elsewhere. Read traffic redirected to surviving replicas. Recovery time: seconds to minutes.
•Network Partition within Region: Affected replicas can't participate in consensus. Majority continues operating. Minority becomes read-only until partition heals.
•Region-Wide Outage: If region was minority in Paxos groups, no impact. If region hosted leaders, leader election to new region. Multi-region configs survive; single-region configs become unavailable.
•Split-Brain Prevention: Paxos guarantees only one leader can commit. TrueTime ensures transactions are correctly ordered even during leader transitions.

Consistency During Failures:

The critical property Spanner maintains is linearizability—even during failures, the database behaves as if there's a single copy of the data with operations executing atomically in some total order. This is achieved through:

Paxos Majority Commits: A transaction isn't considered committed until a majority of replicas acknowledge. This means committed data survives any minority failure.
Leader Leases with TrueTime: Leaders hold leases with precisely bounded durations. A new leader can only begin accepting writes after the old leader's lease definitely expired (accounting for TrueTime uncertainty). This prevents split-brain scenarios.
Read Timestamps: Every read receives a timestamp. The database guarantees that reads at a timestamp see all transactions committed before that timestamp—even if leader changes occurred.

The Result:

Applications using Spanner don't need to implement retry logic for consistency issues, handle stale reads after failover, or worry about lost transactions. The database provides the same guarantees during failures as during normal operation. This dramatically simplifies application code.

Operational Reality

Google reports that Spanner achieves 99.999% availability (about 5 minutes of downtime per year) for its global configurations. This is remarkable for a strongly consistent system—it demonstrates that the CAP theorem's consistency/availability tradeoff can be managed through engineering excellence.

Summary: Global Distribution Principles

We've covered substantial ground in understanding how Spanner achieves planetary-scale distribution. Let's consolidate the key architectural principles:

Key Takeaways

•Physics is the Constraint: Light speed creates unavoidable latency. Spanner's design minimizes cross-continental coordination rather than trying to eliminate it.
•Zones Provide Failure Isolation: Independent failure domains ensure that a problem in one zone doesn't cascade to others.
•Paxos Groups Enable Consensus: Each directory has its own Paxos group, allowing independent coordination and localized writes.
•Interleaved Tables Create Locality: Related data is physically co-located, eliminating cross-datacenter joins for common access patterns.
•Configurable Replication Balances Tradeoffs: Regional configs minimize latency; multi-regional configs maximize availability. You choose based on requirements.
•Leader Placement Controls Latency: The Paxos leader's location determines write latency. Strategic leader placement keeps writes fast for target geographies.
•Failures Are Expected and Handled: Automatic leader election, replica replacement, and load balancing make the system self-healing.

What's Next:

Global distribution alone doesn't guarantee consistency—you still need a way to order transactions across continents. In the next page, we'll explore TrueTime—Spanner's revolutionary approach to global clock synchronization that makes external consistency possible. TrueTime is the innovation that truly sets Spanner apart from other distributed databases.

Page Complete

You now understand the architectural foundations of Spanner's global distribution—zones, Paxos groups, interleaved tables, and configurable replication. Next, we'll see how TrueTime provides the clock synchronization that makes strong consistency possible across planetary distances.