Loading learning content...
In 2012, Google published a paper that fundamentally altered what database engineers believed was possible. Google Spanner demonstrated that a single database could provide:
Prior to Spanner, conventional wisdom held that these properties couldn't coexist at global scale. The CAP theorem seemed to mandate choosing between consistency and availability. Spanner proved that with sufficient engineering—and atomic clocks—you could achieve both for virtually all practical purposes.
By the end of this page, you will understand Spanner's revolutionary architecture, including TrueTime (Google's globally-synchronized time system), its semi-relational data model, distributed transaction protocols, and the specific design decisions that enable planet-scale consistency. You'll see why Spanner is considered the foundation of the NewSQL movement.
Google didn't build Spanner for academic interest. It emerged from genuine operational pain with earlier systems.
Pre-Spanner Google Infrastructure
By the late 2000s, Google operated several specialized storage systems:
Bigtable (2006): Scalable, distributed storage for structured data. Provided eventual consistency and single-row transactions only. Powered Search, Analytics, and many other services.
Megastore (2011): Built on Bigtable, added stronger consistency and cross-datacenter replication. But performance was limited (single-digit writes per second per entity group) and ACID semantics were restricted to entity groups.
The limitations became increasingly painful as Google's applications grew more complex:
The Spanner Vision
Spanner was designed to solve these problems with ambitious goals:
The key insight that made Spanner possible was that time could become a first-class part of the database design—specifically, globally synchronized time with bounded uncertainty.
Spanner now powers critical Google infrastructure including AdWords, Google Play, and the F1 database (used for Google's advertising backend). Cloud Spanner (the external service) is used by customers like Snap, Square, and major banks for mission-critical workloads requiring global consistency.
Spanner's most innovative component is TrueTime—a globally synchronized time API that provides bounded clock uncertainty. TrueTime enables Spanner's external consistency guarantees by allowing the database to reason about the ordering of events across the planet.
The Problem with Clocks in Distributed Systems
Traditional distributed databases struggle with time because:
TrueTime's Solution
TrueTime provides a different API than traditional time functions:
12345678910111213141516171819
// Traditional clock APInow() → timestamp // Returns current time (but how accurate?) // TrueTime APITT.now() → TTinterval {earliest, latest} // TTinterval represents uncertainty bounds// Real time is guaranteed to be within [earliest, latest] // Example return:TT.now() → { earliest: 2024-01-15T14:30:00.000_001_234 latest: 2024-01-15T14:30:00.000_008_567 // Uncertainty: ~7 microseconds} // Additional methods:TT.after(t) → bool // True if t has definitely passedTT.before(t) → bool // True if t has definitely not arrived yetHow TrueTime Works
Google achieves tight time bounds through hardware and software:
Hardware Infrastructure:
Time Propagation:
Typical Uncertainty: ε ≈ 1-7 milliseconds (usually ~4ms average)
This is vastly better than NTP (which provides no meaningful bounds) and sufficient for Spanner's purposes.
NTP synchronizes clocks but provides no guarantee about its accuracy. A Spanner server could see NTP-synchronized time of 12:00:00.000 while actual time is 12:00:00.150. TrueTime would instead report 'time is between 11:59:59.997 and 12:00:00.003'—the actual time is guaranteed within bounds. This bounded uncertainty is what enables Spanner's consistency guarantees.
TrueTime enables Spanner's external consistency guarantee through a technique called commit wait.
What is External Consistency?
External consistency is stronger than serializability. It guarantees:
If transaction T1 commits before transaction T2 starts (in real time), then T1's commit timestamp < T2's commit timestamp, and any observer will see T1's effects before T2's effects.
This is also called "linearizability with respect to real time." It matches our intuitive expectation: if I make a bank transfer (T1) and then check my balance (T2), I should see the transfer—even if T1 and T2 execute in different datacenters.
The Commit Wait Protocol
Spanner achieves external consistency with commit wait:
123456789101112131415161718192021
// Spanner Write Transaction Commit function commit(transaction): // Step 1: Execute transaction, acquire locks executeAndAcquireLocks(transaction) // Step 2: Get commit timestamp s = TT.now().latest // Assign timestamp at END of uncertainty // Step 3: COMMIT WAIT - crucial for external consistency // Wait until we're certain commit time has passed while TT.now().earliest < s: sleep(small_interval) // Now we're GUARANTEED that real time ≥ s // Any future transaction will see s < their_start_time // Step 4: Release locks, make writes visible commitAndRelease(transaction, s) return s // Return commit timestamp to clientWhy Commit Wait Works
Let's trace through why this achieves external consistency:
TT.now().latestTT.now().earliest > s — meaning real time has definitely passed sTT.now().latest which is > real time > sThe Cost of Commit Wait
Commit wait adds latency equal to TrueTime uncertainty (ε):
Google optimized aggressively to minimize ε because every millisecond of uncertainty adds directly to commit latency.
| Aspect | Impact | Mitigation |
|---|---|---|
| Write latency | +ε milliseconds per commit | Reduce clock uncertainty, batch writes |
| Hardware cost | GPS receivers + atomic clocks | Amortized across huge scale |
| Complexity | Specialized infrastructure | Managed service (Cloud Spanner) |
| Geographic latency | Multi-region commits are slower | Automatic replication, read replicas |
Read-only transactions in Spanner can execute at any snapshot time without commit wait. They still get external consistency for their snapshot—they see all transactions that committed before their snapshot time. This makes reads much faster than writes.
Spanner uses a semi-relational data model that combines SQL's relational concepts with a hierarchical structure optimized for distributed locality.
Tables and Interleaving
Spanner tables look like traditional SQL tables but support hierarchical relationships via interleaving:
1234567891011121314151617181920212223242526
-- Parent table: UsersCREATE TABLE Users ( UserId INT64 NOT NULL, UserName STRING(100), Email STRING(256), CreatedAt TIMESTAMP,) PRIMARY KEY (UserId); -- Child table: Interleaved in parentCREATE TABLE Accounts ( UserId INT64 NOT NULL, AccountId INT64 NOT NULL, Balance NUMERIC, Currency STRING(3),) PRIMARY KEY (UserId, AccountId), INTERLEAVE IN PARENT Users ON DELETE CASCADE; -- Grandchild table: Interleaved in AccountsCREATE TABLE Transactions ( UserId INT64 NOT NULL, AccountId INT64 NOT NULL, TransactionId INT64 NOT NULL, Amount NUMERIC, Timestamp TIMESTAMP,) PRIMARY KEY (UserId, AccountId, TransactionId), INTERLEAVE IN PARENT Accounts ON DELETE CASCADE;Physical Storage with Interleaving
Interleaved tables are stored together on disk, sorted by their keys:
Users(UserId=1) → [UserName='Alice', Email='alice@example.com']
Accounts(UserId=1, AccountId=100) → [Balance=5000, Currency='USD']
Transactions(1, 100, 1) → [Amount=-50, ...]
Transactions(1, 100, 2) → [Amount=+200, ...]
Accounts(UserId=1, AccountId=101) → [Balance=15000, Currency='EUR']
Transactions(1, 101, 1) → [Amount=+1000, ...]
Users(UserId=2) → [UserName='Bob', ...]
Accounts(UserId=2, AccountId=200) → [...]
...
This layout provides:
Sequential primary keys (auto-increment, timestamps) concentrate writes on one split leader, creating bottlenecks. Spanner recommends: (1) Bit-reverse sequential keys, (2) Use UUID prefixes, (3) Hash-based sharding in the key. Example: Instead of (TimestampId), use (Hash(UserId), TimestampId).
Spanner's architecture is designed for planet-scale deployment with automatic operations.
Architectural Components
Replication with Paxos
Each Spanner tablet is replicated across multiple zones using the Paxos consensus protocol:
Multi-Region Configurations
Cloud Spanner offers predefined configurations:
| Configuration | Replicas | Write Latency | Read Latency | Use Case |
|---|---|---|---|---|
| Regional (3 zones) | 3 in one region | ~5ms | ~2ms (local) | Low-latency, single region |
| Dual-region | 2+2 in two regions | ~10-20ms | ~2ms (local) | Regional disaster recovery |
| Multi-region | 3+ across continents | ~50-100ms | ~2ms (local) | Global availability |
Spanner supports multiple transaction types optimized for different access patterns.
Read-Write Transactions
Full ACID, serializable transactions that can read and write any data:
1234567891011121314151617181920212223242526272829303132333435
// Go example: Read-Write Transactionfunc transferFunds(ctx context.Context, client *spanner.Client, fromAcct, toAcct string, amount int64) error { _, err := client.ReadWriteTransaction(ctx, func(ctx context.Context, txn *spanner.ReadWriteTransaction) error { // Read both accounts (acquires read locks) row1, _ := txn.ReadRow(ctx, "Accounts", spanner.Key{fromAcct}, []string{"Balance"}) row2, _ := txn.ReadRow(ctx, "Accounts", spanner.Key{toAcct}, []string{"Balance"}) var bal1, bal2 int64 row1.Columns(&bal1) row2.Columns(&bal2) if bal1 < amount { return errors.New("insufficient funds") } // Buffer writes (will execute atomically at commit) txn.BufferWrite([]*spanner.Mutation{ spanner.Update("Accounts", []string{"AccountId", "Balance"}, []interface{}{fromAcct, bal1 - amount}), spanner.Update("Accounts", []string{"AccountId", "Balance"}, []interface{}{toAcct, bal2 + amount}), }) return nil }) return err}Transaction Types Comparison
| Type | Reads | Writes | Isolation | Performance |
|---|---|---|---|---|
| Read-Write | Any data | Any data | Serializable | Higher latency (2PC + commit wait) |
| Read-Only | Any data | None | Serializable snapshot | Low latency, lock-free |
| Partitioned DML | Auto (via WHERE) | Yes | First-committer-wins | High throughput bulk updates |
| Stale Reads | Bounded staleness | None | Snapshot | Lowest latency, any replica |
Read-Only Transactions
For workloads that only read, Spanner provides optimized read-only transactions:
Strong vs Stale Reads
Stale reads are often 3-10x faster because they can use any replica without waiting.
Use Read-Only transactions for complex reporting queries. Use Stale Reads (5-10s staleness) for dashboards and analytics where slight delay is acceptable. Reserve Read-Write transactions for actual modifications. This pattern dramatically improves throughput and latency.
Initially, Spanner used a custom API. In 2017, Google added full SQL support (Spanner SQL), making it a complete relational database.
SQL Dialect
Spanner SQL is based on standard SQL with Google-specific extensions:
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Complex join with aggregationSELECT u.UserName, COUNT(t.TransactionId) AS TxnCount, SUM(t.Amount) AS TotalAmount, AVG(t.Amount) AS AvgAmountFROM Users uJOIN Accounts a ON u.UserId = a.UserIdJOIN Transactions t ON a.UserId = t.UserId AND a.AccountId = t.AccountIdWHERE t.Timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)GROUP BY u.UserNameHAVING SUM(t.Amount) > 10000ORDER BY TotalAmount DESCLIMIT 100; -- Window function for running balanceSELECT TransactionId, Amount, Timestamp, SUM(Amount) OVER ( PARTITION BY AccountId ORDER BY Timestamp ROWS UNBOUNDED PRECEDING ) AS RunningBalanceFROM TransactionsWHERE AccountId = @accountId; -- Parameterized query (prevents SQL injection, enables plan caching)SELECT * FROM Users WHERE Email = @email; -- Array and struct typesSELECT UserId, ARRAY( SELECT AS STRUCT AccountId, Balance FROM Accounts WHERE Accounts.UserId = Users.UserId ) AS AccountsListFROM Users;Query Execution
Spanner's query optimizer generates distributed execution plans:
Spanner-Specific SQL Features
Since 2017, Google Cloud Spanner has made Spanner's capabilities available to external customers as a fully managed service.
Service Characteristics
Pricing Model
Cloud Spanner pricing is based on three components:
Example Pricing (as of 2024):
When to Use Cloud Spanner
Cloud Spanner is only available on Google Cloud Platform. Organizations concerned about vendor lock-in may prefer open-source alternatives like CockroachDB (which we'll cover next) that offer similar capabilities while running anywhere.
Google Spanner represents a landmark achievement in database engineering. It demonstrated that the trade-offs forced by the CAP theorem could be managed—with sufficient engineering—to deliver global-scale consistency.
Let's consolidate the key insights:
What's Next
Spanner's design inspired a generation of NewSQL databases. Next, we'll examine CockroachDB—an open-source, PostgreSQL-compatible database that brings Spanner's concepts to organizations that need portability and want to avoid cloud lock-in. CockroachDB proves that planet-scale SQL doesn't require specialized hardware or a single vendor.
You now understand Google Spanner's architecture, the breakthrough of TrueTime, and how it achieves global-scale consistency with ACID transactions. This foundation will help you appreciate how other NewSQL systems (like CockroachDB) achieve similar goals with different approaches.