Database Management SystemsModern Database Topics

NewSQL Databases

LevelAdvanced

Duration60 mins

TopicModern Database Topics

3 / 5

Google Spanner

The Database That Changed Everything

In 2012, Google published a paper that fundamentally altered what database engineers believed was possible. Google Spanner demonstrated that a single database could provide:

ACID transactions with serializable isolation
Global distribution across multiple continents
External consistency (stronger than linearizability)
Transparent replication with automatic failover
Horizontal scalability to millions of nodes

Prior to Spanner, conventional wisdom held that these properties couldn't coexist at global scale. The CAP theorem seemed to mandate choosing between consistency and availability. Spanner proved that with sufficient engineering—and atomic clocks—you could achieve both for virtually all practical purposes.

What You Will Learn

By the end of this page, you will understand Spanner's revolutionary architecture, including TrueTime (Google's globally-synchronized time system), its semi-relational data model, distributed transaction protocols, and the specific design decisions that enable planet-scale consistency. You'll see why Spanner is considered the foundation of the NewSQL movement.

The Origin Story: Why Google Built Spanner

Google didn't build Spanner for academic interest. It emerged from genuine operational pain with earlier systems.

Pre-Spanner Google Infrastructure

By the late 2000s, Google operated several specialized storage systems:

Bigtable (2006): Scalable, distributed storage for structured data. Provided eventual consistency and single-row transactions only. Powered Search, Analytics, and many other services.
Megastore (2011): Built on Bigtable, added stronger consistency and cross-datacenter replication. But performance was limited (single-digit writes per second per entity group) and ACID semantics were restricted to entity groups.

The limitations became increasingly painful as Google's applications grew more complex:

Pre-Spanner Challenges

•Eventual consistency headaches: Developers wasted enormous effort handling stale reads, conflict resolution, and application-level consistency.
•Limited transactions: Entity group boundaries forced unnatural schema designs and limited what could be done atomically.
•Complex operations: Without strong consistency, even simple operations (like moving money between accounts) required complex multi-step processes with rollback logic.
•Global applications: Google Ads, Google Play, and other services needed global access patterns that exceeded Megastore's capabilities.

The Spanner Vision

Spanner was designed to solve these problems with ambitious goals:

Global distribution: Data replicated across multiple continents for latency and disaster recovery
External consistency: Strongest possible guarantee—if transaction T1 commits before T2 starts, T1 is serialized before T2
SQL interface: Full SQL semantics (added in 2017 revision as "Spanner SQL")
Automatic operations: No manual sharding, replication, or failover management
Scale: Support Google-scale workloads (billions of rows, millions of operations per second)

The key insight that made Spanner possible was that time could become a first-class part of the database design—specifically, globally synchronized time with bounded uncertainty.

Spanner's Reach Today

Spanner now powers critical Google infrastructure including AdWords, Google Play, and the F1 database (used for Google's advertising backend). Cloud Spanner (the external service) is used by customers like Snap, Square, and major banks for mission-critical workloads requiring global consistency.

TrueTime: The Revolutionary Foundation

Spanner's most innovative component is TrueTime—a globally synchronized time API that provides bounded clock uncertainty. TrueTime enables Spanner's external consistency guarantees by allowing the database to reason about the ordering of events across the planet.

The Problem with Clocks in Distributed Systems

Traditional distributed databases struggle with time because:

Physical clocks drift (even good ones drift ~100μs/second)
Network Time Protocol (NTP) has unbounded uncertainty (typically ±100ms, sometimes worse)
No way to know for certain whether event A happened before or after event B on different machines
Clock skew can cause consistency anomalies (reading "stale" data that was actually committed later)

TrueTime's Solution

TrueTime provides a different API than traditional time functions:

truetime_api.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Traditional clock API
now() → timestamp  // Returns current time (but how accurate?)
 
// TrueTime API
TT.now() → TTinterval {earliest, latest}
 
// TTinterval represents uncertainty bounds
// Real time is guaranteed to be within [earliest, latest]
 
// Example return:
TT.now() → {
    earliest: 2024-01-15T14:30:00.000_001_234
    latest:   2024-01-15T14:30:00.000_008_567
    // Uncertainty: ~7 microseconds
}
 
// Additional methods:
TT.after(t) → bool   // True if t has definitely passed
TT.before(t) → bool  // True if t has definitely not arrived yet

How TrueTime Works

Google achieves tight time bounds through hardware and software:

Hardware Infrastructure:

GPS receivers in every data center (atomic-clock-accurate time from satellites)
Two independent time references per data center (GPS + atomic clock backup)
Time masters that serve TrueTime to all machines in the data center

Time Propagation:

Time masters receive time from GPS/atomic clocks
Regular Spanner servers poll multiple time masters
The TrueTime client library computes a confidence interval based on:
- Time since last master synchronization
- Known sources of clock drift
- Communication latency uncertainty

Typical Uncertainty: ε ≈ 1-7 milliseconds (usually ~4ms average)

This is vastly better than NTP (which provides no meaningful bounds) and sufficient for Spanner's purposes.

Converting Mermaid diagram...

Why Not Just Use NTP?

NTP synchronizes clocks but provides no guarantee about its accuracy. A Spanner server could see NTP-synchronized time of 12:00:00.000 while actual time is 12:00:00.150. TrueTime would instead report 'time is between 11:59:59.997 and 12:00:00.003'—the actual time is guaranteed within bounds. This bounded uncertainty is what enables Spanner's consistency guarantees.

Commit Wait: Achieving External Consistency

TrueTime enables Spanner's external consistency guarantee through a technique called commit wait.

What is External Consistency?

External consistency is stronger than serializability. It guarantees:

If transaction T1 commits before transaction T2 starts (in real time), then T1's commit timestamp < T2's commit timestamp, and any observer will see T1's effects before T2's effects.

This is also called "linearizability with respect to real time." It matches our intuitive expectation: if I make a bank transfer (T1) and then check my balance (T2), I should see the transfer—even if T1 and T2 execute in different datacenters.

The Commit Wait Protocol

Spanner achieves external consistency with commit wait:

commit_wait.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Spanner Write Transaction Commit
 
function commit(transaction):
    // Step 1: Execute transaction, acquire locks
    executeAndAcquireLocks(transaction)
    
    // Step 2: Get commit timestamp
    s = TT.now().latest   // Assign timestamp at END of uncertainty
    
    // Step 3: COMMIT WAIT - crucial for external consistency
    // Wait until we're certain commit time has passed
    while TT.now().earliest < s:
        sleep(small_interval)
    
    // Now we're GUARANTEED that real time ≥ s
    // Any future transaction will see s < their_start_time
    
    // Step 4: Release locks, make writes visible
    commitAndRelease(transaction, s)
    
    return s  // Return commit timestamp to client

Why Commit Wait Works

Let's trace through why this achieves external consistency:

Transaction T1 is assigned commit timestamp s at TT.now().latest
T1 waits until TT.now().earliest > s — meaning real time has definitely passed s
T1 commits and releases locks
T1 returns to client
Client starts transaction T2
T2 is assigned timestamp TT.now().latest which is > real time > s
Therefore: T2's timestamp > T1's commit timestamp
T2 will see T1's effects (due to MVCC snapshot mechanics)

The Cost of Commit Wait

Commit wait adds latency equal to TrueTime uncertainty (ε):

Average ε ≈ 4ms means ~4ms added to write latency
Reads are not affected
For workloads with many transactions, this is parallelized (you wait once, not once per transaction)

Google optimized aggressively to minimize ε because every millisecond of uncertainty adds directly to commit latency.

External Consistency Trade-offs
Aspect	Impact	Mitigation
Write latency	+ε milliseconds per commit	Reduce clock uncertainty, batch writes
Hardware cost	GPS receivers + atomic clocks	Amortized across huge scale
Complexity	Specialized infrastructure	Managed service (Cloud Spanner)
Geographic latency	Multi-region commits are slower	Automatic replication, read replicas

Read-Only Transactions Don't Wait

Read-only transactions in Spanner can execute at any snapshot time without commit wait. They still get external consistency for their snapshot—they see all transactions that committed before their snapshot time. This makes reads much faster than writes.

Data Model and Schema Design

Spanner uses a semi-relational data model that combines SQL's relational concepts with a hierarchical structure optimized for distributed locality.

Tables and Interleaving

Spanner tables look like traditional SQL tables but support hierarchical relationships via interleaving:

spanner_schema.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Parent table: Users
CREATE TABLE Users (
    UserId      INT64 NOT NULL,
    UserName    STRING(100),
    Email       STRING(256),
    CreatedAt   TIMESTAMP,
) PRIMARY KEY (UserId);
 
-- Child table: Interleaved in parent
CREATE TABLE Accounts (
    UserId      INT64 NOT NULL,
    AccountId   INT64 NOT NULL,
    Balance     NUMERIC,
    Currency    STRING(3),
) PRIMARY KEY (UserId, AccountId),
  INTERLEAVE IN PARENT Users ON DELETE CASCADE;
 
-- Grandchild table: Interleaved in Accounts
CREATE TABLE Transactions (
    UserId        INT64 NOT NULL,
    AccountId     INT64 NOT NULL,
    TransactionId INT64 NOT NULL,
    Amount        NUMERIC,
    Timestamp     TIMESTAMP,
) PRIMARY KEY (UserId, AccountId, TransactionId),
  INTERLEAVE IN PARENT Accounts ON DELETE CASCADE;

Physical Storage with Interleaving

Interleaved tables are stored together on disk, sorted by their keys:

Users(UserId=1) → [UserName='Alice', Email='alice@example.com']
  Accounts(UserId=1, AccountId=100) → [Balance=5000, Currency='USD']
    Transactions(1, 100, 1) → [Amount=-50, ...]
    Transactions(1, 100, 2) → [Amount=+200, ...]
  Accounts(UserId=1, AccountId=101) → [Balance=15000, Currency='EUR']
    Transactions(1, 101, 1) → [Amount=+1000, ...]
Users(UserId=2) → [UserName='Bob', ...]
  Accounts(UserId=2, AccountId=200) → [...]
    ...

This layout provides:

Locality: All data for one user is physically co-located
Efficient joins: Parent-child joins don't cross node boundaries
Single-split transactions: Updates to one user's data often stay within one split

Spanner Schema Design Principles

•Design for locality: Use interleaving to co-locate related data that's accessed together.
•Choose primary keys wisely: Keys determine data distribution; avoid sequential keys that create hotspots.
•Interleave depth: Limit to 7 levels max; typically 2-3 is optimal for most schemas.
•Non-interleaved tables: Use for truly independent data or many-to-many relationships.
•Secondary indexes: Support global and local indexes; understand the write amplification trade-off.

Avoiding Hotspots

Sequential primary keys (auto-increment, timestamps) concentrate writes on one split leader, creating bottlenecks. Spanner recommends: (1) Bit-reverse sequential keys, (2) Use UUID prefixes, (3) Hash-based sharding in the key. Example: Instead of (TimestampId), use (Hash(UserId), TimestampId).

Distributed Architecture

Spanner's architecture is designed for planet-scale deployment with automatic operations.

Architectural Components

Spanner Components

•Universe: A complete Spanner deployment (e.g., 'production' or 'test'). Contains multiple zones.
•Zone: A unit of physical isolation (similar to a datacenter). Contains spanservers and zone masters. Minimum 3 zones for fault tolerance.
•Spanserver: Manages tablet replicas. Each spanserver handles 100-1000 tablets. Implements Paxos for replication.
•Tablet: A partition of data, similar to a Bigtable tablet. Contains a bag of directories (key ranges). Replicated via Paxos.
•Directory: The unit of data movement. Contains related rows (interleaved hierarchy). Can be moved between tablets for load balancing.
•Placement Driver: Automatically moves data between datacenters to balance load and satisfy replication constraints.

Converting Mermaid diagram...

Replication with Paxos

Each Spanner tablet is replicated across multiple zones using the Paxos consensus protocol:

Typical replication: 3-5 replicas across zones/regions
Leader election: Paxos elects a leader for each tablet; leader handles all writes
Write path: Leader appends to Paxos log → replicates to majority → acknowledges client
Read path: Can read from leader (fresh) or follower (potentially stale)
Long-leader leases: Leaders hold leases (~10s) to avoid re-election overhead

Multi-Region Configurations

Cloud Spanner offers predefined configurations:

Cloud Spanner Regional Configurations
Configuration	Replicas	Write Latency	Read Latency	Use Case
Regional (3 zones)	3 in one region	~5ms	~2ms (local)	Low-latency, single region
Dual-region	2+2 in two regions	~10-20ms	~2ms (local)	Regional disaster recovery
Multi-region	3+ across continents	~50-100ms	~2ms (local)	Global availability

Transactions in Spanner

Spanner supports multiple transaction types optimized for different access patterns.

Read-Write Transactions

Full ACID, serializable transactions that can read and write any data:

spanner_transaction.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Go example: Read-Write Transaction
func transferFunds(ctx context.Context, client *spanner.Client,
    fromAcct, toAcct string, amount int64) error {
    
    _, err := client.ReadWriteTransaction(ctx,
        func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
            
            // Read both accounts (acquires read locks)
            row1, _ := txn.ReadRow(ctx, "Accounts", 
                spanner.Key{fromAcct}, []string{"Balance"})
            row2, _ := txn.ReadRow(ctx, "Accounts",
                spanner.Key{toAcct}, []string{"Balance"})
            
            var bal1, bal2 int64
            row1.Columns(&bal1)
            row2.Columns(&bal2)
            
            if bal1 < amount {
                return errors.New("insufficient funds")
            }
            
            // Buffer writes (will execute atomically at commit)
            txn.BufferWrite([]*spanner.Mutation{
                spanner.Update("Accounts",
                    []string{"AccountId", "Balance"},
                    []interface{}{fromAcct, bal1 - amount}),
                spanner.Update("Accounts",
                    []string{"AccountId", "Balance"},
                    []interface{}{toAcct, bal2 + amount}),
            })
            
            return nil
        })
    return err
}

Transaction Types Comparison

Spanner Transaction Types
Type	Reads	Writes	Isolation	Performance
Read-Write	Any data	Any data	Serializable	Higher latency (2PC + commit wait)
Read-Only	Any data	None	Serializable snapshot	Low latency, lock-free
Partitioned DML	Auto (via WHERE)	Yes	First-committer-wins	High throughput bulk updates
Stale Reads	Bounded staleness	None	Snapshot	Lowest latency, any replica

Read-Only Transactions

For workloads that only read, Spanner provides optimized read-only transactions:

No locks acquired
No 2PC needed (reads don't modify data)
Execute at a snapshot timestamp
External consistency: See all transactions committed before snapshot
Can read from any sufficiently up-to-date replica

Strong vs Stale Reads

Strong read: Wait for the snapshot timestamp to be safe (all prior transactions committed)
Stale read with bounded staleness: Read data as of N seconds ago; can use any replica
Stale read with exact timestamp: Read data at a specific historical point

Stale reads are often 3-10x faster because they can use any replica without waiting.

Choosing Transaction Types

Use Read-Only transactions for complex reporting queries. Use Stale Reads (5-10s staleness) for dashboards and analytics where slight delay is acceptable. Reserve Read-Write transactions for actual modifications. This pattern dramatically improves throughput and latency.

SQL in Spanner

Initially, Spanner used a custom API. In 2017, Google added full SQL support (Spanner SQL), making it a complete relational database.

SQL Dialect

Spanner SQL is based on standard SQL with Google-specific extensions:

ANSI SQL:2011 compliance for core features
GoogleSQL syntax (shared with BigQuery)
Strong typing with Spanner-specific types
Full join support (INNER, LEFT, RIGHT, FULL, CROSS)
Aggregations, subqueries, window functions
DML: INSERT, UPDATE, DELETE with ACID guarantees
DDL: CREATE, ALTER, DROP with online schema changes

spanner_sql_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Complex join with aggregation
SELECT 
    u.UserName,
    COUNT(t.TransactionId) AS TxnCount,
    SUM(t.Amount) AS TotalAmount,
    AVG(t.Amount) AS AvgAmount
FROM Users u
JOIN Accounts a ON u.UserId = a.UserId
JOIN Transactions t ON a.UserId = t.UserId 
    AND a.AccountId = t.AccountId
WHERE t.Timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY u.UserName
HAVING SUM(t.Amount) > 10000
ORDER BY TotalAmount DESC
LIMIT 100;
 
-- Window function for running balance
SELECT 
    TransactionId,
    Amount,
    Timestamp,
    SUM(Amount) OVER (
        PARTITION BY AccountId 
        ORDER BY Timestamp 
        ROWS UNBOUNDED PRECEDING
    ) AS RunningBalance
FROM Transactions
WHERE AccountId = @accountId;
 
-- Parameterized query (prevents SQL injection, enables plan caching)
SELECT * FROM Users WHERE Email = @email;
 
-- Array and struct types
SELECT 
    UserId,
    ARRAY(
        SELECT AS STRUCT AccountId, Balance
        FROM Accounts 
        WHERE Accounts.UserId = Users.UserId
    ) AS AccountsList
FROM Users;

Query Execution

Spanner's query optimizer generates distributed execution plans:

Parse and analyze: Validate SQL, resolve types and references
Optimize: Generate execution plan considering data distribution
Distribute: Partition work across relevant tablet leaders
Execute in parallel: Each tablet processes its portion
Gather and finalize: Coordinator collects results, applies final operations

Spanner-Specific SQL Features

Spanner SQL Extensions

•INTERLEAVE IN PARENT: Define table interleaving in DDL for physical co-location.
•STORING clause: Include columns in secondary indexes for covering queries.
•PENDING_COMMIT_TIMESTAMP(): Insert current transaction's commit timestamp.
•Array and Struct types: First-class support for complex nested structures.
•TABLESAMPLE: Efficient sampling for analytics on large tables.
•Change Streams: Subscribe to row-level data changes in real-time.

Cloud Spanner: The Managed Service

Since 2017, Google Cloud Spanner has made Spanner's capabilities available to external customers as a fully managed service.

Service Characteristics

Cloud Spanner Features

•99.999% SLA (5 nines) for multi-region configurations—52 minutes of downtime per year maximum.
•Auto-scaling: Add/remove nodes without downtime; scales linearly to 10,000+ nodes.
•Global consistency: Same external consistency as internal Spanner via TrueTime.
•Zero maintenance: No patching, upgrades, or operational tasks for customers.
•Point-in-time recovery: Restore database to any point within retention period.
•Change streams: Real-time data change capture for event-driven architectures.
•Data Boost: Isolated compute for analytics queries without impacting production.

Pricing Model

Cloud Spanner pricing is based on three components:

Compute: Per node-hour; each node provides ~2,000 writes/second or 10,000 reads/second
Storage: Per GB-month of data stored (includes versioning)
Network: Standard Google Cloud egress charges for cross-region traffic

Example Pricing (as of 2024):

Regional instance: ~$0.90/node-hour + $0.30/GB-month storage
Multi-region: ~$3.00/node-hour + $0.50/GB-month storage

When to Use Cloud Spanner

Ideal Use Cases

•Global user bases requiring local latency
•Mission-critical financial systems
•High-volume OLTP with strict consistency
•Supply chain and inventory systems
•Gaming leaderboards and state management
•Multi-region disaster recovery requirements

Not Ideal For

•Small-scale applications (cost prohibitive)
•Workloads under 1,000 ops/second
•Heavy analytics (use BigQuery instead)
•Unstructured data (use Bigtable/Firestore)
•Cost-sensitive applications
•Single-region, simple workloads

Vendor Lock-in Consideration

Cloud Spanner is only available on Google Cloud Platform. Organizations concerned about vendor lock-in may prefer open-source alternatives like CockroachDB (which we'll cover next) that offer similar capabilities while running anywhere.

Summary: Google Spanner's Legacy

Google Spanner represents a landmark achievement in database engineering. It demonstrated that the trade-offs forced by the CAP theorem could be managed—with sufficient engineering—to deliver global-scale consistency.

Let's consolidate the key insights:

Key Takeaways

•TrueTime enables external consistency by providing globally-synchronized time with bounded uncertainty, allowing global transaction ordering without coordination.
•Commit wait ensures externally consistent semantics by waiting out clock uncertainty before making transactions visible.
•Interleaved tables provide locality for related data, enabling single-shard transactions and efficient joins.
•Paxos replication across zones/regions provides automatic fault tolerance and leader election.
•Multiple transaction types allow applications to choose the right consistency/performance trade-off per operation.
•Full SQL support makes Spanner a complete relational database, not just a distributed key-value store.
•Cloud Spanner brings these capabilities to external customers with industry-leading SLAs.

What's Next

Spanner's design inspired a generation of NewSQL databases. Next, we'll examine CockroachDB—an open-source, PostgreSQL-compatible database that brings Spanner's concepts to organizations that need portability and want to avoid cloud lock-in. CockroachDB proves that planet-scale SQL doesn't require specialized hardware or a single vendor.

Page Complete

You now understand Google Spanner's architecture, the breakthrough of TrueTime, and how it achieves global-scale consistency with ACID transactions. This foundation will help you appreciate how other NewSQL systems (like CockroachDB) achieve similar goals with different approaches.

3 / 5

Loading learning content...

Database Management SystemsModern Database Topics

NewSQL Databases

LevelAdvanced

Duration60 mins

TopicModern Database Topics

3 / 5

Google Spanner

The Database That Changed Everything

In 2012, Google published a paper that fundamentally altered what database engineers believed was possible. Google Spanner demonstrated that a single database could provide:

ACID transactions with serializable isolation
Global distribution across multiple continents
External consistency (stronger than linearizability)
Transparent replication with automatic failover
Horizontal scalability to millions of nodes

What You Will Learn

The Origin Story: Why Google Built Spanner

Google didn't build Spanner for academic interest. It emerged from genuine operational pain with earlier systems.

Pre-Spanner Google Infrastructure

By the late 2000s, Google operated several specialized storage systems:

Bigtable (2006): Scalable, distributed storage for structured data. Provided eventual consistency and single-row transactions only. Powered Search, Analytics, and many other services.
Megastore (2011): Built on Bigtable, added stronger consistency and cross-datacenter replication. But performance was limited (single-digit writes per second per entity group) and ACID semantics were restricted to entity groups.

The limitations became increasingly painful as Google's applications grew more complex:

Pre-Spanner Challenges

•Eventual consistency headaches: Developers wasted enormous effort handling stale reads, conflict resolution, and application-level consistency.
•Limited transactions: Entity group boundaries forced unnatural schema designs and limited what could be done atomically.
•Complex operations: Without strong consistency, even simple operations (like moving money between accounts) required complex multi-step processes with rollback logic.
•Global applications: Google Ads, Google Play, and other services needed global access patterns that exceeded Megastore's capabilities.

The Spanner Vision

Spanner was designed to solve these problems with ambitious goals:

Global distribution: Data replicated across multiple continents for latency and disaster recovery
External consistency: Strongest possible guarantee—if transaction T1 commits before T2 starts, T1 is serialized before T2
SQL interface: Full SQL semantics (added in 2017 revision as "Spanner SQL")
Automatic operations: No manual sharding, replication, or failover management
Scale: Support Google-scale workloads (billions of rows, millions of operations per second)

The key insight that made Spanner possible was that time could become a first-class part of the database design—specifically, globally synchronized time with bounded uncertainty.

Spanner's Reach Today

TrueTime: The Revolutionary Foundation

The Problem with Clocks in Distributed Systems

Traditional distributed databases struggle with time because:

Physical clocks drift (even good ones drift ~100μs/second)
Network Time Protocol (NTP) has unbounded uncertainty (typically ±100ms, sometimes worse)
No way to know for certain whether event A happened before or after event B on different machines
Clock skew can cause consistency anomalies (reading "stale" data that was actually committed later)

TrueTime's Solution

TrueTime provides a different API than traditional time functions:

truetime_api.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Traditional clock API
now() → timestamp  // Returns current time (but how accurate?)
 
// TrueTime API
TT.now() → TTinterval {earliest, latest}
 
// TTinterval represents uncertainty bounds
// Real time is guaranteed to be within [earliest, latest]
 
// Example return:
TT.now() → {
    earliest: 2024-01-15T14:30:00.000_001_234
    latest:   2024-01-15T14:30:00.000_008_567
    // Uncertainty: ~7 microseconds
}
 
// Additional methods:
TT.after(t) → bool   // True if t has definitely passed
TT.before(t) → bool  // True if t has definitely not arrived yet

How TrueTime Works

Google achieves tight time bounds through hardware and software:

Hardware Infrastructure:

GPS receivers in every data center (atomic-clock-accurate time from satellites)
Two independent time references per data center (GPS + atomic clock backup)
Time masters that serve TrueTime to all machines in the data center

Time Propagation:

Time masters receive time from GPS/atomic clocks
Regular Spanner servers poll multiple time masters
The TrueTime client library computes a confidence interval based on:
- Time since last master synchronization
- Known sources of clock drift
- Communication latency uncertainty

Typical Uncertainty: ε ≈ 1-7 milliseconds (usually ~4ms average)

This is vastly better than NTP (which provides no meaningful bounds) and sufficient for Spanner's purposes.

Converting Mermaid diagram...

Why Not Just Use NTP?

Commit Wait: Achieving External Consistency

TrueTime enables Spanner's external consistency guarantee through a technique called commit wait.

What is External Consistency?

External consistency is stronger than serializability. It guarantees:

If transaction T1 commits before transaction T2 starts (in real time), then T1's commit timestamp < T2's commit timestamp, and any observer will see T1's effects before T2's effects.

The Commit Wait Protocol

Spanner achieves external consistency with commit wait:

commit_wait.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Spanner Write Transaction Commit
 
function commit(transaction):
    // Step 1: Execute transaction, acquire locks
    executeAndAcquireLocks(transaction)
    
    // Step 2: Get commit timestamp
    s = TT.now().latest   // Assign timestamp at END of uncertainty
    
    // Step 3: COMMIT WAIT - crucial for external consistency
    // Wait until we're certain commit time has passed
    while TT.now().earliest < s:
        sleep(small_interval)
    
    // Now we're GUARANTEED that real time ≥ s
    // Any future transaction will see s < their_start_time
    
    // Step 4: Release locks, make writes visible
    commitAndRelease(transaction, s)
    
    return s  // Return commit timestamp to client

Why Commit Wait Works

Let's trace through why this achieves external consistency:

Transaction T1 is assigned commit timestamp s at TT.now().latest
T1 waits until TT.now().earliest > s — meaning real time has definitely passed s
T1 commits and releases locks
T1 returns to client
Client starts transaction T2
T2 is assigned timestamp TT.now().latest which is > real time > s
Therefore: T2's timestamp > T1's commit timestamp
T2 will see T1's effects (due to MVCC snapshot mechanics)

The Cost of Commit Wait

Commit wait adds latency equal to TrueTime uncertainty (ε):

Average ε ≈ 4ms means ~4ms added to write latency
Reads are not affected
For workloads with many transactions, this is parallelized (you wait once, not once per transaction)

Google optimized aggressively to minimize ε because every millisecond of uncertainty adds directly to commit latency.

External Consistency Trade-offs
Aspect	Impact	Mitigation
Write latency	+ε milliseconds per commit	Reduce clock uncertainty, batch writes
Hardware cost	GPS receivers + atomic clocks	Amortized across huge scale
Complexity	Specialized infrastructure	Managed service (Cloud Spanner)
Geographic latency	Multi-region commits are slower	Automatic replication, read replicas

Read-Only Transactions Don't Wait

Data Model and Schema Design

Spanner uses a semi-relational data model that combines SQL's relational concepts with a hierarchical structure optimized for distributed locality.

Tables and Interleaving

Spanner tables look like traditional SQL tables but support hierarchical relationships via interleaving:

spanner_schema.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Parent table: Users
CREATE TABLE Users (
    UserId      INT64 NOT NULL,
    UserName    STRING(100),
    Email       STRING(256),
    CreatedAt   TIMESTAMP,
) PRIMARY KEY (UserId);
 
-- Child table: Interleaved in parent
CREATE TABLE Accounts (
    UserId      INT64 NOT NULL,
    AccountId   INT64 NOT NULL,
    Balance     NUMERIC,
    Currency    STRING(3),
) PRIMARY KEY (UserId, AccountId),
  INTERLEAVE IN PARENT Users ON DELETE CASCADE;
 
-- Grandchild table: Interleaved in Accounts
CREATE TABLE Transactions (
    UserId        INT64 NOT NULL,
    AccountId     INT64 NOT NULL,
    TransactionId INT64 NOT NULL,
    Amount        NUMERIC,
    Timestamp     TIMESTAMP,
) PRIMARY KEY (UserId, AccountId, TransactionId),
  INTERLEAVE IN PARENT Accounts ON DELETE CASCADE;

Physical Storage with Interleaving

Interleaved tables are stored together on disk, sorted by their keys:

Users(UserId=1) → [UserName='Alice', Email='alice@example.com']
  Accounts(UserId=1, AccountId=100) → [Balance=5000, Currency='USD']
    Transactions(1, 100, 1) → [Amount=-50, ...]
    Transactions(1, 100, 2) → [Amount=+200, ...]
  Accounts(UserId=1, AccountId=101) → [Balance=15000, Currency='EUR']
    Transactions(1, 101, 1) → [Amount=+1000, ...]
Users(UserId=2) → [UserName='Bob', ...]
  Accounts(UserId=2, AccountId=200) → [...]
    ...

This layout provides:

Locality: All data for one user is physically co-located
Efficient joins: Parent-child joins don't cross node boundaries
Single-split transactions: Updates to one user's data often stay within one split

Spanner Schema Design Principles

•Design for locality: Use interleaving to co-locate related data that's accessed together.
•Choose primary keys wisely: Keys determine data distribution; avoid sequential keys that create hotspots.
•Interleave depth: Limit to 7 levels max; typically 2-3 is optimal for most schemas.
•Non-interleaved tables: Use for truly independent data or many-to-many relationships.
•Secondary indexes: Support global and local indexes; understand the write amplification trade-off.

Avoiding Hotspots

Distributed Architecture

Spanner's architecture is designed for planet-scale deployment with automatic operations.

Architectural Components

Spanner Components

•Universe: A complete Spanner deployment (e.g., 'production' or 'test'). Contains multiple zones.
•Zone: A unit of physical isolation (similar to a datacenter). Contains spanservers and zone masters. Minimum 3 zones for fault tolerance.
•Spanserver: Manages tablet replicas. Each spanserver handles 100-1000 tablets. Implements Paxos for replication.
•Tablet: A partition of data, similar to a Bigtable tablet. Contains a bag of directories (key ranges). Replicated via Paxos.
•Directory: The unit of data movement. Contains related rows (interleaved hierarchy). Can be moved between tablets for load balancing.
•Placement Driver: Automatically moves data between datacenters to balance load and satisfy replication constraints.

Converting Mermaid diagram...

Replication with Paxos

Each Spanner tablet is replicated across multiple zones using the Paxos consensus protocol:

Typical replication: 3-5 replicas across zones/regions
Leader election: Paxos elects a leader for each tablet; leader handles all writes
Write path: Leader appends to Paxos log → replicates to majority → acknowledges client
Read path: Can read from leader (fresh) or follower (potentially stale)
Long-leader leases: Leaders hold leases (~10s) to avoid re-election overhead

Multi-Region Configurations

Cloud Spanner offers predefined configurations:

Cloud Spanner Regional Configurations
Configuration	Replicas	Write Latency	Read Latency	Use Case
Regional (3 zones)	3 in one region	~5ms	~2ms (local)	Low-latency, single region
Dual-region	2+2 in two regions	~10-20ms	~2ms (local)	Regional disaster recovery
Multi-region	3+ across continents	~50-100ms	~2ms (local)	Global availability

Transactions in Spanner

Spanner supports multiple transaction types optimized for different access patterns.

Read-Write Transactions

Full ACID, serializable transactions that can read and write any data:

spanner_transaction.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Go example: Read-Write Transaction
func transferFunds(ctx context.Context, client *spanner.Client,
    fromAcct, toAcct string, amount int64) error {
    
    _, err := client.ReadWriteTransaction(ctx,
        func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
            
            // Read both accounts (acquires read locks)
            row1, _ := txn.ReadRow(ctx, "Accounts", 
                spanner.Key{fromAcct}, []string{"Balance"})
            row2, _ := txn.ReadRow(ctx, "Accounts",
                spanner.Key{toAcct}, []string{"Balance"})
            
            var bal1, bal2 int64
            row1.Columns(&bal1)
            row2.Columns(&bal2)
            
            if bal1 < amount {
                return errors.New("insufficient funds")
            }
            
            // Buffer writes (will execute atomically at commit)
            txn.BufferWrite([]*spanner.Mutation{
                spanner.Update("Accounts",
                    []string{"AccountId", "Balance"},
                    []interface{}{fromAcct, bal1 - amount}),
                spanner.Update("Accounts",
                    []string{"AccountId", "Balance"},
                    []interface{}{toAcct, bal2 + amount}),
            })
            
            return nil
        })
    return err
}

Transaction Types Comparison

Spanner Transaction Types
Type	Reads	Writes	Isolation	Performance
Read-Write	Any data	Any data	Serializable	Higher latency (2PC + commit wait)
Read-Only	Any data	None	Serializable snapshot	Low latency, lock-free
Partitioned DML	Auto (via WHERE)	Yes	First-committer-wins	High throughput bulk updates
Stale Reads	Bounded staleness	None	Snapshot	Lowest latency, any replica

Read-Only Transactions

For workloads that only read, Spanner provides optimized read-only transactions:

No locks acquired
No 2PC needed (reads don't modify data)
Execute at a snapshot timestamp
External consistency: See all transactions committed before snapshot
Can read from any sufficiently up-to-date replica

Strong vs Stale Reads

Strong read: Wait for the snapshot timestamp to be safe (all prior transactions committed)
Stale read with bounded staleness: Read data as of N seconds ago; can use any replica
Stale read with exact timestamp: Read data at a specific historical point

Stale reads are often 3-10x faster because they can use any replica without waiting.

Choosing Transaction Types

SQL in Spanner

Initially, Spanner used a custom API. In 2017, Google added full SQL support (Spanner SQL), making it a complete relational database.

SQL Dialect

Spanner SQL is based on standard SQL with Google-specific extensions:

ANSI SQL:2011 compliance for core features
GoogleSQL syntax (shared with BigQuery)
Strong typing with Spanner-specific types
Full join support (INNER, LEFT, RIGHT, FULL, CROSS)
Aggregations, subqueries, window functions
DML: INSERT, UPDATE, DELETE with ACID guarantees
DDL: CREATE, ALTER, DROP with online schema changes

spanner_sql_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Complex join with aggregation
SELECT 
    u.UserName,
    COUNT(t.TransactionId) AS TxnCount,
    SUM(t.Amount) AS TotalAmount,
    AVG(t.Amount) AS AvgAmount
FROM Users u
JOIN Accounts a ON u.UserId = a.UserId
JOIN Transactions t ON a.UserId = t.UserId 
    AND a.AccountId = t.AccountId
WHERE t.Timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY u.UserName
HAVING SUM(t.Amount) > 10000
ORDER BY TotalAmount DESC
LIMIT 100;
 
-- Window function for running balance
SELECT 
    TransactionId,
    Amount,
    Timestamp,
    SUM(Amount) OVER (
        PARTITION BY AccountId 
        ORDER BY Timestamp 
        ROWS UNBOUNDED PRECEDING
    ) AS RunningBalance
FROM Transactions
WHERE AccountId = @accountId;
 
-- Parameterized query (prevents SQL injection, enables plan caching)
SELECT * FROM Users WHERE Email = @email;
 
-- Array and struct types
SELECT 
    UserId,
    ARRAY(
        SELECT AS STRUCT AccountId, Balance
        FROM Accounts 
        WHERE Accounts.UserId = Users.UserId
    ) AS AccountsList
FROM Users;

Query Execution

Spanner's query optimizer generates distributed execution plans:

Parse and analyze: Validate SQL, resolve types and references
Optimize: Generate execution plan considering data distribution
Distribute: Partition work across relevant tablet leaders
Execute in parallel: Each tablet processes its portion
Gather and finalize: Coordinator collects results, applies final operations

Spanner-Specific SQL Features

Spanner SQL Extensions

•INTERLEAVE IN PARENT: Define table interleaving in DDL for physical co-location.
•STORING clause: Include columns in secondary indexes for covering queries.
•PENDING_COMMIT_TIMESTAMP(): Insert current transaction's commit timestamp.
•Array and Struct types: First-class support for complex nested structures.
•TABLESAMPLE: Efficient sampling for analytics on large tables.
•Change Streams: Subscribe to row-level data changes in real-time.

Cloud Spanner: The Managed Service

Since 2017, Google Cloud Spanner has made Spanner's capabilities available to external customers as a fully managed service.

Service Characteristics

Cloud Spanner Features

•99.999% SLA (5 nines) for multi-region configurations—52 minutes of downtime per year maximum.
•Auto-scaling: Add/remove nodes without downtime; scales linearly to 10,000+ nodes.
•Global consistency: Same external consistency as internal Spanner via TrueTime.
•Zero maintenance: No patching, upgrades, or operational tasks for customers.
•Point-in-time recovery: Restore database to any point within retention period.
•Change streams: Real-time data change capture for event-driven architectures.
•Data Boost: Isolated compute for analytics queries without impacting production.

Pricing Model

Cloud Spanner pricing is based on three components:

Compute: Per node-hour; each node provides ~2,000 writes/second or 10,000 reads/second
Storage: Per GB-month of data stored (includes versioning)
Network: Standard Google Cloud egress charges for cross-region traffic

Example Pricing (as of 2024):

Regional instance: ~$0.90/node-hour + $0.30/GB-month storage
Multi-region: ~$3.00/node-hour + $0.50/GB-month storage

When to Use Cloud Spanner

Ideal Use Cases

•Global user bases requiring local latency
•Mission-critical financial systems
•High-volume OLTP with strict consistency
•Supply chain and inventory systems
•Gaming leaderboards and state management
•Multi-region disaster recovery requirements

Not Ideal For

•Small-scale applications (cost prohibitive)
•Workloads under 1,000 ops/second
•Heavy analytics (use BigQuery instead)
•Unstructured data (use Bigtable/Firestore)
•Cost-sensitive applications
•Single-region, simple workloads

Vendor Lock-in Consideration

Summary: Google Spanner's Legacy

Let's consolidate the key insights:

Key Takeaways

•TrueTime enables external consistency by providing globally-synchronized time with bounded uncertainty, allowing global transaction ordering without coordination.
•Commit wait ensures externally consistent semantics by waiting out clock uncertainty before making transactions visible.
•Interleaved tables provide locality for related data, enabling single-shard transactions and efficient joins.
•Paxos replication across zones/regions provides automatic fault tolerance and leader election.
•Multiple transaction types allow applications to choose the right consistency/performance trade-off per operation.
•Full SQL support makes Spanner a complete relational database, not just a distributed key-value store.
•Cloud Spanner brings these capabilities to external customers with industry-leading SLAs.

What's Next

Page Complete

3 / 5