System Design (HLD)Google Spanner

Google Spanner: The Globally Distributed Database

LevelAdvanced

Duration90 mins

TopicGoogle Spanner

2 / 5

TrueTime and External Consistency

The Clock Problem That Nearly Broke Distributed Systems

In 1978, Leslie Lamport published "Time, Clocks, and the Ordering of Events in a Distributed System"—a paper that would become one of the most cited in computer science. The core insight was profound: in a distributed system, there is no global notion of time. Different computers have different clocks that drift at different rates, and even the most precise synchronization protocols leave uncertainty windows of milliseconds to seconds.

This clock uncertainty creates a fundamental problem for databases. When two transactions occur on different continents, how do you determine which happened first? If the clocks on those continents disagree, you might produce inconsistent orderings that violate the database's guarantees.

For decades, distributed systems dealt with this by either:

Giving up on global ordering (eventual consistency)
Using a single coordinator (which becomes a bottleneck and single point of failure)
Using logical clocks (which don't correspond to real time and limit what applications can reason about)

Google's TrueTime took a radically different approach: instead of pretending clocks are perfect or giving up on physical time, TrueTime explicitly models clock uncertainty and uses that uncertainty to guarantee correct ordering.

What You Will Learn

By the end of this page, you will understand how TrueTime works, why it's revolutionary for distributed systems, and how Spanner uses it to achieve external consistency—the strongest consistency guarantee possible. You'll see how atomic clocks and GPS receivers combine with clever protocols to create a globally coherent notion of time.

The Problem of Time in Distributed Systems

To understand TrueTime's innovation, we must first deeply understand the problem it solves.

Clock Drift: Nothing Stays Synchronized

Every computer contains a quartz crystal oscillator that counts time. These crystals are remarkably precise—typically accurate to within 10-50 parts per million (ppm). But 50ppm means a clock that drifts by 50 microseconds every second, which accumulates to:

4 seconds per day
2 minutes per month
26 minutes per year

Now imagine thousands of servers across data centers worldwide, each drifting independently. Even with periodic synchronization via protocols like NTP (Network Time Protocol), you can never achieve perfect alignment.

NTP's Limitations:

NTP, the standard internet time synchronization protocol, has fundamental limitations:

Network Latency Variation: NTP works by exchanging timestamps across the network. Network delays vary unpredictably, introducing uncertainty of 10-100ms over the internet, and typically 1-10ms even within a datacenter.
No Bounded Uncertainty: NTP estimates clock offset, but this estimate has an unbounded error. You know approximately what time it is, but you don't know how wrong you might be.
Vulnerable to Failures: If NTP servers become unreachable, clocks continue drifting without correction.

For a database trying to order transactions globally, these limitations are catastrophic.

Clock Synchronization Methods Compared
Method	Typical Accuracy	Bounded Error?	Failure Mode	Cost
Quartz (unsynchronized)	50 ppm	No	Unbounded drift	Free
NTP (internet)	10-100ms	No	Unbounded if unreachable	Minimal
NTP (datacenter)	1-10ms	No	Unbounded if unreachable	Minimal
PTP (IEEE 1588)	<1ms	No	Depends on network	Moderate
GPS Receiver	<1μs to UTC	Yes (after acquisition)	Sky visibility required	Moderate
Atomic Clock	~1ns	Yes	Very rare failures	High ($50K+)
TrueTime (GPS + Atomic)	~1-7ms	Yes, always	Graceful degradation	High (at scale)

Why Unbounded Uncertainty Breaks Consistency:

Consider two transactions that must be ordered:

Transaction A commits at 12:00:00.000 (according to server in New York)
Transaction B commits at 12:00:00.005 (according to server in London)

Does A happen before B? If the New York clock was 10ms fast and the London clock was 10ms slow, then in "true" time:

A actually happened at 11:59:59.990
B actually happened at 12:00:00.015

A happened 25ms before B. But if you just compare the recorded timestamps, you'd think A happened only 5ms before B. Worse, if the clock errors were reversed, you might conclude B happened before A—a complete ordering violation.

The Implication for Databases:

If you can't determine true ordering, you can't guarantee:

Serializability: Transactions appearing to execute one at a time
Snapshot isolation: Reads seeing a consistent point-in-time view
External consistency: If transaction A completes before transaction B starts (in real time), A appears before B

This is why most distributed databases either avoid global ordering entirely, or funnel all transactions through a single coordinator.

The Core Insight

If you don't know how wrong your clock might be, you can't make correct ordering decisions. But if you know the bounds of your uncertainty, you can wait until the uncertainty resolves before committing—guaranteeing correct ordering.

How TrueTime Works

TrueTime's brilliance lies not in eliminating clock uncertainty—which is physically impossible—but in bounding it and making it explicit.

The TrueTime API:

Unlike traditional clock APIs that return a single timestamp, TrueTime returns an interval:

TT.now() → [earliest, latest]

This interval represents the range of possible true times. If TrueTime returns [12:00:00.000, 12:00:00.007], it's guaranteeing that the actual current time is somewhere in that 7-millisecond window.

The width of this interval is called ε (epsilon)—the uncertainty. TrueTime provides two additional operations:

TT.after(t)  → true if t has definitely passed
TT.before(t) → true if t has definitely not arrived

These are powerful primitives. TT.after(t) returns true only when the earliest possible time is past t—meaning t has definitely passed regardless of clock errors.

The Hardware Foundation:

TrueTime's bounded uncertainty depends on specialized hardware:

GPS Receivers: GPS satellites carry atomic clocks and broadcast precise time signals. A GPS receiver can determine UTC to within ~100 nanoseconds when it has satellite visibility.
Atomic Clocks: Each Google datacenter has multiple atomic clocks (typically cesium or rubidium). These drift at rates measured in nanoseconds per day rather than microseconds.
Armageddon Masters: Dedicated time servers that combine GPS and atomic clock inputs, cross-validate them, and serve time to other servers in the datacenter.

truetime-architecture.txt
TRUETIME INFRASTRUCTURE ARCHITECTURE
═══════════════════════════════════════════════════════════════════
 
PER-DATACENTER TIME INFRASTRUCTURE:
 
            ┌─────────────────────────────────────────────────────┐
            │              STRATUM 0: Reference Clocks            │
            │    ┌──────────────┐              ┌──────────────┐   │
            │    │ GPS Receiver │              │ Atomic Clock │   │
            │    │  (multiple)  │              │  (Cesium/Rb) │   │
            │    │   ~100ns     │              │   ~1ns/day   │   │
            │    └──────┬───────┘              └──────┬───────┘   │
            │           │                             │           │
            │           └──────────┬──────────────────┘           │
            │                      ▼                              │
            │    ┌─────────────────────────────────────────────┐  │
            │    │           ARMAGEDDON MASTERS                │  │
            │    │   • Cross-validate GPS and atomic sources   │  │
            │    │   • Detect and exclude faulty clocks        │  │
            │    │   • Distribute time with certified bounds   │  │
            │    │   • Multiple masters for redundancy         │  │
            │    └─────────────────────┬───────────────────────┘  │
            │                          │                          │
            │              STRATUM 1: Time Distribution           │
            │           ┌──────────────┼──────────────┐           │
            │           ▼              ▼              ▼           │
            │    ┌──────────┐   ┌──────────┐   ┌──────────┐       │
            │    │ Timeserver│   │ Timeserver│  │ Timeserver│      │
            │    │ (rack 1)  │   │ (rack 2)  │  │ (rack 3)  │      │
            │    └──────┬────┘   └──────┬────┘  └──────┬────┘     │
            │           │              │              │           │
            │           └──────────────┼──────────────┘           │
            │                          │                          │
            │              ────────────┼────────────               │
            │             ╱            │            ╲              │
            │            ▼             ▼             ▼             │
            │    ┌──────────┐   ┌──────────┐   ┌──────────┐       │
            │    │ Worker   │   │ Worker   │   │ Worker   │       │
            │    │ Servers  │   │ Servers  │   │ Servers  │       │
            │    │ TT.now() │   │ TT.now() │   │ TT.now() │       │
            │    └──────────┘   └──────────┘   └──────────┘       │
            └─────────────────────────────────────────────────────┘
 
TRUETIME UNCERTAINTY (ε) OVER TIME:
 
    ε       │                                        Poll
  (ms)      │                                          │
     7 ─────│─────────────────────────────────────────▲┼
            │                                       ╱  │
     6 ─────│──────────────────────────────────────╱───│
            │                                    ╱     │
     5 ─────│───────────────────────────────────╱──────│
            │                              ╱           │
     4 ─────│─────────────────────────────╱────────────│
            │                        ╱                 │
     3 ─────│───────────────────────╱──────────────────│
            │                  ╱                       │
     2 ─────│─────────────────╱────────────────────────│
            │            ╱                             │ Reset after
     1 ─────│───────────╱──────────────────────────────│ successful
            │      ╱                                   │ poll
     0 ─────│─────▲────────────────────────────────────▼──────────
            │     │                                    │
            └─────┴───────────────────────────────────────────▶
                 Poll                                       Time
                 
Between polls, uncertainty grows due to local clock drift.
After each poll, uncertainty resets to network/processing latency.
Typical ε: 1-7ms, average ~4ms.

How Uncertainty is Bounded:

TrueTime's uncertainty calculation accounts for all error sources:

Reference Time Accuracy: GPS/atomic clock accuracy (sub-microsecond, negligible)
Network Latency: Time to communicate between time servers and workers (~1ms within datacenter)
Quartz Drift Between Polls: Local clocks drift between time synchronization polls. At 200ppm drift and 30-second poll intervals, this adds ~6ms uncertainty.
Processing Delays: Time to process responses and compute intervals

TrueTime continuously tracks these factors and adjusts the returned interval accordingly. After a successful time poll, uncertainty drops to ~1-2ms. As time passes without a poll, uncertainty grows due to local clock drift. If polls fail, uncertainty grows faster until backup time sources are consulted.

The Result:

Typical ε values are 1-7ms, with an average around 4ms. This might seem large, but it's bounded and known—which is infinitely more useful than an unbounded uncertainty that might be milliseconds or minutes.

Why Not Just Use GPS Everywhere?

GPS receivers need sky visibility, which servers in underground datacenters don't have. Atomic clocks don't need sky visibility but are expensive and still drift (very slowly). By combining GPS receivers (at the edges of datacenters) with atomic clocks (providing holdover during GPS outages) and local distribution, TrueTime achieves both reliability and precision.

External Consistency Explained

With TrueTime's bounded uncertainty, Spanner can provide external consistency—the strongest consistency guarantee a distributed database can offer.

What is External Consistency?

External consistency means: if transaction T1 commits before transaction T2 starts (according to an external observer with a perfect clock), then T1's commit timestamp is earlier than T2's commit timestamp.

This is stronger than serializability. Serializability only requires that transactions appear to execute atomically in some order—that order doesn't have to match real-time ordering. External consistency makes the database respect real-world time.

Why External Consistency Matters:

Consider an auditing scenario:

At 12:00:00.000, an admin revokes user Alice's permissions (Transaction T1)
At 12:00:00.010, Alice attempts an unauthorized action (Transaction T2)

With external consistency, the audit log will always show T1 (revocation) with an earlier timestamp than T2 (attempted action). Auditors can trust the database's ordering reflects what actually happened.

Without external consistency, clock skew could cause T2 to receive an earlier timestamp than T1—making it appear that Alice acted before her permissions were revoked, when in reality she acted after.

Linearizability vs. External Consistency:

Linearizability (also called "atomic consistency") is a related but different concept:

Linearizability applies to individual operations on individual objects. Each operation appears to happen atomically at some point between its invocation and response.
External consistency applies to entire transactions that may touch multiple objects. The transaction's effective timestamp respects real-time ordering.

Spanner provides both. Transactions are externally consistent, and individual read/write operations within a Paxos group are linearizable.

Consistency Model Comparison
Model	Ordering Guarantee	Performance	Example Systems
Eventual Consistency	No ordering, values eventually converge	Highest	DynamoDB (default), Cassandra (ONE)
Causal Consistency	Causally related operations ordered	High	MongoDB (causal reads)
Session Consistency	Per-session ordering (read-your-writes)	High	Many cloud databases
Snapshot Isolation	Consistent point-in-time reads	Moderate	PostgreSQL, CockroachDB
Serializability	Transactions appear to execute one at a time	Lower	PostgreSQL SERIALIZABLE
External Consistency	Respects real-time ordering	Lowest	Spanner, TiDB (with TSO)

The Price of External Consistency

External consistency isn't free. As we'll see in the next section, Spanner must wait for TrueTime uncertainty to resolve before committing writes. This adds latency equal to ε (typically 4-7ms). For many applications, this overhead is acceptable for the correctness guarantees provided.

The Commit-Wait Protocol

TrueTime enables external consistency through a clever protocol called commit-wait. The key insight: if you wait long enough after choosing a commit timestamp, you can be certain that the timestamp has definitively passed everywhere.

How Commit-Wait Works:

Transaction Execution: The transaction executes, acquiring locks and staging writes.
Timestamp Assignment: The leader assigns a commit timestamp s ≥ TT.now().latest (the upper bound of current time).
Commit Wait: The leader waits until TT.after(s) returns true—meaning time s has definitely passed.
Paxos Commit: The leader commits the transaction via Paxos.
Response to Client: The transaction is durably committed with timestamp s.

Why This Works:

Let's trace through carefully:

At step 2, the actual time is somewhere in [TT.now().earliest, TT.now().latest]
We choose s ≥ TT.now().latest, so s is at or after the latest possible current time
At step 3, we wait until TT.after(s) is true
TT.after(s) is true when TT.now().earliest > s
At this point, even the earliest possible current time has passed s
Therefore, s has definitely passed everywhere in the world

The Ordering Guarantee:

Now consider two transactions, T1 and T2, where T1 commits before T2 starts:

T1 receives timestamp s1 and completes commit-wait, so s1 has definitely passed
T2 starts after s1 passed, so TT.now().earliest > s1 when T2 starts
T2's timestamp s2 ≥ TT.now().latest > TT.now().earliest > s1
Therefore s2 > s1, and T1 is correctly ordered before T2

commit-wait-protocol.txt
COMMIT-WAIT PROTOCOL VISUALIZATION
═══════════════════════════════════════════════════════════════════
 
Timeline (with TrueTime uncertainty shown as intervals):
 
Transaction T1:                    Transaction T2:
──────────────                    ──────────────
 
    │ T1 starts
    │
    ▼ Execute transaction
    │ (reads, writes, locks)
    │
    ▼ Assign commit timestamp
    │ s1 = 100 (TT.now = [96, 100])
    │
    ▼ BEGIN COMMIT-WAIT
    │├─ Wait until TT.after(s1)
    ││  (TT.now.earliest > 100)
    ││
    ││  TT.now = [97, 101] → earliest=97 ≤ 100, keep waiting
    ││  TT.now = [99, 103] → earliest=99 ≤ 100, keep waiting
    ││  TT.now = [101, 105] → earliest=101 > 100, DONE!
    │▼ END COMMIT-WAIT
    │
    ▼ Commit via Paxos
    │ (replicate to majority)
    │
    ▼ Response to client
    │ "T1 committed at timestamp 100"
    │
════╪═══════════════════════════════════════════════════════════════
    │                             │ T2 starts
    │                             │ TT.now = [102, 106]
    │                             │
    │                             ▼ Execute transaction
    │                             │
    │                             ▼ Assign commit timestamp
    │                             │ s2 = 106 (TT.now = [103, 107])
    │                             │
                                  │ s2 = 106 > s1 = 100 ✓
                                  │ ORDERING PRESERVED!
                                  │
                                  ▼ Continue with commit...
 
 
KEY INVARIANT:
─────────────
When T2 starts, T1's commit-wait has finished.
T1's commit-wait ensured TT.now.earliest > s1.
T2's timestamp s2 ≥ TT.now.latest > TT.now.earliest > s1.
Therefore: s2 > s1, respecting real-time order.
 
COMMIT-WAIT DURATION:
────────────────────
Wait time = s1 - TT.now().earliest (when timestamp assigned)
          ≈ 2ε (since s1 ≈ TT.now().latest and interval width is ε)
          ≈ 8-14ms typically
 
This is the "tax" for external consistency on write transactions.

Optimizations to Reduce Commit-Wait Impact:

Spanner uses several techniques to minimize commit-wait overhead:

1. Prepare-Wait (for Distributed Transactions):

For transactions spanning multiple Paxos groups, Spanner uses two-phase commit. The commit-wait happens during the prepare phase, overlapped with the two-phase commit protocol. By the time all participants have prepared, the commit-wait has often already completed.

2. Read-Only Transaction Optimization:

Read-only transactions don't need commit-wait because they don't modify data. They receive timestamps via TT.now().latest but can return immediately after reading—no waiting required.

3. Batching:

Multiple transactions can share a single commit-wait. If several transactions are assigned the same timestamp, they can all commit after a single wait period.

4. Reduced ε Through Infrastructure:

The smaller ε is, the shorter the commit-wait. Google continuously invests in reducing TrueTime uncertainty through better hardware, more frequent polling, and optimized networking.

The Latency Tradeoff

Commit-wait adds latency proportional to TrueTime uncertainty (typically 4-7ms). This is the price of external consistency. For applications that don't need external consistency, Spanner offers read staleness options that can read data at slightly older timestamps, avoiding fresh commits.

Read Timestamps and Snapshot Reads

TrueTime enables powerful read semantics beyond just transaction ordering. Spanner supports multiple read modes, each with different consistency-latency tradeoffs.

Strong Reads:

A strong read sees all transactions committed before the read began. Implementation:

Assign read timestamp t_read = TT.now().latest
Wait until TT.after(t_read) (to ensure t_read has passed)
Read data at timestamp t_read from any replica
The read sees all transactions with commit timestamp < t_read

Strong reads guarantee you see the latest data, but incur the commit-wait latency.

Bounded Staleness Reads:

If you can tolerate slightly stale data (say, up to 10 seconds old), you can read at a timestamp in the past:

Choose t_read = TT.now().earliest - staleness_bound
Read data at timestamp t_read from any replica
No waiting required—t_read is definitely in the past

Bounded staleness reads are faster because they skip commit-wait, and can be served by any replica (not just leaders).

Exact Staleness Reads:

You can specify an exact timestamp to read at:

Client provides t_read = specific_timestamp
If t_read is definitely in the past (TT.after(t_read)), read proceeds
If t_read might be in the future, wait until TT.after(t_read)

This is useful for reproducible queries—reading at the same timestamp always returns the same data.

Read-Only Transactions:

Read-only transactions are particularly powerful. They:

Execute at a single snapshot timestamp
See a consistent view across all tables
Never block write transactions (no locks held)
Can be served by any replica (reads go to nearest replica)
Support strong, bounded-stale, or exact-stale semantics

Spanner Read Modes Comparison
Read Mode	Staleness	Latency	Replica Choice	Use Case
Strong Read	None (latest)	Higher (commit-wait)	Leader or recent replica	Critical reads, financial data
Bounded Staleness	Up to X seconds	Low	Any replica	Read-heavy workloads, dashboards
Exact Staleness	At timestamp T	Low if T is past	Any replica	Analytics, reproducible queries
Read-Only Txn (Strong)	None	Moderate	Any replica	Multi-table consistent reads
Read-Only Txn (Stale)	Configurable	Low	Any replica	Cross-table analytics

Replica Selection and Stale Reads:

Spanner's stale read capability has profound implications for global performance. Consider a user in Tokyo querying data whose leader is in Virginia:

Strong read: Must contact Virginia leader (~150ms round trip)
10-second stale read: Can read from Tokyo replica (local, <10ms)

For read-heavy workloads where slight staleness is acceptable, this is a massive performance improvement.

Safe Time and Replica Catch-Up:

Every Spanner replica tracks its safe time—the latest timestamp at which it has received all updates. A replica can serve reads at any timestamp less than or equal to its safe time.

If a read request arrives for a timestamp beyond the replica's safe time, the replica either:

Waits for catch-up (updates to arrive)
Forwards the request to a more up-to-date replica

This ensures reads never return stale data beyond what was requested.

Schema Changes and Timestamps

Schema changes (DDL) are also timestamped. When you read at a timestamp, you see the schema as it existed at that timestamp. This enables consistent reads even during schema migrations—old queries continue working at old timestamps while new queries use new schemas.

Global Timestamp Ordering

TrueTime timestamps aren't just internal bookkeeping—they create a globally meaningful ordering that applications can reason about.

Transactions Return Timestamps:

When a Spanner transaction commits, the response includes its commit timestamp. Applications can:

Compare Timestamps: If tx1.timestamp < tx2.timestamp, tx1 happened before tx2 in the global ordering.
Read at Timestamps: Request to read data as-of a specific timestamp, useful for:
- Debugging: "Show me the data exactly as it was at 3:42 PM yesterday"
- Auditing: "What did this account look like when this transaction happened?"
- Reproducibility: "Run this report at the same timestamp as last month"
Coordinate Across Systems: Pass timestamps to other systems that also understand TrueTime, enabling cross-system consistency.

The Timestamp as a Consistency Token:

Consider a microservices architecture where multiple services use Spanner:

Service A writes to Database A, receives timestamp t_A
Service A sends a message to Service B, including t_A
Service B can read Database A at timestamp t_A + δ (where δ accounts for propagation)
Service B is guaranteed to see all of Service A's writes

This pattern enables causal consistency across services without complex distributed transaction protocols.

cross-service-consistency.txt
USING TIMESTAMPS FOR CROSS-SERVICE CONSISTENCY
═══════════════════════════════════════════════════════════════════
 
Scenario: E-commerce with separate Order and Inventory services
 
┌────────────────┐                      ┌────────────────┐
│  Order Service │                      │Inventory Svc   │
│                │                      │                │
│  Spanner DB:   │                      │  Spanner DB:   │
│   orders       │                      │   inventory    │
└───────┬────────┘                      └───────┬────────┘
        │                                       │
        │  1. Create order                      │
        │  ────────────────►                    │
        │  INSERT INTO orders...                │
        │                                       │
        │  ◄────────────────                    │
        │  Commit timestamp: t₁ = 1704672000.123│
        │                                       │
        ├───────────────────────────────────────►
        │  2. Send message via Pub/Sub          │
        │     { "order_id": 123,                │
        │       "timestamp": "1704672000.123",  │
        │       "items": [...] }                │
        │                                       │
        │                      3. Receive message
        │                      ◄────────────────
        │                                       │
        │                      4. Read order details
        │                         at timestamp t₁
        │                      ────────────────►
        │                      SELECT * FROM orders
        │                        WHERE order_id=123
        │                        AT TIMESTAMP t₁
        │                      ◄────────────────
        │                      (Guaranteed to see 
        │                       the order that was
        │                       just created)
        │                                       │
        │                      5. Decrement inventory
        │                      ────────────────►
        │                      UPDATE inventory...
        │                      ◄────────────────
        │                      Commit timestamp: t₂
        │                                       │
        │                      (t₂ > t₁, ordering
        │                       preserved!)
        │                                       │
 
GUARANTEE:
─────────
Because Order Service's write happened at t₁, and Inventory Service
reads at t₁, the read is guaranteed to see the write—even if the 
services are in different continents and using different Spanner
instances (as long as both use TrueTime).
 
This is NOT possible without TrueTime or equivalent global time source.
Traditional distributed systems would need distributed transactions,
two-phase commit, or accept inconsistency windows.

Time-Travel Queries:

Spanner retains old versions of data for a configurable period (by default, 1 hour). Combined with precise timestamps, this enables powerful "time-travel" queries:

-- See account balance at market close yesterday
SELECT balance 
FROM accounts 
WHERE account_id = 'A123'
AS OF TIMESTAMP '2024-01-08T16:00:00Z';

-- Compare balance changes over time
SELECT 
  (SELECT balance FROM accounts AS OF TIMESTAMP T1 WHERE account_id='A123') as before,
  (SELECT balance FROM accounts AS OF TIMESTAMP T2 WHERE account_id='A123') as after;

Implications for Debugging:

When something goes wrong, you can reconstruct the exact state of the database at any point in time:

"What did the product catalog look like when this order was placed?"
"What were the inventory levels at the moment of this stock-out?"
"Show me the user's profile as it existed when they made this complaint"

This temporal capability, powered by TrueTime's globally meaningful timestamps, transforms debugging from guesswork into deterministic investigation.

A New Paradigm

TrueTime transforms timestamps from internal database metadata into a first-class API for reasoning about distributed state. Applications using Spanner can think about time in ways that were previously impossible in distributed systems—coordinating across services, travelling through history, and trusting that ordering reflects reality.

Summary: TrueTime's Revolutionary Impact

We've explored how TrueTime transforms the fundamental challenge of distributed time into a tool for unprecedented consistency guarantees. Let's consolidate the key insights:

Key Takeaways

•Clock Uncertainty is Unavoidable: No synchronization protocol can make clocks perfectly accurate. TrueTime accepts this reality and makes uncertainty explicit.
•Bounded Uncertainty Enables Ordering: By guaranteeing that uncertainty never exceeds ε (typically 1-7ms), TrueTime enables commit-wait to guarantee correct transaction ordering.
•External Consistency is the Strongest Guarantee: Spanner provides external consistency—transactions are ordered according to real-world time, not just some arbitrary total order.
•Commit-Wait is the Price: Write transactions incur latency proportional to ε. This is the fundamental cost of external consistency.
•Stale Reads Avoid the Cost: Applications that can tolerate bounded staleness can read from local replicas without commit-wait, dramatically reducing read latency.
•Timestamps are First-Class Citizens: Commit timestamps enable time-travel queries, cross-service coordination, and debugging capabilities unavailable in traditional databases.
•Hardware Investment Pays Off: GPS receivers and atomic clocks are expensive, but they enable guarantees that software alone cannot provide.

What's Next:

TrueTime enables correct transaction ordering, but how do transactions actually work when they span multiple continents and Paxos groups? In the next page, we'll explore Distributed Transactions in Spanner—how two-phase commit, Paxos, and TrueTime combine to provide globally atomic operations.

Page Complete

You now understand TrueTime's architecture, the commit-wait protocol, and how external consistency transforms what's possible in distributed databases. Next, we'll see how these foundations enable globally distributed transactions.

2 / 5

Loading learning content...

System Design (HLD)Google Spanner

Google Spanner: The Globally Distributed Database

LevelAdvanced

Duration90 mins

TopicGoogle Spanner

2 / 5

TrueTime and External Consistency

The Clock Problem That Nearly Broke Distributed Systems

For decades, distributed systems dealt with this by either:

Giving up on global ordering (eventual consistency)
Using a single coordinator (which becomes a bottleneck and single point of failure)
Using logical clocks (which don't correspond to real time and limit what applications can reason about)

What You Will Learn

The Problem of Time in Distributed Systems

To understand TrueTime's innovation, we must first deeply understand the problem it solves.

Clock Drift: Nothing Stays Synchronized

4 seconds per day
2 minutes per month
26 minutes per year

NTP's Limitations:

NTP, the standard internet time synchronization protocol, has fundamental limitations:

Network Latency Variation: NTP works by exchanging timestamps across the network. Network delays vary unpredictably, introducing uncertainty of 10-100ms over the internet, and typically 1-10ms even within a datacenter.
No Bounded Uncertainty: NTP estimates clock offset, but this estimate has an unbounded error. You know approximately what time it is, but you don't know how wrong you might be.
Vulnerable to Failures: If NTP servers become unreachable, clocks continue drifting without correction.

For a database trying to order transactions globally, these limitations are catastrophic.

Clock Synchronization Methods Compared
Method	Typical Accuracy	Bounded Error?	Failure Mode	Cost
Quartz (unsynchronized)	50 ppm	No	Unbounded drift	Free
NTP (internet)	10-100ms	No	Unbounded if unreachable	Minimal
NTP (datacenter)	1-10ms	No	Unbounded if unreachable	Minimal
PTP (IEEE 1588)	<1ms	No	Depends on network	Moderate
GPS Receiver	<1μs to UTC	Yes (after acquisition)	Sky visibility required	Moderate
Atomic Clock	~1ns	Yes	Very rare failures	High ($50K+)
TrueTime (GPS + Atomic)	~1-7ms	Yes, always	Graceful degradation	High (at scale)

Why Unbounded Uncertainty Breaks Consistency:

Consider two transactions that must be ordered:

Transaction A commits at 12:00:00.000 (according to server in New York)
Transaction B commits at 12:00:00.005 (according to server in London)

Does A happen before B? If the New York clock was 10ms fast and the London clock was 10ms slow, then in "true" time:

A actually happened at 11:59:59.990
B actually happened at 12:00:00.015

The Implication for Databases:

If you can't determine true ordering, you can't guarantee:

Serializability: Transactions appearing to execute one at a time
Snapshot isolation: Reads seeing a consistent point-in-time view
External consistency: If transaction A completes before transaction B starts (in real time), A appears before B

This is why most distributed databases either avoid global ordering entirely, or funnel all transactions through a single coordinator.

The Core Insight

How TrueTime Works

TrueTime's brilliance lies not in eliminating clock uncertainty—which is physically impossible—but in bounding it and making it explicit.

The TrueTime API:

Unlike traditional clock APIs that return a single timestamp, TrueTime returns an interval:

TT.now() → [earliest, latest]

This interval represents the range of possible true times. If TrueTime returns [12:00:00.000, 12:00:00.007], it's guaranteeing that the actual current time is somewhere in that 7-millisecond window.

The width of this interval is called ε (epsilon)—the uncertainty. TrueTime provides two additional operations:

TT.after(t)  → true if t has definitely passed
TT.before(t) → true if t has definitely not arrived

These are powerful primitives. TT.after(t) returns true only when the earliest possible time is past t—meaning t has definitely passed regardless of clock errors.

The Hardware Foundation:

TrueTime's bounded uncertainty depends on specialized hardware:

GPS Receivers: GPS satellites carry atomic clocks and broadcast precise time signals. A GPS receiver can determine UTC to within ~100 nanoseconds when it has satellite visibility.
Atomic Clocks: Each Google datacenter has multiple atomic clocks (typically cesium or rubidium). These drift at rates measured in nanoseconds per day rather than microseconds.
Armageddon Masters: Dedicated time servers that combine GPS and atomic clock inputs, cross-validate them, and serve time to other servers in the datacenter.

truetime-architecture.txt
TRUETIME INFRASTRUCTURE ARCHITECTURE
═══════════════════════════════════════════════════════════════════
 
PER-DATACENTER TIME INFRASTRUCTURE:
 
            ┌─────────────────────────────────────────────────────┐
            │              STRATUM 0: Reference Clocks            │
            │    ┌──────────────┐              ┌──────────────┐   │
            │    │ GPS Receiver │              │ Atomic Clock │   │
            │    │  (multiple)  │              │  (Cesium/Rb) │   │
            │    │   ~100ns     │              │   ~1ns/day   │   │
            │    └──────┬───────┘              └──────┬───────┘   │
            │           │                             │           │
            │           └──────────┬──────────────────┘           │
            │                      ▼                              │
            │    ┌─────────────────────────────────────────────┐  │
            │    │           ARMAGEDDON MASTERS                │  │
            │    │   • Cross-validate GPS and atomic sources   │  │
            │    │   • Detect and exclude faulty clocks        │  │
            │    │   • Distribute time with certified bounds   │  │
            │    │   • Multiple masters for redundancy         │  │
            │    └─────────────────────┬───────────────────────┘  │
            │                          │                          │
            │              STRATUM 1: Time Distribution           │
            │           ┌──────────────┼──────────────┐           │
            │           ▼              ▼              ▼           │
            │    ┌──────────┐   ┌──────────┐   ┌──────────┐       │
            │    │ Timeserver│   │ Timeserver│  │ Timeserver│      │
            │    │ (rack 1)  │   │ (rack 2)  │  │ (rack 3)  │      │
            │    └──────┬────┘   └──────┬────┘  └──────┬────┘     │
            │           │              │              │           │
            │           └──────────────┼──────────────┘           │
            │                          │                          │
            │              ────────────┼────────────               │
            │             ╱            │            ╲              │
            │            ▼             ▼             ▼             │
            │    ┌──────────┐   ┌──────────┐   ┌──────────┐       │
            │    │ Worker   │   │ Worker   │   │ Worker   │       │
            │    │ Servers  │   │ Servers  │   │ Servers  │       │
            │    │ TT.now() │   │ TT.now() │   │ TT.now() │       │
            │    └──────────┘   └──────────┘   └──────────┘       │
            └─────────────────────────────────────────────────────┘
 
TRUETIME UNCERTAINTY (ε) OVER TIME:
 
    ε       │                                        Poll
  (ms)      │                                          │
     7 ─────│─────────────────────────────────────────▲┼
            │                                       ╱  │
     6 ─────│──────────────────────────────────────╱───│
            │                                    ╱     │
     5 ─────│───────────────────────────────────╱──────│
            │                              ╱           │
     4 ─────│─────────────────────────────╱────────────│
            │                        ╱                 │
     3 ─────│───────────────────────╱──────────────────│
            │                  ╱                       │
     2 ─────│─────────────────╱────────────────────────│
            │            ╱                             │ Reset after
     1 ─────│───────────╱──────────────────────────────│ successful
            │      ╱                                   │ poll
     0 ─────│─────▲────────────────────────────────────▼──────────
            │     │                                    │
            └─────┴───────────────────────────────────────────▶
                 Poll                                       Time
                 
Between polls, uncertainty grows due to local clock drift.
After each poll, uncertainty resets to network/processing latency.
Typical ε: 1-7ms, average ~4ms.

How Uncertainty is Bounded:

TrueTime's uncertainty calculation accounts for all error sources:

Reference Time Accuracy: GPS/atomic clock accuracy (sub-microsecond, negligible)
Network Latency: Time to communicate between time servers and workers (~1ms within datacenter)
Quartz Drift Between Polls: Local clocks drift between time synchronization polls. At 200ppm drift and 30-second poll intervals, this adds ~6ms uncertainty.
Processing Delays: Time to process responses and compute intervals

The Result:

Why Not Just Use GPS Everywhere?

External Consistency Explained

With TrueTime's bounded uncertainty, Spanner can provide external consistency—the strongest consistency guarantee a distributed database can offer.

What is External Consistency?

Why External Consistency Matters:

Consider an auditing scenario:

At 12:00:00.000, an admin revokes user Alice's permissions (Transaction T1)
At 12:00:00.010, Alice attempts an unauthorized action (Transaction T2)

Linearizability vs. External Consistency:

Linearizability (also called "atomic consistency") is a related but different concept:

Linearizability applies to individual operations on individual objects. Each operation appears to happen atomically at some point between its invocation and response.
External consistency applies to entire transactions that may touch multiple objects. The transaction's effective timestamp respects real-time ordering.

Spanner provides both. Transactions are externally consistent, and individual read/write operations within a Paxos group are linearizable.

Consistency Model Comparison
Model	Ordering Guarantee	Performance	Example Systems
Eventual Consistency	No ordering, values eventually converge	Highest	DynamoDB (default), Cassandra (ONE)
Causal Consistency	Causally related operations ordered	High	MongoDB (causal reads)
Session Consistency	Per-session ordering (read-your-writes)	High	Many cloud databases
Snapshot Isolation	Consistent point-in-time reads	Moderate	PostgreSQL, CockroachDB
Serializability	Transactions appear to execute one at a time	Lower	PostgreSQL SERIALIZABLE
External Consistency	Respects real-time ordering	Lowest	Spanner, TiDB (with TSO)

The Price of External Consistency

The Commit-Wait Protocol

How Commit-Wait Works:

Transaction Execution: The transaction executes, acquiring locks and staging writes.
Timestamp Assignment: The leader assigns a commit timestamp s ≥ TT.now().latest (the upper bound of current time).
Commit Wait: The leader waits until TT.after(s) returns true—meaning time s has definitely passed.
Paxos Commit: The leader commits the transaction via Paxos.
Response to Client: The transaction is durably committed with timestamp s.

Why This Works:

Let's trace through carefully:

At step 2, the actual time is somewhere in [TT.now().earliest, TT.now().latest]
We choose s ≥ TT.now().latest, so s is at or after the latest possible current time
At step 3, we wait until TT.after(s) is true
TT.after(s) is true when TT.now().earliest > s
At this point, even the earliest possible current time has passed s
Therefore, s has definitely passed everywhere in the world

The Ordering Guarantee:

Now consider two transactions, T1 and T2, where T1 commits before T2 starts:

T1 receives timestamp s1 and completes commit-wait, so s1 has definitely passed
T2 starts after s1 passed, so TT.now().earliest > s1 when T2 starts
T2's timestamp s2 ≥ TT.now().latest > TT.now().earliest > s1
Therefore s2 > s1, and T1 is correctly ordered before T2

commit-wait-protocol.txt
COMMIT-WAIT PROTOCOL VISUALIZATION
═══════════════════════════════════════════════════════════════════
 
Timeline (with TrueTime uncertainty shown as intervals):
 
Transaction T1:                    Transaction T2:
──────────────                    ──────────────
 
    │ T1 starts
    │
    ▼ Execute transaction
    │ (reads, writes, locks)
    │
    ▼ Assign commit timestamp
    │ s1 = 100 (TT.now = [96, 100])
    │
    ▼ BEGIN COMMIT-WAIT
    │├─ Wait until TT.after(s1)
    ││  (TT.now.earliest > 100)
    ││
    ││  TT.now = [97, 101] → earliest=97 ≤ 100, keep waiting
    ││  TT.now = [99, 103] → earliest=99 ≤ 100, keep waiting
    ││  TT.now = [101, 105] → earliest=101 > 100, DONE!
    │▼ END COMMIT-WAIT
    │
    ▼ Commit via Paxos
    │ (replicate to majority)
    │
    ▼ Response to client
    │ "T1 committed at timestamp 100"
    │
════╪═══════════════════════════════════════════════════════════════
    │                             │ T2 starts
    │                             │ TT.now = [102, 106]
    │                             │
    │                             ▼ Execute transaction
    │                             │
    │                             ▼ Assign commit timestamp
    │                             │ s2 = 106 (TT.now = [103, 107])
    │                             │
                                  │ s2 = 106 > s1 = 100 ✓
                                  │ ORDERING PRESERVED!
                                  │
                                  ▼ Continue with commit...
 
 
KEY INVARIANT:
─────────────
When T2 starts, T1's commit-wait has finished.
T1's commit-wait ensured TT.now.earliest > s1.
T2's timestamp s2 ≥ TT.now.latest > TT.now.earliest > s1.
Therefore: s2 > s1, respecting real-time order.
 
COMMIT-WAIT DURATION:
────────────────────
Wait time = s1 - TT.now().earliest (when timestamp assigned)
          ≈ 2ε (since s1 ≈ TT.now().latest and interval width is ε)
          ≈ 8-14ms typically
 
This is the "tax" for external consistency on write transactions.

Optimizations to Reduce Commit-Wait Impact:

Spanner uses several techniques to minimize commit-wait overhead:

1. Prepare-Wait (for Distributed Transactions):

2. Read-Only Transaction Optimization:

Read-only transactions don't need commit-wait because they don't modify data. They receive timestamps via TT.now().latest but can return immediately after reading—no waiting required.

3. Batching:

Multiple transactions can share a single commit-wait. If several transactions are assigned the same timestamp, they can all commit after a single wait period.

4. Reduced ε Through Infrastructure:

The smaller ε is, the shorter the commit-wait. Google continuously invests in reducing TrueTime uncertainty through better hardware, more frequent polling, and optimized networking.

The Latency Tradeoff

Read Timestamps and Snapshot Reads

TrueTime enables powerful read semantics beyond just transaction ordering. Spanner supports multiple read modes, each with different consistency-latency tradeoffs.

Strong Reads:

A strong read sees all transactions committed before the read began. Implementation:

Assign read timestamp t_read = TT.now().latest
Wait until TT.after(t_read) (to ensure t_read has passed)
Read data at timestamp t_read from any replica
The read sees all transactions with commit timestamp < t_read

Strong reads guarantee you see the latest data, but incur the commit-wait latency.

Bounded Staleness Reads:

If you can tolerate slightly stale data (say, up to 10 seconds old), you can read at a timestamp in the past:

Choose t_read = TT.now().earliest - staleness_bound
Read data at timestamp t_read from any replica
No waiting required—t_read is definitely in the past

Bounded staleness reads are faster because they skip commit-wait, and can be served by any replica (not just leaders).

Exact Staleness Reads:

You can specify an exact timestamp to read at:

Client provides t_read = specific_timestamp
If t_read is definitely in the past (TT.after(t_read)), read proceeds
If t_read might be in the future, wait until TT.after(t_read)

This is useful for reproducible queries—reading at the same timestamp always returns the same data.

Read-Only Transactions:

Read-only transactions are particularly powerful. They:

Execute at a single snapshot timestamp
See a consistent view across all tables
Never block write transactions (no locks held)
Can be served by any replica (reads go to nearest replica)
Support strong, bounded-stale, or exact-stale semantics

Spanner Read Modes Comparison
Read Mode	Staleness	Latency	Replica Choice	Use Case
Strong Read	None (latest)	Higher (commit-wait)	Leader or recent replica	Critical reads, financial data
Bounded Staleness	Up to X seconds	Low	Any replica	Read-heavy workloads, dashboards
Exact Staleness	At timestamp T	Low if T is past	Any replica	Analytics, reproducible queries
Read-Only Txn (Strong)	None	Moderate	Any replica	Multi-table consistent reads
Read-Only Txn (Stale)	Configurable	Low	Any replica	Cross-table analytics

Replica Selection and Stale Reads:

Spanner's stale read capability has profound implications for global performance. Consider a user in Tokyo querying data whose leader is in Virginia:

Strong read: Must contact Virginia leader (~150ms round trip)
10-second stale read: Can read from Tokyo replica (local, <10ms)

For read-heavy workloads where slight staleness is acceptable, this is a massive performance improvement.

Safe Time and Replica Catch-Up:

Every Spanner replica tracks its safe time—the latest timestamp at which it has received all updates. A replica can serve reads at any timestamp less than or equal to its safe time.

If a read request arrives for a timestamp beyond the replica's safe time, the replica either:

Waits for catch-up (updates to arrive)
Forwards the request to a more up-to-date replica

This ensures reads never return stale data beyond what was requested.

Schema Changes and Timestamps

Global Timestamp Ordering

TrueTime timestamps aren't just internal bookkeeping—they create a globally meaningful ordering that applications can reason about.

Transactions Return Timestamps:

When a Spanner transaction commits, the response includes its commit timestamp. Applications can:

Compare Timestamps: If tx1.timestamp < tx2.timestamp, tx1 happened before tx2 in the global ordering.
Read at Timestamps: Request to read data as-of a specific timestamp, useful for:
- Debugging: "Show me the data exactly as it was at 3:42 PM yesterday"
- Auditing: "What did this account look like when this transaction happened?"
- Reproducibility: "Run this report at the same timestamp as last month"
Coordinate Across Systems: Pass timestamps to other systems that also understand TrueTime, enabling cross-system consistency.

The Timestamp as a Consistency Token:

Consider a microservices architecture where multiple services use Spanner:

Service A writes to Database A, receives timestamp t_A
Service A sends a message to Service B, including t_A
Service B can read Database A at timestamp t_A + δ (where δ accounts for propagation)
Service B is guaranteed to see all of Service A's writes

This pattern enables causal consistency across services without complex distributed transaction protocols.

cross-service-consistency.txt
USING TIMESTAMPS FOR CROSS-SERVICE CONSISTENCY
═══════════════════════════════════════════════════════════════════
 
Scenario: E-commerce with separate Order and Inventory services
 
┌────────────────┐                      ┌────────────────┐
│  Order Service │                      │Inventory Svc   │
│                │                      │                │
│  Spanner DB:   │                      │  Spanner DB:   │
│   orders       │                      │   inventory    │
└───────┬────────┘                      └───────┬────────┘
        │                                       │
        │  1. Create order                      │
        │  ────────────────►                    │
        │  INSERT INTO orders...                │
        │                                       │
        │  ◄────────────────                    │
        │  Commit timestamp: t₁ = 1704672000.123│
        │                                       │
        ├───────────────────────────────────────►
        │  2. Send message via Pub/Sub          │
        │     { "order_id": 123,                │
        │       "timestamp": "1704672000.123",  │
        │       "items": [...] }                │
        │                                       │
        │                      3. Receive message
        │                      ◄────────────────
        │                                       │
        │                      4. Read order details
        │                         at timestamp t₁
        │                      ────────────────►
        │                      SELECT * FROM orders
        │                        WHERE order_id=123
        │                        AT TIMESTAMP t₁
        │                      ◄────────────────
        │                      (Guaranteed to see 
        │                       the order that was
        │                       just created)
        │                                       │
        │                      5. Decrement inventory
        │                      ────────────────►
        │                      UPDATE inventory...
        │                      ◄────────────────
        │                      Commit timestamp: t₂
        │                                       │
        │                      (t₂ > t₁, ordering
        │                       preserved!)
        │                                       │
 
GUARANTEE:
─────────
Because Order Service's write happened at t₁, and Inventory Service
reads at t₁, the read is guaranteed to see the write—even if the 
services are in different continents and using different Spanner
instances (as long as both use TrueTime).
 
This is NOT possible without TrueTime or equivalent global time source.
Traditional distributed systems would need distributed transactions,
two-phase commit, or accept inconsistency windows.

Time-Travel Queries:

Spanner retains old versions of data for a configurable period (by default, 1 hour). Combined with precise timestamps, this enables powerful "time-travel" queries:

-- See account balance at market close yesterday
SELECT balance 
FROM accounts 
WHERE account_id = 'A123'
AS OF TIMESTAMP '2024-01-08T16:00:00Z';

-- Compare balance changes over time
SELECT 
  (SELECT balance FROM accounts AS OF TIMESTAMP T1 WHERE account_id='A123') as before,
  (SELECT balance FROM accounts AS OF TIMESTAMP T2 WHERE account_id='A123') as after;

Implications for Debugging:

When something goes wrong, you can reconstruct the exact state of the database at any point in time:

"What did the product catalog look like when this order was placed?"
"What were the inventory levels at the moment of this stock-out?"
"Show me the user's profile as it existed when they made this complaint"

This temporal capability, powered by TrueTime's globally meaningful timestamps, transforms debugging from guesswork into deterministic investigation.

A New Paradigm

Summary: TrueTime's Revolutionary Impact

We've explored how TrueTime transforms the fundamental challenge of distributed time into a tool for unprecedented consistency guarantees. Let's consolidate the key insights:

Key Takeaways

•Clock Uncertainty is Unavoidable: No synchronization protocol can make clocks perfectly accurate. TrueTime accepts this reality and makes uncertainty explicit.
•Bounded Uncertainty Enables Ordering: By guaranteeing that uncertainty never exceeds ε (typically 1-7ms), TrueTime enables commit-wait to guarantee correct transaction ordering.
•External Consistency is the Strongest Guarantee: Spanner provides external consistency—transactions are ordered according to real-world time, not just some arbitrary total order.
•Commit-Wait is the Price: Write transactions incur latency proportional to ε. This is the fundamental cost of external consistency.
•Stale Reads Avoid the Cost: Applications that can tolerate bounded staleness can read from local replicas without commit-wait, dramatically reducing read latency.
•Timestamps are First-Class Citizens: Commit timestamps enable time-travel queries, cross-service coordination, and debugging capabilities unavailable in traditional databases.
•Hardware Investment Pays Off: GPS receivers and atomic clocks are expensive, but they enable guarantees that software alone cannot provide.

What's Next:

Page Complete

2 / 5