System Design (HLD)FoundationDB

FoundationDB: The Database That Powers Cloud Giants

LevelAdvanced

Duration90 mins

TopicFoundationDB

2 / 5

Strict Serializability: Correctness Without Compromise

The Hardest Problem in Distributed Databases

The most valuable property a database can offer isn't speed, scale, or feature richness—it's correctness. A faster database that occasionally loses data or returns wrong results is worse than a slower one that's always right. Yet achieving correctness in distributed systems is extraordinarily difficult, and most databases make compromises that push complexity onto application developers.

FoundationDB takes an uncompromising stance: strict serializability for all transactions, with no exceptions. This isn't marketing language—it's a technical claim backed by the most rigorous testing methodology in the database industry. Every read, every write, every transaction—no matter how many keys it touches, no matter how many concurrent transactions are running—behaves as if it executed in complete isolation, in some total order consistent with real time.

This guarantee simplifies application development dramatically. You don't need to reason about race conditions, lost updates, phantom reads, or write skew. If your logic is correct for a single-threaded, single-user scenario, it's correct for millions of concurrent users. The database handles concurrency; you handle business logic.

In this page, we'll explore what strict serializability means, how FoundationDB achieves it, and why this guarantee—rare among distributed databases—is worth its cost.

What You Will Learn

By the end of this page, you will understand: (1) The hierarchy of isolation levels and where strict serializability sits; (2) How FoundationDB's optimistic concurrency control works; (3) The read-write conflict model and how to design for minimal conflicts; (4) FoundationDB's revolutionary simulation testing methodology; and (5) The practical implications of strict serializability for application design.

Understanding Isolation Levels

Before appreciating FoundationDB's guarantees, we need to understand the landscape of transaction isolation—what guarantees are possible, what anomalies each level prevents, and where most databases fall in this hierarchy.

The Problem: Concurrent Transactions

When multiple transactions execute concurrently, they can interfere with each other in subtle ways. Consider two transactions running simultaneously:

T1: Reads account balance, adds $100, writes new balance
T2: Reads account balance, subtracts $50, writes new balance

If the initial balance is $1000, and both transactions read $1000 before either writes:

T1 writes $1100
T2 writes $950 (based on its read of $1000)

The final balance is $950 or $1100 (depending on who writes last), but it should be $1050. This is a lost update—one of many possible concurrency anomalies.

The Isolation Level Hierarchy:

Database isolation levels define which anomalies are prevented. Each level trades correctness for performance:

Transaction Isolation Levels and Prevented Anomalies
Isolation Level	Dirty Reads	Lost Updates	Non-Repeatable Reads	Phantom Reads	Write Skew
Read Uncommitted	❌ Allowed	❌ Allowed	❌ Allowed	❌ Allowed	❌ Allowed
Read Committed	✅ Prevented	❌ Allowed	❌ Allowed	❌ Allowed	❌ Allowed
Repeatable Read	✅ Prevented	✅ Prevented	✅ Prevented	❌ Allowed	❌ Allowed
Snapshot Isolation	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented	❌ Allowed
Serializable	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented
Strict Serializable	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented + Real-time

Defining the Anomalies:

Dirty Read: Transaction reads uncommitted data from another transaction
Lost Update: Two transactions read-modify-write; one update is overwritten
Non-Repeatable Read: Same query returns different results within one transaction
Phantom Read: Range query returns different rows due to inserts by another transaction
Write Skew: Two transactions read overlapping data, make disjoint updates, collectively violate a constraint

Serializability vs. Strict Serializability:

Regular serializability guarantees that the outcome of concurrent transactions is equivalent to some serial execution. But that serial order might not respect real time—a transaction that started and finished before another might appear to execute after it in the serialization order.

Strict serializability (also called linearizability for single objects, or external consistency) adds the constraint that if transaction T1 commits before transaction T2 starts, then T1 must appear before T2 in the serialization order. The database's behavior respects real-time ordering.

This distinction matters for user expectations. If I transfer money and you refresh the page, you should see the transfer. Strict serializability guarantees this; plain serializability does not.

write-skew-example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# ============================================
# WRITE SKEW ANOMALY EXAMPLE
# ============================================
 
# Scenario: On-call scheduling
# Constraint: At least one doctor must be on call at all times
# Current state: Alice and Bob are both on call
 
# Transaction T1 (Alice wants to go off-call):
def alice_goes_off_call(tr):
    alice_on_call = tr.get("doctors/alice/on_call")  # Returns True
    bob_on_call = tr.get("doctors/bob/on_call")       # Returns True
    
    # Check constraint: Is someone else on call?
    if bob_on_call == True:
        tr.set("doctors/alice/on_call", False)  # OK, Bob is still on call
        return "Alice went off-call"
    else:
        return "Cannot go off-call - no one else available"
 
# Transaction T2 (Bob wants to go off-call) - running concurrently:
def bob_goes_off_call(tr):
    alice_on_call = tr.get("doctors/alice/on_call")  # Returns True
    bob_on_call = tr.get("doctors/bob/on_call")       # Returns True
    
    # Check constraint: Is someone else on call?
    if alice_on_call == True:
        tr.set("doctors/bob/on_call", False)  # OK, Alice is still on call
        return "Bob went off-call"
    else:
        return "Cannot go off-call - no one else available"
 
# ============================================
# THE ANOMALY (Under non-serializable isolation)
# ============================================
 
# Both transactions read the SAME snapshot:
#   Alice sees: Alice=True, Bob=True
#   Bob sees:   Alice=True, Bob=True
 
# Both transactions pass their constraint check
# Both transactions write:
#   T1 writes: Alice = False
#   T2 writes: Bob = False
 
# Final state: Alice=False, Bob=False
# CONSTRAINT VIOLATED! No one is on-call.
 
# ============================================
# FOUNDATIONDB PREVENTS THIS
# ============================================
 
# Under strict serializability, one transaction will see
# the other's write and either:
# a) Abort due to conflict (optimistic CC) - FoundationDB's approach
# b) Block until the other commits (pessimistic CC)
 
# In FoundationDB: Both transactions read doctors/alice and doctors/bob
# Write set: T1 writes alice, T2 writes bob
# Read sets overlap with each other's write sets → CONFLICT
# One transaction is aborted and must retry
# On retry, it sees the committed state and correctly refuses

Most Databases Don't Prevent Write Skew

Even databases that claim 'serializable' isolation often implement only Snapshot Isolation, which allows write skew. PostgreSQL's 'serializable' mode since version 9.1 is truly serializable (using SSI), but MySQL's 'serializable' has caveats. Always verify what your database actually provides—the marketing language often differs from the technical reality.

FoundationDB's Optimistic Concurrency Control

FoundationDB uses optimistic concurrency control (OCC) to achieve serializable transactions. Unlike pessimistic approaches (locks), optimistic control allows transactions to proceed without coordination, validating at commit time that no conflicts occurred.

How OCC Works in FoundationDB:

Transaction Start: Client begins a transaction, receiving a read version (a timestamp in the transaction log)
Read Phase: Client reads keys from the database. FoundationDB tracks all keys read by the transaction (the read set) and their versions.
Write Phase: Client accumulates writes locally (the write set). Writes are not sent to the database until commit.
Commit: Client sends the entire write set to a special node called the proxy. The proxy forwards to the resolver for conflict checking.
Conflict Check: The resolver checks if any key in the read set was modified by another transaction that committed after this transaction's read version but before now. If so, the transaction conflicts and must abort.
Commit or Abort: If no conflicts, the transaction is assigned a commit version and written to the transaction log. If conflicts, the client receives an error and must retry from scratch.

occ-transaction-flow.txt
FOUNDATIONDB TRANSACTION LIFECYCLE
═══════════════════════════════════════════════════════════════════
 
┌─────────────────────────────────────────────────────────────────┐
│                         CLIENT                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. Begin Transaction                                          │
│      └── Request read version from Proxy                        │
│          └── Proxy returns read_version = 1000                  │
│                                                                 │
│   2. Perform Reads (read_set accumulated)                       │
│      └── read(key_A) = value_A  [version 950]                   │
│      └── read(key_B) = value_B  [version 980]                   │
│      └── read_set = {key_A, key_B}                              │
│                                                                 │
│   3. Perform Writes (write_set accumulated locally)             │
│      └── write(key_A, new_value_A)                              │
│      └── write(key_C, new_value_C)                              │
│      └── write_set = {key_A: new_value_A, key_C: new_value_C}   │
│                                                                 │
│   4. Commit                                                     │
│      └── Send to Proxy: {read_version, read_set, write_set}     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                          PROXY                                  │
├─────────────────────────────────────────────────────────────────┤
│   Forward to Resolver for conflict detection                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         RESOLVER                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Conflict Check:                                               │
│   "Were any keys in read_set modified between                   │
│    read_version (1000) and now?"                                │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Scenario A: NO CONFLICTS                                │   │
│   │                                                         │   │
│   │ - key_A last modified at version 950 < 1000 ✓           │   │
│   │ - key_B last modified at version 980 < 1000 ✓           │   │
│   │                                                         │   │
│   │ Result: Assign commit_version = 1005, write to log      │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Scenario B: CONFLICT DETECTED                           │   │
│   │                                                         │   │
│   │ - Another transaction modified key_A at version 1002    │   │
│   │ - 1002 > read_version (1000)                            │   │
│   │ - Our read of key_A is stale!                           │   │
│   │                                                         │   │
│   │ Result: Transaction ABORTED, client must retry          │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
 
CONFLICT TYPES:
━━━━━━━━━━━━━━━
1. READ-WRITE CONFLICT: We read a key that someone else wrote
   → Our read might have been different if we'd read later
   
2. WRITE-WRITE CONFLICT: We wrote a key that someone else wrote  
   → Only one can win; the other must retry with fresh data
 
NOTE: Read-read is NOT a conflict. Multiple transactions can read
the same key at the same version without issues.

Why Optimistic over Pessimistic?

Optimistic concurrency control has several advantages for FoundationDB's design goals:

No Distributed Locking: Pessimistic locking in a distributed system requires coordinating lock acquisition across nodes—expensive and deadlock-prone.
Low Latency for Non-Conflicting Transactions: Most transactions in well-designed applications don't conflict. OCC allows these to proceed without any coordination overhead.
Simple Client Model: The client library is stateless. Reads and writes are just data; there's no lock state to manage or clean up on failure.
Natural Retry Semantics: When conflicts occur, the client simply retries the entire transaction. The @fdb.transactional decorator handles this automatically.

The Cost of Optimism:

The downside is wasted work on conflicts. If transactions frequently conflict, each abort means throwing away computation and retrying. For workloads with high contention (many transactions accessing the same keys), this can be problematic.

FoundationDB mitigates this through:

Short transactions: The 5-second transaction time limit encourages small, focused transactions
Conflict ranges: Read entire ranges to avoid false conflicts on specific keys
Atomic operations: Operations like add() don't require reading, reducing conflict potential

Designing for Low Contention

The key to FoundationDB performance is minimizing conflicts. Design transactions to touch disjoint key ranges when possible. Use atomic operations for counters and accumulators. Keep transactions short and focused. A transaction that reads or writes every key in the database will conflict with everything—avoid global operations.

Understanding and Managing Conflicts

Conflict management is central to effective FoundationDB use. Understanding exactly when conflicts occur—and when they don't—enables designing schemas and access patterns that maximize throughput.

Precise Conflict Rules:

Read-Write Conflict: Transaction T1 reads key K at version V. Transaction T2 writes K and commits with version V' > V. T1 tries to commit → CONFLICT.
Write-Write Conflict: Transaction T1 writes key K. Transaction T2 writes K. Both try to commit → one succeeds, one conflicts.
No Read-Read Conflict: Multiple transactions can read the same key at the same version. This never causes conflicts.
No Write-Only Conflict with Unrelated Reads: T1 reads key A. T2 writes key B. No conflict (their key sets are disjoint).

Range Conflicts:

FoundationDB also supports conflict ranges for range reads:

conflict-ranges.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import fdb
 
# ============================================
# RANGE READS AND CONFLICTS
# ============================================
 
@fdb.transactional
def count_user_orders(tr, user_id):
    """
    Counts orders for a user using range read.
    Any INSERT into this range will conflict.
    """
    prefix = pack(("users", user_id, "orders"))
    end = prefix + b'\xff'
    
    # This implicitly adds a READ conflict range
    count = 0
    for key, value in tr.get_range(prefix, end):
        count += 1
    
    return count
 
# If another transaction INSERTS a new order for this user
# (i.e., writes a key in the range [prefix, end)), this 
# transaction will CONFLICT on commit.
 
# This is correct behavior! Our count might be wrong if 
# new orders were added while we were counting.
 
# ============================================
# EXPLICIT CONFLICT RANGES
# ============================================
 
@fdb.transactional
def transfer_funds(tr, from_account, to_account, amount):
    """
    Transfer money between accounts with explicit conflict control.
    """
    from_key = pack(("accounts", from_account, "balance"))
    to_key = pack(("accounts", to_account, "balance"))
    
    # Read current balances
    from_balance = int(tr[from_key] or 0)
    to_balance = int(tr[to_key] or 0)
    
    if from_balance < amount:
        raise ValueError("Insufficient funds")
    
    # Write new balances
    tr[from_key] = str(from_balance - amount).encode()
    tr[to_key] = str(to_balance + amount).encode()
 
# This naturally conflicts if:
# - Another transaction modifies from_account balance
# - Another transaction modifies to_account balance
# Exactly the right conflict behavior!
 
# ============================================
# ADDING CONFLICT RANGES MANUALLY
# ============================================
 
@fdb.transactional
def complex_business_logic(tr, entity_id):
    """
    Sometimes you need to add conflict ranges that aren't
    automatically tracked by reads.
    """
    
    # Add a read conflict range without actually reading
    # This ensures we conflict with any write to this range
    tr.add_read_conflict_range(
        pack(("locks", entity_id)),
        pack(("locks", entity_id)) + b'\xff'
    )
    
    # Add a write conflict range without actually writing
    # Other transactions reading this range will conflict with us
    tr.add_write_conflict_range(
        pack(("notified", entity_id)),
        pack(("notified", entity_id)) + b'\xff'
    )
 
# ============================================
# SNAPSHOT READS (NO CONFLICT)
# ============================================
 
@fdb.transactional
def dashboard_stats(tr):
    """
    For read-only analytics, snapshot reads avoid conflicts.
    We accept potentially stale data for non-conflicting reads.
    """
    
    # Normal read - adds to conflict set
    critical_value = tr[b'critical/key']
    
    # Snapshot read - does NOT add to conflict set
    # If this key changes, we won't conflict
    total_users = tr.snapshot[b'stats/total_users']
    total_orders = tr.snapshot[b'stats/total_orders']
    
    # Good for dashboards: We see a consistent snapshot,
    # but our transaction won't abort if stats change.
    
    return {
        'users': int(total_users or 0),
        'orders': int(total_orders or 0),
        'critical': critical_value
    }

Strategies for Reducing Conflicts:

1. Shard Hot Keys

A single counter that every transaction increments will conflict constantly. Shard it:

# Instead of one key that everyone fights over:
tr.add(b'page_views', pack((1,)))  # High contention!

# Spread across shards:
shard = random.randint(0, 99)
tr.add(pack(('page_views', shard)), pack((1,)))  # Low contention

2. Use Atomic Operations

Atomic operations (add, min, max, and, or) don't require reading the current value. They add to the write set but not the read set, meaning:

Two add operations on the same key don't conflict
An add won't conflict with a transaction that only reads the key

# DON'T (creates read-write conflict):
count = int(tr[b'counter'] or 0)
tr[b'counter'] = str(count + 1).encode()

# DO (atomic, no read conflict):
tr.add(b'counter', pack((1,)))

3. Minimize Transaction Duration

Longer transactions have more time for conflicts to accumulate. FoundationDB enforces a 5-second maximum, but aim for milliseconds:

Pre-fetch data before the transaction
Do computation outside the transaction
Use the transaction only for the atomic read-modify-write

4. Use Versionstamps for Unique Keys

Inserting with unique keys eliminates write-write conflicts:

# Events with unique, time-ordered keys never conflict on insert:
tr.set_versionstamped_key(
    pack(('events', Versionstamp())),
    event_data
)

Conflicts Are Not Errors—They're Coordination

Conflicts aren't bugs; they're the mechanism that ensures correctness. A well-designed system may have some conflicts under high load. The key is ensuring conflicts are handled gracefully (automatic retry) and their frequency doesn't dominate transaction throughput. FoundationDB's client libraries handle retries transparently.

Simulation Testing: Engineering Confidence

How do you know a database's correctness claims are true? This is a profound challenge. Distributed systems have astronomical numbers of possible states—testing every scenario is impossible. Bugs may lurk in rare race conditions that occur once in a million operations, in failures that happen only during recovery, or in interactions between seemingly unrelated components.

FoundationDB's answer is deterministic simulation testing—a revolutionary testing methodology that has found bugs that conventional testing would never catch. This approach is so effective that FoundationDB's developers have stated they "would rather run simulations than have more users" for finding bugs.

The Core Insight:

Traditional distributed systems testing is non-deterministic. You run the system, inject failures, and hope to trigger bugs. If a test fails, you often can't reproduce it—the exact timing of events is different each run.

FoundationDB takes a different approach:

Single-Threaded Simulation: The entire distributed system—multiple nodes, network, disk I/O—runs in a single thread, with a custom scheduler controlling event ordering.
Deterministic Pseudo-Random: All randomness (timings, failures, scheduling decisions) comes from a seeded PRNG. Given the same seed, the exact same execution replays.
Fault Injection: The simulator can inject any failure—network partitions, disk errors, machine crashes, Byzantine behavior—at any point in execution.
Invariant Checking: After every state change, the simulator verifies that all system invariants hold (no lost writes, no inconsistent reads, etc.).

simulation-testing-concept.txt
FOUNDATIONDB SIMULATION TESTING ARCHITECTURE
═══════════════════════════════════════════════════════════════════
 
Traditional Testing:
────────────────────
                    Real Time
                        │
    Node A ──────────────────────────────▶
    Node B ──────────────────────────────▶
    Node C ──────────────────────────────▶
    Network ─────────────────────────────▶
                        
    Problem: Non-deterministic timing, can't reproduce failures
    
═══════════════════════════════════════════════════════════════════
 
FoundationDB Simulation:
────────────────────────
                    
    ┌───────────────────────────────────────────────────────────┐
    │       DETERMINISTIC DISCRETE-EVENT SIMULATOR              │
    ├───────────────────────────────────────────────────────────┤
    │                                                           │
    │   Simulated Time: t=0.0000000                             │
    │                                                           │
    │   Event Queue (priority by simulated time):               │
    │   ┌─────────────────────────────────────────────────────┐ │
    │   │ t=0.001: Node A processes client request            │ │
    │   │ t=0.002: Network delivers message A→B              │ │
    │   │ t=0.003: Node B disk write completes               │ │
    │   │ t=0.015: Node C timeout fires                      │ │
    │   │ t=0.023: FAULT: Network partition A|BC             │ │
    │   │ t=0.045: FAULT: Node A crashes                     │ │
    │   │ t=0.100: FAULT: Node A restarts                    │ │
    │   │ ...                                                │ │
    │   └─────────────────────────────────────────────────────┘ │
    │                                                           │
    │   Single Thread Execution:                                │
    │   1. Pop event from queue                                 │
    │   2. Execute event handler                                │
    │   3. Handler may enqueue new events                       │
    │   4. CHECK ALL INVARIANTS                                 │
    │   5. Repeat                                               │
    │                                                           │
    │   PRNG Seed: 8675309                                      │
    │   (Same seed = exact same execution = reproducible bugs)  │
    │                                                           │
    └───────────────────────────────────────────────────────────┘
 
FAULT INJECTION CAPABILITIES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Network:
  • Partition (total or partial)
  • Message delay (configurable distribution)
  • Message loss (probabilistic)
  • Message corruption
  • Network bandwidth limits
 
Storage:
  • Read/write errors
  • Corruption (bit flips, partial writes)
  • Slow disk (latency injection)
  • Disk full
 
Process:
  • Crash at any point
  • Restart with various delays
  • Memory corruption
  • Clock skew
 
Cascading failures, split-brain scenarios, recovery during failure...
ALL systematically tested!
 
INVARIANTS VERIFIED CONTINUOUSLY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
• No transaction sees uncommitted data
• Committed transactions are durable
• All replicas converge to same state
• Read versions form a proper partial order
• Range queries return consistent snapshots
• Atomic operations compose correctly
• Recovery restores all committed state

What Makes This Work:

Flow: FoundationDB is written in Flow, a custom C++ extension that adds actor-model concurrency with explicit suspension points. All I/O is asynchronous, and the simulator intercepts all async operations.
No Hidden State: The simulator controls all sources of non-determinism: time, random numbers, network, disk. There's nothing the code can do that the simulator doesn't observe.
Full System Simulation: Unlike unit tests that mock components, simulation runs the actual production code—just with simulated I/O. If it works in simulation, it works in production.
Massive Parallelism: Simulations are fast (no real I/O waits) and independent. FoundationDB runs thousands of simulation hours daily across their test cluster.

A Concrete Example: Finding Rare Bugs

Consider a bug that only manifests when:

Node A commits a transaction
Node B receives partial replication
A network partition occurs for exactly 3-7 seconds
Node A crashes during partition
Node C attempts to read during recovery

In real testing, this sequence might occur once in billions of runs—effectively never. In simulation:

The fault injector can schedule exactly this sequence
When invariant checking catches the bug, the seed is saved
Re-running with that seed reproduces the bug instantly
Developers step through the exact sequence of events
The fix is verified against the same seed

Simulation Testing Is Not Optional

FoundationDB's simulation testing isn't a nice-to-have; it's fundamental to how they develop. New features must pass simulation tests. Bug fixes require a simulation test that would have caught the bug. This discipline is why FoundationDB can make strong correctness claims—they've tested more failure scenarios than any real-world deployment could encounter.

Practical Implications for Application Design

Strict serializability changes how you write applications. Many patterns and defensive techniques common in eventual-consistency systems become unnecessary—while some new considerations emerge.

What You No Longer Need:

1. Explicit Locking

With strict serializability, transactions are the locking mechanism. You don't need:

# NOT NEEDED in FoundationDB
acquire_lock("resource_123")
try:
    do_work()
finally:
    release_lock("resource_123")

# Instead, just do your work in a transaction:
@fdb.transactional
def do_work(tr):
    # Automatically serialized with other transactions
    data = tr[b'resource_123']
    # ... process ...
    tr[b'resource_123'] = new_data

2. Read-Your-Writes Workarounds

In eventually consistent systems, you often need session tracking or sticky routing to ensure users see their own writes. Not needed in FoundationDB—every read after a committed write sees that write.

3. Conflict Retry Logic

The @fdb.transactional decorator handles retries automatically. You don't need retry loops, exponential backoff, or idempotency keys (mostly—see below).

4. Data Versioning for Consistency

Patterns like optimistic locking with version numbers are unnecessary:

# NOT NEEDED in FoundationDB
data, version = get_with_version("doc_123")
# ... modify data ...
update_if_version_matches("doc_123", new_data, version)

# Instead:
@fdb.transactional
def update_doc(tr):
    data = tr[b'doc_123']  # Read creates conflict range
    # ... modify data ...
    tr[b'doc_123'] = new_data  # Conflicts if someone else modified

What You Still Need:

1. Idempotency for Non-Database Side Effects

If your transaction has side effects outside FoundationDB (sending emails, calling APIs, charging credit cards), you still need idempotency. The transaction might commit, but your client might crash before receiving confirmation:

@fdb.transactional
def process_order(tr, order_id):
    # Check if already processed (idempotency key)
    status = tr[pack(('orders', order_id, 'status'))]
    if status == b'completed':
        return  # Already done, skip
    
    # Mark as processing (prevents duplicate processing)
    tr[pack(('orders', order_id, 'status'))] = b'processing'
    
    # Database work is safe within the transaction
    deduct_inventory(tr, order_id)
    
    # External side effect - needs its own protection
    # Often done AFTER transaction commits, with status tracking
    
    tr[pack(('orders', order_id, 'status'))] = b'completed'

# For truly external operations:
@fdb.transactional
def request_external_processing(tr, order_id):
    # Record intent in database
    tr[pack(('pending_external', order_id))] = b'pending'

# Separate worker processes pending_external entries
# with its own idempotency and retry logic

2. Transaction Size Limits

FoundationDB transactions are limited:

5 seconds maximum duration
10 MB total data read and written
10,000 keys modified (default, configurable)

Large batch operations must be split into multiple transactions:

def delete_old_events(before_timestamp):
    """Delete events older than timestamp in batches."""
    
    while True:
        deleted_count = delete_batch(before_timestamp, batch_size=1000)
        if deleted_count == 0:
            break
        print(f"Deleted {deleted_count} events")

@fdb.transactional
def delete_batch(tr, before_timestamp, batch_size):
    prefix = pack(('events',))
    count = 0
    to_delete = []
    
    for key, value in tr.get_range(prefix, prefix + b'\xff', limit=batch_size):
        event_time = unpack(key)[1]  # Assuming (events, timestamp, ...)
        if event_time < before_timestamp:
            to_delete.append(key)
            count += 1
    
    for key in to_delete:
        del tr[key]
    
    return count

Transaction Design Best Practices

•Keep transactions small and fast — Aim for milliseconds, not seconds. Pre-compute outside the transaction.
•Use atomic operations for counters — add(), min(), max() avoid read conflicts.
•Use snapshot reads for analytics — Accept stale data to avoid blocking.
•Design for conflict avoidance — Shard hot keys, use unique keys, minimize shared state.
•Batch operations across transactions — Respect size limits by processing in chunks.
•Handle external side effects carefully — Idempotency keys and status tracking for non-database operations.
•Let retry loops do their job — Don't catch and swallow conflicts; let the decorator retry.

The Mental Model

Think of each transaction as if it were the only transaction running. Design your logic for single-threaded correctness. FoundationDB ensures that concurrent transactions behave as if they ran one at a time—you provide the 'what', it handles the 'how'.

Summary: Correctness as Foundation

We've explored FoundationDB's approach to transaction isolation—strict serializability achieved through optimistic concurrency control and validated through deterministic simulation testing. This guarantee is the bedrock that makes everything else possible.

Key Takeaways

•Strict serializability is the gold standard — Transactions behave as if executed serially, respecting real-time order. All concurrency anomalies are prevented.
•Optimistic concurrency control scales — No distributed locking; conflicts detected at commit time. Most transactions succeed without coordination.
•Conflicts have precise semantics — Read-write and write-write conflicts cause aborts. Know these rules to design for low contention.
•Atomic operations reduce conflicts — add(), min(), max() write without reading, enabling concurrent updates to shared keys.
•Simulation testing enables confidence — Deterministic execution, fault injection, and invariant checking find bugs that real-world testing misses.
•Simplify application logic — No locking, no read-your-writes tricks, no version fields. Focus on business logic; the database handles concurrency.
•Respect transaction limits — 5 seconds, 10MB, batch large operations. These limits enable reliability and predictability.

What's Next:

The ordered key-value store with strict serializability is powerful, but for many use cases, raw key-value access isn't ergonomic. FoundationDB addresses this through its layer architecture—the ability to build higher-level abstractions (document databases, SQL databases, graph databases) on top of the core primitives. In the next page, we'll explore how layers work and why this architecture enables unprecedented flexibility.

Page Complete

You now understand FoundationDB's transactional guarantees—strict serializability through optimistic concurrency control, validated by rigorous simulation testing. This correctness foundation is what makes FoundationDB trustworthy for critical infrastructure. Next, we'll see how the layer architecture turns this foundation into versatile data systems.

2 / 5

Loading learning content...

System Design (HLD)FoundationDB

FoundationDB: The Database That Powers Cloud Giants

LevelAdvanced

Duration90 mins

TopicFoundationDB

2 / 5

Strict Serializability: Correctness Without Compromise

The Hardest Problem in Distributed Databases

In this page, we'll explore what strict serializability means, how FoundationDB achieves it, and why this guarantee—rare among distributed databases—is worth its cost.

What You Will Learn

Understanding Isolation Levels

The Problem: Concurrent Transactions

When multiple transactions execute concurrently, they can interfere with each other in subtle ways. Consider two transactions running simultaneously:

T1: Reads account balance, adds $100, writes new balance
T2: Reads account balance, subtracts $50, writes new balance

If the initial balance is $1000, and both transactions read $1000 before either writes:

T1 writes $1100
T2 writes $950 (based on its read of $1000)

The final balance is $950 or $1100 (depending on who writes last), but it should be $1050. This is a lost update—one of many possible concurrency anomalies.

The Isolation Level Hierarchy:

Database isolation levels define which anomalies are prevented. Each level trades correctness for performance:

Transaction Isolation Levels and Prevented Anomalies
Isolation Level	Dirty Reads	Lost Updates	Non-Repeatable Reads	Phantom Reads	Write Skew
Read Uncommitted	❌ Allowed	❌ Allowed	❌ Allowed	❌ Allowed	❌ Allowed
Read Committed	✅ Prevented	❌ Allowed	❌ Allowed	❌ Allowed	❌ Allowed
Repeatable Read	✅ Prevented	✅ Prevented	✅ Prevented	❌ Allowed	❌ Allowed
Snapshot Isolation	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented	❌ Allowed
Serializable	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented
Strict Serializable	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented	✅ Prevented + Real-time

Defining the Anomalies:

Dirty Read: Transaction reads uncommitted data from another transaction
Lost Update: Two transactions read-modify-write; one update is overwritten
Non-Repeatable Read: Same query returns different results within one transaction
Phantom Read: Range query returns different rows due to inserts by another transaction
Write Skew: Two transactions read overlapping data, make disjoint updates, collectively violate a constraint

Serializability vs. Strict Serializability:

This distinction matters for user expectations. If I transfer money and you refresh the page, you should see the transfer. Strict serializability guarantees this; plain serializability does not.

write-skew-example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# ============================================
# WRITE SKEW ANOMALY EXAMPLE
# ============================================
 
# Scenario: On-call scheduling
# Constraint: At least one doctor must be on call at all times
# Current state: Alice and Bob are both on call
 
# Transaction T1 (Alice wants to go off-call):
def alice_goes_off_call(tr):
    alice_on_call = tr.get("doctors/alice/on_call")  # Returns True
    bob_on_call = tr.get("doctors/bob/on_call")       # Returns True
    
    # Check constraint: Is someone else on call?
    if bob_on_call == True:
        tr.set("doctors/alice/on_call", False)  # OK, Bob is still on call
        return "Alice went off-call"
    else:
        return "Cannot go off-call - no one else available"
 
# Transaction T2 (Bob wants to go off-call) - running concurrently:
def bob_goes_off_call(tr):
    alice_on_call = tr.get("doctors/alice/on_call")  # Returns True
    bob_on_call = tr.get("doctors/bob/on_call")       # Returns True
    
    # Check constraint: Is someone else on call?
    if alice_on_call == True:
        tr.set("doctors/bob/on_call", False)  # OK, Alice is still on call
        return "Bob went off-call"
    else:
        return "Cannot go off-call - no one else available"
 
# ============================================
# THE ANOMALY (Under non-serializable isolation)
# ============================================
 
# Both transactions read the SAME snapshot:
#   Alice sees: Alice=True, Bob=True
#   Bob sees:   Alice=True, Bob=True
 
# Both transactions pass their constraint check
# Both transactions write:
#   T1 writes: Alice = False
#   T2 writes: Bob = False
 
# Final state: Alice=False, Bob=False
# CONSTRAINT VIOLATED! No one is on-call.
 
# ============================================
# FOUNDATIONDB PREVENTS THIS
# ============================================
 
# Under strict serializability, one transaction will see
# the other's write and either:
# a) Abort due to conflict (optimistic CC) - FoundationDB's approach
# b) Block until the other commits (pessimistic CC)
 
# In FoundationDB: Both transactions read doctors/alice and doctors/bob
# Write set: T1 writes alice, T2 writes bob
# Read sets overlap with each other's write sets → CONFLICT
# One transaction is aborted and must retry
# On retry, it sees the committed state and correctly refuses

Most Databases Don't Prevent Write Skew

FoundationDB's Optimistic Concurrency Control

How OCC Works in FoundationDB:

Transaction Start: Client begins a transaction, receiving a read version (a timestamp in the transaction log)
Read Phase: Client reads keys from the database. FoundationDB tracks all keys read by the transaction (the read set) and their versions.
Write Phase: Client accumulates writes locally (the write set). Writes are not sent to the database until commit.
Commit: Client sends the entire write set to a special node called the proxy. The proxy forwards to the resolver for conflict checking.
Conflict Check: The resolver checks if any key in the read set was modified by another transaction that committed after this transaction's read version but before now. If so, the transaction conflicts and must abort.
Commit or Abort: If no conflicts, the transaction is assigned a commit version and written to the transaction log. If conflicts, the client receives an error and must retry from scratch.

occ-transaction-flow.txt
FOUNDATIONDB TRANSACTION LIFECYCLE
═══════════════════════════════════════════════════════════════════
 
┌─────────────────────────────────────────────────────────────────┐
│                         CLIENT                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. Begin Transaction                                          │
│      └── Request read version from Proxy                        │
│          └── Proxy returns read_version = 1000                  │
│                                                                 │
│   2. Perform Reads (read_set accumulated)                       │
│      └── read(key_A) = value_A  [version 950]                   │
│      └── read(key_B) = value_B  [version 980]                   │
│      └── read_set = {key_A, key_B}                              │
│                                                                 │
│   3. Perform Writes (write_set accumulated locally)             │
│      └── write(key_A, new_value_A)                              │
│      └── write(key_C, new_value_C)                              │
│      └── write_set = {key_A: new_value_A, key_C: new_value_C}   │
│                                                                 │
│   4. Commit                                                     │
│      └── Send to Proxy: {read_version, read_set, write_set}     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                          PROXY                                  │
├─────────────────────────────────────────────────────────────────┤
│   Forward to Resolver for conflict detection                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         RESOLVER                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Conflict Check:                                               │
│   "Were any keys in read_set modified between                   │
│    read_version (1000) and now?"                                │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Scenario A: NO CONFLICTS                                │   │
│   │                                                         │   │
│   │ - key_A last modified at version 950 < 1000 ✓           │   │
│   │ - key_B last modified at version 980 < 1000 ✓           │   │
│   │                                                         │   │
│   │ Result: Assign commit_version = 1005, write to log      │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Scenario B: CONFLICT DETECTED                           │   │
│   │                                                         │   │
│   │ - Another transaction modified key_A at version 1002    │   │
│   │ - 1002 > read_version (1000)                            │   │
│   │ - Our read of key_A is stale!                           │   │
│   │                                                         │   │
│   │ Result: Transaction ABORTED, client must retry          │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
 
CONFLICT TYPES:
━━━━━━━━━━━━━━━
1. READ-WRITE CONFLICT: We read a key that someone else wrote
   → Our read might have been different if we'd read later
   
2. WRITE-WRITE CONFLICT: We wrote a key that someone else wrote  
   → Only one can win; the other must retry with fresh data
 
NOTE: Read-read is NOT a conflict. Multiple transactions can read
the same key at the same version without issues.

Why Optimistic over Pessimistic?

Optimistic concurrency control has several advantages for FoundationDB's design goals:

No Distributed Locking: Pessimistic locking in a distributed system requires coordinating lock acquisition across nodes—expensive and deadlock-prone.
Low Latency for Non-Conflicting Transactions: Most transactions in well-designed applications don't conflict. OCC allows these to proceed without any coordination overhead.
Simple Client Model: The client library is stateless. Reads and writes are just data; there's no lock state to manage or clean up on failure.
Natural Retry Semantics: When conflicts occur, the client simply retries the entire transaction. The @fdb.transactional decorator handles this automatically.

The Cost of Optimism:

FoundationDB mitigates this through:

Short transactions: The 5-second transaction time limit encourages small, focused transactions
Conflict ranges: Read entire ranges to avoid false conflicts on specific keys
Atomic operations: Operations like add() don't require reading, reducing conflict potential

Designing for Low Contention

Understanding and Managing Conflicts

Precise Conflict Rules:

Read-Write Conflict: Transaction T1 reads key K at version V. Transaction T2 writes K and commits with version V' > V. T1 tries to commit → CONFLICT.
Write-Write Conflict: Transaction T1 writes key K. Transaction T2 writes K. Both try to commit → one succeeds, one conflicts.
No Read-Read Conflict: Multiple transactions can read the same key at the same version. This never causes conflicts.
No Write-Only Conflict with Unrelated Reads: T1 reads key A. T2 writes key B. No conflict (their key sets are disjoint).

Range Conflicts:

FoundationDB also supports conflict ranges for range reads:

conflict-ranges.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import fdb
 
# ============================================
# RANGE READS AND CONFLICTS
# ============================================
 
@fdb.transactional
def count_user_orders(tr, user_id):
    """
    Counts orders for a user using range read.
    Any INSERT into this range will conflict.
    """
    prefix = pack(("users", user_id, "orders"))
    end = prefix + b'\xff'
    
    # This implicitly adds a READ conflict range
    count = 0
    for key, value in tr.get_range(prefix, end):
        count += 1
    
    return count
 
# If another transaction INSERTS a new order for this user
# (i.e., writes a key in the range [prefix, end)), this 
# transaction will CONFLICT on commit.
 
# This is correct behavior! Our count might be wrong if 
# new orders were added while we were counting.
 
# ============================================
# EXPLICIT CONFLICT RANGES
# ============================================
 
@fdb.transactional
def transfer_funds(tr, from_account, to_account, amount):
    """
    Transfer money between accounts with explicit conflict control.
    """
    from_key = pack(("accounts", from_account, "balance"))
    to_key = pack(("accounts", to_account, "balance"))
    
    # Read current balances
    from_balance = int(tr[from_key] or 0)
    to_balance = int(tr[to_key] or 0)
    
    if from_balance < amount:
        raise ValueError("Insufficient funds")
    
    # Write new balances
    tr[from_key] = str(from_balance - amount).encode()
    tr[to_key] = str(to_balance + amount).encode()
 
# This naturally conflicts if:
# - Another transaction modifies from_account balance
# - Another transaction modifies to_account balance
# Exactly the right conflict behavior!
 
# ============================================
# ADDING CONFLICT RANGES MANUALLY
# ============================================
 
@fdb.transactional
def complex_business_logic(tr, entity_id):
    """
    Sometimes you need to add conflict ranges that aren't
    automatically tracked by reads.
    """
    
    # Add a read conflict range without actually reading
    # This ensures we conflict with any write to this range
    tr.add_read_conflict_range(
        pack(("locks", entity_id)),
        pack(("locks", entity_id)) + b'\xff'
    )
    
    # Add a write conflict range without actually writing
    # Other transactions reading this range will conflict with us
    tr.add_write_conflict_range(
        pack(("notified", entity_id)),
        pack(("notified", entity_id)) + b'\xff'
    )
 
# ============================================
# SNAPSHOT READS (NO CONFLICT)
# ============================================
 
@fdb.transactional
def dashboard_stats(tr):
    """
    For read-only analytics, snapshot reads avoid conflicts.
    We accept potentially stale data for non-conflicting reads.
    """
    
    # Normal read - adds to conflict set
    critical_value = tr[b'critical/key']
    
    # Snapshot read - does NOT add to conflict set
    # If this key changes, we won't conflict
    total_users = tr.snapshot[b'stats/total_users']
    total_orders = tr.snapshot[b'stats/total_orders']
    
    # Good for dashboards: We see a consistent snapshot,
    # but our transaction won't abort if stats change.
    
    return {
        'users': int(total_users or 0),
        'orders': int(total_orders or 0),
        'critical': critical_value
    }

Strategies for Reducing Conflicts:

1. Shard Hot Keys

A single counter that every transaction increments will conflict constantly. Shard it:

# Instead of one key that everyone fights over:
tr.add(b'page_views', pack((1,)))  # High contention!

# Spread across shards:
shard = random.randint(0, 99)
tr.add(pack(('page_views', shard)), pack((1,)))  # Low contention

2. Use Atomic Operations

Atomic operations (add, min, max, and, or) don't require reading the current value. They add to the write set but not the read set, meaning:

Two add operations on the same key don't conflict
An add won't conflict with a transaction that only reads the key

# DON'T (creates read-write conflict):
count = int(tr[b'counter'] or 0)
tr[b'counter'] = str(count + 1).encode()

# DO (atomic, no read conflict):
tr.add(b'counter', pack((1,)))

3. Minimize Transaction Duration

Longer transactions have more time for conflicts to accumulate. FoundationDB enforces a 5-second maximum, but aim for milliseconds:

Pre-fetch data before the transaction
Do computation outside the transaction
Use the transaction only for the atomic read-modify-write

4. Use Versionstamps for Unique Keys

Inserting with unique keys eliminates write-write conflicts:

# Events with unique, time-ordered keys never conflict on insert:
tr.set_versionstamped_key(
    pack(('events', Versionstamp())),
    event_data
)

Conflicts Are Not Errors—They're Coordination

Simulation Testing: Engineering Confidence

The Core Insight:

FoundationDB takes a different approach:

Single-Threaded Simulation: The entire distributed system—multiple nodes, network, disk I/O—runs in a single thread, with a custom scheduler controlling event ordering.
Deterministic Pseudo-Random: All randomness (timings, failures, scheduling decisions) comes from a seeded PRNG. Given the same seed, the exact same execution replays.
Fault Injection: The simulator can inject any failure—network partitions, disk errors, machine crashes, Byzantine behavior—at any point in execution.
Invariant Checking: After every state change, the simulator verifies that all system invariants hold (no lost writes, no inconsistent reads, etc.).

simulation-testing-concept.txt
FOUNDATIONDB SIMULATION TESTING ARCHITECTURE
═══════════════════════════════════════════════════════════════════
 
Traditional Testing:
────────────────────
                    Real Time
                        │
    Node A ──────────────────────────────▶
    Node B ──────────────────────────────▶
    Node C ──────────────────────────────▶
    Network ─────────────────────────────▶
                        
    Problem: Non-deterministic timing, can't reproduce failures
    
═══════════════════════════════════════════════════════════════════
 
FoundationDB Simulation:
────────────────────────
                    
    ┌───────────────────────────────────────────────────────────┐
    │       DETERMINISTIC DISCRETE-EVENT SIMULATOR              │
    ├───────────────────────────────────────────────────────────┤
    │                                                           │
    │   Simulated Time: t=0.0000000                             │
    │                                                           │
    │   Event Queue (priority by simulated time):               │
    │   ┌─────────────────────────────────────────────────────┐ │
    │   │ t=0.001: Node A processes client request            │ │
    │   │ t=0.002: Network delivers message A→B              │ │
    │   │ t=0.003: Node B disk write completes               │ │
    │   │ t=0.015: Node C timeout fires                      │ │
    │   │ t=0.023: FAULT: Network partition A|BC             │ │
    │   │ t=0.045: FAULT: Node A crashes                     │ │
    │   │ t=0.100: FAULT: Node A restarts                    │ │
    │   │ ...                                                │ │
    │   └─────────────────────────────────────────────────────┘ │
    │                                                           │
    │   Single Thread Execution:                                │
    │   1. Pop event from queue                                 │
    │   2. Execute event handler                                │
    │   3. Handler may enqueue new events                       │
    │   4. CHECK ALL INVARIANTS                                 │
    │   5. Repeat                                               │
    │                                                           │
    │   PRNG Seed: 8675309                                      │
    │   (Same seed = exact same execution = reproducible bugs)  │
    │                                                           │
    └───────────────────────────────────────────────────────────┘
 
FAULT INJECTION CAPABILITIES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Network:
  • Partition (total or partial)
  • Message delay (configurable distribution)
  • Message loss (probabilistic)
  • Message corruption
  • Network bandwidth limits
 
Storage:
  • Read/write errors
  • Corruption (bit flips, partial writes)
  • Slow disk (latency injection)
  • Disk full
 
Process:
  • Crash at any point
  • Restart with various delays
  • Memory corruption
  • Clock skew
 
Cascading failures, split-brain scenarios, recovery during failure...
ALL systematically tested!
 
INVARIANTS VERIFIED CONTINUOUSLY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
• No transaction sees uncommitted data
• Committed transactions are durable
• All replicas converge to same state
• Read versions form a proper partial order
• Range queries return consistent snapshots
• Atomic operations compose correctly
• Recovery restores all committed state

What Makes This Work:

Flow: FoundationDB is written in Flow, a custom C++ extension that adds actor-model concurrency with explicit suspension points. All I/O is asynchronous, and the simulator intercepts all async operations.
No Hidden State: The simulator controls all sources of non-determinism: time, random numbers, network, disk. There's nothing the code can do that the simulator doesn't observe.
Full System Simulation: Unlike unit tests that mock components, simulation runs the actual production code—just with simulated I/O. If it works in simulation, it works in production.
Massive Parallelism: Simulations are fast (no real I/O waits) and independent. FoundationDB runs thousands of simulation hours daily across their test cluster.

A Concrete Example: Finding Rare Bugs

Consider a bug that only manifests when:

Node A commits a transaction
Node B receives partial replication
A network partition occurs for exactly 3-7 seconds
Node A crashes during partition
Node C attempts to read during recovery

In real testing, this sequence might occur once in billions of runs—effectively never. In simulation:

The fault injector can schedule exactly this sequence
When invariant checking catches the bug, the seed is saved
Re-running with that seed reproduces the bug instantly
Developers step through the exact sequence of events
The fix is verified against the same seed

Simulation Testing Is Not Optional

Practical Implications for Application Design

Strict serializability changes how you write applications. Many patterns and defensive techniques common in eventual-consistency systems become unnecessary—while some new considerations emerge.

What You No Longer Need:

1. Explicit Locking

With strict serializability, transactions are the locking mechanism. You don't need:

# NOT NEEDED in FoundationDB
acquire_lock("resource_123")
try:
    do_work()
finally:
    release_lock("resource_123")

# Instead, just do your work in a transaction:
@fdb.transactional
def do_work(tr):
    # Automatically serialized with other transactions
    data = tr[b'resource_123']
    # ... process ...
    tr[b'resource_123'] = new_data

2. Read-Your-Writes Workarounds

3. Conflict Retry Logic

The @fdb.transactional decorator handles retries automatically. You don't need retry loops, exponential backoff, or idempotency keys (mostly—see below).

4. Data Versioning for Consistency

Patterns like optimistic locking with version numbers are unnecessary:

# NOT NEEDED in FoundationDB
data, version = get_with_version("doc_123")
# ... modify data ...
update_if_version_matches("doc_123", new_data, version)

# Instead:
@fdb.transactional
def update_doc(tr):
    data = tr[b'doc_123']  # Read creates conflict range
    # ... modify data ...
    tr[b'doc_123'] = new_data  # Conflicts if someone else modified

What You Still Need:

1. Idempotency for Non-Database Side Effects

@fdb.transactional
def process_order(tr, order_id):
    # Check if already processed (idempotency key)
    status = tr[pack(('orders', order_id, 'status'))]
    if status == b'completed':
        return  # Already done, skip
    
    # Mark as processing (prevents duplicate processing)
    tr[pack(('orders', order_id, 'status'))] = b'processing'
    
    # Database work is safe within the transaction
    deduct_inventory(tr, order_id)
    
    # External side effect - needs its own protection
    # Often done AFTER transaction commits, with status tracking
    
    tr[pack(('orders', order_id, 'status'))] = b'completed'

# For truly external operations:
@fdb.transactional
def request_external_processing(tr, order_id):
    # Record intent in database
    tr[pack(('pending_external', order_id))] = b'pending'

# Separate worker processes pending_external entries
# with its own idempotency and retry logic

2. Transaction Size Limits

FoundationDB transactions are limited:

5 seconds maximum duration
10 MB total data read and written
10,000 keys modified (default, configurable)

Large batch operations must be split into multiple transactions:

def delete_old_events(before_timestamp):
    """Delete events older than timestamp in batches."""
    
    while True:
        deleted_count = delete_batch(before_timestamp, batch_size=1000)
        if deleted_count == 0:
            break
        print(f"Deleted {deleted_count} events")

@fdb.transactional
def delete_batch(tr, before_timestamp, batch_size):
    prefix = pack(('events',))
    count = 0
    to_delete = []
    
    for key, value in tr.get_range(prefix, prefix + b'\xff', limit=batch_size):
        event_time = unpack(key)[1]  # Assuming (events, timestamp, ...)
        if event_time < before_timestamp:
            to_delete.append(key)
            count += 1
    
    for key in to_delete:
        del tr[key]
    
    return count

Transaction Design Best Practices

•Keep transactions small and fast — Aim for milliseconds, not seconds. Pre-compute outside the transaction.
•Use atomic operations for counters — add(), min(), max() avoid read conflicts.
•Use snapshot reads for analytics — Accept stale data to avoid blocking.
•Design for conflict avoidance — Shard hot keys, use unique keys, minimize shared state.
•Batch operations across transactions — Respect size limits by processing in chunks.
•Handle external side effects carefully — Idempotency keys and status tracking for non-database operations.
•Let retry loops do their job — Don't catch and swallow conflicts; let the decorator retry.

The Mental Model

Summary: Correctness as Foundation

Key Takeaways

•Strict serializability is the gold standard — Transactions behave as if executed serially, respecting real-time order. All concurrency anomalies are prevented.
•Optimistic concurrency control scales — No distributed locking; conflicts detected at commit time. Most transactions succeed without coordination.
•Conflicts have precise semantics — Read-write and write-write conflicts cause aborts. Know these rules to design for low contention.
•Atomic operations reduce conflicts — add(), min(), max() write without reading, enabling concurrent updates to shared keys.
•Simulation testing enables confidence — Deterministic execution, fault injection, and invariant checking find bugs that real-world testing misses.
•Simplify application logic — No locking, no read-your-writes tricks, no version fields. Focus on business logic; the database handles concurrency.
•Respect transaction limits — 5 seconds, 10MB, batch large operations. These limits enable reliability and predictability.

What's Next:

Page Complete

2 / 5