Loading learning content...
Imagine you're a bank auditor running a query to count all accounts with balances exceeding $100,000. Your query returns 47 accounts. Satisfied with this count, you begin generating a detailed compliance report for these same accounts. But when you re-execute the query to fetch the details, you now see 49 accounts—two new accounts have materialized into your result set as if by magic.
You haven't changed your query. The accounts you originally found are still there. Yet somehow, your consistent view of the database has been violated. You're experiencing a phantom read—the most subtle and often misunderstood concurrency anomaly in database systems.
By the end of this page, you will master the formal definition of phantom reads, understand how they differ fundamentally from dirty reads and non-repeatable reads, and recognize why phantom reads represent a unique challenge that requires specialized mechanisms to prevent.
A phantom read occurs when a transaction executes a query that retrieves a set of rows satisfying a given search condition, and a concurrent transaction subsequently inserts, deletes, or updates rows that would satisfy that same condition, causing a re-execution of the original query within the same transaction to return a different set of rows.
Let's formalize this definition with precise notation:
Phantom Read Anomaly: Given transaction T₁ that executes a range predicate query Q returning result set R₁, and transaction T₂ that concurrently performs INSERT, DELETE, or UPDATE operations affecting rows that match Q's predicate, a phantom read occurs when T₁ re-executes Q and obtains result set R₂ where R₁ ≠ R₂, despite T₁ not having committed or modified any data.
Key distinguishing characteristics:
Unlike dirty reads or non-repeatable reads, phantom reads specifically involve:
The phantom read is sometimes called the "set-membership anomaly" because it violates the expectation that the set of rows matching a predicate should remain stable throughout a transaction.
| Characteristic | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| What changes? | Value of a specific row | Value of a specific row | Set membership (rows appearing/disappearing) |
| Source of change | Uncommitted transaction | Committed transaction | INSERT/DELETE by concurrent transaction |
| Row identity | Same row, different value | Same row, different value | Different rows entirely |
| Query type affected | Single-row lookups | Single-row lookups | Range/predicate queries |
| Lock needed to prevent | Read lock on row | Read lock held until commit | Predicate/range lock or table lock |
To understand why phantom reads are particularly insidious, we must recognize that database locks traditionally operate on existing objects. When you read a row, the database can lock that specific row. When you read a range of rows, the database can lock each row it encounters.
But here's the fundamental problem: you cannot lock a row that doesn't exist yet.
Consider this scenario:
1234567891011121314151617181920212223242526272829
-- Transaction T1: Auditing high-value accounts-- Time t1: T1 startsBEGIN TRANSACTION; -- T1 -- Time t2: T1 queries all accounts with balance > 100000SELECT COUNT(*) FROM accounts WHERE balance > 100000;-- Result: 47 accounts -- The database locks rows for account_id: 101, 102, 105, -- 112, 118, ... (47 rows total) -- Meanwhile, at time t3...-- Transaction T2 runs concurrently and inserts a new row BEGIN TRANSACTION; -- T2INSERT INTO accounts (account_id, balance, customer_name)VALUES (999, 150000, 'New High-Value Customer');COMMIT; -- T2 commits successfully -- Time t4: T1 re-executes the same querySELECT COUNT(*) FROM accounts WHERE balance > 100000;-- Result: 48 accounts ← PHANTOM! -- The row with account_id 999 has "appeared" in T1's view-- even though T1 never committed anything COMMIT; -- T1Why couldn't traditional row-level locking prevent this?
When T1 executed its first query, row 999 didn't exist. There was nothing to lock. The database faithfully locked all 47 existing rows that matched the predicate, but this was insufficient because:
balance > 100000 conceptually matches infinitely many possible rows—you can't pre-lock them allPhantom reads expose a fundamental limitation of tuple-level concurrency control: you can only lock what you can see. Rows that don't exist yet, or rows that will be modified to match your predicate, lie outside the scope of traditional locking mechanisms.
Phantom reads can be caused by three distinct types of concurrent operations, each with different characteristics and implications:
12345678910111213141516171819202122232425262728293031323334353637
-- ==========================================-- TYPE 1: INSERT Phantom (Apparition)-- ==========================================-- T1: SELECT * FROM employees WHERE dept = 'Engineering';-- Result: {Alice, Bob, Charlie} -- 3 employees -- T2: INSERT INTO employees VALUES ('Diana', 'Engineering');-- T2: COMMIT; -- T1: SELECT * FROM employees WHERE dept = 'Engineering';-- Result: {Alice, Bob, Charlie, Diana} -- 4 employees (Diana appeared) -- ==========================================-- TYPE 2: DELETE Phantom (Disappearance) -- ==========================================-- T1: SELECT * FROM products WHERE price < 100;-- Result: {Widget, Gadget, Gizmo} -- 3 products -- T2: DELETE FROM products WHERE name = 'Gadget';-- T2: COMMIT; -- T1: SELECT * FROM products WHERE price < 100;-- Result: {Widget, Gizmo} -- 2 products (Gadget disappeared) -- ==========================================-- TYPE 3: UPDATE Phantom (Transmutation)-- ==========================================-- T1: SELECT * FROM orders WHERE status = 'pending';-- Result: {Order-101, Order-102, Order-103} -- T2: UPDATE orders SET status = 'pending' -- WHERE order_id = 'Order-104'; -- Was 'processing'-- T2: COMMIT; -- T1: SELECT * FROM orders WHERE status = 'pending';-- Result: {Order-101, Order-102, Order-103, Order-104}-- Order-104 "transmuted" into the result setWhile INSERT phantoms are most commonly discussed, UPDATE and DELETE phantoms are equally problematic and often overlooked. A complete phantom prevention strategy must address all three operation types, not just insertions.
To rigorously analyze phantom reads, we need a formal framework. Let's define the components:
Definitions:
dept = 'Sales' AND salary > 50000)Phantom Read Condition:
A phantom read occurs in transaction T₁ when:
The Set-Difference Formalization:
We can precisely characterize the phantom phenomenon using set operations:
Phantom Insertions: R(P, t₃) \ R(P, t₁) ≠ ∅ (new rows appeared)
Phantom Deletions: R(P, t₁) \ R(P, t₃) ≠ ∅ (rows disappeared)
Where \ denotes set difference.
A phantom-free execution guarantees:
R(P, t₁) = R(P, t₃) for all re-executions of P within T₁
This is a stronger guarantee than non-repeatable read prevention, which only ensures individual row values are stable. Phantom freedom ensures set membership is stable.
| Property | Mathematical Guarantee | What It Protects |
|---|---|---|
| Read Committed | Read only committed values | Protection from dirty reads |
| Repeatable Read | ∀r ∈ R(P, t₁): value(r, t₁) = value(r, t₃) | Individual row values stable |
| Serializable (Phantom-Free) | R(P, t₁) = R(P, t₃) | Entire result set membership stable |
This formalization reveals why preventing phantoms requires locking the predicate itself, not just the rows. The predicate defines a potentially infinite set of possible rows—past, present, and future. True serializable isolation must protect this conceptual set, not just its current materialization.
Phantom reads occupy a unique position in the concurrency anomaly hierarchy. Understanding these distinctions is crucial for database architects and developers.
The Dimensionality Problem:
Dirty reads and non-repeatable reads are essentially point problems—they involve specific, known data points (rows) whose values might be incorrect or unstable. Phantom reads are space problems—they involve an entire search space defined by a predicate, where new points might appear or existing points might vanish.
This dimensional difference has profound implications:
In the SQL standard, phantom reads are only fully prevented at the SERIALIZABLE isolation level—the highest and most restrictive level. Lower isolation levels (READ COMMITTED, REPEATABLE READ) explicitly permit phantom reads as a trade-off for better concurrency. This is not an oversight—it reflects the fundamental difficulty of phantom prevention.
The term "phantom" was coined to evoke the ghost-like behavior of these anomalous rows—they appear and disappear without apparent cause, haunting transactions with inconsistent views of reality.
Evolution of Understanding:
The phantom read problem was formally identified and analyzed during the development of transaction processing theory in the 1970s and 1980s:
| Isolation Level | Phantom Reads Possible? | Dirty Reads | Non-Repeatable Reads |
|---|---|---|---|
| READ UNCOMMITTED | Yes | Yes | Yes |
| READ COMMITTED | Yes | No | Yes |
| REPEATABLE READ | Yes* | No | No |
| SERIALIZABLE | No | No | No |
The asterisk (*) for REPEATABLE READ is significant. MySQL's InnoDB storage engine actually prevents many phantom scenarios even at REPEATABLE READ through its use of MVCC and gap locking—going beyond the SQL standard's requirements. However, this is implementation-specific and not guaranteed by the standard.
Not all queries are equally susceptible to phantom reads. Understanding which query patterns are vulnerable helps database architects design robust systems.
SELECT COUNT(*) WHERE condition, SELECT SUM(amount) WHERE condition—these depend on set membership being stableSELECT * WHERE value BETWEEN x AND y, SELECT * WHERE date > '2024-01-01'—open ranges invite insertionsSELECT * WHERE name LIKE 'Smith%'—new matching rows can appearSELECT * WHERE id IN (SELECT id FROM ...)—the subquery result can changeSELECT * WHERE NOT EXISTS (SELECT ...)—new satisfying rows could appearSELECT * FROM a JOIN b ON ... WHERE ...—either side can gain/lose rows123456789101112131415161718192021222324252627282930313233343536373839
-- ==========================================-- PHANTOM-VULNERABLE PATTERNS-- ========================================== -- 1. Aggregate with predicate (COUNT phantom)SELECT COUNT(*) FROM orders WHERE order_date = CURRENT_DATE; -- New orders can appear -- 2. Range query (insertion phantom)SELECT * FROM products WHERE price BETWEEN 10 AND 50; -- New products in range -- 3. Comparison predicate (update phantom)SELECT * FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);-- Employee salaries or average can change -- 4. NOT EXISTS (existence phantom)SELECT c.customer_id FROM customers cWHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id); -- New orders can make customers "disappear" from result -- 5. Multi-table join with predicateSELECT o.order_id, p.product_nameFROM orders oJOIN order_items oi ON o.order_id = oi.order_idJOIN products p ON oi.product_id = p.product_idWHERE o.status = 'pending'; -- Phantoms possible from any table -- ==========================================-- LOWER-RISK PATTERNS (but not immune)-- ========================================== -- Single-row lookup by primary key (no predicate range)SELECT * FROM accounts WHERE account_id = 12345;-- Less susceptible, but UPDATE phantoms still possible-- if a row is UPDATEd to have account_id = 12345Even single-row lookups by primary key can experience phantom-like effects if a concurrent transaction creates a new row with that key (assuming the lookup initially returned empty) or deletes and recreates a row. However, range predicates and aggregate queries face the highest phantom risk.
We have established a rigorous foundation for understanding phantom reads. Let's consolidate the key insights:
What's next:
Now that we understand what phantom reads are, we'll examine how new rows specifically appear in query results—the mechanics of INSERT phantoms. The next page explores the exact sequence of events that allows a concurrent insertion to materialize in another transaction's result set, and why this seemingly simple operation creates such complexity.
You now have a precise, formal understanding of phantom reads and their unique position among concurrency anomalies. This foundation prepares you for analyzing specific phantom scenarios and prevention mechanisms in subsequent pages.