Loading content...
The most common and intuitive form of phantom read occurs when a new row "appears" in a transaction's result set. You query a table, obtain results, and upon re-querying, discover additional rows that weren't there before. This section dissects exactly how and why this happens.
Understanding the mechanics of row apparition is essential because it reveals the fundamental challenge facing database designers: how do you prevent something that doesn't exist from being created?
By the end of this page, you will understand the precise sequence of events that leads to INSERT phantoms, analyze why concurrent insertions bypass traditional protections, and recognize the patterns that make applications vulnerable to this anomaly.
To understand INSERT phantoms, we must first understand what happens when a row is inserted into a database table. The insertion process involves multiple steps, each with concurrency implications:
Step 1: Space Allocation
The database allocates physical storage space for the new row. This typically involves finding a page with available space or allocating a new page. At this point, the row has a physical location but no logical presence in any index.
Step 2: Row Data Writing
The row data is written to the allocated space. The row now physically exists but may not be visible to queries—depending on the database's isolation implementation.
Step 3: Index Entry Creation
Entries are added to all relevant indexes. This is crucial because most predicate queries use indexes to locate rows. Once an index entry exists, the row becomes "findable" by queries matching that index.
Step 4: Transaction Commit
Upon commit, the row becomes permanently part of the database. Before commit, the row may or may not be visible to other transactions depending on isolation level and MVCC implementation.
The Critical Observation:
Between T1's first and second queries, T2 inserted a row that matches T1's predicate. T1 had no way to prevent this insertion because:
salary > 50000 defines an infinite potential set—any row with salary above 50000 qualifiesTo truly understand why INSERT phantoms are difficult to prevent, we must examine how indexes store and organize data. Most databases use B+Tree indexes, which maintain sorted order and create inherent "gaps" between values.
Understanding Index Gaps:
Consider a B+Tree index on an age column containing values: [22, 25, 31, 38, 45, 52]
This creates the following logical gaps:
Each gap represents a range where new values can be inserted. Standard row locks only protect existing values, not these gaps.
123456789101112131415161718192021222324252627
-- Current index structure on 'age' column:-- B+Tree index entries: [22] [25] [31] [38] [45] [52]-- ↓ ↓ ↓ ↓ ↓ ↓-- Gap: (-∞,22)(22,25)(25,31)(31,38)(38,45)(45,52)(52,+∞) -- Transaction T1 queries for age > 30SELECT * FROM employees WHERE age > 30;-- Result: rows with ages 31, 38, 45, 52 -- T1 acquires row locks on:-- - Row with age 31 ✓-- - Row with age 38 ✓-- - Row with age 45 ✓-- - Row with age 52 ✓ -- PROBLEM: The gaps (31,38), (38,45), (45,52), (52,+∞) are NOT locked! -- Transaction T2 can freely insert:INSERT INTO employees (name, age) VALUES ('NewHire', 35);-- Age 35 falls in the (31,38) gap - no lock conflict! COMMIT; -- T2 commits successfully -- Now T1 re-queries:SELECT * FROM employees WHERE age > 30;-- Result: rows with ages 31, 35, 38, 45, 52-- The row with age 35 is a PHANTOM!Row-level locking only prevents modifications to existing rows. It does nothing to prevent insertions into index gaps. Solving this requires gap locks—locks that protect not just existing values but the spaces between them. This is computationally expensive and can significantly reduce concurrency.
Why Gap Locking Is Complex:
age > 30 include all values from 31 to infinity—you can't enumerate themWHERE age > 30 AND dept = 'Sales' creates multi-dimensional gapsLet's trace through an INSERT phantom scenario with microsecond-level precision to understand exactly when and why the anomaly occurs.
| Time | Transaction T1 (Reader) | Transaction T2 (Inserter) | Database State |
|---|---|---|---|
| t=0 | BEGIN TRANSACTION | — | Table: {(1, 'Alice', 60K), (2, 'Bob', 55K)} |
| t=1 | SELECT * WHERE salary > 50K | — | Query parsing and optimization |
| t=2 | Result: Alice, Bob (2 rows) | — | T1 acquires S-locks on Alice, Bob |
| t=3 | Processing results... | BEGIN TRANSACTION | T2 starts |
| t=4 | Processing results... | INSERT (3, 'Charlie', 75K) | Row created, not yet committed |
| t=5 | Processing results... | COMMIT | Charlie permanently added to table |
| t=6 | SELECT * WHERE salary > 50K | — | Re-query within same transaction |
| t=7 | Result: Alice, Bob, Charlie (3 rows) | — | PHANTOM DETECTED: Charlie appeared |
| t=8 | COMMIT | — | T1 ends with inconsistent views |
Critical Moments Analysis:
t=2 (Lock Acquisition): T1 acquires shared locks on Alice and Bob—the only rows that exist matching the predicate. There is no row for Charlie to lock.
t=4 (Insertion): T2 inserts Charlie. This insertion does not conflict with T1's locks because:
t=5 (Commit): T2 commits. Charlie now exists in the database and is visible to all transactions running at isolation levels that see committed data.
t=7 (Phantom Manifestation): T1's re-query now finds Charlie. From T1's perspective, a row has "appeared" despite T1 not releasing its locks or modifying data.
The phantom becomes visible at t=7 because T1 is running at an isolation level (like READ COMMITTED or REPEATABLE READ on most systems) that sees newly committed data. At true SERIALIZABLE isolation with proper implementation, the database would either block T2's commit until T1 completes, or abort one of the transactions.
The visibility of inserted rows to concurrent transactions depends heavily on the isolation level. Understanding these semantics is crucial for predicting phantom behavior.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
-- ============================================-- READ UNCOMMITTED: Immediate Visibility-- ============================================-- T1 sees T2's insertions BEFORE T2 commits-- Phantoms can appear from uncommitted inserts-- Most dangerous for phantom reads -- T1: SELECT * FROM accounts WHERE balance > 1000;-- T2: INSERT INTO accounts VALUES ('New', 5000); -- Not committed-- T1: SELECT * FROM accounts WHERE balance > 1000;-- Result: Sees 'New' even though T2 hasn't committed! -- ============================================-- READ COMMITTED: Post-Commit Visibility -- ============================================-- T1 sees T2's insertions AFTER T2 commits-- Standard phantom behavior -- T1: SELECT * FROM accounts WHERE balance > 1000;-- T2: INSERT INTO accounts VALUES ('New', 5000);-- T2: COMMIT; -- Now visible-- T1: SELECT * FROM accounts WHERE balance > 1000;-- Result: Sees 'New' because T2 committed -- ============================================-- REPEATABLE READ: Snapshot-Based (MVCC)-- ============================================-- Behavior depends on implementation: -- PostgreSQL REPEATABLE READ (uses snapshot):-- T1 starts with snapshot at transaction begin-- T2: INSERT INTO accounts VALUES ('New', 5000);-- T2: COMMIT;-- T1: SELECT * FROM accounts WHERE balance > 1000;-- Result: Does NOT see 'New' (snapshot isolation) -- MySQL InnoDB REPEATABLE READ:-- Uses MVCC + gap locking-- Similar snapshot isolation prevents most phantoms-- But may still occur in certain edge cases -- ============================================-- SERIALIZABLE: Full Prevention-- ============================================-- True serializable prevents phantoms entirely-- Methods: predicate locking, SSI, or table locking -- T1: SELECT * FROM accounts WHERE balance > 1000;-- T2: INSERT INTO accounts VALUES ('New', 5000);-- Possible outcomes:-- 1. T2 blocks until T1 commits (2PL approach)-- 2. T1 or T2 aborts on conflict (SSI approach)-- 3. Table lock prevents T2's insert (simple approach)| Isolation Level | When Insert Becomes Visible | Phantom Possible? | Mechanism |
|---|---|---|---|
| READ UNCOMMITTED | Immediately | Yes (from uncommitted) | No read barriers |
| READ COMMITTED | After inserter commits | Yes | Commit-time visibility |
| REPEATABLE READ (MVCC) | After reader commits* | Usually No* | Snapshot isolation |
| REPEATABLE READ (Locking) | After reader commits | Yes | Row locks only |
| SERIALIZABLE | After reader commits | No | Predicate/gap locks or SSI |
Modern MVCC implementations at REPEATABLE READ often prevent phantoms because each transaction sees a snapshot from its start time. However, this is implementation-specific. The SQL standard allows phantoms at REPEATABLE READ, so portable applications should not rely on MVCC phantom prevention.
INSERT phantoms manifest in numerous real-world scenarios. Recognizing these patterns helps developers anticipate and prevent anomalies.
Scenario: Monthly Account Balance Report
A financial system generates end-of-month reports by summing account balances. While the report runs, the operations team opens new accounts for customers.
123456789101112131415161718192021222324252627282930
-- Report Transaction (T1)BEGIN TRANSACTION; -- Step 1: Count and sum all accountsSELECT COUNT(*), SUM(balance)FROM accountsWHERE status = 'active';-- Result: 10,000 accounts, $50,000,000 total -- Step 2: Get detailed breakdown by account typeSELECT account_type, COUNT(*), SUM(balance)FROM accountsWHERE status = 'active'GROUP BY account_type; -- Meanwhile, Operations (T2) opens new accounts:-- INSERT INTO accounts VALUES (10001, 'active', 'checking', 100000);-- INSERT INTO accounts VALUES (10002, 'active', 'savings', 50000);-- COMMIT; -- Step 3: Validate totals match (back in T1)SELECT COUNT(*), SUM(balance)FROM accountsWHERE status = 'active';-- Result: 10,002 accounts, $50,150,000 total -- PHANTOM! The validation shows different numbers than Step 1-- Report is internally inconsistent COMMIT;This phantom causes the report to be internally inconsistent—the detailed breakdown won't match the summary totals. In financial contexts, this could trigger audit failures, regulatory concerns, or incorrect financial statements.
INSERT phantoms become particularly dangerous when combined with aggregate functions. A single phantom row can cascade into significant data integrity issues.
Why Aggregates Amplify Phantom Problems:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- Demonstration of Aggregate Phantom Amplification -- Original query at time t1:SELECT department, COUNT(*) as employee_count, AVG(salary) as avg_salary, MAX(salary) as max_salaryFROM employeesWHERE hire_date > '2024-01-01'GROUP BY departmentHAVING COUNT(*) >= 5; -- Results at t1:-- | department | employee_count | avg_salary | max_salary |-- |------------|----------------|------------|------------|-- | Engineering| 12 | 95000 | 150000 |-- | Sales | 8 | 72000 | 110000 |-- | Marketing | 5 | 68000 | 85000 | -- T2 inserts three new employees:-- INSERT INTO employees VALUES ('Alice', 'Engineering', 200000, '2024-06-01');-- INSERT INTO employees VALUES ('Bob', 'HR', 55000, '2024-03-01');-- INSERT INTO employees VALUES ('Carol', 'Support', 45000, '2024-02-01');-- COMMIT; -- Same query at time t3 (after phantom):SELECT department, COUNT(*) as employee_count, AVG(salary) as avg_salary, MAX(salary) as max_salaryFROM employeesWHERE hire_date > '2024-01-01'GROUP BY departmentHAVING COUNT(*) >= 5; -- Results at t3:-- | department | employee_count | avg_salary | max_salary |-- |------------|----------------|------------|------------|-- | Engineering| 13 | 103077 | 200000 | ← Changed!-- | Sales | 8 | 72000 | 110000 |-- | Marketing | 5 | 68000 | 85000 | -- Impacts:-- 1. Engineering count: 12 → 13 (phantom insertion)-- 2. Engineering avg: 95000 → 103077 (recalculated with new row)-- 3. Engineering max: 150000 → 200000 (new maximum introduced)-- 4. HR and Support: Not shown (< 5 employees, don't meet HAVING)-- But if more inserts happened, new groups could APPEARA single phantom INSERT changed three aggregate values in the Engineering row. In reports or dashboards that cache intermediate results, this creates internal inconsistencies that are difficult to diagnose. The report isn't "wrong" at any single point in time—but it represents different points in time within the same logical report.
Detecting INSERT phantoms during development and testing is crucial. Here are patterns and techniques for identifying phantom-vulnerable code:
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Pattern: Self-Validating Transaction-- Detects phantoms by comparing initial and final state BEGIN TRANSACTION; -- Capture initial snapshotCREATE TEMP TABLE initial_snapshot ASSELECT id, updated_atFROM target_tableWHERE predicate_condition; -- Record countSELECT COUNT(*) as initial_count FROM initial_snapshot; -- ... application logic here ...-- ... time passes, concurrent transactions may run ... -- Capture final snapshotCREATE TEMP TABLE final_snapshot ASSELECT id, updated_atFROM target_tableWHERE predicate_condition; -- Detect phantomsSELECT 'INSERT_PHANTOM' as phantom_type, f.idFROM final_snapshot fLEFT JOIN initial_snapshot i ON f.id = i.idWHERE i.id IS NULLUNION ALLSELECT 'DELETE_PHANTOM' as phantom_type, i.idFROM initial_snapshot iLEFT JOIN final_snapshot f ON i.id = f.idWHERE f.id IS NULL; -- If any rows returned, phantoms occurred-- Application can decide: abort, retry, or accept DROP TABLE initial_snapshot;DROP TABLE final_snapshot; COMMIT;The most effective way to test for phantom vulnerability is to deliberately run at READ COMMITTED isolation and introduce concurrent insertions. If your application produces inconsistent results, you've confirmed phantom sensitivity. Then test at SERIALIZABLE to verify prevention.
We have thoroughly examined how new rows appear in query results through the INSERT phantom mechanism. Let's consolidate the key insights:
What's next:
While INSERT phantoms involve individual rows appearing, the next page explores how range queries specifically amplify phantom vulnerabilities. Range predicates define open-ended search spaces that are particularly susceptible to phantom intrusion, and understanding their unique characteristics is essential for building robust database applications.
You now understand the mechanics of INSERT phantoms—how concurrent insertions bypass row-level locks to materialize new rows in query results. This knowledge prepares you for analyzing the specific challenges of range queries in the next section.