Loading learning content...
Not every transaction reaches its intended destination. Errors occur. Constraints are violated. Deadlocks form. Systems fail. When any of these conditions prevent a transaction from completing successfully, it enters the Failed state.
The Failed state is not a permanent condition—it's a transitional state indicating that something has gone wrong and the transaction must be rolled back. From Failed, every transaction must proceed to the Aborted state, where all its changes are reversed.
Understanding the Failed state is critical because:
By the end of this page, you will understand the formal definition of the Failed state, recognize the various causes that trigger transition to Failed, comprehend how the DBMS handles failures, learn best practices for application-level error handling, and understand the relationship between Failed and Aborted states.
Let's establish a precise understanding of what it means for a transaction to be in the Failed state.
Formal Definition:
A transaction T is in the Failed state if and only if:
Using formal notation:
T ∈ Failed ⟺
(previous_state(T) ∈ {Active, PartiallyCommitted}) ∧
(failure_detected(T) = true) ∧
(rollback_complete(T) = false)
Key Characteristics of the Failed State:
These states are often confused. 'Failed' indicates that failure has been detected but rollback hasn't completed. 'Aborted' indicates that rollback is complete and the transaction's effects have been fully removed. Failed is transitional; Aborted is terminal.
Transactions can fail for numerous reasons. Understanding these causes helps you design systems that handle failures gracefully and prevent avoidable failures.
Category 1: Application-Level Errors
These are errors that originate from the application or user actions:
| Cause | Description | Example | Retriable? |
|---|---|---|---|
| Explicit ROLLBACK | Application intentionally aborts | User cancels order | N/A (intentional) |
| Constraint violation | Data violates integrity constraints | Duplicate primary key | Sometimes |
| Check constraint failure | Business rule violated | Negative balance | If data fixed |
| Foreign key violation | Reference integrity broken | Invalid customer_id | If data fixed |
| Type mismatch | Wrong data type for column | Text in numeric field | If data fixed |
| Null constraint violation | NULL in NOT NULL column | Missing required field | If data fixed |
Category 2: Concurrency-Related Failures
These failures arise from interaction with other concurrent transactions:
| Cause | Description | Detection Method | Retriable? |
|---|---|---|---|
| Deadlock victim | Chosen as deadlock victim | Wait-for graph analysis | Usually yes |
| Lock timeout | Lock wait exceeded timeout | Timer expiration | Usually yes |
| Serialization failure | Cannot find serializable order | MVCC conflict detection | Usually yes |
| Snapshot too old | MVCC versions no longer available | Version check | Yes, immediate retry |
| Write skew detected | Anomaly in snapshot isolation | Conflict detection | Usually yes |
Category 3: System-Level Failures
These are failures in the database engine or infrastructure:
| Cause | Description | Severity | Retriable? |
|---|---|---|---|
| Out of memory | Insufficient RAM for operation | High | After system recovery |
| Out of disk space | Transaction log full | High | After space freed |
| Network disconnect | Connection to client lost | Medium | Reconnect & retry |
| Statement timeout | Query exceeded time limit | Low | With modified query |
| Log I/O failure | Cannot write to transaction log | Critical | After disk recovery |
| Database shutdown | Ordered shutdown during txn | Medium | After restart |
12345678910111213141516171819202122232425262728293031323334353637383940
-- Demonstrating various failure causes -- Constraint Violation (Foreign Key)BEGIN;INSERT INTO orders (id, customer_id, amount) VALUES (1, 99999, 100.00); -- customer 99999 doesn't exist-- ERROR: insert or update on table "orders" violates foreign key constraint-- Transaction is now in Failed stateROLLBACK; -- Explicit rollback (or it happens automatically) -- Unique Constraint ViolationBEGIN;INSERT INTO users (email) VALUES ('alice@example.com');INSERT INTO users (email) VALUES ('alice@example.com'); -- Duplicate!-- ERROR: duplicate key value violates unique constraint "users_email_key"ROLLBACK; -- Check Constraint ViolationBEGIN;UPDATE accounts SET balance = -100 WHERE id = 1;-- ERROR: new row violates check constraint "positive_balance"ROLLBACK; -- Serialization Failure (requires two concurrent sessions, SERIALIZABLE)-- Session 1:BEGIN ISOLATION LEVEL SERIALIZABLE;SELECT SUM(balance) FROM accounts;UPDATE accounts SET balance = balance + 10 WHERE id = 1;-- Don't commit yet... -- Session 2:BEGIN ISOLATION LEVEL SERIALIZABLE;SELECT SUM(balance) FROM accounts;UPDATE accounts SET balance = balance + 10 WHERE id = 2;COMMIT; -- This succeeds -- Back to Session 1:COMMIT;-- ERROR: could not serialize access due to read/write dependencies-- Transaction failed due to serialization conflictIn autocommit mode, each statement is its own transaction. If a statement fails, it's automatically rolled back, but the session continues. This can lead to 'silent' partial failures in scripts where some statements succeed and others fail. Always check for errors after each statement in critical operations.
When a failure occurs, the database management system must detect it and initiate appropriate handling. The mechanisms vary by failure type.
Active Detection (DBMS-Initiated):
The DBMS actively monitors for certain failure conditions:
Passive Detection (Triggered by Operation):
Some failures are only detected when an operation is attempted:
1234567891011121314151617181920212223242526272829303132333435363738
// Simplified deadlock detection algorithm function detect_deadlocks() { // Build wait-for graph wait_graph = new DirectedGraph() for each transaction T: if T is waiting for lock L: holder = L.current_holder wait_graph.add_edge(T, holder) // T waits for holder // Detect cycles using DFS cycles = wait_graph.find_cycles() if cycles is not empty: for each cycle in cycles: // Select a victim (various strategies possible) victim = select_victim(cycle) // e.g., youngest, least work done // Transition victim to Failed state victim.state = FAILED victim.failure_reason = "DEADLOCK_VICTIM" // Wake up the victim to process its failure signal(victim) // The victim will perform rollback and enter Aborted state // This breaks the deadlock; other transactions can proceed} function select_victim(cycle) { // Strategy 1: Youngest transaction (easiest to retry) // Strategy 2: Transaction with least work done (minimize wasted work) // Strategy 3: Transaction with lowest priority (if priorities assigned) // Strategy 4: Transaction holding fewer locks (minimize impact) return cycle.transactions.min_by(txn => txn.start_timestamp)}After Failure Detection:
Once a failure is detected, the DBMS performs these steps:
Different database systems provide varying levels of detail about failures. Capture error codes (SQLSTATE), error messages, and any additional context. This information is crucial for deciding whether to retry, how to log the error, and what to report to users.
Robust applications must handle transaction failures gracefully. The key is to distinguish between different types of failures and respond appropriately.
Retry-Eligible Failures:
Some failures are transient and can be resolved by retrying:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import psycopg2import timeimport randomfrom psycopg2 import errors # SQLSTATE codes that indicate retriable errors (PostgreSQL)RETRIABLE_ERRORS = { '40001', # serialization_failure '40P01', # deadlock_detected '55P03', # lock_not_available (if using NOWAIT) '57014', # query_cancelled (may be retriable) '08000', # connection_exception '08003', # connection_does_not_exist '08006', # connection_failure} def execute_with_retry(connection_pool, operation, max_retries=3): """ Execute a database operation with intelligent retry logic. :param connection_pool: Connection pool to get connections from :param operation: Callable that takes a connection and performs DB work :param max_retries: Maximum number of retry attempts """ last_error = None for attempt in range(max_retries + 1): conn = None try: conn = connection_pool.getconn() # Execute the operation (which should manage its own transaction) result = operation(conn) # Success! Release connection and return connection_pool.putconn(conn) return result except psycopg2.Error as e: last_error = e # Get SQLSTATE code sqlstate = e.pgcode or '' # Check if this error is retriable if sqlstate in RETRIABLE_ERRORS: if attempt < max_retries: # Calculate backoff with jitter wait_time = (2 ** attempt) + random.uniform(0, 1) print(f"Retriable error (SQLSTATE {sqlstate}), " f"attempt {attempt + 1}/{max_retries + 1}, " f"waiting {wait_time:.2f}s") # Rollback current transaction if needed if conn: try: conn.rollback() except: pass # Connection may already be broken # Return connection to pool (or maybe get a fresh one) if conn: connection_pool.putconn(conn, close=True) # Close bad connections time.sleep(wait_time) continue # Retry # Non-retriable error or max retries exceeded if conn: try: conn.rollback() connection_pool.putconn(conn) except: pass raise # Re-raise the exception except Exception as e: # Non-database errors (application logic errors) last_error = e if conn: try: conn.rollback() connection_pool.putconn(conn) except: pass raise # Should not reach here, but just in case raise last_error # Example usagedef transfer_funds(conn, from_account, to_account, amount): """Business operation that should be retried on transient failures.""" with conn.cursor() as cur: cur.execute("BEGIN") # Lock source account first (consistent ordering prevents deadlock) cur.execute( "SELECT balance FROM accounts WHERE id = %s FOR UPDATE", (from_account,) ) row = cur.fetchone() if not row: raise ValueError(f"Account {from_account} not found") balance = row[0] if balance < amount: raise ValueError("Insufficient funds") # Perform transfer cur.execute( "UPDATE accounts SET balance = balance - %s WHERE id = %s", (amount, from_account) ) cur.execute( "UPDATE accounts SET balance = balance + %s WHERE id = %s", (amount, to_account) ) conn.commit() return {"status": "success", "amount": amount} # Call with retry logicresult = execute_with_retry(pool, lambda conn: transfer_funds(conn, 1, 2, 100.00))Non-Retriable Failures:
Some failures indicate fundamental problems that retrying won't solve:
Always implement: (1) Maximum retry limits to prevent infinite loops, (2) Exponential backoff to avoid thundering herd, (3) Jitter in wait times to spread retries, (4) Logging of retries for monitoring. Uncontrolled retries can make performance problems worse by adding load to an already stressed system.
Once a transaction enters the Failed state, it must be rolled back. Rollback reverses all changes made by the transaction, restoring the database to its state before the transaction began.
Rollback Mechanism:
Rollback uses the undo log records generated during the Active state:
123456789101112131415161718192021222324252627282930313233343536373839404142
// Rollback algorithm for a failed transaction function rollback_transaction(transaction T) { // PRE-CONDITION: T is in Failed state assert(T.state == FAILED) // Find all log records for this transaction log_records = get_log_records_for_transaction(T.id) // Process in reverse order (LIFO - undo from most recent to oldest) for each record in log_records.reverse(): if record.type == INSERT: // Undo insert by deleting the row delete_row(record.table, record.row_id) write_compensation_log_record(T, "DELETE", record) elif record.type == DELETE: // Undo delete by re-inserting the row insert_row(record.table, record.old_values) write_compensation_log_record(T, "INSERT", record) elif record.type == UPDATE: // Undo update by restoring old values update_row(record.table, record.row_id, record.old_values) write_compensation_log_record(T, "UPDATE", record) // Write ABORT log record write_log_record(T, "ABORT") // Force log to ensure rollback is durable force_log_to_disk() // Release all locks held by this transaction release_all_locks(T) // Transition to Aborted state T.state = ABORTED // Clean up transaction resources deallocate_transaction_descriptor(T)}Compensation Log Records (CLRs):
While applying undo operations, the DBMS writes Compensation Log Records. These CLRs are essential for recovery:
Rollback Performance Considerations:
Rollback is often slower than commit because it must:
This asymmetry is another reason to design transactions to commit rather than abort.
| Resource | Commit | Rollback |
|---|---|---|
| Undo log processing | None | Full traversal in reverse |
| Data modifications | None (already done) | Inverse of all changes |
| Log records written | 1 (COMMIT) | N CLRs + 1 ABORT |
| Lock hold time | Released immediately | Held during rollback |
| Time complexity | O(1) for state change | O(n) where n = operations |
If you use savepoints within a transaction, you can ROLLBACK TO SAVEPOINT to undo only part of the work, keeping the rest. This partial rollback is faster than full transaction rollback and preserves completed work. Consider savepoints for complex transactions where partial failure is recoverable.
A transaction can enter the Failed state through either explicit action (ROLLBACK command) or implicit detection (error occurs). The outcome is the same, but the paths differ.
Explicit Rollback:
The application deliberately decides to abort the transaction:
1234567891011121314151617181920212223242526272829
-- Explicit rollback scenarios -- Scenario 1: Application logic decides to abortBEGIN;UPDATE inventory SET quantity = quantity - 10 WHERE product_id = 100;-- Check if any inventory becomes unreasonableSELECT quantity FROM inventory WHERE quantity < 0;-- If results returned, we don't want this transactionROLLBACK; -- Explicit decision to abort -- Scenario 2: User cancellationBEGIN;INSERT INTO orders (...) VALUES (...);-- User clicks "Cancel" in the UIROLLBACK; -- Application sends ROLLBACK in response -- Scenario 3: Partial rollback with savepointBEGIN;INSERT INTO batch_jobs (id, status) VALUES (1, 'PROCESSING'); SAVEPOINT before_items;INSERT INTO batch_items (job_id, item) VALUES (1, 'item1');INSERT INTO batch_items (job_id, item) VALUES (1, 'item2');-- Oops, these items are wrongROLLBACK TO SAVEPOINT before_items; -- Correct itemsINSERT INTO batch_items (job_id, item) VALUES (1, 'correct_item1');COMMIT; -- Job header and correct items committedImplicit Rollback:
The DBMS automatically initiates rollback when an error occurs. Behavior varies by database and configuration:
| Database | On Statement Error | On Disconnect | Configuration Options |
|---|---|---|---|
| PostgreSQL | Transaction marked failed, must ROLLBACK | Automatic ROLLBACK | ON_ERROR_ROLLBACK in psql |
| MySQL | Statement rolled back, txn continues | Automatic ROLLBACK | autocommit behavior |
| SQL Server | Depends on XACT_ABORT setting | Automatic ROLLBACK | SET XACT_ABORT ON/OFF |
| Oracle | Statement rolled back, txn continues | Automatic ROLLBACK | Default behavior |
12345678910111213141516
-- PostgreSQL: Entire transaction blocked on error BEGIN;INSERT INTO test (id) VALUES (1); -- Succeeds INSERT INTO test (id) VALUES (1); -- Fails (duplicate key)-- ERROR: duplicate key value violates unique constraint -- Transaction is now in 'aborted' (Failed) state-- Any further commands will fail:SELECT * FROM test;-- ERROR: current transaction is aborted, commands ignored until -- end of transaction block -- Must explicitly rollback to end the transactionROLLBACK;Different databases handle errors differently within transactions. PostgreSQL is strict: one error fails the entire transaction. MySQL and Oracle by default only roll back the failed statement. SQL Server depends on XACT_ABORT. Always test your application's error handling with your specific database.
Monitoring transaction failures is essential for maintaining system health and identifying problems. Here's how to track failures across different database systems.
PostgreSQL: Failure Monitoring
1234567891011121314151617181920212223242526272829303132333435363738
-- View sessions with failed transactionsSELECT pid, usename, state, query_start, xact_start, backend_xid, queryFROM pg_stat_activityWHERE state = 'idle in transaction (aborted)';-- 'idle in transaction (aborted)' = transaction is in Failed state -- Count rollbacks vs commitsSELECT datname, xact_commit, xact_rollback, ROUND(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) AS rollback_percentageFROM pg_stat_databaseWHERE datname NOT LIKE 'template%'; -- View conflict statistics (causes of failure)SELECT datname, confl_tablespace, confl_lock, confl_snapshot, confl_bufferpin, confl_deadlockFROM pg_stat_database_conflicts; -- Enable logging of all errors-- In postgresql.conf:-- log_statement = 'all'-- log_min_error_statement = 'error'-- log_min_messages = 'warning'Key Metrics to Monitor:
Establish baselines for normal failure rates in your system, then alert when rates exceed thresholds. A sudden spike in rollbacks often indicates an application bug, configuration issue, or attack. Proactive monitoring catches problems before they become outages.
We've thoroughly explored the Failed state—the intermediate state that houses transactions that cannot complete successfully. Let's consolidate our understanding:
What's Next:
With the Failed state understood, we'll complete our exploration of the transaction state diagram with the Aborted state—the terminal state that represents a fully rolled-back transaction whose effects have been completely removed from the database.
You now have a comprehensive understanding of the Failed state—what causes transactions to fail, how the DBMS handles failures, and best practices for application-level error handling. This knowledge is essential for building robust applications that gracefully handle the inevitable failures in any real-world system.