Database Management SystemsTransaction States

Transaction States

LevelIntermediate

Duration75 mins

TopicTransaction States

4 / 5

Failed

When Transactions Go Wrong

Not every transaction reaches its intended destination. Errors occur. Constraints are violated. Deadlocks form. Systems fail. When any of these conditions prevent a transaction from completing successfully, it enters the Failed state.

The Failed state is not a permanent condition—it's a transitional state indicating that something has gone wrong and the transaction must be rolled back. From Failed, every transaction must proceed to the Aborted state, where all its changes are reversed.

Understanding the Failed state is critical because:

Failure is inevitable — In any real-world system, some transactions will fail
Proper handling determines reliability — How applications respond to failure defines system robustness
Different failures require different responses — Some failures are retriable; others indicate fundamental problems
Recovery depends on failure detection — The DBMS must correctly identify failed transactions to maintain consistency

What You Will Learn

By the end of this page, you will understand the formal definition of the Failed state, recognize the various causes that trigger transition to Failed, comprehend how the DBMS handles failures, learn best practices for application-level error handling, and understand the relationship between Failed and Aborted states.

Formal Definition of the Failed State

Let's establish a precise understanding of what it means for a transaction to be in the Failed state.

Formal Definition:

A transaction T is in the Failed state if and only if:

T was in the Active or Partially Committed state
A failure condition has been detected that prevents successful completion
T has not yet completed the rollback process (not yet Aborted)

Using formal notation:

T ∈ Failed ⟺ 
    (previous_state(T) ∈ {Active, PartiallyCommitted}) ∧
    (failure_detected(T) = true) ∧
    (rollback_complete(T) = false)

Key Characteristics of the Failed State:

Failed State Properties

•Transitional — Failed is not a terminal state; the transaction must proceed to Aborted
•No forward progress — The transaction cannot continue executing or attempt to commit
•Rollback pending — The DBMS must undo all changes made by this transaction
•Locks still held — Resources remain locked until rollback completes
•Recovery initiated — The failure triggers the transaction recovery mechanism

Converting Mermaid diagram...

Failed vs Aborted

These states are often confused. 'Failed' indicates that failure has been detected but rollback hasn't completed. 'Aborted' indicates that rollback is complete and the transaction's effects have been fully removed. Failed is transitional; Aborted is terminal.

Causes of Transaction Failure

Transactions can fail for numerous reasons. Understanding these causes helps you design systems that handle failures gracefully and prevent avoidable failures.

Category 1: Application-Level Errors

These are errors that originate from the application or user actions:

Application-Level Failure Causes
Cause	Description	Example	Retriable?
Explicit ROLLBACK	Application intentionally aborts	User cancels order	N/A (intentional)
Constraint violation	Data violates integrity constraints	Duplicate primary key	Sometimes
Check constraint failure	Business rule violated	Negative balance	If data fixed
Foreign key violation	Reference integrity broken	Invalid customer_id	If data fixed
Type mismatch	Wrong data type for column	Text in numeric field	If data fixed
Null constraint violation	NULL in NOT NULL column	Missing required field	If data fixed

Category 2: Concurrency-Related Failures

These failures arise from interaction with other concurrent transactions:

Concurrency-Related Failure Causes
Cause	Description	Detection Method	Retriable?
Deadlock victim	Chosen as deadlock victim	Wait-for graph analysis	Usually yes
Lock timeout	Lock wait exceeded timeout	Timer expiration	Usually yes
Serialization failure	Cannot find serializable order	MVCC conflict detection	Usually yes
Snapshot too old	MVCC versions no longer available	Version check	Yes, immediate retry
Write skew detected	Anomaly in snapshot isolation	Conflict detection	Usually yes

Category 3: System-Level Failures

These are failures in the database engine or infrastructure:

System-Level Failure Causes
Cause	Description	Severity	Retriable?
Out of memory	Insufficient RAM for operation	High	After system recovery
Out of disk space	Transaction log full	High	After space freed
Network disconnect	Connection to client lost	Medium	Reconnect & retry
Statement timeout	Query exceeded time limit	Low	With modified query
Log I/O failure	Cannot write to transaction log	Critical	After disk recovery
Database shutdown	Ordered shutdown during txn	Medium	After restart

causing-failures.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Demonstrating various failure causes
 
-- Constraint Violation (Foreign Key)
BEGIN;
INSERT INTO orders (id, customer_id, amount) 
VALUES (1, 99999, 100.00);  -- customer 99999 doesn't exist
-- ERROR: insert or update on table "orders" violates foreign key constraint
-- Transaction is now in Failed state
ROLLBACK;  -- Explicit rollback (or it happens automatically)
 
-- Unique Constraint Violation
BEGIN;
INSERT INTO users (email) VALUES ('alice@example.com');
INSERT INTO users (email) VALUES ('alice@example.com');  -- Duplicate!
-- ERROR: duplicate key value violates unique constraint "users_email_key"
ROLLBACK;
 
-- Check Constraint Violation
BEGIN;
UPDATE accounts SET balance = -100 WHERE id = 1;
-- ERROR: new row violates check constraint "positive_balance"
ROLLBACK;
 
-- Serialization Failure (requires two concurrent sessions, SERIALIZABLE)
-- Session 1:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT SUM(balance) FROM accounts;
UPDATE accounts SET balance = balance + 10 WHERE id = 1;
-- Don't commit yet...
 
-- Session 2:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT SUM(balance) FROM accounts;
UPDATE accounts SET balance = balance + 10 WHERE id = 2;
COMMIT;  -- This succeeds
 
-- Back to Session 1:
COMMIT;
-- ERROR: could not serialize access due to read/write dependencies
-- Transaction failed due to serialization conflict

Silent Failures in Some Modes

In autocommit mode, each statement is its own transaction. If a statement fails, it's automatically rolled back, but the session continues. This can lead to 'silent' partial failures in scripts where some statements succeed and others fail. Always check for errors after each statement in critical operations.

Failure Detection and DBMS Handling

When a failure occurs, the database management system must detect it and initiate appropriate handling. The mechanisms vary by failure type.

Active Detection (DBMS-Initiated):

The DBMS actively monitors for certain failure conditions:

Active Detection Mechanisms

•Deadlock Detection — Periodic scan of wait-for graph; cycles indicate deadlock. One transaction is chosen as victim and aborted.
•Lock Timeout Monitoring — Timers track how long transactions wait for locks. Exceeding the threshold triggers failure.
•Statement Timeout — Timers track statement execution time. Long-running queries are cancelled.
•Resource Monitoring — System monitors memory and disk usage. Critical thresholds trigger transaction abort.
•Heartbeat Checking — For client connections, missed heartbeats indicate disconnection; orphan transactions are aborted.

Passive Detection (Triggered by Operation):

Some failures are only detected when an operation is attempted:

Passive Detection Mechanisms

•Constraint Checking — Violations detected when INSERT/UPDATE is attempted (or at transaction end for deferred constraints)
•Write Conflict Detection — In MVCC systems, conflicts detected at write time
•Serialization Checking — Detected at commit time when ordering cannot be established
•Commit-Time Validation — Some checks deferred until COMMIT (deferred constraints, SSI validation)

deadlock-detection.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Simplified deadlock detection algorithm
 
function detect_deadlocks() {
    // Build wait-for graph
    wait_graph = new DirectedGraph()
    
    for each transaction T:
        if T is waiting for lock L:
            holder = L.current_holder
            wait_graph.add_edge(T, holder)  // T waits for holder
    
    // Detect cycles using DFS
    cycles = wait_graph.find_cycles()
    
    if cycles is not empty:
        for each cycle in cycles:
            // Select a victim (various strategies possible)
            victim = select_victim(cycle)  // e.g., youngest, least work done
            
            // Transition victim to Failed state
            victim.state = FAILED
            victim.failure_reason = "DEADLOCK_VICTIM"
            
            // Wake up the victim to process its failure
            signal(victim)
            
            // The victim will perform rollback and enter Aborted state
            // This breaks the deadlock; other transactions can proceed
}
 
function select_victim(cycle) {
    // Strategy 1: Youngest transaction (easiest to retry)
    // Strategy 2: Transaction with least work done (minimize wasted work)
    // Strategy 3: Transaction with lowest priority (if priorities assigned)
    // Strategy 4: Transaction holding fewer locks (minimize impact)
    
    return cycle.transactions.min_by(txn => txn.start_timestamp)
}

After Failure Detection:

Once a failure is detected, the DBMS performs these steps:

Mark transaction as Failed — Update transaction state
Cancel any in-progress operation — Stop executing current statement
Return error to client — Provide error code and message
Begin rollback — May be immediate or deferred
Log the failure — For monitoring and debugging

Failure Information for Applications

Different database systems provide varying levels of detail about failures. Capture error codes (SQLSTATE), error messages, and any additional context. This information is crucial for deciding whether to retry, how to log the error, and what to report to users.

Application-Level Error Handling

Robust applications must handle transaction failures gracefully. The key is to distinguish between different types of failures and respond appropriately.

Retry-Eligible Failures:

Some failures are transient and can be resolved by retrying:

retry-logic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import psycopg2
import time
import random
from psycopg2 import errors
 
# SQLSTATE codes that indicate retriable errors (PostgreSQL)
RETRIABLE_ERRORS = {
    '40001',  # serialization_failure
    '40P01',  # deadlock_detected
    '55P03',  # lock_not_available (if using NOWAIT)
    '57014',  # query_cancelled (may be retriable)
    '08000',  # connection_exception
    '08003',  # connection_does_not_exist
    '08006',  # connection_failure
}
 
def execute_with_retry(connection_pool, operation, max_retries=3):
    """
    Execute a database operation with intelligent retry logic.
    
    :param connection_pool: Connection pool to get connections from
    :param operation: Callable that takes a connection and performs DB work
    :param max_retries: Maximum number of retry attempts
    """
    last_error = None
    
    for attempt in range(max_retries + 1):
        conn = None
        try:
            conn = connection_pool.getconn()
            
            # Execute the operation (which should manage its own transaction)
            result = operation(conn)
            
            # Success! Release connection and return
            connection_pool.putconn(conn)
            return result
            
        except psycopg2.Error as e:
            last_error = e
            
            # Get SQLSTATE code
            sqlstate = e.pgcode or ''
            
            # Check if this error is retriable
            if sqlstate in RETRIABLE_ERRORS:
                if attempt < max_retries:
                    # Calculate backoff with jitter
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    print(f"Retriable error (SQLSTATE {sqlstate}), "
                          f"attempt {attempt + 1}/{max_retries + 1}, "
                          f"waiting {wait_time:.2f}s")
                    
                    # Rollback current transaction if needed
                    if conn:
                        try:
                            conn.rollback()
                        except:
                            pass  # Connection may already be broken
                    
                    # Return connection to pool (or maybe get a fresh one)
                    if conn:
                        connection_pool.putconn(conn, close=True)  # Close bad connections
                    
                    time.sleep(wait_time)
                    continue  # Retry
                    
            # Non-retriable error or max retries exceeded
            if conn:
                try:
                    conn.rollback()
                    connection_pool.putconn(conn)
                except:
                    pass
            
            raise  # Re-raise the exception
        
        except Exception as e:
            # Non-database errors (application logic errors)
            last_error = e
            if conn:
                try:
                    conn.rollback()
                    connection_pool.putconn(conn)
                except:
                    pass
            raise
    
    # Should not reach here, but just in case
    raise last_error
 
 
# Example usage
def transfer_funds(conn, from_account, to_account, amount):
    """Business operation that should be retried on transient failures."""
    with conn.cursor() as cur:
        cur.execute("BEGIN")
        
        # Lock source account first (consistent ordering prevents deadlock)
        cur.execute(
            "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
            (from_account,)
        )
        row = cur.fetchone()
        if not row:
            raise ValueError(f"Account {from_account} not found")
        
        balance = row[0]
        if balance < amount:
            raise ValueError("Insufficient funds")
        
        # Perform transfer
        cur.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            (amount, from_account)
        )
        cur.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s", 
            (amount, to_account)
        )
        
        conn.commit()
        return {"status": "success", "amount": amount}
 
 
# Call with retry logic
result = execute_with_retry(pool, lambda conn: transfer_funds(conn, 1, 2, 100.00))

Non-Retriable Failures:

Some failures indicate fundamental problems that retrying won't solve:

Non-Retriable Error Categories

•Constraint violations — Data doesn't meet schema requirements; retry with same data will fail again
•Syntax errors — SQL is malformed; needs code fix
•Permission denied — Insufficient privileges; needs admin action
•Object not found — Table/column doesn't exist; schema issue
•Data integrity errors — Fundamental data problems

Retry Limits and Backoff

Always implement: (1) Maximum retry limits to prevent infinite loops, (2) Exponential backoff to avoid thundering herd, (3) Jitter in wait times to spread retries, (4) Logging of retries for monitoring. Uncontrolled retries can make performance problems worse by adding load to an already stressed system.

The Rollback Process

Once a transaction enters the Failed state, it must be rolled back. Rollback reverses all changes made by the transaction, restoring the database to its state before the transaction began.

Rollback Mechanism:

Rollback uses the undo log records generated during the Active state:

Read undo log records for this transaction, in reverse order
For each log record, apply the inverse operation
Write compensation log records (for recovery idempotence)
Release all locks
Transition to Aborted state

rollback-algorithm.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Rollback algorithm for a failed transaction
 
function rollback_transaction(transaction T) {
    // PRE-CONDITION: T is in Failed state
    assert(T.state == FAILED)
    
    // Find all log records for this transaction
    log_records = get_log_records_for_transaction(T.id)
    
    // Process in reverse order (LIFO - undo from most recent to oldest)
    for each record in log_records.reverse():
        
        if record.type == INSERT:
            // Undo insert by deleting the row
            delete_row(record.table, record.row_id)
            write_compensation_log_record(T, "DELETE", record)
            
        elif record.type == DELETE:
            // Undo delete by re-inserting the row
            insert_row(record.table, record.old_values)
            write_compensation_log_record(T, "INSERT", record)
            
        elif record.type == UPDATE:
            // Undo update by restoring old values
            update_row(record.table, record.row_id, record.old_values)
            write_compensation_log_record(T, "UPDATE", record)
    
    // Write ABORT log record
    write_log_record(T, "ABORT")
    
    // Force log to ensure rollback is durable
    force_log_to_disk()
    
    // Release all locks held by this transaction
    release_all_locks(T)
    
    // Transition to Aborted state
    T.state = ABORTED
    
    // Clean up transaction resources
    deallocate_transaction_descriptor(T)
}

Compensation Log Records (CLRs):

While applying undo operations, the DBMS writes Compensation Log Records. These CLRs are essential for recovery:

If the system crashes during rollback, CLRs show which undo operations completed
Recovery can continue rollback from where it left off
CLRs are redo-only records (they're never undone themselves)

Rollback Performance Considerations:

Rollback is often slower than commit because it must:

Process all undo log records (proportional to work done)
Apply inverse operations (additional writes)
Write CLRs (more log I/O)
May compete with normal transactions for resources

This asymmetry is another reason to design transactions to commit rather than abort.

Commit vs Rollback Resource Usage
Resource	Commit	Rollback
Undo log processing	None	Full traversal in reverse
Data modifications	None (already done)	Inverse of all changes
Log records written	1 (COMMIT)	N CLRs + 1 ABORT
Lock hold time	Released immediately	Held during rollback
Time complexity	O(1) for state change	O(n) where n = operations

Savepoints Reduce Rollback Cost

If you use savepoints within a transaction, you can ROLLBACK TO SAVEPOINT to undo only part of the work, keeping the rest. This partial rollback is faster than full transaction rollback and preserves completed work. Consider savepoints for complex transactions where partial failure is recoverable.

Implicit vs Explicit Rollback

A transaction can enter the Failed state through either explicit action (ROLLBACK command) or implicit detection (error occurs). The outcome is the same, but the paths differ.

Explicit Rollback:

The application deliberately decides to abort the transaction:

explicit-rollback.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Explicit rollback scenarios
 
-- Scenario 1: Application logic decides to abort
BEGIN;
UPDATE inventory SET quantity = quantity - 10 WHERE product_id = 100;
-- Check if any inventory becomes unreasonable
SELECT quantity FROM inventory WHERE quantity < 0;
-- If results returned, we don't want this transaction
ROLLBACK;  -- Explicit decision to abort
 
-- Scenario 2: User cancellation
BEGIN;
INSERT INTO orders (...) VALUES (...);
-- User clicks "Cancel" in the UI
ROLLBACK;  -- Application sends ROLLBACK in response
 
-- Scenario 3: Partial rollback with savepoint
BEGIN;
INSERT INTO batch_jobs (id, status) VALUES (1, 'PROCESSING');
 
SAVEPOINT before_items;
INSERT INTO batch_items (job_id, item) VALUES (1, 'item1');
INSERT INTO batch_items (job_id, item) VALUES (1, 'item2');
-- Oops, these items are wrong
ROLLBACK TO SAVEPOINT before_items;
 
-- Correct items
INSERT INTO batch_items (job_id, item) VALUES (1, 'correct_item1');
COMMIT;  -- Job header and correct items committed

Implicit Rollback:

The DBMS automatically initiates rollback when an error occurs. Behavior varies by database and configuration:

Implicit Rollback Behavior by Database
Database	On Statement Error	On Disconnect	Configuration Options
PostgreSQL	Transaction marked failed, must ROLLBACK	Automatic ROLLBACK	ON_ERROR_ROLLBACK in psql
MySQL	Statement rolled back, txn continues	Automatic ROLLBACK	autocommit behavior
SQL Server	Depends on XACT_ABORT setting	Automatic ROLLBACK	SET XACT_ABORT ON/OFF
Oracle	Statement rolled back, txn continues	Automatic ROLLBACK	Default behavior

implicit-rollback-behavior.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- PostgreSQL: Entire transaction blocked on error
 
BEGIN;
INSERT INTO test (id) VALUES (1);  -- Succeeds
 
INSERT INTO test (id) VALUES (1);  -- Fails (duplicate key)
-- ERROR: duplicate key value violates unique constraint
 
-- Transaction is now in 'aborted' (Failed) state
-- Any further commands will fail:
SELECT * FROM test;
-- ERROR: current transaction is aborted, commands ignored until 
-- end of transaction block
 
-- Must explicitly rollback to end the transaction
ROLLBACK;

Know Your Database's Behavior

Different databases handle errors differently within transactions. PostgreSQL is strict: one error fails the entire transaction. MySQL and Oracle by default only roll back the failed statement. SQL Server depends on XACT_ABORT. Always test your application's error handling with your specific database.

Monitoring Failed Transactions

Monitoring transaction failures is essential for maintaining system health and identifying problems. Here's how to track failures across different database systems.

PostgreSQL: Failure Monitoring

monitoring-failures.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- View sessions with failed transactions
SELECT 
    pid,
    usename,
    state,
    query_start,
    xact_start,
    backend_xid,
    query
FROM pg_stat_activity
WHERE state = 'idle in transaction (aborted)';
-- 'idle in transaction (aborted)' = transaction is in Failed state
 
-- Count rollbacks vs commits
SELECT 
    datname,
    xact_commit,
    xact_rollback,
    ROUND(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) 
        AS rollback_percentage
FROM pg_stat_database
WHERE datname NOT LIKE 'template%';
 
-- View conflict statistics (causes of failure)
SELECT 
    datname,
    confl_tablespace,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts;
 
-- Enable logging of all errors
-- In postgresql.conf:
-- log_statement = 'all'
-- log_min_error_statement = 'error'
-- log_min_messages = 'warning'

Key Metrics to Monitor:

Failure Monitoring Metrics

•Rollback ratio — Percentage of transactions that roll back vs commit. High ratios indicate problems.
•Deadlock frequency — Number of deadlocks per time period. Increasing trends require investigation.
•Lock wait times — Average and maximum time spent waiting for locks.
•Constraint violation counts — Frequent constraint violations may indicate application bugs.
•Connection timeout/failure — May indicate client or network issues.
•Memory/disk errors — System resource problems causing failures.

Set Alerts for Abnormal Failure Rates

Establish baselines for normal failure rates in your system, then alert when rates exceed thresholds. A sudden spike in rollbacks often indicates an application bug, configuration issue, or attack. Proactive monitoring catches problems before they become outages.

Summary: The Failed State

We've thoroughly explored the Failed state—the intermediate state that houses transactions that cannot complete successfully. Let's consolidate our understanding:

Key Takeaways

•Transitional state — Failed is not terminal; it leads to Aborted after rollback completes. Every failed transaction must be rolled back.
•Multiple failure causes — Transactions can fail from application errors (constraint violations), concurrency issues (deadlock, serialization failure), or system problems (resource exhaustion).
•Active and passive detection — DBMS actively monitors for some failure types (deadlock) while others are detected when operations are attempted.
•Retry-eligible vs permanent — Some failures (deadlock, serialization) are transient and can be retried. Others (constraint violations) indicate data problems that retry won't solve.
•Rollback uses undo log — The changes made by a failed transaction are reversed using undo log records, processing in reverse order.
•Implicit vs explicit — Failures can be triggered by explicit ROLLBACK commands or implicitly by detected errors.
•Database behavior varies — Different databases handle statement-level vs transaction-level failures differently. Know your database.
•Monitoring is essential — Track failure rates, deadlocks, and other metrics to maintain system health and identify problems.

What's Next:

With the Failed state understood, we'll complete our exploration of the transaction state diagram with the Aborted state—the terminal state that represents a fully rolled-back transaction whose effects have been completely removed from the database.

Page Complete

You now have a comprehensive understanding of the Failed state—what causes transactions to fail, how the DBMS handles failures, and best practices for application-level error handling. This knowledge is essential for building robust applications that gracefully handle the inevitable failures in any real-world system.

4 / 5

Loading learning content...

Database Management SystemsTransaction States

Transaction States

LevelIntermediate

Duration75 mins

TopicTransaction States

4 / 5

Failed

When Transactions Go Wrong

Understanding the Failed state is critical because:

Failure is inevitable — In any real-world system, some transactions will fail
Proper handling determines reliability — How applications respond to failure defines system robustness
Different failures require different responses — Some failures are retriable; others indicate fundamental problems
Recovery depends on failure detection — The DBMS must correctly identify failed transactions to maintain consistency

What You Will Learn

Formal Definition of the Failed State

Let's establish a precise understanding of what it means for a transaction to be in the Failed state.

Formal Definition:

A transaction T is in the Failed state if and only if:

T was in the Active or Partially Committed state
A failure condition has been detected that prevents successful completion
T has not yet completed the rollback process (not yet Aborted)

Using formal notation:

T ∈ Failed ⟺ 
    (previous_state(T) ∈ {Active, PartiallyCommitted}) ∧
    (failure_detected(T) = true) ∧
    (rollback_complete(T) = false)

Key Characteristics of the Failed State:

Failed State Properties

•Transitional — Failed is not a terminal state; the transaction must proceed to Aborted
•No forward progress — The transaction cannot continue executing or attempt to commit
•Rollback pending — The DBMS must undo all changes made by this transaction
•Locks still held — Resources remain locked until rollback completes
•Recovery initiated — The failure triggers the transaction recovery mechanism

Converting Mermaid diagram...

Failed vs Aborted

Causes of Transaction Failure

Transactions can fail for numerous reasons. Understanding these causes helps you design systems that handle failures gracefully and prevent avoidable failures.

Category 1: Application-Level Errors

These are errors that originate from the application or user actions:

Application-Level Failure Causes
Cause	Description	Example	Retriable?
Explicit ROLLBACK	Application intentionally aborts	User cancels order	N/A (intentional)
Constraint violation	Data violates integrity constraints	Duplicate primary key	Sometimes
Check constraint failure	Business rule violated	Negative balance	If data fixed
Foreign key violation	Reference integrity broken	Invalid customer_id	If data fixed
Type mismatch	Wrong data type for column	Text in numeric field	If data fixed
Null constraint violation	NULL in NOT NULL column	Missing required field	If data fixed

Category 2: Concurrency-Related Failures

These failures arise from interaction with other concurrent transactions:

Concurrency-Related Failure Causes
Cause	Description	Detection Method	Retriable?
Deadlock victim	Chosen as deadlock victim	Wait-for graph analysis	Usually yes
Lock timeout	Lock wait exceeded timeout	Timer expiration	Usually yes
Serialization failure	Cannot find serializable order	MVCC conflict detection	Usually yes
Snapshot too old	MVCC versions no longer available	Version check	Yes, immediate retry
Write skew detected	Anomaly in snapshot isolation	Conflict detection	Usually yes

Category 3: System-Level Failures

These are failures in the database engine or infrastructure:

System-Level Failure Causes
Cause	Description	Severity	Retriable?
Out of memory	Insufficient RAM for operation	High	After system recovery
Out of disk space	Transaction log full	High	After space freed
Network disconnect	Connection to client lost	Medium	Reconnect & retry
Statement timeout	Query exceeded time limit	Low	With modified query
Log I/O failure	Cannot write to transaction log	Critical	After disk recovery
Database shutdown	Ordered shutdown during txn	Medium	After restart

causing-failures.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Demonstrating various failure causes
 
-- Constraint Violation (Foreign Key)
BEGIN;
INSERT INTO orders (id, customer_id, amount) 
VALUES (1, 99999, 100.00);  -- customer 99999 doesn't exist
-- ERROR: insert or update on table "orders" violates foreign key constraint
-- Transaction is now in Failed state
ROLLBACK;  -- Explicit rollback (or it happens automatically)
 
-- Unique Constraint Violation
BEGIN;
INSERT INTO users (email) VALUES ('alice@example.com');
INSERT INTO users (email) VALUES ('alice@example.com');  -- Duplicate!
-- ERROR: duplicate key value violates unique constraint "users_email_key"
ROLLBACK;
 
-- Check Constraint Violation
BEGIN;
UPDATE accounts SET balance = -100 WHERE id = 1;
-- ERROR: new row violates check constraint "positive_balance"
ROLLBACK;
 
-- Serialization Failure (requires two concurrent sessions, SERIALIZABLE)
-- Session 1:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT SUM(balance) FROM accounts;
UPDATE accounts SET balance = balance + 10 WHERE id = 1;
-- Don't commit yet...
 
-- Session 2:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT SUM(balance) FROM accounts;
UPDATE accounts SET balance = balance + 10 WHERE id = 2;
COMMIT;  -- This succeeds
 
-- Back to Session 1:
COMMIT;
-- ERROR: could not serialize access due to read/write dependencies
-- Transaction failed due to serialization conflict

Silent Failures in Some Modes

Failure Detection and DBMS Handling

When a failure occurs, the database management system must detect it and initiate appropriate handling. The mechanisms vary by failure type.

Active Detection (DBMS-Initiated):

The DBMS actively monitors for certain failure conditions:

Active Detection Mechanisms

•Deadlock Detection — Periodic scan of wait-for graph; cycles indicate deadlock. One transaction is chosen as victim and aborted.
•Lock Timeout Monitoring — Timers track how long transactions wait for locks. Exceeding the threshold triggers failure.
•Statement Timeout — Timers track statement execution time. Long-running queries are cancelled.
•Resource Monitoring — System monitors memory and disk usage. Critical thresholds trigger transaction abort.
•Heartbeat Checking — For client connections, missed heartbeats indicate disconnection; orphan transactions are aborted.

Passive Detection (Triggered by Operation):

Some failures are only detected when an operation is attempted:

Passive Detection Mechanisms

•Constraint Checking — Violations detected when INSERT/UPDATE is attempted (or at transaction end for deferred constraints)
•Write Conflict Detection — In MVCC systems, conflicts detected at write time
•Serialization Checking — Detected at commit time when ordering cannot be established
•Commit-Time Validation — Some checks deferred until COMMIT (deferred constraints, SSI validation)

deadlock-detection.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Simplified deadlock detection algorithm
 
function detect_deadlocks() {
    // Build wait-for graph
    wait_graph = new DirectedGraph()
    
    for each transaction T:
        if T is waiting for lock L:
            holder = L.current_holder
            wait_graph.add_edge(T, holder)  // T waits for holder
    
    // Detect cycles using DFS
    cycles = wait_graph.find_cycles()
    
    if cycles is not empty:
        for each cycle in cycles:
            // Select a victim (various strategies possible)
            victim = select_victim(cycle)  // e.g., youngest, least work done
            
            // Transition victim to Failed state
            victim.state = FAILED
            victim.failure_reason = "DEADLOCK_VICTIM"
            
            // Wake up the victim to process its failure
            signal(victim)
            
            // The victim will perform rollback and enter Aborted state
            // This breaks the deadlock; other transactions can proceed
}
 
function select_victim(cycle) {
    // Strategy 1: Youngest transaction (easiest to retry)
    // Strategy 2: Transaction with least work done (minimize wasted work)
    // Strategy 3: Transaction with lowest priority (if priorities assigned)
    // Strategy 4: Transaction holding fewer locks (minimize impact)
    
    return cycle.transactions.min_by(txn => txn.start_timestamp)
}

After Failure Detection:

Once a failure is detected, the DBMS performs these steps:

Mark transaction as Failed — Update transaction state
Cancel any in-progress operation — Stop executing current statement
Return error to client — Provide error code and message
Begin rollback — May be immediate or deferred
Log the failure — For monitoring and debugging

Failure Information for Applications

Application-Level Error Handling

Robust applications must handle transaction failures gracefully. The key is to distinguish between different types of failures and respond appropriately.

Retry-Eligible Failures:

Some failures are transient and can be resolved by retrying:

retry-logic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import psycopg2
import time
import random
from psycopg2 import errors
 
# SQLSTATE codes that indicate retriable errors (PostgreSQL)
RETRIABLE_ERRORS = {
    '40001',  # serialization_failure
    '40P01',  # deadlock_detected
    '55P03',  # lock_not_available (if using NOWAIT)
    '57014',  # query_cancelled (may be retriable)
    '08000',  # connection_exception
    '08003',  # connection_does_not_exist
    '08006',  # connection_failure
}
 
def execute_with_retry(connection_pool, operation, max_retries=3):
    """
    Execute a database operation with intelligent retry logic.
    
    :param connection_pool: Connection pool to get connections from
    :param operation: Callable that takes a connection and performs DB work
    :param max_retries: Maximum number of retry attempts
    """
    last_error = None
    
    for attempt in range(max_retries + 1):
        conn = None
        try:
            conn = connection_pool.getconn()
            
            # Execute the operation (which should manage its own transaction)
            result = operation(conn)
            
            # Success! Release connection and return
            connection_pool.putconn(conn)
            return result
            
        except psycopg2.Error as e:
            last_error = e
            
            # Get SQLSTATE code
            sqlstate = e.pgcode or ''
            
            # Check if this error is retriable
            if sqlstate in RETRIABLE_ERRORS:
                if attempt < max_retries:
                    # Calculate backoff with jitter
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    print(f"Retriable error (SQLSTATE {sqlstate}), "
                          f"attempt {attempt + 1}/{max_retries + 1}, "
                          f"waiting {wait_time:.2f}s")
                    
                    # Rollback current transaction if needed
                    if conn:
                        try:
                            conn.rollback()
                        except:
                            pass  # Connection may already be broken
                    
                    # Return connection to pool (or maybe get a fresh one)
                    if conn:
                        connection_pool.putconn(conn, close=True)  # Close bad connections
                    
                    time.sleep(wait_time)
                    continue  # Retry
                    
            # Non-retriable error or max retries exceeded
            if conn:
                try:
                    conn.rollback()
                    connection_pool.putconn(conn)
                except:
                    pass
            
            raise  # Re-raise the exception
        
        except Exception as e:
            # Non-database errors (application logic errors)
            last_error = e
            if conn:
                try:
                    conn.rollback()
                    connection_pool.putconn(conn)
                except:
                    pass
            raise
    
    # Should not reach here, but just in case
    raise last_error
 
 
# Example usage
def transfer_funds(conn, from_account, to_account, amount):
    """Business operation that should be retried on transient failures."""
    with conn.cursor() as cur:
        cur.execute("BEGIN")
        
        # Lock source account first (consistent ordering prevents deadlock)
        cur.execute(
            "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
            (from_account,)
        )
        row = cur.fetchone()
        if not row:
            raise ValueError(f"Account {from_account} not found")
        
        balance = row[0]
        if balance < amount:
            raise ValueError("Insufficient funds")
        
        # Perform transfer
        cur.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            (amount, from_account)
        )
        cur.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s", 
            (amount, to_account)
        )
        
        conn.commit()
        return {"status": "success", "amount": amount}
 
 
# Call with retry logic
result = execute_with_retry(pool, lambda conn: transfer_funds(conn, 1, 2, 100.00))

Non-Retriable Failures:

Some failures indicate fundamental problems that retrying won't solve:

Non-Retriable Error Categories

•Constraint violations — Data doesn't meet schema requirements; retry with same data will fail again
•Syntax errors — SQL is malformed; needs code fix
•Permission denied — Insufficient privileges; needs admin action
•Object not found — Table/column doesn't exist; schema issue
•Data integrity errors — Fundamental data problems

Retry Limits and Backoff

The Rollback Process

Once a transaction enters the Failed state, it must be rolled back. Rollback reverses all changes made by the transaction, restoring the database to its state before the transaction began.

Rollback Mechanism:

Rollback uses the undo log records generated during the Active state:

Read undo log records for this transaction, in reverse order
For each log record, apply the inverse operation
Write compensation log records (for recovery idempotence)
Release all locks
Transition to Aborted state

rollback-algorithm.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Rollback algorithm for a failed transaction
 
function rollback_transaction(transaction T) {
    // PRE-CONDITION: T is in Failed state
    assert(T.state == FAILED)
    
    // Find all log records for this transaction
    log_records = get_log_records_for_transaction(T.id)
    
    // Process in reverse order (LIFO - undo from most recent to oldest)
    for each record in log_records.reverse():
        
        if record.type == INSERT:
            // Undo insert by deleting the row
            delete_row(record.table, record.row_id)
            write_compensation_log_record(T, "DELETE", record)
            
        elif record.type == DELETE:
            // Undo delete by re-inserting the row
            insert_row(record.table, record.old_values)
            write_compensation_log_record(T, "INSERT", record)
            
        elif record.type == UPDATE:
            // Undo update by restoring old values
            update_row(record.table, record.row_id, record.old_values)
            write_compensation_log_record(T, "UPDATE", record)
    
    // Write ABORT log record
    write_log_record(T, "ABORT")
    
    // Force log to ensure rollback is durable
    force_log_to_disk()
    
    // Release all locks held by this transaction
    release_all_locks(T)
    
    // Transition to Aborted state
    T.state = ABORTED
    
    // Clean up transaction resources
    deallocate_transaction_descriptor(T)
}

Compensation Log Records (CLRs):

While applying undo operations, the DBMS writes Compensation Log Records. These CLRs are essential for recovery:

If the system crashes during rollback, CLRs show which undo operations completed
Recovery can continue rollback from where it left off
CLRs are redo-only records (they're never undone themselves)

Rollback Performance Considerations:

Rollback is often slower than commit because it must:

Process all undo log records (proportional to work done)
Apply inverse operations (additional writes)
Write CLRs (more log I/O)
May compete with normal transactions for resources

This asymmetry is another reason to design transactions to commit rather than abort.

Commit vs Rollback Resource Usage
Resource	Commit	Rollback
Undo log processing	None	Full traversal in reverse
Data modifications	None (already done)	Inverse of all changes
Log records written	1 (COMMIT)	N CLRs + 1 ABORT
Lock hold time	Released immediately	Held during rollback
Time complexity	O(1) for state change	O(n) where n = operations

Savepoints Reduce Rollback Cost

Implicit vs Explicit Rollback

A transaction can enter the Failed state through either explicit action (ROLLBACK command) or implicit detection (error occurs). The outcome is the same, but the paths differ.

Explicit Rollback:

The application deliberately decides to abort the transaction:

explicit-rollback.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Explicit rollback scenarios
 
-- Scenario 1: Application logic decides to abort
BEGIN;
UPDATE inventory SET quantity = quantity - 10 WHERE product_id = 100;
-- Check if any inventory becomes unreasonable
SELECT quantity FROM inventory WHERE quantity < 0;
-- If results returned, we don't want this transaction
ROLLBACK;  -- Explicit decision to abort
 
-- Scenario 2: User cancellation
BEGIN;
INSERT INTO orders (...) VALUES (...);
-- User clicks "Cancel" in the UI
ROLLBACK;  -- Application sends ROLLBACK in response
 
-- Scenario 3: Partial rollback with savepoint
BEGIN;
INSERT INTO batch_jobs (id, status) VALUES (1, 'PROCESSING');
 
SAVEPOINT before_items;
INSERT INTO batch_items (job_id, item) VALUES (1, 'item1');
INSERT INTO batch_items (job_id, item) VALUES (1, 'item2');
-- Oops, these items are wrong
ROLLBACK TO SAVEPOINT before_items;
 
-- Correct items
INSERT INTO batch_items (job_id, item) VALUES (1, 'correct_item1');
COMMIT;  -- Job header and correct items committed

Implicit Rollback:

The DBMS automatically initiates rollback when an error occurs. Behavior varies by database and configuration:

Implicit Rollback Behavior by Database
Database	On Statement Error	On Disconnect	Configuration Options
PostgreSQL	Transaction marked failed, must ROLLBACK	Automatic ROLLBACK	ON_ERROR_ROLLBACK in psql
MySQL	Statement rolled back, txn continues	Automatic ROLLBACK	autocommit behavior
SQL Server	Depends on XACT_ABORT setting	Automatic ROLLBACK	SET XACT_ABORT ON/OFF
Oracle	Statement rolled back, txn continues	Automatic ROLLBACK	Default behavior

implicit-rollback-behavior.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- PostgreSQL: Entire transaction blocked on error
 
BEGIN;
INSERT INTO test (id) VALUES (1);  -- Succeeds
 
INSERT INTO test (id) VALUES (1);  -- Fails (duplicate key)
-- ERROR: duplicate key value violates unique constraint
 
-- Transaction is now in 'aborted' (Failed) state
-- Any further commands will fail:
SELECT * FROM test;
-- ERROR: current transaction is aborted, commands ignored until 
-- end of transaction block
 
-- Must explicitly rollback to end the transaction
ROLLBACK;

Know Your Database's Behavior

Monitoring Failed Transactions

Monitoring transaction failures is essential for maintaining system health and identifying problems. Here's how to track failures across different database systems.

PostgreSQL: Failure Monitoring

monitoring-failures.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- View sessions with failed transactions
SELECT 
    pid,
    usename,
    state,
    query_start,
    xact_start,
    backend_xid,
    query
FROM pg_stat_activity
WHERE state = 'idle in transaction (aborted)';
-- 'idle in transaction (aborted)' = transaction is in Failed state
 
-- Count rollbacks vs commits
SELECT 
    datname,
    xact_commit,
    xact_rollback,
    ROUND(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) 
        AS rollback_percentage
FROM pg_stat_database
WHERE datname NOT LIKE 'template%';
 
-- View conflict statistics (causes of failure)
SELECT 
    datname,
    confl_tablespace,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts;
 
-- Enable logging of all errors
-- In postgresql.conf:
-- log_statement = 'all'
-- log_min_error_statement = 'error'
-- log_min_messages = 'warning'

Key Metrics to Monitor:

Failure Monitoring Metrics

•Rollback ratio — Percentage of transactions that roll back vs commit. High ratios indicate problems.
•Deadlock frequency — Number of deadlocks per time period. Increasing trends require investigation.
•Lock wait times — Average and maximum time spent waiting for locks.
•Constraint violation counts — Frequent constraint violations may indicate application bugs.
•Connection timeout/failure — May indicate client or network issues.
•Memory/disk errors — System resource problems causing failures.

Set Alerts for Abnormal Failure Rates

Summary: The Failed State

We've thoroughly explored the Failed state—the intermediate state that houses transactions that cannot complete successfully. Let's consolidate our understanding:

Key Takeaways

•Transitional state — Failed is not terminal; it leads to Aborted after rollback completes. Every failed transaction must be rolled back.
•Multiple failure causes — Transactions can fail from application errors (constraint violations), concurrency issues (deadlock, serialization failure), or system problems (resource exhaustion).
•Active and passive detection — DBMS actively monitors for some failure types (deadlock) while others are detected when operations are attempted.
•Retry-eligible vs permanent — Some failures (deadlock, serialization) are transient and can be retried. Others (constraint violations) indicate data problems that retry won't solve.
•Rollback uses undo log — The changes made by a failed transaction are reversed using undo log records, processing in reverse order.
•Implicit vs explicit — Failures can be triggered by explicit ROLLBACK commands or implicitly by detected errors.
•Database behavior varies — Different databases handle statement-level vs transaction-level failures differently. Know your database.
•Monitoring is essential — Track failure rates, deadlocks, and other metrics to maintain system health and identify problems.

What's Next:

Page Complete

4 / 5