Database Management SystemsFailure Types

Understanding Database Failure Types

LevelIntermediate

Duration60 mins

TopicFailure Types

1 / 5

Transaction Failure

When Transactions Go Wrong

Every database professional eventually faces a humbling truth: failures are not exceptions—they are inevitabilities. In a world of perfect hardware, flawless software, and infallible networks, databases wouldn't need recovery mechanisms. But we don't live in that world.

Transaction failures represent the most frequent and granular form of database failure. They occur when an individual transaction cannot complete its intended operations, requiring the database management system to intervene, undo partial work, and restore consistency. Understanding transaction failures is the first step toward mastering database reliability.

This page provides an exhaustive examination of transaction failures—why they happen, how they manifest, what damage they can cause, and how database systems are designed to handle them gracefully.

What You Will Learn

By the end of this page, you will understand the anatomy of transaction failures, their root causes across logical, application, and system boundaries, and how the DBMS detects and responds to them. You will gain insight into why transaction failures, while disruptive, are actually the 'safest' failure mode and how proper handling prevents data corruption.

Defining Transaction Failure

A transaction failure occurs when a transaction cannot complete its execution and reach the committed state. The transaction may have performed some operations, modified some data, acquired some locks—but ultimately, it cannot successfully finish. The database must intervene to undo any partial changes and restore the database to its state before the transaction began.

Formal Definition:

A transaction T is said to have failed if:

T has begun execution (entered the Active state)
T cannot transition to the Committed state
T must transition to the Aborted state
All effects of T must be rolled back (undone)

This definition captures the essential nature of transaction failure: it's not merely that something went wrong—it's that the transaction's partial work must be erased as if the transaction never happened.

The ACID Connection

Transaction failure is intimately connected to the Atomicity property of ACID. Atomicity demands that a transaction is 'all or nothing'—either all operations complete successfully, or none of them take effect. When a transaction fails, atomicity requires that we undo any partial work to maintain this guarantee.

Transaction Failure in the State Transition Model:

Recall the transaction state diagram:

Active → Partially Committed → Committed
   ↓              ↓
 Failed ←─────────┘
   ↓
Aborted

A transaction failure represents a transition from either the Active or Partially Committed state to the Failed state, and subsequently to the Aborted state. The distinction between failing from Active versus Partially Committed is significant:

Failure from Active: The transaction was still executing operations when failure occurred. Some writes may have been performed but not yet finalized.
Failure from Partially Committed: The transaction completed all its operations and was in the process of committing when failure occurred. This is more complex because the transaction was almost successful.

Transaction Failure Transition Points
Source State	Trigger	Destination	Recovery Complexity
Active	Logical error, constraint violation, explicit rollback	Failed → Aborted	Lower (less work to undo)
Active	System resource exhaustion, deadlock victim selection	Failed → Aborted	Moderate (may have acquired resources)
Partially Committed	Commit processing failure, log write failure	Failed → Aborted	Higher (all operations complete)
Partially Committed	Final constraint check failure	Failed → Aborted	Highest (near-complete transaction)

Causes of Transaction Failure

Transaction failures arise from a diverse set of causes, spanning logical errors, application bugs, constraint violations, resource limitations, and deliberate interventions. Understanding these causes is essential for building robust applications and configuring database systems appropriately.

We categorize transaction failure causes into five major groups:

Major Categories of Transaction Failure Causes

•Logical Errors — Bugs in application code or incorrect SQL statements that make transaction completion impossible
•Constraint Violations — Attempts to violate database integrity constraints (primary key, foreign key, check, unique)
•User-Initiated Aborts — Explicit ROLLBACK commands issued by the application or user
•Concurrency Control Aborts — Deadlock resolution, timeout expiration, or serialization failures
•Resource Exhaustion — Insufficient memory, disk space, or other system resources

2.1 Logical Errors in Application Code

Logical errors are perhaps the most common cause of transaction failures. These occur when the application's logic is flawed, leading to impossible operations or invalid data states.

Examples of Logical Errors:

Arithmetic Errors:
- Division by zero when calculating averages or ratios
- Numeric overflow when summing large values
- Precision loss leading to invalid monetary calculations
Data Type Mismatches:
- Attempting to insert a string into a numeric column
- Date format parsing failures
- Character encoding issues
Missing Data Errors:
- Referencing a variable that was never initialized
- Attempting to use a NULL value in an expression requiring a value
- Fetching from an empty cursor
Logic Flow Errors:
- Infinite loops consuming all resources
- Incorrect conditional logic leading to impossible states
- Race conditions in application-level concurrency

logical_error_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Example 1: Division by zero causes transaction failure
BEGIN TRANSACTION;
    UPDATE accounts SET interest_rate = total_interest / num_years
    WHERE account_id = 12345;
    -- If num_years = 0, this transaction will fail
COMMIT;
 
-- Example 2: Invalid data type causes failure
BEGIN TRANSACTION;
    INSERT INTO employees (emp_id, hire_date, salary)
    VALUES (1001, 'not-a-valid-date', 75000);
    -- The date parsing will fail, aborting the transaction
COMMIT;
 
-- Example 3: Overflow error
BEGIN TRANSACTION;
    -- If population is BIGINT and sum exceeds maximum value
    UPDATE countries SET total_population = total_population + 9223372036854775807;
    -- Arithmetic overflow will cause failure
COMMIT;

2.2 Constraint Violations

Database constraints are designed to protect data integrity. When a transaction attempts to violate these constraints, the DBMS must reject the operation and fail the transaction.

Types of Constraint Violations:

Database Constraint Violations
Constraint Type	Violation Example	DBMS Response
PRIMARY KEY	Inserting a duplicate primary key value	Reject insert, fail transaction
FOREIGN KEY	Inserting a reference to non-existent parent row	Reject insert, fail transaction
UNIQUE	Inserting duplicate value in unique column	Reject insert, fail transaction
CHECK	Inserting value that violates CHECK condition	Reject insert, fail transaction
NOT NULL	Inserting NULL into NOT NULL column	Reject insert, fail transaction
DOMAIN	Inserting value outside defined domain	Reject insert, fail transaction

constraint_violation_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Example: Foreign Key Violation
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT REFERENCES customers(customer_id),
    order_date DATE
);
 
BEGIN TRANSACTION;
    -- This will fail if customer 99999 doesn't exist
    INSERT INTO orders (order_id, customer_id, order_date)
    VALUES (1001, 99999, CURRENT_DATE);
    -- ERROR: insert or update on table "orders" violates foreign key constraint
COMMIT; -- Never reached
 
-- Example: Check Constraint Violation
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    price DECIMAL(10,2) CHECK (price > 0),
    quantity INT CHECK (quantity >= 0)
);
 
BEGIN TRANSACTION;
    -- Negative price violates CHECK constraint
    INSERT INTO products (product_id, price, quantity)
    VALUES (1, -50.00, 100);
    -- Transaction fails
COMMIT;

2.3 User-Initiated and Application-Initiated Aborts

Not all transaction failures are accidental. Applications often deliberately abort transactions:

Validation Failures: After performing reads, the application determines that proceeding is inappropriate
Business Rule Violations: The application detects a business logic violation that the database constraints don't enforce
User Cancellation: The user cancels an operation (e.g., clicks 'Cancel' during checkout)
Exception Handling: The application catches an exception and decides to rollback

These deliberate aborts are healthy—they show the system working correctly. The alternative (committing invalid state) would be far worse.

deliberate_abort_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Example: Application-Initiated Abort
def transfer_funds(from_account, to_account, amount):
    try:
        cursor.execute("BEGIN TRANSACTION")
        
        # Step 1: Check source account balance
        cursor.execute(
            "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
            (from_account,)
        )
        balance = cursor.fetchone()[0]
        
        # Step 2: Business rule validation
        if balance < amount:
            # Insufficient funds - deliberately abort
            cursor.execute("ROLLBACK")
            raise InsufficientFundsError(
                f"Balance {balance} is less than transfer amount {amount}"
            )
        
        # Step 3: Additional business rules
        if amount > 10000:
            # Check for daily limit exceeded
            cursor.execute(
                "SELECT SUM(amount) FROM transfers "
                "WHERE from_account = %s AND transfer_date = CURRENT_DATE",
                (from_account,)
            )
            daily_total = cursor.fetchone()[0] or 0
            if daily_total + amount > 50000:
                cursor.execute("ROLLBACK")
                raise DailyLimitExceededError("Daily transfer limit exceeded")
        
        # Step 4: Perform the transfer
        cursor.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            (amount, from_account)
        )
        cursor.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount, to_account)
        )
        
        cursor.execute("COMMIT")
        return True
        
    except Exception as e:
        cursor.execute("ROLLBACK")  # Ensure transaction is aborted
        raise

2.4 Concurrency Control Aborts

Database concurrency control mechanisms may abort transactions to maintain serializability, resolve deadlocks, or enforce timeouts:

Deadlock Resolution: When two or more transactions form a cycle of waiting (each holding a resource the other needs), the DBMS must choose a victim to abort, breaking the deadlock.

Serialization Failures: In optimistic concurrency control or serializable snapshot isolation, a transaction may be aborted if committing it would violate serializability.

Lock Timeouts: If a transaction waits too long for a lock (exceeding the configured timeout), it may be aborted to prevent indefinite blocking.

Timestamp Ordering Violations: In timestamp-based protocols, transactions may be aborted if their operations would violate the timestamp order.

Deadlock Victims Are Not Failures

When a transaction is chosen as a deadlock victim and aborted, it hasn't necessarily done anything wrong. It was simply unlucky in the timing of its lock requests. Well-designed applications should catch this abort, wait briefly, and retry the transaction. The abort is a necessary evil to keep the system functioning.

2.5 Resource Exhaustion

Transactions may fail when system resources are insufficient:

Memory Exhaustion: The transaction requires more memory than available (for sorting, hash joins, temporary results)
Disk Space Exhaustion: No space for temporary files, log records, or data pages
Connection Limits: Maximum database connections reached
Lock Table Overflow: Too many locks held, lock manager capacity exceeded
Undo Log Space: Undo tablespace full, cannot record more undo information

Resource exhaustion failures are particularly dangerous because they can cascade—one transaction's failure may trigger others as the system struggles to recover resources.

Transaction Failure Detection

The DBMS must detect transaction failures promptly to minimize damage and begin recovery. Detection mechanisms vary based on the failure type:

Synchronous Detection (Immediate): Most failures are detected synchronously when the failing operation is attempted:

Constraint violations are detected when the INSERT/UPDATE/DELETE is processed
Arithmetic errors are detected when the expression is evaluated
Data type errors are detected when parsing the value
Deadlocks are detected by the lock manager's deadlock detector

Asynchronous Detection (Delayed): Some failures are detected asynchronously:

Client disconnection may not be detected until the next I/O operation
Resource exhaustion may be detected by periodic monitoring
Timeouts require timer expiration

Detection Mechanisms by Component

•Query Executor — Detects runtime errors (division by zero, overflow, type mismatches)
•Constraint Subsystem — Detects integrity constraint violations
•Lock Manager — Detects deadlocks via wait-for graph analysis; enforces timeouts
•Buffer Manager — Detects memory exhaustion
•Storage Manager — Detects disk space exhaustion, I/O errors
•Connection Manager — Detects client disconnections, session timeouts
•Transaction Manager — Coordinates failure handling from all components

The Error Propagation Path:

When a failure is detected, the error information must propagate through the system:

Error Origin → Query Executor → Transaction Manager → Recovery Manager → Log Writer
                    ↓                    ↓
             Application             Lock Manager
             (error code)            (release locks)

Error Origin: The component detecting the failure raises an internal error
Query Executor: Halts query execution, prevents further operations
Transaction Manager: Marks transaction as failed, initiates abort sequence
Recovery Manager: Reads log records, executes undo operations
Lock Manager: Releases all locks held by the failed transaction
Application: Receives error code/message for appropriate handling

Error Codes Matter

Different database systems use different error coding schemes. PostgreSQL uses SQLSTATE codes (e.g., '23503' for foreign key violation). MySQL uses error numbers (e.g., 1452). Oracle uses ORA-codes. Knowing your database's error codes helps you write robust error handling in applications.

Transaction Rollback Mechanism

When a transaction fails, its partial effects must be undone through a process called rollback. Rollback is the fundamental mechanism that implements transaction atomicity—ensuring that failed transactions leave no trace.

The Rollback Process:

Rollback uses the database log to undo changes in reverse order:

Identify the Transaction: Find all log records belonging to the failed transaction
Read Backwards: Process log records from newest to oldest
Undo Each Operation: Apply the inverse of each logged operation
Release Resources: Release locks, memory, temporary structures
Write Abort Record: Log that the transaction has been aborted
Notify Application: Return error to the calling application

rollback_algorithm.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
PROCEDURE Rollback(transaction_id):
    // Step 1: Locate the transaction's log records
    log_records = GetLogRecords(transaction_id)
    
    // Step 2: Process in reverse chronological order
    FOR each record IN REVERSE(log_records):
        IF record.type == UPDATE:
            // Write the before-image (old value) back to the page
            page = GetPage(record.page_id)
            page.Write(record.offset, record.before_image)
            MarkPageDirty(page)
            
            // Write a Compensation Log Record (CLR)
            WriteCLR(transaction_id, record.LSN, record.before_image)
            
        ELSE IF record.type == INSERT:
            // Delete the inserted record
            page = GetPage(record.page_id)
            page.Delete(record.record_id)
            WriteCLR(transaction_id, record.LSN, "DELETE")
            
        ELSE IF record.type == DELETE:
            // Re-insert the deleted record
            page = GetPage(record.page_id)
            page.Insert(record.deleted_data)
            WriteCLR(transaction_id, record.LSN, "INSERT")
    END FOR
    
    // Step 3: Release all locks
    ReleaseAllLocks(transaction_id)
    
    // Step 4: Write abort log record
    WriteLogRecord(ABORT, transaction_id)
    
    // Step 5: Update transaction status
    SetTransactionStatus(transaction_id, ABORTED)
    
    RETURN

Before-Images and After-Images:

The log contains the information needed to both undo and redo operations:

Before-Image (Undo Information): The old value of data before the operation. Used during rollback to restore the original state.
After-Image (Redo Information): The new value of data after the operation. Used during recovery to reapply committed changes.

For an UPDATE operation:

Log Record: <T1, PageID, Offset, OldValue, NewValue>
              ^     ^       ^       ^          ^
        Transaction Page  Where  Before-   After-
                   ID    changed  Image     Image

Compensation Log Records (CLRs):

During rollback, the system writes Compensation Log Records to log the undo actions themselves. This is crucial for recovery:

If the system crashes during rollback, the CLRs ensure we don't undo the same operation twice
CLRs point back to the next record that needs to be undone, allowing rollback to resume correctly

Why Log the Undo?

It might seem redundant to log undo operations. But consider: if the system crashes during rollback, and we restart recovery, how do we know which undo operations completed? Without CLRs, we might undo the same operation twice, potentially causing data corruption. CLRs make rollback idempotent.

Rollback Cost Factors
Factor	Impact on Rollback Time	Mitigation Strategy
Number of operations	Linear increase	Minimize transaction size
Size of data modified	Proportional increase	Update only necessary columns
Index updates	Significant overhead	Batch index maintenance
Trigger cascades	Potentially exponential	Design triggers carefully
Lock contention during rollback	Delays other transactions	Prioritize rollback resources

Impact of Transaction Failures

Transaction failures have both direct and indirect impacts on database systems and applications. Understanding these impacts helps engineers design more resilient systems.

Direct Impacts:

Immediate Effects of Transaction Failure

•Work Loss: All operations performed by the transaction are undone; computation time is wasted
•User Impact: Users may see error messages, failed operations, or need to retry actions
•Resource Release: Locks held by the transaction are released, potentially enabling waiting transactions
•Log Growth: Both the original operations and the undo operations consume log space
•Buffer Pool Churn: Pages dirtied by the transaction may need to be discarded or re-read

Indirect and Cascading Impacts:

The effects of transaction failure extend beyond the failed transaction itself:

1. Cascading Rollback (in some schedules): If the database uses a schedule that isn't cascadeless, other transactions that read data written by the failed transaction may also need to be rolled back. This cascade can be extensive:

T1: Write(A)  →  [FAILS]
              ↓
T2: Read(A) Write(B)  →  [Must Rollback - read uncommitted data]
                      ↓
T3: Read(B) Write(C)  →  [Must Rollback - cascade continues]

2. Lock Convoy Formation: While a long-running transaction holds locks and then fails, other transactions queue up waiting. When locks are released, all waiters may attempt to acquire locks simultaneously, potentially causing contention spikes.

3. Retry Storms: If many transactions fail for the same reason (e.g., resource exhaustion) and applications immediately retry, the retry attempts can overwhelm the system, making recovery harder.

The Retry Trap

Immediate, aggressive retries after transaction failures can cause 'thundering herd' problems. Use exponential backoff with jitter: wait a random time that increases with each retry. This spreads out retry attempts and gives the system time to recover.

Business and Operational Impacts:

Customer Experience: Failed transactions result in error messages, incomplete operations, and user frustration
Revenue Loss: In e-commerce, failed checkout transactions directly impact revenue
Support Costs: Transaction failures generate support tickets and investigation effort
Reputation: Frequent failures erode trust in the system
Operational Overhead: Engineers spend time investigating, diagnosing, and fixing failure causes

However, transaction failures also have a protective aspect: they prevent data corruption. A transaction that fails and rolls back leaves the database in a consistent state. The alternative—committing invalid or partial data—would be far more damaging.

Best Practices for Handling Transaction Failures

Effective handling of transaction failures requires attention at both the application level and the database configuration level. Here are best practices developed from decades of production experience:

Application-Level Best Practices

•Always Use Transactions: Never perform multi-statement operations without explicit transaction boundaries. Auto-commit mode is dangerous for complex operations.
•Handle All Error Codes: Catch and handle specific database error codes. Don't use generic exception handling that hides the failure cause.
•Implement Idempotency: Design operations to be safely retryable. Use idempotency keys to prevent duplicate effects.
•Use Retry with Backoff: Retry transient failures (deadlocks, serialization failures) with exponential backoff and jitter.
•Keep Transactions Short: Long transactions hold locks longer, increasing deadlock probability and blocking others.
•Validate Before Transacting: Check preconditions before starting transactions to fail fast on obvious violations.

robust_transaction_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import random
import time
from typing import Callable, TypeVar
 
T = TypeVar('T')
 
class RetryableError(Exception):
    """Errors that can be retried (deadlocks, serialization failures)"""
    pass
 
class NonRetryableError(Exception):
    """Errors that should not be retried (constraint violations)"""
    pass
 
def with_retry(
    operation: Callable[[], T],
    max_attempts: int = 3,
    base_delay: float = 0.1,
    max_delay: float = 10.0
) -> T:
    """
    Execute a database operation with automatic retry for transient failures.
    
    Uses exponential backoff with jitter to prevent thundering herd.
    """
    last_exception = None
    
    for attempt in range(max_attempts):
        try:
            return operation()
        except RetryableError as e:
            last_exception = e
            if attempt < max_attempts - 1:
                # Calculate delay with exponential backoff
                delay = min(base_delay * (2 ** attempt), max_delay)
                # Add jitter (±25%)
                jitter = delay * 0.25 * (2 * random.random() - 1)
                actual_delay = delay + jitter
                
                print(f"Retryable error on attempt {attempt + 1}: {e}")
                print(f"Retrying in {actual_delay:.2f} seconds...")
                time.sleep(actual_delay)
        except NonRetryableError:
            # Don't retry these - re-raise immediately
            raise
    
    raise last_exception
 
def classify_db_error(error_code: str) -> type:
    """Classify database errors as retryable or not."""
    RETRYABLE_CODES = {
        '40001',  # Serialization failure
        '40P01',  # Deadlock detected
        '55P03',  # Lock not available
        '57014',  # Query cancelled (timeout)
    }
    
    if error_code in RETRYABLE_CODES:
        return RetryableError
    return NonRetryableError

Database Configuration Best Practices

•Set Appropriate Timeouts: Configure lock wait timeouts and statement timeouts to prevent indefinite waiting
•Monitor and Alert: Set up monitoring for transaction failure rates, rollback volume, deadlock frequency
•Size Resources Appropriately: Ensure adequate memory for sorting, disk space for logs, connections for load
•Use Connection Pooling: Prevent connection exhaustion with proper pooling configuration
•Enable Detailed Logging: Log enough detail to diagnose failures without impacting performance

Summary: Transaction Failure

Let's consolidate the key concepts covered in this page:

Key Takeaways

•Transaction failures occur when a transaction cannot complete — The DBMS must undo partial work to maintain atomicity
•Failures arise from diverse causes — Logical errors, constraint violations, concurrency control, resource exhaustion, and deliberate aborts
•Detection is carried out by multiple DBMS components — Each subsystem monitors for relevant error conditions
•Rollback uses the log to undo operations — Before-images are applied in reverse order; CLRs make rollback crash-safe
•Failures have cascading impacts — Beyond the failed transaction, they affect waiting transactions, system resources, and user experience
•Proper handling requires both application and database attention — Retry patterns, idempotency, validation, monitoring, and configuration all matter

What's Next:

Transaction failures, while common, are the most localized and manageable form of database failure. In the next page, we'll examine System Failures—when the entire DBMS instance crashes, affecting all active transactions simultaneously. System failures raise new challenges around volatile state, recovery timing, and the interplay between memory and persistent storage.

Page Complete

You now understand transaction failures in depth—their causes, detection, rollback mechanisms, and impacts. This foundation prepares you for understanding more severe failure types: system failures and media failures.

1 / 5

Loading learning content...

Database Management SystemsFailure Types

Understanding Database Failure Types

LevelIntermediate

Duration60 mins

TopicFailure Types

1 / 5

Transaction Failure

When Transactions Go Wrong

This page provides an exhaustive examination of transaction failures—why they happen, how they manifest, what damage they can cause, and how database systems are designed to handle them gracefully.

What You Will Learn

Defining Transaction Failure

Formal Definition:

A transaction T is said to have failed if:

T has begun execution (entered the Active state)
T cannot transition to the Committed state
T must transition to the Aborted state
All effects of T must be rolled back (undone)

The ACID Connection

Transaction Failure in the State Transition Model:

Recall the transaction state diagram:

Active → Partially Committed → Committed
   ↓              ↓
 Failed ←─────────┘
   ↓
Aborted

Failure from Active: The transaction was still executing operations when failure occurred. Some writes may have been performed but not yet finalized.
Failure from Partially Committed: The transaction completed all its operations and was in the process of committing when failure occurred. This is more complex because the transaction was almost successful.

Transaction Failure Transition Points
Source State	Trigger	Destination	Recovery Complexity
Active	Logical error, constraint violation, explicit rollback	Failed → Aborted	Lower (less work to undo)
Active	System resource exhaustion, deadlock victim selection	Failed → Aborted	Moderate (may have acquired resources)
Partially Committed	Commit processing failure, log write failure	Failed → Aborted	Higher (all operations complete)
Partially Committed	Final constraint check failure	Failed → Aborted	Highest (near-complete transaction)

Causes of Transaction Failure

We categorize transaction failure causes into five major groups:

Major Categories of Transaction Failure Causes

•Logical Errors — Bugs in application code or incorrect SQL statements that make transaction completion impossible
•Constraint Violations — Attempts to violate database integrity constraints (primary key, foreign key, check, unique)
•User-Initiated Aborts — Explicit ROLLBACK commands issued by the application or user
•Concurrency Control Aborts — Deadlock resolution, timeout expiration, or serialization failures
•Resource Exhaustion — Insufficient memory, disk space, or other system resources

2.1 Logical Errors in Application Code

Logical errors are perhaps the most common cause of transaction failures. These occur when the application's logic is flawed, leading to impossible operations or invalid data states.

Examples of Logical Errors:

Arithmetic Errors:
- Division by zero when calculating averages or ratios
- Numeric overflow when summing large values
- Precision loss leading to invalid monetary calculations
Data Type Mismatches:
- Attempting to insert a string into a numeric column
- Date format parsing failures
- Character encoding issues
Missing Data Errors:
- Referencing a variable that was never initialized
- Attempting to use a NULL value in an expression requiring a value
- Fetching from an empty cursor
Logic Flow Errors:
- Infinite loops consuming all resources
- Incorrect conditional logic leading to impossible states
- Race conditions in application-level concurrency

logical_error_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Example 1: Division by zero causes transaction failure
BEGIN TRANSACTION;
    UPDATE accounts SET interest_rate = total_interest / num_years
    WHERE account_id = 12345;
    -- If num_years = 0, this transaction will fail
COMMIT;
 
-- Example 2: Invalid data type causes failure
BEGIN TRANSACTION;
    INSERT INTO employees (emp_id, hire_date, salary)
    VALUES (1001, 'not-a-valid-date', 75000);
    -- The date parsing will fail, aborting the transaction
COMMIT;
 
-- Example 3: Overflow error
BEGIN TRANSACTION;
    -- If population is BIGINT and sum exceeds maximum value
    UPDATE countries SET total_population = total_population + 9223372036854775807;
    -- Arithmetic overflow will cause failure
COMMIT;

2.2 Constraint Violations

Database constraints are designed to protect data integrity. When a transaction attempts to violate these constraints, the DBMS must reject the operation and fail the transaction.

Types of Constraint Violations:

Database Constraint Violations
Constraint Type	Violation Example	DBMS Response
PRIMARY KEY	Inserting a duplicate primary key value	Reject insert, fail transaction
FOREIGN KEY	Inserting a reference to non-existent parent row	Reject insert, fail transaction
UNIQUE	Inserting duplicate value in unique column	Reject insert, fail transaction
CHECK	Inserting value that violates CHECK condition	Reject insert, fail transaction
NOT NULL	Inserting NULL into NOT NULL column	Reject insert, fail transaction
DOMAIN	Inserting value outside defined domain	Reject insert, fail transaction

constraint_violation_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Example: Foreign Key Violation
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT REFERENCES customers(customer_id),
    order_date DATE
);
 
BEGIN TRANSACTION;
    -- This will fail if customer 99999 doesn't exist
    INSERT INTO orders (order_id, customer_id, order_date)
    VALUES (1001, 99999, CURRENT_DATE);
    -- ERROR: insert or update on table "orders" violates foreign key constraint
COMMIT; -- Never reached
 
-- Example: Check Constraint Violation
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    price DECIMAL(10,2) CHECK (price > 0),
    quantity INT CHECK (quantity >= 0)
);
 
BEGIN TRANSACTION;
    -- Negative price violates CHECK constraint
    INSERT INTO products (product_id, price, quantity)
    VALUES (1, -50.00, 100);
    -- Transaction fails
COMMIT;

2.3 User-Initiated and Application-Initiated Aborts

Not all transaction failures are accidental. Applications often deliberately abort transactions:

Validation Failures: After performing reads, the application determines that proceeding is inappropriate
Business Rule Violations: The application detects a business logic violation that the database constraints don't enforce
User Cancellation: The user cancels an operation (e.g., clicks 'Cancel' during checkout)
Exception Handling: The application catches an exception and decides to rollback

These deliberate aborts are healthy—they show the system working correctly. The alternative (committing invalid state) would be far worse.

deliberate_abort_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Example: Application-Initiated Abort
def transfer_funds(from_account, to_account, amount):
    try:
        cursor.execute("BEGIN TRANSACTION")
        
        # Step 1: Check source account balance
        cursor.execute(
            "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
            (from_account,)
        )
        balance = cursor.fetchone()[0]
        
        # Step 2: Business rule validation
        if balance < amount:
            # Insufficient funds - deliberately abort
            cursor.execute("ROLLBACK")
            raise InsufficientFundsError(
                f"Balance {balance} is less than transfer amount {amount}"
            )
        
        # Step 3: Additional business rules
        if amount > 10000:
            # Check for daily limit exceeded
            cursor.execute(
                "SELECT SUM(amount) FROM transfers "
                "WHERE from_account = %s AND transfer_date = CURRENT_DATE",
                (from_account,)
            )
            daily_total = cursor.fetchone()[0] or 0
            if daily_total + amount > 50000:
                cursor.execute("ROLLBACK")
                raise DailyLimitExceededError("Daily transfer limit exceeded")
        
        # Step 4: Perform the transfer
        cursor.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            (amount, from_account)
        )
        cursor.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount, to_account)
        )
        
        cursor.execute("COMMIT")
        return True
        
    except Exception as e:
        cursor.execute("ROLLBACK")  # Ensure transaction is aborted
        raise

2.4 Concurrency Control Aborts

Database concurrency control mechanisms may abort transactions to maintain serializability, resolve deadlocks, or enforce timeouts:

Deadlock Resolution: When two or more transactions form a cycle of waiting (each holding a resource the other needs), the DBMS must choose a victim to abort, breaking the deadlock.

Serialization Failures: In optimistic concurrency control or serializable snapshot isolation, a transaction may be aborted if committing it would violate serializability.

Lock Timeouts: If a transaction waits too long for a lock (exceeding the configured timeout), it may be aborted to prevent indefinite blocking.

Timestamp Ordering Violations: In timestamp-based protocols, transactions may be aborted if their operations would violate the timestamp order.

Deadlock Victims Are Not Failures

2.5 Resource Exhaustion

Transactions may fail when system resources are insufficient:

Memory Exhaustion: The transaction requires more memory than available (for sorting, hash joins, temporary results)
Disk Space Exhaustion: No space for temporary files, log records, or data pages
Connection Limits: Maximum database connections reached
Lock Table Overflow: Too many locks held, lock manager capacity exceeded
Undo Log Space: Undo tablespace full, cannot record more undo information

Resource exhaustion failures are particularly dangerous because they can cascade—one transaction's failure may trigger others as the system struggles to recover resources.

Transaction Failure Detection

The DBMS must detect transaction failures promptly to minimize damage and begin recovery. Detection mechanisms vary based on the failure type:

Synchronous Detection (Immediate): Most failures are detected synchronously when the failing operation is attempted:

Constraint violations are detected when the INSERT/UPDATE/DELETE is processed
Arithmetic errors are detected when the expression is evaluated
Data type errors are detected when parsing the value
Deadlocks are detected by the lock manager's deadlock detector

Asynchronous Detection (Delayed): Some failures are detected asynchronously:

Client disconnection may not be detected until the next I/O operation
Resource exhaustion may be detected by periodic monitoring
Timeouts require timer expiration

Detection Mechanisms by Component

•Query Executor — Detects runtime errors (division by zero, overflow, type mismatches)
•Constraint Subsystem — Detects integrity constraint violations
•Lock Manager — Detects deadlocks via wait-for graph analysis; enforces timeouts
•Buffer Manager — Detects memory exhaustion
•Storage Manager — Detects disk space exhaustion, I/O errors
•Connection Manager — Detects client disconnections, session timeouts
•Transaction Manager — Coordinates failure handling from all components

The Error Propagation Path:

When a failure is detected, the error information must propagate through the system:

Error Origin → Query Executor → Transaction Manager → Recovery Manager → Log Writer
                    ↓                    ↓
             Application             Lock Manager
             (error code)            (release locks)

Error Origin: The component detecting the failure raises an internal error
Query Executor: Halts query execution, prevents further operations
Transaction Manager: Marks transaction as failed, initiates abort sequence
Recovery Manager: Reads log records, executes undo operations
Lock Manager: Releases all locks held by the failed transaction
Application: Receives error code/message for appropriate handling

Error Codes Matter

Transaction Rollback Mechanism

The Rollback Process:

Rollback uses the database log to undo changes in reverse order:

Identify the Transaction: Find all log records belonging to the failed transaction
Read Backwards: Process log records from newest to oldest
Undo Each Operation: Apply the inverse of each logged operation
Release Resources: Release locks, memory, temporary structures
Write Abort Record: Log that the transaction has been aborted
Notify Application: Return error to the calling application

rollback_algorithm.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
PROCEDURE Rollback(transaction_id):
    // Step 1: Locate the transaction's log records
    log_records = GetLogRecords(transaction_id)
    
    // Step 2: Process in reverse chronological order
    FOR each record IN REVERSE(log_records):
        IF record.type == UPDATE:
            // Write the before-image (old value) back to the page
            page = GetPage(record.page_id)
            page.Write(record.offset, record.before_image)
            MarkPageDirty(page)
            
            // Write a Compensation Log Record (CLR)
            WriteCLR(transaction_id, record.LSN, record.before_image)
            
        ELSE IF record.type == INSERT:
            // Delete the inserted record
            page = GetPage(record.page_id)
            page.Delete(record.record_id)
            WriteCLR(transaction_id, record.LSN, "DELETE")
            
        ELSE IF record.type == DELETE:
            // Re-insert the deleted record
            page = GetPage(record.page_id)
            page.Insert(record.deleted_data)
            WriteCLR(transaction_id, record.LSN, "INSERT")
    END FOR
    
    // Step 3: Release all locks
    ReleaseAllLocks(transaction_id)
    
    // Step 4: Write abort log record
    WriteLogRecord(ABORT, transaction_id)
    
    // Step 5: Update transaction status
    SetTransactionStatus(transaction_id, ABORTED)
    
    RETURN

Before-Images and After-Images:

The log contains the information needed to both undo and redo operations:

Before-Image (Undo Information): The old value of data before the operation. Used during rollback to restore the original state.
After-Image (Redo Information): The new value of data after the operation. Used during recovery to reapply committed changes.

For an UPDATE operation:

Log Record: <T1, PageID, Offset, OldValue, NewValue>
              ^     ^       ^       ^          ^
        Transaction Page  Where  Before-   After-
                   ID    changed  Image     Image

Compensation Log Records (CLRs):

During rollback, the system writes Compensation Log Records to log the undo actions themselves. This is crucial for recovery:

If the system crashes during rollback, the CLRs ensure we don't undo the same operation twice
CLRs point back to the next record that needs to be undone, allowing rollback to resume correctly

Why Log the Undo?

Rollback Cost Factors
Factor	Impact on Rollback Time	Mitigation Strategy
Number of operations	Linear increase	Minimize transaction size
Size of data modified	Proportional increase	Update only necessary columns
Index updates	Significant overhead	Batch index maintenance
Trigger cascades	Potentially exponential	Design triggers carefully
Lock contention during rollback	Delays other transactions	Prioritize rollback resources

Impact of Transaction Failures

Transaction failures have both direct and indirect impacts on database systems and applications. Understanding these impacts helps engineers design more resilient systems.

Direct Impacts:

Immediate Effects of Transaction Failure

•Work Loss: All operations performed by the transaction are undone; computation time is wasted
•User Impact: Users may see error messages, failed operations, or need to retry actions
•Resource Release: Locks held by the transaction are released, potentially enabling waiting transactions
•Log Growth: Both the original operations and the undo operations consume log space
•Buffer Pool Churn: Pages dirtied by the transaction may need to be discarded or re-read

Indirect and Cascading Impacts:

The effects of transaction failure extend beyond the failed transaction itself:

T1: Write(A)  →  [FAILS]
              ↓
T2: Read(A) Write(B)  →  [Must Rollback - read uncommitted data]
                      ↓
T3: Read(B) Write(C)  →  [Must Rollback - cascade continues]

3. Retry Storms: If many transactions fail for the same reason (e.g., resource exhaustion) and applications immediately retry, the retry attempts can overwhelm the system, making recovery harder.

The Retry Trap

Business and Operational Impacts:

Customer Experience: Failed transactions result in error messages, incomplete operations, and user frustration
Revenue Loss: In e-commerce, failed checkout transactions directly impact revenue
Support Costs: Transaction failures generate support tickets and investigation effort
Reputation: Frequent failures erode trust in the system
Operational Overhead: Engineers spend time investigating, diagnosing, and fixing failure causes

Best Practices for Handling Transaction Failures

Application-Level Best Practices

•Always Use Transactions: Never perform multi-statement operations without explicit transaction boundaries. Auto-commit mode is dangerous for complex operations.
•Handle All Error Codes: Catch and handle specific database error codes. Don't use generic exception handling that hides the failure cause.
•Implement Idempotency: Design operations to be safely retryable. Use idempotency keys to prevent duplicate effects.
•Use Retry with Backoff: Retry transient failures (deadlocks, serialization failures) with exponential backoff and jitter.
•Keep Transactions Short: Long transactions hold locks longer, increasing deadlock probability and blocking others.
•Validate Before Transacting: Check preconditions before starting transactions to fail fast on obvious violations.

robust_transaction_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import random
import time
from typing import Callable, TypeVar
 
T = TypeVar('T')
 
class RetryableError(Exception):
    """Errors that can be retried (deadlocks, serialization failures)"""
    pass
 
class NonRetryableError(Exception):
    """Errors that should not be retried (constraint violations)"""
    pass
 
def with_retry(
    operation: Callable[[], T],
    max_attempts: int = 3,
    base_delay: float = 0.1,
    max_delay: float = 10.0
) -> T:
    """
    Execute a database operation with automatic retry for transient failures.
    
    Uses exponential backoff with jitter to prevent thundering herd.
    """
    last_exception = None
    
    for attempt in range(max_attempts):
        try:
            return operation()
        except RetryableError as e:
            last_exception = e
            if attempt < max_attempts - 1:
                # Calculate delay with exponential backoff
                delay = min(base_delay * (2 ** attempt), max_delay)
                # Add jitter (±25%)
                jitter = delay * 0.25 * (2 * random.random() - 1)
                actual_delay = delay + jitter
                
                print(f"Retryable error on attempt {attempt + 1}: {e}")
                print(f"Retrying in {actual_delay:.2f} seconds...")
                time.sleep(actual_delay)
        except NonRetryableError:
            # Don't retry these - re-raise immediately
            raise
    
    raise last_exception
 
def classify_db_error(error_code: str) -> type:
    """Classify database errors as retryable or not."""
    RETRYABLE_CODES = {
        '40001',  # Serialization failure
        '40P01',  # Deadlock detected
        '55P03',  # Lock not available
        '57014',  # Query cancelled (timeout)
    }
    
    if error_code in RETRYABLE_CODES:
        return RetryableError
    return NonRetryableError

Database Configuration Best Practices

•Set Appropriate Timeouts: Configure lock wait timeouts and statement timeouts to prevent indefinite waiting
•Monitor and Alert: Set up monitoring for transaction failure rates, rollback volume, deadlock frequency
•Size Resources Appropriately: Ensure adequate memory for sorting, disk space for logs, connections for load
•Use Connection Pooling: Prevent connection exhaustion with proper pooling configuration
•Enable Detailed Logging: Log enough detail to diagnose failures without impacting performance

Summary: Transaction Failure

Let's consolidate the key concepts covered in this page:

Key Takeaways

•Transaction failures occur when a transaction cannot complete — The DBMS must undo partial work to maintain atomicity
•Failures arise from diverse causes — Logical errors, constraint violations, concurrency control, resource exhaustion, and deliberate aborts
•Detection is carried out by multiple DBMS components — Each subsystem monitors for relevant error conditions
•Rollback uses the log to undo operations — Before-images are applied in reverse order; CLRs make rollback crash-safe
•Failures have cascading impacts — Beyond the failed transaction, they affect waiting transactions, system resources, and user experience
•Proper handling requires both application and database attention — Retry patterns, idempotency, validation, monitoring, and configuration all matter

What's Next:

Page Complete

1 / 5