Database Management SystemsStarvation

Starvation in Concurrency Control

LevelAdvanced

Duration60 mins

TopicStarvation

4 / 5

Timeout Handling in Lock Management

The Last Line of Defense

A surgeon doesn't operate indefinitely. After a maximum time, regardless of progress, the procedure must end—either successfully or with stabilization and reassessment. Unbounded duration isn't acceptable when patient wellbeing is at stake.

Database transactions face similar constraints. A mobile app user won't wait 30 seconds for a checkout to complete. An API client has a 5-second timeout. The stock trading system needs responses in milliseconds. When transactions wait forever for locks, these constraints are violated, leading to user abandonment, cascading failures, and SLA breaches.

Timeouts are the safety mechanism that bounds waiting time. When a transaction waits too long for a lock, the timeout triggers, aborting the transaction and releasing its resources. It's not elegant—work is lost—but it prevents the worse outcome of indefinite blocking.

What You Will Learn

This page explores timeout handling in depth: types of timeouts, configuration strategies, timeout-based deadlock resolution, the tradeoffs between timeout values, and implementation considerations. You will learn to design timeout policies that meet latency SLAs while minimizing unnecessary aborts.

Timeout Fundamentals

A timeout in database lock management is a maximum duration that a transaction will wait to acquire a lock. If the timeout expires before the lock is granted, the transaction receives an error and must handle the failure.

Formal Definition:

For a transaction T requesting lock L at time t₀:

If lock L is granted by time t₀ + τ (where τ is the timeout), proceed normally
If lock L is NOT granted by time t₀ + τ, raise a timeout error
On timeout error, transaction T may be aborted (automatic) or given control to handle the failure (manual)

Types of Timeouts:

Timeout Types in Database Systems
Timeout Type	Scope	Triggered By	Typical Values
Lock wait timeout	Individual lock acquisition	Waiting for specific lock	5-60 seconds
Statement timeout	Single SQL statement	Total statement execution time	30s-5min
Transaction timeout	Entire transaction	Total transaction duration	1-30 minutes
Connection timeout	Database connection	Idle or total connection time	5-30 minutes
Query timeout	Query execution	Query execution only (not lock waits)	Varies by query

Lock Wait Timeout vs. statement Timeout:

These are often confused but serve different purposes:

Lock Wait Timeout:

Limits time waiting for a specific lock
Does NOT include query execution time
Transaction may acquire many locks, each subject to the timeout
Example: SET lock_timeout = '10s' (PostgreSQL)

Statement Timeout:

Limits total time for a statement including execution and waiting
One statement may involve many lock acquisitions plus computation
If any component takes too long, statement is cancelled
Example: SET statement_timeout = '60s' (PostgreSQL)

Relationship:

Statement time = lock wait time + execution time

If lock_timeout = 10s and statement_timeout = 60s:
- Lock waits >10s will fail immediately
- Statement execution >60s will fail even if locks acquired instantly
- Statement with 8s lock wait + 55s execution = 63s → timeout

Always Set Both Timeouts

In production, set both lock_timeout (for lock acquisition) and statement_timeout (for total execution). Lock timeout prevents indefinite waiting; statement timeout prevents runaway queries. Neither alone is sufficient.

Timeout-Based Deadlock Resolution

Timeouts serve as a simple but effective deadlock resolution mechanism. While not as precise as wait-for graph analysis, timeout-based resolution is:

Simple to implement – No complex graph algorithms required
Low overhead – No continuous monitoring of wait dependencies
Always effective – Will eventually break any deadlock

How Timeout Resolves Deadlock:

Consider a classic deadlock:

Transaction A holds lock on resource X, waits for Y
Transaction B holds lock on resource Y, waits for X
Neither can proceed – deadlock

With timeout:

After τ seconds, whichever transaction reaches timeout first receives an error
That transaction aborts, releasing its locks
The other transaction can now proceed

Converting Mermaid diagram...

Timeout vs. Wait-For Graph Detection:

Aspect	Timeout-Based	Wait-For Graph
Detection accuracy	May abort non-deadlocked transactions	Only aborts actual deadlocks
Overhead	Very low (just timers)	Moderate (graph maintenance)
Resolution speed	Waits for timeout duration	Immediate on detection
False positives	Yes (long-waiting transactions)	No
Implementation	Trivial	Complex

The False Positive Problem:

Timeout cannot distinguish between:

Deadlock: Transaction will never get the lock
Long wait: Transaction will eventually get the lock, just slowly

A transaction waiting behind a long-running legitimate transaction will timeout, even though it would have eventually succeeded. This wastes the aborted transaction's work.

Mitigations:

Use longer timeouts – Reduce false positives but increase deadlock resolution time
Combine with detection – Use wait-for graph for faster true deadlock detection, timeout as fallback
Adaptive timeouts – Adjust timeout based on observed lock hold times

Timeout Alone Is Suboptimal

Most production databases use wait-for graph deadlock detection as the primary mechanism, with timeout as a backup for pathological cases or distributed scenarios where graph detection is impractical. Relying solely on timeout leads to either long deadlock wait times (high timeout) or frequent false positive aborts (low timeout).

Timeout Configuration Strategies

Choosing the right timeout values is critical. Too short causes unnecessary aborts; too long violates latency SLAs and delays deadlock resolution. Here are systematic approaches to timeout configuration.

Timeout Configuration Approaches

•SLA-Driven: Set lock timeout = SLA deadline - expected execution time. If API must respond in 5s and execution takes 1s, set lock timeout to 4s.
•Percentile-Based: Analyze lock wait time distribution. Set timeout at p99 or p99.9 of observed waits—most legitimate waits complete, only outliers timeout.
•Workload-Specific: Different transaction types get different timeouts. Real-time checkout: 2s. Batch reporting: 120s. Match timeout to transaction purpose.
•Adaptive: Start with conservative timeout, adjust based on observed behavior. Increase timeout if abort rate is high; decrease if deadlocks cause delays.
•Hierarchical: Multiple timeout tiers. Lock timeout < Statement timeout < Transaction timeout < Connection timeout. Each level catches different failure modes.

timeout_configuration_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL: Multi-level timeout configuration
-- Session level (applies to all statements in session)
SET lock_timeout = '10s';        -- Max wait for any single lock
SET statement_timeout = '60s';   -- Max total statement time
SET idle_in_transaction_session_timeout = '5min';  -- Idle transaction cleanup
 
-- Per-transaction override
BEGIN;
SET LOCAL lock_timeout = '2s';   -- Tighter for this transaction
SET LOCAL statement_timeout = '5s';
SELECT * FROM orders WHERE customer_id = 123 FOR UPDATE;
COMMIT;
 
-- MySQL/InnoDB timeout configuration
SET GLOBAL innodb_lock_wait_timeout = 50;  -- Global default (seconds)
SET SESSION innodb_lock_wait_timeout = 10; -- Session override
 
-- SQL Server timeout configuration
SET LOCK_TIMEOUT 5000;  -- Milliseconds (5 seconds)
-- Or per-query with NOWAIT/READPAST hints
 
-- Application-level timeout (pseudo-code)
try {
    connection.setQueryTimeout(30);  // 30 second statement limit
    
    PreparedStatement stmt = connection.prepareStatement(
        "UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ?"
    );
    stmt.setQueryTimeout(10);  // Override for this statement
    stmt.execute();
    
} catch (SQLTimeoutException e) {
    // Handle timeout - retry or escalate
    handleLockTimeout(e);
}

Recommended Timeout Ranges by Use Case
Use Case	Lock Timeout	Statement Timeout	Rationale
Real-time API	1-5s	5-10s	User is waiting; fast failure better than delay
Interactive UI	5-15s	30-60s	User can see progress; moderate tolerance
Batch processing	30-120s	5-30min	No user waiting; throughput matters more than latency
ETL/Data pipeline	60-300s	1-4 hours	Large operations; failure is expensive
Schema migration	10-60s	Unlimited	Must complete; retry on timeout

Handling Timeout Errors

When a timeout occurs, the application must decide how to respond. The right response depends on the transaction type, business requirements, and system state.

Timeout Error Types:

Database Timeout Error Codes
Database	Error Code	Error Name	Meaning
PostgreSQL	55P03	lock_not_available	Could not acquire lock within lock_timeout
PostgreSQL	57014	query_canceled	Statement cancelled due to statement_timeout
MySQL	1205	Lock wait timeout exceeded	innodb_lock_wait_timeout exceeded
SQL Server	1222	Lock request time out	LOCK_TIMEOUT exceeded
Oracle	ORA-00054	resource busy	Could not acquire lock (NOWAIT/timeout)

Response Strategies:

Retry Strategy:

The most common response to timeout is retry. The assumption: the blocking condition was temporary and may have resolved.

Retry Implementation:

def execute_with_retry(operation, max_retries=3, base_delay=0.5):
    for attempt in range(max_retries):
        try:
            return operation()
        except LockTimeoutError as e:
            if attempt == max_retries - 1:
                raise  # Final attempt failed
            
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep(delay + jitter)
            
            logger.warning(f"Lock timeout, retry {attempt + 1}/{max_retries}")
    
    raise MaxRetriesExceeded()

Best Practices:

Exponential backoff: Increase delay between retries to reduce contention
Jitter: Add randomness to prevent thundering herd
Retry limit: Don't retry forever; escalate after limit
Idempotency: Ensure operation is safe to retry (critical!)

Timeout and Starvation Interaction

Timeouts have a complex relationship with starvation. On one hand, timeouts prevent indefinite waiting—a transaction cannot starve if it times out. On the other hand, if a transaction repeatedly times out and retries, it may never complete successfully, which is effectively starvation.

Scenario: Timeout-Induced Starvation

Transaction T requests lock on high-contention resource R
T waits, but timeout fires before lock is granted
T retries, again joins the queue for R
While T was retrying, new transactions arrived and took its queue position
T again times out before reaching front of queue
Cycle repeats indefinitely

This is indirect starvation: the transaction is never blocked forever (it times out), but it also never succeeds.

Converting Mermaid diagram...

Preventing Timeout-Induced Starvation:

1. Preserve Queue Position on Retry

Some lock managers allow transactions to reclaim their original queue position when retrying after timeout. This prevents newer transactions from bypassing a transaction that has been waiting longer (across retries).

2. Increase Priority After Timeout

Each timeout increments the transaction's priority. After enough timeouts, the transaction becomes highest priority and is granted the lock.

3. Longer Timeout on Retry

Each retry uses a longer timeout (exponential backoff for timeouts, not just retry delays).

4. Router to Less-Contended Resources

If a resource is consistently causing timeouts, route transactions to replicas or cached alternatives.

5. Circuit Breaker Pattern

After repeated timeouts on a resource, temporarily reject all requests for that resource rather than queuing them.

Timeout + FIFO Queue = Fairness Problem

If the lock queue is FIFO but transactions timeout and restart from the back of the queue, earlier transactions are penalized for being slow. Consider whether timed-out transactions should go to the back of the queue (simple, but potentially unfair) or receive priority/position preservation (fair, but complex).

Adaptive Timeout Strategies

Fixed timeouts are a blunt instrument. A 10-second timeout is too long for a simple query and too short for a complex report. Adaptive timeouts adjust based on observed behavior, providing better balance between responsiveness and completion rate.

Adaptive Timeout Approaches

•Percentile-Tracking: Continuously compute p95/p99 of lock wait times. Set timeout at 2-3x the observed percentile. Adjusts automatically to workload changes.
•Contention-Aware: Monitor current contention on each resource. High contention → longer timeout. Low contention → shorter timeout. Dynamically calculated per-request.
•Transaction-Type Based: Learn expected lock wait by transaction type. Simple read: 1s timeout. Complex join: 30s timeout. Derived from historical data.
•Feedback Loop: Track timeout/success ratio. If timeouts > 5% of requests, increase timeout. If timeouts < 0.1%, cautiously decrease timeout.
•Deadline-Derived: Application provides deadline; database calculates appropriate timeout. Deadline - expected execution = lock timeout budget.

adaptive_timeout.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class AdaptiveTimeoutManager:
    # Percentile tracking
    lock_wait_histogram = SlidingWindowHistogram(window=5_minutes)
    
    # Feedback loop parameters
    target_timeout_rate = 0.01  # 1% of requests should timeout
    adjustment_factor = 1.2     # 20% adjustment per cycle
    
    def get_timeout(self, transaction_type, resource):
        # Base: percentile-based
        p99 = lock_wait_histogram.get_percentile(
            resource, 
            percentile=99
        )
        base_timeout = p99 * 2.5  # 2.5x p99
        
        # Adjust for transaction type expectations
        type_multiplier = get_type_multiplier(transaction_type)
        
        # Adjust for current contention
        contention = get_current_contention(resource)
        contention_multiplier = 1 + (contention * 0.5)
        
        # Calculate final timeout
        timeout = base_timeout * type_multiplier * contention_multiplier
        
        # Clamp to reasonable bounds
        return clamp(timeout, min=100_ms, max=300_seconds)
    
    def feedback_adjust(self):
        """Called periodically to tune global parameters"""
        actual_timeout_rate = recent_timeout_count / recent_request_count
        
        if actual_timeout_rate > target_timeout_rate * 1.5:
            # Too many timeouts - increase base multiplier
            global_multiplier *= adjustment_factor
        elif actual_timeout_rate < target_timeout_rate * 0.5:
            # Too few timeouts - decrease base multiplier
            global_multiplier /= adjustment_factor

Start Conservative, Tune Down

When implementing adaptive timeouts, start with longer timeouts and tune down. A timeout that's too long causes occasional delays; a timeout that's too short causes repeated failures and retries, potentially creating more load than the problem it was solving.

Timeout in Distributed Systems

Distributed databases add complexity to timeout handling. A transaction may span multiple nodes, each with its own lock manager. Timeout must account for network latency, clock skew, and distributed coordination.

Challenges:

Distributed Timeout Challenges

•Clock Skew: Different nodes have different wall-clock times. A timeout of 'T seconds' means different things on different nodes.
•Network Partitions: A node may be unable to determine whether a remote lock holder is alive or the network is partitioned.
•Coordinator Failure: If the transaction coordinator times out, should participants release locks? What if the coordinator recovers?
•Cascading Timeouts: A timeout on one node may cascade, causing other nodes to timeout waiting for responses.
•Two-Phase Commit Blocking: During 2PC, participants may wait indefinitely if coordinator fails between prepare and commit.

Distributed Timeout Strategies:

1. Logical Timestamps

Use logical clocks (Lamport timestamps, vector clocks) instead of wall-clock time. Timeouts are measured in logical time units, immune to clock skew.

2. Heartbeat-Based Detection

Nodes send periodic heartbeats. If heartbeats stop, the node is presumed failed, and its locks can be released after a timeout period.

Heartbeat interval: H
Failure detection timeout: F = 3H + network_jitter_margin
Lock release timeout: L = F + recovery_margin

3. Lease-Based Locks

Locks are granted for a limited time (lease). If not renewed, they automatically expire. This bounds the wait for failed lock holders.

4. Heuristic Commit for 2PC

When a participant hasn't heard from coordinator for timeout period, it may heuristically commit or abort based on likelihood. Dangerous but sometimes necessary.

5. Three-Phase Commit

Adds a pre-commit phase that bounds the window where blocking can occur. Reduces (but doesn't eliminate) timeout-related blocking.

lease_based_lock.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class LeaseBasedDistributedLock:
    LEASE_DURATION = 30_seconds
    RENEWAL_INTERVAL = 10_seconds  # Renew before 1/3 of lease expires
    
    def acquire_lock(resource):
        lease = request_lease_from_coordinator(
            resource, 
            duration=LEASE_DURATION
        )
        
        if not lease.acquired:
            raise LockAcquisitionFailed()
        
        # Start renewal thread
        renewal_thread = start_renewal_loop(lease)
        
        return Lock(
            resource=resource,
            lease=lease,
            renewal=renewal_thread
        )
    
    def renewal_loop(lease):
        while lock_held:
            sleep(RENEWAL_INTERVAL)
            
            try:
                renewed = renew_lease(lease, duration=LEASE_DURATION)
                if not renewed:
                    # Lease lost - abort transaction
                    abort_transaction("Lease not renewed")
            except NetworkError:
                # Network issue - may lose lease
                if time_since_last_renewal() > LEASE_DURATION - SAFETY_MARGIN:
                    abort_transaction("Risk of lease expiration")
    
    def release_lock(lock):
        stop_renewal_loop(lock.renewal)
        release_lease(lock.lease)  # Immediate release, don't wait for expiry

Timeout Best Practices

Based on years of production experience, these best practices help design effective timeout policies.

Timeout Implementation Best Practices

•Set timeouts at every level — Lock timeout, statement timeout, transaction timeout, connection timeout. Each catches different failure modes.
•Make timeouts configurable — Different environments (dev, staging, prod) and workloads need different values. Use configuration, not hardcoding.
•Log timeout events — Every timeout should log: transaction ID, resource waited for, wait duration, current contention level. This data drives optimization.
•Monitor timeout rates — Track timeouts per transaction type, per resource, over time. Alert on anomalous increases.
•Distinguish lock timeout from execution timeout — They have different causes and different solutions. Track them separately.
•Implement retry with backoff — Don't retry immediately after timeout; you'll just add to contention. Exponential backoff is essential.
•Consider circuit breakers — If a resource consistently causes timeouts, stop sending traffic there. Fail fast instead of queuing.
•Test timeout behavior — Deliberately trigger timeouts in testing to verify error handling paths work correctly.
•Document expected timeouts — Include timeout values in API contracts and SLA documents. Clients need to know what to expect.
•Review timeouts regularly — Workloads change. What was appropriate six months ago may be wrong today. Schedule periodic reviews.

The Timeout Domino Effect

Timeouts at one layer can cascade. If your database lock timeout is 30s but your HTTP client timeout is 29s, the client will timeout first, abandoning the database connection. The database transaction completes (or times out later), wasting resources. Ensure timeout hierarchy: innermost (database) < middle (application) < outermost (client).

Summary: Timeout Handling

We've explored timeout handling as a crucial mechanism for bounding lock wait times and resolving deadlocks. Let's consolidate the key concepts:

Key Takeaways

•Timeouts bound waiting time — Transactions will not wait forever for locks, meeting latency SLAs even under contention.
•Multiple timeout types exist — Lock timeout, statement timeout, transaction timeout, connection timeout—each serves a different purpose.
•Timeouts resolve deadlocks — Simple but effective: eventually one transaction times out and releases locks, breaking the cycle.
•Timeouts cause false positives — Cannot distinguish deadlock from slow lock holder. May abort transactions that would have succeeded.
•Timeout configuration is critical — Too short causes unnecessary aborts; too long violates SLAs. Use percentile-based or adaptive strategies.
•Handle timeout errors gracefully — Retry with backoff, escalate to alternative paths, or fail fast depending on transaction requirements.
•Timeouts can cause indirect starvation — Repeated timeout-and-retry cycles without queue position preservation can prevent completion.
•Distributed systems add complexity — Clock skew, network partitions, and coordinator failure require lease-based locks and heartbeats.
•Implement timeout hierarchy — Innermost timeouts < outer timeouts. Ensure database timeout completes before client timeout fires.

What's Next:

With timeouts understood, we conclude this module by examining prevention strategies comprehensively—integrating fairness, priority, and timeout into cohesive starvation prevention policies. We'll synthesize everything into practical guidelines for building starvation-free database systems.

Page Complete

You now understand timeout handling as a mechanism for bounding lock waits, resolving deadlocks, and (with proper design) preventing starvation. This knowledge enables you to configure timeouts that meet SLAs while minimizing unnecessary transaction aborts.

4 / 5

Loading learning content...

Database Management SystemsStarvation

Starvation in Concurrency Control

LevelAdvanced

Duration60 mins

TopicStarvation

4 / 5

Timeout Handling in Lock Management

The Last Line of Defense

What You Will Learn

Timeout Fundamentals

Formal Definition:

For a transaction T requesting lock L at time t₀:

If lock L is granted by time t₀ + τ (where τ is the timeout), proceed normally
If lock L is NOT granted by time t₀ + τ, raise a timeout error
On timeout error, transaction T may be aborted (automatic) or given control to handle the failure (manual)

Types of Timeouts:

Timeout Types in Database Systems
Timeout Type	Scope	Triggered By	Typical Values
Lock wait timeout	Individual lock acquisition	Waiting for specific lock	5-60 seconds
Statement timeout	Single SQL statement	Total statement execution time	30s-5min
Transaction timeout	Entire transaction	Total transaction duration	1-30 minutes
Connection timeout	Database connection	Idle or total connection time	5-30 minutes
Query timeout	Query execution	Query execution only (not lock waits)	Varies by query

Lock Wait Timeout vs. statement Timeout:

These are often confused but serve different purposes:

Lock Wait Timeout:

Limits time waiting for a specific lock
Does NOT include query execution time
Transaction may acquire many locks, each subject to the timeout
Example: SET lock_timeout = '10s' (PostgreSQL)

Statement Timeout:

Limits total time for a statement including execution and waiting
One statement may involve many lock acquisitions plus computation
If any component takes too long, statement is cancelled
Example: SET statement_timeout = '60s' (PostgreSQL)

Relationship:

Statement time = lock wait time + execution time

If lock_timeout = 10s and statement_timeout = 60s:
- Lock waits >10s will fail immediately
- Statement execution >60s will fail even if locks acquired instantly
- Statement with 8s lock wait + 55s execution = 63s → timeout

Always Set Both Timeouts

Timeout-Based Deadlock Resolution

Timeouts serve as a simple but effective deadlock resolution mechanism. While not as precise as wait-for graph analysis, timeout-based resolution is:

Simple to implement – No complex graph algorithms required
Low overhead – No continuous monitoring of wait dependencies
Always effective – Will eventually break any deadlock

How Timeout Resolves Deadlock:

Consider a classic deadlock:

Transaction A holds lock on resource X, waits for Y
Transaction B holds lock on resource Y, waits for X
Neither can proceed – deadlock

With timeout:

After τ seconds, whichever transaction reaches timeout first receives an error
That transaction aborts, releasing its locks
The other transaction can now proceed

Converting Mermaid diagram...

Timeout vs. Wait-For Graph Detection:

Aspect	Timeout-Based	Wait-For Graph
Detection accuracy	May abort non-deadlocked transactions	Only aborts actual deadlocks
Overhead	Very low (just timers)	Moderate (graph maintenance)
Resolution speed	Waits for timeout duration	Immediate on detection
False positives	Yes (long-waiting transactions)	No
Implementation	Trivial	Complex

The False Positive Problem:

Timeout cannot distinguish between:

Deadlock: Transaction will never get the lock
Long wait: Transaction will eventually get the lock, just slowly

A transaction waiting behind a long-running legitimate transaction will timeout, even though it would have eventually succeeded. This wastes the aborted transaction's work.

Mitigations:

Use longer timeouts – Reduce false positives but increase deadlock resolution time
Combine with detection – Use wait-for graph for faster true deadlock detection, timeout as fallback
Adaptive timeouts – Adjust timeout based on observed lock hold times

Timeout Alone Is Suboptimal

Timeout Configuration Strategies

Timeout Configuration Approaches

•SLA-Driven: Set lock timeout = SLA deadline - expected execution time. If API must respond in 5s and execution takes 1s, set lock timeout to 4s.
•Percentile-Based: Analyze lock wait time distribution. Set timeout at p99 or p99.9 of observed waits—most legitimate waits complete, only outliers timeout.
•Workload-Specific: Different transaction types get different timeouts. Real-time checkout: 2s. Batch reporting: 120s. Match timeout to transaction purpose.
•Adaptive: Start with conservative timeout, adjust based on observed behavior. Increase timeout if abort rate is high; decrease if deadlocks cause delays.
•Hierarchical: Multiple timeout tiers. Lock timeout < Statement timeout < Transaction timeout < Connection timeout. Each level catches different failure modes.

timeout_configuration_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL: Multi-level timeout configuration
-- Session level (applies to all statements in session)
SET lock_timeout = '10s';        -- Max wait for any single lock
SET statement_timeout = '60s';   -- Max total statement time
SET idle_in_transaction_session_timeout = '5min';  -- Idle transaction cleanup
 
-- Per-transaction override
BEGIN;
SET LOCAL lock_timeout = '2s';   -- Tighter for this transaction
SET LOCAL statement_timeout = '5s';
SELECT * FROM orders WHERE customer_id = 123 FOR UPDATE;
COMMIT;
 
-- MySQL/InnoDB timeout configuration
SET GLOBAL innodb_lock_wait_timeout = 50;  -- Global default (seconds)
SET SESSION innodb_lock_wait_timeout = 10; -- Session override
 
-- SQL Server timeout configuration
SET LOCK_TIMEOUT 5000;  -- Milliseconds (5 seconds)
-- Or per-query with NOWAIT/READPAST hints
 
-- Application-level timeout (pseudo-code)
try {
    connection.setQueryTimeout(30);  // 30 second statement limit
    
    PreparedStatement stmt = connection.prepareStatement(
        "UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ?"
    );
    stmt.setQueryTimeout(10);  // Override for this statement
    stmt.execute();
    
} catch (SQLTimeoutException e) {
    // Handle timeout - retry or escalate
    handleLockTimeout(e);
}

Recommended Timeout Ranges by Use Case
Use Case	Lock Timeout	Statement Timeout	Rationale
Real-time API	1-5s	5-10s	User is waiting; fast failure better than delay
Interactive UI	5-15s	30-60s	User can see progress; moderate tolerance
Batch processing	30-120s	5-30min	No user waiting; throughput matters more than latency
ETL/Data pipeline	60-300s	1-4 hours	Large operations; failure is expensive
Schema migration	10-60s	Unlimited	Must complete; retry on timeout

Handling Timeout Errors

When a timeout occurs, the application must decide how to respond. The right response depends on the transaction type, business requirements, and system state.

Timeout Error Types:

Database Timeout Error Codes
Database	Error Code	Error Name	Meaning
PostgreSQL	55P03	lock_not_available	Could not acquire lock within lock_timeout
PostgreSQL	57014	query_canceled	Statement cancelled due to statement_timeout
MySQL	1205	Lock wait timeout exceeded	innodb_lock_wait_timeout exceeded
SQL Server	1222	Lock request time out	LOCK_TIMEOUT exceeded
Oracle	ORA-00054	resource busy	Could not acquire lock (NOWAIT/timeout)

Response Strategies:

Retry Strategy:

The most common response to timeout is retry. The assumption: the blocking condition was temporary and may have resolved.

Retry Implementation:

def execute_with_retry(operation, max_retries=3, base_delay=0.5):
    for attempt in range(max_retries):
        try:
            return operation()
        except LockTimeoutError as e:
            if attempt == max_retries - 1:
                raise  # Final attempt failed
            
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep(delay + jitter)
            
            logger.warning(f"Lock timeout, retry {attempt + 1}/{max_retries}")
    
    raise MaxRetriesExceeded()

Best Practices:

Exponential backoff: Increase delay between retries to reduce contention
Jitter: Add randomness to prevent thundering herd
Retry limit: Don't retry forever; escalate after limit
Idempotency: Ensure operation is safe to retry (critical!)

Timeout and Starvation Interaction

Scenario: Timeout-Induced Starvation

Transaction T requests lock on high-contention resource R
T waits, but timeout fires before lock is granted
T retries, again joins the queue for R
While T was retrying, new transactions arrived and took its queue position
T again times out before reaching front of queue
Cycle repeats indefinitely

This is indirect starvation: the transaction is never blocked forever (it times out), but it also never succeeds.

Converting Mermaid diagram...

Preventing Timeout-Induced Starvation:

1. Preserve Queue Position on Retry

2. Increase Priority After Timeout

Each timeout increments the transaction's priority. After enough timeouts, the transaction becomes highest priority and is granted the lock.

3. Longer Timeout on Retry

Each retry uses a longer timeout (exponential backoff for timeouts, not just retry delays).

4. Router to Less-Contended Resources

If a resource is consistently causing timeouts, route transactions to replicas or cached alternatives.

5. Circuit Breaker Pattern

After repeated timeouts on a resource, temporarily reject all requests for that resource rather than queuing them.

Timeout + FIFO Queue = Fairness Problem

Adaptive Timeout Strategies

Adaptive Timeout Approaches

•Percentile-Tracking: Continuously compute p95/p99 of lock wait times. Set timeout at 2-3x the observed percentile. Adjusts automatically to workload changes.
•Contention-Aware: Monitor current contention on each resource. High contention → longer timeout. Low contention → shorter timeout. Dynamically calculated per-request.
•Transaction-Type Based: Learn expected lock wait by transaction type. Simple read: 1s timeout. Complex join: 30s timeout. Derived from historical data.
•Feedback Loop: Track timeout/success ratio. If timeouts > 5% of requests, increase timeout. If timeouts < 0.1%, cautiously decrease timeout.
•Deadline-Derived: Application provides deadline; database calculates appropriate timeout. Deadline - expected execution = lock timeout budget.

adaptive_timeout.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class AdaptiveTimeoutManager:
    # Percentile tracking
    lock_wait_histogram = SlidingWindowHistogram(window=5_minutes)
    
    # Feedback loop parameters
    target_timeout_rate = 0.01  # 1% of requests should timeout
    adjustment_factor = 1.2     # 20% adjustment per cycle
    
    def get_timeout(self, transaction_type, resource):
        # Base: percentile-based
        p99 = lock_wait_histogram.get_percentile(
            resource, 
            percentile=99
        )
        base_timeout = p99 * 2.5  # 2.5x p99
        
        # Adjust for transaction type expectations
        type_multiplier = get_type_multiplier(transaction_type)
        
        # Adjust for current contention
        contention = get_current_contention(resource)
        contention_multiplier = 1 + (contention * 0.5)
        
        # Calculate final timeout
        timeout = base_timeout * type_multiplier * contention_multiplier
        
        # Clamp to reasonable bounds
        return clamp(timeout, min=100_ms, max=300_seconds)
    
    def feedback_adjust(self):
        """Called periodically to tune global parameters"""
        actual_timeout_rate = recent_timeout_count / recent_request_count
        
        if actual_timeout_rate > target_timeout_rate * 1.5:
            # Too many timeouts - increase base multiplier
            global_multiplier *= adjustment_factor
        elif actual_timeout_rate < target_timeout_rate * 0.5:
            # Too few timeouts - decrease base multiplier
            global_multiplier /= adjustment_factor

Start Conservative, Tune Down

Timeout in Distributed Systems

Challenges:

Distributed Timeout Challenges

•Clock Skew: Different nodes have different wall-clock times. A timeout of 'T seconds' means different things on different nodes.
•Network Partitions: A node may be unable to determine whether a remote lock holder is alive or the network is partitioned.
•Coordinator Failure: If the transaction coordinator times out, should participants release locks? What if the coordinator recovers?
•Cascading Timeouts: A timeout on one node may cascade, causing other nodes to timeout waiting for responses.
•Two-Phase Commit Blocking: During 2PC, participants may wait indefinitely if coordinator fails between prepare and commit.

Distributed Timeout Strategies:

1. Logical Timestamps

Use logical clocks (Lamport timestamps, vector clocks) instead of wall-clock time. Timeouts are measured in logical time units, immune to clock skew.

2. Heartbeat-Based Detection

Nodes send periodic heartbeats. If heartbeats stop, the node is presumed failed, and its locks can be released after a timeout period.

Heartbeat interval: H
Failure detection timeout: F = 3H + network_jitter_margin
Lock release timeout: L = F + recovery_margin

3. Lease-Based Locks

Locks are granted for a limited time (lease). If not renewed, they automatically expire. This bounds the wait for failed lock holders.

4. Heuristic Commit for 2PC

When a participant hasn't heard from coordinator for timeout period, it may heuristically commit or abort based on likelihood. Dangerous but sometimes necessary.

5. Three-Phase Commit

Adds a pre-commit phase that bounds the window where blocking can occur. Reduces (but doesn't eliminate) timeout-related blocking.

lease_based_lock.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class LeaseBasedDistributedLock:
    LEASE_DURATION = 30_seconds
    RENEWAL_INTERVAL = 10_seconds  # Renew before 1/3 of lease expires
    
    def acquire_lock(resource):
        lease = request_lease_from_coordinator(
            resource, 
            duration=LEASE_DURATION
        )
        
        if not lease.acquired:
            raise LockAcquisitionFailed()
        
        # Start renewal thread
        renewal_thread = start_renewal_loop(lease)
        
        return Lock(
            resource=resource,
            lease=lease,
            renewal=renewal_thread
        )
    
    def renewal_loop(lease):
        while lock_held:
            sleep(RENEWAL_INTERVAL)
            
            try:
                renewed = renew_lease(lease, duration=LEASE_DURATION)
                if not renewed:
                    # Lease lost - abort transaction
                    abort_transaction("Lease not renewed")
            except NetworkError:
                # Network issue - may lose lease
                if time_since_last_renewal() > LEASE_DURATION - SAFETY_MARGIN:
                    abort_transaction("Risk of lease expiration")
    
    def release_lock(lock):
        stop_renewal_loop(lock.renewal)
        release_lease(lock.lease)  # Immediate release, don't wait for expiry

Timeout Best Practices

Based on years of production experience, these best practices help design effective timeout policies.

Timeout Implementation Best Practices

•Set timeouts at every level — Lock timeout, statement timeout, transaction timeout, connection timeout. Each catches different failure modes.
•Make timeouts configurable — Different environments (dev, staging, prod) and workloads need different values. Use configuration, not hardcoding.
•Log timeout events — Every timeout should log: transaction ID, resource waited for, wait duration, current contention level. This data drives optimization.
•Monitor timeout rates — Track timeouts per transaction type, per resource, over time. Alert on anomalous increases.
•Distinguish lock timeout from execution timeout — They have different causes and different solutions. Track them separately.
•Implement retry with backoff — Don't retry immediately after timeout; you'll just add to contention. Exponential backoff is essential.
•Consider circuit breakers — If a resource consistently causes timeouts, stop sending traffic there. Fail fast instead of queuing.
•Test timeout behavior — Deliberately trigger timeouts in testing to verify error handling paths work correctly.
•Document expected timeouts — Include timeout values in API contracts and SLA documents. Clients need to know what to expect.
•Review timeouts regularly — Workloads change. What was appropriate six months ago may be wrong today. Schedule periodic reviews.

The Timeout Domino Effect

Summary: Timeout Handling

We've explored timeout handling as a crucial mechanism for bounding lock wait times and resolving deadlocks. Let's consolidate the key concepts:

Key Takeaways

•Timeouts bound waiting time — Transactions will not wait forever for locks, meeting latency SLAs even under contention.
•Multiple timeout types exist — Lock timeout, statement timeout, transaction timeout, connection timeout—each serves a different purpose.
•Timeouts resolve deadlocks — Simple but effective: eventually one transaction times out and releases locks, breaking the cycle.
•Timeouts cause false positives — Cannot distinguish deadlock from slow lock holder. May abort transactions that would have succeeded.
•Timeout configuration is critical — Too short causes unnecessary aborts; too long violates SLAs. Use percentile-based or adaptive strategies.
•Handle timeout errors gracefully — Retry with backoff, escalate to alternative paths, or fail fast depending on transaction requirements.
•Timeouts can cause indirect starvation — Repeated timeout-and-retry cycles without queue position preservation can prevent completion.
•Distributed systems add complexity — Clock skew, network partitions, and coordinator failure require lease-based locks and heartbeats.
•Implement timeout hierarchy — Innermost timeouts < outer timeouts. Ensure database timeout completes before client timeout fires.

What's Next:

Page Complete

4 / 5