System Design (HLD)Failure Is Inevitable

Failure Is Inevitable: Understanding and Embracing System Failures

LevelIntermediate

Duration90 mins

TopicFailure Is Inevitable

4 / 4

Fail-Safe vs Fail-Fast

Two Philosophies for Handling Failure

When a system encounters an anomaly—an unexpected state, corrupted data, a resource constraint, an inconsistency—it faces a fundamental choice: should it attempt to continue operating, or should it stop immediately?

This isn't a question with a universally correct answer. Both approaches have merit, and the right choice depends on context. A medical monitoring system should probably keep running with degraded data rather than shut down entirely. A financial trading system should probably halt rather than execute trades based on corrupted market data.

Understanding the tradeoffs between fail-safe (prioritizing continued operation) and fail-fast (prioritizing immediate halt on problems) is essential for making appropriate design decisions. The wrong choice can be catastrophic in either direction.

What You Will Learn

By the end of this page, you will deeply understand both fail-safe and fail-fast philosophies: their definitions, when each is appropriate, how to implement them, their implications for system behavior, and how to choose between them. You'll be able to analyze systems and make informed decisions about failure handling approaches.

Defining Fail-Safe and Fail-Fast

These terms are sometimes used loosely, so let's establish precise definitions:

Fail-Safe: A fail-safe system is designed to revert to a safe state when a failure occurs, prioritizing continued operation (possibly degraded) over stopping. The system attempts to maintain availability and core functionality even when components fail or anomalies are detected.

Fail-Fast: A fail-fast system is designed to immediately stop or signal failure when an anomaly is detected, rather than attempting to continue with potentially corrupt state. The system prioritizes correctness and early problem detection over continued availability.

Important Distinction: Fail-safe is NOT about preventing failures (that's fault prevention). Fail-fast is NOT about being fragile. Both are strategies for what happens AFTER an anomaly is detected.

Fail-Safe vs Fail-Fast Comparison
Aspect	Fail-Safe	Fail-Fast
Primary Goal	Continued operation	Immediate problem detection
On Anomaly	Attempt recovery/degradation	Stop/crash/reject
Availability Priority	High	Lower
Correctness Priority	Lower	High
Error Detection Time	May be delayed	Immediate
Blast Radius	Potentially spreads corruption	Contained to single operation
Debugging	Harder (symptoms distant from cause)	Easier (fail at point of problem)
User Experience	More continuous	More interruptions

Not Binary Choices

Real systems typically use a mix of both strategies, applied to different components or conditions. Critical invariants might be fail-fast while non-critical operations are fail-safe. The art is in choosing appropriately for each case.

The Fail-Fast Philosophy

Fail-fast systems are based on a simple principle: it's better to fail loudly and immediately than to continue with corrupt state that causes worse problems later.

The Fail-Fast Argument:

When an anomaly is detected (assertion failure, unexpected null, inconsistent state), you have a choice:

Continue optimistically, hoping the problem is minor
Stop immediately and make the problem visible

Fail-fast advocates argue for option 2 because:

Problems caught immediately are easier to debug (stack trace at exact point of failure)
Continuing may propagate corruption to other components, databases, or external systems
Users may act on incorrect data before the problem is noticed
The root cause becomes obscured as symptoms appear far from the original error
Recovery is easier when corruption hasn't spread

Fail-Fast Implementation Patterns

•Assertions — Check invariants throughout code. If invariants are violated, crash immediately rather than continuing. assert balance >= 0 fails fast on negative balance.
•Precondition Validation — Validate all inputs before processing. Reject invalid inputs immediately rather than attempting partial processing.
•Null Rejection — Null/nil values in unexpected places cause immediate exceptions rather than silent null propagation (or use Option/Maybe types).
•Type Systems — Strong type systems catch certain error classes at compile time, the earliest possible point.
•Immutability — Immutable data can't be corrupted after creation. Attempts to mutate fail fast.
•Transaction Abort — In databases, any anomaly causes transaction abort and rollback rather than partial commit.
•Connection Rejection — Load balancers reject new connections when unhealthy rather than accepting and failing slowly.
•Process Crash — Let supervisors (systemd, Kubernetes) detect crashed processes and restart them. Crash cleanly rather than limping along.

The Erlang Philosophy

Erlang's 'Let It Crash' philosophy is the premier example of fail-fast thinking. Rather than defensive programming everywhere, processes crash on unexpected conditions. Supervisor processes detect crashes and restart workers. The system is designed for components to fail cleanly and recover automatically. This approach powers some of the world's most reliable telecom systems.

When Fail-Fast Excels:

Data integrity is paramount — Better to reject a transaction than corrupt the database
Errors have cascading consequences — Early failure prevents downstream corruption
Debugging speed matters — Crash at the point of the problem
Recovery mechanisms exist — Supervisor restarts, transaction rollback, retry logic
Inconsistent state is dangerous — Financial, medical, safety-critical systems

The Fail-Safe Philosophy

Fail-safe systems are based on a different principle: availability is so critical that continued operation may be more valuable than perfect correctness.

The Fail-Safe Argument:

Not all failures are catastrophic. Many anomalies are transient or minor. Crashing on every unexpected condition means:

Users experience frequent outages
Minor issues cause major disruptions
Teams become desensitized to frequent crashes
The system is less stable than its components

Fail-safe advocates argue that systems should:

Attempt to recover from transient failures
Operate in degraded modes when healthy recovery isn't possible
Fail only when truly dangerous conditions exist
Keep providing value even when imperfect

Fail-Safe Implementation Patterns

•Retry with Backoff — Transient failures are retried automatically with exponential backoff. Many failures resolve without user impact.
•Fallback Responses — When primary data source fails, use cached data, default values, or alternative sources.
•Graceful Degradation — Reduce functionality rather than failing completely. Show cached product data if real-time inventory check fails.
•Circuit Breakers — Temporarily stop calling failing dependencies, use fallbacks, try again later when healthy.
•Self-Healing — Automatically restart failed components, rebuild corrupted indexes, retry failed background jobs.
•Loose Coupling — Components continue operating even when dependencies fail temporarily.
•Background Repair — Detect inconsistencies but repair in background rather than blocking operations.
•Error Absorption — Non-critical errors are logged but don't propagate up to crash the request/process.

When Fail-Safe Excels

•Availability is critical for safety (medical monitoring)
•Errors are often transient (network issues)
•Users need continuous access
•Imperfect data is better than no data
•Cost of outage exceeds cost of imperfection

Fail-Safe Dangers

•Corruption may spread before detection
•Root cause becomes obscured
•Debugging is harder
•Users may trust incorrect data
•Background repair may never catch up

The Fail-Safe Trap

Overly aggressive fail-safe design can mask serious problems. A system that absorbs all errors may accumulate corruption silently until it catastrophically fails. Signal-to-noise ratio degrades as teams ignore 'routine' errors that turn out to be symptoms of critical issues.

Making the Choice: When to Use Each Approach

The choice between fail-safe and fail-fast depends on multiple factors. Rather than applying one philosophy universally, analyze each component and failure mode individually.

Key Decision Factors:

Decision Framework

•Correctness Criticality — Higher correctness requirements favor fail-fast. Financial transactions, medical records, safety systems need fail-fast for data integrity.
•Availability Criticality — Higher availability requirements favor fail-safe. Emergency services, real-time monitoring, user-facing critical paths need fail-safe.
•Recoverability — If failures are easily recoverable (restart, retry, rollback), fail-fast is cheaper. If recovery is hard, fail-safe avoids the recovery cost.
•Blast Radius — If errors propagate widely, fail-fast contains damage. If errors are isolated, fail-safe avoids unnecessary outages.
•Detection Difficulty — If subtle corruption is hard to detect later, fail-fast makes problems visible. If problems are obvious, fail-safe is safe.
•Error Frequency — Rare errors favor fail-fast (investigation needed). Common transient errors favor fail-safe (retry handles them).
•State Persistence — Errors in persistent state favor fail-fast (corruption is permanent). Errors in ephemeral state favor fail-safe (restart clears them).

Recommended Approach by Context
Context	Recommended	Rationale
Database writes	Fail-Fast	Corruption is permanent and spreading
Read caching	Fail-Safe	Stale data often acceptable; invalidation fixes
Financial transactions	Fail-Fast	Incorrect money movement is catastrophic
Content serving (CDN)	Fail-Safe	Stale content better than no content
Authentication	Fail-Fast	Wrong auth decision is security breach
Recommendations	Fail-Safe	No/default recommendations acceptable
Order processing	Fail-Fast	Partial orders create fulfillment nightmare
Search results	Fail-Safe	Partial/cached results better than nothing
Metrics collection	Fail-Safe	Missing data points better than crash
Audit logging	Fail-Fast	Missing audit records may be compliance violation

The Hybrid Approach

Most production systems use a hybrid: fail-fast for critical invariants (data consistency, security boundaries, financial correctness) and fail-safe for non-critical functionality (personalization, recommendations, analytics). The key is explicit classification of what's critical.

Implementation Considerations

Implementing either approach requires careful engineering. Both can be done poorly, leading to systems that are neither safe nor fast.

Fail-Fast Implementation Considerations:

Fail-Fast Implementation

•Clear error messages — Stack traces and context that make debugging straightforward. 'Assertion failed: balance < 0, actual: -15.00, account: A1234'
•Observable crashes — Crashes are logged, alerted, tracked. Silent crashes are as bad as silent failures.
•Restart mechanisms — Supervisors restart crashed processes. Kubernetes restarts crashed containers. Crashes don't mean permanent outage.
•Idempotent operations — Operations that can be safely retried after crash/restart. Crash-restart cycle repeats until success.
•Transactional boundaries — Clear points where state is committed. Crash between transactions leaves clean state.
•Crash-only software — Design where restart is the primary recovery mechanism. No complex 'graceful shutdown' required.

Fail-Safe Implementation Considerations:

Fail-Safe Implementation

•Logging all absorbed errors — Every caught error should be logged with context. Silent absorption hides problems.
•Metrics on error rates — Track rates of errors being absorbed. Rising rates may indicate silent failures.
•Timeouts and resource limits — Fail-safe doesn't mean 'run forever.' Bound resource consumption.
•Eventual consistency checks — Background jobs that verify consistency and alert on persistent mismatches.
•Escalation thresholds — Too many errors in a window escalates from fail-safe to fail-fast. 'Something is seriously wrong.'
•Clear degradation indicators — Users should know when they're getting degraded experience. Admins should see degradation metrics.

Common Pitfalls

Fail-fast done poorly: crashes without useful diagnostics, no restart mechanisms, data loss on crash. Fail-safe done poorly: silent data corruption, ignored errors that indicate serious problems, systems that 'work' but produce wrong results.

Case Studies: Learning from Real Systems

Examining real systems illuminates how these philosophies work in practice.

Case Study 1: Erlang/OTP — The Let It Crash Philosophy

Erlang is perhaps the most influential example of fail-fast design. Telecom systems built on Erlang routinely achieve 99.9999999% availability (nine nines). How?

Worker processes crash immediately on any unexpected condition
Supervisor processes detect crashes and restart workers
State isolation ensures one crash doesn't corrupt other processes
The crash/restart cycle is measured in milliseconds
Most 'failures' are invisible to users because restart is so fast

The philosophy: 'If you don't know how to handle an error, crash. Somebody else will restart you in a known-good state.'

Case Study 2: Amazon DynamoDB — Fail-Safe for Availability

DynamoDB prioritizes availability over immediate consistency. When a write can't reach all replicas:

The write succeeds if the minimum quorum is met
Background repair ensures eventual consistency across all replicas
Read requests may return stale data briefly
System continues operating through network partitions

The philosophy: 'Availability for writes is more valuable than immediate consistency across replicas. Eventual consistency is good enough for many use cases.'

Case Study 3: Google Spanner — Fail-Fast for Consistency

Spanner takes the opposite approach—consistency over availability:

Writes require global coordination (TrueTime)
If coordination fails, writes fail (fail-fast)
The system rejects operations it can't guarantee consistent
Availability is sacrificed during certain partition scenarios

The philosophy: 'For global financial systems, incorrect balances are worse than temporary unavailability. Users can retry failed operations.'

Both Approaches Succeed

All three systems are highly successful, powering critical infrastructure worldwide. The difference in approach reflects different requirements: telecom needs high availability per-call, Dynamo needs always-on writes, Spanner needs global consistency. Choose based on requirements, not philosophy.

Combining Both Approaches Effectively

The most robust systems combine both approaches, using each where appropriate. This requires explicit analysis of failure modes and their consequences.

A Practical Framework:

Classify components by consequence of failure:
- Safety-critical: Fail-fast always (crashes, rollbacks)
- User-visible-critical: Fail-fast with fast recovery (sub-second restart)
- Quality-affecting: Fail-safe with degradation (use fallbacks)
- Non-critical: Fail-safe with logging (absorb and continue)
Classify errors by type:
- Invariant violations: Always fail-fast (state is corrupt)
- Resource exhaustion: Fail-fast with load shedding (reject new work)
- Transient failures: Fail-safe with retry (network blips, brief unavailability)
- Dependency failures: Fail-safe with fallback (cache, defaults)

Combination Patterns

•Fail-fast core, fail-safe periphery — Database writes fail-fast; recommendation service failures are absorbed and fallback to defaults.
•Fail-fast internals, fail-safe boundaries — Within a service, assertions crash. At API boundaries, return error responses rather than cascading crashes.
•Fail-safe first, fail-fast on escalation — Retry transient errors, but if errors persist, crash to trigger investigation.
•Fail-fast writes, fail-safe reads — Writing to database requires strict correctness; reading can use cached/stale data.
•Fail-fast per-request, fail-safe per-process — Individual bad requests fail immediately; the process continues serving other requests.

Document Your Choices

For each service, document: (1) Which invariants trigger fail-fast behavior, (2) Which failures are absorbed (fail-safe), (3) What fallbacks exist, (4) What metrics indicate absorbed errors. This documentation guides incident response and prevents accidentally changing behavior.

Summary: Fail-Safe vs Fail-Fast

We've explored the fundamental choice between prioritizing continued operation (fail-safe) and immediate problem detection (fail-fast). Let's consolidate the essential insights:

Key Takeaways

•Fail-fast prioritizes correctness — Stop immediately on problems to prevent corruption spread and make debugging easier. Best for data integrity, security, financial systems.
•Fail-safe prioritizes availability — Continue operating through problems, possibly degraded. Best for user-facing systems, monitoring, scenarios where uptime is critical.
•Neither is universally correct — The right choice depends on context: what are the consequences of wrong data vs. unavailability?
•Real systems use both — Fail-fast for critical invariants, fail-safe for non-critical functionality. Explicit classification is essential.
•Implementation matters — Fail-fast needs good diagnostics and restart mechanisms. Fail-safe needs logging, metrics, and eventual consistency checks.
•Document your approach — Explicit documentation of what fails how prevents accidental behavior changes and guides incident response.

Module Conclusion:

With this page, we conclude Module 1: Failure Is Inevitable. You now understand:

Types of failures: Hardware, software, and network failures and their characteristics
Partial failures: The defining challenge of distributed systems
Designing for failure: Architecture and patterns for resilient systems
Fail-safe vs fail-fast: When to prioritize availability vs. correctness

This foundation prepares you for the specific fault tolerance patterns we'll study in subsequent modules: circuit breakers, bulkheads, timeouts, retries, and fallbacks.

Module Complete

You now have a comprehensive understanding that failure is inevitable in production systems. More importantly, you have the frameworks and vocabulary to reason about failure modes and design systems that handle them appropriately. The subsequent modules will build on this foundation with specific implementation patterns.

4 / 4

Loading learning content...

System Design (HLD)Failure Is Inevitable

Failure Is Inevitable: Understanding and Embracing System Failures

LevelIntermediate

Duration90 mins

TopicFailure Is Inevitable

4 / 4

Fail-Safe vs Fail-Fast

Two Philosophies for Handling Failure

What You Will Learn

Defining Fail-Safe and Fail-Fast

These terms are sometimes used loosely, so let's establish precise definitions:

Fail-Safe vs Fail-Fast Comparison
Aspect	Fail-Safe	Fail-Fast
Primary Goal	Continued operation	Immediate problem detection
On Anomaly	Attempt recovery/degradation	Stop/crash/reject
Availability Priority	High	Lower
Correctness Priority	Lower	High
Error Detection Time	May be delayed	Immediate
Blast Radius	Potentially spreads corruption	Contained to single operation
Debugging	Harder (symptoms distant from cause)	Easier (fail at point of problem)
User Experience	More continuous	More interruptions

Not Binary Choices

The Fail-Fast Philosophy

Fail-fast systems are based on a simple principle: it's better to fail loudly and immediately than to continue with corrupt state that causes worse problems later.

The Fail-Fast Argument:

When an anomaly is detected (assertion failure, unexpected null, inconsistent state), you have a choice:

Continue optimistically, hoping the problem is minor
Stop immediately and make the problem visible

Fail-fast advocates argue for option 2 because:

Problems caught immediately are easier to debug (stack trace at exact point of failure)
Continuing may propagate corruption to other components, databases, or external systems
Users may act on incorrect data before the problem is noticed
The root cause becomes obscured as symptoms appear far from the original error
Recovery is easier when corruption hasn't spread

Fail-Fast Implementation Patterns

•Assertions — Check invariants throughout code. If invariants are violated, crash immediately rather than continuing. assert balance >= 0 fails fast on negative balance.
•Precondition Validation — Validate all inputs before processing. Reject invalid inputs immediately rather than attempting partial processing.
•Null Rejection — Null/nil values in unexpected places cause immediate exceptions rather than silent null propagation (or use Option/Maybe types).
•Type Systems — Strong type systems catch certain error classes at compile time, the earliest possible point.
•Immutability — Immutable data can't be corrupted after creation. Attempts to mutate fail fast.
•Transaction Abort — In databases, any anomaly causes transaction abort and rollback rather than partial commit.
•Connection Rejection — Load balancers reject new connections when unhealthy rather than accepting and failing slowly.
•Process Crash — Let supervisors (systemd, Kubernetes) detect crashed processes and restart them. Crash cleanly rather than limping along.

The Erlang Philosophy

When Fail-Fast Excels:

Data integrity is paramount — Better to reject a transaction than corrupt the database
Errors have cascading consequences — Early failure prevents downstream corruption
Debugging speed matters — Crash at the point of the problem
Recovery mechanisms exist — Supervisor restarts, transaction rollback, retry logic
Inconsistent state is dangerous — Financial, medical, safety-critical systems

The Fail-Safe Philosophy

Fail-safe systems are based on a different principle: availability is so critical that continued operation may be more valuable than perfect correctness.

The Fail-Safe Argument:

Not all failures are catastrophic. Many anomalies are transient or minor. Crashing on every unexpected condition means:

Users experience frequent outages
Minor issues cause major disruptions
Teams become desensitized to frequent crashes
The system is less stable than its components

Fail-safe advocates argue that systems should:

Attempt to recover from transient failures
Operate in degraded modes when healthy recovery isn't possible
Fail only when truly dangerous conditions exist
Keep providing value even when imperfect

Fail-Safe Implementation Patterns

•Retry with Backoff — Transient failures are retried automatically with exponential backoff. Many failures resolve without user impact.
•Fallback Responses — When primary data source fails, use cached data, default values, or alternative sources.
•Graceful Degradation — Reduce functionality rather than failing completely. Show cached product data if real-time inventory check fails.
•Circuit Breakers — Temporarily stop calling failing dependencies, use fallbacks, try again later when healthy.
•Self-Healing — Automatically restart failed components, rebuild corrupted indexes, retry failed background jobs.
•Loose Coupling — Components continue operating even when dependencies fail temporarily.
•Background Repair — Detect inconsistencies but repair in background rather than blocking operations.
•Error Absorption — Non-critical errors are logged but don't propagate up to crash the request/process.

When Fail-Safe Excels

•Availability is critical for safety (medical monitoring)
•Errors are often transient (network issues)
•Users need continuous access
•Imperfect data is better than no data
•Cost of outage exceeds cost of imperfection

Fail-Safe Dangers

•Corruption may spread before detection
•Root cause becomes obscured
•Debugging is harder
•Users may trust incorrect data
•Background repair may never catch up

The Fail-Safe Trap

Making the Choice: When to Use Each Approach

The choice between fail-safe and fail-fast depends on multiple factors. Rather than applying one philosophy universally, analyze each component and failure mode individually.

Key Decision Factors:

Decision Framework

•Correctness Criticality — Higher correctness requirements favor fail-fast. Financial transactions, medical records, safety systems need fail-fast for data integrity.
•Availability Criticality — Higher availability requirements favor fail-safe. Emergency services, real-time monitoring, user-facing critical paths need fail-safe.
•Recoverability — If failures are easily recoverable (restart, retry, rollback), fail-fast is cheaper. If recovery is hard, fail-safe avoids the recovery cost.
•Blast Radius — If errors propagate widely, fail-fast contains damage. If errors are isolated, fail-safe avoids unnecessary outages.
•Detection Difficulty — If subtle corruption is hard to detect later, fail-fast makes problems visible. If problems are obvious, fail-safe is safe.
•Error Frequency — Rare errors favor fail-fast (investigation needed). Common transient errors favor fail-safe (retry handles them).
•State Persistence — Errors in persistent state favor fail-fast (corruption is permanent). Errors in ephemeral state favor fail-safe (restart clears them).

Recommended Approach by Context
Context	Recommended	Rationale
Database writes	Fail-Fast	Corruption is permanent and spreading
Read caching	Fail-Safe	Stale data often acceptable; invalidation fixes
Financial transactions	Fail-Fast	Incorrect money movement is catastrophic
Content serving (CDN)	Fail-Safe	Stale content better than no content
Authentication	Fail-Fast	Wrong auth decision is security breach
Recommendations	Fail-Safe	No/default recommendations acceptable
Order processing	Fail-Fast	Partial orders create fulfillment nightmare
Search results	Fail-Safe	Partial/cached results better than nothing
Metrics collection	Fail-Safe	Missing data points better than crash
Audit logging	Fail-Fast	Missing audit records may be compliance violation

The Hybrid Approach

Implementation Considerations

Implementing either approach requires careful engineering. Both can be done poorly, leading to systems that are neither safe nor fast.

Fail-Fast Implementation Considerations:

Fail-Fast Implementation

•Clear error messages — Stack traces and context that make debugging straightforward. 'Assertion failed: balance < 0, actual: -15.00, account: A1234'
•Observable crashes — Crashes are logged, alerted, tracked. Silent crashes are as bad as silent failures.
•Restart mechanisms — Supervisors restart crashed processes. Kubernetes restarts crashed containers. Crashes don't mean permanent outage.
•Idempotent operations — Operations that can be safely retried after crash/restart. Crash-restart cycle repeats until success.
•Transactional boundaries — Clear points where state is committed. Crash between transactions leaves clean state.
•Crash-only software — Design where restart is the primary recovery mechanism. No complex 'graceful shutdown' required.

Fail-Safe Implementation Considerations:

Fail-Safe Implementation

•Logging all absorbed errors — Every caught error should be logged with context. Silent absorption hides problems.
•Metrics on error rates — Track rates of errors being absorbed. Rising rates may indicate silent failures.
•Timeouts and resource limits — Fail-safe doesn't mean 'run forever.' Bound resource consumption.
•Eventual consistency checks — Background jobs that verify consistency and alert on persistent mismatches.
•Escalation thresholds — Too many errors in a window escalates from fail-safe to fail-fast. 'Something is seriously wrong.'
•Clear degradation indicators — Users should know when they're getting degraded experience. Admins should see degradation metrics.

Common Pitfalls

Case Studies: Learning from Real Systems

Examining real systems illuminates how these philosophies work in practice.

Case Study 1: Erlang/OTP — The Let It Crash Philosophy

Erlang is perhaps the most influential example of fail-fast design. Telecom systems built on Erlang routinely achieve 99.9999999% availability (nine nines). How?

Worker processes crash immediately on any unexpected condition
Supervisor processes detect crashes and restart workers
State isolation ensures one crash doesn't corrupt other processes
The crash/restart cycle is measured in milliseconds
Most 'failures' are invisible to users because restart is so fast

The philosophy: 'If you don't know how to handle an error, crash. Somebody else will restart you in a known-good state.'

Case Study 2: Amazon DynamoDB — Fail-Safe for Availability

DynamoDB prioritizes availability over immediate consistency. When a write can't reach all replicas:

The write succeeds if the minimum quorum is met
Background repair ensures eventual consistency across all replicas
Read requests may return stale data briefly
System continues operating through network partitions

The philosophy: 'Availability for writes is more valuable than immediate consistency across replicas. Eventual consistency is good enough for many use cases.'

Case Study 3: Google Spanner — Fail-Fast for Consistency

Spanner takes the opposite approach—consistency over availability:

Writes require global coordination (TrueTime)
If coordination fails, writes fail (fail-fast)
The system rejects operations it can't guarantee consistent
Availability is sacrificed during certain partition scenarios

The philosophy: 'For global financial systems, incorrect balances are worse than temporary unavailability. Users can retry failed operations.'

Both Approaches Succeed

Combining Both Approaches Effectively

The most robust systems combine both approaches, using each where appropriate. This requires explicit analysis of failure modes and their consequences.

A Practical Framework:

Classify components by consequence of failure:
- Safety-critical: Fail-fast always (crashes, rollbacks)
- User-visible-critical: Fail-fast with fast recovery (sub-second restart)
- Quality-affecting: Fail-safe with degradation (use fallbacks)
- Non-critical: Fail-safe with logging (absorb and continue)
Classify errors by type:
- Invariant violations: Always fail-fast (state is corrupt)
- Resource exhaustion: Fail-fast with load shedding (reject new work)
- Transient failures: Fail-safe with retry (network blips, brief unavailability)
- Dependency failures: Fail-safe with fallback (cache, defaults)

Combination Patterns

•Fail-fast core, fail-safe periphery — Database writes fail-fast; recommendation service failures are absorbed and fallback to defaults.
•Fail-fast internals, fail-safe boundaries — Within a service, assertions crash. At API boundaries, return error responses rather than cascading crashes.
•Fail-safe first, fail-fast on escalation — Retry transient errors, but if errors persist, crash to trigger investigation.
•Fail-fast writes, fail-safe reads — Writing to database requires strict correctness; reading can use cached/stale data.
•Fail-fast per-request, fail-safe per-process — Individual bad requests fail immediately; the process continues serving other requests.

Document Your Choices

Summary: Fail-Safe vs Fail-Fast

We've explored the fundamental choice between prioritizing continued operation (fail-safe) and immediate problem detection (fail-fast). Let's consolidate the essential insights:

Key Takeaways

•Fail-fast prioritizes correctness — Stop immediately on problems to prevent corruption spread and make debugging easier. Best for data integrity, security, financial systems.
•Fail-safe prioritizes availability — Continue operating through problems, possibly degraded. Best for user-facing systems, monitoring, scenarios where uptime is critical.
•Neither is universally correct — The right choice depends on context: what are the consequences of wrong data vs. unavailability?
•Real systems use both — Fail-fast for critical invariants, fail-safe for non-critical functionality. Explicit classification is essential.
•Implementation matters — Fail-fast needs good diagnostics and restart mechanisms. Fail-safe needs logging, metrics, and eventual consistency checks.
•Document your approach — Explicit documentation of what fails how prevents accidental behavior changes and guides incident response.

Module Conclusion:

With this page, we conclude Module 1: Failure Is Inevitable. You now understand:

Types of failures: Hardware, software, and network failures and their characteristics
Partial failures: The defining challenge of distributed systems
Designing for failure: Architecture and patterns for resilient systems
Fail-safe vs fail-fast: When to prioritize availability vs. correctness

This foundation prepares you for the specific fault tolerance patterns we'll study in subsequent modules: circuit breakers, bulkheads, timeouts, retries, and fallbacks.

Module Complete

4 / 4