System Design (HLD)Failure Is Inevitable

Failure Is Inevitable: Understanding and Embracing System Failures

LevelIntermediate

Duration90 mins

TopicFailure Is Inevitable

3 / 4

Designing for Failure

Engineering for the Inevitable

Traditional engineering disciplines have long understood that failure is inevitable. Civil engineers don't ask 'Will this bridge ever experience stress?' but 'How much stress, and how will the bridge respond?' Aerospace engineers don't assume perfect flight conditions but design for turbulence, ice, bird strikes, and engine failures.

Software engineering has been slower to internalize this lesson. Too often, systems are designed for the happy path, with failure handling added as an afterthought—if at all. This approach creates brittle systems that work well until they catastrophically don't.

Designing for failure means treating failure scenarios as first-class design inputs, not edge cases. It means asking 'What happens when this fails?' for every component, every connection, every dependency. It means building resilience into the architecture itself, not bolting it on later.

What You Will Learn

By the end of this page, you will master the principles and practices of failure-aware design: defensive architecture, failure domain isolation, graceful degradation strategies, redundancy patterns, and the organizational culture that sustains resilient systems. You'll understand how to bake resilience into systems from conception.

The Failure-Aware Design Philosophy

Designing for failure requires a fundamental shift in mindset. Instead of asking 'How do I make this work?' you must also ask 'How will this fail, and what happens then?'

The Core Philosophical Shift:

Traditional Design	Failure-Aware Design
Failure is exceptional	Failure is expected
Design for success	Design for recovery
Minimize failure probability	Minimize failure impact
Perfect systems	Resilient systems
Prevent all failures	Contain and recover from failures

This shift doesn't mean abandoning quality or accepting poor reliability. It means recognizing that no matter how well-built a system is, failures will occur. The question is not whether, but when, and how your system responds.

Key Principles:

Failure-Aware Design Principles

•Expect Failure — Every component, every network call, every disk write can fail. Design with this assumption, not against it.
•Fail Predictably — When failures occur, behavior should be defined, not undefined. Failed operations should be clearly failed, not ambiguously half-done.
•Fail Fast, Recover Faster — Detect failures quickly, stop relying on failed components immediately, and restore service as quickly as possible.
•Minimize Blast Radius — Contain failures to the smallest possible scope. A database failure shouldn't crash the web server.
•Preserve Core Functionality — Identify what absolutely must work and protect it. Sacrifice non-essential features to preserve essentials.
•Enable Human Recovery — Some failures require human intervention. Make the system observable and operations clear.
•Test Failure Paths — Failure handling code that's never tested is code that's probably broken. Exercise failure paths regularly.

The Amazon Mandate

At Amazon, services must define their behavior during every failure mode before launch. 'What happens when dependency X is unavailable?' isn't optional—it's a required design artifact. This mandate has prevented countless cascading failures.

Defensive Architecture

Defensive architecture is the practice of building systems with protection against failures at every layer. It's the architectural equivalent of a castle's layered defenses: walls, moat, drawbridge, keep—each providing protection if outer layers are breached.

Core Defensive Patterns:

Defensive Architecture Patterns

•Redundancy at Every Layer — No single point of failure. Multiple instances of every component, multiple paths for every network connection, multiple copies of every critical dataset. 'What happens when this one thing fails?' should answer 'Nothing visible to users.'
•Isolation Boundaries — Components that can fail independently. Process isolation, container isolation, service isolation, geographical isolation. Failure in one boundary doesn't penetrate others.
•Timeouts Everywhere — Every operation that can hang must have a timeout. Network calls, database queries, file operations, lock acquisitions. Without timeouts, resources are held indefinitely.
•Defensive Input Handling — Assume all input is potentially malformed, malicious, or unexpected. Validate before processing. Reject what you don't understand rather than attempting to handle it.
•Defensive Output Handling — Verify outputs before returning to callers. Catch serialization failures, respect response size limits, handle encoding errors.
•Resource Limits Everywhere — Limit memory usage, thread count, connection count, queue depth, request size. Unbounded resources eventually exhaust, crashing everything.
•Asynchronous Decoupling — Where possible, replace synchronous dependencies with asynchronous patterns. A queue between services means Service A continues even if Service B fails.

The Layered Defense Model:

Defensive architecture works in layers, each catching failures that slip through outer layers:

Application Layer — Input validation, business logic guards, error handling
Service Layer — Timeouts, circuit breakers, retries, fallbacks
Infrastructure Layer — Load balancing, health checks, auto-scaling
Platform Layer — Container orchestration, VM replacement, storage replication
Datacenter Layer — Redundant power, cooling, network, multi-AZ deployment

Each layer assumes the others might fail and provides its own protections.

Defense Depth vs. Complexity

Every defensive layer adds complexity. Complexity itself can cause failures. Balance defense depth against operational simplicity. A system so complex that no one fully understands it is dangerous regardless of how many defensive layers it has.

Failure Domain Design

A failure domain is a scope within which failures are correlated. Components within the same failure domain are likely to fail together; components in different failure domains fail independently.

Examples of failure domains:

All VMs on the same physical host (host failure domain)
All servers on the same rack (rack failure domain)
All services in the same availability zone (AZ failure domain)
All users on the same database shard (shard failure domain)
All requests using the same connection pool (pool failure domain)

Designing Failure Domains:

Failure Domain Design Principles

•Identify Natural Failure Domains — Understand what fails together. Physical location (rack, datacenter), logical grouping (shard, tenant), resource sharing (thread pool, connection pool).
•Spread Replicas Across Domains — Ensure replicas of the same data or service are in different failure domains. If both replicas are on the same rack, rack failure loses both.
•Size Domains Appropriately — Too small domains mean complexity; too large means dangerous blast radius. A failure domain of 10% of capacity means surviving 9 domain failures.
•Make Domains Explicit — Document and monitor failure domains. Label infrastructure with domain information. Make domain membership visible.
•Test Domain Failures — Regularly fail entire domains (in controlled ways) to verify system behavior. Chaos engineering at the domain level.

Common Failure Domains and Their Scope
Domain Type	Typical Scope	Failure Cause Examples	Design Response
Host	1 server	Hardware failure, kernel panic	Replicate across hosts
Rack	20-40 servers	ToR switch failure, power circuit	Replicate across racks
Availability Zone	One datacenter building	Cooling failure, network isolation	Replicate across AZs
Region	Multiple AZs in geography	Natural disaster, regional power	Replicate across regions
Shard	Subset of data/users	Shard database failure	Shard isolation, failover
Tenant	One customer's resources	Tenant overload, bad data	Tenant isolation
Deployment	All instances of new version	Bad deploy	Canary, blue-green

Hidden Failure Domains

Many failure domains are hidden: shared library versions, shared configuration sources, shared DNS resolvers, shared monitoring systems. Audit not just infrastructure placement but all shared dependencies. A config server is a failure domain for all services that read from it.

Graceful Degradation

Graceful degradation is the practice of designing systems to provide reduced but still valuable functionality when components fail. Rather than complete failure, the system continues operating in a diminished capacity.

The Degradation Spectrum:

Systems don't have to be binary (fully working or completely broken). They can operate at multiple levels:

Full Functionality — Everything works as designed
Degraded Features — Some features unavailable, core functions work
Read-Only Mode — No writes, but reads continue
Static Content — Dynamic features fail, cached/static content served
Maintenance Mode — User sees 'temporarily unavailable' instead of errors

Each level should be explicitly designed, not accidental.

Degradation Strategies

•Feature Toggling — Disable non-essential features when under stress
•Cache Fallbacks — Serve cached data when source unavailable
•Static Defaults — Return default values instead of computed ones
•Best-Effort Responses — Return partial results instead of failing
•Reduced Consistency — Accept eventual consistency when strong consistency unavailable

Anti-Patterns

•All-or-Nothing — If one thing fails, everything fails
•Silent Failures — System pretends to work but doesn't
•Unbounded Retries — Keep retrying forever
•Cascade Dependencies — Every service needs every other service
•No Fallback Design — Never considered what to do when X fails

Implementing Graceful Degradation:

Identify Core Value — What's the minimum viable experience? An e-commerce site must show products. Add-to-cart can degrade. Recommendations can disappear.
Categorize Features — Critical (must work), important (should work), nice-to-have (can fail). Different degradation thresholds for each.
Design Fallbacks — For each component, define what happens when it fails. Document and test these fallbacks.
Implement Load Shedding — When overwhelmed, shed non-critical traffic to preserve critical paths.
Communicate Degradation — Users should know when they're getting degraded experience. 'Recommendations temporarily unavailable' is better than nothing.

Netflix's Approach

Netflix categorizes all features by criticality and has explicit fallbacks for each. If the recommendation service fails, users see static popular lists. If playback fails, a friendly error message with retry. They regularly test these fallbacks by intentionally breaking services in production (Chaos Engineering).

Redundancy Strategies

Redundancy is the duplication of critical components or functions to increase reliability. When one component fails, a redundant component takes over. Different redundancy strategies suit different requirements.

Redundancy Configurations:

Redundancy Patterns

•Active-Active — All redundant components handle traffic simultaneously. Load balanced across all instances. Any instance can fail without impacting capacity (minus that instance's share). Best throughput, more complex state management.
•Active-Passive (Hot Standby) — One active instance handles traffic, standby is ready to take over instantly. Standby receives replicated data but doesn't serve traffic. Faster failover than cold standby.
•Active-Passive (Cold Standby) — Standby is deployed but not fully running. Requires startup time during failover. Cheaper but slower recovery.
•N+1 Redundancy — For N required components, maintain N+1. One failure is absorbed with no capacity loss. Cost-effective for homogeneous systems.
•N+M Redundancy — Maintain M extra components for N required. Handles multiple simultaneous failures. More expensive but more resilient.
•Geographic Redundancy — Components distributed across physical locations (datacenters, regions). Protects against localized disasters. Adds latency and consistency complexity.

Redundancy Comparison
Type	Failover Time	Cost	Complexity	Best For
Active-Active	Instant	High	High	Traffic that can be distributed
Active-Passive (Hot)	Seconds	Medium	Medium	Stateful services
Active-Passive (Cold)	Minutes	Lower	Lower	Non-critical systems
N+1	Depends on config	Medium	Low	Homogeneous compute
Geographic	Seconds to minutes	High	Very High	Disaster recovery

Redundancy Pitfalls:

Correlated Failures — Redundant components that fail together aren't actually redundant. Same software bug, same config error, same failure mode.
Failover Bugs — The code that detects failure and triggers failover is often the least tested code in the system. Failover that doesn't work when needed is worthless.
Split-Brain — Both 'active' instances thinking they're the primary. Can cause data corruption. Requires careful coordination.
Cascade After Failover — Traffic moving to surviving instances overwhelms them. They fail. More traffic moves. Cascade ensues.
Delayed Detection — Failures aren't detected promptly, so failover takes too long. Users experience extended outage.

Test Your Failover

Redundancy you haven't tested is redundancy you can't trust. Regularly fail primary components and verify standby takeover works correctly. This is so important that major companies schedule 'Game Days' specifically to test failover systems.

Dependency Management for Resilience

Dependencies are the connections between components in a system. Each dependency is a path for failure propagation. Managing dependencies is essential for limiting blast radius and enabling graceful degradation.

Dependency Analysis:

For each service, understand:

Upstream Dependencies — Services that call you
Downstream Dependencies — Services you call
Hard Dependencies — You cannot function without them
Soft Dependencies — You can function (potentially degraded) without them
Transitive Dependencies — Dependencies of your dependencies

Dependency Resilience Patterns

•Minimize Hard Dependencies — Convert hard dependencies to soft dependencies where possible. Use cached data, default values, best-effort processing.
•Timeouts on Every Call — Every external call must have a timeout. Hanging dependencies shouldn't hang you.
•Circuit Breakers — When a dependency fails repeatedly, stop calling it. Fail fast rather than wasting resources on calls that will fail.
•Bulkheads — Separate resources for different dependencies. Failures in calls to Service A shouldn't exhaust resources needed for Service B.
•Async When Possible — Replace synchronous dependencies with asynchronous ones (queues, events). Decouples failure modes.
•Fallback Responses — Define what to return when a dependency is unavailable. Cache, default, or error—but something specified.
•Dependency Health Checks — Monitor dependency health. Know when they're degraded before they're dead.
•Limit Fan-Out — Don't fan out to 100 dependencies for one request. Aggregation points reduce failure surface.

The Dependency Inversion for Resilience:

Traditional thinking: 'My service works if all dependencies work.'

Resilient thinking: 'My service provides value even when dependencies fail.'

This inversion requires active design:

What's the core value I provide independent of dependencies?
What can I cache to serve when dependencies are down?
What degraded mode provides value to users?
What's the minimum viable response?

Dependency Mapping

Create and maintain service dependency maps. Visualize the graph of dependencies. Identify chokepoints where many services depend on one. These chokepoints are high-priority resilience investments.

Building a Culture of Resilience

Technical patterns alone don't create resilient systems. The organization must embrace resilience as a value, with practices that reinforce it.

Cultural Elements:

Resilience Culture Practices

•Blameless Post-Mortems — When failures occur, focus on systemic causes, not individual blame. People who fear blame hide information. Systems improve when truth flows freely.
•Chaos Engineering Practice — Regularly inject failures to test resilience. Finding weaknesses in controlled tests is better than discovering them during incidents.
•Operational Readiness Reviews — Before launch, systematically review failure handling. What happens when X fails? Is it documented? Tested?
•On-Call Ownership — Teams own the operational health of their services. Building services you'll be paged for at 3am incentivizes resilience.
•Error Budget Policies — Accept that failures happen, define acceptable rates (SLOs), and use 'error budget' to balance reliability against velocity.
•Game Days — Scheduled events for practicing incident response. Test both systems and humans under failure conditions.
•Documentation of Failure Modes — Every service documents its failure modes and degradation behavior. Tribal knowledge doesn't survive turnover.
•Resilience in Reviews — Code reviews and design reviews explicitly consider failure handling. 'What happens when this fails?' is a standard question.

The Learning Organization:

Resilience improves through learning. Every incident is a learning opportunity if the organization captures and applies lessons.

Incident → Post-Mortem → Action Items → Implementation → Verification

This cycle must complete. Action items that never get implemented are wasted learning. Organizations that break the cycle repeat failures.

Measuring Resilience Culture:

How quickly are incidents detected?
How quickly are they resolved?
Are post-mortems conducted and completed?
Are action items tracked to completion?
Is failure handling tested before incidents occur?
Do teams proactively improve resilience, or only react to incidents?

The Netflix Culture

Netflix's resilience culture is legendary. They run Chaos Monkey in production constantly. Engineers expect random failures and design accordingly. Their Simian Army includes tools to kill instances, fail networks, and corrupt data. 'If our systems can't handle a Chaos Monkey, users will find that out eventually anyway—we'd rather find it first.'

Summary: Designing for Failure

We've explored the comprehensive approach to designing systems that embrace failure. Let's consolidate the essential practices:

Key Takeaways

•Treat failure as a design input — Every component, connection, and dependency should have defined failure behavior. 'What happens when X fails?' is a required question.
•Build defense in depth — Multiple layers of protection catch failures that slip through outer layers. Timeouts, retries, circuit breakers, bulkheads, fallbacks.
•Design explicit failure domains — Know what fails together. Spread replicas across domains. Make domain membership visible and monitored.
•Enable graceful degradation — Systems should operate at reduced capacity rather than completely failing. Define degradation modes explicitly.
•Use appropriate redundancy — Active-active, active-passive, N+1, geographic—choose based on requirements and test regularly.
•Manage dependencies for resilience — Minimize hard dependencies. Timeout everything. Use circuit breakers. Have fallback responses.
•Build culture that supports resilience — Blameless post-mortems, chaos engineering, operational ownership, continuous improvement.

What's next:

Having established the design philosophy for failure, we'll examine the specific decision between fail-safe and fail-fast approaches. Should a system prioritize continuing operation at all costs, or stopping immediately when problems are detected? This seemingly simple choice has profound implications for system behavior.

Page Complete

You now have a comprehensive framework for designing failure-resilient systems: defensive architecture, failure domains, graceful degradation, redundancy patterns, dependency management, and the cultural practices that sustain resilience. These principles will inform every fault tolerance pattern we study.

3 / 4

Loading learning content...

System Design (HLD)Failure Is Inevitable

Failure Is Inevitable: Understanding and Embracing System Failures

LevelIntermediate

Duration90 mins

TopicFailure Is Inevitable

3 / 4

Designing for Failure

Engineering for the Inevitable

What You Will Learn

The Failure-Aware Design Philosophy

Designing for failure requires a fundamental shift in mindset. Instead of asking 'How do I make this work?' you must also ask 'How will this fail, and what happens then?'

The Core Philosophical Shift:

Traditional Design	Failure-Aware Design
Failure is exceptional	Failure is expected
Design for success	Design for recovery
Minimize failure probability	Minimize failure impact
Perfect systems	Resilient systems
Prevent all failures	Contain and recover from failures

Key Principles:

Failure-Aware Design Principles

•Expect Failure — Every component, every network call, every disk write can fail. Design with this assumption, not against it.
•Fail Predictably — When failures occur, behavior should be defined, not undefined. Failed operations should be clearly failed, not ambiguously half-done.
•Fail Fast, Recover Faster — Detect failures quickly, stop relying on failed components immediately, and restore service as quickly as possible.
•Minimize Blast Radius — Contain failures to the smallest possible scope. A database failure shouldn't crash the web server.
•Preserve Core Functionality — Identify what absolutely must work and protect it. Sacrifice non-essential features to preserve essentials.
•Enable Human Recovery — Some failures require human intervention. Make the system observable and operations clear.
•Test Failure Paths — Failure handling code that's never tested is code that's probably broken. Exercise failure paths regularly.

The Amazon Mandate

Defensive Architecture

Core Defensive Patterns:

Defensive Architecture Patterns

•Redundancy at Every Layer — No single point of failure. Multiple instances of every component, multiple paths for every network connection, multiple copies of every critical dataset. 'What happens when this one thing fails?' should answer 'Nothing visible to users.'
•Isolation Boundaries — Components that can fail independently. Process isolation, container isolation, service isolation, geographical isolation. Failure in one boundary doesn't penetrate others.
•Timeouts Everywhere — Every operation that can hang must have a timeout. Network calls, database queries, file operations, lock acquisitions. Without timeouts, resources are held indefinitely.
•Defensive Input Handling — Assume all input is potentially malformed, malicious, or unexpected. Validate before processing. Reject what you don't understand rather than attempting to handle it.
•Defensive Output Handling — Verify outputs before returning to callers. Catch serialization failures, respect response size limits, handle encoding errors.
•Resource Limits Everywhere — Limit memory usage, thread count, connection count, queue depth, request size. Unbounded resources eventually exhaust, crashing everything.
•Asynchronous Decoupling — Where possible, replace synchronous dependencies with asynchronous patterns. A queue between services means Service A continues even if Service B fails.

The Layered Defense Model:

Defensive architecture works in layers, each catching failures that slip through outer layers:

Application Layer — Input validation, business logic guards, error handling
Service Layer — Timeouts, circuit breakers, retries, fallbacks
Infrastructure Layer — Load balancing, health checks, auto-scaling
Platform Layer — Container orchestration, VM replacement, storage replication
Datacenter Layer — Redundant power, cooling, network, multi-AZ deployment

Each layer assumes the others might fail and provides its own protections.

Defense Depth vs. Complexity

Failure Domain Design

A failure domain is a scope within which failures are correlated. Components within the same failure domain are likely to fail together; components in different failure domains fail independently.

Examples of failure domains:

All VMs on the same physical host (host failure domain)
All servers on the same rack (rack failure domain)
All services in the same availability zone (AZ failure domain)
All users on the same database shard (shard failure domain)
All requests using the same connection pool (pool failure domain)

Designing Failure Domains:

Failure Domain Design Principles

•Identify Natural Failure Domains — Understand what fails together. Physical location (rack, datacenter), logical grouping (shard, tenant), resource sharing (thread pool, connection pool).
•Spread Replicas Across Domains — Ensure replicas of the same data or service are in different failure domains. If both replicas are on the same rack, rack failure loses both.
•Size Domains Appropriately — Too small domains mean complexity; too large means dangerous blast radius. A failure domain of 10% of capacity means surviving 9 domain failures.
•Make Domains Explicit — Document and monitor failure domains. Label infrastructure with domain information. Make domain membership visible.
•Test Domain Failures — Regularly fail entire domains (in controlled ways) to verify system behavior. Chaos engineering at the domain level.

Common Failure Domains and Their Scope
Domain Type	Typical Scope	Failure Cause Examples	Design Response
Host	1 server	Hardware failure, kernel panic	Replicate across hosts
Rack	20-40 servers	ToR switch failure, power circuit	Replicate across racks
Availability Zone	One datacenter building	Cooling failure, network isolation	Replicate across AZs
Region	Multiple AZs in geography	Natural disaster, regional power	Replicate across regions
Shard	Subset of data/users	Shard database failure	Shard isolation, failover
Tenant	One customer's resources	Tenant overload, bad data	Tenant isolation
Deployment	All instances of new version	Bad deploy	Canary, blue-green

Hidden Failure Domains

Graceful Degradation

The Degradation Spectrum:

Systems don't have to be binary (fully working or completely broken). They can operate at multiple levels:

Full Functionality — Everything works as designed
Degraded Features — Some features unavailable, core functions work
Read-Only Mode — No writes, but reads continue
Static Content — Dynamic features fail, cached/static content served
Maintenance Mode — User sees 'temporarily unavailable' instead of errors

Each level should be explicitly designed, not accidental.

Degradation Strategies

•Feature Toggling — Disable non-essential features when under stress
•Cache Fallbacks — Serve cached data when source unavailable
•Static Defaults — Return default values instead of computed ones
•Best-Effort Responses — Return partial results instead of failing
•Reduced Consistency — Accept eventual consistency when strong consistency unavailable

Anti-Patterns

•All-or-Nothing — If one thing fails, everything fails
•Silent Failures — System pretends to work but doesn't
•Unbounded Retries — Keep retrying forever
•Cascade Dependencies — Every service needs every other service
•No Fallback Design — Never considered what to do when X fails

Implementing Graceful Degradation:

Identify Core Value — What's the minimum viable experience? An e-commerce site must show products. Add-to-cart can degrade. Recommendations can disappear.
Categorize Features — Critical (must work), important (should work), nice-to-have (can fail). Different degradation thresholds for each.
Design Fallbacks — For each component, define what happens when it fails. Document and test these fallbacks.
Implement Load Shedding — When overwhelmed, shed non-critical traffic to preserve critical paths.
Communicate Degradation — Users should know when they're getting degraded experience. 'Recommendations temporarily unavailable' is better than nothing.

Netflix's Approach

Redundancy Strategies

Redundancy Configurations:

Redundancy Patterns

•Active-Active — All redundant components handle traffic simultaneously. Load balanced across all instances. Any instance can fail without impacting capacity (minus that instance's share). Best throughput, more complex state management.
•Active-Passive (Hot Standby) — One active instance handles traffic, standby is ready to take over instantly. Standby receives replicated data but doesn't serve traffic. Faster failover than cold standby.
•Active-Passive (Cold Standby) — Standby is deployed but not fully running. Requires startup time during failover. Cheaper but slower recovery.
•N+1 Redundancy — For N required components, maintain N+1. One failure is absorbed with no capacity loss. Cost-effective for homogeneous systems.
•N+M Redundancy — Maintain M extra components for N required. Handles multiple simultaneous failures. More expensive but more resilient.
•Geographic Redundancy — Components distributed across physical locations (datacenters, regions). Protects against localized disasters. Adds latency and consistency complexity.

Redundancy Comparison
Type	Failover Time	Cost	Complexity	Best For
Active-Active	Instant	High	High	Traffic that can be distributed
Active-Passive (Hot)	Seconds	Medium	Medium	Stateful services
Active-Passive (Cold)	Minutes	Lower	Lower	Non-critical systems
N+1	Depends on config	Medium	Low	Homogeneous compute
Geographic	Seconds to minutes	High	Very High	Disaster recovery

Redundancy Pitfalls:

Correlated Failures — Redundant components that fail together aren't actually redundant. Same software bug, same config error, same failure mode.
Failover Bugs — The code that detects failure and triggers failover is often the least tested code in the system. Failover that doesn't work when needed is worthless.
Split-Brain — Both 'active' instances thinking they're the primary. Can cause data corruption. Requires careful coordination.
Cascade After Failover — Traffic moving to surviving instances overwhelms them. They fail. More traffic moves. Cascade ensues.
Delayed Detection — Failures aren't detected promptly, so failover takes too long. Users experience extended outage.

Test Your Failover

Dependency Management for Resilience

Dependency Analysis:

For each service, understand:

Upstream Dependencies — Services that call you
Downstream Dependencies — Services you call
Hard Dependencies — You cannot function without them
Soft Dependencies — You can function (potentially degraded) without them
Transitive Dependencies — Dependencies of your dependencies

Dependency Resilience Patterns

•Minimize Hard Dependencies — Convert hard dependencies to soft dependencies where possible. Use cached data, default values, best-effort processing.
•Timeouts on Every Call — Every external call must have a timeout. Hanging dependencies shouldn't hang you.
•Circuit Breakers — When a dependency fails repeatedly, stop calling it. Fail fast rather than wasting resources on calls that will fail.
•Bulkheads — Separate resources for different dependencies. Failures in calls to Service A shouldn't exhaust resources needed for Service B.
•Async When Possible — Replace synchronous dependencies with asynchronous ones (queues, events). Decouples failure modes.
•Fallback Responses — Define what to return when a dependency is unavailable. Cache, default, or error—but something specified.
•Dependency Health Checks — Monitor dependency health. Know when they're degraded before they're dead.
•Limit Fan-Out — Don't fan out to 100 dependencies for one request. Aggregation points reduce failure surface.

The Dependency Inversion for Resilience:

Traditional thinking: 'My service works if all dependencies work.'

Resilient thinking: 'My service provides value even when dependencies fail.'

This inversion requires active design:

What's the core value I provide independent of dependencies?
What can I cache to serve when dependencies are down?
What degraded mode provides value to users?
What's the minimum viable response?

Dependency Mapping

Create and maintain service dependency maps. Visualize the graph of dependencies. Identify chokepoints where many services depend on one. These chokepoints are high-priority resilience investments.

Building a Culture of Resilience

Technical patterns alone don't create resilient systems. The organization must embrace resilience as a value, with practices that reinforce it.

Cultural Elements:

Resilience Culture Practices

•Blameless Post-Mortems — When failures occur, focus on systemic causes, not individual blame. People who fear blame hide information. Systems improve when truth flows freely.
•Chaos Engineering Practice — Regularly inject failures to test resilience. Finding weaknesses in controlled tests is better than discovering them during incidents.
•Operational Readiness Reviews — Before launch, systematically review failure handling. What happens when X fails? Is it documented? Tested?
•On-Call Ownership — Teams own the operational health of their services. Building services you'll be paged for at 3am incentivizes resilience.
•Error Budget Policies — Accept that failures happen, define acceptable rates (SLOs), and use 'error budget' to balance reliability against velocity.
•Game Days — Scheduled events for practicing incident response. Test both systems and humans under failure conditions.
•Documentation of Failure Modes — Every service documents its failure modes and degradation behavior. Tribal knowledge doesn't survive turnover.
•Resilience in Reviews — Code reviews and design reviews explicitly consider failure handling. 'What happens when this fails?' is a standard question.

The Learning Organization:

Resilience improves through learning. Every incident is a learning opportunity if the organization captures and applies lessons.

Incident → Post-Mortem → Action Items → Implementation → Verification

This cycle must complete. Action items that never get implemented are wasted learning. Organizations that break the cycle repeat failures.

Measuring Resilience Culture:

How quickly are incidents detected?
How quickly are they resolved?
Are post-mortems conducted and completed?
Are action items tracked to completion?
Is failure handling tested before incidents occur?
Do teams proactively improve resilience, or only react to incidents?

The Netflix Culture

Summary: Designing for Failure

We've explored the comprehensive approach to designing systems that embrace failure. Let's consolidate the essential practices:

Key Takeaways

•Treat failure as a design input — Every component, connection, and dependency should have defined failure behavior. 'What happens when X fails?' is a required question.
•Build defense in depth — Multiple layers of protection catch failures that slip through outer layers. Timeouts, retries, circuit breakers, bulkheads, fallbacks.
•Design explicit failure domains — Know what fails together. Spread replicas across domains. Make domain membership visible and monitored.
•Enable graceful degradation — Systems should operate at reduced capacity rather than completely failing. Define degradation modes explicitly.
•Use appropriate redundancy — Active-active, active-passive, N+1, geographic—choose based on requirements and test regularly.
•Manage dependencies for resilience — Minimize hard dependencies. Timeout everything. Use circuit breakers. Have fallback responses.
•Build culture that supports resilience — Blameless post-mortems, chaos engineering, operational ownership, continuous improvement.

What's next:

Page Complete

3 / 4