System Design (HLD)Redundancy Patterns

Redundancy Patterns: Building Systems That Never Fail

LevelIntermediate

Duration90 mins

TopicRedundancy Patterns

5 / 5

Component Redundancy: Eliminating Single Points of Failure

The Weakest Link Determines System Availability

You've built a sophisticated application with redundant web servers, replicated databases, and multi-region deployment. But buried somewhere in your architecture is a single configuration service, a lone DNS resolver, or a unique authentication endpoint. When that single component fails, your carefully designed redundancy collapses. Your 99.99% availability calculation becomes meaningless because you overlooked a single point of failure.

Amdahl's Law for Availability: Just as Amdahl's Law limits parallelization gains to the sequential portion of code, your system's availability is limited by your least redundant component. A system is only as available as its weakest link.

Component redundancy addresses this by systematically identifying and eliminating single points of failure at every level of your architecture—from obvious elements like databases and load balancers to subtle dependencies like configuration sources, certificate authorities, and third-party APIs.

This page provides a comprehensive framework for identifying single points of failure, understanding component-level redundancy patterns, and building systems where no single component failure can bring down the service.

What You Will Learn

By the end of this page, you will understand how to identify single points of failure systematically, implement redundancy for different component types, handle stateful component redundancy challenges, and build architectures where component failures are isolated rather than cascading.

Identifying Single Points of Failure

A Single Point of Failure (SPOF) is any component whose failure would cause the entire system (or a critical function) to become unavailable. SPOFs can be obvious or subtle, and finding them requires systematic analysis.

The Request Path Analysis

Trace a typical request through your entire system and ask: "If this component died, would service continue?"

DNS resolution
Network path (routers, switches, load balancers)
TLS termination
Application servers
Session/cache stores
Database (read path, write path)
External service dependencies
Configuration sources
Logging/monitoring (affects observability if not service)
Authentication/authorization services

Common Hidden SPOFs:

DNS servers: If your recursive resolvers fail, nothing can connect
Configuration management: If config can't be fetched, new instances can't start
Certificate renewal: If cert renewal fails, TLS connections fail
Logging pipelines: May not stop service but prevents incident response
Monitoring infrastructure: Can't detect and respond to failures
Deployment pipelines: Can't deploy fixes during incidents
Secret management: If secrets service fails, rotations and new deployments fail

SPOF Analysis Matrix
Component	Redundancy Status	Failure Impact	Priority
Primary Database	Replicated	Data unavailable	Critical
Config Service	Single instance	New deploys fail	High
DNS Resolver	Single provider	All connections fail	Critical
Auth Service	Multi-instance	Login unavailable	High
Logging Pipeline	Single endpoint	Observability lost	Medium
Rate Limiter	Single Redis	Either block all or allow all	High

Dependency Mapping

Create a comprehensive dependency graph:

List every external service, database, cache, queue, and infrastructure component
For each, identify: How many instances? What happens if it's unavailable? How quickly can it be recovered?
Color-code by redundancy status: Green (redundant), Yellow (partial), Red (single)
Prioritize elimination based on impact and likelihood

The goal isn't eliminating every SPOF—that may be cost-prohibitive—but understanding and accepting them consciously.

Failure Mode Effect Analysis (FMEA)

Borrow from aerospace and medical device engineering: for each component, analyze Severity (impact of failure), Occurrence (probability of failure), and Detection (ability to detect failure). Multiply these for a Risk Priority Number that guides redundancy investment.

Stateless Component Redundancy

Stateless components are the easiest to make redundant. Since they hold no local state, any instance can handle any request, and failing instances can be replaced without data loss.

Web/Application Servers

The most common stateless component:

Pattern: Multiple instances behind a load balancer
Health checks: Load balancer removes unhealthy instances automatically
Scaling: Add/remove instances based on load
Failure handling: Traffic redistributes to healthy instances

API Gateways

Entry points to your service mesh:

Pattern: Multiple gateway instances, typically in different AZs
State concerns: Rate limiting state must be externalized or approximate
Health checks: Both load balancer and gateway-level health monitoring

Workers/Consumers

Background job processors:

Pattern: Multiple workers consuming from shared queue
At-least-once: Failed workers' jobs retry automatically from queue
Idempotency: Design jobs to handle duplicate processing safely

Proxies and Sidecars

Service mesh components:

Pattern: One per service instance (dies with the instance)
Redundancy: Handled by service instance redundancy
Control plane: Control plane (Istio, Linkerd) needs its own redundancy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3                    # Maintains 3 instances (N+1 if N=2 needed)
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1          # Always keep at least 2 running
      maxSurge: 1
  template:
    spec:
      affinity:
        podAntiAffinity:         # Spread across nodes/zones
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: topology.kubernetes.io/zone
      containers:
        - name: api
          image: api-server:v1.2.3
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          readinessProbe:        # Only receive traffic when ready
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:         # Restart if unresponsive
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2                # Never drop below 2 pods
  selector:
    matchLabels:
      app: api-server

Anti-Affinity Is Critical

Without anti-affinity rules, all your 'redundant' pods might land on the same node or in the same availability zone. When that node or zone fails, all pods fail together. Always configure anti-affinity to spread replicas across failure domains.

Stateful Component Redundancy

Stateful components present the core challenge of redundancy: you can't simply replace them because they hold irreplaceable data. Redundancy requires replication and careful state management.

Databases

The most critical stateful components:

Primary-Replica replication: Writes to primary, reads distributed, standby for failover
Consensus-based clusters: Raft/Paxos for leader election (PostgreSQL, CockroachDB, TiDB)
Multi-primary configurations: Complex conflict resolution but maximum write availability

Caches

Often treated as ephemeral, but performance may depend on them:

Redis Cluster: Sharded with replicas for each shard
Redis Sentinel: Automatic failover for single-instance setups
Memcached: Consistent hashing; clients handle node failures
Cache warming: Prepare for cache-miss storms after failures

Message Queues

Reliable message delivery requires durability:

Kafka: Partition replication with configurable acknowledgment
RabbitMQ: Mirrored queues or quorum queues
Cloud services: SQS, Pub/Sub provide built-in redundancy

Search Indices

Elasticsearch: Shards with replicas; automatic rebalancing
OpenSearch: Same as Elasticsearch
Algolia/CloudSearch: Managed services with built-in redundancy

Stateful Component Redundancy Patterns
Component Type	Pattern	Failover Time	Data Loss Risk
PostgreSQL	Streaming replication + Patroni	Seconds	Zero (sync) or minimal (async)
MySQL	Group Replication / InnoDB Cluster	Seconds	Zero with sync commit
Redis	Sentinel or Cluster mode	Seconds to minutes	Minimal (async replication)
Kafka	Partition replication (3x)	Immediate	Zero with acks=all
Elasticsearch	Shard replicas	Automatic	Zero with replicas
MongoDB	Replica Set (3+ members)	Seconds	Zero with majority write

Key Design Principles for Stateful Redundancy:

1. Separate Compute from Storage

Where possible, decouple stateless compute from durable storage. Losing compute is cheap; losing data is expensive.

2. Externalize State

Move state out of application servers into purpose-built stateful services with their own redundancy.

3. Accept CAP Tradeoffs

For distributed stateful systems, choose your tradeoff consciously: strong consistency with reduced availability, or high availability with eventual consistency.

4. Plan for State Recovery

Even with redundancy, have backup and restore procedures. Redundancy handles operational failures; backups handle corruption, disasters, and human error.

Beware 'Near-Stateless' Patterns

Some components appear stateless but have subtle state dependencies. Servers that cache configuration, hold circuit breaker state in memory, or accumulate metrics may misbehave when replaced. Identify and externalize all state, even the small bits.

Infrastructure Component Redundancy

Infrastructure components often become SPOFs because they're 'invisible'—shared services that don't appear in application architecture diagrams but are critical to operation.

Load Balancers

Often the front door to your entire system:

Cloud-managed (ALB, GCE, Azure LB):

Built-in redundancy across AZs
Automatic failover transparent to users
Generally the right choice for most workloads

Self-managed (HAProxy, NGINX):

Active-passive with keepalived/VRRP for VIP failover
Active-active with ECMP or DNS round-robin
Requires careful operational attention

DNS

Failure here breaks everything:

Multiple NS records: Authoritative DNS on multiple providers
Recursive resolvers: Multiple resolvers in different locations
Local caching: Run local DNS caches to survive resolver failures
TTL strategy: Balance between performance and failover speed

Secrets Management

Vault, AWS Secrets Manager, etc.:

Caching: Applications should cache secrets to survive temporary outages
Clustered deployment: Vault in HA mode with Consul backend
Managed services: AWS Secrets Manager provides built-in redundancy
Offline fallback: Consider offline secret access for extreme scenarios

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
interface ConfigSource {
    name: string;
    priority: number;
    load(): Promise<Config | null>;
}
 
class ResilientConfigLoader {
    private cachedConfig: Config | null = null;
    private cacheTime: Date | null = null;
    private readonly cacheDurationMs = 5 * 60 * 1000; // 5 minutes
    
    constructor(private sources: ConfigSource[]) {
        // Sort by priority (highest first)
        this.sources.sort((a, b) => b.priority - a.priority);
    }
    
    async loadConfig(): Promise<Config> {
        // Try each source in priority order
        for (const source of this.sources) {
            try {
                const config = await source.load();
                if (config) {
                    this.cachedConfig = config;
                    this.cacheTime = new Date();
                    console.log(`Loaded config from ${source.name}`);
                    return config;
                }
            } catch (error) {
                console.warn(`Failed to load from ${source.name}: ${error}`);
                // Continue to next source
            }
        }
        
        // All sources failed - use cache if available and fresh enough
        if (this.cachedConfig && this.cacheTime) {
            const age = Date.now() - this.cacheTime.getTime();
            if (age < this.cacheDurationMs * 2) { // Extended cache during failure
                console.warn('All config sources failed, using stale cache');
                return this.cachedConfig;
            }
        }
        
        throw new Error('All configuration sources failed and no valid cache');
    }
}
 
// Usage: Multiple sources with fallback
const configLoader = new ResilientConfigLoader([
    { name: 'consul', priority: 100, load: () => fetchFromConsul() },
    { name: 's3', priority: 50, load: () => fetchFromS3() },
    { name: 'local', priority: 10, load: () => loadFromDisk() },
]);

Multi-Vendor for Critical Infrastructure

For truly critical infrastructure (DNS, CDN), consider multi-vendor redundancy. Run authoritative DNS on both Route 53 and Cloudflare. Use multiple CDN providers with failover. This protects against vendor-specific outages—rare but impactful.

External Dependency Redundancy

External dependencies—third-party APIs, SaaS services, partner integrations—introduce SPOFs outside your control. You can't make Stripe redundant by running two Stripes, but you can design your system to handle Stripe's unavailability.

Patterns for External Dependency Resilience:

1. Graceful Degradation

Design features to work (perhaps with reduced functionality) when dependencies fail:

Payment service down → Accept orders, process payment later
Recommendation engine down → Show popular items instead
Analytics service down → Queue events for later processing

2. Multi-Provider Strategies

For some services, multiple providers can serve the same function:

Payment processing: Stripe primary, Braintree fallback
SMS delivery: Twilio primary, AWS SNS fallback
Email sending: SendGrid primary, SES fallback

3. Caching and Local Fallbacks

Store data locally to survive dependency outages:

Geo-IP databases: Local copy vs API lookup
Exchange rates: Cached values with freshness limits
Feature flags: Cached configuration with offline defaults

4. Async Decoupling

Don't make synchronous calls to dependencies when async would work:

Queue webhook deliveries rather than inline calls
Buffer analytics events for batch sending
Process non-critical integrations asynchronously

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
interface PaymentProvider {
    name: string;
    processPayment(amount: number, token: string): Promise<PaymentResult>;
    isHealthy(): Promise<boolean>;
}
 
class ResilientPaymentProcessor {
    private providerHealth: Map<string, { healthy: boolean; lastCheck: Date }> = new Map();
    
    constructor(private providers: PaymentProvider[]) {}
    
    async processPayment(amount: number, token: string): Promise<PaymentResult> {
        const orderedProviders = await this.getOrderedProviders();
        
        let lastError: Error | null = null;
        
        for (const provider of orderedProviders) {
            try {
                console.log(`Attempting payment via ${provider.name}`);
                const result = await provider.processPayment(amount, token);
                
                // Mark provider as healthy on success
                this.providerHealth.set(provider.name, { 
                    healthy: true, 
                    lastCheck: new Date() 
                });
                
                return result;
            } catch (error) {
                console.error(`Payment failed via ${provider.name}: ${error}`);
                
                // Mark provider as unhealthy
                this.providerHealth.set(provider.name, { 
                    healthy: false, 
                    lastCheck: new Date() 
                });
                
                lastError = error as Error;
                // Continue to next provider
            }
        }
        
        // All providers failed
        throw new Error(`All payment providers failed. Last error: ${lastError?.message}`);
    }
    
    private async getOrderedProviders(): Promise<PaymentProvider[]> {
        // Return healthy providers first, then unhealthy ones as last resort
        const healthy: PaymentProvider[] = [];
        const unhealthy: PaymentProvider[] = [];
        
        for (const provider of this.providers) {
            const status = this.providerHealth.get(provider.name);
            if (!status || status.healthy) {
                healthy.push(provider);
            } else {
                unhealthy.push(provider);
            }
        }
        
        return [...healthy, ...unhealthy];
    }
}

Fallback Behavior Must Be Tested

Fallback paths are rarely exercised in production. When exercised during actual failures, they often fail due to stale configurations, changed APIs, or untested edge cases. Regularly test fallback behavior by deliberately failing primary providers in staging and occasionally in production.

Failure Isolation Patterns

Component redundancy isn't just about running multiple copies—it's about ensuring that one component's failure doesn't cascade to others. Failure isolation contains the blast radius of component failures.

Bulkhead Pattern

Partition resources so that failure in one partition doesn't affect others:

Separate thread pools for different downstream services
Dedicated database connections for critical vs non-critical paths
Isolated resource quotas per tenant in multi-tenant systems

Circuit Breaker Pattern

Stop calling a failing service to prevent cascading load:

Open circuit after failure threshold exceeded
Allow some test requests to check for recovery
Close circuit when service recovers

Timeout and Deadline Patterns

Prevent slow components from blocking healthy ones:

Set timeouts on all external calls
Propagate deadlines through the call chain
Fail fast rather than waiting indefinitely

Queue-Based Decoupling

Asynchronous processing isolates producer from consumer failures:

Producer succeeds if queue accepts message
Consumer failures accumulate in queue (with backpressure)
Recovery processes backlog automatically

Failure Isolation Checklist

•Timeouts on all calls — No indefinite waits; every external call has a timeout
•Circuit breakers on dependencies — Automatic load shedding when dependencies fail
•Bulkheads for critical paths — Critical and non-critical work use separate resource pools
•Graceful degradation defined — Each feature has a documented fallback behavior
•Queue buffers for async work — Asynchronous processing uses durable queues
•Rate limiting — Prevent any single client from exhausting shared resources
•Health endpoints — Every component exposes its health for monitoring

Defense in Depth

Apply multiple isolation patterns simultaneously. A circuit breaker alone helps, but a circuit breaker with timeouts, bulkheads, and queue decoupling provides much stronger isolation. Each layer catches failures that slip through the others.

Testing Component Redundancy

Redundancy that hasn't been tested is unreliable redundancy. Component redundancy must be validated through systematic failure injection.

Unit-Level Failure Testing

Test individual component failover in isolation:

Kill database primary, verify promotion
Remove load balancer backend, verify traffic rerouting
Fail cache node, verify application behavior
Terminate application instance, verify replacement

Integration Failure Testing

Test failures in context of full system:

Kill component while under load
Fail component during deployment
Simulate slow network to component
Inject errors in component responses

Chaos Engineering

Regular, automated failure injection in production:

Chaos Monkey: Random instance termination
Gremlin: Controlled failure injection
LitmusChaos: Kubernetes-native chaos
AWS FIS: AWS resource failure simulation

GameDays

Orchestrated failure exercises with the team:

Announce the exercise scope
Inject specific failure scenario
Observe response and behavior
Document findings and improvements

Redundancy Testing Matrix
Component	Failure Test	Expected Behavior	Recovery Test
App Server	Terminate instance	Load balancer routes around	New instance joins pool
Database Primary	Stop primary process	Standby promoted	Old primary rejoins as replica
Cache Cluster	Kill cache node	Requests hit origin	Node rejoins, cache rebuilds
Queue	Block queue access	Dead letter handling	Queue resumes, backlog processes
Config Service	Make config unavailable	Use cached config	Fetch fresh config on recovery

Start Small, Increase Scope

Begin chaos testing in development, then staging, then production during low-traffic periods, then production during normal traffic. Build confidence incrementally. Never start chaos testing in production without extensive staging validation.

Summary: Component Redundancy

Component redundancy eliminates the weakest links in your architecture—the single points of failure that could bring down an otherwise well-designed system. It requires systematic identification, appropriate patterns for different component types, and regular testing to validate.

Key Takeaways

•Systematically identify SPOFs — Trace request paths and dependency trees to find every single point of failure
•Stateless components are easy — Multiple instances behind load balancers with health checks
•Stateful components require replication — Different patterns for databases, caches, and queues
•Infrastructure components are often hidden SPOFs — DNS, load balancers, secrets management, config services
•External dependencies need resilience patterns — Graceful degradation, multi-provider, caching, async decoupling
•Failure isolation prevents cascades — Bulkheads, circuit breakers, timeouts, and queues contain blast radius
•Testing validates redundancy — Chaos engineering and GameDays prove redundancy works

Module Complete:

You've now completed the Redundancy Patterns module. You understand:

Active-Passive: Standby systems for failover
Active-Active: All nodes serving traffic for maximum utilization
N+1: Capacity planning for failure tolerance
Geographic: Multi-region and multi-zone distribution
Component: Eliminating single points of failure

These patterns work together to build systems that maintain availability despite hardware failures, software bugs, network partitions, and human error. The next module will explore Failover Strategies—the mechanisms for detecting failures and executing the transitions that redundancy enables.

Module Complete

You've mastered redundancy patterns—from active-passive through component-level redundancy. These patterns form the foundation of high availability architecture. Apply them systematically to eliminate single points of failure and build systems that survive the inevitable failures of distributed computing.

5 / 5

Loading learning content...

System Design (HLD)Redundancy Patterns

Redundancy Patterns: Building Systems That Never Fail

LevelIntermediate

Duration90 mins

TopicRedundancy Patterns

5 / 5

Component Redundancy: Eliminating Single Points of Failure

The Weakest Link Determines System Availability

What You Will Learn

Identifying Single Points of Failure

The Request Path Analysis

Trace a typical request through your entire system and ask: "If this component died, would service continue?"

DNS resolution
Network path (routers, switches, load balancers)
TLS termination
Application servers
Session/cache stores
Database (read path, write path)
External service dependencies
Configuration sources
Logging/monitoring (affects observability if not service)
Authentication/authorization services

Common Hidden SPOFs:

DNS servers: If your recursive resolvers fail, nothing can connect
Configuration management: If config can't be fetched, new instances can't start
Certificate renewal: If cert renewal fails, TLS connections fail
Logging pipelines: May not stop service but prevents incident response
Monitoring infrastructure: Can't detect and respond to failures
Deployment pipelines: Can't deploy fixes during incidents
Secret management: If secrets service fails, rotations and new deployments fail

SPOF Analysis Matrix
Component	Redundancy Status	Failure Impact	Priority
Primary Database	Replicated	Data unavailable	Critical
Config Service	Single instance	New deploys fail	High
DNS Resolver	Single provider	All connections fail	Critical
Auth Service	Multi-instance	Login unavailable	High
Logging Pipeline	Single endpoint	Observability lost	Medium
Rate Limiter	Single Redis	Either block all or allow all	High

Dependency Mapping

Create a comprehensive dependency graph:

List every external service, database, cache, queue, and infrastructure component
For each, identify: How many instances? What happens if it's unavailable? How quickly can it be recovered?
Color-code by redundancy status: Green (redundant), Yellow (partial), Red (single)
Prioritize elimination based on impact and likelihood

The goal isn't eliminating every SPOF—that may be cost-prohibitive—but understanding and accepting them consciously.

Failure Mode Effect Analysis (FMEA)

Stateless Component Redundancy

Stateless components are the easiest to make redundant. Since they hold no local state, any instance can handle any request, and failing instances can be replaced without data loss.

Web/Application Servers

The most common stateless component:

Pattern: Multiple instances behind a load balancer
Health checks: Load balancer removes unhealthy instances automatically
Scaling: Add/remove instances based on load
Failure handling: Traffic redistributes to healthy instances

API Gateways

Entry points to your service mesh:

Pattern: Multiple gateway instances, typically in different AZs
State concerns: Rate limiting state must be externalized or approximate
Health checks: Both load balancer and gateway-level health monitoring

Workers/Consumers

Background job processors:

Pattern: Multiple workers consuming from shared queue
At-least-once: Failed workers' jobs retry automatically from queue
Idempotency: Design jobs to handle duplicate processing safely

Proxies and Sidecars

Service mesh components:

Pattern: One per service instance (dies with the instance)
Redundancy: Handled by service instance redundancy
Control plane: Control plane (Istio, Linkerd) needs its own redundancy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3                    # Maintains 3 instances (N+1 if N=2 needed)
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1          # Always keep at least 2 running
      maxSurge: 1
  template:
    spec:
      affinity:
        podAntiAffinity:         # Spread across nodes/zones
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: topology.kubernetes.io/zone
      containers:
        - name: api
          image: api-server:v1.2.3
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          readinessProbe:        # Only receive traffic when ready
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:         # Restart if unresponsive
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2                # Never drop below 2 pods
  selector:
    matchLabels:
      app: api-server

Anti-Affinity Is Critical

Stateful Component Redundancy

Stateful components present the core challenge of redundancy: you can't simply replace them because they hold irreplaceable data. Redundancy requires replication and careful state management.

Databases

The most critical stateful components:

Primary-Replica replication: Writes to primary, reads distributed, standby for failover
Consensus-based clusters: Raft/Paxos for leader election (PostgreSQL, CockroachDB, TiDB)
Multi-primary configurations: Complex conflict resolution but maximum write availability

Caches

Often treated as ephemeral, but performance may depend on them:

Redis Cluster: Sharded with replicas for each shard
Redis Sentinel: Automatic failover for single-instance setups
Memcached: Consistent hashing; clients handle node failures
Cache warming: Prepare for cache-miss storms after failures

Message Queues

Reliable message delivery requires durability:

Kafka: Partition replication with configurable acknowledgment
RabbitMQ: Mirrored queues or quorum queues
Cloud services: SQS, Pub/Sub provide built-in redundancy

Search Indices

Elasticsearch: Shards with replicas; automatic rebalancing
OpenSearch: Same as Elasticsearch
Algolia/CloudSearch: Managed services with built-in redundancy

Stateful Component Redundancy Patterns
Component Type	Pattern	Failover Time	Data Loss Risk
PostgreSQL	Streaming replication + Patroni	Seconds	Zero (sync) or minimal (async)
MySQL	Group Replication / InnoDB Cluster	Seconds	Zero with sync commit
Redis	Sentinel or Cluster mode	Seconds to minutes	Minimal (async replication)
Kafka	Partition replication (3x)	Immediate	Zero with acks=all
Elasticsearch	Shard replicas	Automatic	Zero with replicas
MongoDB	Replica Set (3+ members)	Seconds	Zero with majority write

Key Design Principles for Stateful Redundancy:

1. Separate Compute from Storage

Where possible, decouple stateless compute from durable storage. Losing compute is cheap; losing data is expensive.

2. Externalize State

Move state out of application servers into purpose-built stateful services with their own redundancy.

3. Accept CAP Tradeoffs

For distributed stateful systems, choose your tradeoff consciously: strong consistency with reduced availability, or high availability with eventual consistency.

4. Plan for State Recovery

Even with redundancy, have backup and restore procedures. Redundancy handles operational failures; backups handle corruption, disasters, and human error.

Beware 'Near-Stateless' Patterns

Infrastructure Component Redundancy

Infrastructure components often become SPOFs because they're 'invisible'—shared services that don't appear in application architecture diagrams but are critical to operation.

Load Balancers

Often the front door to your entire system:

Cloud-managed (ALB, GCE, Azure LB):

Built-in redundancy across AZs
Automatic failover transparent to users
Generally the right choice for most workloads

Self-managed (HAProxy, NGINX):

Active-passive with keepalived/VRRP for VIP failover
Active-active with ECMP or DNS round-robin
Requires careful operational attention

DNS

Failure here breaks everything:

Multiple NS records: Authoritative DNS on multiple providers
Recursive resolvers: Multiple resolvers in different locations
Local caching: Run local DNS caches to survive resolver failures
TTL strategy: Balance between performance and failover speed

Secrets Management

Vault, AWS Secrets Manager, etc.:

Caching: Applications should cache secrets to survive temporary outages
Clustered deployment: Vault in HA mode with Consul backend
Managed services: AWS Secrets Manager provides built-in redundancy
Offline fallback: Consider offline secret access for extreme scenarios

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
interface ConfigSource {
    name: string;
    priority: number;
    load(): Promise<Config | null>;
}
 
class ResilientConfigLoader {
    private cachedConfig: Config | null = null;
    private cacheTime: Date | null = null;
    private readonly cacheDurationMs = 5 * 60 * 1000; // 5 minutes
    
    constructor(private sources: ConfigSource[]) {
        // Sort by priority (highest first)
        this.sources.sort((a, b) => b.priority - a.priority);
    }
    
    async loadConfig(): Promise<Config> {
        // Try each source in priority order
        for (const source of this.sources) {
            try {
                const config = await source.load();
                if (config) {
                    this.cachedConfig = config;
                    this.cacheTime = new Date();
                    console.log(`Loaded config from ${source.name}`);
                    return config;
                }
            } catch (error) {
                console.warn(`Failed to load from ${source.name}: ${error}`);
                // Continue to next source
            }
        }
        
        // All sources failed - use cache if available and fresh enough
        if (this.cachedConfig && this.cacheTime) {
            const age = Date.now() - this.cacheTime.getTime();
            if (age < this.cacheDurationMs * 2) { // Extended cache during failure
                console.warn('All config sources failed, using stale cache');
                return this.cachedConfig;
            }
        }
        
        throw new Error('All configuration sources failed and no valid cache');
    }
}
 
// Usage: Multiple sources with fallback
const configLoader = new ResilientConfigLoader([
    { name: 'consul', priority: 100, load: () => fetchFromConsul() },
    { name: 's3', priority: 50, load: () => fetchFromS3() },
    { name: 'local', priority: 10, load: () => loadFromDisk() },
]);

Multi-Vendor for Critical Infrastructure

External Dependency Redundancy

Patterns for External Dependency Resilience:

1. Graceful Degradation

Design features to work (perhaps with reduced functionality) when dependencies fail:

Payment service down → Accept orders, process payment later
Recommendation engine down → Show popular items instead
Analytics service down → Queue events for later processing

2. Multi-Provider Strategies

For some services, multiple providers can serve the same function:

Payment processing: Stripe primary, Braintree fallback
SMS delivery: Twilio primary, AWS SNS fallback
Email sending: SendGrid primary, SES fallback

3. Caching and Local Fallbacks

Store data locally to survive dependency outages:

Geo-IP databases: Local copy vs API lookup
Exchange rates: Cached values with freshness limits
Feature flags: Cached configuration with offline defaults

4. Async Decoupling

Don't make synchronous calls to dependencies when async would work:

Queue webhook deliveries rather than inline calls
Buffer analytics events for batch sending
Process non-critical integrations asynchronously

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
interface PaymentProvider {
    name: string;
    processPayment(amount: number, token: string): Promise<PaymentResult>;
    isHealthy(): Promise<boolean>;
}
 
class ResilientPaymentProcessor {
    private providerHealth: Map<string, { healthy: boolean; lastCheck: Date }> = new Map();
    
    constructor(private providers: PaymentProvider[]) {}
    
    async processPayment(amount: number, token: string): Promise<PaymentResult> {
        const orderedProviders = await this.getOrderedProviders();
        
        let lastError: Error | null = null;
        
        for (const provider of orderedProviders) {
            try {
                console.log(`Attempting payment via ${provider.name}`);
                const result = await provider.processPayment(amount, token);
                
                // Mark provider as healthy on success
                this.providerHealth.set(provider.name, { 
                    healthy: true, 
                    lastCheck: new Date() 
                });
                
                return result;
            } catch (error) {
                console.error(`Payment failed via ${provider.name}: ${error}`);
                
                // Mark provider as unhealthy
                this.providerHealth.set(provider.name, { 
                    healthy: false, 
                    lastCheck: new Date() 
                });
                
                lastError = error as Error;
                // Continue to next provider
            }
        }
        
        // All providers failed
        throw new Error(`All payment providers failed. Last error: ${lastError?.message}`);
    }
    
    private async getOrderedProviders(): Promise<PaymentProvider[]> {
        // Return healthy providers first, then unhealthy ones as last resort
        const healthy: PaymentProvider[] = [];
        const unhealthy: PaymentProvider[] = [];
        
        for (const provider of this.providers) {
            const status = this.providerHealth.get(provider.name);
            if (!status || status.healthy) {
                healthy.push(provider);
            } else {
                unhealthy.push(provider);
            }
        }
        
        return [...healthy, ...unhealthy];
    }
}

Fallback Behavior Must Be Tested

Failure Isolation Patterns

Bulkhead Pattern

Partition resources so that failure in one partition doesn't affect others:

Separate thread pools for different downstream services
Dedicated database connections for critical vs non-critical paths
Isolated resource quotas per tenant in multi-tenant systems

Circuit Breaker Pattern

Stop calling a failing service to prevent cascading load:

Open circuit after failure threshold exceeded
Allow some test requests to check for recovery
Close circuit when service recovers

Timeout and Deadline Patterns

Prevent slow components from blocking healthy ones:

Set timeouts on all external calls
Propagate deadlines through the call chain
Fail fast rather than waiting indefinitely

Queue-Based Decoupling

Asynchronous processing isolates producer from consumer failures:

Producer succeeds if queue accepts message
Consumer failures accumulate in queue (with backpressure)
Recovery processes backlog automatically

Failure Isolation Checklist

•Timeouts on all calls — No indefinite waits; every external call has a timeout
•Circuit breakers on dependencies — Automatic load shedding when dependencies fail
•Bulkheads for critical paths — Critical and non-critical work use separate resource pools
•Graceful degradation defined — Each feature has a documented fallback behavior
•Queue buffers for async work — Asynchronous processing uses durable queues
•Rate limiting — Prevent any single client from exhausting shared resources
•Health endpoints — Every component exposes its health for monitoring

Defense in Depth

Testing Component Redundancy

Redundancy that hasn't been tested is unreliable redundancy. Component redundancy must be validated through systematic failure injection.

Unit-Level Failure Testing

Test individual component failover in isolation:

Kill database primary, verify promotion
Remove load balancer backend, verify traffic rerouting
Fail cache node, verify application behavior
Terminate application instance, verify replacement

Integration Failure Testing

Test failures in context of full system:

Kill component while under load
Fail component during deployment
Simulate slow network to component
Inject errors in component responses

Chaos Engineering

Regular, automated failure injection in production:

Chaos Monkey: Random instance termination
Gremlin: Controlled failure injection
LitmusChaos: Kubernetes-native chaos
AWS FIS: AWS resource failure simulation

GameDays

Orchestrated failure exercises with the team:

Announce the exercise scope
Inject specific failure scenario
Observe response and behavior
Document findings and improvements

Redundancy Testing Matrix
Component	Failure Test	Expected Behavior	Recovery Test
App Server	Terminate instance	Load balancer routes around	New instance joins pool
Database Primary	Stop primary process	Standby promoted	Old primary rejoins as replica
Cache Cluster	Kill cache node	Requests hit origin	Node rejoins, cache rebuilds
Queue	Block queue access	Dead letter handling	Queue resumes, backlog processes
Config Service	Make config unavailable	Use cached config	Fetch fresh config on recovery

Start Small, Increase Scope

Summary: Component Redundancy

Key Takeaways

•Systematically identify SPOFs — Trace request paths and dependency trees to find every single point of failure
•Stateless components are easy — Multiple instances behind load balancers with health checks
•Stateful components require replication — Different patterns for databases, caches, and queues
•Infrastructure components are often hidden SPOFs — DNS, load balancers, secrets management, config services
•External dependencies need resilience patterns — Graceful degradation, multi-provider, caching, async decoupling
•Failure isolation prevents cascades — Bulkheads, circuit breakers, timeouts, and queues contain blast radius
•Testing validates redundancy — Chaos engineering and GameDays prove redundancy works

Module Complete:

You've now completed the Redundancy Patterns module. You understand:

Active-Passive: Standby systems for failover
Active-Active: All nodes serving traffic for maximum utilization
N+1: Capacity planning for failure tolerance
Geographic: Multi-region and multi-zone distribution
Component: Eliminating single points of failure

Module Complete

5 / 5