Loading learning content...
You've built a sophisticated application with redundant web servers, replicated databases, and multi-region deployment. But buried somewhere in your architecture is a single configuration service, a lone DNS resolver, or a unique authentication endpoint. When that single component fails, your carefully designed redundancy collapses. Your 99.99% availability calculation becomes meaningless because you overlooked a single point of failure.
Amdahl's Law for Availability: Just as Amdahl's Law limits parallelization gains to the sequential portion of code, your system's availability is limited by your least redundant component. A system is only as available as its weakest link.
Component redundancy addresses this by systematically identifying and eliminating single points of failure at every level of your architecture—from obvious elements like databases and load balancers to subtle dependencies like configuration sources, certificate authorities, and third-party APIs.
This page provides a comprehensive framework for identifying single points of failure, understanding component-level redundancy patterns, and building systems where no single component failure can bring down the service.
By the end of this page, you will understand how to identify single points of failure systematically, implement redundancy for different component types, handle stateful component redundancy challenges, and build architectures where component failures are isolated rather than cascading.
A Single Point of Failure (SPOF) is any component whose failure would cause the entire system (or a critical function) to become unavailable. SPOFs can be obvious or subtle, and finding them requires systematic analysis.
The Request Path Analysis
Trace a typical request through your entire system and ask: "If this component died, would service continue?"
Common Hidden SPOFs:
| Component | Redundancy Status | Failure Impact | Priority |
|---|---|---|---|
| Primary Database | Replicated | Data unavailable | Critical |
| Config Service | Single instance | New deploys fail | High |
| DNS Resolver | Single provider | All connections fail | Critical |
| Auth Service | Multi-instance | Login unavailable | High |
| Logging Pipeline | Single endpoint | Observability lost | Medium |
| Rate Limiter | Single Redis | Either block all or allow all | High |
Dependency Mapping
Create a comprehensive dependency graph:
The goal isn't eliminating every SPOF—that may be cost-prohibitive—but understanding and accepting them consciously.
Borrow from aerospace and medical device engineering: for each component, analyze Severity (impact of failure), Occurrence (probability of failure), and Detection (ability to detect failure). Multiply these for a Risk Priority Number that guides redundancy investment.
Stateless components are the easiest to make redundant. Since they hold no local state, any instance can handle any request, and failing instances can be replaced without data loss.
Web/Application Servers
The most common stateless component:
API Gateways
Entry points to your service mesh:
Workers/Consumers
Background job processors:
Proxies and Sidecars
Service mesh components:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
apiVersion: apps/v1kind: Deploymentmetadata: name: api-serverspec: replicas: 3 # Maintains 3 instances (N+1 if N=2 needed) strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 # Always keep at least 2 running maxSurge: 1 template: spec: affinity: podAntiAffinity: # Spread across nodes/zones preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: topologyKey: topology.kubernetes.io/zone containers: - name: api image: api-server:v1.2.3 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1000m" memory: "1Gi" readinessProbe: # Only receive traffic when ready httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 livenessProbe: # Restart if unresponsive httpGet: path: /health/live port: 8080 initialDelaySeconds: 15 periodSeconds: 10---apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: api-server-pdbspec: minAvailable: 2 # Never drop below 2 pods selector: matchLabels: app: api-serverWithout anti-affinity rules, all your 'redundant' pods might land on the same node or in the same availability zone. When that node or zone fails, all pods fail together. Always configure anti-affinity to spread replicas across failure domains.
Stateful components present the core challenge of redundancy: you can't simply replace them because they hold irreplaceable data. Redundancy requires replication and careful state management.
Databases
The most critical stateful components:
Caches
Often treated as ephemeral, but performance may depend on them:
Message Queues
Reliable message delivery requires durability:
Search Indices
| Component Type | Pattern | Failover Time | Data Loss Risk |
|---|---|---|---|
| PostgreSQL | Streaming replication + Patroni | Seconds | Zero (sync) or minimal (async) |
| MySQL | Group Replication / InnoDB Cluster | Seconds | Zero with sync commit |
| Redis | Sentinel or Cluster mode | Seconds to minutes | Minimal (async replication) |
| Kafka | Partition replication (3x) | Immediate | Zero with acks=all |
| Elasticsearch | Shard replicas | Automatic | Zero with replicas |
| MongoDB | Replica Set (3+ members) | Seconds | Zero with majority write |
Key Design Principles for Stateful Redundancy:
1. Separate Compute from Storage
Where possible, decouple stateless compute from durable storage. Losing compute is cheap; losing data is expensive.
2. Externalize State
Move state out of application servers into purpose-built stateful services with their own redundancy.
3. Accept CAP Tradeoffs
For distributed stateful systems, choose your tradeoff consciously: strong consistency with reduced availability, or high availability with eventual consistency.
4. Plan for State Recovery
Even with redundancy, have backup and restore procedures. Redundancy handles operational failures; backups handle corruption, disasters, and human error.
Some components appear stateless but have subtle state dependencies. Servers that cache configuration, hold circuit breaker state in memory, or accumulate metrics may misbehave when replaced. Identify and externalize all state, even the small bits.
Infrastructure components often become SPOFs because they're 'invisible'—shared services that don't appear in application architecture diagrams but are critical to operation.
Load Balancers
Often the front door to your entire system:
Cloud-managed (ALB, GCE, Azure LB):
Self-managed (HAProxy, NGINX):
DNS
Failure here breaks everything:
Secrets Management
Vault, AWS Secrets Manager, etc.:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
interface ConfigSource { name: string; priority: number; load(): Promise<Config | null>;} class ResilientConfigLoader { private cachedConfig: Config | null = null; private cacheTime: Date | null = null; private readonly cacheDurationMs = 5 * 60 * 1000; // 5 minutes constructor(private sources: ConfigSource[]) { // Sort by priority (highest first) this.sources.sort((a, b) => b.priority - a.priority); } async loadConfig(): Promise<Config> { // Try each source in priority order for (const source of this.sources) { try { const config = await source.load(); if (config) { this.cachedConfig = config; this.cacheTime = new Date(); console.log(`Loaded config from ${source.name}`); return config; } } catch (error) { console.warn(`Failed to load from ${source.name}: ${error}`); // Continue to next source } } // All sources failed - use cache if available and fresh enough if (this.cachedConfig && this.cacheTime) { const age = Date.now() - this.cacheTime.getTime(); if (age < this.cacheDurationMs * 2) { // Extended cache during failure console.warn('All config sources failed, using stale cache'); return this.cachedConfig; } } throw new Error('All configuration sources failed and no valid cache'); }} // Usage: Multiple sources with fallbackconst configLoader = new ResilientConfigLoader([ { name: 'consul', priority: 100, load: () => fetchFromConsul() }, { name: 's3', priority: 50, load: () => fetchFromS3() }, { name: 'local', priority: 10, load: () => loadFromDisk() },]);For truly critical infrastructure (DNS, CDN), consider multi-vendor redundancy. Run authoritative DNS on both Route 53 and Cloudflare. Use multiple CDN providers with failover. This protects against vendor-specific outages—rare but impactful.
External dependencies—third-party APIs, SaaS services, partner integrations—introduce SPOFs outside your control. You can't make Stripe redundant by running two Stripes, but you can design your system to handle Stripe's unavailability.
Patterns for External Dependency Resilience:
1. Graceful Degradation
Design features to work (perhaps with reduced functionality) when dependencies fail:
2. Multi-Provider Strategies
For some services, multiple providers can serve the same function:
3. Caching and Local Fallbacks
Store data locally to survive dependency outages:
4. Async Decoupling
Don't make synchronous calls to dependencies when async would work:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
interface PaymentProvider { name: string; processPayment(amount: number, token: string): Promise<PaymentResult>; isHealthy(): Promise<boolean>;} class ResilientPaymentProcessor { private providerHealth: Map<string, { healthy: boolean; lastCheck: Date }> = new Map(); constructor(private providers: PaymentProvider[]) {} async processPayment(amount: number, token: string): Promise<PaymentResult> { const orderedProviders = await this.getOrderedProviders(); let lastError: Error | null = null; for (const provider of orderedProviders) { try { console.log(`Attempting payment via ${provider.name}`); const result = await provider.processPayment(amount, token); // Mark provider as healthy on success this.providerHealth.set(provider.name, { healthy: true, lastCheck: new Date() }); return result; } catch (error) { console.error(`Payment failed via ${provider.name}: ${error}`); // Mark provider as unhealthy this.providerHealth.set(provider.name, { healthy: false, lastCheck: new Date() }); lastError = error as Error; // Continue to next provider } } // All providers failed throw new Error(`All payment providers failed. Last error: ${lastError?.message}`); } private async getOrderedProviders(): Promise<PaymentProvider[]> { // Return healthy providers first, then unhealthy ones as last resort const healthy: PaymentProvider[] = []; const unhealthy: PaymentProvider[] = []; for (const provider of this.providers) { const status = this.providerHealth.get(provider.name); if (!status || status.healthy) { healthy.push(provider); } else { unhealthy.push(provider); } } return [...healthy, ...unhealthy]; }}Fallback paths are rarely exercised in production. When exercised during actual failures, they often fail due to stale configurations, changed APIs, or untested edge cases. Regularly test fallback behavior by deliberately failing primary providers in staging and occasionally in production.
Component redundancy isn't just about running multiple copies—it's about ensuring that one component's failure doesn't cascade to others. Failure isolation contains the blast radius of component failures.
Bulkhead Pattern
Partition resources so that failure in one partition doesn't affect others:
Circuit Breaker Pattern
Stop calling a failing service to prevent cascading load:
Timeout and Deadline Patterns
Prevent slow components from blocking healthy ones:
Queue-Based Decoupling
Asynchronous processing isolates producer from consumer failures:
Apply multiple isolation patterns simultaneously. A circuit breaker alone helps, but a circuit breaker with timeouts, bulkheads, and queue decoupling provides much stronger isolation. Each layer catches failures that slip through the others.
Redundancy that hasn't been tested is unreliable redundancy. Component redundancy must be validated through systematic failure injection.
Unit-Level Failure Testing
Test individual component failover in isolation:
Integration Failure Testing
Test failures in context of full system:
Chaos Engineering
Regular, automated failure injection in production:
GameDays
Orchestrated failure exercises with the team:
| Component | Failure Test | Expected Behavior | Recovery Test |
|---|---|---|---|
| App Server | Terminate instance | Load balancer routes around | New instance joins pool |
| Database Primary | Stop primary process | Standby promoted | Old primary rejoins as replica |
| Cache Cluster | Kill cache node | Requests hit origin | Node rejoins, cache rebuilds |
| Queue | Block queue access | Dead letter handling | Queue resumes, backlog processes |
| Config Service | Make config unavailable | Use cached config | Fetch fresh config on recovery |
Begin chaos testing in development, then staging, then production during low-traffic periods, then production during normal traffic. Build confidence incrementally. Never start chaos testing in production without extensive staging validation.
Component redundancy eliminates the weakest links in your architecture—the single points of failure that could bring down an otherwise well-designed system. It requires systematic identification, appropriate patterns for different component types, and regular testing to validate.
Module Complete:
You've now completed the Redundancy Patterns module. You understand:
These patterns work together to build systems that maintain availability despite hardware failures, software bugs, network partitions, and human error. The next module will explore Failover Strategies—the mechanisms for detecting failures and executing the transitions that redundancy enables.
You've mastered redundancy patterns—from active-passive through component-level redundancy. These patterns form the foundation of high availability architecture. Apply them systematically to eliminate single points of failure and build systems that survive the inevitable failures of distributed computing.