System Design (HLD)What Is High Availability?

Understanding High Availability: Building Systems That Stay Online

LevelIntermediate

Duration60 mins

TopicWhat Is High Availability?

3 / 4

Availability vs. Reliability: Two Pillars of Dependable Systems

The Subtle but Critical Distinction

In system design discussions, availability and reliability are often used interchangeably—but they represent fundamentally different aspects of system dependability. Understanding this distinction isn't academic pedantry; it shapes how you design systems, measure success, and prioritize engineering investments.

Consider two hypothetical systems:

System A: Available 99.9% of the time, but when it fails, it sometimes corrupts user data, returns inconsistent results, or behaves unpredictably.

System B: Available 99.5% of the time, but when it's up, it always behaves correctly, consistently, and predictably.

Which is better? The answer depends on your use case, and choosing wisely requires understanding what each property actually means.

What You Will Learn

By the end of this page, you will understand the formal definitions of availability and reliability, how they differ across multiple dimensions, when each matters more, and how to design systems that excel at both. You'll also learn about the broader context of dependability, which encompasses both properties.

Formal Definitions

Let's establish precise definitions before exploring the implications:

Availability is the probability that a system is operational and capable of performing its function at any randomly selected moment in time. It answers the question: "Is the system working right now?"

Reliability is the probability that a system will perform its intended function correctly over a specified period of time under stated conditions. It answers the question: "Will the system continue working correctly for the duration I need it?"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
AVAILABILITY (Point-in-Time)
============================
A(t) = P(system is operational at time t)
 
Steady-state availability:
A = MTBF / (MTBF + MTTR)
 
Where:
  MTBF = Mean Time Between Failures
  MTTR = Mean Time To Recovery
 
 
RELIABILITY (Duration-Based)
============================
R(t) = P(system operates correctly from time 0 to time t)
 
For exponential failure distribution:
R(t) = e^(-t/MTBF) = e^(-λt)
 
Where:
  λ = failure rate = 1/MTBF
  t = mission time
 
 
KEY DISTINCTION
===============
- Availability: a snapshot measurement at any instant
- Reliability: a duration measurement over continuous operation
 
A system can be highly available (comes back up quickly after failures)
but unreliable (fails frequently during operation).

The intuitive difference:

Imagine a lightbulb:

Availability answers: "Is the light on when I flip the switch?"
Reliability answers: "Will the light stay on for the next 8 hours?"

A lightbulb that turns on 99% of the time when you flip the switch (high availability) but flickers off randomly every few minutes (low reliability) is very different from one that takes a few tries to turn on (lower availability) but then runs continuously for months (high reliability).

Both Matter for Dependability

Neither property is superior—both are essential aspects of system dependability. A system that's always available but often behaves incorrectly is useless, but so is a system that's perfectly reliable when running but frequently unavailable.

Practical Differences

The formal definitions become clearer when we examine how these properties manifest in real systems:

Availability vs. Reliability: Practical Comparison
Aspect	Availability	Reliability
Core question	Can I use the system now?	Will the system work correctly while I use it?
Measurement type	Point-in-time probability	Duration-based probability
Failure impact	Cannot access service	Incorrect, inconsistent, or erratic behavior
Recovery focus	Minimize MTTR (get back up fast)	Maximize MTBF (fail less often)
Design focus	Redundancy, failover, quick restart	Robust implementation, testing, fault prevention
User experience	'Site is down'	'Something weird is happening'
Detection method	Health checks, synthetic probes	Error monitoring, data validation, consistency checks

High Availability, Low Reliability

•System is almost always reachable
•But results are inconsistent or incorrect
•Data may be stale, corrupted, or missing
•Behavior differs under the same conditions
•Fast recovery after crashes
•But crashes happen frequently
•Example: A cache that's always up but occasionally serves stale data

High Reliability, Lower Availability

•When running, behavior is perfectly correct
•Handles all edge cases gracefully
•Data is always consistent and valid
•Same input always produces same output
•But maintenance windows take the system offline
•Or startup after failures takes longer
•Example: A financial ledger with daily maintenance windows but zero computation errors

Real-world manifestations:

High availability, low reliability scenarios:

A search engine that's always up but sometimes returns irrelevant results due to stale indexes
A distributed database that's always accepting writes but occasionally loses transactions during leader elections
A microservice that responds to every request but randomly returns different results for identical queries

High reliability, lower availability scenarios:

A payment system that's down for 2 hours every Sunday for maintenance but has never calculated an amount incorrectly
A flight control system that takes 5 minutes to boot but has never issued an incorrect command
A scientific computation cluster that queues jobs (reducing 'availability') but produces perfect results

When Each Matters More

Neither availability nor reliability is universally more important—the right priority depends on the nature of the system and its consequences of failure.

Prioritize Availability When:

•Continuous access is critical — Real-time communication, streaming media, live collaboration tools. Users expect the system to 'always be there.'
•Brief outages cause significant harm — Emergency services, critical infrastructure monitoring, trading platforms during market hours.
•Users have no alternative — The system is the only way to accomplish a task; downtime = complete stoppage.
•Recovery from errors is easy — If users can retry, refresh, or the impact of occasional errors is low, prioritize accessibility.
•Competition is fierce — Users will switch to a competitor if your service is unavailable, even briefly.

Prioritize Reliability When:

•Correctness is non-negotiable — Financial transactions, medical dosage calculations, legal document processing. Errors are worse than downtime.
•Errors are hard to detect or reverse — Data corruption, silent failures, incorrect calculations that propagate downstream.
•Regulatory or legal requirements demand it — Audit trails, compliance systems, safety-critical applications.
•Downstream systems depend on correctness — If your output feeds other systems, reliability failures cascade.
•Trust is the core value proposition — Banks, healthcare systems, infrastructure providers. One reliability failure can destroy years of trust.

Priority by System Type
System Type	Availability Priority	Reliability Priority	Reasoning
Social media feed	Very High	Medium	Users tolerate occasional stale data; unavailability is immediately noticed
Banking transactions	High	Critical	Brief downtime is acceptable; incorrect balances are catastrophic
Live video streaming	Critical	Medium	Buffering/unavailability immediately visible; minor glitches tolerated
Medical records system	High	Critical	Records must be accessible; but incorrect records could be life-threatening
E-commerce cart	High	High	Users abandon unavailable sites; lost items in cart cause churn
Data warehouse/ETL	Medium	Critical	Batch processing can wait; incorrect data propagates to all downstream reports

The Ideal: High Availability AND High Reliability

The goal is almost always to achieve both high availability and high reliability. The question of 'which matters more' is about prioritization when trade-offs are necessary—during incidents, when making architectural decisions under constraints, or when allocating limited engineering resources.

The Relationship Between Availability and Reliability

Availability and reliability are not independent—they interact in complex ways. Understanding these interactions is essential for making informed design decisions.

Reliability enables availability:

Higher reliability (fewer failures) directly contributes to higher availability (more uptime):

MTBF (reliability) is the numerator in the availability formula
A system that fails less frequently has more time between recovery periods
Preventing failures is often cheaper than recovering from them

The MTBF-MTTR tradeoff:

Both reliability (MTBF) and recovery speed (MTTR) contribute to availability:

Availability = MTBF / (MTBF + MTTR)

You can achieve 99.9% availability through:

High MTBF approach: Fail once every 1000 hours, recover in 1 hour = 99.9%
High MTTR focus: Fail once every 10 hours, recover in 36 seconds = 99.9%

Both achieve the same availability, but they represent very different systems with different user experiences.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
SCENARIO COMPARISON: Achieving 99.9% Availability
=================================================
 
APPROACH A: High MTBF Focus (Fail Rarely)
------------------------------------------
MTBF = 1000 hours (fail about once per 6 weeks)
MTTR = 1 hour (recovery takes a while)
Availability = 1000 / (1000 + 1) = 99.9%
 
User experience:
  - Rare outages (users may never experience one)
  - When outages occur, they're disruptive (1 hour)
  - Users forget the system ever fails
  
Engineering investment:
  - Extensive testing and QA
  - Conservative change management
  - High-quality components
  - Comprehensive monitoring to prevent failures
 
 
APPROACH B: Fast MTTR Focus (Recover Quickly)
---------------------------------------------
MTBF = 10 hours (fail about twice per day)
MTTR = 36 seconds (near-instant recovery)
Availability = 10 / (10 + 0.01) = 99.9%
 
User experience:
  - Frequent but brief interruptions
  - Users might not notice (request retry hides it)
  - System feels 'fragile' to sophisticated users
 
Engineering investment:
  - Extensive redundancy and failover
  - Rapid detection and automatic recovery
  - Graceful degradation and retry logic
  - May tolerate messier code/more failures
 
 
HYBRID APPROACH (Best Practice)
-------------------------------
MTBF = 100 hours (fail about weekly)  
MTTR = 6 minutes (reasonably fast recovery)
Availability = 100 / (100 + 0.1) = 99.9%
 
User experience:
  - Occasional brief outages
  - Outages are infrequent enough to be acceptable
  - Recovery is fast enough to not be disruptive
 
Engineering reality:
  This is where most successful systems land—
  neither extreme reliability nor extreme recovery,
  but a balanced investment in both.

The Modern Trend: Embrace MTTR

Modern system design increasingly favors the 'fast MTTR' approach over the 'high MTBF' approach. The reasoning: you can test recovery mechanisms (failover, restarts, replication) but you cannot exhaustively test all failure modes. Practicing recovery leads to better outcomes than trying to prevent all failures.

Modes of Failure: Availability vs. Reliability Failures

Understanding different failure modes helps clarify the distinction between availability and reliability issues:

Availability Failures (System Down)

•Crash failures — Process terminates unexpectedly, server goes down
•Hang failures — System becomes unresponsive, stops processing requests
•Network partitions — System is running but unreachable due to network issues
•Resource exhaustion — Out of memory, disk full, connection pool exhausted
•Dependency failures — Critical external service is unavailable
•Overload failures — System rejects requests due to capacity limits
•Deployment failures — Bad deployment takes down the service

Reliability Failures (System Misbehaves)

•Byzantine failures — System produces arbitrary, potentially malicious output
•Data corruption — Stored or transmitted data becomes invalid
•Timing failures — System responds within time bounds but with stale or reordered data
•Logic errors — System produces incorrect results due to bugs
•Consistency failures — Different reads return different values for same data
•Silent failures — System reports success but doesn't actually complete the operation
•Degraded accuracy — ML model produces increasingly poor predictions

The detection challenge:

Availability failures are usually obvious and immediate:

Health checks fail
Error rates spike
Response times go to infinity
Users complain loudly ("Site is down!")

Reliability failures are often subtle and delayed:

Data looks reasonable but is slightly wrong
Errors only appear under specific conditions
Problems accumulate over time before becoming visible
Users may not notice immediately ("Something seems off...")

This detection asymmetry explains why many organizations focus more on availability—it's easier to measure and harder to ignore. But reliability failures can be more damaging precisely because they go undetected longer.

The Hidden Cost of Reliability Failures

Reliability failures are often more expensive than availability failures. An availability outage is visible, bounded in time, and embarrassing—but often recoverable. A reliability failure that corrupts data, provides wrong answers, or silently loses transactions may require days of investigation, data recovery, and customer compensation.

Designing for Both Availability and Reliability

Excellent systems achieve both high availability and high reliability. Here are design principles and practices that support each property:

Practices That Improve Availability

•Redundancy at every layer — Multiple servers, databases, network paths, and data centers
•Automatic failover — Detect failures and reroute traffic without human intervention
•Health checks and load balancing — Remove unhealthy instances from rotation before they affect users
•Graceful degradation — Provide reduced functionality when components fail
•Circuit breakers — Prevent cascade failures when dependencies are unavailable
•Retry with backoff — Automatically recover from transient failures
•Chaos engineering — Regularly practice failures to ensure recovery works
•Fast rollback — Quickly revert bad deployments before they cause extended outages

Practices That Improve Reliability

•Extensive testing — Unit, integration, end-to-end, property-based, and fuzz testing
•Type systems and static analysis — Catch errors at compile time, not runtime
•Data validation — Validate inputs at every boundary, reject invalid data early
•Idempotency — Design operations that can be safely retried without side effects
•Checksums and verification — Detect corruption in storage and transmission
•Formal verification — For critical paths, mathematically prove correctness
•Consistency mechanisms — Transactions, consensus protocols, version vectors
•Audit logging — Track all changes for accountability and error investigation
•Observability — Comprehensive metrics, traces, and logs to detect subtle issues

Practices That Improve Both
Practice	Availability Benefit	Reliability Benefit
Canary deployments	Limits blast radius of bad deploys	Catches bugs before wide exposure
Feature flags	Disable features without full rollback	Isolate experimental code from stable paths
Comprehensive monitoring	Detect outages quickly	Detect correctness issues and anomalies
Post-incident reviews	Improve recovery processes	Fix root causes of logic errors
Code review	Catch deployment blockers	Catch bugs before they're deployed
Immutable infrastructure	Consistent, predictable deploys	Eliminate configuration drift

The Broader Context: Dependability and Related Properties

Availability and reliability are two components of a broader concept called dependability—the trustworthiness of a system such that reliance can be placed on the service it delivers. The classical taxonomy of dependability includes several related properties:

The Dependability Taxonomy
Property	Definition	Focus
Availability	Probability system is operational at a point in time	Can I use the system now?
Reliability	Probability of correct operation over a time period	Will it keep working correctly?
Safety	Absence of catastrophic consequences for users/environment	Will failure cause harm?
Integrity	Absence of improper system alterations	Is the system/data untampered?
Maintainability	Ability to undergo modifications and repairs easily	Can we fix and evolve it?
Confidentiality	Absence of unauthorized disclosure of information	Is information protected?

Security as a dependability component:

Note that security (confidentiality + integrity + availability) overlaps with dependability. A DDoS attack reduces availability. A data breach compromises integrity and confidentiality. When designing systems, security and reliability/availability concerns often require similar solutions: redundancy, monitoring, validation, and defense in depth.

Maintainability's underrated importance:

Maintainability directly affects long-term availability and reliability:

Systems that are easy to understand have fewer bugs (reliability)
Systems that are easy to modify can be fixed faster (availability via MTTR)
Systems that are easy to deploy can be updated more safely (reliability)
Technical debt reduces maintainability, eventually degrading everything else

Holistic System Design

The best engineers don't optimize for single properties in isolation. They design for dependability holistically—understanding that investments in one property often yield benefits across multiple properties, while neglecting any property can undermine the entire system.

Summary: Availability and Reliability as Complementary Goals

We've thoroughly explored the distinctions between availability and reliability. Let's consolidate the key insights:

Key Takeaways

•Availability = accessible now; Reliability = correct over time — They answer different questions and require different design approaches.
•Neither is universally more important — Context determines priority. Financial systems prioritize reliability; real-time systems prioritize availability.
•Both contribute to the same goal — High MTBF (reliability) and low MTTR (fast recovery) both improve availability.
•Failure modes differ — Availability failures are obvious (system down); reliability failures are subtle (system misbehaves).
•Different practices improve each — Redundancy and failover for availability; testing and validation for reliability. Many practices help both.
•They're part of broader dependability — Availability and reliability join safety, integrity, maintainability, and confidentiality in the complete picture of system trustworthiness.

What's next:

Now that we understand what availability and reliability mean and how they relate, the next page explores the cost of downtime. We'll quantify the business impact of unavailability, examine both direct and indirect costs, and develop frameworks for justifying investments in high availability. Understanding the true cost of downtime is essential for making informed decisions about how much availability is 'enough.'

Page Complete

You now understand the formal and practical differences between availability and reliability, when each matters more, how they interact, the various failure modes, and how to design systems that excel at both. Next, we'll explore the business impact of downtime.

3 / 4

Loading learning content...

System Design (HLD)What Is High Availability?

Understanding High Availability: Building Systems That Stay Online

LevelIntermediate

Duration60 mins

TopicWhat Is High Availability?

3 / 4

Availability vs. Reliability: Two Pillars of Dependable Systems

The Subtle but Critical Distinction

Consider two hypothetical systems:

System A: Available 99.9% of the time, but when it fails, it sometimes corrupts user data, returns inconsistent results, or behaves unpredictably.

System B: Available 99.5% of the time, but when it's up, it always behaves correctly, consistently, and predictably.

Which is better? The answer depends on your use case, and choosing wisely requires understanding what each property actually means.

What You Will Learn

Formal Definitions

Let's establish precise definitions before exploring the implications:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
AVAILABILITY (Point-in-Time)
============================
A(t) = P(system is operational at time t)
 
Steady-state availability:
A = MTBF / (MTBF + MTTR)
 
Where:
  MTBF = Mean Time Between Failures
  MTTR = Mean Time To Recovery
 
 
RELIABILITY (Duration-Based)
============================
R(t) = P(system operates correctly from time 0 to time t)
 
For exponential failure distribution:
R(t) = e^(-t/MTBF) = e^(-λt)
 
Where:
  λ = failure rate = 1/MTBF
  t = mission time
 
 
KEY DISTINCTION
===============
- Availability: a snapshot measurement at any instant
- Reliability: a duration measurement over continuous operation
 
A system can be highly available (comes back up quickly after failures)
but unreliable (fails frequently during operation).

The intuitive difference:

Imagine a lightbulb:

Availability answers: "Is the light on when I flip the switch?"
Reliability answers: "Will the light stay on for the next 8 hours?"

Both Matter for Dependability

Practical Differences

The formal definitions become clearer when we examine how these properties manifest in real systems:

Availability vs. Reliability: Practical Comparison
Aspect	Availability	Reliability
Core question	Can I use the system now?	Will the system work correctly while I use it?
Measurement type	Point-in-time probability	Duration-based probability
Failure impact	Cannot access service	Incorrect, inconsistent, or erratic behavior
Recovery focus	Minimize MTTR (get back up fast)	Maximize MTBF (fail less often)
Design focus	Redundancy, failover, quick restart	Robust implementation, testing, fault prevention
User experience	'Site is down'	'Something weird is happening'
Detection method	Health checks, synthetic probes	Error monitoring, data validation, consistency checks

High Availability, Low Reliability

•System is almost always reachable
•But results are inconsistent or incorrect
•Data may be stale, corrupted, or missing
•Behavior differs under the same conditions
•Fast recovery after crashes
•But crashes happen frequently
•Example: A cache that's always up but occasionally serves stale data

High Reliability, Lower Availability

•When running, behavior is perfectly correct
•Handles all edge cases gracefully
•Data is always consistent and valid
•Same input always produces same output
•But maintenance windows take the system offline
•Or startup after failures takes longer
•Example: A financial ledger with daily maintenance windows but zero computation errors

Real-world manifestations:

High availability, low reliability scenarios:

A search engine that's always up but sometimes returns irrelevant results due to stale indexes
A distributed database that's always accepting writes but occasionally loses transactions during leader elections
A microservice that responds to every request but randomly returns different results for identical queries

High reliability, lower availability scenarios:

A payment system that's down for 2 hours every Sunday for maintenance but has never calculated an amount incorrectly
A flight control system that takes 5 minutes to boot but has never issued an incorrect command
A scientific computation cluster that queues jobs (reducing 'availability') but produces perfect results

When Each Matters More

Neither availability nor reliability is universally more important—the right priority depends on the nature of the system and its consequences of failure.

Prioritize Availability When:

•Continuous access is critical — Real-time communication, streaming media, live collaboration tools. Users expect the system to 'always be there.'
•Brief outages cause significant harm — Emergency services, critical infrastructure monitoring, trading platforms during market hours.
•Users have no alternative — The system is the only way to accomplish a task; downtime = complete stoppage.
•Recovery from errors is easy — If users can retry, refresh, or the impact of occasional errors is low, prioritize accessibility.
•Competition is fierce — Users will switch to a competitor if your service is unavailable, even briefly.

Prioritize Reliability When:

•Correctness is non-negotiable — Financial transactions, medical dosage calculations, legal document processing. Errors are worse than downtime.
•Errors are hard to detect or reverse — Data corruption, silent failures, incorrect calculations that propagate downstream.
•Regulatory or legal requirements demand it — Audit trails, compliance systems, safety-critical applications.
•Downstream systems depend on correctness — If your output feeds other systems, reliability failures cascade.
•Trust is the core value proposition — Banks, healthcare systems, infrastructure providers. One reliability failure can destroy years of trust.

Priority by System Type
System Type	Availability Priority	Reliability Priority	Reasoning
Social media feed	Very High	Medium	Users tolerate occasional stale data; unavailability is immediately noticed
Banking transactions	High	Critical	Brief downtime is acceptable; incorrect balances are catastrophic
Live video streaming	Critical	Medium	Buffering/unavailability immediately visible; minor glitches tolerated
Medical records system	High	Critical	Records must be accessible; but incorrect records could be life-threatening
E-commerce cart	High	High	Users abandon unavailable sites; lost items in cart cause churn
Data warehouse/ETL	Medium	Critical	Batch processing can wait; incorrect data propagates to all downstream reports

The Ideal: High Availability AND High Reliability

The Relationship Between Availability and Reliability

Availability and reliability are not independent—they interact in complex ways. Understanding these interactions is essential for making informed design decisions.

Reliability enables availability:

Higher reliability (fewer failures) directly contributes to higher availability (more uptime):

MTBF (reliability) is the numerator in the availability formula
A system that fails less frequently has more time between recovery periods
Preventing failures is often cheaper than recovering from them

The MTBF-MTTR tradeoff:

Both reliability (MTBF) and recovery speed (MTTR) contribute to availability:

Availability = MTBF / (MTBF + MTTR)

You can achieve 99.9% availability through:

High MTBF approach: Fail once every 1000 hours, recover in 1 hour = 99.9%
High MTTR focus: Fail once every 10 hours, recover in 36 seconds = 99.9%

Both achieve the same availability, but they represent very different systems with different user experiences.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
SCENARIO COMPARISON: Achieving 99.9% Availability
=================================================
 
APPROACH A: High MTBF Focus (Fail Rarely)
------------------------------------------
MTBF = 1000 hours (fail about once per 6 weeks)
MTTR = 1 hour (recovery takes a while)
Availability = 1000 / (1000 + 1) = 99.9%
 
User experience:
  - Rare outages (users may never experience one)
  - When outages occur, they're disruptive (1 hour)
  - Users forget the system ever fails
  
Engineering investment:
  - Extensive testing and QA
  - Conservative change management
  - High-quality components
  - Comprehensive monitoring to prevent failures
 
 
APPROACH B: Fast MTTR Focus (Recover Quickly)
---------------------------------------------
MTBF = 10 hours (fail about twice per day)
MTTR = 36 seconds (near-instant recovery)
Availability = 10 / (10 + 0.01) = 99.9%
 
User experience:
  - Frequent but brief interruptions
  - Users might not notice (request retry hides it)
  - System feels 'fragile' to sophisticated users
 
Engineering investment:
  - Extensive redundancy and failover
  - Rapid detection and automatic recovery
  - Graceful degradation and retry logic
  - May tolerate messier code/more failures
 
 
HYBRID APPROACH (Best Practice)
-------------------------------
MTBF = 100 hours (fail about weekly)  
MTTR = 6 minutes (reasonably fast recovery)
Availability = 100 / (100 + 0.1) = 99.9%
 
User experience:
  - Occasional brief outages
  - Outages are infrequent enough to be acceptable
  - Recovery is fast enough to not be disruptive
 
Engineering reality:
  This is where most successful systems land—
  neither extreme reliability nor extreme recovery,
  but a balanced investment in both.

The Modern Trend: Embrace MTTR

Modes of Failure: Availability vs. Reliability Failures

Understanding different failure modes helps clarify the distinction between availability and reliability issues:

Availability Failures (System Down)

•Crash failures — Process terminates unexpectedly, server goes down
•Hang failures — System becomes unresponsive, stops processing requests
•Network partitions — System is running but unreachable due to network issues
•Resource exhaustion — Out of memory, disk full, connection pool exhausted
•Dependency failures — Critical external service is unavailable
•Overload failures — System rejects requests due to capacity limits
•Deployment failures — Bad deployment takes down the service

Reliability Failures (System Misbehaves)

•Byzantine failures — System produces arbitrary, potentially malicious output
•Data corruption — Stored or transmitted data becomes invalid
•Timing failures — System responds within time bounds but with stale or reordered data
•Logic errors — System produces incorrect results due to bugs
•Consistency failures — Different reads return different values for same data
•Silent failures — System reports success but doesn't actually complete the operation
•Degraded accuracy — ML model produces increasingly poor predictions

The detection challenge:

Availability failures are usually obvious and immediate:

Health checks fail
Error rates spike
Response times go to infinity
Users complain loudly ("Site is down!")

Reliability failures are often subtle and delayed:

Data looks reasonable but is slightly wrong
Errors only appear under specific conditions
Problems accumulate over time before becoming visible
Users may not notice immediately ("Something seems off...")

The Hidden Cost of Reliability Failures

Designing for Both Availability and Reliability

Excellent systems achieve both high availability and high reliability. Here are design principles and practices that support each property:

Practices That Improve Availability

•Redundancy at every layer — Multiple servers, databases, network paths, and data centers
•Automatic failover — Detect failures and reroute traffic without human intervention
•Health checks and load balancing — Remove unhealthy instances from rotation before they affect users
•Graceful degradation — Provide reduced functionality when components fail
•Circuit breakers — Prevent cascade failures when dependencies are unavailable
•Retry with backoff — Automatically recover from transient failures
•Chaos engineering — Regularly practice failures to ensure recovery works
•Fast rollback — Quickly revert bad deployments before they cause extended outages

Practices That Improve Reliability

•Extensive testing — Unit, integration, end-to-end, property-based, and fuzz testing
•Type systems and static analysis — Catch errors at compile time, not runtime
•Data validation — Validate inputs at every boundary, reject invalid data early
•Idempotency — Design operations that can be safely retried without side effects
•Checksums and verification — Detect corruption in storage and transmission
•Formal verification — For critical paths, mathematically prove correctness
•Consistency mechanisms — Transactions, consensus protocols, version vectors
•Audit logging — Track all changes for accountability and error investigation
•Observability — Comprehensive metrics, traces, and logs to detect subtle issues

Practices That Improve Both
Practice	Availability Benefit	Reliability Benefit
Canary deployments	Limits blast radius of bad deploys	Catches bugs before wide exposure
Feature flags	Disable features without full rollback	Isolate experimental code from stable paths
Comprehensive monitoring	Detect outages quickly	Detect correctness issues and anomalies
Post-incident reviews	Improve recovery processes	Fix root causes of logic errors
Code review	Catch deployment blockers	Catch bugs before they're deployed
Immutable infrastructure	Consistent, predictable deploys	Eliminate configuration drift

The Broader Context: Dependability and Related Properties

The Dependability Taxonomy
Property	Definition	Focus
Availability	Probability system is operational at a point in time	Can I use the system now?
Reliability	Probability of correct operation over a time period	Will it keep working correctly?
Safety	Absence of catastrophic consequences for users/environment	Will failure cause harm?
Integrity	Absence of improper system alterations	Is the system/data untampered?
Maintainability	Ability to undergo modifications and repairs easily	Can we fix and evolve it?
Confidentiality	Absence of unauthorized disclosure of information	Is information protected?

Security as a dependability component:

Maintainability's underrated importance:

Maintainability directly affects long-term availability and reliability:

Systems that are easy to understand have fewer bugs (reliability)
Systems that are easy to modify can be fixed faster (availability via MTTR)
Systems that are easy to deploy can be updated more safely (reliability)
Technical debt reduces maintainability, eventually degrading everything else

Holistic System Design

Summary: Availability and Reliability as Complementary Goals

We've thoroughly explored the distinctions between availability and reliability. Let's consolidate the key insights:

Key Takeaways

•Availability = accessible now; Reliability = correct over time — They answer different questions and require different design approaches.
•Neither is universally more important — Context determines priority. Financial systems prioritize reliability; real-time systems prioritize availability.
•Both contribute to the same goal — High MTBF (reliability) and low MTTR (fast recovery) both improve availability.
•Failure modes differ — Availability failures are obvious (system down); reliability failures are subtle (system misbehaves).
•Different practices improve each — Redundancy and failover for availability; testing and validation for reliability. Many practices help both.
•They're part of broader dependability — Availability and reliability join safety, integrity, maintainability, and confidentiality in the complete picture of system trustworthiness.

What's next:

Page Complete

3 / 4