System Design (HLD)What Is High Availability?

Understanding High Availability: Building Systems That Stay Online

LevelIntermediate

Duration60 mins

TopicWhat Is High Availability?

1 / 4

Availability Definition: What It Really Means for Systems to Be Available

The Promise Every System Must Keep

In December 2021, Amazon Web Services experienced a major outage affecting US-EAST-1, one of its most critical regions. For nearly seven hours, thousands of services—from smart doorbells to food delivery apps—became partially or completely unavailable. The financial impact was estimated in the hundreds of millions of dollars, but the reputational damage extended far beyond any balance sheet.

This wasn't a story about servers catching fire or hackers breaching defenses. It was a story about availability—specifically, the catastrophic consequences when systems fail to maintain it.

Every distributed system makes an implicit promise to its users: I will be here when you need me. Understanding what that promise actually means, how to quantify it, and why it matters forms the foundation of everything else we'll learn about building highly available systems.

What You Will Learn

By the end of this page, you will understand the formal and practical definitions of availability, how availability differs from related concepts like uptime and reachability, and why the precise definition you choose profoundly impacts how you design, operate, and measure your systems.

The Formal Definition of Availability

At its most fundamental level, availability is the probability that a system is operational and able to perform its intended function at any given point in time. This deceptively simple definition carries profound implications for how we design and operate distributed systems.

The formal mathematical definition is:

Availability = Uptime / (Uptime + Downtime)
 
            = MTBF / (MTBF + MTTR)
 
Where:
  MTBF = Mean Time Between Failures
  MTTR = Mean Time To Recovery
 
Expressed as a percentage:
  Availability % = (Total Time - Downtime) / Total Time × 100

This formula reveals two fundamental levers we can pull to improve availability:

Increase MTBF — Make failures less frequent by building more robust components, implementing redundancy, and eliminating single points of failure.
Decrease MTTR — When failures inevitably occur, detect them faster, diagnose them more quickly, and restore service as rapidly as possible.

Most organizations initially focus heavily on the first lever (preventing failures), but as systems mature, the second lever (fast recovery) often yields greater returns. A system that fails once a month but recovers in 30 seconds may be more available than one that fails once a year but takes 4 hours to restore.

The Recovery Paradox

Counterintuitively, systems that practice recovery regularly (through chaos engineering, failover drills, etc.) often achieve higher availability than systems that try to prevent all failures. Practice reduces MTTR, and the reduction in recovery time often outweighs any increase in failure frequency.

The time dimension matters:

Availability is always measured over a specific time period—a month, a quarter, a year. This is critical because:

Short-term availability can be high while long-term availability suffers from rare but lengthy outages
Averaging can hide patterns (e.g., a system that's down every Sunday morning might show 97% availability but be unusable for weekend processing)
Time-of-day patterns affect real impact (15 minutes of downtime at 3 AM differs vastly from 15 minutes during peak traffic)

A mature availability definition accounts for when downtime occurs, not just how much.

What 'Available' Actually Means

The formula above is clear, but it hides a crucial question: What counts as the system being 'up' or 'down'? This question has no universal answer—it depends entirely on context, and the answer you choose fundamentally shapes your system's design.

Consider a simple e-commerce website. Is the system 'available' when:

Ambiguous Availability States

•The homepage loads — But product search is broken?
•Search works — But checkout is failing for 10% of users?
•All features work — But response times are 15 seconds instead of 200ms?
•Everything is fast — But only for users in North America; European users timeout?
•All core features work — But the recommendation engine is down, reducing conversion rates by 30%?
•99% of requests succeed — But the 1% that fail are payment confirmations, causing massive customer support load?

These scenarios illustrate why availability is not a binary property in practice. Systems exist on a spectrum between fully operational and completely unavailable, with countless degraded states in between.

To address this complexity, sophisticated organizations define availability across multiple dimensions:

Multi-Dimensional Availability Definitions
Dimension	Definition	Example
Functional Availability	Which features/capabilities are operational	Search works, but recommendations are degraded
Performance Availability	Is the system meeting latency targets	P99 latency is 5s instead of target 500ms
Correctness Availability	Are results accurate and complete	Search returns results but rankings are stale
Geographic Availability	Which regions can access the service	US and EU available, APAC experiencing outage
User Segment Availability	Which users can access the service	Free tier available, premium tier experiencing issues
Capacity Availability	Can the system handle expected load	Available but at 90% capacity, rejecting new connections

User-Centric Availability

The most meaningful availability definitions are user-centric: 'The percentage of user requests that complete successfully within acceptable latency bounds.' This definition captures function, performance, and correctness in a single metric aligned with actual user experience.

Partial Availability and Graceful Degradation

Modern distributed systems are rarely completely up or completely down. Instead, they exist in partial availability states where some functionality works while other functionality is impaired. Understanding and designing for partial availability is one of the hallmarks of mature system design.

Graceful degradation is the practice of intentionally designing systems to provide reduced but still valuable functionality when components fail. The key insight is that partial service is almost always better than no service.

Brittle System Behavior

•Database slow → entire site returns 500 errors
•Recommendation service down → product pages fail to load
•Payment processor issues → users can't browse catalog
•Analytics service unreachable → checkout blocked
•Single component failure cascades to total outage

Gracefully Degrading System

•Database slow → serve cached results, hide dynamic features
•Recommendations down → show trending/popular items instead
•Payment issues → let users browse, save carts for later
•Analytics unreachable → proceed without tracking, backfill later
•Component failures isolated, core journey preserved

Designing for graceful degradation requires:

Feature prioritization — Not all features are equal. Identify which capabilities are critical (must always work), important (should work but can be degraded), and optional (nice to have, can be disabled).
Dependency isolation — Critical paths should not depend on non-critical services. Failures in analytics should never break checkout.
Fallback strategies — Every external dependency should have a defined fallback: cached data, default values, simplified functionality, or graceful error messages.
Circuit breakers — Prevent cascading failures by detecting and isolating failing components before they affect the broader system.
Load shedding — When overwhelmed, reject some requests cleanly rather than failing all requests poorly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Example: Product recommendation service with graceful degradation
async function getRecommendations(userId: string, productId: string): Promise<Product[]> {
    try {
        // Primary: Personalized ML-based recommendations
        const recommendations = await recommendationService.getPersonalized(userId, productId);
        return recommendations;
    } catch (error) {
        metrics.increment('recommendations.primary_failure');
        
        try {
            // Fallback 1: Non-personalized, similar products from cache
            const cached = await cache.get(`similar:${productId}`);
            if (cached) {
                return cached;
            }
        } catch (cacheError) {
            metrics.increment('recommendations.cache_failure');
        }
        
        try {
            // Fallback 2: Category top sellers from database
            const category = await productService.getCategory(productId);
            return await productService.getTopSellers(category, 10);
        } catch (dbError) {
            metrics.increment('recommendations.db_failure');
        }
        
        // Final fallback: Static popular products
        return getStaticPopularProducts();
    }
}

Silent Degradation Dangers

Graceful degradation can hide problems if not properly monitored. A system serving cached recommendations for hours without anyone noticing indicates a monitoring gap. Always alert on degraded states, even when the user experience appears normal.

The Availability Spectrum

Different systems have vastly different availability requirements based on their purpose, users, and consequences of failure. Understanding where your system sits on this spectrum is crucial for making appropriate design decisions—overengineering availability wastes resources, while underengineering creates business risk.

Let's examine the spectrum from lowest to highest availability requirements:

The Availability Requirements Spectrum
Category	Target	Examples	Downtime Impact
Internal Tools	95-99%	Admin dashboards, internal reporting, dev environments	Mild inconvenience, work delayed
Standard Web Apps	99-99.5%	Content sites, blogs, marketing pages	Lost traffic, minor revenue impact
Business Applications	99.5-99.9%	SaaS products, e-commerce, business-critical apps	Significant revenue loss, customer churn
Financial Systems	99.9-99.99%	Payment processing, trading platforms, banking	Major financial loss, regulatory issues
Critical Infrastructure	99.99-99.999%	Cloud platforms, telecom, healthcare systems	Cascading failures, safety risks
Life-Critical Systems	99.999%+	Air traffic control, medical devices, nuclear systems	Potential loss of life, catastrophic failure

The exponential cost curve:

Moving up the availability spectrum doesn't just increase costs linearly—the relationship is often exponential. Going from 99% to 99.9% might require:

Redundant infrastructure (2x-3x compute costs)
More sophisticated monitoring (specialized tooling costs)
On-call engineering support (staffing costs)
Incident response processes (operational overhead)

Going from 99.9% to 99.99% might add:

Multi-region deployment (3x-5x infrastructure costs)
Active-active architectures (significant engineering complexity)
24/7 operations teams (substantial headcount)
Extensive DR testing (ongoing operational investment)

Each additional 'nine' typically costs 5-10x more than the previous one.

Know Your Requirements

One of the most common architectural mistakes is over-specifying availability requirements. A 99.99% target for a system that genuinely needs only 99.5% wastes engineering resources, increases operational complexity, and often delays feature development—all without providing proportional value.

Context-dependent availability:

Availability requirements often vary by:

Time of day: A stock trading platform might need 99.99% during market hours but only 99% overnight
Feature: Core transactions need higher availability than auxiliary features
User segment: Premium customers might expect higher availability than free tier users
Geography: Primary markets might require higher availability than emerging markets

Sophisticated systems define and implement different availability targets for different contexts rather than applying a one-size-fits-all requirement.

Availability vs. Related Concepts

Availability is often confused with several related but distinct concepts. Clarifying these distinctions is essential for precise communication and proper system design.

Availability vs. Uptime

•Uptime measures whether the system is running at all
•Availability measures whether the system is usable and performing correctly
•A server can have 100% uptime while having 0% availability (if it's returning errors to all requests)
•Uptime is a necessary but not sufficient condition for availability
•Many naive monitoring systems track uptime but miss availability issues

Availability vs. Reliability

•Availability — Is the system accessible right now? (probability of being operational at a moment)
•Reliability — Will the system continue working over time? (probability of performing without failure for a duration)
•A system can be highly available but unreliable (frequently up but behaving unpredictably)
•A system can be reliable but have lower availability (consistent when running, but infrequent maintenance windows)
•Reliability is about consistent behavior; availability is about accessible behavior

Availability vs. Durability

•Availability — Can you access your data/service right now?
•Durability — Will your data still exist tomorrow? (protection against permanent loss)
•A storage system can be temporarily unavailable but still perfectly durable (data safe, just can't access it)
•A system can be available but not durable (you can write data, but it might be lost)
•Storage systems often quote extremely high durability (99.999999999%) with more modest availability

Availability vs. Performance

•Availability — Does the system respond at all?
•Performance — How quickly does the system respond?
•At some point, slow performance becomes unavailability (a 60-second response is effectively unavailable)
•Modern SLOs often tie them together: 'Available means responding with success within 500ms'
•Performance degradation is often a leading indicator of impending availability issues

Integration in Practice

While these concepts are distinct, modern Site Reliability Engineering (SRE) practices integrate them into unified Service Level Objectives (SLOs) that capture availability, performance, and correctness together: 'X% of requests complete successfully within Y milliseconds.' We'll explore this in depth in the SLO chapter.

Defining Availability for Your System

Every system needs a precise, shared definition of what 'available' means. Without this, teams talk past each other, measurements are meaningless, and design decisions lack anchor points. Here's a framework for crafting your availability definition:

Framework for Defining Availability

•Identify critical user journeys — What are the 3-5 most important things users do with your system? For e-commerce: browse → search → add to cart → checkout → order confirmation.
•Define success for each journey — What constitutes a successful completion? Include both functional correctness and reasonable performance bounds.
•Determine measurement points — Where will you measure availability? At the edge (CDN)? Application server? Database? The measurement point significantly affects the number.
•Account for user impact — Not all requests are equal. Weight by user impact: a failed checkout is worse than a failed product page view.
•Set appropriate time windows — Define the measurement period (hourly, daily, monthly) and any time-of-day variations.
•Document exclusions — What doesn't count against availability? Scheduled maintenance? External dependency failures? Client-side issues?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
AVAILABILITY DEFINITION: E-Commerce Platform
=============================================
 
Definition: The platform is considered available when a user can 
complete the core purchase journey (browse → search → cart → checkout)
with acceptable performance.
 
Measurement: 
  - Success: 2XX response within 2000ms at the 99th percentile
  - Measured at: Application load balancer (ALB) access logs
  - Time window: Rolling 30-day period
 
Critical Journeys & Targets:
  1. Product browsing (homepage, category pages): 99.9%
  2. Product search: 99.9%
  3. Shopping cart operations: 99.95%
  4. Checkout flow: 99.99%
  5. Order confirmation: 99.99%
 
Weighted Composite Availability:
  - Browse/Search: 40% weight
  - Cart: 20% weight
  - Checkout/Confirm: 40% weight
  - Target: 99.95% (8.76 hours allowed downtime per year)
 
Exclusions:
  - Scheduled maintenance windows (max 4 hours/month, announced 72h ahead)
  - Third-party payment processor outages beyond our control
  - Client-side failures (network, browser, device)
  - DDoS attacks exceeding 10x normal traffic baseline
 
Time-Weighted Considerations:
  - Peak hours (6pm-10pm local): failures weighted 2x
  - Major sales events: failures weighted 5x

Avoid Availability Theater

Many organizations define availability in ways that look good but don't reflect reality: measuring at the server instead of user edge, excluding all 'external' factors (which are often controllable), or averaging away time-of-day patterns. Your availability definition should reflect what users actually experience.

The Human Element of Availability

While we've focused on technical definitions, availability is ultimately about human experience. The same numerical availability can create vastly different user perceptions based on context, communication, and recovery.

User perception factors:

Factors Affecting Perceived Availability

•Frequency vs. Duration — Users often perceive many short outages as worse than one long outage, even if total downtime is equal. Stability of experience matters.
•Predictability — Scheduled maintenance (predictable) is far more acceptable than unexpected outages, even if the downtime is longer.
•Communication — A status page showing 'We know about the issue, working on it' dramatically reduces user frustration compared to silent failures.
•Recovery experience — How does the system behave when it comes back? Lost shopping carts versus preserved state significantly affects perception.
•Error messaging — 'Something went wrong' versus 'Our payment system is temporarily unavailable, please try in 5 minutes' creates different user responses.
•Recency bias — Users remember recent outages more strongly; reputation is rebuilt over time, not instantly.

Building trust through transparency:

Highly available organizations don't just achieve good numbers—they build trust through:

Public status pages that honestly reflect system state
Post-incident communications explaining what happened and what's being done
SLA commitments that put their money where their mouth is (credits for missed targets)
Incident history that shows learning and improvement over time
Proactive notifications when issues are detected before users report them

This transparency paradoxically increases perceived availability. Users tolerate occasional issues more when they trust the organization to be honest about them and to continuously improve.

The 'Nines' Aren't Everything

A service with 99.9% availability but terrible communication, unclear error messages, and slow recovery may generate more support tickets and churn than a service with 99.5% availability that communicates proactively, degrades gracefully, and recovers user state. Availability is a number; trust is a relationship.

Summary: Understanding Availability

We've established a comprehensive foundation for understanding availability in distributed systems. Let's consolidate the key insights:

Key Takeaways

•Availability = Uptime / (Uptime + Downtime) — This simple formula has two improvement vectors: increase MTBF (fail less) or decrease MTTR (recover faster).
•'Available' must be precisely defined — Ambiguity about what constitutes availability leads to measurement errors and misaligned engineering efforts.
•Partial availability is the norm — Modern systems rarely fail completely; designing for graceful degradation preserves user value during partial failures.
•Availability exists on a spectrum — Different systems require different targets; over-engineering is as wasteful as under-engineering.
•Distinguish availability from related concepts — Uptime, reliability, durability, and performance are related but distinct; conflating them leads to poor decisions.
•User perception matters — How users experience availability (communication, recovery, consistency) is as important as the raw numbers.

What's next:

Now that we understand what availability means, the next page explores how we measure it. The famous 'nines' notation (99.9%, 99.99%, etc.) is more nuanced than it first appears. We'll examine what each level of availability actually means in practice, how to calculate yearly/monthly budgets, and why choosing the right target is a business decision as much as a technical one.

Page Complete

You now understand the formal and practical definitions of availability, how to design for partial availability, where your system might sit on the availability spectrum, and how availability relates to other quality attributes. Next, we'll dive into measuring availability with precision.

1 / 4

Loading learning content...

System Design (HLD)What Is High Availability?

Understanding High Availability: Building Systems That Stay Online

LevelIntermediate

Duration60 mins

TopicWhat Is High Availability?

1 / 4

Availability Definition: What It Really Means for Systems to Be Available

The Promise Every System Must Keep

This wasn't a story about servers catching fire or hackers breaching defenses. It was a story about availability—specifically, the catastrophic consequences when systems fail to maintain it.

What You Will Learn

The Formal Definition of Availability

The formal mathematical definition is:

Availability = Uptime / (Uptime + Downtime)
 
            = MTBF / (MTBF + MTTR)
 
Where:
  MTBF = Mean Time Between Failures
  MTTR = Mean Time To Recovery
 
Expressed as a percentage:
  Availability % = (Total Time - Downtime) / Total Time × 100

This formula reveals two fundamental levers we can pull to improve availability:

Increase MTBF — Make failures less frequent by building more robust components, implementing redundancy, and eliminating single points of failure.
Decrease MTTR — When failures inevitably occur, detect them faster, diagnose them more quickly, and restore service as rapidly as possible.

The Recovery Paradox

The time dimension matters:

Availability is always measured over a specific time period—a month, a quarter, a year. This is critical because:

Short-term availability can be high while long-term availability suffers from rare but lengthy outages
Averaging can hide patterns (e.g., a system that's down every Sunday morning might show 97% availability but be unusable for weekend processing)
Time-of-day patterns affect real impact (15 minutes of downtime at 3 AM differs vastly from 15 minutes during peak traffic)

A mature availability definition accounts for when downtime occurs, not just how much.

What 'Available' Actually Means

Consider a simple e-commerce website. Is the system 'available' when:

Ambiguous Availability States

•The homepage loads — But product search is broken?
•Search works — But checkout is failing for 10% of users?
•All features work — But response times are 15 seconds instead of 200ms?
•Everything is fast — But only for users in North America; European users timeout?
•All core features work — But the recommendation engine is down, reducing conversion rates by 30%?
•99% of requests succeed — But the 1% that fail are payment confirmations, causing massive customer support load?

To address this complexity, sophisticated organizations define availability across multiple dimensions:

Multi-Dimensional Availability Definitions
Dimension	Definition	Example
Functional Availability	Which features/capabilities are operational	Search works, but recommendations are degraded
Performance Availability	Is the system meeting latency targets	P99 latency is 5s instead of target 500ms
Correctness Availability	Are results accurate and complete	Search returns results but rankings are stale
Geographic Availability	Which regions can access the service	US and EU available, APAC experiencing outage
User Segment Availability	Which users can access the service	Free tier available, premium tier experiencing issues
Capacity Availability	Can the system handle expected load	Available but at 90% capacity, rejecting new connections

User-Centric Availability

Partial Availability and Graceful Degradation

Brittle System Behavior

•Database slow → entire site returns 500 errors
•Recommendation service down → product pages fail to load
•Payment processor issues → users can't browse catalog
•Analytics service unreachable → checkout blocked
•Single component failure cascades to total outage

Gracefully Degrading System

•Database slow → serve cached results, hide dynamic features
•Recommendations down → show trending/popular items instead
•Payment issues → let users browse, save carts for later
•Analytics unreachable → proceed without tracking, backfill later
•Component failures isolated, core journey preserved

Designing for graceful degradation requires:

Feature prioritization — Not all features are equal. Identify which capabilities are critical (must always work), important (should work but can be degraded), and optional (nice to have, can be disabled).
Dependency isolation — Critical paths should not depend on non-critical services. Failures in analytics should never break checkout.
Fallback strategies — Every external dependency should have a defined fallback: cached data, default values, simplified functionality, or graceful error messages.
Circuit breakers — Prevent cascading failures by detecting and isolating failing components before they affect the broader system.
Load shedding — When overwhelmed, reject some requests cleanly rather than failing all requests poorly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Example: Product recommendation service with graceful degradation
async function getRecommendations(userId: string, productId: string): Promise<Product[]> {
    try {
        // Primary: Personalized ML-based recommendations
        const recommendations = await recommendationService.getPersonalized(userId, productId);
        return recommendations;
    } catch (error) {
        metrics.increment('recommendations.primary_failure');
        
        try {
            // Fallback 1: Non-personalized, similar products from cache
            const cached = await cache.get(`similar:${productId}`);
            if (cached) {
                return cached;
            }
        } catch (cacheError) {
            metrics.increment('recommendations.cache_failure');
        }
        
        try {
            // Fallback 2: Category top sellers from database
            const category = await productService.getCategory(productId);
            return await productService.getTopSellers(category, 10);
        } catch (dbError) {
            metrics.increment('recommendations.db_failure');
        }
        
        // Final fallback: Static popular products
        return getStaticPopularProducts();
    }
}

Silent Degradation Dangers

The Availability Spectrum

Let's examine the spectrum from lowest to highest availability requirements:

The Availability Requirements Spectrum
Category	Target	Examples	Downtime Impact
Internal Tools	95-99%	Admin dashboards, internal reporting, dev environments	Mild inconvenience, work delayed
Standard Web Apps	99-99.5%	Content sites, blogs, marketing pages	Lost traffic, minor revenue impact
Business Applications	99.5-99.9%	SaaS products, e-commerce, business-critical apps	Significant revenue loss, customer churn
Financial Systems	99.9-99.99%	Payment processing, trading platforms, banking	Major financial loss, regulatory issues
Critical Infrastructure	99.99-99.999%	Cloud platforms, telecom, healthcare systems	Cascading failures, safety risks
Life-Critical Systems	99.999%+	Air traffic control, medical devices, nuclear systems	Potential loss of life, catastrophic failure

The exponential cost curve:

Moving up the availability spectrum doesn't just increase costs linearly—the relationship is often exponential. Going from 99% to 99.9% might require:

Redundant infrastructure (2x-3x compute costs)
More sophisticated monitoring (specialized tooling costs)
On-call engineering support (staffing costs)
Incident response processes (operational overhead)

Going from 99.9% to 99.99% might add:

Multi-region deployment (3x-5x infrastructure costs)
Active-active architectures (significant engineering complexity)
24/7 operations teams (substantial headcount)
Extensive DR testing (ongoing operational investment)

Each additional 'nine' typically costs 5-10x more than the previous one.

Know Your Requirements

Context-dependent availability:

Availability requirements often vary by:

Time of day: A stock trading platform might need 99.99% during market hours but only 99% overnight
Feature: Core transactions need higher availability than auxiliary features
User segment: Premium customers might expect higher availability than free tier users
Geography: Primary markets might require higher availability than emerging markets

Sophisticated systems define and implement different availability targets for different contexts rather than applying a one-size-fits-all requirement.

Availability vs. Related Concepts

Availability is often confused with several related but distinct concepts. Clarifying these distinctions is essential for precise communication and proper system design.

Availability vs. Uptime

•Uptime measures whether the system is running at all
•Availability measures whether the system is usable and performing correctly
•A server can have 100% uptime while having 0% availability (if it's returning errors to all requests)
•Uptime is a necessary but not sufficient condition for availability
•Many naive monitoring systems track uptime but miss availability issues

Availability vs. Reliability

•Availability — Is the system accessible right now? (probability of being operational at a moment)
•Reliability — Will the system continue working over time? (probability of performing without failure for a duration)
•A system can be highly available but unreliable (frequently up but behaving unpredictably)
•A system can be reliable but have lower availability (consistent when running, but infrequent maintenance windows)
•Reliability is about consistent behavior; availability is about accessible behavior

Availability vs. Durability

•Availability — Can you access your data/service right now?
•Durability — Will your data still exist tomorrow? (protection against permanent loss)
•A storage system can be temporarily unavailable but still perfectly durable (data safe, just can't access it)
•A system can be available but not durable (you can write data, but it might be lost)
•Storage systems often quote extremely high durability (99.999999999%) with more modest availability

Availability vs. Performance

•Availability — Does the system respond at all?
•Performance — How quickly does the system respond?
•At some point, slow performance becomes unavailability (a 60-second response is effectively unavailable)
•Modern SLOs often tie them together: 'Available means responding with success within 500ms'
•Performance degradation is often a leading indicator of impending availability issues

Integration in Practice

Defining Availability for Your System

Framework for Defining Availability

•Identify critical user journeys — What are the 3-5 most important things users do with your system? For e-commerce: browse → search → add to cart → checkout → order confirmation.
•Define success for each journey — What constitutes a successful completion? Include both functional correctness and reasonable performance bounds.
•Determine measurement points — Where will you measure availability? At the edge (CDN)? Application server? Database? The measurement point significantly affects the number.
•Account for user impact — Not all requests are equal. Weight by user impact: a failed checkout is worse than a failed product page view.
•Set appropriate time windows — Define the measurement period (hourly, daily, monthly) and any time-of-day variations.
•Document exclusions — What doesn't count against availability? Scheduled maintenance? External dependency failures? Client-side issues?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
AVAILABILITY DEFINITION: E-Commerce Platform
=============================================
 
Definition: The platform is considered available when a user can 
complete the core purchase journey (browse → search → cart → checkout)
with acceptable performance.
 
Measurement: 
  - Success: 2XX response within 2000ms at the 99th percentile
  - Measured at: Application load balancer (ALB) access logs
  - Time window: Rolling 30-day period
 
Critical Journeys & Targets:
  1. Product browsing (homepage, category pages): 99.9%
  2. Product search: 99.9%
  3. Shopping cart operations: 99.95%
  4. Checkout flow: 99.99%
  5. Order confirmation: 99.99%
 
Weighted Composite Availability:
  - Browse/Search: 40% weight
  - Cart: 20% weight
  - Checkout/Confirm: 40% weight
  - Target: 99.95% (8.76 hours allowed downtime per year)
 
Exclusions:
  - Scheduled maintenance windows (max 4 hours/month, announced 72h ahead)
  - Third-party payment processor outages beyond our control
  - Client-side failures (network, browser, device)
  - DDoS attacks exceeding 10x normal traffic baseline
 
Time-Weighted Considerations:
  - Peak hours (6pm-10pm local): failures weighted 2x
  - Major sales events: failures weighted 5x

Avoid Availability Theater

The Human Element of Availability

User perception factors:

Factors Affecting Perceived Availability

•Frequency vs. Duration — Users often perceive many short outages as worse than one long outage, even if total downtime is equal. Stability of experience matters.
•Predictability — Scheduled maintenance (predictable) is far more acceptable than unexpected outages, even if the downtime is longer.
•Communication — A status page showing 'We know about the issue, working on it' dramatically reduces user frustration compared to silent failures.
•Recovery experience — How does the system behave when it comes back? Lost shopping carts versus preserved state significantly affects perception.
•Error messaging — 'Something went wrong' versus 'Our payment system is temporarily unavailable, please try in 5 minutes' creates different user responses.
•Recency bias — Users remember recent outages more strongly; reputation is rebuilt over time, not instantly.

Building trust through transparency:

Highly available organizations don't just achieve good numbers—they build trust through:

Public status pages that honestly reflect system state
Post-incident communications explaining what happened and what's being done
SLA commitments that put their money where their mouth is (credits for missed targets)
Incident history that shows learning and improvement over time
Proactive notifications when issues are detected before users report them

This transparency paradoxically increases perceived availability. Users tolerate occasional issues more when they trust the organization to be honest about them and to continuously improve.

The 'Nines' Aren't Everything

Summary: Understanding Availability

We've established a comprehensive foundation for understanding availability in distributed systems. Let's consolidate the key insights:

Key Takeaways

•Availability = Uptime / (Uptime + Downtime) — This simple formula has two improvement vectors: increase MTBF (fail less) or decrease MTTR (recover faster).
•'Available' must be precisely defined — Ambiguity about what constitutes availability leads to measurement errors and misaligned engineering efforts.
•Partial availability is the norm — Modern systems rarely fail completely; designing for graceful degradation preserves user value during partial failures.
•Availability exists on a spectrum — Different systems require different targets; over-engineering is as wasteful as under-engineering.
•Distinguish availability from related concepts — Uptime, reliability, durability, and performance are related but distinct; conflating them leads to poor decisions.
•User perception matters — How users experience availability (communication, recovery, consistency) is as important as the raw numbers.

What's next:

Page Complete

1 / 4