Loading learning content...
In December 2021, Amazon Web Services experienced a major outage affecting US-EAST-1, one of its most critical regions. For nearly seven hours, thousands of services—from smart doorbells to food delivery apps—became partially or completely unavailable. The financial impact was estimated in the hundreds of millions of dollars, but the reputational damage extended far beyond any balance sheet.
This wasn't a story about servers catching fire or hackers breaching defenses. It was a story about availability—specifically, the catastrophic consequences when systems fail to maintain it.
Every distributed system makes an implicit promise to its users: I will be here when you need me. Understanding what that promise actually means, how to quantify it, and why it matters forms the foundation of everything else we'll learn about building highly available systems.
By the end of this page, you will understand the formal and practical definitions of availability, how availability differs from related concepts like uptime and reachability, and why the precise definition you choose profoundly impacts how you design, operate, and measure your systems.
At its most fundamental level, availability is the probability that a system is operational and able to perform its intended function at any given point in time. This deceptively simple definition carries profound implications for how we design and operate distributed systems.
The formal mathematical definition is:
12345678910
Availability = Uptime / (Uptime + Downtime) = MTBF / (MTBF + MTTR) Where: MTBF = Mean Time Between Failures MTTR = Mean Time To Recovery Expressed as a percentage: Availability % = (Total Time - Downtime) / Total Time × 100This formula reveals two fundamental levers we can pull to improve availability:
Increase MTBF — Make failures less frequent by building more robust components, implementing redundancy, and eliminating single points of failure.
Decrease MTTR — When failures inevitably occur, detect them faster, diagnose them more quickly, and restore service as rapidly as possible.
Most organizations initially focus heavily on the first lever (preventing failures), but as systems mature, the second lever (fast recovery) often yields greater returns. A system that fails once a month but recovers in 30 seconds may be more available than one that fails once a year but takes 4 hours to restore.
Counterintuitively, systems that practice recovery regularly (through chaos engineering, failover drills, etc.) often achieve higher availability than systems that try to prevent all failures. Practice reduces MTTR, and the reduction in recovery time often outweighs any increase in failure frequency.
The time dimension matters:
Availability is always measured over a specific time period—a month, a quarter, a year. This is critical because:
A mature availability definition accounts for when downtime occurs, not just how much.
The formula above is clear, but it hides a crucial question: What counts as the system being 'up' or 'down'? This question has no universal answer—it depends entirely on context, and the answer you choose fundamentally shapes your system's design.
Consider a simple e-commerce website. Is the system 'available' when:
These scenarios illustrate why availability is not a binary property in practice. Systems exist on a spectrum between fully operational and completely unavailable, with countless degraded states in between.
To address this complexity, sophisticated organizations define availability across multiple dimensions:
| Dimension | Definition | Example |
|---|---|---|
| Functional Availability | Which features/capabilities are operational | Search works, but recommendations are degraded |
| Performance Availability | Is the system meeting latency targets | P99 latency is 5s instead of target 500ms |
| Correctness Availability | Are results accurate and complete | Search returns results but rankings are stale |
| Geographic Availability | Which regions can access the service | US and EU available, APAC experiencing outage |
| User Segment Availability | Which users can access the service | Free tier available, premium tier experiencing issues |
| Capacity Availability | Can the system handle expected load | Available but at 90% capacity, rejecting new connections |
The most meaningful availability definitions are user-centric: 'The percentage of user requests that complete successfully within acceptable latency bounds.' This definition captures function, performance, and correctness in a single metric aligned with actual user experience.
Modern distributed systems are rarely completely up or completely down. Instead, they exist in partial availability states where some functionality works while other functionality is impaired. Understanding and designing for partial availability is one of the hallmarks of mature system design.
Graceful degradation is the practice of intentionally designing systems to provide reduced but still valuable functionality when components fail. The key insight is that partial service is almost always better than no service.
Designing for graceful degradation requires:
Feature prioritization — Not all features are equal. Identify which capabilities are critical (must always work), important (should work but can be degraded), and optional (nice to have, can be disabled).
Dependency isolation — Critical paths should not depend on non-critical services. Failures in analytics should never break checkout.
Fallback strategies — Every external dependency should have a defined fallback: cached data, default values, simplified functionality, or graceful error messages.
Circuit breakers — Prevent cascading failures by detecting and isolating failing components before they affect the broader system.
Load shedding — When overwhelmed, reject some requests cleanly rather than failing all requests poorly.
12345678910111213141516171819202122232425262728293031
// Example: Product recommendation service with graceful degradationasync function getRecommendations(userId: string, productId: string): Promise<Product[]> { try { // Primary: Personalized ML-based recommendations const recommendations = await recommendationService.getPersonalized(userId, productId); return recommendations; } catch (error) { metrics.increment('recommendations.primary_failure'); try { // Fallback 1: Non-personalized, similar products from cache const cached = await cache.get(`similar:${productId}`); if (cached) { return cached; } } catch (cacheError) { metrics.increment('recommendations.cache_failure'); } try { // Fallback 2: Category top sellers from database const category = await productService.getCategory(productId); return await productService.getTopSellers(category, 10); } catch (dbError) { metrics.increment('recommendations.db_failure'); } // Final fallback: Static popular products return getStaticPopularProducts(); }}Graceful degradation can hide problems if not properly monitored. A system serving cached recommendations for hours without anyone noticing indicates a monitoring gap. Always alert on degraded states, even when the user experience appears normal.
Different systems have vastly different availability requirements based on their purpose, users, and consequences of failure. Understanding where your system sits on this spectrum is crucial for making appropriate design decisions—overengineering availability wastes resources, while underengineering creates business risk.
Let's examine the spectrum from lowest to highest availability requirements:
| Category | Target | Examples | Downtime Impact |
|---|---|---|---|
| Internal Tools | 95-99% | Admin dashboards, internal reporting, dev environments | Mild inconvenience, work delayed |
| Standard Web Apps | 99-99.5% | Content sites, blogs, marketing pages | Lost traffic, minor revenue impact |
| Business Applications | 99.5-99.9% | SaaS products, e-commerce, business-critical apps | Significant revenue loss, customer churn |
| Financial Systems | 99.9-99.99% | Payment processing, trading platforms, banking | Major financial loss, regulatory issues |
| Critical Infrastructure | 99.99-99.999% | Cloud platforms, telecom, healthcare systems | Cascading failures, safety risks |
| Life-Critical Systems | 99.999%+ | Air traffic control, medical devices, nuclear systems | Potential loss of life, catastrophic failure |
The exponential cost curve:
Moving up the availability spectrum doesn't just increase costs linearly—the relationship is often exponential. Going from 99% to 99.9% might require:
Going from 99.9% to 99.99% might add:
Each additional 'nine' typically costs 5-10x more than the previous one.
One of the most common architectural mistakes is over-specifying availability requirements. A 99.99% target for a system that genuinely needs only 99.5% wastes engineering resources, increases operational complexity, and often delays feature development—all without providing proportional value.
Context-dependent availability:
Availability requirements often vary by:
Sophisticated systems define and implement different availability targets for different contexts rather than applying a one-size-fits-all requirement.
Availability is often confused with several related but distinct concepts. Clarifying these distinctions is essential for precise communication and proper system design.
While these concepts are distinct, modern Site Reliability Engineering (SRE) practices integrate them into unified Service Level Objectives (SLOs) that capture availability, performance, and correctness together: 'X% of requests complete successfully within Y milliseconds.' We'll explore this in depth in the SLO chapter.
Every system needs a precise, shared definition of what 'available' means. Without this, teams talk past each other, measurements are meaningless, and design decisions lack anchor points. Here's a framework for crafting your availability definition:
12345678910111213141516171819202122232425262728293031323334
AVAILABILITY DEFINITION: E-Commerce Platform============================================= Definition: The platform is considered available when a user can complete the core purchase journey (browse → search → cart → checkout)with acceptable performance. Measurement: - Success: 2XX response within 2000ms at the 99th percentile - Measured at: Application load balancer (ALB) access logs - Time window: Rolling 30-day period Critical Journeys & Targets: 1. Product browsing (homepage, category pages): 99.9% 2. Product search: 99.9% 3. Shopping cart operations: 99.95% 4. Checkout flow: 99.99% 5. Order confirmation: 99.99% Weighted Composite Availability: - Browse/Search: 40% weight - Cart: 20% weight - Checkout/Confirm: 40% weight - Target: 99.95% (8.76 hours allowed downtime per year) Exclusions: - Scheduled maintenance windows (max 4 hours/month, announced 72h ahead) - Third-party payment processor outages beyond our control - Client-side failures (network, browser, device) - DDoS attacks exceeding 10x normal traffic baseline Time-Weighted Considerations: - Peak hours (6pm-10pm local): failures weighted 2x - Major sales events: failures weighted 5xMany organizations define availability in ways that look good but don't reflect reality: measuring at the server instead of user edge, excluding all 'external' factors (which are often controllable), or averaging away time-of-day patterns. Your availability definition should reflect what users actually experience.
While we've focused on technical definitions, availability is ultimately about human experience. The same numerical availability can create vastly different user perceptions based on context, communication, and recovery.
User perception factors:
Building trust through transparency:
Highly available organizations don't just achieve good numbers—they build trust through:
This transparency paradoxically increases perceived availability. Users tolerate occasional issues more when they trust the organization to be honest about them and to continuously improve.
A service with 99.9% availability but terrible communication, unclear error messages, and slow recovery may generate more support tickets and churn than a service with 99.5% availability that communicates proactively, degrades gracefully, and recovers user state. Availability is a number; trust is a relationship.
We've established a comprehensive foundation for understanding availability in distributed systems. Let's consolidate the key insights:
What's next:
Now that we understand what availability means, the next page explores how we measure it. The famous 'nines' notation (99.9%, 99.99%, etc.) is more nuanced than it first appears. We'll examine what each level of availability actually means in practice, how to calculate yearly/monthly budgets, and why choosing the right target is a business decision as much as a technical one.
You now understand the formal and practical definitions of availability, how to design for partial availability, where your system might sit on the availability spectrum, and how availability relates to other quality attributes. Next, we'll dive into measuring availability with precision.