Loading learning content...
Of all the dimensions of service reliability, availability is the most fundamental. Before users can evaluate whether your service is fast, accurate, or feature-rich, they must first be able to access it. Availability is the prerequisite for every other quality—the foundation without which nothing else matters.
Yet availability, seemingly the simplest reliability concept, harbors surprising complexity when you attempt to measure it rigorously. What does it mean for a service to be "available"? If the homepage loads but checkout fails, is the service available? If requests complete but take 30 seconds, is that availability or a latency problem? If your service works for 99% of users but 1% experience DNS resolution failures, what is your true availability?
These aren't academic questions. The answers determine how you contractually commit to customers (SLAs), what targets you set for engineering teams (SLOs), and how you detect when you're falling short (alerting). Get availability measurement wrong, and you'll either over-promise to customers or under-invest in reliability—both costly mistakes.
By the end of this page, you will understand the multiple dimensions of availability, how to choose the right availability definition for your context, techniques for accurate measurement including handling of edge cases, and how to calculate and report availability metrics that genuinely reflect user experience. You'll be equipped to design availability SLIs that drive meaningful reliability improvement.
The seemingly simple question "Is the service available?" has no single correct answer. Different definitions serve different purposes, and choosing the wrong definition leads to metrics that don't reflect reality.
Time-Based Availability
The traditional definition of availability measures the proportion of time a service is operational:
Availability = (Total Time - Downtime) / Total Time × 100%
For example, if a service has 1 hour of downtime in a 30-day month (720 hours):
Availability = (720 - 1) / 720 × 100% = 99.86%
This approach is intuitive and widely used, particularly in infrastructure contexts (server uptime, network uptime). However, it has significant limitations:
Time-based availability treats all minutes equally, but user impact is not uniform across time. A minute of downtime during Black Friday peak shopping has vastly more impact than a minute at 3 AM on a Tuesday. User-centric availability must account for traffic-weighted impact.
Request-Based Availability
A more user-centric approach measures the proportion of requests that succeed:
Availability = Successful Requests / Total Requests × 100%
This approach has several advantages:
Request-based availability is the foundation of modern SLI practice. It's what users actually experience—each interaction either succeeds or fails, and the aggregate of these experiences defines availability.
User-Session Availability
Even request-based availability can obscure user experience. Consider a user who makes 10 requests to your service. If 9 succeed and 1 fails, is that 90% availability for that user? It might be, or that one failure might have been critical (the checkout request), making the entire session a failure.
User-session availability measures the proportion of user sessions that complete successfully:
Availability = Successful User Sessions / Total User Sessions × 100%
This is the most user-centric definition but also the most difficult to implement—you must define session boundaries, success criteria for sessions, and handle the complexity of concurrent and abandoned sessions.
| Definition | Formula | Best For | Limitations |
|---|---|---|---|
| Time-Based | (Total Time - Downtime) / Total Time | Infrastructure components, simple services | Binary classification, ignores traffic patterns |
| Request-Based | Successful Requests / Total Requests | API services, microservices | Treats all requests equally regardless of criticality |
| User-Session | Successful Sessions / Total Sessions | End-to-end user experience | Complex to implement, requires session definition |
| Revenue-Weighted | Successful Revenue-Generating Requests / Total | E-commerce, transaction systems | Requires revenue attribution |
For request-based availability SLIs, you must precisely define what constitutes a "successful" request. This is less obvious than it seems, and different definitions significantly impact your availability numbers.
The Naive Definition: HTTP 2xx = Success
The simplest definition treats any HTTP 200-299 response as success and everything else as failure. This is better than nothing but has significant issues:
{"error": "processing failed"} is not a success.A Better Definition: Semantic Success
Semantic success means the request accomplished what the user intended—the business logic succeeded, not just the HTTP handshake. This requires understanding the purpose of each endpoint:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// Semantic success classification for availability SLIsinterface RequestOutcome { httpStatus: number; responseBody: unknown; latencyMs: number; endpoint: string;} function classifyForAvailability(outcome: RequestOutcome): 'success' | 'failure' | 'excluded' { // Exclude synthetic health checks from user-facing availability if (outcome.endpoint === '/health' || outcome.endpoint === '/ready') { return 'excluded'; } // Definite failures if (outcome.httpStatus >= 500 && outcome.httpStatus < 600) { return 'failure'; } // Timeouts are failures (user didn't get a response) if (outcome.latencyMs > 30000) { // 30 second hard timeout return 'failure'; } // Rate limiting: depends on context // - Abusive traffic being rate-limited: excluded // - Legitimate user hitting limits: failure if (outcome.httpStatus === 429) { // This would need additional context about the client return 'excluded'; // Conservative: don't count rate-limited as availability failure } // 4xx client errors: nuanced classification if (outcome.httpStatus >= 400 && outcome.httpStatus < 500) { // 404 on a valid-looking URL = broken link = our failure // 404 on obvious user typo = not our failure // 401/403 = authentication working as intended = not a failure if (outcome.httpStatus === 401 || outcome.httpStatus === 403) { return 'success'; // Auth system working correctly } if (outcome.httpStatus === 400) { // Ideally, distinguish server-caused bad requests from user errors return 'excluded'; // Conservative: don't count as failure or success } return 'excluded'; } // 2xx with error in response body if (outcome.httpStatus >= 200 && outcome.httpStatus < 300) { const body = outcome.responseBody as { error?: string; success?: boolean }; if (body?.error || body?.success === false) { return 'failure'; // HTTP success but business logic failure } return 'success'; } return 'excluded';}Latency as a Dimension of Availability
A philosophically interesting question: if a request takes 60 seconds to complete, was the service "available"?
From the user's perspective, extremely slow responses are functionally equivalent to unavailability. Users abandon requests, retry (causing more load), and perceive the service as broken. This leads to the concept of latency-adjusted availability:
Latency-Adjusted Success = Request returned AND latency < threshold
You can incorporate latency into availability by treating requests that exceed a latency threshold as failures. Common approaches:
The threshold should reflect user tolerance. For synchronous UI interactions, 10-30 seconds might be appropriate. For background jobs visible to users, minutes might be acceptable.
Whatever definition of "success" you choose, document it explicitly. Include the edge cases: how do you handle timeouts? Partial failures? Retries? An undocumented availability SLI is essentially meaningless—different people will interpret the same numbers differently.
Availability = Successful Requests / Total Requests. We've discussed the numerator (what's a success?), now let's tackle the denominator: what counts as a request?
This is the "denominator problem," and getting it wrong can dramatically skew your availability numbers.
Problem 1: Synthetic vs. Real Traffic
Your monitoring systems generate synthetic requests for health checks. Your internal tools make API calls. Partner services call your endpoints. Not all of this represents "user" traffic.
Should synthetic health checks count?
Solution: Separate SLIs for synthetic and real user traffic. Use synthetic monitoring to detect issues quickly, but calculate user-facing availability SLIs from user traffic only. Tag requests at ingestion to distinguish origins.
Problem 2: Internal vs. External Traffic
In a microservices architecture, a single user request might generate 50 internal service-to-service calls. If you count all requests equally:
Solution: Calculate availability at the boundary that users interact with. Internal service calls contribute to that boundary's availability but shouldn't be double-counted. Alternatively, maintain separate SLIs for internal and external availability.
Problem 3: Request Amplification and Reduction
The relationship between user intent and requests is rarely 1:1:
Amplification scenarios:
Reduction scenarios:
Solution: Define your SLI at the level of user intent, not network requests. For a page load, the SLI might be "page load success rate," counting each navigation as one attempt regardless of how many HTTP requests it generates. This requires client-side instrumentation but produces more meaningful measurements.
Problem 4: The Zero-Traffic Window
What's your availability when nobody is using the service? If there are zero requests in a measurement window, what's the availability?
Solution: For windows with insufficient traffic, either:
With small sample sizes, availability percentages become unreliable. If you have 10 requests and 1 fails, is that 90% availability? It might be—or it might be an outlier and true availability is 99.9%. Report confidence intervals alongside point estimates when sample sizes are small.
With definitions established, let's explore practical measurement approaches.
Measurement Point Selection
Where you measure availability affects what you capture:
At the load balancer/ingress:
At the application layer:
Client-side (RUM):
Hybrid approach (recommended):
Aggregation Windows
Availability SLIs are typically aggregated over time windows. The choice of window size has significant implications:
Short windows (1-5 minutes):
Medium windows (1 hour):
Long windows (1 day, 1 week, 1 month):
Rolling vs. calendar windows:
For SLO compliance, rolling windows (often 30 days) are typical. For reporting to stakeholders, calendar months align with business cycles.
| Use Case | Recommended Window | Rationale |
|---|---|---|
| Real-time alerting | 1-5 minute rolling | Fast detection of emerging issues |
| On-call dashboards | 1 hour rolling | Balance of immediacy and stability |
| SLO burn rate | 1 hour to 1 day rolling | Error budget consumption visibility |
| SLO compliance | 30 day rolling | Standard period for objectives |
| Executive reporting | Calendar month | Alignment with business reporting cycles |
| SLA contractual | Calendar month | Customer billing cycle alignment |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
from dataclasses import dataclassfrom datetime import datetime, timedeltafrom typing import List, Optionalimport statistics @dataclassclass RequestRecord: timestamp: datetime success: bool latency_ms: float is_synthetic: bool = False class AvailabilityCalculator: """ Calculates request-based availability with proper handling of edge cases and multiple aggregation strategies. """ def __init__( self, latency_threshold_ms: float = 30000, min_sample_size: int = 100 ): self.latency_threshold_ms = latency_threshold_ms self.min_sample_size = min_sample_size def calculate_availability( self, requests: List[RequestRecord], include_latency: bool = True, exclude_synthetic: bool = True ) -> dict: """ Calculate availability with proper statistical handling. Returns: dict with availability percentage, confidence interval, and metadata about the calculation. """ # Filter synthetic if requested if exclude_synthetic: requests = [r for r in requests if not r.is_synthetic] if len(requests) == 0: return { "availability_percent": None, "error": "NO_DATA", "message": "No requests in window" } if len(requests) < self.min_sample_size: return { "availability_percent": None, "error": "INSUFFICIENT_DATA", "message": f"Only {len(requests)} requests; need {self.min_sample_size}", "sample_size": len(requests) } # Calculate success based on semantic success + latency successes = 0 for r in requests: if r.success: if include_latency and r.latency_ms > self.latency_threshold_ms: continue # Latency exceeded; count as failure successes += 1 total = len(requests) availability = (successes / total) * 100 # Calculate confidence interval using Wilson score interval # for proportion estimation confidence_interval = self._wilson_confidence(successes, total) return { "availability_percent": round(availability, 4), "successes": successes, "failures": total - successes, "total_requests": total, "confidence_interval_95": confidence_interval, "latency_adjusted": include_latency, "latency_threshold_ms": self.latency_threshold_ms if include_latency else None } def _wilson_confidence( self, successes: int, total: int, z: float = 1.96 # 95% confidence ) -> tuple: """Wilson score confidence interval for proportions.""" if total == 0: return (0, 0) p = successes / total denominator = 1 + z**2 / total center = (p + z**2 / (2 * total)) / denominator margin = (z / denominator) * ( (p * (1 - p) / total + z**2 / (4 * total**2)) ** 0.5 ) lower = max(0, (center - margin) * 100) upper = min(100, (center + margin) * 100) return (round(lower, 4), round(upper, 4)) # Usage examplecalculator = AvailabilityCalculator(latency_threshold_ms=30000)requests = [...] # Load from your data source result = calculator.calculate_availability(requests)print(f"Availability: {result['availability_percent']:.4f}%")print(f"95% CI: [{result['confidence_interval_95'][0]:.4f}%, " f"{result['confidence_interval_95'][1]:.4f}%]")Availability is commonly expressed in "nines"—99% is "two nines," 99.9% is "three nines," and so on. This shorthand obscures the dramatic differences between levels. Let's make it concrete.
The Exponential Scale of Nines
Each additional nine represents a 10x reduction in allowed downtime. This exponential relationship has profound implications:
| Availability | "Nines" | Monthly Downtime | Annual Downtime | Characteristics |
|---|---|---|---|---|
| 99% | Two nines | 7.3 hours | 3.65 days | Basic level; significant user complaints expected |
| 99.5% | Two and a half nines | 3.65 hours | 1.83 days | Improved; still noticeable to active users |
| 99.9% | Three nines | 43.8 minutes | 8.76 hours | Standard target for many SaaS products |
| 99.95% | Three and a half nines | 21.9 minutes | 4.38 hours | High reliability; most users won't experience issues |
| 99.99% | Four nines | 4.38 minutes | 52.6 minutes | Very high reliability; expensive to achieve |
| 99.999% | Five nines | 26.3 seconds | 5.26 minutes | Extreme reliability; requires major architectural investment |
The Cost of Each Nine
Making a service more available costs exponentially more as you add nines:
The question isn't "how to achieve five nines" but "do we need five nines?" For most consumer services, three to four nines is appropriate. Five or more nines is reserved for critical infrastructure (telecommunications, financial clearing, emergency services).
Nines Are Not Additive
A common misconception: "If I have two components each at 99.9% availability, my total availability is also 99.9%." This is wrong.
If components are in series (request must pass through both), availability multiplies:
With 10 components at 99.9% each in series: 99.9%¹⁰ = 99.0% You've lost an entire nine just by having 10 dependencies!
This is why microservices architectures can have availability challenges—you multiply many slightly-unreliable components together.
Your service's maximum possible availability is bounded by your least reliable critical dependency. If your payment provider offers 99.9% availability, your checkout cannot exceed 99.9% no matter how reliable your own infrastructure is. Map your dependencies and understand their SLAs before committing to your own.
Choosing the Right Target
How do you select an appropriate availability target? Consider:
User expectations: What do your users expect based on competitive alternatives and historical experience?
Business impact of downtime: What's the cost per minute of unavailability? (Lost revenue, SLA penalties, reputation damage)
Achievability: What level can your team and infrastructure realistically maintain?
Cost-benefit analysis: At what point does the incremental cost of reliability exceed the benefit?
For most B2B SaaS products, 99.9% is a sensible starting target—achievable with good practices, meeting user expectations, and not requiring heroic effort. As you mature, you can tighten to 99.95% or 99.99% if business needs justify the investment.
Real-world availability measurement is full of edge cases that don't fit neatly into "success" or "failure." How you handle these determines whether your SLI reflects reality or fiction.
Planned Maintenance
Should scheduled maintenance count against availability? Arguments both ways:
Count it: Users experience unavailability regardless of whether it was planned. From the user perspective, the service was down.
Exclude it: Planned maintenance with advance notice is a different user experience than unexpected outages. Users can plan around announced maintenance.
Recommended approach: Count planned maintenance in your SLI but track it separately. Report "raw availability" (all downtime) and "unplanned availability" (excluding announced maintenance). Be strict about what qualifies as "announced"—maintenance that gets announced 5 minutes before isn't properly planned.
Dependency Failures
If your service is unavailable because AWS is down, should that count?
Count it: Users experience your unavailability. They don't care whose fault it is.
Exclude it: It's beyond your control; punishing you for AWS failures isn't fair.
Recommended approach: Count all failures in your SLI (users experienced them). However, track "owned" vs. "dependency" failures separately for internal analysis. Your SLO should account for expected dependency outages—if AWS promises 99.9%, your SLO can't assume 100% AWS uptime.
Partial Availability
Complex applications have multiple features with independent reliability. How do you handle partial outages?
Example: Your e-commerce site's search is down, but browse, cart, and checkout work fine. What's your availability?
Option 1 - Component SLIs: Maintain separate availability SLIs for search, browse, cart, checkout. Each has its own target. This is operationally clean but doesn't give an overall picture.
Option 2 - Weighted composite: Define weights based on usage frequency or business importance. Overall availability = Σ(component availability × weight). Search at 40% traffic weight and 0% availability + everything else at 60% weight and 100% availability = 60% overall.
Option 3 - Critical path: Define availability by the critical user journey. If checkout is available, the service is "available." Degraded features are tracked separately. This works well for primarily transactional services.
Recommended approach: Use component SLIs for operations and reporting, critical path for high-level SLA contractual commitments, and be explicit about what "available" means in each context.
Gray Failures
The hardest edge case: the service appears available but is subtly broken. Examples:
These are the most dangerous because they evade simple success/failure SLIs.
Solution: Complement availability SLIs with correctness SLIs. Don't just measure "did it respond?" but "did it respond correctly?" This might mean sampling responses for correctness, monitoring data integrity, or tracking specific correctness indicators.
Every exclusion you add makes your SLI less meaningful. Teams under pressure will find creative ways to exclude outages. Resist this. A high availability number that doesn't reflect user experience provides false confidence. It's better to have a lower, honest number that drives improvement than a high, fictional number that hides problems.
Let's examine how specific systems might define availability SLIs, demonstrating the principles in practice.
Example 1: RESTful API Service
123456789101112131415161718192021222324252627282930313233
# Availability SLI Specification: Customer-Facing REST APIsli: name: "API Availability" description: "Proportion of valid API requests that succeed" numerator: description: "Successful requests" criteria: - http_status < 500 # No server errors - response_time_ms < 30000 # Must respond within 30s - "!timeout" # No client or server timeout denominator: description: "Total valid requests" includes: - "All requests to /api/* endpoints" excludes: - "Health check requests (/api/health, /api/ready)" - "Requests from internal services (by header X-Internal: true)" - "Requests during announced maintenance windows" - "Requests blocked by rate limiting (429 responses)" measurement: primary_source: "Load balancer access logs" secondary_source: "Application metrics (for detailed error classification)" aggregation_window: "5 minutes for alerting, 30 days rolling for SLO" latency_threshold: "Requests exceeding 30s counted as failures" target: 99.9% # Three nines error_budget: monthly: 43.83 # minutes equivalent calculation: "(1 - 0.999) * total_requests"Example 2: E-Commerce Web Application
123456789101112131415161718192021222324252627282930313233343536373839404142434445
# Availability SLI Specification: E-Commerce Platformsli: name: "Checkout Availability" description: "Proportion of checkout attempts that complete successfully" type: "User Journey" journey_definition: start_event: "checkout_initiated" end_event: "order_confirmed OR checkout_failed" timeout: 300 seconds # Max 5 min checkout numerator: description: "Successful checkouts" criteria: - order_confirmation_received == true - payment_captured == true - confirmation_email_queued == true denominator: description: "Total checkout attempts" includes: - "All checkout_initiated events from real users" excludes: - "Test transactions (identified by test card numbers)" - "Bot traffic (identified by device fingerprint service)" - "User-abandoned checkouts (no activity for 5+ minutes)" abandonment_note: > User abandonment is excluded from denominator. Separately track abandonment rate to ensure we're not falsely classifying system failures as user abandonment. measurement: primary_source: "Client-side journey tracking events" secondary_source: "Server-side transaction logs" target: 99.95% sub_slis: - name: "Payment Processing Availability" target: 99.99% description: "Of checkouts reaching payment, proportion that process" - name: "Cart Availability" target: 99.9% description: "Proportion of cart load attempts that succeed"Example 3: Data Processing Pipeline
12345678910111213141516171819202122232425262728293031323334353637
# Availability SLI Specification: Data Processing Pipelinesli: name: "Pipeline Availability" description: "Proportion of data ingestion windows processed successfully" type: "Batch Processing" # For batch systems, availability is measured in processing windows # not individual requests numerator: description: "Successful processing windows" criteria: - processing_completed == true - output_data_valid == true - processing_time < 2 * expected_duration denominator: description: "Total scheduled processing windows" unit: "1-hour processing windows" includes: - "All scheduled hourly processing runs" excludes: - "Deliberately paused runs during maintenance" - "Runs delayed by upstream data unavailability" measurement: primary_source: "Pipeline orchestration logs" secondary_source: "Output data quality checks" aggregation_window: "24 hours (24 windows)" target: 99.5% # 0.5% = 1 failed window per week approximately freshness_sli: name: "Data Freshness" description: "Proportion of time data is within freshness target" target: 99% freshness_target: "3 hours from event occurrence to query availability"Availability SLIs, while conceptually simple, require careful design to accurately reflect user experience. Let's consolidate the key learnings:
What's Next
Availability answers "did it work?" but doesn't capture "how fast did it work?" In the next page, we'll explore Latency SLIs—measuring and setting targets for response time, understanding percentiles, and the nuanced relationship between latency and user experience.
You now have a comprehensive understanding of availability SLIs—from philosophical definitions through practical implementation. You can define what "available" means for your service, choose appropriate measurement approaches, set realistic targets, and handle the edge cases that make real-world availability complex.