System Design (HLD)Choosing SLIs

Choosing Service Level Indicators

LevelAdvanced

Duration90 mins

TopicChoosing SLIs

2 / 5

Availability SLIs

The Most Fundamental Promise

Of all the dimensions of service reliability, availability is the most fundamental. Before users can evaluate whether your service is fast, accurate, or feature-rich, they must first be able to access it. Availability is the prerequisite for every other quality—the foundation without which nothing else matters.

Yet availability, seemingly the simplest reliability concept, harbors surprising complexity when you attempt to measure it rigorously. What does it mean for a service to be "available"? If the homepage loads but checkout fails, is the service available? If requests complete but take 30 seconds, is that availability or a latency problem? If your service works for 99% of users but 1% experience DNS resolution failures, what is your true availability?

These aren't academic questions. The answers determine how you contractually commit to customers (SLAs), what targets you set for engineering teams (SLOs), and how you detect when you're falling short (alerting). Get availability measurement wrong, and you'll either over-promise to customers or under-invest in reliability—both costly mistakes.

What You Will Learn

By the end of this page, you will understand the multiple dimensions of availability, how to choose the right availability definition for your context, techniques for accurate measurement including handling of edge cases, and how to calculate and report availability metrics that genuinely reflect user experience. You'll be equipped to design availability SLIs that drive meaningful reliability improvement.

Defining Availability: More Complex Than It Appears

The seemingly simple question "Is the service available?" has no single correct answer. Different definitions serve different purposes, and choosing the wrong definition leads to metrics that don't reflect reality.

Time-Based Availability

The traditional definition of availability measures the proportion of time a service is operational:

Availability = (Total Time - Downtime) / Total Time × 100%

For example, if a service has 1 hour of downtime in a 30-day month (720 hours):

Availability = (720 - 1) / 720 × 100% = 99.86%

This approach is intuitive and widely used, particularly in infrastructure contexts (server uptime, network uptime). However, it has significant limitations:

Granularity problems: Was the service "up" or "down" during a 5-minute period with 50% error rate? Time-based availability forces a binary classification.
Partial failures invisible: If half your servers are healthy, is the service "up"? Time-based availability doesn't capture degraded states.
Doesn't reflect traffic patterns: One hour of downtime at 3 AM is very different from one hour during peak sales.

The Fundamental Limitation of Time-Based Availability

Time-based availability treats all minutes equally, but user impact is not uniform across time. A minute of downtime during Black Friday peak shopping has vastly more impact than a minute at 3 AM on a Tuesday. User-centric availability must account for traffic-weighted impact.

Request-Based Availability

A more user-centric approach measures the proportion of requests that succeed:

Availability = Successful Requests / Total Requests × 100%

This approach has several advantages:

Naturally traffic-weighted: Periods with more traffic contribute more to the metric.
Captures partial failures: A 50% error rate for one hour is reflected as 50% of that hour's requests failing, not as "up" or "down."
Granular resolution: Every individual request is classified, enabling precise measurements.

Request-based availability is the foundation of modern SLI practice. It's what users actually experience—each interaction either succeeds or fails, and the aggregate of these experiences defines availability.

User-Session Availability

Even request-based availability can obscure user experience. Consider a user who makes 10 requests to your service. If 9 succeed and 1 fails, is that 90% availability for that user? It might be, or that one failure might have been critical (the checkout request), making the entire session a failure.

User-session availability measures the proportion of user sessions that complete successfully:

Availability = Successful User Sessions / Total User Sessions × 100%

This is the most user-centric definition but also the most difficult to implement—you must define session boundaries, success criteria for sessions, and handle the complexity of concurrent and abandoned sessions.

Availability Definition Comparison
Definition	Formula	Best For	Limitations
Time-Based	(Total Time - Downtime) / Total Time	Infrastructure components, simple services	Binary classification, ignores traffic patterns
Request-Based	Successful Requests / Total Requests	API services, microservices	Treats all requests equally regardless of criticality
User-Session	Successful Sessions / Total Sessions	End-to-end user experience	Complex to implement, requires session definition
Revenue-Weighted	Successful Revenue-Generating Requests / Total	E-commerce, transaction systems	Requires revenue attribution

Anatomy of a Successful Request

For request-based availability SLIs, you must precisely define what constitutes a "successful" request. This is less obvious than it seems, and different definitions significantly impact your availability numbers.

The Naive Definition: HTTP 2xx = Success

The simplest definition treats any HTTP 200-299 response as success and everything else as failure. This is better than nothing but has significant issues:

Client errors (4xx) aren't always user failures: A 400 Bad Request from a confusing form is a UX problem, not user error. A 404 from a broken link is a system issue.
Some 5xx errors are expected: A 503 during graceful degradation might be intentional. Rate-limited requests (429) are system-protecting behaviors.
Success status with failure body: An HTTP 200 that returns {"error": "processing failed"} is not a success.

A Better Definition: Semantic Success

Semantic success means the request accomplished what the user intended—the business logic succeeded, not just the HTTP handshake. This requires understanding the purpose of each endpoint:

Search endpoint: Success = relevant results returned (even if zero matches)
Authentication endpoint: Success = valid credentials accepted or invalid credentials rejected with clear feedback
Payment endpoint: Success = payment captured OR clear rejection with actionable reason
Data save endpoint: Success = data persisted and confirmation returned

semantic-success-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Semantic success classification for availability SLIs
interface RequestOutcome {
  httpStatus: number;
  responseBody: unknown;
  latencyMs: number;
  endpoint: string;
}
 
function classifyForAvailability(outcome: RequestOutcome): 'success' | 'failure' | 'excluded' {
  // Exclude synthetic health checks from user-facing availability
  if (outcome.endpoint === '/health' || outcome.endpoint === '/ready') {
    return 'excluded';
  }
 
  // Definite failures
  if (outcome.httpStatus >= 500 && outcome.httpStatus < 600) {
    return 'failure';
  }
 
  // Timeouts are failures (user didn't get a response)
  if (outcome.latencyMs > 30000) { // 30 second hard timeout
    return 'failure';
  }
 
  // Rate limiting: depends on context
  // - Abusive traffic being rate-limited: excluded
  // - Legitimate user hitting limits: failure
  if (outcome.httpStatus === 429) {
    // This would need additional context about the client
    return 'excluded'; // Conservative: don't count rate-limited as availability failure
  }
 
  // 4xx client errors: nuanced classification
  if (outcome.httpStatus >= 400 && outcome.httpStatus < 500) {
    // 404 on a valid-looking URL = broken link = our failure
    // 404 on obvious user typo = not our failure
    // 401/403 = authentication working as intended = not a failure
    if (outcome.httpStatus === 401 || outcome.httpStatus === 403) {
      return 'success'; // Auth system working correctly
    }
    if (outcome.httpStatus === 400) {
      // Ideally, distinguish server-caused bad requests from user errors
      return 'excluded'; // Conservative: don't count as failure or success
    }
    return 'excluded';
  }
 
  // 2xx with error in response body
  if (outcome.httpStatus >= 200 && outcome.httpStatus < 300) {
    const body = outcome.responseBody as { error?: string; success?: boolean };
    if (body?.error || body?.success === false) {
      return 'failure'; // HTTP success but business logic failure
    }
    return 'success';
  }
 
  return 'excluded';
}

Latency as a Dimension of Availability

A philosophically interesting question: if a request takes 60 seconds to complete, was the service "available"?

From the user's perspective, extremely slow responses are functionally equivalent to unavailability. Users abandon requests, retry (causing more load), and perceive the service as broken. This leads to the concept of latency-adjusted availability:

Latency-Adjusted Success = Request returned AND latency < threshold

You can incorporate latency into availability by treating requests that exceed a latency threshold as failures. Common approaches:

Binary threshold: Requests > 10 seconds are counted as unavailable
Tiered thresholds: > 5 seconds = degraded, > 30 seconds = unavailable
Percentile-based: Success only if within 99th percentile of normal latency

The threshold should reflect user tolerance. For synchronous UI interactions, 10-30 seconds might be appropriate. For background jobs visible to users, minutes might be acceptable.

Document Your Success Definition

Whatever definition of "success" you choose, document it explicitly. Include the edge cases: how do you handle timeouts? Partial failures? Retries? An undocumented availability SLI is essentially meaningless—different people will interpret the same numbers differently.

The Denominator Problem: What Counts as a Request?

Availability = Successful Requests / Total Requests. We've discussed the numerator (what's a success?), now let's tackle the denominator: what counts as a request?

This is the "denominator problem," and getting it wrong can dramatically skew your availability numbers.

Problem 1: Synthetic vs. Real Traffic

Your monitoring systems generate synthetic requests for health checks. Your internal tools make API calls. Partner services call your endpoints. Not all of this represents "user" traffic.

Should synthetic health checks count?

If they fail while real users succeed → artificially deflates availability
If they succeed while real users fail → artificially inflates availability

Solution: Separate SLIs for synthetic and real user traffic. Use synthetic monitoring to detect issues quickly, but calculate user-facing availability SLIs from user traffic only. Tag requests at ingestion to distinguish origins.

Problem 2: Internal vs. External Traffic

In a microservices architecture, a single user request might generate 50 internal service-to-service calls. If you count all requests equally:

One user clicking a button creates 50 "requests"
That user's experience is over-weighted by 50x

Solution: Calculate availability at the boundary that users interact with. Internal service calls contribute to that boundary's availability but shouldn't be double-counted. Alternatively, maintain separate SLIs for internal and external availability.

Denominator Pitfalls

•Retries inflate the denominator: If clients automatically retry failed requests, one user interaction becomes multiple "requests." This artificially lowers availability (failed attempt + successful retry = 50% availability for one successful user action).
•Batch operations skew numbers: A bulk import of 10,000 records might be one user action but 10,000 API calls. Count the user action, not the API calls.
•Partial requests are tricky: If a client starts a request but abandons before sending complete data, does that count? If your server never saw a valid request, it shouldn't affect availability.
•Load balancer rejections may be invisible: If requests are rejected at the load balancer (e.g., TLS handshake failure), your application servers never see them. Are these in your denominator?
•Caching affects denominator: Requests served from CDN cache might not reach your origin servers. Where you measure determines which requests you see.

Problem 3: Request Amplification and Reduction

The relationship between user intent and requests is rarely 1:1:

Amplification scenarios:

User loads a page → browser makes 50 requests (HTML, CSS, JS, images, APIs)
User submits a form → server fans out to 10 downstream services
Client uses HTTP/2 multiplexing → multiple logical operations over one connection

Reduction scenarios:

CDN caches response → user requests that never reach origin
GraphQL batches multiple queries → one request contains multiple operations
Websocket connections → continuous stream counted as one request

Solution: Define your SLI at the level of user intent, not network requests. For a page load, the SLI might be "page load success rate," counting each navigation as one attempt regardless of how many HTTP requests it generates. This requires client-side instrumentation but produces more meaningful measurements.

Problem 4: The Zero-Traffic Window

What's your availability when nobody is using the service? If there are zero requests in a measurement window, what's the availability?

0/0 = undefined, not 100%
Absence of failure is not presence of availability

Solution: For windows with insufficient traffic, either:

Extend the measurement window until you have statistical significance (changes SLI characteristics)
Fall back to synthetic monitoring data for those windows
Report "insufficient data" rather than a misleading number

Statistical Significance Matters

With small sample sizes, availability percentages become unreliable. If you have 10 requests and 1 fails, is that 90% availability? It might be—or it might be an outlier and true availability is 99.9%. Report confidence intervals alongside point estimates when sample sizes are small.

Measuring Availability in Practice

With definitions established, let's explore practical measurement approaches.

Measurement Point Selection

Where you measure availability affects what you capture:

At the load balancer/ingress:

Pros: Sees all traffic, captures transport-layer failures, high-fidelity data
Cons: Doesn't see client-side failures, doesn't understand business logic success

At the application layer:

Pros: Can apply semantic success definitions, correlate with business logic
Cons: Missed requests (crashed before logging), doesn't see upstream failures

Client-side (RUM):

Pros: Captures true user experience, includes network and rendering failures
Cons: Incomplete data (sampling, blocked scripts), harder to aggregate

Hybrid approach (recommended):

Client-side RUM provides the ground-truth user experience
Server-side logging provides complete request data and debugging context
Correlation between them identifies gaps (client saw failure, server didn't log it)

Aggregation Windows

Availability SLIs are typically aggregated over time windows. The choice of window size has significant implications:

Short windows (1-5 minutes):

High-resolution visibility into availability changes
Noisy—random variability causes fluctuations
Good for alerting, poor for SLO calculations

Medium windows (1 hour):

Balance between resolution and stability
Smooths out random variation
Common for dashboard displays

Long windows (1 day, 1 week, 1 month):

Stable, comparable over time
Masks short incidents
Appropriate for SLO reporting and SLA calculations

Rolling vs. calendar windows:

Rolling (e.g., last 30 days): Always current, but comparison is tricky (different windows overlap)
Calendar (e.g., month of January): Clean comparisons, but early in the month you have little data

For SLO compliance, rolling windows (often 30 days) are typical. For reporting to stakeholders, calendar months align with business cycles.

Aggregation Window Selection Guide
Use Case	Recommended Window	Rationale
Real-time alerting	1-5 minute rolling	Fast detection of emerging issues
On-call dashboards	1 hour rolling	Balance of immediacy and stability
SLO burn rate	1 hour to 1 day rolling	Error budget consumption visibility
SLO compliance	30 day rolling	Standard period for objectives
Executive reporting	Calendar month	Alignment with business reporting cycles
SLA contractual	Calendar month	Customer billing cycle alignment

availability-calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List, Optional
import statistics
 
@dataclass
class RequestRecord:
    timestamp: datetime
    success: bool
    latency_ms: float
    is_synthetic: bool = False
 
class AvailabilityCalculator:
    """
    Calculates request-based availability with proper handling of
    edge cases and multiple aggregation strategies.
    """
    
    def __init__(
        self,
        latency_threshold_ms: float = 30000,
        min_sample_size: int = 100
    ):
        self.latency_threshold_ms = latency_threshold_ms
        self.min_sample_size = min_sample_size
    
    def calculate_availability(
        self,
        requests: List[RequestRecord],
        include_latency: bool = True,
        exclude_synthetic: bool = True
    ) -> dict:
        """
        Calculate availability with proper statistical handling.
        
        Returns:
            dict with availability percentage, confidence interval,
            and metadata about the calculation.
        """
        # Filter synthetic if requested
        if exclude_synthetic:
            requests = [r for r in requests if not r.is_synthetic]
        
        if len(requests) == 0:
            return {
                "availability_percent": None,
                "error": "NO_DATA",
                "message": "No requests in window"
            }
        
        if len(requests) < self.min_sample_size:
            return {
                "availability_percent": None,
                "error": "INSUFFICIENT_DATA",
                "message": f"Only {len(requests)} requests; need {self.min_sample_size}",
                "sample_size": len(requests)
            }
        
        # Calculate success based on semantic success + latency
        successes = 0
        for r in requests:
            if r.success:
                if include_latency and r.latency_ms > self.latency_threshold_ms:
                    continue  # Latency exceeded; count as failure
                successes += 1
        
        total = len(requests)
        availability = (successes / total) * 100
        
        # Calculate confidence interval using Wilson score interval
        # for proportion estimation
        confidence_interval = self._wilson_confidence(successes, total)
        
        return {
            "availability_percent": round(availability, 4),
            "successes": successes,
            "failures": total - successes,
            "total_requests": total,
            "confidence_interval_95": confidence_interval,
            "latency_adjusted": include_latency,
            "latency_threshold_ms": self.latency_threshold_ms if include_latency else None
        }
    
    def _wilson_confidence(
        self, 
        successes: int, 
        total: int, 
        z: float = 1.96  # 95% confidence
    ) -> tuple:
        """Wilson score confidence interval for proportions."""
        if total == 0:
            return (0, 0)
        
        p = successes / total
        denominator = 1 + z**2 / total
        center = (p + z**2 / (2 * total)) / denominator
        margin = (z / denominator) * (
            (p * (1 - p) / total + z**2 / (4 * total**2)) ** 0.5
        )
        
        lower = max(0, (center - margin) * 100)
        upper = min(100, (center + margin) * 100)
        
        return (round(lower, 4), round(upper, 4))
 
# Usage example
calculator = AvailabilityCalculator(latency_threshold_ms=30000)
requests = [...]  # Load from your data source
 
result = calculator.calculate_availability(requests)
print(f"Availability: {result['availability_percent']:.4f}%")
print(f"95% CI: [{result['confidence_interval_95'][0]:.4f}%, "
      f"{result['confidence_interval_95'][1]:.4f}%]")

The Nines Table: Understanding Availability Targets

Availability is commonly expressed in "nines"—99% is "two nines," 99.9% is "three nines," and so on. This shorthand obscures the dramatic differences between levels. Let's make it concrete.

The Exponential Scale of Nines

Each additional nine represents a 10x reduction in allowed downtime. This exponential relationship has profound implications:

The Nines Table: Availability Levels and Their Implications
Availability	"Nines"	Monthly Downtime	Annual Downtime	Characteristics
99%	Two nines	7.3 hours	3.65 days	Basic level; significant user complaints expected
99.5%	Two and a half nines	3.65 hours	1.83 days	Improved; still noticeable to active users
99.9%	Three nines	43.8 minutes	8.76 hours	Standard target for many SaaS products
99.95%	Three and a half nines	21.9 minutes	4.38 hours	High reliability; most users won't experience issues
99.99%	Four nines	4.38 minutes	52.6 minutes	Very high reliability; expensive to achieve
99.999%	Five nines	26.3 seconds	5.26 minutes	Extreme reliability; requires major architectural investment

The Cost of Each Nine

Making a service more available costs exponentially more as you add nines:

99% → 99.9%: Requires robust monitoring, on-call rotation, incident response process. Cost: Moderate.
99.9% → 99.99%: Requires redundancy at every layer, automated failover, multi-region deployment, chaos engineering. Cost: Significant.
99.99% → 99.999%: Requires active-active multi-region, zero-downtime deployments, extensive testing, potentially purpose-built infrastructure. Cost: Very high.
Beyond 99.999%: Requires fundamental architectural innovations, potentially custom hardware, extreme operational discipline. Cost: Exceptional.

The question isn't "how to achieve five nines" but "do we need five nines?" For most consumer services, three to four nines is appropriate. Five or more nines is reserved for critical infrastructure (telecommunications, financial clearing, emergency services).

Nines Are Not Additive

A common misconception: "If I have two components each at 99.9% availability, my total availability is also 99.9%." This is wrong.

If components are in series (request must pass through both), availability multiplies:

Component A: 99.9%
Component B: 99.9%
Combined: 99.9% × 99.9% = 99.8%

With 10 components at 99.9% each in series: 99.9%¹⁰ = 99.0% You've lost an entire nine just by having 10 dependencies!

This is why microservices architectures can have availability challenges—you multiply many slightly-unreliable components together.

The Dependency Multiplication Problem

Your service's maximum possible availability is bounded by your least reliable critical dependency. If your payment provider offers 99.9% availability, your checkout cannot exceed 99.9% no matter how reliable your own infrastructure is. Map your dependencies and understand their SLAs before committing to your own.

Choosing the Right Target

How do you select an appropriate availability target? Consider:

User expectations: What do your users expect based on competitive alternatives and historical experience?
Business impact of downtime: What's the cost per minute of unavailability? (Lost revenue, SLA penalties, reputation damage)
Achievability: What level can your team and infrastructure realistically maintain?
Cost-benefit analysis: At what point does the incremental cost of reliability exceed the benefit?

For most B2B SaaS products, 99.9% is a sensible starting target—achievable with good practices, meeting user expectations, and not requiring heroic effort. As you mature, you can tighten to 99.95% or 99.99% if business needs justify the investment.

Handling Edge Cases and Exclusions

Real-world availability measurement is full of edge cases that don't fit neatly into "success" or "failure." How you handle these determines whether your SLI reflects reality or fiction.

Planned Maintenance

Should scheduled maintenance count against availability? Arguments both ways:

Count it: Users experience unavailability regardless of whether it was planned. From the user perspective, the service was down.

Exclude it: Planned maintenance with advance notice is a different user experience than unexpected outages. Users can plan around announced maintenance.

Recommended approach: Count planned maintenance in your SLI but track it separately. Report "raw availability" (all downtime) and "unplanned availability" (excluding announced maintenance). Be strict about what qualifies as "announced"—maintenance that gets announced 5 minutes before isn't properly planned.

Dependency Failures

If your service is unavailable because AWS is down, should that count?

Count it: Users experience your unavailability. They don't care whose fault it is.

Exclude it: It's beyond your control; punishing you for AWS failures isn't fair.

Recommended approach: Count all failures in your SLI (users experienced them). However, track "owned" vs. "dependency" failures separately for internal analysis. Your SLO should account for expected dependency outages—if AWS promises 99.9%, your SLO can't assume 100% AWS uptime.

Reasonable Exclusions

•DDoS attacks that exceed reasonable protection (force majeure)
•Abuse traffic that would be correctly rejected
•Synthetic monitoring traffic (separate SLI)
•Client-caused issues (broken mobile app crashing before making requests)
•Properly pre-announced maintenance windows

Indefensible Exclusions

•"That was a bad deploy we rolled back" (deploy problems are your problem)
•"Only 1% of users were affected" (still affects availability for those users)
•"The root cause was our vendor" (users don't care about root cause)
•"We announced maintenance 10 minutes before" (not real notice)
•"It was a known issue we're working on" (known issues are still downtime)

Partial Availability

Complex applications have multiple features with independent reliability. How do you handle partial outages?

Example: Your e-commerce site's search is down, but browse, cart, and checkout work fine. What's your availability?

Option 1 - Component SLIs: Maintain separate availability SLIs for search, browse, cart, checkout. Each has its own target. This is operationally clean but doesn't give an overall picture.

Option 2 - Weighted composite: Define weights based on usage frequency or business importance. Overall availability = Σ(component availability × weight). Search at 40% traffic weight and 0% availability + everything else at 60% weight and 100% availability = 60% overall.

Option 3 - Critical path: Define availability by the critical user journey. If checkout is available, the service is "available." Degraded features are tracked separately. This works well for primarily transactional services.

Recommended approach: Use component SLIs for operations and reporting, critical path for high-level SLA contractual commitments, and be explicit about what "available" means in each context.

Gray Failures

The hardest edge case: the service appears available but is subtly broken. Examples:

Responses are returned but incorrect (wrong data, stale cache)
Service is fast but silently dropping 1% of writes
Everything works except one obscure but critical feature

These are the most dangerous because they evade simple success/failure SLIs.

Solution: Complement availability SLIs with correctness SLIs. Don't just measure "did it respond?" but "did it respond correctly?" This might mean sampling responses for correctness, monitoring data integrity, or tracking specific correctness indicators.

The Exclusion Creep Danger

Every exclusion you add makes your SLI less meaningful. Teams under pressure will find creative ways to exclude outages. Resist this. A high availability number that doesn't reflect user experience provides false confidence. It's better to have a lower, honest number that drives improvement than a high, fictional number that hides problems.

Real-World Availability SLI Specifications

Let's examine how specific systems might define availability SLIs, demonstrating the principles in practice.

Example 1: RESTful API Service

api-availability-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Availability SLI Specification: Customer-Facing REST API
sli:
  name: "API Availability"
  description: "Proportion of valid API requests that succeed"
  
  numerator:
    description: "Successful requests"
    criteria:
      - http_status < 500  # No server errors
      - response_time_ms < 30000  # Must respond within 30s
      - "!timeout"  # No client or server timeout
    
  denominator:
    description: "Total valid requests"
    includes:
      - "All requests to /api/* endpoints"
    excludes:
      - "Health check requests (/api/health, /api/ready)"
      - "Requests from internal services (by header X-Internal: true)"
      - "Requests during announced maintenance windows"
      - "Requests blocked by rate limiting (429 responses)"
    
  measurement:
    primary_source: "Load balancer access logs"
    secondary_source: "Application metrics (for detailed error classification)"
    aggregation_window: "5 minutes for alerting, 30 days rolling for SLO"
    latency_threshold: "Requests exceeding 30s counted as failures"
    
  target: 99.9%  # Three nines
  
  error_budget:
    monthly: 43.83  # minutes equivalent
    calculation: "(1 - 0.999) * total_requests"

Example 2: E-Commerce Web Application

ecommerce-availability-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Availability SLI Specification: E-Commerce Platform
sli:
  name: "Checkout Availability"
  description: "Proportion of checkout attempts that complete successfully"
  type: "User Journey"
  
  journey_definition:
    start_event: "checkout_initiated"
    end_event: "order_confirmed OR checkout_failed"
    timeout: 300 seconds  # Max 5 min checkout
  
  numerator:
    description: "Successful checkouts"
    criteria:
      - order_confirmation_received == true
      - payment_captured == true
      - confirmation_email_queued == true
    
  denominator:
    description: "Total checkout attempts"
    includes:
      - "All checkout_initiated events from real users"
    excludes:
      - "Test transactions (identified by test card numbers)"
      - "Bot traffic (identified by device fingerprint service)"
      - "User-abandoned checkouts (no activity for 5+ minutes)"
    
    abandonment_note: >
      User abandonment is excluded from denominator. Separately track
      abandonment rate to ensure we're not falsely classifying system
      failures as user abandonment.
    
  measurement:
    primary_source: "Client-side journey tracking events"
    secondary_source: "Server-side transaction logs"
    
  target: 99.95%
  
  sub_slis:
    - name: "Payment Processing Availability"
      target: 99.99%
      description: "Of checkouts reaching payment, proportion that process"
    - name: "Cart Availability"
      target: 99.9%
      description: "Proportion of cart load attempts that succeed"

Example 3: Data Processing Pipeline

pipeline-availability-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Availability SLI Specification: Data Processing Pipeline
sli:
  name: "Pipeline Availability"
  description: "Proportion of data ingestion windows processed successfully"
  type: "Batch Processing"
  
  # For batch systems, availability is measured in processing windows
  # not individual requests
  
  numerator:
    description: "Successful processing windows"
    criteria:
      - processing_completed == true
      - output_data_valid == true
      - processing_time < 2 * expected_duration
    
  denominator:
    description: "Total scheduled processing windows"
    unit: "1-hour processing windows"
    includes:
      - "All scheduled hourly processing runs"
    excludes:
      - "Deliberately paused runs during maintenance"
      - "Runs delayed by upstream data unavailability"
    
  measurement:
    primary_source: "Pipeline orchestration logs"
    secondary_source: "Output data quality checks"
    aggregation_window: "24 hours (24 windows)"
    
  target: 99.5%  # 0.5% = 1 failed window per week approximately
  
  freshness_sli:
    name: "Data Freshness"
    description: "Proportion of time data is within freshness target"
    target: 99%
    freshness_target: "3 hours from event occurrence to query availability"

Summary: Mastering Availability SLIs

Availability SLIs, while conceptually simple, require careful design to accurately reflect user experience. Let's consolidate the key learnings:

Key Principles for Availability SLIs

•Choose the right availability definition. Time-based, request-based, and session-based availability measure different things. Request-based is typically most user-centric for interactive services.
•Define success semantically. HTTP 200 doesn't always mean success. Understand what each endpoint should accomplish and measure that.
•Mind the denominator. What counts as a "request" dramatically affects your numbers. Exclude synthetic traffic, handle retries carefully, and ensure you're measuring user intent, not network requests.
•Incorporate latency. Extremely slow responses are functionally unavailable. Set a latency threshold beyond which requests count as failures.
•Understand the nines. Each additional nine is 10x harder and more expensive. Choose targets that balance user needs against achievable cost.
•Account for dependencies. Availability multiplies across dependencies. Your ceiling is set by your least reliable critical dependency.
•Handle edge cases explicitly. Planned maintenance, dependency failures, and partial outages need documented policies, not ad-hoc decisions.
•Resist exclusion creep. Every exclusion makes your SLI less meaningful. Default to inclusion; require justification for exclusion.

What's Next

Availability answers "did it work?" but doesn't capture "how fast did it work?" In the next page, we'll explore Latency SLIs—measuring and setting targets for response time, understanding percentiles, and the nuanced relationship between latency and user experience.

Page Complete

You now have a comprehensive understanding of availability SLIs—from philosophical definitions through practical implementation. You can define what "available" means for your service, choose appropriate measurement approaches, set realistic targets, and handle the edge cases that make real-world availability complex.

2 / 5

Loading learning content...

System Design (HLD)Choosing SLIs

Choosing Service Level Indicators

LevelAdvanced

Duration90 mins

TopicChoosing SLIs

2 / 5

Availability SLIs

The Most Fundamental Promise

What You Will Learn

Defining Availability: More Complex Than It Appears

Time-Based Availability

The traditional definition of availability measures the proportion of time a service is operational:

Availability = (Total Time - Downtime) / Total Time × 100%

For example, if a service has 1 hour of downtime in a 30-day month (720 hours):

Availability = (720 - 1) / 720 × 100% = 99.86%

This approach is intuitive and widely used, particularly in infrastructure contexts (server uptime, network uptime). However, it has significant limitations:

Granularity problems: Was the service "up" or "down" during a 5-minute period with 50% error rate? Time-based availability forces a binary classification.
Partial failures invisible: If half your servers are healthy, is the service "up"? Time-based availability doesn't capture degraded states.
Doesn't reflect traffic patterns: One hour of downtime at 3 AM is very different from one hour during peak sales.

The Fundamental Limitation of Time-Based Availability

Request-Based Availability

A more user-centric approach measures the proportion of requests that succeed:

Availability = Successful Requests / Total Requests × 100%

This approach has several advantages:

Naturally traffic-weighted: Periods with more traffic contribute more to the metric.
Captures partial failures: A 50% error rate for one hour is reflected as 50% of that hour's requests failing, not as "up" or "down."
Granular resolution: Every individual request is classified, enabling precise measurements.

User-Session Availability

User-session availability measures the proportion of user sessions that complete successfully:

Availability = Successful User Sessions / Total User Sessions × 100%

Availability Definition Comparison
Definition	Formula	Best For	Limitations
Time-Based	(Total Time - Downtime) / Total Time	Infrastructure components, simple services	Binary classification, ignores traffic patterns
Request-Based	Successful Requests / Total Requests	API services, microservices	Treats all requests equally regardless of criticality
User-Session	Successful Sessions / Total Sessions	End-to-end user experience	Complex to implement, requires session definition
Revenue-Weighted	Successful Revenue-Generating Requests / Total	E-commerce, transaction systems	Requires revenue attribution

Anatomy of a Successful Request

The Naive Definition: HTTP 2xx = Success

The simplest definition treats any HTTP 200-299 response as success and everything else as failure. This is better than nothing but has significant issues:

Client errors (4xx) aren't always user failures: A 400 Bad Request from a confusing form is a UX problem, not user error. A 404 from a broken link is a system issue.
Some 5xx errors are expected: A 503 during graceful degradation might be intentional. Rate-limited requests (429) are system-protecting behaviors.
Success status with failure body: An HTTP 200 that returns {"error": "processing failed"} is not a success.

A Better Definition: Semantic Success

Semantic success means the request accomplished what the user intended—the business logic succeeded, not just the HTTP handshake. This requires understanding the purpose of each endpoint:

Search endpoint: Success = relevant results returned (even if zero matches)
Authentication endpoint: Success = valid credentials accepted or invalid credentials rejected with clear feedback
Payment endpoint: Success = payment captured OR clear rejection with actionable reason
Data save endpoint: Success = data persisted and confirmation returned

semantic-success-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Semantic success classification for availability SLIs
interface RequestOutcome {
  httpStatus: number;
  responseBody: unknown;
  latencyMs: number;
  endpoint: string;
}
 
function classifyForAvailability(outcome: RequestOutcome): 'success' | 'failure' | 'excluded' {
  // Exclude synthetic health checks from user-facing availability
  if (outcome.endpoint === '/health' || outcome.endpoint === '/ready') {
    return 'excluded';
  }
 
  // Definite failures
  if (outcome.httpStatus >= 500 && outcome.httpStatus < 600) {
    return 'failure';
  }
 
  // Timeouts are failures (user didn't get a response)
  if (outcome.latencyMs > 30000) { // 30 second hard timeout
    return 'failure';
  }
 
  // Rate limiting: depends on context
  // - Abusive traffic being rate-limited: excluded
  // - Legitimate user hitting limits: failure
  if (outcome.httpStatus === 429) {
    // This would need additional context about the client
    return 'excluded'; // Conservative: don't count rate-limited as availability failure
  }
 
  // 4xx client errors: nuanced classification
  if (outcome.httpStatus >= 400 && outcome.httpStatus < 500) {
    // 404 on a valid-looking URL = broken link = our failure
    // 404 on obvious user typo = not our failure
    // 401/403 = authentication working as intended = not a failure
    if (outcome.httpStatus === 401 || outcome.httpStatus === 403) {
      return 'success'; // Auth system working correctly
    }
    if (outcome.httpStatus === 400) {
      // Ideally, distinguish server-caused bad requests from user errors
      return 'excluded'; // Conservative: don't count as failure or success
    }
    return 'excluded';
  }
 
  // 2xx with error in response body
  if (outcome.httpStatus >= 200 && outcome.httpStatus < 300) {
    const body = outcome.responseBody as { error?: string; success?: boolean };
    if (body?.error || body?.success === false) {
      return 'failure'; // HTTP success but business logic failure
    }
    return 'success';
  }
 
  return 'excluded';
}

Latency as a Dimension of Availability

A philosophically interesting question: if a request takes 60 seconds to complete, was the service "available"?

Latency-Adjusted Success = Request returned AND latency < threshold

You can incorporate latency into availability by treating requests that exceed a latency threshold as failures. Common approaches:

Binary threshold: Requests > 10 seconds are counted as unavailable
Tiered thresholds: > 5 seconds = degraded, > 30 seconds = unavailable
Percentile-based: Success only if within 99th percentile of normal latency

The threshold should reflect user tolerance. For synchronous UI interactions, 10-30 seconds might be appropriate. For background jobs visible to users, minutes might be acceptable.

Document Your Success Definition

The Denominator Problem: What Counts as a Request?

Availability = Successful Requests / Total Requests. We've discussed the numerator (what's a success?), now let's tackle the denominator: what counts as a request?

This is the "denominator problem," and getting it wrong can dramatically skew your availability numbers.

Problem 1: Synthetic vs. Real Traffic

Your monitoring systems generate synthetic requests for health checks. Your internal tools make API calls. Partner services call your endpoints. Not all of this represents "user" traffic.

Should synthetic health checks count?

If they fail while real users succeed → artificially deflates availability
If they succeed while real users fail → artificially inflates availability

Problem 2: Internal vs. External Traffic

In a microservices architecture, a single user request might generate 50 internal service-to-service calls. If you count all requests equally:

One user clicking a button creates 50 "requests"
That user's experience is over-weighted by 50x

Denominator Pitfalls

•Retries inflate the denominator: If clients automatically retry failed requests, one user interaction becomes multiple "requests." This artificially lowers availability (failed attempt + successful retry = 50% availability for one successful user action).
•Batch operations skew numbers: A bulk import of 10,000 records might be one user action but 10,000 API calls. Count the user action, not the API calls.
•Partial requests are tricky: If a client starts a request but abandons before sending complete data, does that count? If your server never saw a valid request, it shouldn't affect availability.
•Load balancer rejections may be invisible: If requests are rejected at the load balancer (e.g., TLS handshake failure), your application servers never see them. Are these in your denominator?
•Caching affects denominator: Requests served from CDN cache might not reach your origin servers. Where you measure determines which requests you see.

Problem 3: Request Amplification and Reduction

The relationship between user intent and requests is rarely 1:1:

Amplification scenarios:

User loads a page → browser makes 50 requests (HTML, CSS, JS, images, APIs)
User submits a form → server fans out to 10 downstream services
Client uses HTTP/2 multiplexing → multiple logical operations over one connection

Reduction scenarios:

CDN caches response → user requests that never reach origin
GraphQL batches multiple queries → one request contains multiple operations
Websocket connections → continuous stream counted as one request

Problem 4: The Zero-Traffic Window

What's your availability when nobody is using the service? If there are zero requests in a measurement window, what's the availability?

0/0 = undefined, not 100%
Absence of failure is not presence of availability

Solution: For windows with insufficient traffic, either:

Extend the measurement window until you have statistical significance (changes SLI characteristics)
Fall back to synthetic monitoring data for those windows
Report "insufficient data" rather than a misleading number

Statistical Significance Matters

Measuring Availability in Practice

With definitions established, let's explore practical measurement approaches.

Measurement Point Selection

Where you measure availability affects what you capture:

At the load balancer/ingress:

Pros: Sees all traffic, captures transport-layer failures, high-fidelity data
Cons: Doesn't see client-side failures, doesn't understand business logic success

At the application layer:

Pros: Can apply semantic success definitions, correlate with business logic
Cons: Missed requests (crashed before logging), doesn't see upstream failures

Client-side (RUM):

Pros: Captures true user experience, includes network and rendering failures
Cons: Incomplete data (sampling, blocked scripts), harder to aggregate

Hybrid approach (recommended):

Client-side RUM provides the ground-truth user experience
Server-side logging provides complete request data and debugging context
Correlation between them identifies gaps (client saw failure, server didn't log it)

Aggregation Windows

Availability SLIs are typically aggregated over time windows. The choice of window size has significant implications:

Short windows (1-5 minutes):

High-resolution visibility into availability changes
Noisy—random variability causes fluctuations
Good for alerting, poor for SLO calculations

Medium windows (1 hour):

Balance between resolution and stability
Smooths out random variation
Common for dashboard displays

Long windows (1 day, 1 week, 1 month):

Stable, comparable over time
Masks short incidents
Appropriate for SLO reporting and SLA calculations

Rolling vs. calendar windows:

Rolling (e.g., last 30 days): Always current, but comparison is tricky (different windows overlap)
Calendar (e.g., month of January): Clean comparisons, but early in the month you have little data

For SLO compliance, rolling windows (often 30 days) are typical. For reporting to stakeholders, calendar months align with business cycles.

Aggregation Window Selection Guide
Use Case	Recommended Window	Rationale
Real-time alerting	1-5 minute rolling	Fast detection of emerging issues
On-call dashboards	1 hour rolling	Balance of immediacy and stability
SLO burn rate	1 hour to 1 day rolling	Error budget consumption visibility
SLO compliance	30 day rolling	Standard period for objectives
Executive reporting	Calendar month	Alignment with business reporting cycles
SLA contractual	Calendar month	Customer billing cycle alignment

availability-calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List, Optional
import statistics
 
@dataclass
class RequestRecord:
    timestamp: datetime
    success: bool
    latency_ms: float
    is_synthetic: bool = False
 
class AvailabilityCalculator:
    """
    Calculates request-based availability with proper handling of
    edge cases and multiple aggregation strategies.
    """
    
    def __init__(
        self,
        latency_threshold_ms: float = 30000,
        min_sample_size: int = 100
    ):
        self.latency_threshold_ms = latency_threshold_ms
        self.min_sample_size = min_sample_size
    
    def calculate_availability(
        self,
        requests: List[RequestRecord],
        include_latency: bool = True,
        exclude_synthetic: bool = True
    ) -> dict:
        """
        Calculate availability with proper statistical handling.
        
        Returns:
            dict with availability percentage, confidence interval,
            and metadata about the calculation.
        """
        # Filter synthetic if requested
        if exclude_synthetic:
            requests = [r for r in requests if not r.is_synthetic]
        
        if len(requests) == 0:
            return {
                "availability_percent": None,
                "error": "NO_DATA",
                "message": "No requests in window"
            }
        
        if len(requests) < self.min_sample_size:
            return {
                "availability_percent": None,
                "error": "INSUFFICIENT_DATA",
                "message": f"Only {len(requests)} requests; need {self.min_sample_size}",
                "sample_size": len(requests)
            }
        
        # Calculate success based on semantic success + latency
        successes = 0
        for r in requests:
            if r.success:
                if include_latency and r.latency_ms > self.latency_threshold_ms:
                    continue  # Latency exceeded; count as failure
                successes += 1
        
        total = len(requests)
        availability = (successes / total) * 100
        
        # Calculate confidence interval using Wilson score interval
        # for proportion estimation
        confidence_interval = self._wilson_confidence(successes, total)
        
        return {
            "availability_percent": round(availability, 4),
            "successes": successes,
            "failures": total - successes,
            "total_requests": total,
            "confidence_interval_95": confidence_interval,
            "latency_adjusted": include_latency,
            "latency_threshold_ms": self.latency_threshold_ms if include_latency else None
        }
    
    def _wilson_confidence(
        self, 
        successes: int, 
        total: int, 
        z: float = 1.96  # 95% confidence
    ) -> tuple:
        """Wilson score confidence interval for proportions."""
        if total == 0:
            return (0, 0)
        
        p = successes / total
        denominator = 1 + z**2 / total
        center = (p + z**2 / (2 * total)) / denominator
        margin = (z / denominator) * (
            (p * (1 - p) / total + z**2 / (4 * total**2)) ** 0.5
        )
        
        lower = max(0, (center - margin) * 100)
        upper = min(100, (center + margin) * 100)
        
        return (round(lower, 4), round(upper, 4))
 
# Usage example
calculator = AvailabilityCalculator(latency_threshold_ms=30000)
requests = [...]  # Load from your data source
 
result = calculator.calculate_availability(requests)
print(f"Availability: {result['availability_percent']:.4f}%")
print(f"95% CI: [{result['confidence_interval_95'][0]:.4f}%, "
      f"{result['confidence_interval_95'][1]:.4f}%]")

The Nines Table: Understanding Availability Targets

Availability is commonly expressed in "nines"—99% is "two nines," 99.9% is "three nines," and so on. This shorthand obscures the dramatic differences between levels. Let's make it concrete.

The Exponential Scale of Nines

Each additional nine represents a 10x reduction in allowed downtime. This exponential relationship has profound implications:

The Nines Table: Availability Levels and Their Implications
Availability	"Nines"	Monthly Downtime	Annual Downtime	Characteristics
99%	Two nines	7.3 hours	3.65 days	Basic level; significant user complaints expected
99.5%	Two and a half nines	3.65 hours	1.83 days	Improved; still noticeable to active users
99.9%	Three nines	43.8 minutes	8.76 hours	Standard target for many SaaS products
99.95%	Three and a half nines	21.9 minutes	4.38 hours	High reliability; most users won't experience issues
99.99%	Four nines	4.38 minutes	52.6 minutes	Very high reliability; expensive to achieve
99.999%	Five nines	26.3 seconds	5.26 minutes	Extreme reliability; requires major architectural investment

The Cost of Each Nine

Making a service more available costs exponentially more as you add nines:

99% → 99.9%: Requires robust monitoring, on-call rotation, incident response process. Cost: Moderate.
99.9% → 99.99%: Requires redundancy at every layer, automated failover, multi-region deployment, chaos engineering. Cost: Significant.
99.99% → 99.999%: Requires active-active multi-region, zero-downtime deployments, extensive testing, potentially purpose-built infrastructure. Cost: Very high.
Beyond 99.999%: Requires fundamental architectural innovations, potentially custom hardware, extreme operational discipline. Cost: Exceptional.

Nines Are Not Additive

A common misconception: "If I have two components each at 99.9% availability, my total availability is also 99.9%." This is wrong.

If components are in series (request must pass through both), availability multiplies:

Component A: 99.9%
Component B: 99.9%
Combined: 99.9% × 99.9% = 99.8%

With 10 components at 99.9% each in series: 99.9%¹⁰ = 99.0% You've lost an entire nine just by having 10 dependencies!

This is why microservices architectures can have availability challenges—you multiply many slightly-unreliable components together.

The Dependency Multiplication Problem

Choosing the Right Target

How do you select an appropriate availability target? Consider:

User expectations: What do your users expect based on competitive alternatives and historical experience?
Business impact of downtime: What's the cost per minute of unavailability? (Lost revenue, SLA penalties, reputation damage)
Achievability: What level can your team and infrastructure realistically maintain?
Cost-benefit analysis: At what point does the incremental cost of reliability exceed the benefit?

Handling Edge Cases and Exclusions

Real-world availability measurement is full of edge cases that don't fit neatly into "success" or "failure." How you handle these determines whether your SLI reflects reality or fiction.

Planned Maintenance

Should scheduled maintenance count against availability? Arguments both ways:

Count it: Users experience unavailability regardless of whether it was planned. From the user perspective, the service was down.

Exclude it: Planned maintenance with advance notice is a different user experience than unexpected outages. Users can plan around announced maintenance.

Dependency Failures

If your service is unavailable because AWS is down, should that count?

Count it: Users experience your unavailability. They don't care whose fault it is.

Exclude it: It's beyond your control; punishing you for AWS failures isn't fair.

Reasonable Exclusions

•DDoS attacks that exceed reasonable protection (force majeure)
•Abuse traffic that would be correctly rejected
•Synthetic monitoring traffic (separate SLI)
•Client-caused issues (broken mobile app crashing before making requests)
•Properly pre-announced maintenance windows

Indefensible Exclusions

•"That was a bad deploy we rolled back" (deploy problems are your problem)
•"Only 1% of users were affected" (still affects availability for those users)
•"The root cause was our vendor" (users don't care about root cause)
•"We announced maintenance 10 minutes before" (not real notice)
•"It was a known issue we're working on" (known issues are still downtime)

Partial Availability

Complex applications have multiple features with independent reliability. How do you handle partial outages?

Example: Your e-commerce site's search is down, but browse, cart, and checkout work fine. What's your availability?

Option 1 - Component SLIs: Maintain separate availability SLIs for search, browse, cart, checkout. Each has its own target. This is operationally clean but doesn't give an overall picture.

Recommended approach: Use component SLIs for operations and reporting, critical path for high-level SLA contractual commitments, and be explicit about what "available" means in each context.

Gray Failures

The hardest edge case: the service appears available but is subtly broken. Examples:

Responses are returned but incorrect (wrong data, stale cache)
Service is fast but silently dropping 1% of writes
Everything works except one obscure but critical feature

These are the most dangerous because they evade simple success/failure SLIs.

The Exclusion Creep Danger

Real-World Availability SLI Specifications

Let's examine how specific systems might define availability SLIs, demonstrating the principles in practice.

Example 1: RESTful API Service

api-availability-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Availability SLI Specification: Customer-Facing REST API
sli:
  name: "API Availability"
  description: "Proportion of valid API requests that succeed"
  
  numerator:
    description: "Successful requests"
    criteria:
      - http_status < 500  # No server errors
      - response_time_ms < 30000  # Must respond within 30s
      - "!timeout"  # No client or server timeout
    
  denominator:
    description: "Total valid requests"
    includes:
      - "All requests to /api/* endpoints"
    excludes:
      - "Health check requests (/api/health, /api/ready)"
      - "Requests from internal services (by header X-Internal: true)"
      - "Requests during announced maintenance windows"
      - "Requests blocked by rate limiting (429 responses)"
    
  measurement:
    primary_source: "Load balancer access logs"
    secondary_source: "Application metrics (for detailed error classification)"
    aggregation_window: "5 minutes for alerting, 30 days rolling for SLO"
    latency_threshold: "Requests exceeding 30s counted as failures"
    
  target: 99.9%  # Three nines
  
  error_budget:
    monthly: 43.83  # minutes equivalent
    calculation: "(1 - 0.999) * total_requests"

Example 2: E-Commerce Web Application

ecommerce-availability-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Availability SLI Specification: E-Commerce Platform
sli:
  name: "Checkout Availability"
  description: "Proportion of checkout attempts that complete successfully"
  type: "User Journey"
  
  journey_definition:
    start_event: "checkout_initiated"
    end_event: "order_confirmed OR checkout_failed"
    timeout: 300 seconds  # Max 5 min checkout
  
  numerator:
    description: "Successful checkouts"
    criteria:
      - order_confirmation_received == true
      - payment_captured == true
      - confirmation_email_queued == true
    
  denominator:
    description: "Total checkout attempts"
    includes:
      - "All checkout_initiated events from real users"
    excludes:
      - "Test transactions (identified by test card numbers)"
      - "Bot traffic (identified by device fingerprint service)"
      - "User-abandoned checkouts (no activity for 5+ minutes)"
    
    abandonment_note: >
      User abandonment is excluded from denominator. Separately track
      abandonment rate to ensure we're not falsely classifying system
      failures as user abandonment.
    
  measurement:
    primary_source: "Client-side journey tracking events"
    secondary_source: "Server-side transaction logs"
    
  target: 99.95%
  
  sub_slis:
    - name: "Payment Processing Availability"
      target: 99.99%
      description: "Of checkouts reaching payment, proportion that process"
    - name: "Cart Availability"
      target: 99.9%
      description: "Proportion of cart load attempts that succeed"

Example 3: Data Processing Pipeline

pipeline-availability-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Availability SLI Specification: Data Processing Pipeline
sli:
  name: "Pipeline Availability"
  description: "Proportion of data ingestion windows processed successfully"
  type: "Batch Processing"
  
  # For batch systems, availability is measured in processing windows
  # not individual requests
  
  numerator:
    description: "Successful processing windows"
    criteria:
      - processing_completed == true
      - output_data_valid == true
      - processing_time < 2 * expected_duration
    
  denominator:
    description: "Total scheduled processing windows"
    unit: "1-hour processing windows"
    includes:
      - "All scheduled hourly processing runs"
    excludes:
      - "Deliberately paused runs during maintenance"
      - "Runs delayed by upstream data unavailability"
    
  measurement:
    primary_source: "Pipeline orchestration logs"
    secondary_source: "Output data quality checks"
    aggregation_window: "24 hours (24 windows)"
    
  target: 99.5%  # 0.5% = 1 failed window per week approximately
  
  freshness_sli:
    name: "Data Freshness"
    description: "Proportion of time data is within freshness target"
    target: 99%
    freshness_target: "3 hours from event occurrence to query availability"

Summary: Mastering Availability SLIs

Availability SLIs, while conceptually simple, require careful design to accurately reflect user experience. Let's consolidate the key learnings:

Key Principles for Availability SLIs

•Choose the right availability definition. Time-based, request-based, and session-based availability measure different things. Request-based is typically most user-centric for interactive services.
•Define success semantically. HTTP 200 doesn't always mean success. Understand what each endpoint should accomplish and measure that.
•Mind the denominator. What counts as a "request" dramatically affects your numbers. Exclude synthetic traffic, handle retries carefully, and ensure you're measuring user intent, not network requests.
•Incorporate latency. Extremely slow responses are functionally unavailable. Set a latency threshold beyond which requests count as failures.
•Understand the nines. Each additional nine is 10x harder and more expensive. Choose targets that balance user needs against achievable cost.
•Account for dependencies. Availability multiplies across dependencies. Your ceiling is set by your least reliable critical dependency.
•Handle edge cases explicitly. Planned maintenance, dependency failures, and partial outages need documented policies, not ad-hoc decisions.
•Resist exclusion creep. Every exclusion makes your SLI less meaningful. Default to inclusion; require justification for exclusion.

What's Next

Page Complete

2 / 5