System Design (HLD)Non-Functional Requirements

Non-Functional Requirements

LevelAdvanced

Duration90 mins

TopicNon-Functional Requirements

2 / 5

Availability Requirements

When Down Means Done

On July 8, 2015, United Airlines grounded 4,900 flights, the New York Stock Exchange halted trading for nearly four hours, and the Wall Street Journal's website went offline—all on the same day. Each outage had different causes, but each shared a common consequence: massive business impact from unavailability.

For United, the outage cost an estimated $30 million in direct revenue plus immeasurable reputational damage. For NYSE, billions in potential trades went unexecuted. These weren't obscure technical failures—they were availability failures that made headlines.

Availability requirements define your system's promise to be accessible when users need it. They quantify the acceptable failure rate, specify recovery expectations, and drive architectural decisions that can cost millions to change later. Getting availability requirements right is not optional—it's existential.

What You Will Learn

By the end of this page, you will master the complete framework for defining availability requirements: understanding the mathematics of uptime, distinguishing between SLAs/SLOs/SLIs, calculating nines implications, specifying failure handling requirements, and translating business criticality into precise availability targets.

Defining Availability: More Than Just 'Up'

Availability seems straightforward—is the system working or not? But this simplicity conceals significant complexity that your requirements must address.

The Formal Definition:

Availability = (Total Time - Downtime) / Total Time × 100%

Or equivalently:

Availability = Uptime / (Uptime + Downtime) × 100%

But What Constitutes 'Available'?

The definition of "available" varies dramatically by context:

Binary availability: System responds or doesn't
Functional availability: Core features work even if secondary features are degraded
Performance availability: System responds within acceptable latency thresholds
Partial availability: Some percentage of requests succeed

Availability Definition Examples
System Type	Available When...	Unavailable When...
E-commerce checkout	Users can complete purchases within 5 seconds	Any step of checkout fails or exceeds 5s
Search engine	95% of queries return results in <500ms	Error rate >5% OR P95 latency >500ms
Real-time trading	Orders execute within 10ms, 99.9% success	Any degradation beyond these thresholds
Social media feed	Feed loads within 3s for 99% of users	Error pages shown OR infinite loading
IoT telemetry	Data ingestion succeeds for 99.99% of events	Event loss exceeds 0.01%

Define 'Available' Before Measuring It

One of the most common failures in availability requirements is not defining what 'available' means. A system could report 99.99% uptime while users experience degraded performance that effectively renders it unusable. Your requirements must specify: 'The system is considered available when [specific criteria including response time, error rate, and functionality scope].'

Time Windows for Measurement:

Availability metrics must specify the measurement window:

Window	Typical Use	Consideration
Monthly	SLA reporting, business reviews	Most common for contracts
Weekly	Operational dashboards	Faster feedback, more variance
Rolling 30-day	Real-time monitoring	Smooths daily variance
Quarterly	Business planning	Strategic, less operational
Annual	Long-term trends	Only meaningful at very high nines

Scheduled vs. Unscheduled Downtime:

Requirements must clarify whether scheduled maintenance counts against availability:

Including scheduled downtime: Simpler, but may penalize necessary maintenance
Excluding scheduled downtime: More fair, but requires clear maintenance window definitions
Separate metrics: Track scheduled and unscheduled independently

Example requirement: "The system shall maintain 99.9% availability measured monthly, excluding pre-announced maintenance windows of no more than 4 hours per month conducted between 02:00-06:00 UTC on Sundays."

The Nines of Availability: Understanding the Mathematics

Availability is typically expressed as "nines"—99% (two nines), 99.9% (three nines), 99.99% (four nines), and so on. These seemingly small differences in percentage have massive implications for allowed downtime and required engineering investment.

The Nines Table:

Availability Levels and Allowed Downtime
Availability	Nines	Downtime/Year	Downtime/Month	Downtime/Week
90%	One nine	36.5 days	3 days	16.8 hours
99%	Two nines	3.65 days	7.2 hours	1.68 hours
99.5%	Two and a half	1.83 days	3.6 hours	50.4 min
99.9%	Three nines	8.76 hours	43.8 min	10.1 min
99.95%	Three and a half	4.38 hours	21.9 min	5.04 min
99.99%	Four nines	52.6 min	4.38 min	1.01 min
99.999%	Five nines	5.26 min	26.3 sec	6.05 sec
99.9999%	Six nines	31.5 sec	2.63 sec	0.6 sec

The Exponential Cost of Each Nine:

Each additional nine is not a linear increment—it typically requires exponential investment in infrastructure, process, and expertise:

Transition	Engineering Implications
99% → 99.9%	Basic redundancy, health checks, automated restarts
99.9% → 99.99%	Multi-AZ deployment, sophisticated failover, 24/7 on-call
99.99% → 99.999%	Multi-region active-active, chaos engineering, extensive automation
99.999% → 99.9999%	Custom hardware, specialized staff, years of operational refinement

Practical Implications:

Consider what each level means operationally:

Three nines (99.9%): You have 8.76 hours of downtime budget per year. A single significant deployment that causes 30 minutes of issues consumes 6% of your annual budget. Doable with good engineering practices.
Four nines (99.99%): You have 52.6 minutes per year. A single bad deployment that takes 10 minutes to roll back consumes 19% of your annual budget. This requires zero-downtime deployments, instant rollback, and multi-zone redundancy.
Five nines (99.999%): You have 5.26 minutes per year. There is essentially no room for human-speed incident response. Everything must be automated, and even automated failover must complete in seconds. This is the realm of pagers waking people at 3 AM for potential issues, not actual outages.

Right-Size Your Availability

Don't specify five nines because it sounds impressive. Every nine has a cost. A social media feed at 99.99% is probably over-engineered; a payment system at 99.9% is probably under-engineered. Match availability requirements to business impact: 'Based on $50K revenue/hour of downtime, the system requires 99.95% availability (4.38 hours/year downtime = $219K maximum annual impact).'

SLA, SLO, and SLI: The Availability Hierarchy

Understanding the distinction between Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) is crucial for specifying availability requirements correctly.

Definitions:

The SLA/SLO/SLI Framework

•SLI (Service Level Indicator) — A quantitative measure of service behavior. What you measure. Example: 'Percentage of requests completing in under 500ms.'
•SLO (Service Level Objective) — A target value or range for an SLI. What you aim for internally. Example: '99.9% of requests complete in under 500ms.'
•SLA (Service Level Agreement) — A contract with consequences for failing to meet commitments. What you promise externally. Example: 'If availability falls below 99.9%, customer receives 10% service credit.'

The Relationship Between Them:

                    ┌─────────────────────────────────────┐
                    │              SLA                    │
                    │  External contract with penalties   │
                    │  "We guarantee 99.9% uptime or      │
                    │   you get 10% service credit"       │
                    └─────────────────────────────────────┘
                                     ▲
                                     │ (Derived from, typically looser)
                    ┌─────────────────────────────────────┐
                    │              SLO                    │
                    │  Internal target objective          │
                    │  "We aim for 99.95% uptime          │
                    │   (buffer above SLA)"               │
                    └─────────────────────────────────────┘
                                     ▲
                                     │ (Measured by)
                    ┌─────────────────────────────────────┐
                    │              SLI                    │
                    │  The actual measurement             │
                    │  "% of successful requests in       │
                    │   rolling 30-day window"            │
                    └─────────────────────────────────────┘

Why This Matters for Requirements:

Your availability requirements should specify all three levels:

SLI Definition: Exactly how availability is measured
SLO Target: Internal objective with error budget
SLA Commitment: External promise with consequences

availability-sla-slo-sli-specification.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Availability Requirements: SLA/SLO/SLI Specification
 
## Service Level Indicators (SLIs)
 
### Primary Availability SLI
- **Metric:** Request success rate
- **Calculation:** (Successful requests / Total requests) × 100%
- **Success criteria:** HTTP 2xx or 3xx response within 5 seconds
- **Failure criteria:** HTTP 5xx, timeout, or connection refused
- **Exclusions:** Client errors (4xx) are not counted as failures
- **Measurement window:** Rolling 30 days
- **Measurement frequency:** Real-time, aggregated every minute
 
### Secondary Availability SLI
- **Metric:** Synthetic health check success rate
- **Calculation:** (Successful health checks / Total health checks) × 100%
- **Probe frequency:** Every 30 seconds from 5 geographic regions
- **Success criteria:** Health endpoint returns 200 within 2 seconds
 
## Service Level Objectives (SLOs)
 
### Availability SLO
- **Target:** 99.95% request success rate
- **Measurement:** Primary Availability SLI
- **Error budget:** 0.05% = ~21.9 minutes/month
- **Alert threshold:** SLO burn rate > 1.5x triggers page
 
### Internal Stretch Goal
- **Target:** 99.97% (operational excellence target)
- **Review:** Weekly during engineering review
 
## Service Level Agreement (SLA)
 
### External Commitment
- **Commitment:** 99.9% monthly availability
- **Measurement:** Primary Availability SLI
- **Remedies:**
  - 99.0% - 99.9%: 10% service credit
  - 95.0% - 99.0%: 25% service credit
  - < 95.0%: 50% service credit
  
### Exclusions
- Force majeure events
- Scheduled maintenance (max 4 hours/month)
- Customer-caused issues
- Beta/preview features

The SLO Buffer

Always set your SLO higher than your SLA. If your SLA promises 99.9%, aim for 99.95% internally. This provides an error budget buffer—you can have incidents, deployments, and unexpected issues while still meeting your contractual obligations. Without this buffer, any incident immediately becomes an SLA breach.

Error Budgets: Quantifying Acceptable Failure

Error budgets transform availability from a vague aspiration into a concrete, actionable resource. Pioneered by Google's SRE practices, error budgets fundamentally change how teams think about reliability.

The Core Concept:

If your SLO is 99.9%, your error budget is 0.1%. Over a 30-day month, that's:

0.1% × 30 days × 24 hours × 60 minutes = 43.2 minutes of allowed downtime

This isn't a budget to be spent frivolously—it's a budget that acknowledges reality: no system is perfect, and trying to achieve perfection trades off against innovation and velocity.

Error Budget Policy:

Error Budget Consumption Policies
Budget Remaining	Policy	Actions
75-100%	Normal operations	Normal deployment velocity, feature development proceeds
50-75%	Caution zone	Enhanced deployment scrutiny, prioritize stability work
25-50%	Risk zone	Freeze non-critical changes, focus on reliability
0-25%	Critical zone	Only emergency fixes, all hands on reliability
0% (exhausted)	Freeze	No deployments until budget regenerates

Error Budget Burn Rate:

Beyond the absolute budget, the rate of consumption is critical:

Burn rate = 1.0: Consuming budget at exactly the sustainable rate
Burn rate > 1.0: Consuming budget faster than it regenerates
Burn rate > 10.0: Critical—will exhaust budget within days

Calculating Burn Rate:

Burn Rate = (Actual Error Rate / Allowed Error Rate)

Example:
- SLO: 99.9% (allowed error rate = 0.1%)
- Current error rate: 0.3%
- Burn Rate = 0.3% / 0.1% = 3.0x

At 3.0x burn rate:
- Monthly budget exhausted in 10 days
- Weekly budget exhausted in 2.3 days

Error Budgets Enable Trade-offs

Error budgets make the reliability-velocity trade-off explicit. With remaining budget, teams can take calculated risks—deploying faster, experimenting more. When budget is low, everyone understands why deployments pause. This removes subjective arguments about 'is it risky enough to delay?' and replaces them with objective measurement.

Error Budget Requirements Specification:

Your availability requirements should include error budget policies:

Error Budget Policy Requirements:

1. Error budget calculation shall be based on the primary availability SLI
   measured over a rolling 30-day window.

2. Error budget dashboards shall be visible to all engineering teams with
   real-time updates and 7-day projections.

3. When error budget falls below 50%, the following policies activate:
   - All deployments require additional senior engineer approval
   - On-call staffing increases from 1 to 2 engineers
   - Daily reliability review meetings commence

4. When error budget is exhausted (0% remaining):
   - Only security patches and critical bug fixes may deploy
   - Post-incident reviews prioritized over feature work
   - Executive stakeholders notified within 2 hours

5. Error budget exceptions may be granted by VP of Engineering for
   business-critical deployments with documented risk acceptance.

Failure Modes and Recovery Requirements

Availability requirements must address not just uptime targets but how the system behaves during and recovers from failures. This section specifies failure handling requirements.

Key Recovery Metrics:

Recovery Time Metrics
Metric	Definition	Example Requirement
MTBF (Mean Time Between Failures)	Average time between system failures	MTBF ≥ 720 hours (30 days)
MTTR (Mean Time To Recovery)	Average time to restore service after failure	MTTR ≤ 15 minutes for P1 incidents
MTTD (Mean Time To Detection)	Average time to detect an issue	MTTD ≤ 2 minutes for service degradation
MTTF (Mean Time To Failure)	Expected time before first failure	MTTF ≥ 4,000 hours for new deployments
RTO (Recovery Time Objective)	Maximum acceptable downtime	RTO = 30 minutes for disaster recovery
RPO (Recovery Point Objective)	Maximum acceptable data loss	RPO = 5 minutes (max 5 min of data loss)

Failure Mode Categories:

Your requirements should specify handling for different failure categories:

Failure Category	Examples	Handling Requirement
Transient failures	Network blips, timeouts	Automatic retry with exponential backoff
Instance failures	Server crash, OOM	Auto-restart within 30 seconds, failover
Zone failures	AZ outage	Automatic traffic shift to healthy zones
Region failures	Regional disaster	Manual or automatic regional failover
Data corruption	Bit rot, software bugs	Detection, rollback to known-good state
Cascading failures	Overload-induced spread	Circuit breakers, load shedding
Security incidents	Breach, DDoS	Isolation, traffic filtering

Failure Handling Requirements Specification

•Detection: 'The system shall detect service degradation (error rate > 1% or P99 > 2s) within 60 seconds and alert on-call personnel within 120 seconds.'
•Isolation: 'Failed instances shall be automatically removed from load balancer rotation within 30 seconds of health check failure.'
•Failover: 'Upon zone failure detection, traffic shall route to healthy zones within 60 seconds without manual intervention.'
•Recovery: 'Failed instances shall automatically restart and rejoin the cluster within 5 minutes of failure isolation.'
•Rollback: 'Any deployment causing error rate > 0.5% shall be automatically rolled back within 10 minutes.'
•Graceful degradation: 'Under overload (>120% capacity), the system shall shed non-critical traffic while maintaining core functionality.'

Specify Behavior, Not Just Goals

Don't just specify '99.9% availability.' Specify what happens during the 0.1%: Does the system return errors? Queue requests? Degrade gracefully? Show cached data? The behavior during failure often matters more to user experience than the aggregate availability percentage.

Redundancy Strategies and Requirements

Redundancy is the foundation of high availability. Your requirements must specify redundancy expectations at each layer of the system.

Redundancy Levels:

Redundancy Strategy Comparison
Strategy	Description	Use Case	Cost Factor
No redundancy (N)	Single instance	Dev/test, non-critical batch	1x baseline
N+1	One spare beyond minimum	Cost-sensitive production	1.5-2x
N+2	Two spares beyond minimum	Critical services	2-2.5x
2N	Fully redundant (100% spare)	Financial, healthcare	2x minimum
2N+1	Double + one additional	Ultra-high availability	2.5x+

Geographic Redundancy:

Level	Configuration	Failure Protection	Typical Availability
Single instance	One server	None	99-99.9%
Single zone	Multiple instances, one AZ	Instance failure only	99.5-99.9%
Multi-zone	Multiple AZs in one region	Zone failure	99.9-99.99%
Multi-region	Multiple regions	Regional failure	99.99-99.999%
Multi-cloud	Multiple providers	Provider failure	99.999%+

Data Redundancy Requirements:

Data requires special redundancy consideration because it has state:

Data Redundancy Requirements:

1. Database Tier:
   - Minimum 3 replicas across availability zones
   - Synchronous replication to one replica
   - Asynchronous replication to additional replicas
   - RPO: 0 for synchronous replica, 5 min for async
   - Automatic failover within 60 seconds

2. Object Storage:
   - Minimum 99.999999999% (11 nines) durability
   - Cross-region replication within 15 minutes
   - Versioning enabled for accidental deletion recovery

3. Message Queues:
   - Minimum 3-replica cluster
   - Messages durable to disk before acknowledgment
   - No message loss during broker failure

Availability Compounds Through Redundancy

If a single component has 99.9% availability, and you deploy 3 independent replicas, overall availability becomes: 1 - (0.001)³ = 99.9999999%. But this only works if failures are truly independent—correlated failures (like a bug in code deployed to all replicas) defeat the math. Specify: 'Redundant components shall be isolated such that single-failure modes cannot affect multiple replicas simultaneously.'

Availability Dependencies and Composite SLOs

Your system's availability is constrained by its least reliable dependency. Understanding and managing this dependency chain is critical for realistic availability requirements.

The Dependency Chain:

For a system with serial dependencies, composite availability is:

Composite Availability = A₁ × A₂ × A₃ × ... × Aₙ

Example:

Your service: 99.99%
Database: 99.95%
Payment gateway: 99.9%
Authentication provider: 99.9%

Composite = 0.9999 × 0.9995 × 0.999 × 0.999 = 99.74%

Your 99.99% service effectively becomes 99.74% due to dependencies.

Dependency Availability Impact
Dependencies	Each at 99.9%	Each at 99.99%
1	99.9%	99.99%
2	99.8%	99.98%
3	99.7%	99.97%
5	99.5%	99.95%
10	99.0%	99.9%
20	98.0%	99.8%

Strategies for Managing Dependencies:

Minimize critical dependencies: Fewer dependencies = higher composite availability
Choose high-availability providers: Each dependency should exceed your target
Build resilience patterns: Caching, fallbacks, graceful degradation
Parallel vs. serial: If only one of N services must work, use: 1 - (1-A)ⁿ

Dependency Requirements Specification:

Dependency Availability Requirements:

1. Critical Path Dependencies (must be operational for core function):
   - Database: Minimum 99.95% SLA required
   - Cache: Minimum 99.9% SLA, with cache-miss fallback to database
   - Authentication: Minimum 99.9% SLA, with session cache for degradation

2. Non-Critical Dependencies (system degrades but functions without):
   - Analytics: Failures logged, requests proceed
   - Recommendations: Blank/default shown on failure
   - Notification service: Events queued for retry

3. Fallback Requirements:
   - Payment gateway: Secondary gateway activated if primary fails for >30s
   - CDN: Origin server directly accessible if CDN fails

4. Dependency Monitoring:
   - All dependencies health-checked every 30 seconds
   - Dependency availability tracked as separate SLIs
   - Alert if dependency availability drops below 99.5%

Your SLA Cannot Exceed Your Dependencies

If your database provider guarantees 99.95% and your cloud provider guarantees 99.99%, your maximum realistic availability is approximately 99.94%. Never commit to an SLA above the composite availability of your critical dependencies—you're making promises you cannot keep.

Summary: Availability Requirements Mastery

We have covered the complete framework for availability requirements. Let's consolidate the essential takeaways:

Key Takeaways

•Define 'available' precisely — Include response time, error rate, and functionality scope in your availability definition.
•Understand the nines — Each additional nine requires exponential investment; right-size for your business impact.
•Use SLA/SLO/SLI properly — SLIs measure, SLOs target, SLAs contract. Always maintain buffer between SLO and SLA.
•Implement error budgets — Transform availability from aspiration to actionable, tradeable resource.
•Specify failure handling — MTTR, RTO, RPO, and degradation behavior matter as much as uptime percentage.
•Plan redundancy explicitly — Geographic and data redundancy requirements drive architecture.
•Account for dependencies — Composite availability cannot exceed the chain of critical dependencies.

What's Next:

With availability requirements mastered, we turn to Latency Requirements. While availability determines if your system responds at all, latency determines how quickly it responds. In today's competitive landscape, a system that's available but slow is often worse than one that's briefly unavailable—users expect instant responsiveness, and even small delays cascade into poor experiences.

Page Complete

You now have a comprehensive framework for defining availability requirements. These specifications determine the reliability engineering investments your system requires—from infrastructure redundancy to operational processes to incident response procedures. In the next page, we'll tackle latency requirements with equal depth.

2 / 5

Loading learning content...

System Design (HLD)Non-Functional Requirements

Non-Functional Requirements

LevelAdvanced

Duration90 mins

TopicNon-Functional Requirements

2 / 5

Availability Requirements

When Down Means Done

What You Will Learn

Defining Availability: More Than Just 'Up'

Availability seems straightforward—is the system working or not? But this simplicity conceals significant complexity that your requirements must address.

The Formal Definition:

Availability = (Total Time - Downtime) / Total Time × 100%

Or equivalently:

Availability = Uptime / (Uptime + Downtime) × 100%

But What Constitutes 'Available'?

The definition of "available" varies dramatically by context:

Binary availability: System responds or doesn't
Functional availability: Core features work even if secondary features are degraded
Performance availability: System responds within acceptable latency thresholds
Partial availability: Some percentage of requests succeed

Availability Definition Examples
System Type	Available When...	Unavailable When...
E-commerce checkout	Users can complete purchases within 5 seconds	Any step of checkout fails or exceeds 5s
Search engine	95% of queries return results in <500ms	Error rate >5% OR P95 latency >500ms
Real-time trading	Orders execute within 10ms, 99.9% success	Any degradation beyond these thresholds
Social media feed	Feed loads within 3s for 99% of users	Error pages shown OR infinite loading
IoT telemetry	Data ingestion succeeds for 99.99% of events	Event loss exceeds 0.01%

Define 'Available' Before Measuring It

Time Windows for Measurement:

Availability metrics must specify the measurement window:

Window	Typical Use	Consideration
Monthly	SLA reporting, business reviews	Most common for contracts
Weekly	Operational dashboards	Faster feedback, more variance
Rolling 30-day	Real-time monitoring	Smooths daily variance
Quarterly	Business planning	Strategic, less operational
Annual	Long-term trends	Only meaningful at very high nines

Scheduled vs. Unscheduled Downtime:

Requirements must clarify whether scheduled maintenance counts against availability:

Including scheduled downtime: Simpler, but may penalize necessary maintenance
Excluding scheduled downtime: More fair, but requires clear maintenance window definitions
Separate metrics: Track scheduled and unscheduled independently

The Nines of Availability: Understanding the Mathematics

The Nines Table:

Availability Levels and Allowed Downtime
Availability	Nines	Downtime/Year	Downtime/Month	Downtime/Week
90%	One nine	36.5 days	3 days	16.8 hours
99%	Two nines	3.65 days	7.2 hours	1.68 hours
99.5%	Two and a half	1.83 days	3.6 hours	50.4 min
99.9%	Three nines	8.76 hours	43.8 min	10.1 min
99.95%	Three and a half	4.38 hours	21.9 min	5.04 min
99.99%	Four nines	52.6 min	4.38 min	1.01 min
99.999%	Five nines	5.26 min	26.3 sec	6.05 sec
99.9999%	Six nines	31.5 sec	2.63 sec	0.6 sec

The Exponential Cost of Each Nine:

Each additional nine is not a linear increment—it typically requires exponential investment in infrastructure, process, and expertise:

Transition	Engineering Implications
99% → 99.9%	Basic redundancy, health checks, automated restarts
99.9% → 99.99%	Multi-AZ deployment, sophisticated failover, 24/7 on-call
99.99% → 99.999%	Multi-region active-active, chaos engineering, extensive automation
99.999% → 99.9999%	Custom hardware, specialized staff, years of operational refinement

Practical Implications:

Consider what each level means operationally:

Three nines (99.9%): You have 8.76 hours of downtime budget per year. A single significant deployment that causes 30 minutes of issues consumes 6% of your annual budget. Doable with good engineering practices.
Four nines (99.99%): You have 52.6 minutes per year. A single bad deployment that takes 10 minutes to roll back consumes 19% of your annual budget. This requires zero-downtime deployments, instant rollback, and multi-zone redundancy.
Five nines (99.999%): You have 5.26 minutes per year. There is essentially no room for human-speed incident response. Everything must be automated, and even automated failover must complete in seconds. This is the realm of pagers waking people at 3 AM for potential issues, not actual outages.

Right-Size Your Availability

SLA, SLO, and SLI: The Availability Hierarchy

Definitions:

The SLA/SLO/SLI Framework

•SLI (Service Level Indicator) — A quantitative measure of service behavior. What you measure. Example: 'Percentage of requests completing in under 500ms.'
•SLO (Service Level Objective) — A target value or range for an SLI. What you aim for internally. Example: '99.9% of requests complete in under 500ms.'
•SLA (Service Level Agreement) — A contract with consequences for failing to meet commitments. What you promise externally. Example: 'If availability falls below 99.9%, customer receives 10% service credit.'

The Relationship Between Them:

                    ┌─────────────────────────────────────┐
                    │              SLA                    │
                    │  External contract with penalties   │
                    │  "We guarantee 99.9% uptime or      │
                    │   you get 10% service credit"       │
                    └─────────────────────────────────────┘
                                     ▲
                                     │ (Derived from, typically looser)
                    ┌─────────────────────────────────────┐
                    │              SLO                    │
                    │  Internal target objective          │
                    │  "We aim for 99.95% uptime          │
                    │   (buffer above SLA)"               │
                    └─────────────────────────────────────┘
                                     ▲
                                     │ (Measured by)
                    ┌─────────────────────────────────────┐
                    │              SLI                    │
                    │  The actual measurement             │
                    │  "% of successful requests in       │
                    │   rolling 30-day window"            │
                    └─────────────────────────────────────┘

Why This Matters for Requirements:

Your availability requirements should specify all three levels:

SLI Definition: Exactly how availability is measured
SLO Target: Internal objective with error budget
SLA Commitment: External promise with consequences

availability-sla-slo-sli-specification.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Availability Requirements: SLA/SLO/SLI Specification
 
## Service Level Indicators (SLIs)
 
### Primary Availability SLI
- **Metric:** Request success rate
- **Calculation:** (Successful requests / Total requests) × 100%
- **Success criteria:** HTTP 2xx or 3xx response within 5 seconds
- **Failure criteria:** HTTP 5xx, timeout, or connection refused
- **Exclusions:** Client errors (4xx) are not counted as failures
- **Measurement window:** Rolling 30 days
- **Measurement frequency:** Real-time, aggregated every minute
 
### Secondary Availability SLI
- **Metric:** Synthetic health check success rate
- **Calculation:** (Successful health checks / Total health checks) × 100%
- **Probe frequency:** Every 30 seconds from 5 geographic regions
- **Success criteria:** Health endpoint returns 200 within 2 seconds
 
## Service Level Objectives (SLOs)
 
### Availability SLO
- **Target:** 99.95% request success rate
- **Measurement:** Primary Availability SLI
- **Error budget:** 0.05% = ~21.9 minutes/month
- **Alert threshold:** SLO burn rate > 1.5x triggers page
 
### Internal Stretch Goal
- **Target:** 99.97% (operational excellence target)
- **Review:** Weekly during engineering review
 
## Service Level Agreement (SLA)
 
### External Commitment
- **Commitment:** 99.9% monthly availability
- **Measurement:** Primary Availability SLI
- **Remedies:**
  - 99.0% - 99.9%: 10% service credit
  - 95.0% - 99.0%: 25% service credit
  - < 95.0%: 50% service credit
  
### Exclusions
- Force majeure events
- Scheduled maintenance (max 4 hours/month)
- Customer-caused issues
- Beta/preview features

The SLO Buffer

Error Budgets: Quantifying Acceptable Failure

The Core Concept:

If your SLO is 99.9%, your error budget is 0.1%. Over a 30-day month, that's:

0.1% × 30 days × 24 hours × 60 minutes = 43.2 minutes of allowed downtime

This isn't a budget to be spent frivolously—it's a budget that acknowledges reality: no system is perfect, and trying to achieve perfection trades off against innovation and velocity.

Error Budget Policy:

Error Budget Consumption Policies
Budget Remaining	Policy	Actions
75-100%	Normal operations	Normal deployment velocity, feature development proceeds
50-75%	Caution zone	Enhanced deployment scrutiny, prioritize stability work
25-50%	Risk zone	Freeze non-critical changes, focus on reliability
0-25%	Critical zone	Only emergency fixes, all hands on reliability
0% (exhausted)	Freeze	No deployments until budget regenerates

Error Budget Burn Rate:

Beyond the absolute budget, the rate of consumption is critical:

Burn rate = 1.0: Consuming budget at exactly the sustainable rate
Burn rate > 1.0: Consuming budget faster than it regenerates
Burn rate > 10.0: Critical—will exhaust budget within days

Calculating Burn Rate:

Burn Rate = (Actual Error Rate / Allowed Error Rate)

Example:
- SLO: 99.9% (allowed error rate = 0.1%)
- Current error rate: 0.3%
- Burn Rate = 0.3% / 0.1% = 3.0x

At 3.0x burn rate:
- Monthly budget exhausted in 10 days
- Weekly budget exhausted in 2.3 days

Error Budgets Enable Trade-offs

Error Budget Requirements Specification:

Your availability requirements should include error budget policies:

Error Budget Policy Requirements:

1. Error budget calculation shall be based on the primary availability SLI
   measured over a rolling 30-day window.

2. Error budget dashboards shall be visible to all engineering teams with
   real-time updates and 7-day projections.

3. When error budget falls below 50%, the following policies activate:
   - All deployments require additional senior engineer approval
   - On-call staffing increases from 1 to 2 engineers
   - Daily reliability review meetings commence

4. When error budget is exhausted (0% remaining):
   - Only security patches and critical bug fixes may deploy
   - Post-incident reviews prioritized over feature work
   - Executive stakeholders notified within 2 hours

5. Error budget exceptions may be granted by VP of Engineering for
   business-critical deployments with documented risk acceptance.

Failure Modes and Recovery Requirements

Availability requirements must address not just uptime targets but how the system behaves during and recovers from failures. This section specifies failure handling requirements.

Key Recovery Metrics:

Recovery Time Metrics
Metric	Definition	Example Requirement
MTBF (Mean Time Between Failures)	Average time between system failures	MTBF ≥ 720 hours (30 days)
MTTR (Mean Time To Recovery)	Average time to restore service after failure	MTTR ≤ 15 minutes for P1 incidents
MTTD (Mean Time To Detection)	Average time to detect an issue	MTTD ≤ 2 minutes for service degradation
MTTF (Mean Time To Failure)	Expected time before first failure	MTTF ≥ 4,000 hours for new deployments
RTO (Recovery Time Objective)	Maximum acceptable downtime	RTO = 30 minutes for disaster recovery
RPO (Recovery Point Objective)	Maximum acceptable data loss	RPO = 5 minutes (max 5 min of data loss)

Failure Mode Categories:

Your requirements should specify handling for different failure categories:

Failure Category	Examples	Handling Requirement
Transient failures	Network blips, timeouts	Automatic retry with exponential backoff
Instance failures	Server crash, OOM	Auto-restart within 30 seconds, failover
Zone failures	AZ outage	Automatic traffic shift to healthy zones
Region failures	Regional disaster	Manual or automatic regional failover
Data corruption	Bit rot, software bugs	Detection, rollback to known-good state
Cascading failures	Overload-induced spread	Circuit breakers, load shedding
Security incidents	Breach, DDoS	Isolation, traffic filtering

Failure Handling Requirements Specification

•Detection: 'The system shall detect service degradation (error rate > 1% or P99 > 2s) within 60 seconds and alert on-call personnel within 120 seconds.'
•Isolation: 'Failed instances shall be automatically removed from load balancer rotation within 30 seconds of health check failure.'
•Failover: 'Upon zone failure detection, traffic shall route to healthy zones within 60 seconds without manual intervention.'
•Recovery: 'Failed instances shall automatically restart and rejoin the cluster within 5 minutes of failure isolation.'
•Rollback: 'Any deployment causing error rate > 0.5% shall be automatically rolled back within 10 minutes.'
•Graceful degradation: 'Under overload (>120% capacity), the system shall shed non-critical traffic while maintaining core functionality.'

Specify Behavior, Not Just Goals

Redundancy Strategies and Requirements

Redundancy is the foundation of high availability. Your requirements must specify redundancy expectations at each layer of the system.

Redundancy Levels:

Redundancy Strategy Comparison
Strategy	Description	Use Case	Cost Factor
No redundancy (N)	Single instance	Dev/test, non-critical batch	1x baseline
N+1	One spare beyond minimum	Cost-sensitive production	1.5-2x
N+2	Two spares beyond minimum	Critical services	2-2.5x
2N	Fully redundant (100% spare)	Financial, healthcare	2x minimum
2N+1	Double + one additional	Ultra-high availability	2.5x+

Geographic Redundancy:

Level	Configuration	Failure Protection	Typical Availability
Single instance	One server	None	99-99.9%
Single zone	Multiple instances, one AZ	Instance failure only	99.5-99.9%
Multi-zone	Multiple AZs in one region	Zone failure	99.9-99.99%
Multi-region	Multiple regions	Regional failure	99.99-99.999%
Multi-cloud	Multiple providers	Provider failure	99.999%+

Data Redundancy Requirements:

Data requires special redundancy consideration because it has state:

Data Redundancy Requirements:

1. Database Tier:
   - Minimum 3 replicas across availability zones
   - Synchronous replication to one replica
   - Asynchronous replication to additional replicas
   - RPO: 0 for synchronous replica, 5 min for async
   - Automatic failover within 60 seconds

2. Object Storage:
   - Minimum 99.999999999% (11 nines) durability
   - Cross-region replication within 15 minutes
   - Versioning enabled for accidental deletion recovery

3. Message Queues:
   - Minimum 3-replica cluster
   - Messages durable to disk before acknowledgment
   - No message loss during broker failure

Availability Compounds Through Redundancy

Availability Dependencies and Composite SLOs

Your system's availability is constrained by its least reliable dependency. Understanding and managing this dependency chain is critical for realistic availability requirements.

The Dependency Chain:

For a system with serial dependencies, composite availability is:

Composite Availability = A₁ × A₂ × A₃ × ... × Aₙ

Example:

Your service: 99.99%
Database: 99.95%
Payment gateway: 99.9%
Authentication provider: 99.9%

Composite = 0.9999 × 0.9995 × 0.999 × 0.999 = 99.74%

Your 99.99% service effectively becomes 99.74% due to dependencies.

Dependency Availability Impact
Dependencies	Each at 99.9%	Each at 99.99%
1	99.9%	99.99%
2	99.8%	99.98%
3	99.7%	99.97%
5	99.5%	99.95%
10	99.0%	99.9%
20	98.0%	99.8%

Strategies for Managing Dependencies:

Minimize critical dependencies: Fewer dependencies = higher composite availability
Choose high-availability providers: Each dependency should exceed your target
Build resilience patterns: Caching, fallbacks, graceful degradation
Parallel vs. serial: If only one of N services must work, use: 1 - (1-A)ⁿ

Dependency Requirements Specification:

Dependency Availability Requirements:

1. Critical Path Dependencies (must be operational for core function):
   - Database: Minimum 99.95% SLA required
   - Cache: Minimum 99.9% SLA, with cache-miss fallback to database
   - Authentication: Minimum 99.9% SLA, with session cache for degradation

2. Non-Critical Dependencies (system degrades but functions without):
   - Analytics: Failures logged, requests proceed
   - Recommendations: Blank/default shown on failure
   - Notification service: Events queued for retry

3. Fallback Requirements:
   - Payment gateway: Secondary gateway activated if primary fails for >30s
   - CDN: Origin server directly accessible if CDN fails

4. Dependency Monitoring:
   - All dependencies health-checked every 30 seconds
   - Dependency availability tracked as separate SLIs
   - Alert if dependency availability drops below 99.5%

Your SLA Cannot Exceed Your Dependencies

Summary: Availability Requirements Mastery

We have covered the complete framework for availability requirements. Let's consolidate the essential takeaways:

Key Takeaways

•Define 'available' precisely — Include response time, error rate, and functionality scope in your availability definition.
•Understand the nines — Each additional nine requires exponential investment; right-size for your business impact.
•Use SLA/SLO/SLI properly — SLIs measure, SLOs target, SLAs contract. Always maintain buffer between SLO and SLA.
•Implement error budgets — Transform availability from aspiration to actionable, tradeable resource.
•Specify failure handling — MTTR, RTO, RPO, and degradation behavior matter as much as uptime percentage.
•Plan redundancy explicitly — Geographic and data redundancy requirements drive architecture.
•Account for dependencies — Composite availability cannot exceed the chain of critical dependencies.

What's Next:

Page Complete

2 / 5