Loading learning content...
On July 8, 2015, United Airlines grounded 4,900 flights, the New York Stock Exchange halted trading for nearly four hours, and the Wall Street Journal's website went offline—all on the same day. Each outage had different causes, but each shared a common consequence: massive business impact from unavailability.
For United, the outage cost an estimated $30 million in direct revenue plus immeasurable reputational damage. For NYSE, billions in potential trades went unexecuted. These weren't obscure technical failures—they were availability failures that made headlines.
Availability requirements define your system's promise to be accessible when users need it. They quantify the acceptable failure rate, specify recovery expectations, and drive architectural decisions that can cost millions to change later. Getting availability requirements right is not optional—it's existential.
By the end of this page, you will master the complete framework for defining availability requirements: understanding the mathematics of uptime, distinguishing between SLAs/SLOs/SLIs, calculating nines implications, specifying failure handling requirements, and translating business criticality into precise availability targets.
Availability seems straightforward—is the system working or not? But this simplicity conceals significant complexity that your requirements must address.
The Formal Definition:
Availability = (Total Time - Downtime) / Total Time × 100%
Or equivalently:
Availability = Uptime / (Uptime + Downtime) × 100%
But What Constitutes 'Available'?
The definition of "available" varies dramatically by context:
| System Type | Available When... | Unavailable When... |
|---|---|---|
| E-commerce checkout | Users can complete purchases within 5 seconds | Any step of checkout fails or exceeds 5s |
| Search engine | 95% of queries return results in <500ms | Error rate >5% OR P95 latency >500ms |
| Real-time trading | Orders execute within 10ms, 99.9% success | Any degradation beyond these thresholds |
| Social media feed | Feed loads within 3s for 99% of users | Error pages shown OR infinite loading |
| IoT telemetry | Data ingestion succeeds for 99.99% of events | Event loss exceeds 0.01% |
One of the most common failures in availability requirements is not defining what 'available' means. A system could report 99.99% uptime while users experience degraded performance that effectively renders it unusable. Your requirements must specify: 'The system is considered available when [specific criteria including response time, error rate, and functionality scope].'
Time Windows for Measurement:
Availability metrics must specify the measurement window:
| Window | Typical Use | Consideration |
|---|---|---|
| Monthly | SLA reporting, business reviews | Most common for contracts |
| Weekly | Operational dashboards | Faster feedback, more variance |
| Rolling 30-day | Real-time monitoring | Smooths daily variance |
| Quarterly | Business planning | Strategic, less operational |
| Annual | Long-term trends | Only meaningful at very high nines |
Scheduled vs. Unscheduled Downtime:
Requirements must clarify whether scheduled maintenance counts against availability:
Example requirement: "The system shall maintain 99.9% availability measured monthly, excluding pre-announced maintenance windows of no more than 4 hours per month conducted between 02:00-06:00 UTC on Sundays."
Availability is typically expressed as "nines"—99% (two nines), 99.9% (three nines), 99.99% (four nines), and so on. These seemingly small differences in percentage have massive implications for allowed downtime and required engineering investment.
The Nines Table:
| Availability | Nines | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|---|
| 90% | One nine | 36.5 days | 3 days | 16.8 hours |
| 99% | Two nines | 3.65 days | 7.2 hours | 1.68 hours |
| 99.5% | Two and a half | 1.83 days | 3.6 hours | 50.4 min |
| 99.9% | Three nines | 8.76 hours | 43.8 min | 10.1 min |
| 99.95% | Three and a half | 4.38 hours | 21.9 min | 5.04 min |
| 99.99% | Four nines | 52.6 min | 4.38 min | 1.01 min |
| 99.999% | Five nines | 5.26 min | 26.3 sec | 6.05 sec |
| 99.9999% | Six nines | 31.5 sec | 2.63 sec | 0.6 sec |
The Exponential Cost of Each Nine:
Each additional nine is not a linear increment—it typically requires exponential investment in infrastructure, process, and expertise:
| Transition | Engineering Implications |
|---|---|
| 99% → 99.9% | Basic redundancy, health checks, automated restarts |
| 99.9% → 99.99% | Multi-AZ deployment, sophisticated failover, 24/7 on-call |
| 99.99% → 99.999% | Multi-region active-active, chaos engineering, extensive automation |
| 99.999% → 99.9999% | Custom hardware, specialized staff, years of operational refinement |
Practical Implications:
Consider what each level means operationally:
Three nines (99.9%): You have 8.76 hours of downtime budget per year. A single significant deployment that causes 30 minutes of issues consumes 6% of your annual budget. Doable with good engineering practices.
Four nines (99.99%): You have 52.6 minutes per year. A single bad deployment that takes 10 minutes to roll back consumes 19% of your annual budget. This requires zero-downtime deployments, instant rollback, and multi-zone redundancy.
Five nines (99.999%): You have 5.26 minutes per year. There is essentially no room for human-speed incident response. Everything must be automated, and even automated failover must complete in seconds. This is the realm of pagers waking people at 3 AM for potential issues, not actual outages.
Don't specify five nines because it sounds impressive. Every nine has a cost. A social media feed at 99.99% is probably over-engineered; a payment system at 99.9% is probably under-engineered. Match availability requirements to business impact: 'Based on $50K revenue/hour of downtime, the system requires 99.95% availability (4.38 hours/year downtime = $219K maximum annual impact).'
Understanding the distinction between Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) is crucial for specifying availability requirements correctly.
Definitions:
The Relationship Between Them:
┌─────────────────────────────────────┐
│ SLA │
│ External contract with penalties │
│ "We guarantee 99.9% uptime or │
│ you get 10% service credit" │
└─────────────────────────────────────┘
▲
│ (Derived from, typically looser)
┌─────────────────────────────────────┐
│ SLO │
│ Internal target objective │
│ "We aim for 99.95% uptime │
│ (buffer above SLA)" │
└─────────────────────────────────────┘
▲
│ (Measured by)
┌─────────────────────────────────────┐
│ SLI │
│ The actual measurement │
│ "% of successful requests in │
│ rolling 30-day window" │
└─────────────────────────────────────┘
Why This Matters for Requirements:
Your availability requirements should specify all three levels:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# Availability Requirements: SLA/SLO/SLI Specification ## Service Level Indicators (SLIs) ### Primary Availability SLI- **Metric:** Request success rate- **Calculation:** (Successful requests / Total requests) × 100%- **Success criteria:** HTTP 2xx or 3xx response within 5 seconds- **Failure criteria:** HTTP 5xx, timeout, or connection refused- **Exclusions:** Client errors (4xx) are not counted as failures- **Measurement window:** Rolling 30 days- **Measurement frequency:** Real-time, aggregated every minute ### Secondary Availability SLI- **Metric:** Synthetic health check success rate- **Calculation:** (Successful health checks / Total health checks) × 100%- **Probe frequency:** Every 30 seconds from 5 geographic regions- **Success criteria:** Health endpoint returns 200 within 2 seconds ## Service Level Objectives (SLOs) ### Availability SLO- **Target:** 99.95% request success rate- **Measurement:** Primary Availability SLI- **Error budget:** 0.05% = ~21.9 minutes/month- **Alert threshold:** SLO burn rate > 1.5x triggers page ### Internal Stretch Goal- **Target:** 99.97% (operational excellence target)- **Review:** Weekly during engineering review ## Service Level Agreement (SLA) ### External Commitment- **Commitment:** 99.9% monthly availability- **Measurement:** Primary Availability SLI- **Remedies:** - 99.0% - 99.9%: 10% service credit - 95.0% - 99.0%: 25% service credit - < 95.0%: 50% service credit ### Exclusions- Force majeure events- Scheduled maintenance (max 4 hours/month)- Customer-caused issues- Beta/preview featuresAlways set your SLO higher than your SLA. If your SLA promises 99.9%, aim for 99.95% internally. This provides an error budget buffer—you can have incidents, deployments, and unexpected issues while still meeting your contractual obligations. Without this buffer, any incident immediately becomes an SLA breach.
Error budgets transform availability from a vague aspiration into a concrete, actionable resource. Pioneered by Google's SRE practices, error budgets fundamentally change how teams think about reliability.
The Core Concept:
If your SLO is 99.9%, your error budget is 0.1%. Over a 30-day month, that's:
This isn't a budget to be spent frivolously—it's a budget that acknowledges reality: no system is perfect, and trying to achieve perfection trades off against innovation and velocity.
Error Budget Policy:
| Budget Remaining | Policy | Actions |
|---|---|---|
| 75-100% | Normal operations | Normal deployment velocity, feature development proceeds |
| 50-75% | Caution zone | Enhanced deployment scrutiny, prioritize stability work |
| 25-50% | Risk zone | Freeze non-critical changes, focus on reliability |
| 0-25% | Critical zone | Only emergency fixes, all hands on reliability |
| 0% (exhausted) | Freeze | No deployments until budget regenerates |
Error Budget Burn Rate:
Beyond the absolute budget, the rate of consumption is critical:
Calculating Burn Rate:
Burn Rate = (Actual Error Rate / Allowed Error Rate)
Example:
- SLO: 99.9% (allowed error rate = 0.1%)
- Current error rate: 0.3%
- Burn Rate = 0.3% / 0.1% = 3.0x
At 3.0x burn rate:
- Monthly budget exhausted in 10 days
- Weekly budget exhausted in 2.3 days
Error budgets make the reliability-velocity trade-off explicit. With remaining budget, teams can take calculated risks—deploying faster, experimenting more. When budget is low, everyone understands why deployments pause. This removes subjective arguments about 'is it risky enough to delay?' and replaces them with objective measurement.
Error Budget Requirements Specification:
Your availability requirements should include error budget policies:
Error Budget Policy Requirements:
1. Error budget calculation shall be based on the primary availability SLI
measured over a rolling 30-day window.
2. Error budget dashboards shall be visible to all engineering teams with
real-time updates and 7-day projections.
3. When error budget falls below 50%, the following policies activate:
- All deployments require additional senior engineer approval
- On-call staffing increases from 1 to 2 engineers
- Daily reliability review meetings commence
4. When error budget is exhausted (0% remaining):
- Only security patches and critical bug fixes may deploy
- Post-incident reviews prioritized over feature work
- Executive stakeholders notified within 2 hours
5. Error budget exceptions may be granted by VP of Engineering for
business-critical deployments with documented risk acceptance.
Availability requirements must address not just uptime targets but how the system behaves during and recovers from failures. This section specifies failure handling requirements.
Key Recovery Metrics:
| Metric | Definition | Example Requirement |
|---|---|---|
| MTBF (Mean Time Between Failures) | Average time between system failures | MTBF ≥ 720 hours (30 days) |
| MTTR (Mean Time To Recovery) | Average time to restore service after failure | MTTR ≤ 15 minutes for P1 incidents |
| MTTD (Mean Time To Detection) | Average time to detect an issue | MTTD ≤ 2 minutes for service degradation |
| MTTF (Mean Time To Failure) | Expected time before first failure | MTTF ≥ 4,000 hours for new deployments |
| RTO (Recovery Time Objective) | Maximum acceptable downtime | RTO = 30 minutes for disaster recovery |
| RPO (Recovery Point Objective) | Maximum acceptable data loss | RPO = 5 minutes (max 5 min of data loss) |
Failure Mode Categories:
Your requirements should specify handling for different failure categories:
| Failure Category | Examples | Handling Requirement |
|---|---|---|
| Transient failures | Network blips, timeouts | Automatic retry with exponential backoff |
| Instance failures | Server crash, OOM | Auto-restart within 30 seconds, failover |
| Zone failures | AZ outage | Automatic traffic shift to healthy zones |
| Region failures | Regional disaster | Manual or automatic regional failover |
| Data corruption | Bit rot, software bugs | Detection, rollback to known-good state |
| Cascading failures | Overload-induced spread | Circuit breakers, load shedding |
| Security incidents | Breach, DDoS | Isolation, traffic filtering |
Don't just specify '99.9% availability.' Specify what happens during the 0.1%: Does the system return errors? Queue requests? Degrade gracefully? Show cached data? The behavior during failure often matters more to user experience than the aggregate availability percentage.
Redundancy is the foundation of high availability. Your requirements must specify redundancy expectations at each layer of the system.
Redundancy Levels:
| Strategy | Description | Use Case | Cost Factor |
|---|---|---|---|
| No redundancy (N) | Single instance | Dev/test, non-critical batch | 1x baseline |
| N+1 | One spare beyond minimum | Cost-sensitive production | 1.5-2x |
| N+2 | Two spares beyond minimum | Critical services | 2-2.5x |
| 2N | Fully redundant (100% spare) | Financial, healthcare | 2x minimum |
| 2N+1 | Double + one additional | Ultra-high availability | 2.5x+ |
Geographic Redundancy:
| Level | Configuration | Failure Protection | Typical Availability |
|---|---|---|---|
| Single instance | One server | None | 99-99.9% |
| Single zone | Multiple instances, one AZ | Instance failure only | 99.5-99.9% |
| Multi-zone | Multiple AZs in one region | Zone failure | 99.9-99.99% |
| Multi-region | Multiple regions | Regional failure | 99.99-99.999% |
| Multi-cloud | Multiple providers | Provider failure | 99.999%+ |
Data Redundancy Requirements:
Data requires special redundancy consideration because it has state:
Data Redundancy Requirements:
1. Database Tier:
- Minimum 3 replicas across availability zones
- Synchronous replication to one replica
- Asynchronous replication to additional replicas
- RPO: 0 for synchronous replica, 5 min for async
- Automatic failover within 60 seconds
2. Object Storage:
- Minimum 99.999999999% (11 nines) durability
- Cross-region replication within 15 minutes
- Versioning enabled for accidental deletion recovery
3. Message Queues:
- Minimum 3-replica cluster
- Messages durable to disk before acknowledgment
- No message loss during broker failure
If a single component has 99.9% availability, and you deploy 3 independent replicas, overall availability becomes: 1 - (0.001)³ = 99.9999999%. But this only works if failures are truly independent—correlated failures (like a bug in code deployed to all replicas) defeat the math. Specify: 'Redundant components shall be isolated such that single-failure modes cannot affect multiple replicas simultaneously.'
Your system's availability is constrained by its least reliable dependency. Understanding and managing this dependency chain is critical for realistic availability requirements.
The Dependency Chain:
For a system with serial dependencies, composite availability is:
Composite Availability = A₁ × A₂ × A₃ × ... × Aₙ
Example:
Composite = 0.9999 × 0.9995 × 0.999 × 0.999 = 99.74%
Your 99.99% service effectively becomes 99.74% due to dependencies.
| Dependencies | Each at 99.9% | Each at 99.99% |
|---|---|---|
| 1 | 99.9% | 99.99% |
| 2 | 99.8% | 99.98% |
| 3 | 99.7% | 99.97% |
| 5 | 99.5% | 99.95% |
| 10 | 99.0% | 99.9% |
| 20 | 98.0% | 99.8% |
Strategies for Managing Dependencies:
Dependency Requirements Specification:
Dependency Availability Requirements:
1. Critical Path Dependencies (must be operational for core function):
- Database: Minimum 99.95% SLA required
- Cache: Minimum 99.9% SLA, with cache-miss fallback to database
- Authentication: Minimum 99.9% SLA, with session cache for degradation
2. Non-Critical Dependencies (system degrades but functions without):
- Analytics: Failures logged, requests proceed
- Recommendations: Blank/default shown on failure
- Notification service: Events queued for retry
3. Fallback Requirements:
- Payment gateway: Secondary gateway activated if primary fails for >30s
- CDN: Origin server directly accessible if CDN fails
4. Dependency Monitoring:
- All dependencies health-checked every 30 seconds
- Dependency availability tracked as separate SLIs
- Alert if dependency availability drops below 99.5%
If your database provider guarantees 99.95% and your cloud provider guarantees 99.99%, your maximum realistic availability is approximately 99.94%. Never commit to an SLA above the composite availability of your critical dependencies—you're making promises you cannot keep.
We have covered the complete framework for availability requirements. Let's consolidate the essential takeaways:
What's Next:
With availability requirements mastered, we turn to Latency Requirements. While availability determines if your system responds at all, latency determines how quickly it responds. In today's competitive landscape, a system that's available but slow is often worse than one that's briefly unavailable—users expect instant responsiveness, and even small delays cascade into poor experiences.
You now have a comprehensive framework for defining availability requirements. These specifications determine the reliability engineering investments your system requires—from infrastructure redundancy to operational processes to incident response procedures. In the next page, we'll tackle latency requirements with equal depth.