Loading content...
SLIs tell us what to measure. SLOs tell us what targets to aim for. But when reliability commitments leave the engineering realm and enter contracts with customers, the stakes escalate dramatically. A Service Level Agreement (SLA) is a formal contract that specifies what level of service a customer can expect, and what compensation they receive if that level isn't met.
SLAs transform reliability from an internal engineering concern into a legal obligation with financial consequences. When you violate an SLA, you're not just disappointing users—you're potentially liable for service credits, refunds, or contractual damages.
By the end of this page, you will understand what SLAs are, how they differ from SLOs, how to structure SLA commitments, the economics of SLA violations, and best practices for protecting your organization while still serving customers well.
A Service Level Agreement (SLA) is a formal, legally binding contract between a service provider and a customer that explicitly defines:
Unlike SLOs, which are internal targets that guide engineering decisions, SLAs are external promises with contractual weight. Violating an SLO affects your error budget and prioritization. Violating an SLA affects your revenue, reputation, and potentially legal standing.
| Aspect | SLI | SLO | SLA |
|---|---|---|---|
| What it is | Measurement | Internal target | External contract |
| Audience | Engineers | Engineering & Product | Customers & Legal |
| Consequences | Data point | Prioritization impact | Financial liability |
| Who sets it | SRE/Engineering | Eng + Product + Business | Business + Legal + Sales |
| Changeability | As data sources evolve | Quarterly review | Contract renegotiation |
| Visibility | Dashboards | Team objectives | Public documentation |
Your SLA must always be less strict than your SLO.
If you promise customers 99.9% availability (SLA), your internal target should be 99.95% or higher (SLO). This buffer protects you from minor fluctuations triggering SLA violations.
The formula: • SLA = What you promise externally (e.g., 99.9%) • SLO = What you target internally (e.g., 99.95%) • Actual SLI = What you achieve (hopefully ≥ SLO)
If SLI drops below SLO, you have time to fix it. If it drops below SLA, you owe customers money.
Enterprise SLAs are complex legal documents, but the technical components follow a predictable structure. Let's dissect each element:
What's covered:
An SLA must precisely define which services are included. Ambiguity leads to disputes.
Covered Services:
- Production API endpoints (api.example.com/*)
- Customer dashboard (dashboard.example.com)
- Webhook delivery system
Excluded Services:
- Sandbox/staging environments
- Beta features (marked with 'Beta' label)
- Third-party integrations
- Customer-managed infrastructure
Why precision matters:
If your SLA vaguely covers 'the platform,' a customer could claim SLA credits for issues in your internal admin tool or experimental features. Define boundaries explicitly.
'Uptime' is not as simple as it sounds:
'Available' means the Service responds to valid API requests with
a non-5xx status code within 30 seconds. Requests that fail due to:
(a) Invalid authentication
(b) Rate limiting (429 responses)
(c) Client-side errors (4xx responses for malformed requests)
are not considered unavailable.
Common availability formulas:
Monthly Uptime % = (Total Minutes - Downtime Minutes) / Total Minutes × 100%
OR
Monthly Uptime % = (Successful Requests / Total Requests) × 100%
The request-based formula is more precise but requires robust measurement infrastructure.
The financial consequence:
SLA violations typically trigger 'service credits'—discounts on future bills rather than cash refunds.
Common tiered structure:
| Monthly Uptime | Service Credit |
|---|---|
| ≥ 99.9% | None |
| 99.0% - 99.9% | 10% of monthly bill |
| 95.0% - 99.0% | 25% of monthly bill |
| < 95.0% | 50% of monthly bill |
Why credits, not refunds:
What doesn't count against the SLA:
Exclusions:
1. Scheduled maintenance (with 72-hour advance notice)
2. Force majeure (natural disasters, war, etc.)
3. Customer-caused issues (code, configuration, attacks)
4. Third-party failures (AWS, payment processors)
5. Reasonable rate limiting to prevent abuse
6. Features marked as 'Alpha' or 'Beta'
7. Outages lasting less than 1 minute
The 'customer-caused' clause:
This is crucial. If a customer's code hammers your API causing them to get rate-limited, that's not an SLA violation. But proving the root cause requires good logging and forensics.
Most SLAs require customers to actively claim credits within a window (typically 30 days after the calendar month ends).
Standard claims language:
'To receive service credits, Customer must submit a claim to support@example.com within 30 days of the end of the billing month. Claim must include: (a) 'SLA Credit Request' in subject line, (b) dates and times of unavailability, (c) affected service endpoints, (d) description of impact. Provider will respond within 10 business days.'
This protects against retroactive claims months later and ensures you have incident details while they're still fresh.
SLAs are ultimately financial instruments. Understanding their economics is essential for both setting appropriate commitments and managing the business implications of violations.
Direct SLA costs:
Monthly revenue: $1,000,000
SLA commitment: 99.9% (43.2 minutes allowed downtime)
Actual uptime: 99.5% (216 minutes downtime)
Credit tier triggered: 25% (95-99.9% tier)
Credit liability: $250,000
Indirect costs often exceed credits:
Rule of thumb: For every $1 in SLA credits paid, downtime causes $5-10 in indirect costs.
More aggressive SLAs win deals:
Enterprise customers often evaluate SLAs as part of vendor selection. Offering 99.99% when competitors offer 99.9% can win contracts.
But more aggressive SLAs increase exposure:
If you can't achieve 99.99%, you'll pay credits. If you consistently miss, you'll lose customers anyway.
The strategic calculation:
Expected SLA Credit Cost = P(violation) × E(credit amount per violation) × violations/year
If:
P(missing 99.9%) = 15% per month
Average credit if missed = $50,000
Then: Expected annual cost = 0.15 × $50,000 × 12 = $90,000
Is $90,000/year acceptable to offer an SLA that wins $500,000/year in new business?
Average SLA credit exposure can be modeled, but catastrophic outages are tail risks:
• An extended multi-day outage could hit the lowest tier (<95%) for all customers • Coordinated claims from enterprise customers could exceed reserves • Depending on contract language, some customers may have negotiated additional remedies
Major cloud providers have faced multi-million-dollar SLA credit events. Budget for bad months, not just average months.
| Scenario | Uptime | Customer Base Impact | Credit Exposure |
|---|---|---|---|
| Minor degradation | 99.7% | All customers (10% credit) | $100K if $1M MRR |
| Significant outage | 98% | All customers (25% credit) | $250K if $1M MRR |
| Major incident | 90% | All customers (50% credit) | $500K if $1M MRR |
| Catastrophic failure | <90% | Plus negotiated enterprise penalties | Potentially >100% MRR |
Setting SLA levels requires balancing competitive positioning with achievable reliability. Here's a systematic approach:
How much buffer is enough?
Scenario: SLO = 99.95%, SLA = 99.9%
Buffer = 99.95% - 99.9% = 0.05%
In 30 days, 0.05% = 21.6 minutes
This buffer absorbs:
- Minor incidents (< 22 minutes/month)
- Planned maintenance windows
- Measurement variance
Rule of thumb:
Example:
| SLO | Conservative SLA | Moderate SLA | Aggressive SLA |
|---|---|---|---|
| 99.99% | 99.9% | 99.95% | 99.98% |
| 99.95% | 99.5% | 99.9% | 99.94% |
| 99.9% | 99% | 99.5% | 99.89% |
Many providers offer different SLAs based on pricing tier:
• Free tier: No SLA (best effort) • Standard tier: 99.9% SLA • Enterprise tier: 99.95% SLA with custom terms • Premium tier: 99.99% SLA with dedicated support
This allows you to price reliability as a feature and reserve your most robust infrastructure for highest-paying customers.
Large enterprise customers often negotiate custom SLA terms beyond your standard published SLA. This can be profitable (larger deals) but also risky (larger exposure). Here's how to navigate these negotiations:
Higher availability targets:
'We need 99.99% availability, not 99.9%.'
Financial remedies beyond credits:
'If you're below 99.5%, we want the right to terminate without penalty.'
'SLA credits should be actual refunds, not credits against future bills.'
Custom measurement:
'We want to measure availability from our monitoring system, not yours.'
'Availability should be measured only during business hours (6 AM - 10 PM EST).'
Specific incident response times:
'Critical issues must have first response in 15 minutes.'
'Resolution SLA: Critical issues resolved in 4 hours.'
Penalties for repeated failures:
'If you miss SLA 3 months in a row, we can terminate.'
'Escalating credits: 15% first miss, 25% second, 50% third.'
Custom SLA terms constitute additional risk for your business. Price accordingly.
Rule of thumb: For every 'nine' of additional availability commitment, price should increase 50-100%. A customer wanting 99.99% instead of 99.9% should pay premium pricing.
Some organizations have 'SLA uplift' pricing: standard SLA is included, enhanced SLA is a percentage premium on the contract value.
Despite best efforts, SLA violations happen. How you handle them significantly impacts customer relationships and organizational learning.
Proactive approach (recommended):
Dear Customer,
We regret to inform you that on [date], our [Service] experienced an
outage lasting [duration]. This impacted your monthly uptime percentage,
which was [X%], below our SLA commitment of [Y%].
Per our Service Level Agreement, a credit of [amount] will be automatically
applied to your next invoice. No action is required on your part.
We take this seriously. A post-incident review is available at [link],
detailing root cause and preventive measures implemented.
Thank you for your patience and continued trust.
Why proactive is better:
Standard process:
Mature organizations automate SLA credit calculation and application:
• Real-time SLI tracking detects threshold breaches • Automated calculation of credit per customer based on usage and tier • Credits automatically stage in billing system • Approval workflow triggers for human review • Automated notification emails
This reduces manual error, speeds customer experience, and ensures consistent treatment.
Beyond just paying credits:
An SLA violation should trigger deeper analysis:
The SLA violation retrospective:
Schedule a quarterly review of all SLA violations:
Let's examine how major cloud providers and SaaS companies structure their SLAs. These examples represent industry best practices and common patterns.
Commitment: 99.99% Monthly Uptime Percentage
Credit structure:
Key exclusions:
Claim process: Customer must submit within 30 days of billing cycle end.
Commitment: Varies by service (99.95% - 99.999%)
Interesting feature: Google offers tiered SLAs based on architecture:
This incentivizes customers to build resilient architectures.
Commitment: 99.99% uptime for core payment processing
Unique aspects:
Modern SaaS companies increasingly publish SLA performance publicly:
• Real-time uptime percentages on status pages • Monthly uptime reports sent to all customers • Public post-incident reviews • Historical uptime trending
This transparency builds trust and holds the organization accountable. If your SLA dashboard is internal-only, consider whether public visibility would help or hurt.
Common elements:
Lessons for your SLA:
SLAs transform reliability from an engineering concern into a contractual commitment with financial consequences. Setting and managing SLAs requires collaboration between engineering, legal, finance, and sales.
What's next:
With SLIs, SLOs, and SLAs now established, we need to understand how they interrelate in practice—the feedback loops, escalation paths, and organizational dynamics that make reliability management work.
You now understand Service Level Agreements—the contractual commitments that bind your organization to reliability standards. SLAs transform the internal discipline of SLOs into external accountability, with real consequences for violations. Next, we'll explore how SLIs, SLOs, and SLAs work together as an integrated reliability framework.