Slis Slos And Slas - Learning Module

Loading content...

0/273

SLA: Service Level Agreement

When Reliability Becomes a Contract

SLIs tell us what to measure. SLOs tell us what targets to aim for. But when reliability commitments leave the engineering realm and enter contracts with customers, the stakes escalate dramatically. A Service Level Agreement (SLA) is a formal contract that specifies what level of service a customer can expect, and what compensation they receive if that level isn't met.

SLAs transform reliability from an internal engineering concern into a legal obligation with financial consequences. When you violate an SLA, you're not just disappointing users—you're potentially liable for service credits, refunds, or contractual damages.

What You Will Learn

By the end of this page, you will understand what SLAs are, how they differ from SLOs, how to structure SLA commitments, the economics of SLA violations, and best practices for protecting your organization while still serving customers well.

What is a Service Level Agreement?

A Service Level Agreement (SLA) is a formal, legally binding contract between a service provider and a customer that explicitly defines:

Service description — What service is being provided
Performance guarantees — What level of reliability/availability is promised
Measurement methodology — How compliance will be determined
Remedies — What happens if guarantees aren't met (typically service credits)
Exclusions — What circumstances don't count against the SLA

Unlike SLOs, which are internal targets that guide engineering decisions, SLAs are external promises with contractual weight. Violating an SLO affects your error budget and prioritization. Violating an SLA affects your revenue, reputation, and potentially legal standing.

SLI vs SLO vs SLA: The Complete Picture
Aspect	SLI	SLO	SLA
What it is	Measurement	Internal target	External contract
Audience	Engineers	Engineering & Product	Customers & Legal
Consequences	Data point	Prioritization impact	Financial liability
Who sets it	SRE/Engineering	Eng + Product + Business	Business + Legal + Sales
Changeability	As data sources evolve	Quarterly review	Contract renegotiation
Visibility	Dashboards	Team objectives	Public documentation

The Critical Relationship

Your SLA must always be less strict than your SLO.

If you promise customers 99.9% availability (SLA), your internal target should be 99.95% or higher (SLO). This buffer protects you from minor fluctuations triggering SLA violations.

The formula: • SLA = What you promise externally (e.g., 99.9%) • SLO = What you target internally (e.g., 99.95%) • Actual SLI = What you achieve (hopefully ≥ SLO)

If SLI drops below SLO, you have time to fix it. If it drops below SLA, you owe customers money.

Anatomy of a Well-Structured SLA

Enterprise SLAs are complex legal documents, but the technical components follow a predictable structure. Let's dissect each element:

1. Service Description and Scope

What's covered:

An SLA must precisely define which services are included. Ambiguity leads to disputes.

Covered Services:
- Production API endpoints (api.example.com/*)
- Customer dashboard (dashboard.example.com)
- Webhook delivery system

Excluded Services:
- Sandbox/staging environments
- Beta features (marked with 'Beta' label)
- Third-party integrations
- Customer-managed infrastructure

Why precision matters:

If your SLA vaguely covers 'the platform,' a customer could claim SLA credits for issues in your internal admin tool or experimental features. Define boundaries explicitly.

2. Availability Definition

'Uptime' is not as simple as it sounds:

'Available' means the Service responds to valid API requests with
a non-5xx status code within 30 seconds. Requests that fail due to:
(a) Invalid authentication
(b) Rate limiting (429 responses)
(c) Client-side errors (4xx responses for malformed requests)
are not considered unavailable.

Common availability formulas:

Monthly Uptime % = (Total Minutes - Downtime Minutes) / Total Minutes × 100%

OR

Monthly Uptime % = (Successful Requests / Total Requests) × 100%

The request-based formula is more precise but requires robust measurement infrastructure.

3. Service Credit Structure

The financial consequence:

SLA violations typically trigger 'service credits'—discounts on future bills rather than cash refunds.

Common tiered structure:

Monthly Uptime	Service Credit
≥ 99.9%	None
99.0% - 99.9%	10% of monthly bill
95.0% - 99.0%	25% of monthly bill
< 95.0%	50% of monthly bill

Why credits, not refunds:

Keeps customers paying (credits apply to future bills)
Limits cash outflow during incidents
Incentivizes continued relationship
Often tax-advantaged compared to refunds

4. Exclusions and Limitations

What doesn't count against the SLA:

Exclusions:
1. Scheduled maintenance (with 72-hour advance notice)
2. Force majeure (natural disasters, war, etc.)
3. Customer-caused issues (code, configuration, attacks)
4. Third-party failures (AWS, payment processors)
5. Reasonable rate limiting to prevent abuse
6. Features marked as 'Alpha' or 'Beta'
7. Outages lasting less than 1 minute

The 'customer-caused' clause:

This is crucial. If a customer's code hammers your API causing them to get rate-limited, that's not an SLA violation. But proving the root cause requires good logging and forensics.

The Claims Process

Most SLAs require customers to actively claim credits within a window (typically 30 days after the calendar month ends).

Standard claims language:

'To receive service credits, Customer must submit a claim to support@example.com within 30 days of the end of the billing month. Claim must include: (a) 'SLA Credit Request' in subject line, (b) dates and times of unavailability, (c) affected service endpoints, (d) description of impact. Provider will respond within 10 business days.'

This protects against retroactive claims months later and ensures you have incident details while they're still fresh.

The Economics of SLA Commitments

SLAs are ultimately financial instruments. Understanding their economics is essential for both setting appropriate commitments and managing the business implications of violations.

The Cost of Downtime

Direct SLA costs:

Monthly revenue: $1,000,000
SLA commitment: 99.9% (43.2 minutes allowed downtime)
Actual uptime: 99.5% (216 minutes downtime)

Credit tier triggered: 25% (95-99.9% tier)
Credit liability: $250,000

Indirect costs often exceed credits:

Customer churn (lost future revenue)
Sales cycle damage (prospects see your status page)
Support costs (handling complaints)
Engineering opportunity cost (incident response vs. features)
Reputation damage (social media, press coverage)

Rule of thumb: For every $1 in SLA credits paid, downtime causes $5-10 in indirect costs.

The Risk-Reward Calculation

More aggressive SLAs win deals:

Enterprise customers often evaluate SLAs as part of vendor selection. Offering 99.99% when competitors offer 99.9% can win contracts.

But more aggressive SLAs increase exposure:

If you can't achieve 99.99%, you'll pay credits. If you consistently miss, you'll lose customers anyway.

The strategic calculation:

Expected SLA Credit Cost = P(violation) × E(credit amount per violation) × violations/year

If:
  P(missing 99.9%) = 15% per month
  Average credit if missed = $50,000
  Then: Expected annual cost = 0.15 × $50,000 × 12 = $90,000

Is $90,000/year acceptable to offer an SLA that wins $500,000/year in new business?

The Tail Risk Problem

Average SLA credit exposure can be modeled, but catastrophic outages are tail risks:

• An extended multi-day outage could hit the lowest tier (<95%) for all customers • Coordinated claims from enterprise customers could exceed reserves • Depending on contract language, some customers may have negotiated additional remedies

Major cloud providers have faced multi-million-dollar SLA credit events. Budget for bad months, not just average months.

SLA Liability Scenarios
Scenario	Uptime	Customer Base Impact	Credit Exposure
Minor degradation	99.7%	All customers (10% credit)	$100K if $1M MRR
Significant outage	98%	All customers (25% credit)	$250K if $1M MRR
Major incident	90%	All customers (50% credit)	$500K if $1M MRR
Catastrophic failure	<90%	Plus negotiated enterprise penalties	Potentially >100% MRR

Setting Appropriate SLA Commitments

Setting SLA levels requires balancing competitive positioning with achievable reliability. Here's a systematic approach:

The SLA Setting Framework

•Start with your SLO — Your SLA must be below your consistently-achieved SLO. If you're at 99.95%, don't promise more than 99.9%.
•Analyze historical performance — Look at 24+ months of data. What's your worst month? That's your floor.
•Build in planned maintenance — Your SLA budget must accommodate maintenance windows. 99.9% allows 43 minutes/month—is that enough?
•Consider dependency constraints — If AWS EC2 has a 99.99% SLA, you likely can't reliably exceed that.
•Benchmark competitors — What are similar services offering? Being significantly below market is a competitive disadvantage.
•Model credit liability — Calculate expected costs at various SLA levels. Can you absorb bad months?
•Get stakeholder alignment — Legal, Finance, Sales, and Engineering must agree on the commitment.

The 'SLA - SLO' Buffer

How much buffer is enough?

Scenario: SLO = 99.95%, SLA = 99.9%

Buffer = 99.95% - 99.9% = 0.05%
In 30 days, 0.05% = 21.6 minutes

This buffer absorbs:
- Minor incidents (< 22 minutes/month)
- Planned maintenance windows
- Measurement variance

Rule of thumb:

Conservative: SLA = SLO - 0.1% (10x more lenient in downtime budget)
Moderate: SLA = SLO - 0.05% (comfortable buffer)
Aggressive: SLA = SLO - 0.01% (minimal buffer, higher risk)

Example:

SLO	Conservative SLA	Moderate SLA	Aggressive SLA
99.99%	99.9%	99.95%	99.98%
99.95%	99.5%	99.9%	99.94%
99.9%	99%	99.5%	99.89%

Tiered SLAs for Different Customer Segments

Many providers offer different SLAs based on pricing tier:

• Free tier: No SLA (best effort) • Standard tier: 99.9% SLA • Enterprise tier: 99.95% SLA with custom terms • Premium tier: 99.99% SLA with dedicated support

This allows you to price reliability as a feature and reserve your most robust infrastructure for highest-paying customers.

Enterprise SLA Negotiation

Large enterprise customers often negotiate custom SLA terms beyond your standard published SLA. This can be profitable (larger deals) but also risky (larger exposure). Here's how to navigate these negotiations:

What Enterprises Typically Request

Higher availability targets:

'We need 99.99% availability, not 99.9%.'

Financial remedies beyond credits:

'If you're below 99.5%, we want the right to terminate without penalty.'
'SLA credits should be actual refunds, not credits against future bills.'

Custom measurement:

'We want to measure availability from our monitoring system, not yours.'
'Availability should be measured only during business hours (6 AM - 10 PM EST).'

Specific incident response times:

'Critical issues must have first response in 15 minutes.'
'Resolution SLA: Critical issues resolved in 4 hours.'

Penalties for repeated failures:

'If you miss SLA 3 months in a row, we can terminate.'
'Escalating credits: 15% first miss, 25% second, 50% third.'

Dangerous Concessions

•Customer-defined measurement methodology
•Unlimited liability (no credit caps)
•Cash refunds instead of credits
•Commitments above your SLO
•SLAs on beta/experimental features
•Penalties for partial degradation

Reasonable Negotiations

•Higher credit percentages (15% → 25%)
•Faster response time SLAs
•Dedicated support channels
•Business-hours-only measurement
•Enhanced incident communication
•Review of root cause analyses

The 'Custom SLA Premium'

Custom SLA terms constitute additional risk for your business. Price accordingly.

Rule of thumb: For every 'nine' of additional availability commitment, price should increase 50-100%. A customer wanting 99.99% instead of 99.9% should pay premium pricing.

Some organizations have 'SLA uplift' pricing: standard SLA is included, enhanced SLA is a percentage premium on the contract value.

Managing SLA Violations

Despite best efforts, SLA violations happen. How you handle them significantly impacts customer relationships and organizational learning.

Proactive vs Reactive Communication

Proactive approach (recommended):

Dear Customer,

We regret to inform you that on [date], our [Service] experienced an
outage lasting [duration]. This impacted your monthly uptime percentage,
which was [X%], below our SLA commitment of [Y%].

Per our Service Level Agreement, a credit of [amount] will be automatically
applied to your next invoice. No action is required on your part.

We take this seriously. A post-incident review is available at [link],
detailing root cause and preventive measures implemented.

Thank you for your patience and continued trust.

Why proactive is better:

Demonstrates accountability and transparency
Reduces support inquiries ('Am I getting a credit?')
Builds trust (you're not hiding bad news)
Controls the narrative (you frame the message)

The Credit Processing Workflow

Standard process:

Detection: Monitoring automatically flags SLA threshold breach
Validation: SRE confirms the breach, excludes any exclusions
Impact assessment: Finance calculates affected customers and credit amounts
Approval: Operations/Finance leadership approves credit batch
Application: Credits applied to billing system
Communication: Affected customers notified
Documentation: Incident + credits logged for reporting

Automating SLA Credits

Mature organizations automate SLA credit calculation and application:

• Real-time SLI tracking detects threshold breaches • Automated calculation of credit per customer based on usage and tier • Credits automatically stage in billing system • Approval workflow triggers for human review • Automated notification emails

This reduces manual error, speeds customer experience, and ensures consistent treatment.

Learning from SLA Violations

Beyond just paying credits:

An SLA violation should trigger deeper analysis:

Why did we miss the SLO? (The SLO should catch problems before SLA breach)
Was our SLA too aggressive? (Strategic mismatch with capability)
What systematic improvements are needed? (Not just incident fixes)
Should we adjust monitoring/alerting? (Earlier detection next time)
Are there customer-specific factors? (Geographic, usage patterns)

The SLA violation retrospective:

Schedule a quarterly review of all SLA violations:

Which customers were affected?
What was the total credit impact?
Are violations concentrated in specific services?
Is there a trend (improving or worsening)?
What reliability investments are justified by credit reduction?

Real-World SLA Examples from Industry Leaders

Let's examine how major cloud providers and SaaS companies structure their SLAs. These examples represent industry best practices and common patterns.

AWS EC2 SLA (Simplified)

Commitment: 99.99% Monthly Uptime Percentage

Credit structure:

99.0% - 99.99%: 10% credit
95.0% - 99.0%: 30% credit
< 95.0%: 100% credit (month is free)

Key exclusions:

Scheduled maintenance
Customer-caused issues
Force majeure
Individual instance failures (only region-wide counts)

Claim process: Customer must submit within 30 days of billing cycle end.

Google Cloud Platform SLA (Simplified)

Commitment: Varies by service (99.95% - 99.999%)

Interesting feature: Google offers tiered SLAs based on architecture:

Single zone deployment: 99.5%
Multi-zone deployment: 99.99%

This incentivizes customers to build resilient architectures.

Stripe API SLA

Commitment: 99.99% uptime for core payment processing

Unique aspects:

Very specific about what endpoints are covered
Separate SLAs for different API categories
Enterprise plans include custom response time SLAs
Public status page tracking as part of SLA transparency

SLA Transparency Trend

Modern SaaS companies increasingly publish SLA performance publicly:

• Real-time uptime percentages on status pages • Monthly uptime reports sent to all customers • Public post-incident reviews • Historical uptime trending

This transparency builds trust and holds the organization accountable. If your SLA dashboard is internal-only, consider whether public visibility would help or hurt.

Patterns Across Industry SLAs

Common elements:

Conservative targets: Most SLAs are 99.9% (three nines), even from companies achieving higher
Credits, not refunds: Almost universally use service credits
Capped liability: Credits rarely exceed 100% of monthly fee
Customer claim requirement: Customers must actively request credits
Exclusions list: All SLAs have extensive exclusions
Time-bounded claims: 30-60 day claim windows

Lessons for your SLA:

Industry practice provides a defensible baseline
Customers familiar with cloud SLAs have calibrated expectations
Unusual SLA terms (very high commitments, cash refunds) are red flags to experienced buyers

Summary: Mastering Service Level Agreements

SLAs transform reliability from an engineering concern into a contractual commitment with financial consequences. Setting and managing SLAs requires collaboration between engineering, legal, finance, and sales.

Key Takeaways

•SLAs are contracts, not aspirations — They carry legal and financial weight. Treat them with corresponding seriousness.
•SLAs must be below SLOs — Your internal target (SLO) should always exceed your external promise (SLA) to avoid violations.
•Structure matters — Clear definitions of availability, exclusions, measurement, and remedies prevent disputes.
•Credits are the standard remedy — Cash refunds and unlimited liability are rare and risky.
•Enterprise negotiations require caution — Custom terms add risk; price accordingly.
•Violations happen — How you handle them (proactive communication, automated credits) matters for relationships.
•Learn from every violation — SLA breaches should trigger analysis beyond just paying credits.

What's next:

With SLIs, SLOs, and SLAs now established, we need to understand how they interrelate in practice—the feedback loops, escalation paths, and organizational dynamics that make reliability management work.

Page Complete

You now understand Service Level Agreements—the contractual commitments that bind your organization to reliability standards. SLAs transform the internal discipline of SLOs into external accountability, with real consequences for violations. Next, we'll explore how SLIs, SLOs, and SLAs work together as an integrated reliability framework.