System Design (HLD)Choosing SLIs

Choosing Service Level Indicators

LevelAdvanced

Duration90 mins

TopicChoosing SLIs

4 / 5

Error Rate SLIs

Understanding Failure at Scale

Every distributed system fails. The question isn't whether errors will occur, but how many, what kind, and what impact they have on users. Error rate SLIs quantify this fundamental reality, transforming the abstract concept of "system reliability" into concrete, measurable proportions.

While availability SLIs focus on whether the service is working at all, and latency SLIs measure how fast it works, error rate SLIs capture the quality of individual operations. A service might be "available" (responding to requests) yet returning errors for 10% of operations. A service might be "fast" (low latency) yet delivering incorrect data. Error rate SLIs fill this gap, measuring the proportion of operations that succeed in accomplishing their intended purpose.

But what constitutes an "error"? An HTTP 500 is obviously an error. What about an HTTP 400? A timeout? A successful response containing semantically wrong data? The answers to these questions determine whether your error rate SLI reflects true user impact or merely counts technical exceptions.

What You Will Learn

By the end of this page, you will understand how to classify errors meaningfully, distinguish error severity levels, measure error rates at appropriate granularity, set realistic error rate targets, and design error rate SLIs that genuinely reflect user experience and system health.

The Error Taxonomy: Classifying What Goes Wrong

Not all errors are created equal. A rigorous error rate SLI requires a taxonomy—a systematic classification of error types and their characteristics.

Dimension 1: Error Origin

Where did the error originate?

Server errors (5xx): The server acknowledged the request but couldn't fulfill it. The fault lies within your system.

500 Internal Server Error: Unhandled exception, bug
502 Bad Gateway: Upstream dependency failed
503 Service Unavailable: Overloaded, maintenance
504 Gateway Timeout: Dependency took too long

Client errors (4xx): The request was somehow invalid. But "client's fault" isn't always clear-cut.

400 Bad Request: Could be a bug in your client application
401 Unauthorized: Expected behavior when credentials are wrong
403 Forbidden: Expected when permissions don't allow access
404 Not Found: Could be user error or broken internal link
429 Too Many Requests: Rate limiting (protective, not a bug)

Infrastructure errors: Failures before the application sees the request.

DNS resolution failures
TLS handshake failures
Connection refused / reset
Load balancer errors

Error Classification Matrix
Error Type	Whose Fault?	Count in Error Rate SLI?	Rationale
500 Internal Server Error	Ours	Yes	Clear system failure
502 Bad Gateway	Ours (dependency)	Yes	User experiences failure regardless of root cause
503 Service Unavailable	Ours	Yes	Capacity/operational issue
504 Gateway Timeout	Ours (or dependency)	Yes	User waited and got nothing
400 Bad Request	Depends	Sometimes	If caused by our client app: yes. If user typo: no
401 Unauthorized	User (or expected)	No	Auth system working correctly
403 Forbidden	User (or expected)	No	Authorization working correctly
404 Not Found	Depends	Sometimes	Broken link = yes. User typo = no
429 Rate Limited	User (or attack)	No (track separately)	Protective behavior, not error
Timeout (client-side)	Ours	Yes	User gave up waiting
Connection refused	Ours	Yes	Complete service failure

Dimension 2: Error Impact

How severely does the error affect the user?

Critical: User's primary goal is blocked. Payment failed, data lost, cannot access account.

Major: Significant degradation of experience. Search returns no results for valid query, images don't load, feature broken.

Minor: Noticeable but not blocking. Recommendation widget fails but core product works. Avatar doesn't load.

Cosmetic: Barely perceptible. Analytics pixel fails to load. Non-essential metadata missing.

Dimension 3: Error Persistence

Transient: Temporary failure that may succeed on retry. Network glitch, momentary overload.

Persistent: Failure that will recur until fixed. Bug in code, corrupted data, misconfiguration.

Deterministic: Always fails for specific input. Bug triggered by edge case.

Stochastic: Fails randomly. Race conditions, resource exhaustion.

Understanding these dimensions allows you to design SLIs that weight errors appropriately. A transient minor error in an analytics call should not be weighted the same as a persistent critical error in payment processing.

Error Classification Is a Design Decision

There's no universally correct error classification. What matters is that your classification is explicit, documented, and reflects user impact. Make the taxonomy a first-class artifact that your team agrees upon and revises as you learn more about your failure modes.

Calculating Error Rate: The Mathematics

Error rate calculation seems simple—errors divided by total—but the details matter significantly.

The Basic Formula

Error Rate = Errors / Total Operations × 100%

Alternatively, express as Success Rate = (1 - Error Rate):

Success Rate = (Total Operations - Errors) / Total Operations × 100%

Most SLOs are expressed as success rates ("99.9% of requests succeed") rather than error rates ("0.1% of requests fail"). Both convey the same information but success rates are more intuitive for targets.

The Numerator: What Counts as an Error?

Based on your error taxonomy, define precisely which responses count:

Errors = HTTP_5xx + Timeouts + Connection_Failures + 
         Semantic_Errors + (Conditional_4xx per taxonomy)

Document every decision:

Are 400 errors from poorly designed forms our errors? (Often yes)
Are 404 errors from old bookmarks our errors? (Debatable)
Are upstream failures (502, 504) our errors? (Yes, from user perspective)

error-rate-calculator.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
interface RequestOutcome {
  httpStatus: number;
  latencyMs: number;
  wasTimeout: boolean;
  endpoint: string;
  errorCode?: string;  // Application-level error code
  clientType: 'external' | 'internal' | 'synthetic';
}
 
interface ErrorClassification {
  isError: boolean;
  errorCategory: 'server' | 'client' | 'timeout' | 'semantic' | 'none';
  severity: 'critical' | 'major' | 'minor' | 'cosmetic' | 'none';
  includeInSli: boolean;
  reason: string;
}
 
function classifyError(outcome: RequestOutcome): ErrorClassification {
  // Timeouts are always errors from user perspective
  if (outcome.wasTimeout) {
    return {
      isError: true,
      errorCategory: 'timeout',
      severity: 'critical',
      includeInSli: true,
      reason: 'User waited and received no response',
    };
  }
 
  // Server errors (5xx) are always errors
  if (outcome.httpStatus >= 500) {
    return {
      isError: true,
      errorCategory: 'server',
      severity: determineSeverityByEndpoint(outcome.endpoint),
      includeInSli: true,
      reason: `Server error: HTTP ${outcome.httpStatus}`,
    };
  }
 
  // Rate limiting: track separately, don't count against error rate SLI
  if (outcome.httpStatus === 429) {
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Rate limiting is protective behavior',
    };
  }
 
  // Auth errors (401, 403): typically not errors unless unexpected
  if (outcome.httpStatus === 401 || outcome.httpStatus === 403) {
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Authentication/authorization working as expected',
    };
  }
 
  // 400 Bad Request: nuanced classification
  if (outcome.httpStatus === 400) {
    // If from our own client app, might be a bug on our side
    if (outcome.clientType === 'internal') {
      return {
        isError: true,
        errorCategory: 'client',
        severity: 'major',
        includeInSli: true,
        reason: 'Bad request from internal client suggests a bug',
      };
    }
    // External clients: don't count against our SLI
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Bad request from external client',
    };
  }
 
  // 404: depends on context
  if (outcome.httpStatus === 404) {
    // Check if this endpoint should exist
    if (isKnownEndpoint(outcome.endpoint)) {
      return {
        isError: true,
        errorCategory: 'server',
        severity: 'major',
        includeInSli: true,
        reason: 'Valid endpoint returned 404 - likely a bug or data issue',
      };
    }
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Unknown endpoint - likely user error or old link',
    };
  }
 
  // Semantic errors: HTTP 2xx but application-level failure
  if (outcome.httpStatus >= 200 && outcome.httpStatus < 300) {
    if (outcome.errorCode) {
      return {
        isError: true,
        errorCategory: 'semantic',
        severity: determineSeverityByErrorCode(outcome.errorCode),
        includeInSli: true,
        reason: `Semantic error: ${outcome.errorCode}`,
      };
    }
  }
 
  // Success
  return {
    isError: false,
    errorCategory: 'none',
    severity: 'none',
    includeInSli: false,
    reason: 'Request succeeded',
  };
}
 
function calculateErrorRate(outcomes: RequestOutcome[]): {
  totalRequests: number;
  errorCount: number;
  errorRate: number;
  successRate: number;
  byCategory: Record<string, number>;
} {
  let errorCount = 0;
  const byCategory: Record<string, number> = {};
  
  const eligibleOutcomes = outcomes.filter(o => 
    o.clientType !== 'synthetic'  // Exclude synthetic monitoring
  );
 
  for (const outcome of eligibleOutcomes) {
    const classification = classifyError(outcome);
    if (classification.includeInSli && classification.isError) {
      errorCount++;
      byCategory[classification.errorCategory] = 
        (byCategory[classification.errorCategory] || 0) + 1;
    }
  }
 
  const total = eligibleOutcomes.length;
  const errorRate = total > 0 ? (errorCount / total) * 100 : 0;
  const successRate = 100 - errorRate;
 
  return {
    totalRequests: total,
    errorCount,
    errorRate,
    successRate,
    byCategory,
  };
}

The Denominator: What Counts as an Operation?

This mirrors the availability denominator discussion but has error-specific considerations:

Include or exclude synthetic traffic? Usually exclude—synthetic monitoring errors shouldn't affect error rate SLI.
Include or exclude internal traffic? Internal errors might indicate bugs that haven't reached production users yet.
How to count retries? If a client retries a failed request, is that one error or two?

Recommended approach:

Primary SLI: External user traffic only, excluding synthetic
Supporting metrics: Internal traffic error rates, synthetic monitoring error rates
Document retry handling: Typically count each attempt separately (a retry that succeeds reduces the error rate; a retry that fails counts as another error)

Error Rate Granularity: Aggregate vs. Disaggregated

Should you have one error rate SLI for your entire service, or separate SLIs for each endpoint, feature, or error type? The answer depends on your goals and operational maturity.

The Case for Aggregated Error Rates

A single "overall error rate" for your service:

Advantages:

Simple to understand and communicate
Easy to set a single SLO target
Represents "big picture" health

Disadvantages:

Obscures problems in low-traffic endpoints
A 0.1% error rate across the service might hide 50% error rate on a critical feature used by few
Hard to prioritize fixes without knowing which errors matter

The Case for Disaggregated Error Rates

Separate error rate SLIs per endpoint, feature, or user journey:

Advantages:

Precision in identifying problems
Can set different targets for different criticality levels
Enables focused improvement efforts

Disadvantages:

Many SLIs to track and understand
Risk of alert fatigue
Requires more operational maturity

Recommended Granularity Strategy

•Start with a few critical SLIs: Identify 3-5 critical user journeys. Create dedicated error rate SLIs for these.
•Add an aggregate SLI for coverage: A service-wide error rate catches issues in areas without dedicated SLIs.
•Segment by error category: Track server errors, timeouts, and semantic errors separately—they have different root causes.
•Consider business impact: Payment processing might warrant its own SLI even if it's low traffic.
•Evolve over time: As you learn which errors matter, add or consolidate SLIs.

Composite Error Rate SLIs

A middle ground is a weighted composite that maintains simplicity while accounting for criticality:

Weighted Error Rate = Σ(Error Rate × Weight × Traffic) / Σ(Traffic)

Where weights reflect business importance:

Critical endpoints (checkout, auth): Weight 3.0
Major endpoints (search, product pages): Weight 1.0
Minor endpoints (analytics, preferences): Weight 0.3

This single metric naturally surfaces critical endpoint failures more prominently while still tracking overall health.

Error Budgets by Category

Another approach: allocate your error budget across error categories:

Total error budget: 0.1% (99.9% success rate)
Server errors (5xx): Max 0.05%
Timeouts: Max 0.03%
Semantic errors: Max 0.02%

If server errors consume the entire budget, you're not meeting SLO even if timeouts are zero.

Low-Traffic Endpoint Caution

Low-traffic endpoints have noisy error rates. If an endpoint gets 10 requests/day and 1 fails, that's a 10% error rate—but it's also just 1 failure. Set minimum sample thresholds before calculating meaningful error rates. Consider weekly or monthly windows for infrequent endpoints.

Semantic Errors: When Success Isn't Success

Some of the most insidious errors return HTTP 200 but deliver incorrect, incomplete, or corrupted results. These semantic errors evade simple status-code-based detection and can cause significant user harm before being noticed.

Examples of Semantic Errors

Data correctness errors:

Account balance shows $0 when user has $10,000 (data fetch failure, cached default returned)
Search returns popular products instead of matching products (search index corrupted)
User profile shows another user's data (caching bug, session mixup)

Completeness errors:

Order confirmation says "Order placed" but order was never recorded
Notification sent but critical details missing
API response missing required fields

Consistency errors:

Cart total doesn't match sum of item prices (rounding error, stale data)
Different API endpoints return contradictory information about same entity
UI shows stale state after operation "completed"

Logic errors:

Discount applied incorrectly (calculation bug)
Wrong product shipped (fulfillment data mapping error)
User granted incorrect permissions (authorization logic bug)

Detecting Semantic Errors

Semantic errors require domain-aware detection—you must understand what "correct" looks like:

Approach 1: Response Validation Validate response structure and content:

Schema validation: Does response match expected format?
Referential integrity: Do references point to valid entities?
Business rules: Are invariants maintained? (total = sum of items)

Approach 2: Synthetic Validation Regularly execute known-answer queries:

Create test data with known correct results
Periodically query this data and verify response matches expected
Alerts fire if synthetic validations fail

Approach 3: Cross-Reference Verification Compare data across sources:

Compare API response to database ground truth (sampling)
Verify eventual consistency by checking replicas converge
Reconcile transaction logs with downstream systems

Approach 4: User Feedback Integration Treat certain user behaviors as error signals:

"Report incorrect information" clicks
Support tickets about wrong data
Immediate session termination after certain operations (user saw something wrong)

semantic-error-detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
import json
 
@dataclass
class SemanticValidationResult:
    is_valid: bool
    errors: List[str]
    warnings: List[str]
 
class SemanticValidator:
    """
    Validates responses for semantic correctness beyond HTTP status codes.
    """
    
    def validate_search_response(
        self, 
        query: str, 
        response: Dict[str, Any]
    ) -> SemanticValidationResult:
        """Validate search results are semantically correct."""
        errors = []
        warnings = []
        
        results = response.get('results', [])
        query_terms = query.lower().split()
        
        # Check: Results should be relevant to query
        if results and len(query_terms) > 0:
            relevance_scores = []
            for result in results[:10]:  # Check top 10
                title = result.get('title', '').lower()
                description = result.get('description', '').lower()
                content = f"{title} {description}"
                
                term_matches = sum(1 for term in query_terms if term in content)
                relevance_scores.append(term_matches / len(query_terms))
            
            avg_relevance = sum(relevance_scores) / len(relevance_scores)
            if avg_relevance < 0.2:  # Less than 20% term coverage
                errors.append(
                    f"Search results appear irrelevant: avg_relevance={avg_relevance:.2f}"
                )
        
        # Check: Result count should match total claimed
        claimed_total = response.get('total_results', 0)
        if len(results) > claimed_total:
            errors.append(
                f"Result count mismatch: returned {len(results)} but claimed total {claimed_total}"
            )
        
        # Check: Results should have required fields
        for i, result in enumerate(results):
            if not result.get('id'):
                errors.append(f"Result {i} missing 'id' field")
            if not result.get('title'):
                warnings.append(f"Result {i} missing 'title' field")
        
        return SemanticValidationResult(
            is_valid=len(errors) == 0,
            errors=errors,
            warnings=warnings,
        )
 
    def validate_order_response(
        self, 
        order_request: Dict[str, Any],
        response: Dict[str, Any]
    ) -> SemanticValidationResult:
        """Validate order confirmation is semantically correct."""
        errors = []
        warnings = []
        
        # Check: Order ID should be returned
        if not response.get('order_id'):
            errors.append("Order confirmation missing order_id")
        
        # Check: Totals should be consistent
        items = response.get('items', [])
        calculated_subtotal = sum(
            item.get('price', 0) * item.get('quantity', 0) 
            for item in items
        )
        claimed_subtotal = response.get('subtotal', 0)
        
        if abs(calculated_subtotal - claimed_subtotal) > 0.01:
            errors.append(
                f"Subtotal mismatch: calculated {calculated_subtotal}, claimed {claimed_subtotal}"
            )
        
        # Check: Total = Subtotal + Tax + Shipping - Discounts
        claimed_total = response.get('total', 0)
        expected_total = (
            claimed_subtotal + 
            response.get('tax', 0) + 
            response.get('shipping', 0) - 
            response.get('discount', 0)
        )
        
        if abs(expected_total - claimed_total) > 0.01:
            errors.append(
                f"Total calculation error: expected {expected_total}, got {claimed_total}"
            )
        
        # Check: Items match request
        requested_items = {item['sku']: item['quantity'] 
                         for item in order_request.get('items', [])}
        confirmed_items = {item['sku']: item['quantity'] 
                         for item in items}
        
        if requested_items != confirmed_items:
            errors.append("Confirmed items don't match requested items")
        
        return SemanticValidationResult(
            is_valid=len(errors) == 0,
            errors=errors,
            warnings=warnings,
        )
 
# Usage in request pipeline
validator = SemanticValidator()
 
def track_semantic_error_rate(response, operation_type, request_data=None):
    """
    Validate response and track semantic error rate.
    """
    if operation_type == 'search':
        result = validator.validate_search_response(
            request_data.get('query', ''),
            response
        )
    elif operation_type == 'order':
        result = validator.validate_order_response(request_data, response)
    else:
        return  # No validation for this operation type
    
    if not result.is_valid:
        # Increment semantic error counter
        metrics.increment('semantic_errors', 
                         tags={'operation': operation_type})
        
        # Log for investigation
        logger.error(f"Semantic validation failed: {result.errors}",
                    extra={'response': response, 'errors': result.errors})

Semantic Error Detection Is Expensive

Full semantic validation of every response may be computationally prohibitive. Consider sampling strategies: validate 1% of responses thoroughly, validate all responses from specific critical endpoints, or validate based on risk signals (new code paths, unusual inputs).

Setting Error Rate Targets

What error rate should you target? This depends on the nature of the operation, user expectations, and business impact.

Framework for Target Selection

Consider user tolerance: How forgiving are users of this particular failure?

Zero tolerance: Payment failures, data loss, security breaches. Every error is a crisis.
Low tolerance: Core functionality failures (search down, cannot log in). Users notice and complain.
Medium tolerance: Secondary feature failures (recommendations broken, images slow). Users notice but continue.
High tolerance: Cosmetic issues (wrong icon, stale timestamp). Most users don't notice.

Consider failure impact:

Irrecoverable: User data lost, money debited without product delivered. Requires manual intervention.
Recoverable with effort: User must retry, contact support, or wait for fix.
Auto-recoverable: Retry succeeds, page refresh fixes it.
No impact: User doesn't notice or care.

Error Rate Target Guidelines by Operation Type
Operation Type	Suggested Target	Example	Rationale
Financial transactions	99.99%+ success	Payment, withdrawal	Every failure is money-related; user trust critical
Data mutation (writes)	99.95%+ success	Save, update, delete	Failed writes may cause data loss or inconsistency
Authentication	99.9%+ success	Login, session	Failed auth blocks all functionality
Core read operations	99.9%+ success	Product page, user profile	Primary user journeys must work
Search/discovery	99.5%+ success	Search, recommendations	Degradation acceptable; user can browse manually
Background operations	99%+ success	Analytics, sync	Failures retried; user doesn't see directly
Nice-to-have features	95%+ success	Social badges, achievements	Enhances but not required for core experience

Baseline Then Improve

If you're establishing error rate SLIs for the first time, don't guess at targets:

Measure current state: What's your actual error rate today?
Set initial target at current + buffer: If you're at 99.0%, set target at 98.5% initially.
Tighten as you improve: Once you consistently beat 98.5%, raise to 99.0%, then 99.5%.

Avoid unrealistic targets: A target you can never meet demoralizes the team. A target you always meet doesn't drive improvement. The sweet spot: achievable most of the time, occasionally breached, driving continuous attention.

Multi-Tier Targets

Like latency SLIs, consider multi-tier error rate targets:

SLO Target: 99.9% success rate (allow 0.1% errors) – Monthly SLO compliance
Alert Threshold: 99.0% success rate – Fire alert if error rate exceeds 1%
Critical Threshold: 95.0% success rate – Major incident if exceeded

This provides graduated response: attention at 1% error rate, emergency at 5% error rate.

The 10x Rule of Thumb

A useful heuristic: for each step up in target strictness (from 99% to 99.9% to 99.99%), effort increases by approximately 10x. If maintaining 99% is achievable with basic practices, 99.9% requires robust monitoring and on-call, and 99.99% requires sophisticated automation, redundancy, and chaos engineering.

Error Rate Monitoring and Alerting

Effective error rate monitoring balances sensitivity (catching problems quickly) with specificity (avoiding alert noise).

Alert Strategy for Error Rates

The problem with naive thresholds: A fixed threshold like "alert if error rate > 1%" has issues:

Low traffic periods: 1 error in 10 requests = 10% error rate (likely noise)
High traffic periods: 10,000 errors per hour might be "only" 0.5% (major issue)

Better approaches:

1. Error Budget Burn Rate Alerting Alert based on how quickly you're consuming your monthly error budget:

If your SLO is 99.9% (0.1% error budget over 30 days)
And you're currently at 2% error rate
You're burning budget at 20x normal rate
At this rate, you'll exhaust monthly budget in 1.5 days

Alert when burn rate exceeds thresholds (e.g., 10x for 5 minutes, 2x for 1 hour).

error-budget-burning.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
interface BurnRateAlert {
  shouldAlert: boolean;
  burnRate: number;
  currentErrorRate: number;
  budgetRemaining: number;
  hoursUntilBudgetExhausted: number;
  severity: 'critical' | 'warning' | 'none';
}
 
function evaluateErrorBudgetBurning(
  currentErrorRate: number,      // Current error rate (0-1, e.g., 0.02 = 2%)
  sloTarget: number,             // SLO target (0-1, e.g., 0.999 = 99.9%)
  windowHours: number,           // Current measurement window
  budgetConsumedSoFar: number,   // Portion of monthly budget already used (0-1)
  monthlyBudgetHours: number = 720,  // Hours in SLO period (30 days)
): BurnRateAlert {
  
  // Error budget = allowed error rate = 1 - SLO
  const allowedErrorRate = 1 - sloTarget;  // e.g., 0.001 for 99.9% SLO
  
  // Burn rate = (current error rate) / (allowed error rate)
  // Burn rate of 1 = using budget at exact expected pace
  // Burn rate of 10 = burning 10x faster than sustainable
  const burnRate = currentErrorRate / allowedErrorRate;
  
  // Budget remaining
  const budgetRemaining = 1 - budgetConsumedSoFar;
  
  // At current burn rate, how long until budget exhausted?
  // If burn rate is 10 and we have 30 days budget, exhausted in 3 days
  const hoursUntilExhausted = burnRate > 0 
    ? (budgetRemaining * monthlyBudgetHours) / burnRate 
    : Infinity;
  
  // Alert thresholds based on Google SRE multi-window approach
  let shouldAlert = false;
  let severity: 'critical' | 'warning' | 'none' = 'none';
  
  // Critical: Burning very fast, will exhaust budget soon
  if (burnRate >= 14.4 && windowHours >= 1/60) {  // 14.4x for 1 minute
    shouldAlert = true;
    severity = 'critical';
  } else if (burnRate >= 6 && windowHours >= 0.25) {  // 6x for 15 minutes
    shouldAlert = true;
    severity = 'critical';
  }
  // Warning: Elevated burn rate
  else if (burnRate >= 3 && windowHours >= 1) {  // 3x for 1 hour
    shouldAlert = true;
    severity = 'warning';
  } else if (burnRate >= 1 && windowHours >= 6) {  // 1x for 6 hours
    shouldAlert = true;
    severity = 'warning';
  }
  
  return {
    shouldAlert,
    burnRate,
    currentErrorRate,
    budgetRemaining,
    hoursUntilBudgetExhausted: hoursUntilExhausted,
    severity,
  };
}
 
// Example usage
const result = evaluateErrorBudgetBurning(
  0.02,    // 2% current error rate
  0.999,   // 99.9% SLO target
  0.25,    // 15-minute window
  0.3,     // 30% of monthly budget already used
);
 
if (result.shouldAlert) {
  console.log(`ALERT (${result.severity}): Burn rate ${result.burnRate.toFixed(1)}x`);
  console.log(`Budget exhausted in ${result.hoursUntilBudgetExhausted.toFixed(1)} hours`);
}

2. Minimum Sample Size

Don't alert on low-sample windows:

if request_count < 100:
    skip_alert()  # Insufficient data
else:
    evaluate_error_rate()

3. Error Rate Change Detection

Alert on significant changes from baseline, not just absolute thresholds:

If normal error rate is 0.1% and it suddenly jumps to 0.5%, that's a 5x increase (significant)
If normal error rate is 0.001% and it jumps to 0.01%, that's a 10x increase (also significant)

4. Error Clustering Detection

Watch for error concentration that might not breach overall thresholds:

100% of requests from one region failing (geographic issue)
100% of requests for one endpoint failing (feature broken)
100% of requests from one user failing (per-user bug)

These might be small percentages overall but warrant immediate attention.

Dashboard Best Practices

Display error rate alongside request volume (context matters)
Show breakdown by error type (5xx vs timeout vs semantic)
Include recent change ("Error rate: 0.5% ↑ from 0.2% yesterday")
Show error budget consumption ("47% of monthly budget remaining")
Link to example errors for quick debugging

The Multi-Window, Multi-Burn-Rate Approach

Google's SRE practices recommend multi-window alerting: a short window with high burn rate threshold catches fast-burning incidents, while a longer window with lower threshold catches slow leaks. For example: alert if burn rate > 14x for 2 minutes OR burn rate > 6x for 15 minutes OR burn rate > 3x for 1 hour. This balances speed of detection with noise reduction.

Error Rate SLI Specifications: Practical Examples

Let's examine concrete error rate SLI specifications for real-world scenarios.

Example 1: API Service Error Rate

api-error-rate-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Error Rate SLI Specification: Customer-Facing API
sli:
  name: "API Success Rate"
  description: "Proportion of API requests that succeed"
  
  success_definition:
    http_status:
      success: [200, 201, 204, 304]  # Standard success codes
      error: [500, 501, 502, 503, 504]  # Server errors
      excluded:
        - 429  # Rate limiting (tracked separately)
        - 401  # Authentication expected behavior
        - 403  # Authorization expected behavior
        - 404  # Only counted as error for known resource paths
    additional_error_conditions:
      - "Request timeout (client or server)"
      - "Connection reset before response"
      - "Response body contains 'error' with non-null value"
  
  scope:
    endpoints: "/api/v*/**"
    traffic_sources:
      include: ["external_users", "mobile_app", "web_app"]
      exclude: ["synthetic_monitoring", "internal_services"]
  
  target:
    success_rate: 99.9%
    error_rate: 0.1%
    budget_period: "30 days rolling"
    
  alerting:
    burn_rate_alerts:
      - window: "5 minutes"
        burn_rate_threshold: 14.4
        severity: "critical"
      - window: "1 hour"
        burn_rate_threshold: 6
        severity: "critical"
      - window: "6 hours"
        burn_rate_threshold: 3
        severity: "warning"
    minimum_request_count: 100
    
  reporting:
    dashboards: ["service-health", "slo-compliance"]
    weekly_review: true
    monthly_report: true

Example 2: Transaction Processing Error Rate

transaction-error-rate-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Error Rate SLI Specification: Payment Transactions
sli:
  name: "Transaction Success Rate"
  description: "Proportion of payment transactions that complete successfully"
  
  success_definition:
    description: >
      A transaction is successful if the payment is captured and the order
      is recorded in our system. Customer payment method issues (insufficient
      funds, card declined by bank) are not counted as system errors.
    
    success_criteria:
      - "Payment captured (payment_status = 'captured')"
      - "Order created in database"
      - "Confirmation event dispatched"
    
    error_criteria:
      - "HTTP 5xx from payment service"
      - "Timeout waiting for payment processor"
      - "Order creation failed after successful payment capture"
      - "Inconsistent state (payment captured but order missing)"
    
    excluded_from_denominator:
      - "Card declined by issuing bank (customer issue)"
      - "Invalid card number (customer input error)"
      - "Fraud detection block (intentional protection)"
      - "3DS authentication failed (customer issue)"
  
  scope:
    events: "payment_initiated"
    payment_types: ["credit_card", "debit_card", "bank_transfer"]
    excludes: ["test_transactions", "internal_orders"]
  
  target:
    success_rate: 99.99%  # Very high target for financial operations
    error_rate: 0.01%
    
  semantic_validation:
    enabled: true
    checks:
      - name: "Amount consistency"
        rule: "captured_amount == authorized_amount"
      - name: "Currency consistency"  
        rule: "captured_currency == order_currency"
      - name: "Order linkage"
        rule: "order_id in orders_database"
    
  alerting:
    immediate_alerts:
      - condition: "Any transaction in inconsistent state"
        severity: "critical"
        notification: ["payments-oncall", "finance-team"]
    rate_based:
      - error_rate_threshold: 0.1%
        window: "5 minutes"
        severity: "critical"

Example 3: Batch Processing Error Rate

batch-error-rate-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Error Rate SLI Specification: Nightly Data Pipeline
sli:
  name: "Data Pipeline Record Processing Success"
  description: "Proportion of records successfully processed through the pipeline"
  
  # For batch processing, we measure at record level, not job level
  # A job can succeed overall while having some record-level failures
  
  success_definition:
    at_record_level:
      success: "Record transformed and loaded to destination"
      error: "Record rejected or failed to load"
      excluded: "Record filtered by business rules (intentional skip)"
    
    at_job_level:
      # Job success doesn't affect SLI, but tracked separately
      job_success: "All records processed (success + error + excluded = input)"
      job_failure: "Job crashed or timed out before completing"
  
  scope:
    pipelines: ["customer-sync", "order-analytics", "inventory-update"]
    record_sources: "All input records from source systems"
  
  targets:
    record_success_rate: 99.5%
    rationale: >
      Batch processing allows for retry and manual remediation.
      0.5% error rate on millions of records is still thousands of
      failures to investigate, so this is reasonably tight.
    
    job_completion_rate: 99.9%  # Separate SLI for job-level reliability
  
  retry_policy:
    automatic_retries: 3
    retry_interval: "exponential backoff, max 5 minutes"
    error_after_retries: true  # Only count as error after all retries fail
  
  error_categorization:
    recoverable:
      - "Transient database connection error"
      - "Temporary API rate limiting"
    permanent:
      - "Schema validation failure"
      - "Missing required field"
      - "Referential integrity violation"
  
  reporting:
    per_job_metrics:
      - "records_processed"
      - "records_succeeded"
      - "records_failed"
      - "error_rate"
      - "error_by_category"
    daily_aggregate: true
    monthly_slo_report: true

Summary: Mastering Error Rate SLIs

Error rate SLIs quantify the proportion of operations that fail, providing a direct measure of system reliability. Let's consolidate the key learnings:

Key Principles for Error Rate SLIs

•Build an explicit error taxonomy. Define exactly which responses count as errors. HTTP status codes alone are insufficient—account for timeouts, semantic errors, and nuanced 4xx handling.
•Consider error origin, impact, and persistence. Not all errors are equal. A transient cosmetic error differs fundamentally from a persistent critical error. Your SLI should reflect these differences.
•Choose appropriate granularity. Start with critical journey SLIs plus an aggregate, then refine. Too many SLIs cause noise; too few hide problems.
•Don't forget semantic errors. HTTP 200 with wrong data is sometimes worse than HTTP 500. Invest in semantic validation for critical operations.
•Set targets based on user tolerance and impact. Financial transactions demand 99.99%+; auxiliary features tolerate 95%. Map targets to what users actually care about.
•Use error budget burn rate for alerting. Simple threshold alerts miss context. Burn rate alerting catches both fast incidents and slow leaks while reducing noise.
•Track error categories separately. Aggregate error rate masks the nature of problems. Know your server error rate, timeout rate, and semantic error rate independently.
•Handle low-traffic and edge cases gracefully. Small sample sizes cause noisy error rates. Set minimums and consider longer windows for infrequent operations.

What's Next

We've explored user-centric SLIs, availability SLIs, latency SLIs, and error rate SLIs. The final page in this module addresses the critical practical question: Measuring SLIs Accurately—how to ensure your measurement infrastructure captures true reality without gaps, biases, or artifacts.

Page Complete

You now have a comprehensive understanding of error rate SLIs—from classification through practical implementation. You can build error taxonomies, calculate and report error rates accurately, set appropriate targets, design effective alerting, and avoid the pitfalls that make error rate metrics misleading.

4 / 5

Loading learning content...

System Design (HLD)Choosing SLIs

Choosing Service Level Indicators

LevelAdvanced

Duration90 mins

TopicChoosing SLIs

4 / 5

Error Rate SLIs

Understanding Failure at Scale

What You Will Learn

The Error Taxonomy: Classifying What Goes Wrong

Not all errors are created equal. A rigorous error rate SLI requires a taxonomy—a systematic classification of error types and their characteristics.

Dimension 1: Error Origin

Where did the error originate?

Server errors (5xx): The server acknowledged the request but couldn't fulfill it. The fault lies within your system.

500 Internal Server Error: Unhandled exception, bug
502 Bad Gateway: Upstream dependency failed
503 Service Unavailable: Overloaded, maintenance
504 Gateway Timeout: Dependency took too long

Client errors (4xx): The request was somehow invalid. But "client's fault" isn't always clear-cut.

400 Bad Request: Could be a bug in your client application
401 Unauthorized: Expected behavior when credentials are wrong
403 Forbidden: Expected when permissions don't allow access
404 Not Found: Could be user error or broken internal link
429 Too Many Requests: Rate limiting (protective, not a bug)

Infrastructure errors: Failures before the application sees the request.

DNS resolution failures
TLS handshake failures
Connection refused / reset
Load balancer errors

Error Classification Matrix
Error Type	Whose Fault?	Count in Error Rate SLI?	Rationale
500 Internal Server Error	Ours	Yes	Clear system failure
502 Bad Gateway	Ours (dependency)	Yes	User experiences failure regardless of root cause
503 Service Unavailable	Ours	Yes	Capacity/operational issue
504 Gateway Timeout	Ours (or dependency)	Yes	User waited and got nothing
400 Bad Request	Depends	Sometimes	If caused by our client app: yes. If user typo: no
401 Unauthorized	User (or expected)	No	Auth system working correctly
403 Forbidden	User (or expected)	No	Authorization working correctly
404 Not Found	Depends	Sometimes	Broken link = yes. User typo = no
429 Rate Limited	User (or attack)	No (track separately)	Protective behavior, not error
Timeout (client-side)	Ours	Yes	User gave up waiting
Connection refused	Ours	Yes	Complete service failure

Dimension 2: Error Impact

How severely does the error affect the user?

Critical: User's primary goal is blocked. Payment failed, data lost, cannot access account.

Major: Significant degradation of experience. Search returns no results for valid query, images don't load, feature broken.

Minor: Noticeable but not blocking. Recommendation widget fails but core product works. Avatar doesn't load.

Cosmetic: Barely perceptible. Analytics pixel fails to load. Non-essential metadata missing.

Dimension 3: Error Persistence

Transient: Temporary failure that may succeed on retry. Network glitch, momentary overload.

Persistent: Failure that will recur until fixed. Bug in code, corrupted data, misconfiguration.

Deterministic: Always fails for specific input. Bug triggered by edge case.

Stochastic: Fails randomly. Race conditions, resource exhaustion.

Error Classification Is a Design Decision

Calculating Error Rate: The Mathematics

Error rate calculation seems simple—errors divided by total—but the details matter significantly.

The Basic Formula

Error Rate = Errors / Total Operations × 100%

Alternatively, express as Success Rate = (1 - Error Rate):

Success Rate = (Total Operations - Errors) / Total Operations × 100%

The Numerator: What Counts as an Error?

Based on your error taxonomy, define precisely which responses count:

Errors = HTTP_5xx + Timeouts + Connection_Failures + 
         Semantic_Errors + (Conditional_4xx per taxonomy)

Document every decision:

Are 400 errors from poorly designed forms our errors? (Often yes)
Are 404 errors from old bookmarks our errors? (Debatable)
Are upstream failures (502, 504) our errors? (Yes, from user perspective)

error-rate-calculator.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
interface RequestOutcome {
  httpStatus: number;
  latencyMs: number;
  wasTimeout: boolean;
  endpoint: string;
  errorCode?: string;  // Application-level error code
  clientType: 'external' | 'internal' | 'synthetic';
}
 
interface ErrorClassification {
  isError: boolean;
  errorCategory: 'server' | 'client' | 'timeout' | 'semantic' | 'none';
  severity: 'critical' | 'major' | 'minor' | 'cosmetic' | 'none';
  includeInSli: boolean;
  reason: string;
}
 
function classifyError(outcome: RequestOutcome): ErrorClassification {
  // Timeouts are always errors from user perspective
  if (outcome.wasTimeout) {
    return {
      isError: true,
      errorCategory: 'timeout',
      severity: 'critical',
      includeInSli: true,
      reason: 'User waited and received no response',
    };
  }
 
  // Server errors (5xx) are always errors
  if (outcome.httpStatus >= 500) {
    return {
      isError: true,
      errorCategory: 'server',
      severity: determineSeverityByEndpoint(outcome.endpoint),
      includeInSli: true,
      reason: `Server error: HTTP ${outcome.httpStatus}`,
    };
  }
 
  // Rate limiting: track separately, don't count against error rate SLI
  if (outcome.httpStatus === 429) {
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Rate limiting is protective behavior',
    };
  }
 
  // Auth errors (401, 403): typically not errors unless unexpected
  if (outcome.httpStatus === 401 || outcome.httpStatus === 403) {
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Authentication/authorization working as expected',
    };
  }
 
  // 400 Bad Request: nuanced classification
  if (outcome.httpStatus === 400) {
    // If from our own client app, might be a bug on our side
    if (outcome.clientType === 'internal') {
      return {
        isError: true,
        errorCategory: 'client',
        severity: 'major',
        includeInSli: true,
        reason: 'Bad request from internal client suggests a bug',
      };
    }
    // External clients: don't count against our SLI
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Bad request from external client',
    };
  }
 
  // 404: depends on context
  if (outcome.httpStatus === 404) {
    // Check if this endpoint should exist
    if (isKnownEndpoint(outcome.endpoint)) {
      return {
        isError: true,
        errorCategory: 'server',
        severity: 'major',
        includeInSli: true,
        reason: 'Valid endpoint returned 404 - likely a bug or data issue',
      };
    }
    return {
      isError: false,
      errorCategory: 'client',
      severity: 'none',
      includeInSli: false,
      reason: 'Unknown endpoint - likely user error or old link',
    };
  }
 
  // Semantic errors: HTTP 2xx but application-level failure
  if (outcome.httpStatus >= 200 && outcome.httpStatus < 300) {
    if (outcome.errorCode) {
      return {
        isError: true,
        errorCategory: 'semantic',
        severity: determineSeverityByErrorCode(outcome.errorCode),
        includeInSli: true,
        reason: `Semantic error: ${outcome.errorCode}`,
      };
    }
  }
 
  // Success
  return {
    isError: false,
    errorCategory: 'none',
    severity: 'none',
    includeInSli: false,
    reason: 'Request succeeded',
  };
}
 
function calculateErrorRate(outcomes: RequestOutcome[]): {
  totalRequests: number;
  errorCount: number;
  errorRate: number;
  successRate: number;
  byCategory: Record<string, number>;
} {
  let errorCount = 0;
  const byCategory: Record<string, number> = {};
  
  const eligibleOutcomes = outcomes.filter(o => 
    o.clientType !== 'synthetic'  // Exclude synthetic monitoring
  );
 
  for (const outcome of eligibleOutcomes) {
    const classification = classifyError(outcome);
    if (classification.includeInSli && classification.isError) {
      errorCount++;
      byCategory[classification.errorCategory] = 
        (byCategory[classification.errorCategory] || 0) + 1;
    }
  }
 
  const total = eligibleOutcomes.length;
  const errorRate = total > 0 ? (errorCount / total) * 100 : 0;
  const successRate = 100 - errorRate;
 
  return {
    totalRequests: total,
    errorCount,
    errorRate,
    successRate,
    byCategory,
  };
}

The Denominator: What Counts as an Operation?

This mirrors the availability denominator discussion but has error-specific considerations:

Include or exclude synthetic traffic? Usually exclude—synthetic monitoring errors shouldn't affect error rate SLI.
Include or exclude internal traffic? Internal errors might indicate bugs that haven't reached production users yet.
How to count retries? If a client retries a failed request, is that one error or two?

Recommended approach:

Primary SLI: External user traffic only, excluding synthetic
Supporting metrics: Internal traffic error rates, synthetic monitoring error rates
Document retry handling: Typically count each attempt separately (a retry that succeeds reduces the error rate; a retry that fails counts as another error)

Error Rate Granularity: Aggregate vs. Disaggregated

Should you have one error rate SLI for your entire service, or separate SLIs for each endpoint, feature, or error type? The answer depends on your goals and operational maturity.

The Case for Aggregated Error Rates

A single "overall error rate" for your service:

Advantages:

Simple to understand and communicate
Easy to set a single SLO target
Represents "big picture" health

Disadvantages:

Obscures problems in low-traffic endpoints
A 0.1% error rate across the service might hide 50% error rate on a critical feature used by few
Hard to prioritize fixes without knowing which errors matter

The Case for Disaggregated Error Rates

Separate error rate SLIs per endpoint, feature, or user journey:

Advantages:

Precision in identifying problems
Can set different targets for different criticality levels
Enables focused improvement efforts

Disadvantages:

Many SLIs to track and understand
Risk of alert fatigue
Requires more operational maturity

Recommended Granularity Strategy

•Start with a few critical SLIs: Identify 3-5 critical user journeys. Create dedicated error rate SLIs for these.
•Add an aggregate SLI for coverage: A service-wide error rate catches issues in areas without dedicated SLIs.
•Segment by error category: Track server errors, timeouts, and semantic errors separately—they have different root causes.
•Consider business impact: Payment processing might warrant its own SLI even if it's low traffic.
•Evolve over time: As you learn which errors matter, add or consolidate SLIs.

Composite Error Rate SLIs

A middle ground is a weighted composite that maintains simplicity while accounting for criticality:

Weighted Error Rate = Σ(Error Rate × Weight × Traffic) / Σ(Traffic)

Where weights reflect business importance:

Critical endpoints (checkout, auth): Weight 3.0
Major endpoints (search, product pages): Weight 1.0
Minor endpoints (analytics, preferences): Weight 0.3

This single metric naturally surfaces critical endpoint failures more prominently while still tracking overall health.

Error Budgets by Category

Another approach: allocate your error budget across error categories:

Total error budget: 0.1% (99.9% success rate)
Server errors (5xx): Max 0.05%
Timeouts: Max 0.03%
Semantic errors: Max 0.02%

If server errors consume the entire budget, you're not meeting SLO even if timeouts are zero.

Low-Traffic Endpoint Caution

Semantic Errors: When Success Isn't Success

Examples of Semantic Errors

Data correctness errors:

Account balance shows $0 when user has $10,000 (data fetch failure, cached default returned)
Search returns popular products instead of matching products (search index corrupted)
User profile shows another user's data (caching bug, session mixup)

Completeness errors:

Order confirmation says "Order placed" but order was never recorded
Notification sent but critical details missing
API response missing required fields

Consistency errors:

Cart total doesn't match sum of item prices (rounding error, stale data)
Different API endpoints return contradictory information about same entity
UI shows stale state after operation "completed"

Logic errors:

Discount applied incorrectly (calculation bug)
Wrong product shipped (fulfillment data mapping error)
User granted incorrect permissions (authorization logic bug)

Detecting Semantic Errors

Semantic errors require domain-aware detection—you must understand what "correct" looks like:

Approach 1: Response Validation Validate response structure and content:

Schema validation: Does response match expected format?
Referential integrity: Do references point to valid entities?
Business rules: Are invariants maintained? (total = sum of items)

Approach 2: Synthetic Validation Regularly execute known-answer queries:

Create test data with known correct results
Periodically query this data and verify response matches expected
Alerts fire if synthetic validations fail

Approach 3: Cross-Reference Verification Compare data across sources:

Compare API response to database ground truth (sampling)
Verify eventual consistency by checking replicas converge
Reconcile transaction logs with downstream systems

Approach 4: User Feedback Integration Treat certain user behaviors as error signals:

"Report incorrect information" clicks
Support tickets about wrong data
Immediate session termination after certain operations (user saw something wrong)

semantic-error-detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
import json
 
@dataclass
class SemanticValidationResult:
    is_valid: bool
    errors: List[str]
    warnings: List[str]
 
class SemanticValidator:
    """
    Validates responses for semantic correctness beyond HTTP status codes.
    """
    
    def validate_search_response(
        self, 
        query: str, 
        response: Dict[str, Any]
    ) -> SemanticValidationResult:
        """Validate search results are semantically correct."""
        errors = []
        warnings = []
        
        results = response.get('results', [])
        query_terms = query.lower().split()
        
        # Check: Results should be relevant to query
        if results and len(query_terms) > 0:
            relevance_scores = []
            for result in results[:10]:  # Check top 10
                title = result.get('title', '').lower()
                description = result.get('description', '').lower()
                content = f"{title} {description}"
                
                term_matches = sum(1 for term in query_terms if term in content)
                relevance_scores.append(term_matches / len(query_terms))
            
            avg_relevance = sum(relevance_scores) / len(relevance_scores)
            if avg_relevance < 0.2:  # Less than 20% term coverage
                errors.append(
                    f"Search results appear irrelevant: avg_relevance={avg_relevance:.2f}"
                )
        
        # Check: Result count should match total claimed
        claimed_total = response.get('total_results', 0)
        if len(results) > claimed_total:
            errors.append(
                f"Result count mismatch: returned {len(results)} but claimed total {claimed_total}"
            )
        
        # Check: Results should have required fields
        for i, result in enumerate(results):
            if not result.get('id'):
                errors.append(f"Result {i} missing 'id' field")
            if not result.get('title'):
                warnings.append(f"Result {i} missing 'title' field")
        
        return SemanticValidationResult(
            is_valid=len(errors) == 0,
            errors=errors,
            warnings=warnings,
        )
 
    def validate_order_response(
        self, 
        order_request: Dict[str, Any],
        response: Dict[str, Any]
    ) -> SemanticValidationResult:
        """Validate order confirmation is semantically correct."""
        errors = []
        warnings = []
        
        # Check: Order ID should be returned
        if not response.get('order_id'):
            errors.append("Order confirmation missing order_id")
        
        # Check: Totals should be consistent
        items = response.get('items', [])
        calculated_subtotal = sum(
            item.get('price', 0) * item.get('quantity', 0) 
            for item in items
        )
        claimed_subtotal = response.get('subtotal', 0)
        
        if abs(calculated_subtotal - claimed_subtotal) > 0.01:
            errors.append(
                f"Subtotal mismatch: calculated {calculated_subtotal}, claimed {claimed_subtotal}"
            )
        
        # Check: Total = Subtotal + Tax + Shipping - Discounts
        claimed_total = response.get('total', 0)
        expected_total = (
            claimed_subtotal + 
            response.get('tax', 0) + 
            response.get('shipping', 0) - 
            response.get('discount', 0)
        )
        
        if abs(expected_total - claimed_total) > 0.01:
            errors.append(
                f"Total calculation error: expected {expected_total}, got {claimed_total}"
            )
        
        # Check: Items match request
        requested_items = {item['sku']: item['quantity'] 
                         for item in order_request.get('items', [])}
        confirmed_items = {item['sku']: item['quantity'] 
                         for item in items}
        
        if requested_items != confirmed_items:
            errors.append("Confirmed items don't match requested items")
        
        return SemanticValidationResult(
            is_valid=len(errors) == 0,
            errors=errors,
            warnings=warnings,
        )
 
# Usage in request pipeline
validator = SemanticValidator()
 
def track_semantic_error_rate(response, operation_type, request_data=None):
    """
    Validate response and track semantic error rate.
    """
    if operation_type == 'search':
        result = validator.validate_search_response(
            request_data.get('query', ''),
            response
        )
    elif operation_type == 'order':
        result = validator.validate_order_response(request_data, response)
    else:
        return  # No validation for this operation type
    
    if not result.is_valid:
        # Increment semantic error counter
        metrics.increment('semantic_errors', 
                         tags={'operation': operation_type})
        
        # Log for investigation
        logger.error(f"Semantic validation failed: {result.errors}",
                    extra={'response': response, 'errors': result.errors})

Semantic Error Detection Is Expensive

Setting Error Rate Targets

What error rate should you target? This depends on the nature of the operation, user expectations, and business impact.

Framework for Target Selection

Consider user tolerance: How forgiving are users of this particular failure?

Zero tolerance: Payment failures, data loss, security breaches. Every error is a crisis.
Low tolerance: Core functionality failures (search down, cannot log in). Users notice and complain.
Medium tolerance: Secondary feature failures (recommendations broken, images slow). Users notice but continue.
High tolerance: Cosmetic issues (wrong icon, stale timestamp). Most users don't notice.

Consider failure impact:

Irrecoverable: User data lost, money debited without product delivered. Requires manual intervention.
Recoverable with effort: User must retry, contact support, or wait for fix.
Auto-recoverable: Retry succeeds, page refresh fixes it.
No impact: User doesn't notice or care.

Error Rate Target Guidelines by Operation Type
Operation Type	Suggested Target	Example	Rationale
Financial transactions	99.99%+ success	Payment, withdrawal	Every failure is money-related; user trust critical
Data mutation (writes)	99.95%+ success	Save, update, delete	Failed writes may cause data loss or inconsistency
Authentication	99.9%+ success	Login, session	Failed auth blocks all functionality
Core read operations	99.9%+ success	Product page, user profile	Primary user journeys must work
Search/discovery	99.5%+ success	Search, recommendations	Degradation acceptable; user can browse manually
Background operations	99%+ success	Analytics, sync	Failures retried; user doesn't see directly
Nice-to-have features	95%+ success	Social badges, achievements	Enhances but not required for core experience

Baseline Then Improve

If you're establishing error rate SLIs for the first time, don't guess at targets:

Measure current state: What's your actual error rate today?
Set initial target at current + buffer: If you're at 99.0%, set target at 98.5% initially.
Tighten as you improve: Once you consistently beat 98.5%, raise to 99.0%, then 99.5%.

Multi-Tier Targets

Like latency SLIs, consider multi-tier error rate targets:

SLO Target: 99.9% success rate (allow 0.1% errors) – Monthly SLO compliance
Alert Threshold: 99.0% success rate – Fire alert if error rate exceeds 1%
Critical Threshold: 95.0% success rate – Major incident if exceeded

This provides graduated response: attention at 1% error rate, emergency at 5% error rate.

The 10x Rule of Thumb

Error Rate Monitoring and Alerting

Effective error rate monitoring balances sensitivity (catching problems quickly) with specificity (avoiding alert noise).

Alert Strategy for Error Rates

The problem with naive thresholds: A fixed threshold like "alert if error rate > 1%" has issues:

Low traffic periods: 1 error in 10 requests = 10% error rate (likely noise)
High traffic periods: 10,000 errors per hour might be "only" 0.5% (major issue)

Better approaches:

1. Error Budget Burn Rate Alerting Alert based on how quickly you're consuming your monthly error budget:

If your SLO is 99.9% (0.1% error budget over 30 days)
And you're currently at 2% error rate
You're burning budget at 20x normal rate
At this rate, you'll exhaust monthly budget in 1.5 days

Alert when burn rate exceeds thresholds (e.g., 10x for 5 minutes, 2x for 1 hour).

error-budget-burning.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
interface BurnRateAlert {
  shouldAlert: boolean;
  burnRate: number;
  currentErrorRate: number;
  budgetRemaining: number;
  hoursUntilBudgetExhausted: number;
  severity: 'critical' | 'warning' | 'none';
}
 
function evaluateErrorBudgetBurning(
  currentErrorRate: number,      // Current error rate (0-1, e.g., 0.02 = 2%)
  sloTarget: number,             // SLO target (0-1, e.g., 0.999 = 99.9%)
  windowHours: number,           // Current measurement window
  budgetConsumedSoFar: number,   // Portion of monthly budget already used (0-1)
  monthlyBudgetHours: number = 720,  // Hours in SLO period (30 days)
): BurnRateAlert {
  
  // Error budget = allowed error rate = 1 - SLO
  const allowedErrorRate = 1 - sloTarget;  // e.g., 0.001 for 99.9% SLO
  
  // Burn rate = (current error rate) / (allowed error rate)
  // Burn rate of 1 = using budget at exact expected pace
  // Burn rate of 10 = burning 10x faster than sustainable
  const burnRate = currentErrorRate / allowedErrorRate;
  
  // Budget remaining
  const budgetRemaining = 1 - budgetConsumedSoFar;
  
  // At current burn rate, how long until budget exhausted?
  // If burn rate is 10 and we have 30 days budget, exhausted in 3 days
  const hoursUntilExhausted = burnRate > 0 
    ? (budgetRemaining * monthlyBudgetHours) / burnRate 
    : Infinity;
  
  // Alert thresholds based on Google SRE multi-window approach
  let shouldAlert = false;
  let severity: 'critical' | 'warning' | 'none' = 'none';
  
  // Critical: Burning very fast, will exhaust budget soon
  if (burnRate >= 14.4 && windowHours >= 1/60) {  // 14.4x for 1 minute
    shouldAlert = true;
    severity = 'critical';
  } else if (burnRate >= 6 && windowHours >= 0.25) {  // 6x for 15 minutes
    shouldAlert = true;
    severity = 'critical';
  }
  // Warning: Elevated burn rate
  else if (burnRate >= 3 && windowHours >= 1) {  // 3x for 1 hour
    shouldAlert = true;
    severity = 'warning';
  } else if (burnRate >= 1 && windowHours >= 6) {  // 1x for 6 hours
    shouldAlert = true;
    severity = 'warning';
  }
  
  return {
    shouldAlert,
    burnRate,
    currentErrorRate,
    budgetRemaining,
    hoursUntilBudgetExhausted: hoursUntilExhausted,
    severity,
  };
}
 
// Example usage
const result = evaluateErrorBudgetBurning(
  0.02,    // 2% current error rate
  0.999,   // 99.9% SLO target
  0.25,    // 15-minute window
  0.3,     // 30% of monthly budget already used
);
 
if (result.shouldAlert) {
  console.log(`ALERT (${result.severity}): Burn rate ${result.burnRate.toFixed(1)}x`);
  console.log(`Budget exhausted in ${result.hoursUntilBudgetExhausted.toFixed(1)} hours`);
}

2. Minimum Sample Size

Don't alert on low-sample windows:

if request_count < 100:
    skip_alert()  # Insufficient data
else:
    evaluate_error_rate()

3. Error Rate Change Detection

Alert on significant changes from baseline, not just absolute thresholds:

If normal error rate is 0.1% and it suddenly jumps to 0.5%, that's a 5x increase (significant)
If normal error rate is 0.001% and it jumps to 0.01%, that's a 10x increase (also significant)

4. Error Clustering Detection

Watch for error concentration that might not breach overall thresholds:

100% of requests from one region failing (geographic issue)
100% of requests for one endpoint failing (feature broken)
100% of requests from one user failing (per-user bug)

These might be small percentages overall but warrant immediate attention.

Dashboard Best Practices

Display error rate alongside request volume (context matters)
Show breakdown by error type (5xx vs timeout vs semantic)
Include recent change ("Error rate: 0.5% ↑ from 0.2% yesterday")
Show error budget consumption ("47% of monthly budget remaining")
Link to example errors for quick debugging

The Multi-Window, Multi-Burn-Rate Approach

Error Rate SLI Specifications: Practical Examples

Let's examine concrete error rate SLI specifications for real-world scenarios.

Example 1: API Service Error Rate

api-error-rate-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Error Rate SLI Specification: Customer-Facing API
sli:
  name: "API Success Rate"
  description: "Proportion of API requests that succeed"
  
  success_definition:
    http_status:
      success: [200, 201, 204, 304]  # Standard success codes
      error: [500, 501, 502, 503, 504]  # Server errors
      excluded:
        - 429  # Rate limiting (tracked separately)
        - 401  # Authentication expected behavior
        - 403  # Authorization expected behavior
        - 404  # Only counted as error for known resource paths
    additional_error_conditions:
      - "Request timeout (client or server)"
      - "Connection reset before response"
      - "Response body contains 'error' with non-null value"
  
  scope:
    endpoints: "/api/v*/**"
    traffic_sources:
      include: ["external_users", "mobile_app", "web_app"]
      exclude: ["synthetic_monitoring", "internal_services"]
  
  target:
    success_rate: 99.9%
    error_rate: 0.1%
    budget_period: "30 days rolling"
    
  alerting:
    burn_rate_alerts:
      - window: "5 minutes"
        burn_rate_threshold: 14.4
        severity: "critical"
      - window: "1 hour"
        burn_rate_threshold: 6
        severity: "critical"
      - window: "6 hours"
        burn_rate_threshold: 3
        severity: "warning"
    minimum_request_count: 100
    
  reporting:
    dashboards: ["service-health", "slo-compliance"]
    weekly_review: true
    monthly_report: true

Example 2: Transaction Processing Error Rate

transaction-error-rate-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Error Rate SLI Specification: Payment Transactions
sli:
  name: "Transaction Success Rate"
  description: "Proportion of payment transactions that complete successfully"
  
  success_definition:
    description: >
      A transaction is successful if the payment is captured and the order
      is recorded in our system. Customer payment method issues (insufficient
      funds, card declined by bank) are not counted as system errors.
    
    success_criteria:
      - "Payment captured (payment_status = 'captured')"
      - "Order created in database"
      - "Confirmation event dispatched"
    
    error_criteria:
      - "HTTP 5xx from payment service"
      - "Timeout waiting for payment processor"
      - "Order creation failed after successful payment capture"
      - "Inconsistent state (payment captured but order missing)"
    
    excluded_from_denominator:
      - "Card declined by issuing bank (customer issue)"
      - "Invalid card number (customer input error)"
      - "Fraud detection block (intentional protection)"
      - "3DS authentication failed (customer issue)"
  
  scope:
    events: "payment_initiated"
    payment_types: ["credit_card", "debit_card", "bank_transfer"]
    excludes: ["test_transactions", "internal_orders"]
  
  target:
    success_rate: 99.99%  # Very high target for financial operations
    error_rate: 0.01%
    
  semantic_validation:
    enabled: true
    checks:
      - name: "Amount consistency"
        rule: "captured_amount == authorized_amount"
      - name: "Currency consistency"  
        rule: "captured_currency == order_currency"
      - name: "Order linkage"
        rule: "order_id in orders_database"
    
  alerting:
    immediate_alerts:
      - condition: "Any transaction in inconsistent state"
        severity: "critical"
        notification: ["payments-oncall", "finance-team"]
    rate_based:
      - error_rate_threshold: 0.1%
        window: "5 minutes"
        severity: "critical"

Example 3: Batch Processing Error Rate

batch-error-rate-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Error Rate SLI Specification: Nightly Data Pipeline
sli:
  name: "Data Pipeline Record Processing Success"
  description: "Proportion of records successfully processed through the pipeline"
  
  # For batch processing, we measure at record level, not job level
  # A job can succeed overall while having some record-level failures
  
  success_definition:
    at_record_level:
      success: "Record transformed and loaded to destination"
      error: "Record rejected or failed to load"
      excluded: "Record filtered by business rules (intentional skip)"
    
    at_job_level:
      # Job success doesn't affect SLI, but tracked separately
      job_success: "All records processed (success + error + excluded = input)"
      job_failure: "Job crashed or timed out before completing"
  
  scope:
    pipelines: ["customer-sync", "order-analytics", "inventory-update"]
    record_sources: "All input records from source systems"
  
  targets:
    record_success_rate: 99.5%
    rationale: >
      Batch processing allows for retry and manual remediation.
      0.5% error rate on millions of records is still thousands of
      failures to investigate, so this is reasonably tight.
    
    job_completion_rate: 99.9%  # Separate SLI for job-level reliability
  
  retry_policy:
    automatic_retries: 3
    retry_interval: "exponential backoff, max 5 minutes"
    error_after_retries: true  # Only count as error after all retries fail
  
  error_categorization:
    recoverable:
      - "Transient database connection error"
      - "Temporary API rate limiting"
    permanent:
      - "Schema validation failure"
      - "Missing required field"
      - "Referential integrity violation"
  
  reporting:
    per_job_metrics:
      - "records_processed"
      - "records_succeeded"
      - "records_failed"
      - "error_rate"
      - "error_by_category"
    daily_aggregate: true
    monthly_slo_report: true

Summary: Mastering Error Rate SLIs

Error rate SLIs quantify the proportion of operations that fail, providing a direct measure of system reliability. Let's consolidate the key learnings:

Key Principles for Error Rate SLIs

•Build an explicit error taxonomy. Define exactly which responses count as errors. HTTP status codes alone are insufficient—account for timeouts, semantic errors, and nuanced 4xx handling.
•Consider error origin, impact, and persistence. Not all errors are equal. A transient cosmetic error differs fundamentally from a persistent critical error. Your SLI should reflect these differences.
•Choose appropriate granularity. Start with critical journey SLIs plus an aggregate, then refine. Too many SLIs cause noise; too few hide problems.
•Don't forget semantic errors. HTTP 200 with wrong data is sometimes worse than HTTP 500. Invest in semantic validation for critical operations.
•Set targets based on user tolerance and impact. Financial transactions demand 99.99%+; auxiliary features tolerate 95%. Map targets to what users actually care about.
•Use error budget burn rate for alerting. Simple threshold alerts miss context. Burn rate alerting catches both fast incidents and slow leaks while reducing noise.
•Track error categories separately. Aggregate error rate masks the nature of problems. Know your server error rate, timeout rate, and semantic error rate independently.
•Handle low-traffic and edge cases gracefully. Small sample sizes cause noisy error rates. Set minimums and consider longer windows for infrequent operations.

What's Next

Page Complete

4 / 5