Loading learning content...
Every distributed system fails. The question isn't whether errors will occur, but how many, what kind, and what impact they have on users. Error rate SLIs quantify this fundamental reality, transforming the abstract concept of "system reliability" into concrete, measurable proportions.
While availability SLIs focus on whether the service is working at all, and latency SLIs measure how fast it works, error rate SLIs capture the quality of individual operations. A service might be "available" (responding to requests) yet returning errors for 10% of operations. A service might be "fast" (low latency) yet delivering incorrect data. Error rate SLIs fill this gap, measuring the proportion of operations that succeed in accomplishing their intended purpose.
But what constitutes an "error"? An HTTP 500 is obviously an error. What about an HTTP 400? A timeout? A successful response containing semantically wrong data? The answers to these questions determine whether your error rate SLI reflects true user impact or merely counts technical exceptions.
By the end of this page, you will understand how to classify errors meaningfully, distinguish error severity levels, measure error rates at appropriate granularity, set realistic error rate targets, and design error rate SLIs that genuinely reflect user experience and system health.
Not all errors are created equal. A rigorous error rate SLI requires a taxonomy—a systematic classification of error types and their characteristics.
Dimension 1: Error Origin
Where did the error originate?
Server errors (5xx): The server acknowledged the request but couldn't fulfill it. The fault lies within your system.
Client errors (4xx): The request was somehow invalid. But "client's fault" isn't always clear-cut.
Infrastructure errors: Failures before the application sees the request.
| Error Type | Whose Fault? | Count in Error Rate SLI? | Rationale |
|---|---|---|---|
| 500 Internal Server Error | Ours | Yes | Clear system failure |
| 502 Bad Gateway | Ours (dependency) | Yes | User experiences failure regardless of root cause |
| 503 Service Unavailable | Ours | Yes | Capacity/operational issue |
| 504 Gateway Timeout | Ours (or dependency) | Yes | User waited and got nothing |
| 400 Bad Request | Depends | Sometimes | If caused by our client app: yes. If user typo: no |
| 401 Unauthorized | User (or expected) | No | Auth system working correctly |
| 403 Forbidden | User (or expected) | No | Authorization working correctly |
| 404 Not Found | Depends | Sometimes | Broken link = yes. User typo = no |
| 429 Rate Limited | User (or attack) | No (track separately) | Protective behavior, not error |
| Timeout (client-side) | Ours | Yes | User gave up waiting |
| Connection refused | Ours | Yes | Complete service failure |
Dimension 2: Error Impact
How severely does the error affect the user?
Critical: User's primary goal is blocked. Payment failed, data lost, cannot access account.
Major: Significant degradation of experience. Search returns no results for valid query, images don't load, feature broken.
Minor: Noticeable but not blocking. Recommendation widget fails but core product works. Avatar doesn't load.
Cosmetic: Barely perceptible. Analytics pixel fails to load. Non-essential metadata missing.
Dimension 3: Error Persistence
Transient: Temporary failure that may succeed on retry. Network glitch, momentary overload.
Persistent: Failure that will recur until fixed. Bug in code, corrupted data, misconfiguration.
Deterministic: Always fails for specific input. Bug triggered by edge case.
Stochastic: Fails randomly. Race conditions, resource exhaustion.
Understanding these dimensions allows you to design SLIs that weight errors appropriately. A transient minor error in an analytics call should not be weighted the same as a persistent critical error in payment processing.
There's no universally correct error classification. What matters is that your classification is explicit, documented, and reflects user impact. Make the taxonomy a first-class artifact that your team agrees upon and revises as you learn more about your failure modes.
Error rate calculation seems simple—errors divided by total—but the details matter significantly.
The Basic Formula
Error Rate = Errors / Total Operations × 100%
Alternatively, express as Success Rate = (1 - Error Rate):
Success Rate = (Total Operations - Errors) / Total Operations × 100%
Most SLOs are expressed as success rates ("99.9% of requests succeed") rather than error rates ("0.1% of requests fail"). Both convey the same information but success rates are more intuitive for targets.
The Numerator: What Counts as an Error?
Based on your error taxonomy, define precisely which responses count:
Errors = HTTP_5xx + Timeouts + Connection_Failures +
Semantic_Errors + (Conditional_4xx per taxonomy)
Document every decision:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
interface RequestOutcome { httpStatus: number; latencyMs: number; wasTimeout: boolean; endpoint: string; errorCode?: string; // Application-level error code clientType: 'external' | 'internal' | 'synthetic';} interface ErrorClassification { isError: boolean; errorCategory: 'server' | 'client' | 'timeout' | 'semantic' | 'none'; severity: 'critical' | 'major' | 'minor' | 'cosmetic' | 'none'; includeInSli: boolean; reason: string;} function classifyError(outcome: RequestOutcome): ErrorClassification { // Timeouts are always errors from user perspective if (outcome.wasTimeout) { return { isError: true, errorCategory: 'timeout', severity: 'critical', includeInSli: true, reason: 'User waited and received no response', }; } // Server errors (5xx) are always errors if (outcome.httpStatus >= 500) { return { isError: true, errorCategory: 'server', severity: determineSeverityByEndpoint(outcome.endpoint), includeInSli: true, reason: `Server error: HTTP ${outcome.httpStatus}`, }; } // Rate limiting: track separately, don't count against error rate SLI if (outcome.httpStatus === 429) { return { isError: false, errorCategory: 'client', severity: 'none', includeInSli: false, reason: 'Rate limiting is protective behavior', }; } // Auth errors (401, 403): typically not errors unless unexpected if (outcome.httpStatus === 401 || outcome.httpStatus === 403) { return { isError: false, errorCategory: 'client', severity: 'none', includeInSli: false, reason: 'Authentication/authorization working as expected', }; } // 400 Bad Request: nuanced classification if (outcome.httpStatus === 400) { // If from our own client app, might be a bug on our side if (outcome.clientType === 'internal') { return { isError: true, errorCategory: 'client', severity: 'major', includeInSli: true, reason: 'Bad request from internal client suggests a bug', }; } // External clients: don't count against our SLI return { isError: false, errorCategory: 'client', severity: 'none', includeInSli: false, reason: 'Bad request from external client', }; } // 404: depends on context if (outcome.httpStatus === 404) { // Check if this endpoint should exist if (isKnownEndpoint(outcome.endpoint)) { return { isError: true, errorCategory: 'server', severity: 'major', includeInSli: true, reason: 'Valid endpoint returned 404 - likely a bug or data issue', }; } return { isError: false, errorCategory: 'client', severity: 'none', includeInSli: false, reason: 'Unknown endpoint - likely user error or old link', }; } // Semantic errors: HTTP 2xx but application-level failure if (outcome.httpStatus >= 200 && outcome.httpStatus < 300) { if (outcome.errorCode) { return { isError: true, errorCategory: 'semantic', severity: determineSeverityByErrorCode(outcome.errorCode), includeInSli: true, reason: `Semantic error: ${outcome.errorCode}`, }; } } // Success return { isError: false, errorCategory: 'none', severity: 'none', includeInSli: false, reason: 'Request succeeded', };} function calculateErrorRate(outcomes: RequestOutcome[]): { totalRequests: number; errorCount: number; errorRate: number; successRate: number; byCategory: Record<string, number>;} { let errorCount = 0; const byCategory: Record<string, number> = {}; const eligibleOutcomes = outcomes.filter(o => o.clientType !== 'synthetic' // Exclude synthetic monitoring ); for (const outcome of eligibleOutcomes) { const classification = classifyError(outcome); if (classification.includeInSli && classification.isError) { errorCount++; byCategory[classification.errorCategory] = (byCategory[classification.errorCategory] || 0) + 1; } } const total = eligibleOutcomes.length; const errorRate = total > 0 ? (errorCount / total) * 100 : 0; const successRate = 100 - errorRate; return { totalRequests: total, errorCount, errorRate, successRate, byCategory, };}The Denominator: What Counts as an Operation?
This mirrors the availability denominator discussion but has error-specific considerations:
Recommended approach:
Should you have one error rate SLI for your entire service, or separate SLIs for each endpoint, feature, or error type? The answer depends on your goals and operational maturity.
The Case for Aggregated Error Rates
A single "overall error rate" for your service:
Advantages:
Disadvantages:
The Case for Disaggregated Error Rates
Separate error rate SLIs per endpoint, feature, or user journey:
Advantages:
Disadvantages:
Composite Error Rate SLIs
A middle ground is a weighted composite that maintains simplicity while accounting for criticality:
Weighted Error Rate = Σ(Error Rate × Weight × Traffic) / Σ(Traffic)
Where weights reflect business importance:
This single metric naturally surfaces critical endpoint failures more prominently while still tracking overall health.
Error Budgets by Category
Another approach: allocate your error budget across error categories:
If server errors consume the entire budget, you're not meeting SLO even if timeouts are zero.
Low-traffic endpoints have noisy error rates. If an endpoint gets 10 requests/day and 1 fails, that's a 10% error rate—but it's also just 1 failure. Set minimum sample thresholds before calculating meaningful error rates. Consider weekly or monthly windows for infrequent endpoints.
Some of the most insidious errors return HTTP 200 but deliver incorrect, incomplete, or corrupted results. These semantic errors evade simple status-code-based detection and can cause significant user harm before being noticed.
Examples of Semantic Errors
Data correctness errors:
Completeness errors:
Consistency errors:
Logic errors:
Detecting Semantic Errors
Semantic errors require domain-aware detection—you must understand what "correct" looks like:
Approach 1: Response Validation Validate response structure and content:
Approach 2: Synthetic Validation Regularly execute known-answer queries:
Approach 3: Cross-Reference Verification Compare data across sources:
Approach 4: User Feedback Integration Treat certain user behaviors as error signals:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
from dataclasses import dataclassfrom typing import List, Optional, Dict, Anyimport json @dataclassclass SemanticValidationResult: is_valid: bool errors: List[str] warnings: List[str] class SemanticValidator: """ Validates responses for semantic correctness beyond HTTP status codes. """ def validate_search_response( self, query: str, response: Dict[str, Any] ) -> SemanticValidationResult: """Validate search results are semantically correct.""" errors = [] warnings = [] results = response.get('results', []) query_terms = query.lower().split() # Check: Results should be relevant to query if results and len(query_terms) > 0: relevance_scores = [] for result in results[:10]: # Check top 10 title = result.get('title', '').lower() description = result.get('description', '').lower() content = f"{title} {description}" term_matches = sum(1 for term in query_terms if term in content) relevance_scores.append(term_matches / len(query_terms)) avg_relevance = sum(relevance_scores) / len(relevance_scores) if avg_relevance < 0.2: # Less than 20% term coverage errors.append( f"Search results appear irrelevant: avg_relevance={avg_relevance:.2f}" ) # Check: Result count should match total claimed claimed_total = response.get('total_results', 0) if len(results) > claimed_total: errors.append( f"Result count mismatch: returned {len(results)} but claimed total {claimed_total}" ) # Check: Results should have required fields for i, result in enumerate(results): if not result.get('id'): errors.append(f"Result {i} missing 'id' field") if not result.get('title'): warnings.append(f"Result {i} missing 'title' field") return SemanticValidationResult( is_valid=len(errors) == 0, errors=errors, warnings=warnings, ) def validate_order_response( self, order_request: Dict[str, Any], response: Dict[str, Any] ) -> SemanticValidationResult: """Validate order confirmation is semantically correct.""" errors = [] warnings = [] # Check: Order ID should be returned if not response.get('order_id'): errors.append("Order confirmation missing order_id") # Check: Totals should be consistent items = response.get('items', []) calculated_subtotal = sum( item.get('price', 0) * item.get('quantity', 0) for item in items ) claimed_subtotal = response.get('subtotal', 0) if abs(calculated_subtotal - claimed_subtotal) > 0.01: errors.append( f"Subtotal mismatch: calculated {calculated_subtotal}, claimed {claimed_subtotal}" ) # Check: Total = Subtotal + Tax + Shipping - Discounts claimed_total = response.get('total', 0) expected_total = ( claimed_subtotal + response.get('tax', 0) + response.get('shipping', 0) - response.get('discount', 0) ) if abs(expected_total - claimed_total) > 0.01: errors.append( f"Total calculation error: expected {expected_total}, got {claimed_total}" ) # Check: Items match request requested_items = {item['sku']: item['quantity'] for item in order_request.get('items', [])} confirmed_items = {item['sku']: item['quantity'] for item in items} if requested_items != confirmed_items: errors.append("Confirmed items don't match requested items") return SemanticValidationResult( is_valid=len(errors) == 0, errors=errors, warnings=warnings, ) # Usage in request pipelinevalidator = SemanticValidator() def track_semantic_error_rate(response, operation_type, request_data=None): """ Validate response and track semantic error rate. """ if operation_type == 'search': result = validator.validate_search_response( request_data.get('query', ''), response ) elif operation_type == 'order': result = validator.validate_order_response(request_data, response) else: return # No validation for this operation type if not result.is_valid: # Increment semantic error counter metrics.increment('semantic_errors', tags={'operation': operation_type}) # Log for investigation logger.error(f"Semantic validation failed: {result.errors}", extra={'response': response, 'errors': result.errors})Full semantic validation of every response may be computationally prohibitive. Consider sampling strategies: validate 1% of responses thoroughly, validate all responses from specific critical endpoints, or validate based on risk signals (new code paths, unusual inputs).
What error rate should you target? This depends on the nature of the operation, user expectations, and business impact.
Framework for Target Selection
Consider user tolerance: How forgiving are users of this particular failure?
Consider failure impact:
| Operation Type | Suggested Target | Example | Rationale |
|---|---|---|---|
| Financial transactions | 99.99%+ success | Payment, withdrawal | Every failure is money-related; user trust critical |
| Data mutation (writes) | 99.95%+ success | Save, update, delete | Failed writes may cause data loss or inconsistency |
| Authentication | 99.9%+ success | Login, session | Failed auth blocks all functionality |
| Core read operations | 99.9%+ success | Product page, user profile | Primary user journeys must work |
| Search/discovery | 99.5%+ success | Search, recommendations | Degradation acceptable; user can browse manually |
| Background operations | 99%+ success | Analytics, sync | Failures retried; user doesn't see directly |
| Nice-to-have features | 95%+ success | Social badges, achievements | Enhances but not required for core experience |
Baseline Then Improve
If you're establishing error rate SLIs for the first time, don't guess at targets:
Avoid unrealistic targets: A target you can never meet demoralizes the team. A target you always meet doesn't drive improvement. The sweet spot: achievable most of the time, occasionally breached, driving continuous attention.
Multi-Tier Targets
Like latency SLIs, consider multi-tier error rate targets:
This provides graduated response: attention at 1% error rate, emergency at 5% error rate.
A useful heuristic: for each step up in target strictness (from 99% to 99.9% to 99.99%), effort increases by approximately 10x. If maintaining 99% is achievable with basic practices, 99.9% requires robust monitoring and on-call, and 99.99% requires sophisticated automation, redundancy, and chaos engineering.
Effective error rate monitoring balances sensitivity (catching problems quickly) with specificity (avoiding alert noise).
Alert Strategy for Error Rates
The problem with naive thresholds: A fixed threshold like "alert if error rate > 1%" has issues:
Better approaches:
1. Error Budget Burn Rate Alerting Alert based on how quickly you're consuming your monthly error budget:
Alert when burn rate exceeds thresholds (e.g., 10x for 5 minutes, 2x for 1 hour).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
interface BurnRateAlert { shouldAlert: boolean; burnRate: number; currentErrorRate: number; budgetRemaining: number; hoursUntilBudgetExhausted: number; severity: 'critical' | 'warning' | 'none';} function evaluateErrorBudgetBurning( currentErrorRate: number, // Current error rate (0-1, e.g., 0.02 = 2%) sloTarget: number, // SLO target (0-1, e.g., 0.999 = 99.9%) windowHours: number, // Current measurement window budgetConsumedSoFar: number, // Portion of monthly budget already used (0-1) monthlyBudgetHours: number = 720, // Hours in SLO period (30 days)): BurnRateAlert { // Error budget = allowed error rate = 1 - SLO const allowedErrorRate = 1 - sloTarget; // e.g., 0.001 for 99.9% SLO // Burn rate = (current error rate) / (allowed error rate) // Burn rate of 1 = using budget at exact expected pace // Burn rate of 10 = burning 10x faster than sustainable const burnRate = currentErrorRate / allowedErrorRate; // Budget remaining const budgetRemaining = 1 - budgetConsumedSoFar; // At current burn rate, how long until budget exhausted? // If burn rate is 10 and we have 30 days budget, exhausted in 3 days const hoursUntilExhausted = burnRate > 0 ? (budgetRemaining * monthlyBudgetHours) / burnRate : Infinity; // Alert thresholds based on Google SRE multi-window approach let shouldAlert = false; let severity: 'critical' | 'warning' | 'none' = 'none'; // Critical: Burning very fast, will exhaust budget soon if (burnRate >= 14.4 && windowHours >= 1/60) { // 14.4x for 1 minute shouldAlert = true; severity = 'critical'; } else if (burnRate >= 6 && windowHours >= 0.25) { // 6x for 15 minutes shouldAlert = true; severity = 'critical'; } // Warning: Elevated burn rate else if (burnRate >= 3 && windowHours >= 1) { // 3x for 1 hour shouldAlert = true; severity = 'warning'; } else if (burnRate >= 1 && windowHours >= 6) { // 1x for 6 hours shouldAlert = true; severity = 'warning'; } return { shouldAlert, burnRate, currentErrorRate, budgetRemaining, hoursUntilBudgetExhausted: hoursUntilExhausted, severity, };} // Example usageconst result = evaluateErrorBudgetBurning( 0.02, // 2% current error rate 0.999, // 99.9% SLO target 0.25, // 15-minute window 0.3, // 30% of monthly budget already used); if (result.shouldAlert) { console.log(`ALERT (${result.severity}): Burn rate ${result.burnRate.toFixed(1)}x`); console.log(`Budget exhausted in ${result.hoursUntilBudgetExhausted.toFixed(1)} hours`);}2. Minimum Sample Size
Don't alert on low-sample windows:
if request_count < 100:
skip_alert() # Insufficient data
else:
evaluate_error_rate()
3. Error Rate Change Detection
Alert on significant changes from baseline, not just absolute thresholds:
4. Error Clustering Detection
Watch for error concentration that might not breach overall thresholds:
These might be small percentages overall but warrant immediate attention.
Dashboard Best Practices
Google's SRE practices recommend multi-window alerting: a short window with high burn rate threshold catches fast-burning incidents, while a longer window with lower threshold catches slow leaks. For example: alert if burn rate > 14x for 2 minutes OR burn rate > 6x for 15 minutes OR burn rate > 3x for 1 hour. This balances speed of detection with noise reduction.
Let's examine concrete error rate SLI specifications for real-world scenarios.
Example 1: API Service Error Rate
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# Error Rate SLI Specification: Customer-Facing APIsli: name: "API Success Rate" description: "Proportion of API requests that succeed" success_definition: http_status: success: [200, 201, 204, 304] # Standard success codes error: [500, 501, 502, 503, 504] # Server errors excluded: - 429 # Rate limiting (tracked separately) - 401 # Authentication expected behavior - 403 # Authorization expected behavior - 404 # Only counted as error for known resource paths additional_error_conditions: - "Request timeout (client or server)" - "Connection reset before response" - "Response body contains 'error' with non-null value" scope: endpoints: "/api/v*/**" traffic_sources: include: ["external_users", "mobile_app", "web_app"] exclude: ["synthetic_monitoring", "internal_services"] target: success_rate: 99.9% error_rate: 0.1% budget_period: "30 days rolling" alerting: burn_rate_alerts: - window: "5 minutes" burn_rate_threshold: 14.4 severity: "critical" - window: "1 hour" burn_rate_threshold: 6 severity: "critical" - window: "6 hours" burn_rate_threshold: 3 severity: "warning" minimum_request_count: 100 reporting: dashboards: ["service-health", "slo-compliance"] weekly_review: true monthly_report: trueExample 2: Transaction Processing Error Rate
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
# Error Rate SLI Specification: Payment Transactionssli: name: "Transaction Success Rate" description: "Proportion of payment transactions that complete successfully" success_definition: description: > A transaction is successful if the payment is captured and the order is recorded in our system. Customer payment method issues (insufficient funds, card declined by bank) are not counted as system errors. success_criteria: - "Payment captured (payment_status = 'captured')" - "Order created in database" - "Confirmation event dispatched" error_criteria: - "HTTP 5xx from payment service" - "Timeout waiting for payment processor" - "Order creation failed after successful payment capture" - "Inconsistent state (payment captured but order missing)" excluded_from_denominator: - "Card declined by issuing bank (customer issue)" - "Invalid card number (customer input error)" - "Fraud detection block (intentional protection)" - "3DS authentication failed (customer issue)" scope: events: "payment_initiated" payment_types: ["credit_card", "debit_card", "bank_transfer"] excludes: ["test_transactions", "internal_orders"] target: success_rate: 99.99% # Very high target for financial operations error_rate: 0.01% semantic_validation: enabled: true checks: - name: "Amount consistency" rule: "captured_amount == authorized_amount" - name: "Currency consistency" rule: "captured_currency == order_currency" - name: "Order linkage" rule: "order_id in orders_database" alerting: immediate_alerts: - condition: "Any transaction in inconsistent state" severity: "critical" notification: ["payments-oncall", "finance-team"] rate_based: - error_rate_threshold: 0.1% window: "5 minutes" severity: "critical"Example 3: Batch Processing Error Rate
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
# Error Rate SLI Specification: Nightly Data Pipelinesli: name: "Data Pipeline Record Processing Success" description: "Proportion of records successfully processed through the pipeline" # For batch processing, we measure at record level, not job level # A job can succeed overall while having some record-level failures success_definition: at_record_level: success: "Record transformed and loaded to destination" error: "Record rejected or failed to load" excluded: "Record filtered by business rules (intentional skip)" at_job_level: # Job success doesn't affect SLI, but tracked separately job_success: "All records processed (success + error + excluded = input)" job_failure: "Job crashed or timed out before completing" scope: pipelines: ["customer-sync", "order-analytics", "inventory-update"] record_sources: "All input records from source systems" targets: record_success_rate: 99.5% rationale: > Batch processing allows for retry and manual remediation. 0.5% error rate on millions of records is still thousands of failures to investigate, so this is reasonably tight. job_completion_rate: 99.9% # Separate SLI for job-level reliability retry_policy: automatic_retries: 3 retry_interval: "exponential backoff, max 5 minutes" error_after_retries: true # Only count as error after all retries fail error_categorization: recoverable: - "Transient database connection error" - "Temporary API rate limiting" permanent: - "Schema validation failure" - "Missing required field" - "Referential integrity violation" reporting: per_job_metrics: - "records_processed" - "records_succeeded" - "records_failed" - "error_rate" - "error_by_category" daily_aggregate: true monthly_slo_report: trueError rate SLIs quantify the proportion of operations that fail, providing a direct measure of system reliability. Let's consolidate the key learnings:
What's Next
We've explored user-centric SLIs, availability SLIs, latency SLIs, and error rate SLIs. The final page in this module addresses the critical practical question: Measuring SLIs Accurately—how to ensure your measurement infrastructure captures true reality without gaps, biases, or artifacts.
You now have a comprehensive understanding of error rate SLIs—from classification through practical implementation. You can build error taxonomies, calculate and report error rates accurately, set appropriate targets, design effective alerting, and avoid the pitfalls that make error rate metrics misleading.