Loading content...
In the realm of Site Reliability Engineering, a fundamental truth often gets obscured by the noise of infrastructure metrics, system dashboards, and technical alerts: reliability is ultimately defined by the user's experience, not by the health of your servers.
Consider a scenario that plays out daily across organizations worldwide: your monitoring dashboards glow green, CPU utilization hovers at a comfortable 40%, memory consumption is well within bounds, network latency between your internal services measures in single-digit milliseconds, and your infrastructure team reports zero incidents. Yet, your customer support channels are flooded with complaints. Users report the application as "slow," "broken," or "unusable." How can this be?
The disconnect stems from a fundamental misalignment between what we measure and what users actually experience. Traditional infrastructure metrics tell you about the health of your systems. User-centric SLIs tell you about the quality of the experience you deliver. These are not the same thing, and conflating them is one of the most common—and costly—mistakes in reliability engineering.
By the end of this page, you will understand why user-centric SLIs form the foundation of effective reliability practice. You'll learn frameworks for identifying what truly matters to users, techniques for translating subjective experience into objective measurements, and strategies for avoiding the common pitfalls that lead to meaningless metrics. Most importantly, you'll gain the mindset shift necessary to view your systems through your users' eyes.
Before diving into specific techniques, we must establish the philosophical foundation that underpins user-centric SLIs. This isn't merely a technical exercise—it represents a fundamental shift in how we conceptualize and measure reliability.
The Traditional Approach: System-Centric Metrics
Historically, operations teams focused on metrics that were easy to collect and directly observable from infrastructure:
These metrics have value—they're essential for capacity planning, debugging, and understanding system behavior. However, they suffer from a critical limitation: they measure means, not ends.
Your infrastructure exists to serve users. Low CPU usage is not an end goal—it's a means to ensuring requests get processed quickly. High network bandwidth is not valuable in itself—it matters only insofar as it enables fast data delivery to users. When we mistake means for ends, we optimize for the wrong outcomes.
The User-Centric Paradigm Shift
User-centric SLIs flip this model on its head. Instead of asking "How healthy are our systems?" we ask:
This shift has profound implications. It means our SLIs must be derived from user journeys, not system topologies. It means we need to understand user expectations and translate them into measurable quantities. And it means accepting that a "healthy" system causing poor user experience is, by definition, not actually healthy.
The Observability Chain
Think of reliability observability as a chain with multiple links:
Each link in this chain gets us closer to what ultimately matters. User-centric SLIs position us at the right end of this chain—measuring outcomes rather than intermediaries.
| Dimension | System-Centric Approach | User-Centric Approach |
|---|---|---|
| Primary Question | "Is our infrastructure healthy?" | "Are users having a good experience?" |
| Measurement Point | Internal system boundaries | User-facing interaction points |
| Success Definition | Systems operating within resource limits | Users accomplishing intended goals |
| Alert Trigger | Resource threshold breached | User experience degradation detected |
| Failure Response | "Fix the server" | "Restore user experience" |
| Stakeholder Alignment | Primarily engineering-focused | Business and engineering aligned |
Effective user-centric SLIs begin with a deep understanding of how users interact with your system. This requires mapping user journeys—the sequences of interactions users perform to accomplish their goals.
What Is a User Journey?
A user journey is the complete sequence of steps a user takes to achieve a specific outcome. It encompasses every touchpoint, from initial intent through final confirmation. Consider an e-commerce checkout:
Each step in this journey represents an opportunity for the system to succeed or fail from the user's perspective. A user-centric SLI might measure the end-to-end success rate of checkout completions—not just whether individual API endpoints responded.
User journey mapping shouldn't be done by engineering in isolation. Product managers, UX researchers, customer support teams, and actual users all provide essential perspectives. The goal is to understand what users are trying to accomplish and what they perceive as success or failure—not what your system logs indicate.
Identifying Critical User Journeys
Not all user journeys carry equal weight. To prioritize SLI development, categorize journeys by their criticality:
Tier 1 - Revenue-Critical Journeys These directly impact revenue or core business operations. For an e-commerce platform: search, product viewing, checkout, payment processing. For a SaaS product: core workflow completion, data export, API access for paying customers.
Tier 2 - Experience-Critical Journeys These significantly impact user satisfaction and retention but may not directly generate revenue. Examples: account settings modification, preference management, notification delivery, report generation.
Tier 3 - Auxiliary Journeys These support the overall product but are non-essential to core value delivery. Examples: social features, gamification elements, optional integrations, cosmetic customizations.
Your SLI strategy should ensure Tier 1 journeys have the most rigorous measurement, with coverage decreasing through lower tiers.
From Journeys to Measurements
Once you've identified critical user journeys, the next step is translating them into measurable quantities. This requires defining:
For the e-commerce checkout example:
Understanding how users perceive reliability is essential for choosing SLIs that genuinely reflect experience quality. User perception doesn't perfectly correlate with objective measurements—humans have psychological biases, varying expectations, and contextual interpretations that must be accounted for.
The Psychology of Waiting
Research in human-computer interaction has established that users' perception of time is heavily influenced by context:
These psychological factors mean that a 3-second operation with good feedback may feel faster than a 2-second operation with no feedback. Your SLIs should account for this—measuring not just raw latency but the quality of the waiting experience.
| Duration | User Perception | SLI Implication |
|---|---|---|
| < 100ms | Instantaneous; feels like direct manipulation | Target for simple interactions (button clicks, toggles) |
| 100-300ms | Slight delay; still feels responsive | Acceptable for most UI interactions |
| 300ms-1s | Noticeable delay; user aware of processing | Typical target for page loads, search results |
| 1-5s | Significant delay; requires progress feedback | Acceptable only with loading indicators |
| 5-10s | Uncomfortable; users question if it's working | Should be exceptional; requires explanation |
10s | Unacceptable for interactive tasks | Indicates architectural problem or should be async |
User Expectations Vary by Context
User tolerance for latency and failure depends heavily on context:
High Stakes = Low Tolerance When users are performing critical actions—financial transactions, medical data access, legal document submission—their tolerance for errors and delays plummets. An error during payment processing is vastly more frustrating than an error loading a news article. Your SLIs should reflect this by having stricter targets for high-stakes journeys.
Frequency Affects Frustration An interaction that's slow once is an annoyance. An interaction that's slow every time becomes infuriating. If users perform an action 50 times per day, even small latency issues compound into significant productivity loss. Frequent interactions often warrant tighter latency SLIs than one-time actions.
Alternatives Change Expectations User expectations are shaped by competitive alternatives. If competing products complete the same action in 200ms, your 2-second latency—even if technically acceptable—becomes a competitive disadvantage. Benchmark SLIs against competitor performance when possible.
Prior Experience Sets Baselines User satisfaction is relative to their expectations. If your service has historically responded in 500ms, users adapt to this baseline. Improvements to 200ms delight users; regressions to 800ms frustrate them. SLIs should protect against regression from established baselines, not just breach of absolute thresholds.
Psychological research shows that humans remember experiences based on their peak intensity and their ending, not their average. A single catastrophic failure in an otherwise good session will dominate user memory. Similarly, ending an interaction poorly (slow confirmation, unclear success state) disproportionately colors the entire experience. Shape your SLIs to protect critical moments and endings, not just average performance.
The bridge between subjective user experience and objective SLIs requires careful translation. This section provides frameworks for converting qualitative user expectations into quantitative measurements.
The User Story to SLI Translation Process
Start with user stories that express expectations in natural language, then progressively refine them into measurable indicators:
Stage 1: Qualitative User Statement
"As a customer, I want my search results to appear quickly so I can find products efficiently."
Stage 2: Quantifiable User Expectation
"Search results should feel instant—I shouldn't notice waiting."
Stage 3: Technical Translation
Based on perception research, "feel instant" maps to < 300ms for the first meaningful results.
Stage 4: Measurement Definition
Search latency SLI: Time from user query submission to first 10 results rendered in user's browser.
Stage 5: SLI Specification
95th percentile search latency < 300ms, measured client-side from query keypress to first result render.
Notice how each stage adds precision while maintaining connection to the original user need. The final SLI can be measured objectively, yet it directly traces back to user experience goals.
The Four Golden Signals for User Experience
Google's Site Reliability Engineering book popularized the "Four Golden Signals" for monitoring. Let's reframe these from a user-centric perspective:
1. Latency (User-Centric Framing: Responsiveness) Not just "how long do requests take?" but "how long do users wait for meaningful responses?" This means:
2. Traffic (User-Centric Framing: User Activity) Not just "requests per second" but "how many users are successfully engaging?" This means:
3. Errors (User-Centric Framing: Success Rate) Not just "what percentage of HTTP requests return 5xx?" but "what percentage of user attempts succeed?" This means:
4. Saturation (User-Centric Framing: Capacity Headroom) Not just "how utilized are our resources?" but "how close are we to degrading user experience?" This means:
The location where you measure dramatically affects what your SLI captures. The closer you measure to the user, the more accurately your SLI reflects true user experience—but client-side measurement also introduces complexity. Understanding this tradeoff is essential.
The Measurement Spectrum
Imagine a typical web request flowing through your infrastructure:
User Browser → CDN → Load Balancer → API Gateway → Application Server → Database
Each point along this path offers a potential measurement location, with different tradeoffs:
| Measurement Point | What It Captures | What It Misses | Best For |
|---|---|---|---|
| Database Layer | Query execution time, data availability | All network hops, rendering, client issues | Database-specific SLIs only |
| Application Server | Business logic execution, server-side latency | Network transit, CDN behavior, client rendering | Server-side error rates, processing latency |
| Load Balancer | Backend availability, basic latency | Last-mile network, client behavior | Infrastructure availability SLIs |
| CDN Edge | Edge-to-origin latency, cache behavior | Last-mile network, client rendering | Content delivery SLIs, cache hit rates |
| Client-Side (RUM) | Full user experience including render time | Inconsistent data, sampling challenges | True user experience SLIs |
| Synthetic Monitoring | Consistent baseline from known locations | Real user variability, scale issues | Baseline availability, comparison testing |
Real User Monitoring (RUM) provides the most accurate picture of user experience but requires careful implementation. The gold standard is to measure client-side for primary SLIs while using server-side metrics for diagnostics. When client-side measurement isn't feasible, measure as close to the edge as possible and explicitly acknowledge the gap between your measurement and true user experience.
Real User Monitoring (RUM) Deep Dive
Real User Monitoring instruments actual user browsers to capture performance data as users experience it. This provides unparalleled accuracy but introduces implementation challenges:
Advantages of RUM for SLIs:
Challenges with RUM:
Key RUM Metrics for SLIs:
These Web Vitals metrics map directly to user experience goals and should form the basis of user-centric SLIs.
Synthetic Monitoring as a Complement
While RUM captures real user experience, synthetic monitoring provides consistent baselines:
Best practice: Use synthetic monitoring to detect availability issues and regressions quickly, use RUM to understand true user experience and set SLO targets.
Even with the best intentions, teams often fall into patterns that undermine the user-centricity of their SLIs. Recognizing these anti-patterns helps you avoid them.
Anti-Pattern 1: Measuring Averages Instead of Percentiles
Average latency is deeply misleading. Consider two scenarios:
Both have identical averages, but Scenario A has a catastrophically bad experience for 1% of users. If you're processing 1 million requests daily, that's 10,000 users per day having a terrible experience—invisible in your average.
The Fix: Always use percentiles. The 50th percentile (median) shows "typical" experience. The 95th and 99th percentiles reveal what your worst-affected users experience. For user-centric SLIs, the 99th percentile often matters most.
Anti-Pattern 2: The "Available Means Pingable" Trap
Many teams define availability as "the load balancer health check passes." This dramatically overstates availability from a user perspective. A service might be pingable while:
The Fix: Define availability in terms of user capability. "Available" means "users can accomplish their primary goals." This often requires synthetic transactions that exercise full user workflows, not just ping health endpoints.
Anti-Pattern 3: Excludingowing Exclusions
Teams often exclude various failures from SLI calculations:
From the user's perspective, all of these were failures. User-centric SLIs should include everything users experience as failures.
The Fix: Include all failures in your SLI. If you have legitimate exclusions (planned maintenance with notice, user-caused issues), document them explicitly and require approval. The default should be inclusion, not exclusion.
Anti-Pattern 4: Over-Aggregation
A single "availability" SLI for your entire platform obscures problems in specific journeys. If product search is failing but checkout is working, an aggregated availability metric might still look acceptable—while users trying to find products are having a terrible experience.
The Fix: Create distinct SLIs for distinct user journeys. Roll these up into summary metrics for executive dashboards, but preserve the granular signals for operational use.
The most damaging anti-pattern is choosing SLIs based on what's easy to measure rather than what matters to users. If your monitoring infrastructure can't measure true user experience today, invest in upgrading it. Don't settle for metrics that give false confidence while users suffer.
With theoretical foundations established, let's walk through a practical framework for implementing user-centric SLIs in a real system.
Step 1: Enumerate User Journeys
Start by creating a comprehensive inventory of user journeys. For each journey, document:
Step 2: Define Success and Failure
For each critical journey, precisely define what constitutes success and failure from the user's perspective. Be explicit about edge cases:
Example: E-Commerce Product Search
123456789101112131415161718192021222324252627282930313233
# User-Centric SLI Definition Templatesli_definitions: product_search: name: "Product Search Success" description: "Users can successfully search for and view product results" journey_tier: 1 success_criteria: - condition: "Search results rendered within 1 second" measurement: "client_side_time_to_results < 1000ms" - condition: "At least one product visible in results" measurement: "result_count > 0 OR query_has_no_matches" - condition: "Search filters functional" measurement: "filter_interactions_succeed" failure_modes: hard_failures: - "HTTP 5xx from search service" - "Client-side JavaScript error preventing render" - "Timeout after 5 seconds with no response" soft_failures: - "Results returned but > 3 seconds latency" - "Main results load but recommendations fail" measurement_points: primary: "RUM - time from search submit to first result render" fallback: "Server-side - time from request receipt to response sent" diagnostic: "Database query execution time" sli_formula: > (successful_searches / total_search_attempts) * 100 where successful_searches = searches meeting all success criteria and total_search_attempts excludes only documented exclusionsStep 3: Establish Measurement Infrastructure
Implement the instrumentation to collect your defined SLIs:
Client-side instrumentation: Add Real User Monitoring to capture page load performance, interaction latency, and client-side errors. Use the Web Vitals API for standardized metrics.
Server-side observability: Ensure structured logging and distributed tracing are in place to correlate client observations with server behavior.
Synthetic probes: Deploy synthetic monitoring that exercises critical user journeys from multiple geographic locations.
Data pipeline: Build aggregation pipelines that compute SLI values from raw metrics at appropriate intervals (typically 1-5 minute windows).
Step 4: Validate Against User Feedback
Before finalizing SLI targets, validate that your measurements correlate with actual user satisfaction:
If your SLI shows good performance during periods users report poor experience, your SLI isn't measuring what matters—refine it.
Step 5: Document and Communicate
User-centric SLIs are most valuable when understood beyond the SRE team:
User-centric SLIs represent a fundamental shift from measuring what's convenient to measuring what matters. Let's consolidate the key principles:
What's Next
With the philosophical foundation of user-centric measurement established, we're ready to explore specific categories of SLIs. In the next page, we'll dive deep into Availability SLIs—how to measure and reason about whether your service is truly available to users, including the subtle distinctions between uptime, reachability, and functionality.
You now understand the foundational principles of user-centric SLIs. You can identify critical user journeys, translate user expectations into measurable indicators, choose appropriate measurement points, and avoid common anti-patterns. This user-focused mindset is the bedrock upon which all effective SLI strategies are built.