Choosing SLIs - Learning Module

Loading content...

0/273

User-Centric SLIs

The User Experience Imperative

In the realm of Site Reliability Engineering, a fundamental truth often gets obscured by the noise of infrastructure metrics, system dashboards, and technical alerts: reliability is ultimately defined by the user's experience, not by the health of your servers.

Consider a scenario that plays out daily across organizations worldwide: your monitoring dashboards glow green, CPU utilization hovers at a comfortable 40%, memory consumption is well within bounds, network latency between your internal services measures in single-digit milliseconds, and your infrastructure team reports zero incidents. Yet, your customer support channels are flooded with complaints. Users report the application as "slow," "broken," or "unusable." How can this be?

The disconnect stems from a fundamental misalignment between what we measure and what users actually experience. Traditional infrastructure metrics tell you about the health of your systems. User-centric SLIs tell you about the quality of the experience you deliver. These are not the same thing, and conflating them is one of the most common—and costly—mistakes in reliability engineering.

What You Will Learn

By the end of this page, you will understand why user-centric SLIs form the foundation of effective reliability practice. You'll learn frameworks for identifying what truly matters to users, techniques for translating subjective experience into objective measurements, and strategies for avoiding the common pitfalls that lead to meaningless metrics. Most importantly, you'll gain the mindset shift necessary to view your systems through your users' eyes.

The Philosophy of User-Centric Measurement

Before diving into specific techniques, we must establish the philosophical foundation that underpins user-centric SLIs. This isn't merely a technical exercise—it represents a fundamental shift in how we conceptualize and measure reliability.

The Traditional Approach: System-Centric Metrics

Historically, operations teams focused on metrics that were easy to collect and directly observable from infrastructure:

CPU utilization: How busy are our processors?
Memory consumption: How much RAM is in use?
Disk I/O: How fast can we read and write data?
Network bandwidth: How much data flows through our pipes?
Process health: Are all our services running?

These metrics have value—they're essential for capacity planning, debugging, and understanding system behavior. However, they suffer from a critical limitation: they measure means, not ends.

The Means vs. Ends Distinction

Your infrastructure exists to serve users. Low CPU usage is not an end goal—it's a means to ensuring requests get processed quickly. High network bandwidth is not valuable in itself—it matters only insofar as it enables fast data delivery to users. When we mistake means for ends, we optimize for the wrong outcomes.

The User-Centric Paradigm Shift

User-centric SLIs flip this model on its head. Instead of asking "How healthy are our systems?" we ask:

Can users complete their intended tasks?
How quickly do users perceive responses?
How often do users encounter errors or degraded experiences?
What percentage of user attempts succeed?

This shift has profound implications. It means our SLIs must be derived from user journeys, not system topologies. It means we need to understand user expectations and translate them into measurable quantities. And it means accepting that a "healthy" system causing poor user experience is, by definition, not actually healthy.

The Observability Chain

Think of reliability observability as a chain with multiple links:

Infrastructure metrics: CPU, memory, disk, network
Service metrics: Request rates, error rates, latency distributions
Business metrics: Transactions completed, revenue processed, users served
User experience metrics: Perceived performance, task completion, satisfaction

Each link in this chain gets us closer to what ultimately matters. User-centric SLIs position us at the right end of this chain—measuring outcomes rather than intermediaries.

System-Centric vs. User-Centric Measurement Paradigms
Dimension	System-Centric Approach	User-Centric Approach
Primary Question	"Is our infrastructure healthy?"	"Are users having a good experience?"
Measurement Point	Internal system boundaries	User-facing interaction points
Success Definition	Systems operating within resource limits	Users accomplishing intended goals
Alert Trigger	Resource threshold breached	User experience degradation detected
Failure Response	"Fix the server"	"Restore user experience"
Stakeholder Alignment	Primarily engineering-focused	Business and engineering aligned

Understanding User Journeys

Effective user-centric SLIs begin with a deep understanding of how users interact with your system. This requires mapping user journeys—the sequences of interactions users perform to accomplish their goals.

What Is a User Journey?

A user journey is the complete sequence of steps a user takes to achieve a specific outcome. It encompasses every touchpoint, from initial intent through final confirmation. Consider an e-commerce checkout:

User adds items to cart
User navigates to checkout
User enters shipping information
User selects shipping method
User enters payment details
User reviews order summary
User confirms purchase
User receives confirmation

Each step in this journey represents an opportunity for the system to succeed or fail from the user's perspective. A user-centric SLI might measure the end-to-end success rate of checkout completions—not just whether individual API endpoints responded.

Journey Mapping Is a Collaborative Exercise

User journey mapping shouldn't be done by engineering in isolation. Product managers, UX researchers, customer support teams, and actual users all provide essential perspectives. The goal is to understand what users are trying to accomplish and what they perceive as success or failure—not what your system logs indicate.

Identifying Critical User Journeys

Not all user journeys carry equal weight. To prioritize SLI development, categorize journeys by their criticality:

Tier 1 - Revenue-Critical Journeys These directly impact revenue or core business operations. For an e-commerce platform: search, product viewing, checkout, payment processing. For a SaaS product: core workflow completion, data export, API access for paying customers.

Tier 2 - Experience-Critical Journeys These significantly impact user satisfaction and retention but may not directly generate revenue. Examples: account settings modification, preference management, notification delivery, report generation.

Tier 3 - Auxiliary Journeys These support the overall product but are non-essential to core value delivery. Examples: social features, gamification elements, optional integrations, cosmetic customizations.

Your SLI strategy should ensure Tier 1 journeys have the most rigorous measurement, with coverage decreasing through lower tiers.

Framework for User Journey Discovery

•Interview stakeholders: Product managers, customer success, sales, and support teams all have unique insights into what users care about and where they struggle.
•Analyze support tickets: Patterns in user complaints reveal the journeys where failures cause the most friction.
•Study analytics data: Funnel analysis, drop-off points, and time-on-task metrics illuminate where users encounter difficulty.
•Conduct user research: Direct observation and user interviews uncover expectations and frustrations invisible in quantitative data.
•Map against business metrics: Correlate user journey completion rates with revenue, retention, and NPS to quantify criticality.
•Review competitive positioning: Understand which experiences differentiate your product and must be protected.

From Journeys to Measurements

Once you've identified critical user journeys, the next step is translating them into measurable quantities. This requires defining:

Success criteria: What constitutes successful completion of this journey?
Failure modes: What are the ways this journey can fail?
Measurement points: Where in your system can you observe success/failure?
Latency expectations: How long should this journey reasonably take?
Quality dimensions: Beyond success/failure, what quality aspects matter?

For the e-commerce checkout example:

Success criteria: Order confirmed, payment captured, confirmation email sent
Failure modes: Payment declined (not a system failure), timeout during processing, inventory unavailable after cart, system error during confirmation
Measurement points: Transaction completion event in payment service, confirmation email delivery, order record creation
Latency expectations: Entire checkout should complete within 30 seconds; payment processing specifically within 5 seconds
Quality dimensions: Order accuracy, confirmation timeliness, receipt correctness

The User's Perception of Reliability

Understanding how users perceive reliability is essential for choosing SLIs that genuinely reflect experience quality. User perception doesn't perfectly correlate with objective measurements—humans have psychological biases, varying expectations, and contextual interpretations that must be accounted for.

The Psychology of Waiting

Research in human-computer interaction has established that users' perception of time is heavily influenced by context:

Active waits feel shorter than passive waits: Users watching a progress indicator perceive less time passing than users staring at a blank screen.
Explained waits feel shorter than unexplained waits: "Processing your payment securely..." feels faster than a generic spinner.
Uncertain waits feel longer: Not knowing if something is working causes anxiety that stretches perceived time.
Interrupted waits feel much longer: If a process appears to fail and requires retry, the perceived duration multiplies.

These psychological factors mean that a 3-second operation with good feedback may feel faster than a 2-second operation with no feedback. Your SLIs should account for this—measuring not just raw latency but the quality of the waiting experience.

Human Perception Thresholds for Web Interactions
Duration	User Perception	SLI Implication
< 100ms	Instantaneous; feels like direct manipulation	Target for simple interactions (button clicks, toggles)
100-300ms	Slight delay; still feels responsive	Acceptable for most UI interactions
300ms-1s	Noticeable delay; user aware of processing	Typical target for page loads, search results
1-5s	Significant delay; requires progress feedback	Acceptable only with loading indicators
5-10s	Uncomfortable; users question if it's working	Should be exceptional; requires explanation
10s	Unacceptable for interactive tasks	Indicates architectural problem or should be async

User Expectations Vary by Context

User tolerance for latency and failure depends heavily on context:

High Stakes = Low Tolerance When users are performing critical actions—financial transactions, medical data access, legal document submission—their tolerance for errors and delays plummets. An error during payment processing is vastly more frustrating than an error loading a news article. Your SLIs should reflect this by having stricter targets for high-stakes journeys.

Frequency Affects Frustration An interaction that's slow once is an annoyance. An interaction that's slow every time becomes infuriating. If users perform an action 50 times per day, even small latency issues compound into significant productivity loss. Frequent interactions often warrant tighter latency SLIs than one-time actions.

Alternatives Change Expectations User expectations are shaped by competitive alternatives. If competing products complete the same action in 200ms, your 2-second latency—even if technically acceptable—becomes a competitive disadvantage. Benchmark SLIs against competitor performance when possible.

Prior Experience Sets Baselines User satisfaction is relative to their expectations. If your service has historically responded in 500ms, users adapt to this baseline. Improvements to 200ms delight users; regressions to 800ms frustrate them. SLIs should protect against regression from established baselines, not just breach of absolute thresholds.

The Peak-End Rule

Psychological research shows that humans remember experiences based on their peak intensity and their ending, not their average. A single catastrophic failure in an otherwise good session will dominate user memory. Similarly, ending an interaction poorly (slow confirmation, unclear success state) disproportionately colors the entire experience. Shape your SLIs to protect critical moments and endings, not just average performance.

Translating User Experience to Metrics

The bridge between subjective user experience and objective SLIs requires careful translation. This section provides frameworks for converting qualitative user expectations into quantitative measurements.

The User Story to SLI Translation Process

Start with user stories that express expectations in natural language, then progressively refine them into measurable indicators:

Stage 1: Qualitative User Statement

"As a customer, I want my search results to appear quickly so I can find products efficiently."

Stage 2: Quantifiable User Expectation

"Search results should feel instant—I shouldn't notice waiting."

Stage 3: Technical Translation

Based on perception research, "feel instant" maps to < 300ms for the first meaningful results.

Stage 4: Measurement Definition

Search latency SLI: Time from user query submission to first 10 results rendered in user's browser.

Stage 5: SLI Specification

95th percentile search latency < 300ms, measured client-side from query keypress to first result render.

Notice how each stage adds precision while maintaining connection to the original user need. The final SLI can be measured objectively, yet it directly traces back to user experience goals.

Key Translation Techniques

•Define success from the user's perspective: "The page loaded" is different from "The user can begin their intended task." Measure what matters to users, not what's convenient to measure.
•Choose the right percentile: Average latency hides the user experience in the tail. The 95th or 99th percentile better represents "do most users have a good experience?"
•Measure at the edge, not the center: Server-side latency excludes network transit, CDN delays, and rendering time. Client-side measurement captures what users actually experience.
•Account for all failure modes: Users don't distinguish between a 500 error, a timeout, and a network failure. Your SLI should treat all these as "failures from user perspective."
•Include downstream dependencies: If your checkout fails because a payment provider is down, users experience your failure. SLIs should encompass the entire stack users depend on.

The Four Golden Signals for User Experience

Google's Site Reliability Engineering book popularized the "Four Golden Signals" for monitoring. Let's reframe these from a user-centric perspective:

1. Latency (User-Centric Framing: Responsiveness) Not just "how long do requests take?" but "how long do users wait for meaningful responses?" This means:

Measuring time to first byte AND time to interactive
Separating successful request latency from error latency (users experience slow errors differently)
Weighting by user-facing criticality

2. Traffic (User-Centric Framing: User Activity) Not just "requests per second" but "how many users are successfully engaging?" This means:

Tracking unique user sessions, not just raw requests
Distinguishing organic traffic from synthetic monitoring
Understanding traffic composition (new vs. returning users, premium vs. free)

3. Errors (User-Centric Framing: Success Rate) Not just "what percentage of HTTP requests return 5xx?" but "what percentage of user attempts succeed?" This means:

Including client errors caused by poor UX (400 errors from confusing forms)
Treating timeouts as errors (users waited and got nothing)
Considering partial failures (page loaded but core function broken)

4. Saturation (User-Centric Framing: Capacity Headroom) Not just "how utilized are our resources?" but "how close are we to degrading user experience?" This means:

Defining saturation in terms of latency impact, not resource percentage
Identifying the resource that will degrade user experience first
Building capacity models that predict user impact

Where to Measure: Choosing Measurement Points

The location where you measure dramatically affects what your SLI captures. The closer you measure to the user, the more accurately your SLI reflects true user experience—but client-side measurement also introduces complexity. Understanding this tradeoff is essential.

The Measurement Spectrum

Imagine a typical web request flowing through your infrastructure:

User Browser → CDN → Load Balancer → API Gateway → Application Server → Database

Each point along this path offers a potential measurement location, with different tradeoffs:

Measurement Point Tradeoffs
Measurement Point	What It Captures	What It Misses	Best For
Database Layer	Query execution time, data availability	All network hops, rendering, client issues	Database-specific SLIs only
Application Server	Business logic execution, server-side latency	Network transit, CDN behavior, client rendering	Server-side error rates, processing latency
Load Balancer	Backend availability, basic latency	Last-mile network, client behavior	Infrastructure availability SLIs
CDN Edge	Edge-to-origin latency, cache behavior	Last-mile network, client rendering	Content delivery SLIs, cache hit rates
Client-Side (RUM)	Full user experience including render time	Inconsistent data, sampling challenges	True user experience SLIs
Synthetic Monitoring	Consistent baseline from known locations	Real user variability, scale issues	Baseline availability, comparison testing

The Ideal Measurement Strategy

Real User Monitoring (RUM) provides the most accurate picture of user experience but requires careful implementation. The gold standard is to measure client-side for primary SLIs while using server-side metrics for diagnostics. When client-side measurement isn't feasible, measure as close to the edge as possible and explicitly acknowledge the gap between your measurement and true user experience.

Real User Monitoring (RUM) Deep Dive

Real User Monitoring instruments actual user browsers to capture performance data as users experience it. This provides unparalleled accuracy but introduces implementation challenges:

Advantages of RUM for SLIs:

Captures true end-to-end latency including rendering
Reflects geographic distribution of your user base
Accounts for device and network variability
Includes client-side JavaScript execution time
Reflects actual usage patterns, not synthetic scenarios

Challenges with RUM:

Not all users have JavaScript enabled or functional
Data is sampled, not comprehensive (cost/performance tradeoff)
Client clock skew can affect timing accuracy
Data may be blocked by ad blockers or privacy settings
Requires careful aggregation to be meaningful

Key RUM Metrics for SLIs:

First Contentful Paint (FCP): When users first see content
Largest Contentful Paint (LCP): When main content is visible
Time to Interactive (TTI): When users can meaningfully interact
First Input Delay (FID): Responsiveness of first user interaction
Cumulative Layout Shift (CLS): Visual stability during loading

These Web Vitals metrics map directly to user experience goals and should form the basis of user-centric SLIs.

Synthetic Monitoring as a Complement

While RUM captures real user experience, synthetic monitoring provides consistent baselines:

Run identical tests from consistent locations and conditions
Detect issues before real users encounter them
Enable meaningful comparison over time (control for user variability)
Provide 24/7 coverage even during low-traffic periods

Best practice: Use synthetic monitoring to detect availability issues and regressions quickly, use RUM to understand true user experience and set SLO targets.

Common Anti-Patterns in User-Centric SLIs

Even with the best intentions, teams often fall into patterns that undermine the user-centricity of their SLIs. Recognizing these anti-patterns helps you avoid them.

Anti-Pattern 1: Measuring Averages Instead of Percentiles

Average latency is deeply misleading. Consider two scenarios:

Scenario A: 99 requests at 100ms, 1 request at 10,000ms → Average: 199ms
Scenario B: 100 requests all at 199ms → Average: 199ms

Both have identical averages, but Scenario A has a catastrophically bad experience for 1% of users. If you're processing 1 million requests daily, that's 10,000 users per day having a terrible experience—invisible in your average.

The Fix: Always use percentiles. The 50th percentile (median) shows "typical" experience. The 95th and 99th percentiles reveal what your worst-affected users experience. For user-centric SLIs, the 99th percentile often matters most.

Anti-Patterns to Avoid

•Measuring server-side latency as a proxy for user experience
•Ignoring errors that happen client-side
•Treating all requests equally regardless of user criticality
•Using infrastructure health as an SLI
•Setting thresholds based on what's achievable rather than what users need
•Aggregating across disparate user journeys

Best Practices to Follow

•Measure client-side or at the edge whenever possible
•Count all failure modes users experience
•Weight metrics by user journey criticality
•Use outcomes (user success) not outputs (server metrics)
•Derive thresholds from user research and competitive analysis
•Create separate SLIs for distinct user journeys

Anti-Pattern 2: The "Available Means Pingable" Trap

Many teams define availability as "the load balancer health check passes." This dramatically overstates availability from a user perspective. A service might be pingable while:

Database connections are exhausted
Critical downstream services are failing
Authentication is broken
Core workflows are throwing errors
Performance is degraded beyond usability

The Fix: Define availability in terms of user capability. "Available" means "users can accomplish their primary goals." This often requires synthetic transactions that exercise full user workflows, not just ping health endpoints.

Anti-Pattern 3: Excludingowing Exclusions

Teams often exclude various failures from SLI calculations:

"That was a dependency failure, not our fault"
"Those errors were from a bad deploy we rolled back"
"That outage was during our maintenance window"
"Those timeouts were users with slow connections"

From the user's perspective, all of these were failures. User-centric SLIs should include everything users experience as failures.

The Fix: Include all failures in your SLI. If you have legitimate exclusions (planned maintenance with notice, user-caused issues), document them explicitly and require approval. The default should be inclusion, not exclusion.

Anti-Pattern 4: Over-Aggregation

A single "availability" SLI for your entire platform obscures problems in specific journeys. If product search is failing but checkout is working, an aggregated availability metric might still look acceptable—while users trying to find products are having a terrible experience.

The Fix: Create distinct SLIs for distinct user journeys. Roll these up into summary metrics for executive dashboards, but preserve the granular signals for operational use.

The Ultimate Anti-Pattern

The most damaging anti-pattern is choosing SLIs based on what's easy to measure rather than what matters to users. If your monitoring infrastructure can't measure true user experience today, invest in upgrading it. Don't settle for metrics that give false confidence while users suffer.

Practical Implementation Framework

With theoretical foundations established, let's walk through a practical framework for implementing user-centric SLIs in a real system.

Step 1: Enumerate User Journeys

Start by creating a comprehensive inventory of user journeys. For each journey, document:

Journey name and description
User persona(s) who perform this journey
Business criticality (Tier 1/2/3)
Expected frequency of use
User-facing success criteria
Known pain points or historical issues

Step 2: Define Success and Failure

For each critical journey, precisely define what constitutes success and failure from the user's perspective. Be explicit about edge cases:

Example: E-Commerce Product Search

Success: User receives relevant search results within 1 second of query submission
Failure - Error: Zero results due to system error (not empty results from no matches)
Failure - Timeout: User waited > 5 seconds with no response
Failure - Degraded: Results rendered but pagination or filters non-functional
Excluded: Empty results from legitimate query with no matching products

user-centric-sli-definition.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# User-Centric SLI Definition Template
sli_definitions:
  product_search:
    name: "Product Search Success"
    description: "Users can successfully search for and view product results"
    journey_tier: 1
    
    success_criteria:
      - condition: "Search results rendered within 1 second"
        measurement: "client_side_time_to_results < 1000ms"
      - condition: "At least one product visible in results"
        measurement: "result_count > 0 OR query_has_no_matches"
      - condition: "Search filters functional"
        measurement: "filter_interactions_succeed"
    
    failure_modes:
      hard_failures:
        - "HTTP 5xx from search service"
        - "Client-side JavaScript error preventing render"
        - "Timeout after 5 seconds with no response"
      soft_failures:
        - "Results returned but > 3 seconds latency"
        - "Main results load but recommendations fail"
    
    measurement_points:
      primary: "RUM - time from search submit to first result render"
      fallback: "Server-side - time from request receipt to response sent"
      diagnostic: "Database query execution time"
    
    sli_formula: >
      (successful_searches / total_search_attempts) * 100
      where successful_searches = searches meeting all success criteria
      and total_search_attempts excludes only documented exclusions

Step 3: Establish Measurement Infrastructure

Implement the instrumentation to collect your defined SLIs:

Client-side instrumentation: Add Real User Monitoring to capture page load performance, interaction latency, and client-side errors. Use the Web Vitals API for standardized metrics.
Server-side observability: Ensure structured logging and distributed tracing are in place to correlate client observations with server behavior.
Synthetic probes: Deploy synthetic monitoring that exercises critical user journeys from multiple geographic locations.
Data pipeline: Build aggregation pipelines that compute SLI values from raw metrics at appropriate intervals (typically 1-5 minute windows).

Step 4: Validate Against User Feedback

Before finalizing SLI targets, validate that your measurements correlate with actual user satisfaction:

Compare SLI values during periods of known user complaints to baseline periods
Correlate SLI degradation with support ticket volume
Survey users about their experience and cross-reference with SLI data
Review NPS trends against SLI performance over time

If your SLI shows good performance during periods users report poor experience, your SLI isn't measuring what matters—refine it.

Step 5: Document and Communicate

User-centric SLIs are most valuable when understood beyond the SRE team:

Publish SLI definitions in a central wiki accessible to product and engineering teams
Include SLIs in regular product review meetings
Create dashboards that show SLI status alongside business metrics
Train product managers to interpret SLI data for roadmap prioritization

Summary: Building User-Centric SLI Foundations

User-centric SLIs represent a fundamental shift from measuring what's convenient to measuring what matters. Let's consolidate the key principles:

Key Principles for User-Centric SLIs

•Reliability is defined by user experience, not system health. Your servers can be perfectly healthy while users suffer. Measure what users experience, not what's easy to observe.
•Start with user journeys, not system components. Identify critical user paths first, then instrument to measure their success. Never build SLIs in isolation from user needs.
•User perception is psychological, not just physical. Waiting time perception varies with context, feedback, and expectation. Understand the human factors that shape experience.
•Measure at the edge, not the center. The closer you measure to users, the more accurately you capture their experience. Invest in Real User Monitoring for critical journeys.
•Use percentiles, not averages. The worst experiences your users have are hidden by averages. Track 95th and 99th percentile performance to protect your most affected users.
•Include all failures users experience. Don't exclude failures because they're "not your fault." If users experience it as failure, it counts.
•Validate SLIs against real user feedback. If your SLIs show good performance when users report poor experience, your SLIs are wrong. Continuously validate and refine.

What's Next

With the philosophical foundation of user-centric measurement established, we're ready to explore specific categories of SLIs. In the next page, we'll dive deep into Availability SLIs—how to measure and reason about whether your service is truly available to users, including the subtle distinctions between uptime, reachability, and functionality.

Page Complete

You now understand the foundational principles of user-centric SLIs. You can identify critical user journeys, translate user expectations into measurable indicators, choose appropriate measurement points, and avoid common anti-patterns. This user-focused mindset is the bedrock upon which all effective SLI strategies are built.