System Design (HLD)Principles of Chaos

Principles of Chaos Engineering

LevelAdvanced

Duration75 mins

TopicPrinciples of Chaos

1 / 5

Hypothesize About Steady State

The Foundation of Scientific Chaos

Before you ever terminate a process, inject latency, or simulate a network partition, there is one fundamental question you must answer: What does "normal" look like for your system?

This might seem obvious, but it's the question most teams skip. They jump straight into breaking things—killing containers, saturating CPUs, dropping packets—without first establishing a clear, measurable definition of what their system should be doing when everything is working correctly. The result? Chaos experiments that generate noise but no actionable insight.

The first and most critical principle of chaos engineering is deceptively simple: Hypothesize about steady state. Yet within this principle lies the intellectual rigor that separates random destruction from disciplined experimentation. It's the difference between a child breaking toys and a scientist conducting experiments.

What You Will Learn

By the end of this page, you will understand how to define steady state for complex distributed systems, how to form testable hypotheses, why business metrics matter more than technical metrics, and how to establish the scientific foundation that makes chaos engineering actually useful rather than merely destructive.

Understanding Steady State

In physics and control theory, steady state refers to a condition where key variables remain constant over time. The system may experience fluctuations internally, but its observable outputs remain within expected bounds. For a distributed system, steady state represents the normal operational condition where the system is fulfilling its intended purpose within acceptable parameters.

Why "steady state" instead of "healthy"?

The term is deliberately chosen over simpler words like "healthy" or "working" because it captures a crucial nuance: systems are never perfectly stable. Traffic fluctuates. Background jobs run. Caches warm and cool. Individual requests fail. A sophisticated understanding of system behavior requires accepting that micro-level chaos is constant—what matters is macro-level stability.

Consider an e-commerce platform:

Request latency varies from 50ms to 500ms
Some requests fail (0.01% error rate)
Database connections are constantly opening and closing
Cache hit rates fluctuate between 92% and 98%

All of this variation is normal. The system is in steady state not because nothing changes, but because the changes remain within expected ranges that allow the system to fulfill its purpose: enabling customers to browse and purchase products.

The Key Insight

Steady state is not about perfection—it's about predictability within bounds. A system may drop 0.1% of requests during normal operation and still be in steady state, as long as that's the expected behavior. Understanding this distinction is fundamental to designing meaningful chaos experiments.

Steady state is contextual:

Different systems have different steady state definitions. What's normal for a real-time trading platform (sub-millisecond latency requirements) is vastly different from what's normal for a batch data processing pipeline (minutes or hours of latency is acceptable). Before hypothesizing, you must deeply understand the operational context:

Throughput expectations: How many requests/operations per second?
Latency bounds: What response times are acceptable at various percentiles?
Error budgets: What failure rate is tolerable?
Resource utilization norms: What's typical CPU, memory, and network usage?
Business outcome measures: What customer-facing outcomes indicate health?

Choosing Steady State Metrics

Not all metrics are equally useful for defining steady state. The most valuable steady state metrics share specific characteristics:

Customer-centric: They reflect what customers experience, not just internal system behavior
Holistic: They capture end-to-end system health rather than individual component health
Measurable: They can be objectively quantified and tracked over time
Actionable: Deviations from normal indicate problems that can be addressed
Stable under normal conditions: They don't fluctuate so wildly that signal is lost in noise

The chaos engineering pioneers at Netflix explicitly recommend focusing on business metrics over technical metrics wherever possible. Why? Because a system can have perfect internal technical metrics (low CPU usage, zero errors in logs) while still failing to serve customers (perhaps due to logic bugs or upstream dependency failures).

Steady State Metrics Comparison
Metric Type	Example	Strength	Weakness
Business Metric	Orders completed per minute	Directly measures customer value	May have complex causality chains
User Experience Metric	Page load time at P95	Customer-perceivable impact	Requires real user monitoring
Service Level Metric	API success rate	Standardized, easy to instrument	May miss end-to-end issues
Infrastructure Metric	CPU utilization percentage	Easy to collect, precise	Poor proxy for customer experience

The Netflix Example:

Netflix famously uses "starts per second" (SPS)—the number of video streams that successfully begin per second—as their primary steady state indicator. This single metric encapsulates the entire system's health because a successful stream start requires:

User authentication working
Content catalog accessible
License server responding
CDN serving video content
Player initialization completing

If any of these components fail, SPS drops. The metric acts as a health indicator for the entire complex system without requiring explicit monitoring of every component.

Finding Your "Starts Per Second"

Every system has its equivalent of SPS—a high-level metric that captures whether customers can successfully accomplish their goals. For a payment processor, it might be "successful transactions per second." For a messaging app, it might be "messages delivered within SLA per minute." For a search engine, it might be "searches returning results within latency target." Identify this metric for your system.

Characteristics of Excellent Steady State Metrics

•Outcome-focused: Measures what the system accomplishes, not how it accomplishes it
•Leading indicator: Changes quickly when the system is degraded, before users file complaints
•Low noise-to-signal ratio: Normal variation is small relative to meaningful deviations
•Universally understood: Business stakeholders can interpret it without engineering translation
•Consistently measurable: Collection method is reliable across different conditions

Establishing Baselines

Once you've identified your steady state metrics, the next step is establishing baselines—the quantitative ranges that define "normal." This is more challenging than it might appear because production systems exhibit complex temporal patterns.

Time-based variations:

Most systems have inherent rhythms that must be accounted for:

Hourly patterns: Different traffic volumes and user behavior during business hours vs. night
Daily patterns: Weekday vs. weekend usage differences
Weekly patterns: Monday morning login surges, Friday afternoon slowdowns
Monthly patterns: Beginning-of-month billing runs, end-of-month reports
Seasonal patterns: Holiday traffic spikes, summer slowdowns

A metric value that's alarming at 3 AM might be perfectly normal at 3 PM. Your baseline must account for these patterns, or your chaos experiments will generate false positives (detecting "anomalies" that are actually normal variation) or false negatives (missing real problems that are masked by high variance periods).

steady-state-baseline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import statistics
import datetime
from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum
 
class TimeWindow(Enum):
    HOUR = "hour"
    DAY_OF_WEEK = "day_of_week"
    TIME_OF_DAY = "time_of_day"
 
@dataclass
class BaselineMetric:
    """Represents a steady state baseline with time-aware bounds."""
    name: str
    mean: float
    std_dev: float
    p50: float
    p95: float
    p99: float
    min_acceptable: float
    max_acceptable: float
    time_window: TimeWindow
    window_value: str  # e.g., "Monday", "14:00", etc.
    
    def is_within_bounds(self, value: float) -> bool:
        """Check if a value falls within acceptable steady state range."""
        return self.min_acceptable <= value <= self.max_acceptable
    
    def deviation_severity(self, value: float) -> str:
        """Categorize how far a value deviates from expected."""
        if self.is_within_bounds(value):
            return "normal"
        
        # Calculate how many standard deviations away
        z_score = abs(value - self.mean) / self.std_dev if self.std_dev > 0 else float('inf')
        
        if z_score < 2:
            return "minor_deviation"
        elif z_score < 3:
            return "moderate_deviation"
        else:
            return "severe_deviation"
 
class SteadyStateBaseline:
    """
    Manages steady state baselines with time-aware segmentation.
    This enables comparing current metrics against historically
    appropriate baselines rather than global averages.
    """
    
    def __init__(self, metric_name: str):
        self.metric_name = metric_name
        self.baselines: Dict[str, BaselineMetric] = {}
    
    def build_from_history(
        self, 
        historical_data: List[Dict],  # [{"timestamp": datetime, "value": float}]
        window_type: TimeWindow,
        num_std_devs: float = 3.0  # for setting acceptable bounds
    ) -> None:
        """Build time-segmented baselines from historical data."""
        
        # Group data by time window
        grouped_data: Dict[str, List[float]] = {}
        
        for point in historical_data:
            window_key = self._get_window_key(point["timestamp"], window_type)
            if window_key not in grouped_data:
                grouped_data[window_key] = []
            grouped_data[window_key].append(point["value"])
        
        # Calculate baseline statistics for each time segment
        for window_key, values in grouped_data.items():
            if len(values) < 30:  # Need sufficient data points
                continue
                
            mean = statistics.mean(values)
            std_dev = statistics.stdev(values) if len(values) > 1 else 0
            sorted_values = sorted(values)
            
            baseline = BaselineMetric(
                name=self.metric_name,
                mean=mean,
                std_dev=std_dev,
                p50=self._percentile(sorted_values, 50),
                p95=self._percentile(sorted_values, 95),
                p99=self._percentile(sorted_values, 99),
                min_acceptable=mean - (num_std_devs * std_dev),
                max_acceptable=mean + (num_std_devs * std_dev),
                time_window=window_type,
                window_value=window_key
            )
            
            self.baselines[window_key] = baseline
    
    def evaluate(
        self, 
        current_value: float, 
        current_time: datetime.datetime
    ) -> Dict:
        """
        Evaluate a current metric value against the appropriate baseline.
        Returns detailed comparison results.
        """
        # Find the appropriate baseline for this time
        window_key = self._get_window_key(current_time, TimeWindow.TIME_OF_DAY)
        
        if window_key not in self.baselines:
            return {
                "status": "no_baseline",
                "message": f"No baseline data for {window_key}"
            }
        
        baseline = self.baselines[window_key]
        
        return {
            "status": "evaluated",
            "in_steady_state": baseline.is_within_bounds(current_value),
            "severity": baseline.deviation_severity(current_value),
            "current_value": current_value,
            "expected_range": {
                "min": baseline.min_acceptable,
                "max": baseline.max_acceptable
            },
            "baseline_stats": {
                "mean": baseline.mean,
                "std_dev": baseline.std_dev,
                "p95": baseline.p95
            }
        }
    
    def _get_window_key(
        self, 
        timestamp: datetime.datetime, 
        window_type: TimeWindow
    ) -> str:
        """Generate a key for time-based bucketing."""
        if window_type == TimeWindow.HOUR:
            return str(timestamp.hour)
        elif window_type == TimeWindow.DAY_OF_WEEK:
            return timestamp.strftime("%A")
        elif window_type == TimeWindow.TIME_OF_DAY:
            # 2-hour windows
            hour_bucket = (timestamp.hour // 2) * 2
            return f"{hour_bucket:02d}:00-{hour_bucket+2:02d}:00"
        return "default"
    
    @staticmethod
    def _percentile(sorted_data: List[float], percentile: int) -> float:
        index = (percentile / 100) * (len(sorted_data) - 1)
        lower = int(index)
        upper = lower + 1
        if upper >= len(sorted_data):
            return sorted_data[-1]
        fraction = index - lower
        return sorted_data[lower] + fraction * (sorted_data[upper] - sorted_data[lower])

Baseline establishment process:

Collect historical data: Gather at least 2-4 weeks of metric data during normal operation
Identify patterns: Use statistical analysis or visualization to find cyclical patterns
Segment by time: Create separate baselines for different time periods
Define bounds: Set acceptable ranges, typically 2-3 standard deviations from the mean
Validate: Confirm that the baselines correctly identify known past incidents
Review periodically: Baselines must evolve as the system and its usage patterns change

Common Baseline Pitfalls

Avoid these mistakes: (1) Using global averages that ignore temporal patterns, (2) Setting bounds too tight, causing normal variation to appear as anomalies, (3) Setting bounds too loose, missing genuine degradation, (4) Never updating baselines as the system evolves, (5) Using infrastructure metrics like CPU as primary steady state indicators.

Formulating Hypotheses

With steady state defined and baselines established, you can now formulate testable hypotheses. A chaos engineering hypothesis follows a specific structure:

Template: If [perturbation occurs], then [steady state metric] will remain within [acceptable bounds] because [resilience mechanism].

This structure forces you to:

Specify exactly what disruption you're introducing
Identify what metric you'll observe
Define what "resilience" looks like quantitatively
Articulate your understanding of why the system should survive

The last element—explaining the resilience mechanism—is crucial. It transforms chaos from "let's see what happens" into "let's validate our understanding." If your hypothesis is confirmed, you've validated that a resilience mechanism works. If it's refuted, you've discovered a gap in your understanding or implementation.

Example Hypotheses

•Replica Failure: If we terminate one of our three database replicas, read latency at P99 will remain below 200ms because traffic will automatically shift to the surviving replicas via our connection pool's health checking.
•Availability Zone Outage: If we simulate the loss of availability zone us-east-1a, our orders-per-minute metric will not drop below 95% of baseline because our stateless services are deployed across three zones with automatic failover.
•Upstream Dependency Latency: If our payment processor API responds 5x slower than normal, checkout success rate will remain above 98% because our circuit breaker will open after 10 failures and return cached payment tokens.
•Network Partition: If we partition our Kafka cluster, no messages will be permanently lost and producers will resume within 30 seconds because our producer is configured with acks=all and appropriate retry settings.

The Power of Specific Predictions:

Notice how each hypothesis makes a specific, quantitative prediction. This specificity is essential. Vague hypotheses like "the system should be fine" provide no framework for evaluation. When your hypothesis states "P99 latency will remain below 200ms," you have a clear success criterion.

Writing specific hypotheses also forces you to confront what you don't know. If you find yourself unable to predict how the system will behave, that's valuable information—it reveals gaps in your mental model that should be addressed before the chaos experiment runs.

Weak Hypotheses

•"The system will probably handle server failures"
•"Users won't notice if the cache goes down"
•"Our failover should work"
•"Things should be fine with one node gone"
•"The retry logic will handle errors"

Strong Hypotheses

•"Terminating 1 of 3 API servers will cause <5% throughput reduction for <30 seconds"
•"Cache failure will increase DB latency P95 from 50ms to 150ms but not exceed 500ms"
•"Primary database failover will complete within 45 seconds with zero transaction failures"
•"Loss of one Kafka broker will cause producer latency to spike to 500ms before stabilizing at 100ms"
•"With 50% packet loss to auth service, login success rate drops by max 2% due to retry logic"

The Hypothesis-Driven Lifecycle

Formulating hypotheses is not a one-time activity—it's part of an ongoing lifecycle that mirrors the scientific method:

Observe → Hypothesize → Experiment → Analyze → Iterate

This lifecycle transforms chaos engineering from a collection of ad-hoc tests into a systematic practice for building and validating confidence in system resilience.

The Five-Stage Hypothesis Lifecycle

•Observe: Study the system during normal operation. Collect baseline metrics. Understand the architecture, dependencies, and intended resilience mechanisms. Document how the system is supposed to behave.
•Hypothesize: Based on observations, form specific predictions about how the system will behave when specific components fail. Articulate why you expect this behavior. Be specific about metrics and thresholds.
•Experiment: Introduce controlled perturbations and measure the results. Collect detailed telemetry during the experiment window. Compare actual behavior to predicted behavior.
•Analyze: Determine whether the hypothesis was confirmed or refuted. If refuted, investigate why. Was the resilience mechanism missing? Misconfigured? Overwhelmed? Document findings thoroughly.
•Iterate: Update your understanding of the system. Fix discovered weaknesses. Form new hypotheses for the next experiment cycle. Over time, build comprehensive coverage of failure scenarios.

Building Institutional Knowledge

Each cycle through this lifecycle produces artifacts: documented hypotheses, experiment results, and updated system understanding. Over months and years, these artifacts become invaluable institutional knowledge. New team members can review past experiments to understand why resilience mechanisms exist. Future experiments can build on validated hypotheses rather than starting from scratch.

Hypothesis evolution:

As you run experiments and gain confidence, hypotheses naturally become more sophisticated:

Early stage (building basic confidence):

"Single instance failures don't cause outages"
"Basic failover works for stateless services"

Intermediate stage (exploring edge cases):

"Failing during peak traffic doesn't overwhelm survivors"
"Concurrent failures in two different services are survivable"

Advanced stage (testing complex scenarios):

"Cascading failures are contained within failure domains"
"Recovery time remains bounded even with compounding failures"

This progression reflects growing system maturity and team confidence.

Common Mistakes to Avoid

Even experienced teams make mistakes when defining steady state and formulating hypotheses. Understanding these antipatterns helps you avoid them:

Antipatterns in Steady State Definition

•Confusing "no errors in logs" with steady state — Systems can be broken while producing clean logs. A misconfigured load balancer routing all traffic to one server produces no error logs while creating a single point of failure.
•Using vanity metrics as steady state indicators — Metrics like "total users registered" or "uptime percentage over the last year" tell you nothing about current system health. Steady state requires real-time, health-indicating metrics.
•Ignoring user journey heterogeneity — Not all users are affected equally. A mobile app API might be in steady state while a web dashboard is broken. Ensure your metrics cover all critical user paths.
•Over-relying on synthetic monitoring — Synthetic health checks can succeed while real users fail. A carefully crafted health endpoint might pass while actual user requests are broken due to differences in request complexity.
•Setting static thresholds for dynamic systems — A system that serves 1000 requests/second during the day and 100/second at night needs time-aware thresholds, not static ones.

Antipatterns in Hypothesis Formation

•Unfalsifiable hypotheses — "The system should handle failures gracefully" cannot be proven or disproven. Every hypothesis must have clear success and failure criteria.
•Hypotheses that assume success — "The circuit breaker will work perfectly" is not a hypothesis—it's wishful thinking. Hypotheses should be tested, not assumed.
•Missing the 'because' clause — "P99 latency won't increase" tells you what you expect but not why. Without understanding the mechanism, you can't explain unexpected results.
•Testing unrealistic scenarios first — Starting with "simultaneous failure of all databases across all regions" before validating single-node failure recovery is backwards. Hypotheses should progress from simple to complex.
•Ignoring partial failures — Real failures are often partial—50% packet loss, not 100%. Degraded performance, not complete outage. Hypotheses should cover these realistic scenarios.

The Goldilocks Zone

The best hypotheses are those with genuine uncertainty about the outcome. If you're 100% confident the system will survive, the experiment teaches you nothing new. If you're 100% confident it will fail, you should fix the bug rather than run an experiment. Aim for hypotheses where you believe they'll pass but aren't certain.

Real-World Application

Let's walk through a complete example of defining steady state and formulating hypotheses for a realistic e-commerce system.

Case Study: AcmeShop E-Commerce Platform

AcmeShop is a mid-size e-commerce platform processing 500,000 orders per month. The system consists of: a React frontend, an API Gateway, a Product Service, an Inventory Service, an Order Service, a Payment Service (integrating with Stripe), a PostgreSQL primary with two read replicas, and Redis for caching.

Step 1: Identify the primary steady state metric

For AcmeShop, the business-critical outcome is successful order completion. We define:

Primary Steady State Metric: Orders completed per minute Baseline: 12 orders/minute average (with hourly variance from 5-25) Acceptable bounds: Within 20% of hourly baseline

Step 2: Identify supporting metrics

To understand why orders might succeed or fail, we add supporting metrics:

API Gateway response time P95: 400ms baseline, 1000ms maximum
Payment success rate: 99.5% baseline, 98% minimum
Database read latency P95: 50ms baseline, 200ms maximum
Cache hit rate: 94% baseline, 80% minimum

acmeshop-steady-state.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# AcmeShop Steady State Definition
steady_state:
  primary_metric:
    name: orders_completed_per_minute
    description: Number of orders successfully placed per minute
    baseline: 12
    variance_by_hour:
      "00:00-06:00": { min: 2, max: 8 }
      "06:00-12:00": { min: 8, max: 18 }
      "12:00-18:00": { min: 15, max: 30 }
      "18:00-24:00": { min: 10, max: 22 }
    alert_threshold:
      min_percentage_of_baseline: 80
      
  supporting_metrics:
    - name: api_gateway_p95_latency_ms
      baseline: 400
      max_acceptable: 1000
      
    - name: payment_success_rate
      baseline: 0.995
      min_acceptable: 0.98
      
    - name: database_read_p95_latency_ms
      baseline: 50
      max_acceptable: 200
      
    - name: cache_hit_rate
      baseline: 0.94
      min_acceptable: 0.80
      
  user_journeys:
    - name: browse_and_search
      metrics: [api_latency, cache_hit_rate]
      weight: 0.3  # 30% of user value
      
    - name: add_to_cart
      metrics: [api_latency, inventory_check_latency]
      weight: 0.2
      
    - name: checkout_and_pay
      metrics: [payment_success_rate, order_completion_rate]
      weight: 0.5  # 50% of user value - most critical

Step 3: Formulate hypotheses for key failure scenarios

With steady state defined, we can now form testable hypotheses:

AcmeShop Chaos Experiment Hypotheses
Scenario	Hypothesis	Mechanism	Success Criteria
1 of 2 DB read replicas fails	Orders/minute stays within 20% of baseline	Connection pool redistributes load to surviving replica	DB read P95 < 200ms, orders > 80% baseline
Redis cache complete failure	API P95 latency increases but stays under 1s	Fallback to database with query optimization	API P95 < 1000ms, no increase in 5xx errors
Payment service 5s latency injection	90% of checkouts complete; 10% time out with clear error	Timeout configuration and user-friendly error handling	Payment success rate > 90%, zero hung requests
Inventory service 50% error rate	Users see degraded experience but no crashes	Circuit breaker opens; cached inventory used	Browse journeys unaffected, checkout warns of potential stock issues

Summary: Hypothesize About Steady State

The first principle of chaos engineering—hypothesizing about steady state—establishes the scientific foundation that makes chaos valuable rather than merely chaotic. Let's consolidate the key takeaways:

Key Takeaways

•Steady state is about bounded behavior, not perfection — Normal systems have variation; what matters is that key metrics stay within expected ranges.
•Business metrics trump technical metrics — Customer-centric outcomes (orders completed, streams started) are better steady state indicators than infrastructure metrics (CPU utilization).
•Baselines must be time-aware — Traffic patterns vary by hour, day, and season. Effective baselines account for this variation.
•Hypotheses require specificity — "The system should handle it" is not a hypothesis. "P99 latency will stay under 200ms because of retry logic" is.
•The 'because' clause is essential — Explaining the expected resilience mechanism transforms random testing into systematic validation.
•Hypothesis quality improves over time — Early experiments test basic resilience; mature programs test complex, compounding failure scenarios.

What's next:

With steady state defined and hypotheses formulated, we're ready to explore the second principle of chaos engineering: Vary Real-World Events. This principle guides what perturbations to introduce—moving from theory to the actual disruptions that test our hypotheses.

Principle Established

You now understand how to define steady state, establish time-aware baselines, and formulate testable hypotheses. This scientific foundation distinguishes chaos engineering from random destruction. Next, we'll explore what kinds of real-world events to simulate in our experiments.

1 / 5

Loading learning content...

System Design (HLD)Principles of Chaos

Principles of Chaos Engineering

LevelAdvanced

Duration75 mins

TopicPrinciples of Chaos

1 / 5

Hypothesize About Steady State

The Foundation of Scientific Chaos

Before you ever terminate a process, inject latency, or simulate a network partition, there is one fundamental question you must answer: What does "normal" look like for your system?

What You Will Learn

Understanding Steady State

Why "steady state" instead of "healthy"?

Consider an e-commerce platform:

Request latency varies from 50ms to 500ms
Some requests fail (0.01% error rate)
Database connections are constantly opening and closing
Cache hit rates fluctuate between 92% and 98%

The Key Insight

Steady state is contextual:

Throughput expectations: How many requests/operations per second?
Latency bounds: What response times are acceptable at various percentiles?
Error budgets: What failure rate is tolerable?
Resource utilization norms: What's typical CPU, memory, and network usage?
Business outcome measures: What customer-facing outcomes indicate health?

Choosing Steady State Metrics

Not all metrics are equally useful for defining steady state. The most valuable steady state metrics share specific characteristics:

Customer-centric: They reflect what customers experience, not just internal system behavior
Holistic: They capture end-to-end system health rather than individual component health
Measurable: They can be objectively quantified and tracked over time
Actionable: Deviations from normal indicate problems that can be addressed
Stable under normal conditions: They don't fluctuate so wildly that signal is lost in noise

Steady State Metrics Comparison
Metric Type	Example	Strength	Weakness
Business Metric	Orders completed per minute	Directly measures customer value	May have complex causality chains
User Experience Metric	Page load time at P95	Customer-perceivable impact	Requires real user monitoring
Service Level Metric	API success rate	Standardized, easy to instrument	May miss end-to-end issues
Infrastructure Metric	CPU utilization percentage	Easy to collect, precise	Poor proxy for customer experience

The Netflix Example:

User authentication working
Content catalog accessible
License server responding
CDN serving video content
Player initialization completing

If any of these components fail, SPS drops. The metric acts as a health indicator for the entire complex system without requiring explicit monitoring of every component.

Finding Your "Starts Per Second"

Characteristics of Excellent Steady State Metrics

•Outcome-focused: Measures what the system accomplishes, not how it accomplishes it
•Leading indicator: Changes quickly when the system is degraded, before users file complaints
•Low noise-to-signal ratio: Normal variation is small relative to meaningful deviations
•Universally understood: Business stakeholders can interpret it without engineering translation
•Consistently measurable: Collection method is reliable across different conditions

Establishing Baselines

Time-based variations:

Most systems have inherent rhythms that must be accounted for:

Hourly patterns: Different traffic volumes and user behavior during business hours vs. night
Daily patterns: Weekday vs. weekend usage differences
Weekly patterns: Monday morning login surges, Friday afternoon slowdowns
Monthly patterns: Beginning-of-month billing runs, end-of-month reports
Seasonal patterns: Holiday traffic spikes, summer slowdowns

steady-state-baseline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import statistics
import datetime
from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum
 
class TimeWindow(Enum):
    HOUR = "hour"
    DAY_OF_WEEK = "day_of_week"
    TIME_OF_DAY = "time_of_day"
 
@dataclass
class BaselineMetric:
    """Represents a steady state baseline with time-aware bounds."""
    name: str
    mean: float
    std_dev: float
    p50: float
    p95: float
    p99: float
    min_acceptable: float
    max_acceptable: float
    time_window: TimeWindow
    window_value: str  # e.g., "Monday", "14:00", etc.
    
    def is_within_bounds(self, value: float) -> bool:
        """Check if a value falls within acceptable steady state range."""
        return self.min_acceptable <= value <= self.max_acceptable
    
    def deviation_severity(self, value: float) -> str:
        """Categorize how far a value deviates from expected."""
        if self.is_within_bounds(value):
            return "normal"
        
        # Calculate how many standard deviations away
        z_score = abs(value - self.mean) / self.std_dev if self.std_dev > 0 else float('inf')
        
        if z_score < 2:
            return "minor_deviation"
        elif z_score < 3:
            return "moderate_deviation"
        else:
            return "severe_deviation"
 
class SteadyStateBaseline:
    """
    Manages steady state baselines with time-aware segmentation.
    This enables comparing current metrics against historically
    appropriate baselines rather than global averages.
    """
    
    def __init__(self, metric_name: str):
        self.metric_name = metric_name
        self.baselines: Dict[str, BaselineMetric] = {}
    
    def build_from_history(
        self, 
        historical_data: List[Dict],  # [{"timestamp": datetime, "value": float}]
        window_type: TimeWindow,
        num_std_devs: float = 3.0  # for setting acceptable bounds
    ) -> None:
        """Build time-segmented baselines from historical data."""
        
        # Group data by time window
        grouped_data: Dict[str, List[float]] = {}
        
        for point in historical_data:
            window_key = self._get_window_key(point["timestamp"], window_type)
            if window_key not in grouped_data:
                grouped_data[window_key] = []
            grouped_data[window_key].append(point["value"])
        
        # Calculate baseline statistics for each time segment
        for window_key, values in grouped_data.items():
            if len(values) < 30:  # Need sufficient data points
                continue
                
            mean = statistics.mean(values)
            std_dev = statistics.stdev(values) if len(values) > 1 else 0
            sorted_values = sorted(values)
            
            baseline = BaselineMetric(
                name=self.metric_name,
                mean=mean,
                std_dev=std_dev,
                p50=self._percentile(sorted_values, 50),
                p95=self._percentile(sorted_values, 95),
                p99=self._percentile(sorted_values, 99),
                min_acceptable=mean - (num_std_devs * std_dev),
                max_acceptable=mean + (num_std_devs * std_dev),
                time_window=window_type,
                window_value=window_key
            )
            
            self.baselines[window_key] = baseline
    
    def evaluate(
        self, 
        current_value: float, 
        current_time: datetime.datetime
    ) -> Dict:
        """
        Evaluate a current metric value against the appropriate baseline.
        Returns detailed comparison results.
        """
        # Find the appropriate baseline for this time
        window_key = self._get_window_key(current_time, TimeWindow.TIME_OF_DAY)
        
        if window_key not in self.baselines:
            return {
                "status": "no_baseline",
                "message": f"No baseline data for {window_key}"
            }
        
        baseline = self.baselines[window_key]
        
        return {
            "status": "evaluated",
            "in_steady_state": baseline.is_within_bounds(current_value),
            "severity": baseline.deviation_severity(current_value),
            "current_value": current_value,
            "expected_range": {
                "min": baseline.min_acceptable,
                "max": baseline.max_acceptable
            },
            "baseline_stats": {
                "mean": baseline.mean,
                "std_dev": baseline.std_dev,
                "p95": baseline.p95
            }
        }
    
    def _get_window_key(
        self, 
        timestamp: datetime.datetime, 
        window_type: TimeWindow
    ) -> str:
        """Generate a key for time-based bucketing."""
        if window_type == TimeWindow.HOUR:
            return str(timestamp.hour)
        elif window_type == TimeWindow.DAY_OF_WEEK:
            return timestamp.strftime("%A")
        elif window_type == TimeWindow.TIME_OF_DAY:
            # 2-hour windows
            hour_bucket = (timestamp.hour // 2) * 2
            return f"{hour_bucket:02d}:00-{hour_bucket+2:02d}:00"
        return "default"
    
    @staticmethod
    def _percentile(sorted_data: List[float], percentile: int) -> float:
        index = (percentile / 100) * (len(sorted_data) - 1)
        lower = int(index)
        upper = lower + 1
        if upper >= len(sorted_data):
            return sorted_data[-1]
        fraction = index - lower
        return sorted_data[lower] + fraction * (sorted_data[upper] - sorted_data[lower])

Baseline establishment process:

Collect historical data: Gather at least 2-4 weeks of metric data during normal operation
Identify patterns: Use statistical analysis or visualization to find cyclical patterns
Segment by time: Create separate baselines for different time periods
Define bounds: Set acceptable ranges, typically 2-3 standard deviations from the mean
Validate: Confirm that the baselines correctly identify known past incidents
Review periodically: Baselines must evolve as the system and its usage patterns change

Common Baseline Pitfalls

Formulating Hypotheses

With steady state defined and baselines established, you can now formulate testable hypotheses. A chaos engineering hypothesis follows a specific structure:

Template: If [perturbation occurs], then [steady state metric] will remain within [acceptable bounds] because [resilience mechanism].

This structure forces you to:

Specify exactly what disruption you're introducing
Identify what metric you'll observe
Define what "resilience" looks like quantitatively
Articulate your understanding of why the system should survive

Example Hypotheses

•Replica Failure: If we terminate one of our three database replicas, read latency at P99 will remain below 200ms because traffic will automatically shift to the surviving replicas via our connection pool's health checking.
•Availability Zone Outage: If we simulate the loss of availability zone us-east-1a, our orders-per-minute metric will not drop below 95% of baseline because our stateless services are deployed across three zones with automatic failover.
•Upstream Dependency Latency: If our payment processor API responds 5x slower than normal, checkout success rate will remain above 98% because our circuit breaker will open after 10 failures and return cached payment tokens.
•Network Partition: If we partition our Kafka cluster, no messages will be permanently lost and producers will resume within 30 seconds because our producer is configured with acks=all and appropriate retry settings.

The Power of Specific Predictions:

Weak Hypotheses

•"The system will probably handle server failures"
•"Users won't notice if the cache goes down"
•"Our failover should work"
•"Things should be fine with one node gone"
•"The retry logic will handle errors"

Strong Hypotheses

•"Terminating 1 of 3 API servers will cause <5% throughput reduction for <30 seconds"
•"Cache failure will increase DB latency P95 from 50ms to 150ms but not exceed 500ms"
•"Primary database failover will complete within 45 seconds with zero transaction failures"
•"Loss of one Kafka broker will cause producer latency to spike to 500ms before stabilizing at 100ms"
•"With 50% packet loss to auth service, login success rate drops by max 2% due to retry logic"

The Hypothesis-Driven Lifecycle

Formulating hypotheses is not a one-time activity—it's part of an ongoing lifecycle that mirrors the scientific method:

Observe → Hypothesize → Experiment → Analyze → Iterate

This lifecycle transforms chaos engineering from a collection of ad-hoc tests into a systematic practice for building and validating confidence in system resilience.

The Five-Stage Hypothesis Lifecycle

•Observe: Study the system during normal operation. Collect baseline metrics. Understand the architecture, dependencies, and intended resilience mechanisms. Document how the system is supposed to behave.
•Hypothesize: Based on observations, form specific predictions about how the system will behave when specific components fail. Articulate why you expect this behavior. Be specific about metrics and thresholds.
•Experiment: Introduce controlled perturbations and measure the results. Collect detailed telemetry during the experiment window. Compare actual behavior to predicted behavior.
•Analyze: Determine whether the hypothesis was confirmed or refuted. If refuted, investigate why. Was the resilience mechanism missing? Misconfigured? Overwhelmed? Document findings thoroughly.
•Iterate: Update your understanding of the system. Fix discovered weaknesses. Form new hypotheses for the next experiment cycle. Over time, build comprehensive coverage of failure scenarios.

Building Institutional Knowledge

Hypothesis evolution:

As you run experiments and gain confidence, hypotheses naturally become more sophisticated:

Early stage (building basic confidence):

"Single instance failures don't cause outages"
"Basic failover works for stateless services"

Intermediate stage (exploring edge cases):

"Failing during peak traffic doesn't overwhelm survivors"
"Concurrent failures in two different services are survivable"

Advanced stage (testing complex scenarios):

"Cascading failures are contained within failure domains"
"Recovery time remains bounded even with compounding failures"

This progression reflects growing system maturity and team confidence.

Common Mistakes to Avoid

Even experienced teams make mistakes when defining steady state and formulating hypotheses. Understanding these antipatterns helps you avoid them:

Antipatterns in Steady State Definition

•Confusing "no errors in logs" with steady state — Systems can be broken while producing clean logs. A misconfigured load balancer routing all traffic to one server produces no error logs while creating a single point of failure.
•Using vanity metrics as steady state indicators — Metrics like "total users registered" or "uptime percentage over the last year" tell you nothing about current system health. Steady state requires real-time, health-indicating metrics.
•Ignoring user journey heterogeneity — Not all users are affected equally. A mobile app API might be in steady state while a web dashboard is broken. Ensure your metrics cover all critical user paths.
•Over-relying on synthetic monitoring — Synthetic health checks can succeed while real users fail. A carefully crafted health endpoint might pass while actual user requests are broken due to differences in request complexity.
•Setting static thresholds for dynamic systems — A system that serves 1000 requests/second during the day and 100/second at night needs time-aware thresholds, not static ones.

Antipatterns in Hypothesis Formation

•Unfalsifiable hypotheses — "The system should handle failures gracefully" cannot be proven or disproven. Every hypothesis must have clear success and failure criteria.
•Hypotheses that assume success — "The circuit breaker will work perfectly" is not a hypothesis—it's wishful thinking. Hypotheses should be tested, not assumed.
•Missing the 'because' clause — "P99 latency won't increase" tells you what you expect but not why. Without understanding the mechanism, you can't explain unexpected results.
•Testing unrealistic scenarios first — Starting with "simultaneous failure of all databases across all regions" before validating single-node failure recovery is backwards. Hypotheses should progress from simple to complex.
•Ignoring partial failures — Real failures are often partial—50% packet loss, not 100%. Degraded performance, not complete outage. Hypotheses should cover these realistic scenarios.

The Goldilocks Zone

Real-World Application

Let's walk through a complete example of defining steady state and formulating hypotheses for a realistic e-commerce system.

Case Study: AcmeShop E-Commerce Platform

Step 1: Identify the primary steady state metric

For AcmeShop, the business-critical outcome is successful order completion. We define:

Primary Steady State Metric: Orders completed per minute Baseline: 12 orders/minute average (with hourly variance from 5-25) Acceptable bounds: Within 20% of hourly baseline

Step 2: Identify supporting metrics

To understand why orders might succeed or fail, we add supporting metrics:

API Gateway response time P95: 400ms baseline, 1000ms maximum
Payment success rate: 99.5% baseline, 98% minimum
Database read latency P95: 50ms baseline, 200ms maximum
Cache hit rate: 94% baseline, 80% minimum

acmeshop-steady-state.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# AcmeShop Steady State Definition
steady_state:
  primary_metric:
    name: orders_completed_per_minute
    description: Number of orders successfully placed per minute
    baseline: 12
    variance_by_hour:
      "00:00-06:00": { min: 2, max: 8 }
      "06:00-12:00": { min: 8, max: 18 }
      "12:00-18:00": { min: 15, max: 30 }
      "18:00-24:00": { min: 10, max: 22 }
    alert_threshold:
      min_percentage_of_baseline: 80
      
  supporting_metrics:
    - name: api_gateway_p95_latency_ms
      baseline: 400
      max_acceptable: 1000
      
    - name: payment_success_rate
      baseline: 0.995
      min_acceptable: 0.98
      
    - name: database_read_p95_latency_ms
      baseline: 50
      max_acceptable: 200
      
    - name: cache_hit_rate
      baseline: 0.94
      min_acceptable: 0.80
      
  user_journeys:
    - name: browse_and_search
      metrics: [api_latency, cache_hit_rate]
      weight: 0.3  # 30% of user value
      
    - name: add_to_cart
      metrics: [api_latency, inventory_check_latency]
      weight: 0.2
      
    - name: checkout_and_pay
      metrics: [payment_success_rate, order_completion_rate]
      weight: 0.5  # 50% of user value - most critical

Step 3: Formulate hypotheses for key failure scenarios

With steady state defined, we can now form testable hypotheses:

AcmeShop Chaos Experiment Hypotheses
Scenario	Hypothesis	Mechanism	Success Criteria
1 of 2 DB read replicas fails	Orders/minute stays within 20% of baseline	Connection pool redistributes load to surviving replica	DB read P95 < 200ms, orders > 80% baseline
Redis cache complete failure	API P95 latency increases but stays under 1s	Fallback to database with query optimization	API P95 < 1000ms, no increase in 5xx errors
Payment service 5s latency injection	90% of checkouts complete; 10% time out with clear error	Timeout configuration and user-friendly error handling	Payment success rate > 90%, zero hung requests
Inventory service 50% error rate	Users see degraded experience but no crashes	Circuit breaker opens; cached inventory used	Browse journeys unaffected, checkout warns of potential stock issues

Summary: Hypothesize About Steady State

Key Takeaways

•Steady state is about bounded behavior, not perfection — Normal systems have variation; what matters is that key metrics stay within expected ranges.
•Business metrics trump technical metrics — Customer-centric outcomes (orders completed, streams started) are better steady state indicators than infrastructure metrics (CPU utilization).
•Baselines must be time-aware — Traffic patterns vary by hour, day, and season. Effective baselines account for this variation.
•Hypotheses require specificity — "The system should handle it" is not a hypothesis. "P99 latency will stay under 200ms because of retry logic" is.
•The 'because' clause is essential — Explaining the expected resilience mechanism transforms random testing into systematic validation.
•Hypothesis quality improves over time — Early experiments test basic resilience; mature programs test complex, compounding failure scenarios.

What's next:

Principle Established

1 / 5