Loading learning content...
Before you ever terminate a process, inject latency, or simulate a network partition, there is one fundamental question you must answer: What does "normal" look like for your system?
This might seem obvious, but it's the question most teams skip. They jump straight into breaking things—killing containers, saturating CPUs, dropping packets—without first establishing a clear, measurable definition of what their system should be doing when everything is working correctly. The result? Chaos experiments that generate noise but no actionable insight.
The first and most critical principle of chaos engineering is deceptively simple: Hypothesize about steady state. Yet within this principle lies the intellectual rigor that separates random destruction from disciplined experimentation. It's the difference between a child breaking toys and a scientist conducting experiments.
By the end of this page, you will understand how to define steady state for complex distributed systems, how to form testable hypotheses, why business metrics matter more than technical metrics, and how to establish the scientific foundation that makes chaos engineering actually useful rather than merely destructive.
In physics and control theory, steady state refers to a condition where key variables remain constant over time. The system may experience fluctuations internally, but its observable outputs remain within expected bounds. For a distributed system, steady state represents the normal operational condition where the system is fulfilling its intended purpose within acceptable parameters.
Why "steady state" instead of "healthy"?
The term is deliberately chosen over simpler words like "healthy" or "working" because it captures a crucial nuance: systems are never perfectly stable. Traffic fluctuates. Background jobs run. Caches warm and cool. Individual requests fail. A sophisticated understanding of system behavior requires accepting that micro-level chaos is constant—what matters is macro-level stability.
Consider an e-commerce platform:
All of this variation is normal. The system is in steady state not because nothing changes, but because the changes remain within expected ranges that allow the system to fulfill its purpose: enabling customers to browse and purchase products.
Steady state is not about perfection—it's about predictability within bounds. A system may drop 0.1% of requests during normal operation and still be in steady state, as long as that's the expected behavior. Understanding this distinction is fundamental to designing meaningful chaos experiments.
Steady state is contextual:
Different systems have different steady state definitions. What's normal for a real-time trading platform (sub-millisecond latency requirements) is vastly different from what's normal for a batch data processing pipeline (minutes or hours of latency is acceptable). Before hypothesizing, you must deeply understand the operational context:
Not all metrics are equally useful for defining steady state. The most valuable steady state metrics share specific characteristics:
The chaos engineering pioneers at Netflix explicitly recommend focusing on business metrics over technical metrics wherever possible. Why? Because a system can have perfect internal technical metrics (low CPU usage, zero errors in logs) while still failing to serve customers (perhaps due to logic bugs or upstream dependency failures).
| Metric Type | Example | Strength | Weakness |
|---|---|---|---|
| Business Metric | Orders completed per minute | Directly measures customer value | May have complex causality chains |
| User Experience Metric | Page load time at P95 | Customer-perceivable impact | Requires real user monitoring |
| Service Level Metric | API success rate | Standardized, easy to instrument | May miss end-to-end issues |
| Infrastructure Metric | CPU utilization percentage | Easy to collect, precise | Poor proxy for customer experience |
The Netflix Example:
Netflix famously uses "starts per second" (SPS)—the number of video streams that successfully begin per second—as their primary steady state indicator. This single metric encapsulates the entire system's health because a successful stream start requires:
If any of these components fail, SPS drops. The metric acts as a health indicator for the entire complex system without requiring explicit monitoring of every component.
Every system has its equivalent of SPS—a high-level metric that captures whether customers can successfully accomplish their goals. For a payment processor, it might be "successful transactions per second." For a messaging app, it might be "messages delivered within SLA per minute." For a search engine, it might be "searches returning results within latency target." Identify this metric for your system.
Once you've identified your steady state metrics, the next step is establishing baselines—the quantitative ranges that define "normal." This is more challenging than it might appear because production systems exhibit complex temporal patterns.
Time-based variations:
Most systems have inherent rhythms that must be accounted for:
A metric value that's alarming at 3 AM might be perfectly normal at 3 PM. Your baseline must account for these patterns, or your chaos experiments will generate false positives (detecting "anomalies" that are actually normal variation) or false negatives (missing real problems that are masked by high variance periods).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157
import statisticsimport datetimefrom dataclasses import dataclassfrom typing import List, Dict, Optionalfrom enum import Enum class TimeWindow(Enum): HOUR = "hour" DAY_OF_WEEK = "day_of_week" TIME_OF_DAY = "time_of_day" @dataclassclass BaselineMetric: """Represents a steady state baseline with time-aware bounds.""" name: str mean: float std_dev: float p50: float p95: float p99: float min_acceptable: float max_acceptable: float time_window: TimeWindow window_value: str # e.g., "Monday", "14:00", etc. def is_within_bounds(self, value: float) -> bool: """Check if a value falls within acceptable steady state range.""" return self.min_acceptable <= value <= self.max_acceptable def deviation_severity(self, value: float) -> str: """Categorize how far a value deviates from expected.""" if self.is_within_bounds(value): return "normal" # Calculate how many standard deviations away z_score = abs(value - self.mean) / self.std_dev if self.std_dev > 0 else float('inf') if z_score < 2: return "minor_deviation" elif z_score < 3: return "moderate_deviation" else: return "severe_deviation" class SteadyStateBaseline: """ Manages steady state baselines with time-aware segmentation. This enables comparing current metrics against historically appropriate baselines rather than global averages. """ def __init__(self, metric_name: str): self.metric_name = metric_name self.baselines: Dict[str, BaselineMetric] = {} def build_from_history( self, historical_data: List[Dict], # [{"timestamp": datetime, "value": float}] window_type: TimeWindow, num_std_devs: float = 3.0 # for setting acceptable bounds ) -> None: """Build time-segmented baselines from historical data.""" # Group data by time window grouped_data: Dict[str, List[float]] = {} for point in historical_data: window_key = self._get_window_key(point["timestamp"], window_type) if window_key not in grouped_data: grouped_data[window_key] = [] grouped_data[window_key].append(point["value"]) # Calculate baseline statistics for each time segment for window_key, values in grouped_data.items(): if len(values) < 30: # Need sufficient data points continue mean = statistics.mean(values) std_dev = statistics.stdev(values) if len(values) > 1 else 0 sorted_values = sorted(values) baseline = BaselineMetric( name=self.metric_name, mean=mean, std_dev=std_dev, p50=self._percentile(sorted_values, 50), p95=self._percentile(sorted_values, 95), p99=self._percentile(sorted_values, 99), min_acceptable=mean - (num_std_devs * std_dev), max_acceptable=mean + (num_std_devs * std_dev), time_window=window_type, window_value=window_key ) self.baselines[window_key] = baseline def evaluate( self, current_value: float, current_time: datetime.datetime ) -> Dict: """ Evaluate a current metric value against the appropriate baseline. Returns detailed comparison results. """ # Find the appropriate baseline for this time window_key = self._get_window_key(current_time, TimeWindow.TIME_OF_DAY) if window_key not in self.baselines: return { "status": "no_baseline", "message": f"No baseline data for {window_key}" } baseline = self.baselines[window_key] return { "status": "evaluated", "in_steady_state": baseline.is_within_bounds(current_value), "severity": baseline.deviation_severity(current_value), "current_value": current_value, "expected_range": { "min": baseline.min_acceptable, "max": baseline.max_acceptable }, "baseline_stats": { "mean": baseline.mean, "std_dev": baseline.std_dev, "p95": baseline.p95 } } def _get_window_key( self, timestamp: datetime.datetime, window_type: TimeWindow ) -> str: """Generate a key for time-based bucketing.""" if window_type == TimeWindow.HOUR: return str(timestamp.hour) elif window_type == TimeWindow.DAY_OF_WEEK: return timestamp.strftime("%A") elif window_type == TimeWindow.TIME_OF_DAY: # 2-hour windows hour_bucket = (timestamp.hour // 2) * 2 return f"{hour_bucket:02d}:00-{hour_bucket+2:02d}:00" return "default" @staticmethod def _percentile(sorted_data: List[float], percentile: int) -> float: index = (percentile / 100) * (len(sorted_data) - 1) lower = int(index) upper = lower + 1 if upper >= len(sorted_data): return sorted_data[-1] fraction = index - lower return sorted_data[lower] + fraction * (sorted_data[upper] - sorted_data[lower])Baseline establishment process:
Avoid these mistakes: (1) Using global averages that ignore temporal patterns, (2) Setting bounds too tight, causing normal variation to appear as anomalies, (3) Setting bounds too loose, missing genuine degradation, (4) Never updating baselines as the system evolves, (5) Using infrastructure metrics like CPU as primary steady state indicators.
With steady state defined and baselines established, you can now formulate testable hypotheses. A chaos engineering hypothesis follows a specific structure:
Template: If [perturbation occurs], then [steady state metric] will remain within [acceptable bounds] because [resilience mechanism].
This structure forces you to:
The last element—explaining the resilience mechanism—is crucial. It transforms chaos from "let's see what happens" into "let's validate our understanding." If your hypothesis is confirmed, you've validated that a resilience mechanism works. If it's refuted, you've discovered a gap in your understanding or implementation.
The Power of Specific Predictions:
Notice how each hypothesis makes a specific, quantitative prediction. This specificity is essential. Vague hypotheses like "the system should be fine" provide no framework for evaluation. When your hypothesis states "P99 latency will remain below 200ms," you have a clear success criterion.
Writing specific hypotheses also forces you to confront what you don't know. If you find yourself unable to predict how the system will behave, that's valuable information—it reveals gaps in your mental model that should be addressed before the chaos experiment runs.
Formulating hypotheses is not a one-time activity—it's part of an ongoing lifecycle that mirrors the scientific method:
Observe → Hypothesize → Experiment → Analyze → Iterate
This lifecycle transforms chaos engineering from a collection of ad-hoc tests into a systematic practice for building and validating confidence in system resilience.
Each cycle through this lifecycle produces artifacts: documented hypotheses, experiment results, and updated system understanding. Over months and years, these artifacts become invaluable institutional knowledge. New team members can review past experiments to understand why resilience mechanisms exist. Future experiments can build on validated hypotheses rather than starting from scratch.
Hypothesis evolution:
As you run experiments and gain confidence, hypotheses naturally become more sophisticated:
Early stage (building basic confidence):
Intermediate stage (exploring edge cases):
Advanced stage (testing complex scenarios):
This progression reflects growing system maturity and team confidence.
Even experienced teams make mistakes when defining steady state and formulating hypotheses. Understanding these antipatterns helps you avoid them:
The best hypotheses are those with genuine uncertainty about the outcome. If you're 100% confident the system will survive, the experiment teaches you nothing new. If you're 100% confident it will fail, you should fix the bug rather than run an experiment. Aim for hypotheses where you believe they'll pass but aren't certain.
Let's walk through a complete example of defining steady state and formulating hypotheses for a realistic e-commerce system.
AcmeShop is a mid-size e-commerce platform processing 500,000 orders per month. The system consists of: a React frontend, an API Gateway, a Product Service, an Inventory Service, an Order Service, a Payment Service (integrating with Stripe), a PostgreSQL primary with two read replicas, and Redis for caching.
Step 1: Identify the primary steady state metric
For AcmeShop, the business-critical outcome is successful order completion. We define:
Primary Steady State Metric: Orders completed per minute Baseline: 12 orders/minute average (with hourly variance from 5-25) Acceptable bounds: Within 20% of hourly baseline
Step 2: Identify supporting metrics
To understand why orders might succeed or fail, we add supporting metrics:
12345678910111213141516171819202122232425262728293031323334353637383940414243
# AcmeShop Steady State Definitionsteady_state: primary_metric: name: orders_completed_per_minute description: Number of orders successfully placed per minute baseline: 12 variance_by_hour: "00:00-06:00": { min: 2, max: 8 } "06:00-12:00": { min: 8, max: 18 } "12:00-18:00": { min: 15, max: 30 } "18:00-24:00": { min: 10, max: 22 } alert_threshold: min_percentage_of_baseline: 80 supporting_metrics: - name: api_gateway_p95_latency_ms baseline: 400 max_acceptable: 1000 - name: payment_success_rate baseline: 0.995 min_acceptable: 0.98 - name: database_read_p95_latency_ms baseline: 50 max_acceptable: 200 - name: cache_hit_rate baseline: 0.94 min_acceptable: 0.80 user_journeys: - name: browse_and_search metrics: [api_latency, cache_hit_rate] weight: 0.3 # 30% of user value - name: add_to_cart metrics: [api_latency, inventory_check_latency] weight: 0.2 - name: checkout_and_pay metrics: [payment_success_rate, order_completion_rate] weight: 0.5 # 50% of user value - most criticalStep 3: Formulate hypotheses for key failure scenarios
With steady state defined, we can now form testable hypotheses:
| Scenario | Hypothesis | Mechanism | Success Criteria |
|---|---|---|---|
| 1 of 2 DB read replicas fails | Orders/minute stays within 20% of baseline | Connection pool redistributes load to surviving replica | DB read P95 < 200ms, orders > 80% baseline |
| Redis cache complete failure | API P95 latency increases but stays under 1s | Fallback to database with query optimization | API P95 < 1000ms, no increase in 5xx errors |
| Payment service 5s latency injection | 90% of checkouts complete; 10% time out with clear error | Timeout configuration and user-friendly error handling | Payment success rate > 90%, zero hung requests |
| Inventory service 50% error rate | Users see degraded experience but no crashes | Circuit breaker opens; cached inventory used | Browse journeys unaffected, checkout warns of potential stock issues |
The first principle of chaos engineering—hypothesizing about steady state—establishes the scientific foundation that makes chaos valuable rather than merely chaotic. Let's consolidate the key takeaways:
What's next:
With steady state defined and hypotheses formulated, we're ready to explore the second principle of chaos engineering: Vary Real-World Events. This principle guides what perturbations to introduce—moving from theory to the actual disruptions that test our hypotheses.
You now understand how to define steady state, establish time-aware baselines, and formulate testable hypotheses. This scientific foundation distinguishes chaos engineering from random destruction. Next, we'll explore what kinds of real-world events to simulate in our experiments.