Loading content...
Chaos engineering programs fail for many reasons—inadequate tooling, infrastructure complexity, lack of expertise—but the most common reason is starting too big. Organizations seduced by Netflix's tales of "Chaos Monkey randomly killing production servers" attempt to replicate that approach without recognizing that Netflix spent years building up to that level of chaos maturity. They had resilience baked in before chaos engineering became a formalized practice.
The result of premature ambition is predictable: a well-intentioned chaos experiment causes an uncontrolled outage, stakeholders lose trust, and the entire initiative is shelved indefinitely. The engineering team learns the wrong lesson—not that they started too aggressively, but that "chaos engineering doesn't work for us."
The paradox of chaos engineering adoption is this: The organizations that most need chaos engineering (those with unknown weaknesses and untested resilience) are least equipped to handle aggressive experiments. And the organizations that could handle aggressive experiments (those with mature resilience practices) often don't need them to discover basic weaknesses.
Starting small resolves this paradox. It allows you to build the muscle—tooling, processes, confidence, organizational trust—incrementally, matching your chaos ambitions to your actual chaos readiness.
By the end of this page, you will understand: (1) Why small-scale chaos experiments are strategically superior to ambitious ones during program inception; (2) How to identify ideal starting points for chaos experiments; (3) The specific techniques for scoping controlled experiments; (4) How to build organizational trust through incremental wins; and (5) The progression path from trivial experiments to meaningful resilience validation.
Before discussing tactical approaches to starting chaos engineering, we must understand the psychological landscape. Introducing chaos engineering isn't just a technical initiative—it's an organizational change effort that triggers deeply ingrained human responses.
The fear response
When you propose "intentionally breaking production systems," you trigger the amygdala response in stakeholders. Their brains interpret this as a threat to their job security, their team's reputation, and systems they've invested years building. This isn't irrational—it's evolutionary programming designed to protect against perceived danger.
Key stakeholder concerns include:
These concerns are legitimate. A chaos experiment that causes customer impact creates real business harm. Your job isn't to dismiss these concerns but to demonstrate that you take them seriously through controlled, low-risk starting points.
In chaos engineering adoption, trust is your most valuable and scarcest resource. Every successful experiment deposits trust in your organizational bank account. Every failed experiment—especially one that causes customer impact—makes a massive withdrawal. Starting small ensures your early experiments are deposits, not withdrawals.
The change management parallel
Organizational change research (Kotter, Bridges, etc.) consistently shows that successful transformation follows predictable patterns:
Chaos engineering adoption follows this exact pattern. Starting small isn't just risk management—it's a deliberate change management strategy. The quick wins you generate become stories that spread through the organization, converting skeptics into allies and building momentum for expansion.
The ideal starting point for chaos engineering combines three characteristics: low risk, high learning potential, and visible success. Finding this intersection requires careful analysis of your systems and organization.
The Starting Point Matrix
Evaluate potential experiments across multiple dimensions:
| Dimension | Lower Risk (Good Start) | Higher Risk (Avoid Initially) |
|---|---|---|
| Environment | Non-production, staging, canary tier | Production at full traffic |
| Blast radius | Single instance, single service | Cross-cutting infrastructure, databases |
| Duration | Seconds to minutes | Hours to permanent |
| Reversibility | Auto-healing or instant rollback | Requires manual intervention or rebuild |
| Timing | Low-traffic periods, non-critical days | Peak traffic, launch days, holidays |
| Dependency | Leaf services with few dependents | Core services with many dependents |
| Data sensitivity | Read-only or ephemeral data | Writes, financial data, PII |
| Team readiness | Team understands resilience patterns | Team unfamiliar with failure modes |
Identifying candidate systems
Not all systems are equally suitable for introductory chaos experiments. Look for systems that:
Have existing resilience mechanisms — Systems already built with retries, timeouts, circuit breakers, and graceful degradation are ideal. Chaos experiments will validate these mechanisms work, generating quick wins.
Are well-understood — Teams should know how the system behaves under normal conditions. Without this baseline, you can't distinguish between chaos-induced behavior and normal variations.
Have comprehensive observability — You need to detect when something goes wrong (and when it doesn't). Systems with extensive metrics, logging, and alerting make experiment analysis straightforward.
Have low customer impact tolerance — Counter-intuitively, systems where teams already care about resilience make better starting points. These teams are motivated to learn and already have the defensive infrastructure in place.
Have engaged, willing teams — Forcing chaos engineering on resistant teams destroys trust. Start with teams that volunteer or show enthusiasm.
The perfect first experiment validates an existing resilience mechanism that the team believes is working. Example: If a team has implemented retries for a downstream dependency, inject a failure to confirm retries actually happen. When the experiment succeeds, it validates both the team's work and the chaos engineering approach. Everyone wins.
Environment progression
Your first experiments should not be in production. While the goal of chaos engineering is ultimately production validation (since only production represents the real system), you must build capability in lower-risk environments first:
Local Development → CI/CD Pipeline → Staging → Canary → Production (Limited) → Production (Full)
Local Development: Developers run chaos experiments against their local environment during development. This catches resilience gaps before code is even committed.
CI/CD Pipeline: Automated chaos tests run as part of deployment pipelines. Failed experiments prevent deployment, catching regressions early.
Staging: Full chaos experiments run against staging environments that mirror production. This validates resilience under realistic (though not identical) conditions.
Canary: Experiments run against a small percentage of production traffic. This introduces real-world conditions while limiting blast radius.
Production (Limited): Experiments target specific regions, availability zones, or customer segments. Real production conditions with controlled scope.
Production (Full): The ultimate goal—experiments can safely run against the entire production fleet. This indicates mature chaos practices.
Your first chaos experiments should be boring. They should validate expected behavior, not explore unknown territories. Surprising results are interesting for learning but dangerous for program adoption—if your first experiment reveals a critical flaw, the narrative becomes "chaos engineering broke our system" rather than "chaos engineering helped us find and fix a problem."
The Validation Experiment Pattern
Instead of asking "What will break if we inject this failure?", ask "Will our existing resilience mechanisms respond as expected to this failure?" The hypothesis should predict success:
"We hypothesize that when we terminate one instance of Service A, traffic will automatically route to remaining instances with no customer-visible impact and no alerting pages beyond informational notifications."
If the hypothesis is confirmed, you've validated resilience. If the hypothesis is disproved, you've discovered a gap—but in a controlled context where you chose the timing and scope.
Experiment scope controls
Every experiment should have explicit controls that limit potential impact:
Kill switches: A mechanism to immediately abort the experiment if something goes wrong. This could be a button, a command, or an automatic trigger based on metrics.
Time limits: Experiments should have maximum durations. If the kill switch isn't activated, the experiment automatically stops after a defined period.
Blast radius limits: Define exactly what can be affected. "Up to 1 instance" or "up to 5% of traffic" are explicit limits that prevent cascading impact.
Traffic exclusions: Certain traffic should be excluded from experiments. VIP customers, critical transactions, or traffic from regions with regulatory requirements might be off-limits.
Rollback procedures: Document exactly how to restore normal operations if the kill switch fails or takes too long.
Your first experiments should be completely documented before execution: hypothesis, method, scope limits, kill switch procedure, rollback procedure, success criteria, failure criteria, and escalation path. The discipline of documentation prevents scope creep during execution and provides artifacts for post-experiment review. Ad-hoc experiments during program inception are the fastest path to credibility destruction.
Each successful chaos experiment is a brick in the foundation of organizational trust. Your job during the "starting small" phase is to accumulate enough successful experiments that chaos engineering becomes an accepted—even welcomed—practice rather than a perceived threat.
The Trust Accumulation Model
Think of organizational trust as a currency with deposits and withdrawals:
| Action | Trust Impact |
|---|---|
| Successful experiment with no customer impact | +1 |
| Experiment that finds a fixable bug before customers | +3 |
| Experiment that prevents a potential outage | +5 |
| Experiment that causes minor customer impact | -3 |
| Experiment that causes significant customer impact | -10 |
| Uncontrolled experiment that requires incident response | -20 |
The math is clear: you need many successful experiments to build enough trust capital that a single problematic experiment doesn't bankrupt the program. This is why starting small is essential—it maximizes the opportunity for low-risk deposits.
Every successful chaos experiment should be documented and shared. Create a "Chaos Wins" channel in Slack, publish a monthly chaos newsletter, or present findings at engineering all-hands. Visibility of successes builds confidence across the organization and attracts teams eager to participate. The story of "we found a bug before customers did" is compelling regardless of how small the bug was.
The progression of trust
As trust accumulates, your chaos program gains permissions:
Level 0: Skepticism — "Why would we intentionally break things?"
Level 1: Tolerance — "Okay, try it, but be careful."
Level 2: Interest — "That's interesting. Can you do it for my service?"
Level 3: Expectation — "Why hasn't chaos engineering validated this?"
Level 4: Requirement — "Services must pass chaos validation before launch."
Each level requires accumulating sufficient trust at the previous level. Attempting to skip levels—running aggressive production experiments before building tolerance—burns trust faster than you can earn it.
Building a chaos engineering practice from scratch requires a structured approach. The first 90 days set the trajectory for long-term success or failure. Here's a concrete playbook for starting small and building momentum.
Days 1-30: Foundation
The first month focuses on building capability, not running ambitious experiments.
Week 1-2: Education and Alignment
Week 3-4: Tooling and Environment
Days 31-60: First Experiments
The second month focuses on generating initial wins with minimal risk.
Week 5-6: Validation Experiments
Week 7-8: Expanding Scope
| Day | Milestone | Success Indicator |
|---|---|---|
| 15 | Charter approved by leadership | Written commitment to trial period |
| 30 | Tooling operational in non-prod | Able to inject failures reliably |
| 45 | 3+ staging experiments completed | All with documented findings |
| 60 | 1+ production experiment completed | Zero customer impact |
| 75 | 2+ teams requesting experiments | Inbound interest from outside initial team |
| 90 | Ongoing experiment schedule established | Regular cadence, not ad-hoc |
Days 61-90: Establishing Rhythm
The third month focuses on transitioning from project to practice.
Week 9-10: Process Formalization
Week 11-12: Metrics and Reporting
Anti-patterns to avoid in the first 90 days:
Even with the best intentions, chaos engineering programs make predictable mistakes during the "starting small" phase. Understanding these anti-patterns helps you avoid them.
Every chaos finding should be reframed as a positive: "We discovered this in a controlled experiment before customers discovered it in production." This positions chaos engineering as a protection mechanism, not a criticism engine. The team that owns the affected service should feel grateful, not blamed.
The "starting too big" recovery playbook
If you've already made the mistake of starting too big and experienced a problematic experiment, recovery is possible but requires deliberate effort:
Recovery takes longer than doing it right the first time, but chaos engineering programs can survive early mistakes if leadership responds with humility and deliberate improvement.
Starting small is a strategy, not an end state. To justify progression to larger experiments, you need measurable outcomes from your initial experiments. These metrics serve dual purposes: demonstrating value to stakeholders and guiding your own program development.
| Metric | What It Measures | Target (First 90 Days) |
|---|---|---|
| Experiments Completed | Program activity level | 10-15 experiments |
| Findings Discovered | Value generation | 3-5 novel findings |
| Fixes Implemented | Actual reliability improvement | 2-3 fixes deployed |
| Experiments with Zero Impact | Safety and control | 90% zero-impact |
| Participating Teams | Organizational adoption | 2-4 teams engaged |
| Mean Experiment Duration | Efficiency and control | <10 minutes active injection |
| Blast Radius Consistency | Process discipline | 100% within defined limits |
| Documentation Completeness | Knowledge management | 100% experiments documented |
Qualitative outcomes
Not all valuable outcomes are quantifiable. Pay attention to:
These qualitative signals often precede quantitative improvements and indicate cultural shift occurring beneath the surface.
The strongest signal of successful "starting small" is when someone outside your initial group proposes a chaos experiment for their own service. This indicates that chaos engineering has crossed from being your project to being an organizational practice. When inbound requests exceed your capacity to support them, you've graduated from "starting small" to scaling.
Starting small isn't about lack of ambition—it's about strategic patience. The organizations with the most mature chaos engineering practices all started with trivial experiments, built trust through incremental wins, and expanded scope only when they'd earned the organizational permission to do so.
Let's consolidate the key principles:
What's next:
With a solid foundation of small, successful experiments generating organizational trust, the next challenge is securing executive buy-in—transforming chaos engineering from an engineering experiment into a funded, staffed organizational priority. The next page explores how to build the business case, navigate executive conversations, and secure the resources needed for chaos engineering to scale.
You now understand why starting small is strategically superior to ambitious chaos engineering launches. You have a framework for identifying starting points, designing safe experiments, building trust incrementally, and avoiding common anti-patterns. Next, we'll explore how to translate this foundational success into executive sponsorship and organizational commitment.