Building Chaos Culture - Learning Module

Loading content...

0/273

Starting Small

The Art of Beginning: Why Small Starts Win

Chaos engineering programs fail for many reasons—inadequate tooling, infrastructure complexity, lack of expertise—but the most common reason is starting too big. Organizations seduced by Netflix's tales of "Chaos Monkey randomly killing production servers" attempt to replicate that approach without recognizing that Netflix spent years building up to that level of chaos maturity. They had resilience baked in before chaos engineering became a formalized practice.

The result of premature ambition is predictable: a well-intentioned chaos experiment causes an uncontrolled outage, stakeholders lose trust, and the entire initiative is shelved indefinitely. The engineering team learns the wrong lesson—not that they started too aggressively, but that "chaos engineering doesn't work for us."

The paradox of chaos engineering adoption is this: The organizations that most need chaos engineering (those with unknown weaknesses and untested resilience) are least equipped to handle aggressive experiments. And the organizations that could handle aggressive experiments (those with mature resilience practices) often don't need them to discover basic weaknesses.

Starting small resolves this paradox. It allows you to build the muscle—tooling, processes, confidence, organizational trust—incrementally, matching your chaos ambitions to your actual chaos readiness.

What You Will Learn

By the end of this page, you will understand: (1) Why small-scale chaos experiments are strategically superior to ambitious ones during program inception; (2) How to identify ideal starting points for chaos experiments; (3) The specific techniques for scoping controlled experiments; (4) How to build organizational trust through incremental wins; and (5) The progression path from trivial experiments to meaningful resilience validation.

The Psychology of Organizational Change

Before discussing tactical approaches to starting chaos engineering, we must understand the psychological landscape. Introducing chaos engineering isn't just a technical initiative—it's an organizational change effort that triggers deeply ingrained human responses.

The fear response

When you propose "intentionally breaking production systems," you trigger the amygdala response in stakeholders. Their brains interpret this as a threat to their job security, their team's reputation, and systems they've invested years building. This isn't irrational—it's evolutionary programming designed to protect against perceived danger.

Key stakeholder concerns include:

Operations teams: "You want to break things I'm on-call for?"
Product managers: "What if this affects my feature launch?"
Executives: "What's the liability if this goes wrong?"
Developers: "Will I be blamed when my service fails the experiment?"

These concerns are legitimate. A chaos experiment that causes customer impact creates real business harm. Your job isn't to dismiss these concerns but to demonstrate that you take them seriously through controlled, low-risk starting points.

Trust is the Critical Resource

In chaos engineering adoption, trust is your most valuable and scarcest resource. Every successful experiment deposits trust in your organizational bank account. Every failed experiment—especially one that causes customer impact—makes a massive withdrawal. Starting small ensures your early experiments are deposits, not withdrawals.

The change management parallel

Organizational change research (Kotter, Bridges, etc.) consistently shows that successful transformation follows predictable patterns:

Create urgency → Show why the current state is risky
Build a coalition → Find allies who share the vision
Generate quick wins → Demonstrate value before asking for larger investments
Build on success → Use early wins to justify expansion
Anchor in culture → Make the new behavior the default

Chaos engineering adoption follows this exact pattern. Starting small isn't just risk management—it's a deliberate change management strategy. The quick wins you generate become stories that spread through the organization, converting skeptics into allies and building momentum for expansion.

Psychological Barriers to Overcome

•Loss aversion — Humans feel losses twice as strongly as equivalent gains. A chaos experiment that causes 1 minute of downtime feels worse than preventing 10 minutes of future downtime feels good.
•Status quo bias — People prefer the current state, even when it's suboptimal. "We've never done chaos engineering and things are mostly fine" is a powerful argument, even if false.
•Blame culture residue — If your organization has historically blamed individuals for outages, people will resist any practice that might expose their systems' weaknesses.
•Complexity aversion — Adding chaos engineering layer seems like adding complexity to an already-complex system, even though it ultimately reduces complexity of failures.
•Sunk cost attachment — Teams invested in building their systems resist anything that might reveal those systems are less resilient than believed.

Identifying Your Starting Point

The ideal starting point for chaos engineering combines three characteristics: low risk, high learning potential, and visible success. Finding this intersection requires careful analysis of your systems and organization.

The Starting Point Matrix

Evaluate potential experiments across multiple dimensions:

Experiment Evaluation Criteria
Dimension	Lower Risk (Good Start)	Higher Risk (Avoid Initially)
Environment	Non-production, staging, canary tier	Production at full traffic
Blast radius	Single instance, single service	Cross-cutting infrastructure, databases
Duration	Seconds to minutes	Hours to permanent
Reversibility	Auto-healing or instant rollback	Requires manual intervention or rebuild
Timing	Low-traffic periods, non-critical days	Peak traffic, launch days, holidays
Dependency	Leaf services with few dependents	Core services with many dependents
Data sensitivity	Read-only or ephemeral data	Writes, financial data, PII
Team readiness	Team understands resilience patterns	Team unfamiliar with failure modes

Identifying candidate systems

Not all systems are equally suitable for introductory chaos experiments. Look for systems that:

Have existing resilience mechanisms — Systems already built with retries, timeouts, circuit breakers, and graceful degradation are ideal. Chaos experiments will validate these mechanisms work, generating quick wins.
Are well-understood — Teams should know how the system behaves under normal conditions. Without this baseline, you can't distinguish between chaos-induced behavior and normal variations.
Have comprehensive observability — You need to detect when something goes wrong (and when it doesn't). Systems with extensive metrics, logging, and alerting make experiment analysis straightforward.
Have low customer impact tolerance — Counter-intuitively, systems where teams already care about resilience make better starting points. These teams are motivated to learn and already have the defensive infrastructure in place.
Have engaged, willing teams — Forcing chaos engineering on resistant teams destroys trust. Start with teams that volunteer or show enthusiasm.

The Ideal First Experiment

The perfect first experiment validates an existing resilience mechanism that the team believes is working. Example: If a team has implemented retries for a downstream dependency, inject a failure to confirm retries actually happen. When the experiment succeeds, it validates both the team's work and the chaos engineering approach. Everyone wins.

Environment progression

Your first experiments should not be in production. While the goal of chaos engineering is ultimately production validation (since only production represents the real system), you must build capability in lower-risk environments first:

Local Development → CI/CD Pipeline → Staging → Canary → Production (Limited) → Production (Full)

Local Development: Developers run chaos experiments against their local environment during development. This catches resilience gaps before code is even committed.

CI/CD Pipeline: Automated chaos tests run as part of deployment pipelines. Failed experiments prevent deployment, catching regressions early.

Staging: Full chaos experiments run against staging environments that mirror production. This validates resilience under realistic (though not identical) conditions.

Canary: Experiments run against a small percentage of production traffic. This introduces real-world conditions while limiting blast radius.

Production (Limited): Experiments target specific regions, availability zones, or customer segments. Real production conditions with controlled scope.

Production (Full): The ultimate goal—experiments can safely run against the entire production fleet. This indicates mature chaos practices.

Designing Your First Experiments

Your first chaos experiments should be boring. They should validate expected behavior, not explore unknown territories. Surprising results are interesting for learning but dangerous for program adoption—if your first experiment reveals a critical flaw, the narrative becomes "chaos engineering broke our system" rather than "chaos engineering helped us find and fix a problem."

The Validation Experiment Pattern

Instead of asking "What will break if we inject this failure?", ask "Will our existing resilience mechanisms respond as expected to this failure?" The hypothesis should predict success:

"We hypothesize that when we terminate one instance of Service A, traffic will automatically route to remaining instances with no customer-visible impact and no alerting pages beyond informational notifications."

If the hypothesis is confirmed, you've validated resilience. If the hypothesis is disproved, you've discovered a gap—but in a controlled context where you chose the timing and scope.

Excellent First Experiments

•Single instance termination — Kill one instance of a multi-instance service. Validate load balancing, health checks, and auto-scaling work correctly.
•Dependency latency injection — Add 100-500ms latency to calls to a non-critical downstream service. Validate timeouts and circuit breakers activate appropriately.
•Dependency failure simulation — Return errors from a dependency that has fallback behavior. Validate fallbacks engage and provide graceful degradation.
•Resource constraint — Limit CPU or memory on a non-critical service. Validate the service degrades gracefully rather than crashing catastrophically.
•DNS delay — Add latency to DNS resolution. Validate caching and retry logic handle transient DNS issues.
•Health check failure — Force a health check to fail on one instance. Validate the instance is removed from rotation and replaced.

Experiment scope controls

Every experiment should have explicit controls that limit potential impact:

Kill switches: A mechanism to immediately abort the experiment if something goes wrong. This could be a button, a command, or an automatic trigger based on metrics.

Time limits: Experiments should have maximum durations. If the kill switch isn't activated, the experiment automatically stops after a defined period.

Blast radius limits: Define exactly what can be affected. "Up to 1 instance" or "up to 5% of traffic" are explicit limits that prevent cascading impact.

Traffic exclusions: Certain traffic should be excluded from experiments. VIP customers, critical transactions, or traffic from regions with regulatory requirements might be off-limits.

Rollback procedures: Document exactly how to restore normal operations if the kill switch fails or takes too long.

Never Improvise Your First Experiments

Your first experiments should be completely documented before execution: hypothesis, method, scope limits, kill switch procedure, rollback procedure, success criteria, failure criteria, and escalation path. The discipline of documentation prevents scope creep during execution and provides artifacts for post-experiment review. Ad-hoc experiments during program inception are the fastest path to credibility destruction.

Documented Experiment Template

•Hypothesis: Clear, falsifiable statement
•Steady state: How you'll measure normal behavior
•Method: Exact failure injection approach
•Scope: What can be affected
•Duration: How long before auto-abort
•Kill switch: How to stop immediately
•Rollback: How to restore if kill switch fails
•Success criteria: What confirms hypothesis
•Failure criteria: What disproves hypothesis
•Escalation: Who to contact if things go wrong

What NOT To Do Initially

•Random instance termination — Save for maturity
•Database failure injection — Too risky initially
•Network partition simulation — Complex failure modes
•Cross-region failover — Requires extensive preparation
•Peak traffic experiments — Amplifies any issues
•Multi-service cascades — Too many variables
•Unmonitored experiments — Always observe actively
•Long-duration experiments — Keep initial ones short

Building Trust Through Incremental Wins

Each successful chaos experiment is a brick in the foundation of organizational trust. Your job during the "starting small" phase is to accumulate enough successful experiments that chaos engineering becomes an accepted—even welcomed—practice rather than a perceived threat.

The Trust Accumulation Model

Think of organizational trust as a currency with deposits and withdrawals:

Action	Trust Impact
Successful experiment with no customer impact	+1
Experiment that finds a fixable bug before customers	+3
Experiment that prevents a potential outage	+5
Experiment that causes minor customer impact	-3
Experiment that causes significant customer impact	-10
Uncontrolled experiment that requires incident response	-20

The math is clear: you need many successful experiments to build enough trust capital that a single problematic experiment doesn't bankrupt the program. This is why starting small is essential—it maximizes the opportunity for low-risk deposits.

Socializing Successes

Every successful chaos experiment should be documented and shared. Create a "Chaos Wins" channel in Slack, publish a monthly chaos newsletter, or present findings at engineering all-hands. Visibility of successes builds confidence across the organization and attracts teams eager to participate. The story of "we found a bug before customers did" is compelling regardless of how small the bug was.

The progression of trust

As trust accumulates, your chaos program gains permissions:

Level 0: Skepticism — "Why would we intentionally break things?"

Focus: Education and advocacy
Experiments: Non-production only, fully supervised

Level 1: Tolerance — "Okay, try it, but be careful."

Focus: Demonstrating safety
Experiments: Staging environments, validation of existing resilience

Level 2: Interest — "That's interesting. Can you do it for my service?"

Focus: Expanding to willing teams
Experiments: Limited production experiments with extensive guardrails

Level 3: Expectation — "Why hasn't chaos engineering validated this?"

Focus: Integrating into standard processes
Experiments: Regular production experiments, automated chaos in CI/CD

Level 4: Requirement — "Services must pass chaos validation before launch."

Focus: Governance and standards
Experiments: Continuous, automated, comprehensive

Each level requires accumulating sufficient trust at the previous level. Attempting to skip levels—running aggressive production experiments before building tolerance—burns trust faster than you can earn it.

Strategies for Quick Wins

•Target known-good systems — Run experiments against systems built with resilience in mind. Success validates both the system and the chaos practice.
•Validate recent fixes — If a team just fixed a reliability issue, offer to validate the fix with a chaos experiment. There's investment in success from both sides.
•Partner with reliability champions — Every organization has engineers passionate about reliability. These are your natural allies for early experiments.
•Leverage post-incident energy — After a production incident, teams are motivated to prevent recurrence. Chaos experiments to validate fixes are well-received.
•Start with infrastructure they don't own — Platform teams running chaos against shared infrastructure (load balancers, service mesh, etc.) doesn't feel threatening to application teams.
•Make findings actionable — Every finding should come with a suggested remediation. "Your retry is misconfigured" is better than "your service failed."

The First 90 Days Playbook

Building a chaos engineering practice from scratch requires a structured approach. The first 90 days set the trajectory for long-term success or failure. Here's a concrete playbook for starting small and building momentum.

Days 1-30: Foundation

The first month focuses on building capability, not running ambitious experiments.

Week 1-2: Education and Alignment

Read foundational materials (Principles of Chaos Engineering, Google SRE books)
Identify 2-3 engineering leaders who can champion the initiative
Draft a one-page charter explaining the value proposition and approach
Socialize the charter with stakeholders, gather feedback

Week 3-4: Tooling and Environment

Select chaos engineering tooling appropriate to your stack
Set up tooling in a non-production environment
Run tool familiarization experiments with no observers (learn the tool before demonstrating it)
Document standard experiment templates

Days 31-60: First Experiments

The second month focuses on generating initial wins with minimal risk.

Week 5-6: Validation Experiments

Identify 2-3 services with existing resilience mechanisms
Partner with service owners to design validation experiments
Run experiments in staging environments
Document findings and socialize "Chaos Wins"

Week 7-8: Expanding Scope

Based on staging success, propose limited production experiments
Design experiments with extremely conservative guardrails
Run first production experiments (single instance, limited duration)
Conduct thorough post-experiment reviews

90-Day Milestones
Day	Milestone	Success Indicator
15	Charter approved by leadership	Written commitment to trial period
30	Tooling operational in non-prod	Able to inject failures reliably
45	3+ staging experiments completed	All with documented findings
60	1+ production experiment completed	Zero customer impact
75	2+ teams requesting experiments	Inbound interest from outside initial team
90	Ongoing experiment schedule established	Regular cadence, not ad-hoc

Days 61-90: Establishing Rhythm

The third month focuses on transitioning from project to practice.

Week 9-10: Process Formalization

Create self-service experiment templates
Document the experiment request process
Train interested team members to run their own experiments
Establish a regular experiment review meeting

Week 11-12: Metrics and Reporting

Define metrics for chaos program health (experiments run, findings discovered, fixes implemented)
Create a chaos engineering dashboard
Prepare a 90-day report for leadership
Plan the next quarter's expansion

Anti-patterns to avoid in the first 90 days:

Overcommitting on scope — Promise less, deliver more. It's better to succeed at 3 experiments than fail at 10.
Skipping documentation — Every experiment should be documented. This builds institutional knowledge and demonstrates rigor.
Ignoring stakeholder concerns — If someone expresses worry, address it directly. Dismissed concerns fester into active resistance.
Hiding failures — If an experiment goes wrong, own it publicly. Transparency builds more trust than appearing to hide problems.
Going it alone — Chaos engineering needs allies. A single engineer running experiments looks like a rogue operator, not a program.

Common Mistakes When Starting Small

Even with the best intentions, chaos engineering programs make predictable mistakes during the "starting small" phase. Understanding these anti-patterns helps you avoid them.

Anti-Patterns in Early-Stage Chaos Programs

•The Netflix Fallacy — Attempting to replicate Netflix's mature chaos practices without Netflix's years of resilience engineering. Their Chaos Monkey works because they built systems to survive Chaos Monkey. You probably haven't.
•The Production Shortcut — Skipping non-production experiments to "get real results faster." Production experiments require trust you haven't yet earned and infrastructure you haven't yet validated.
•The Lone Wolf Operator — A single engineer running chaos experiments without organizational buy-in. This creates the perception (accurate or not) of a rogue operator rather than a sanctioned program.
•The Big Bang Experiment — Designing a massive, complex first experiment that will "really prove the value." Complex experiments have more variables, higher risk, and worse optics if they fail.
•The Invisible Program — Running successful experiments but not socializing them. If no one knows about your wins, no one will support your expansion.
•The Blame Generator — Framing findings as "Service X failed" rather than "We discovered an opportunity to improve Service X." Blaming language creates enemies, not allies.
•The Perfectionism Trap — Waiting until you have the perfect tool, process, or environment before running any experiments. Progress beats perfection.
•The Resilience Assumption — Assuming systems are resilient because they've worked so far. Past uptime is not evidence of resilience—only chaos experiments can validate resilience.

The Reframing Strategy

Every chaos finding should be reframed as a positive: "We discovered this in a controlled experiment before customers discovered it in production." This positions chaos engineering as a protection mechanism, not a criticism engine. The team that owns the affected service should feel grateful, not blamed.

The "starting too big" recovery playbook

If you've already made the mistake of starting too big and experienced a problematic experiment, recovery is possible but requires deliberate effort:

Acknowledge openly — Publicly acknowledge what went wrong without defensiveness. Blame the approach, not the concept.
Document learnings — Write a post-mortem on the experiment itself, not just any impact. What will you do differently?
Reset expectations — Explicitly communicate that you're returning to basics with smaller experiments.
Over-communicate guardrails — New experiments should have visibly conservative guardrails that address stakeholder concerns.
Earn back trust gradually — Don't expect quick forgiveness. You may need 10 successful small experiments to recover from 1 problematic big one.
Bring allies into the tent — Former skeptics who participated in the review process become advocates if they believe their concerns are addressed.

Recovery takes longer than doing it right the first time, but chaos engineering programs can survive early mistakes if leadership responds with humility and deliberate improvement.

Measurable Outcomes of Starting Small

Starting small is a strategy, not an end state. To justify progression to larger experiments, you need measurable outcomes from your initial experiments. These metrics serve dual purposes: demonstrating value to stakeholders and guiding your own program development.

Key Metrics for Early-Stage Chaos Programs
Metric	What It Measures	Target (First 90 Days)
Experiments Completed	Program activity level	10-15 experiments
Findings Discovered	Value generation	3-5 novel findings
Fixes Implemented	Actual reliability improvement	2-3 fixes deployed
Experiments with Zero Impact	Safety and control	90% zero-impact
Participating Teams	Organizational adoption	2-4 teams engaged
Mean Experiment Duration	Efficiency and control	<10 minutes active injection
Blast Radius Consistency	Process discipline	100% within defined limits
Documentation Completeness	Knowledge management	100% experiments documented

Qualitative outcomes

Not all valuable outcomes are quantifiable. Pay attention to:

Language shifts — Engineers start asking "What happens if this fails?" during design reviews
Voluntary participation — Teams actively request chaos experiments for their services
Resilience discussions — Post-incident reviews start referencing chaos experiment results
Leadership inquiries — Executives ask for chaos experiment updates
Platform integration — Chaos engineering becomes part of service readiness checklists

These qualitative signals often precede quantitative improvements and indicate cultural shift occurring beneath the surface.

The Ultimate Success Metric

The strongest signal of successful "starting small" is when someone outside your initial group proposes a chaos experiment for their own service. This indicates that chaos engineering has crossed from being your project to being an organizational practice. When inbound requests exceed your capacity to support them, you've graduated from "starting small" to scaling.

Summary: Mastering the Art of Starting Small

Starting small isn't about lack of ambition—it's about strategic patience. The organizations with the most mature chaos engineering practices all started with trivial experiments, built trust through incremental wins, and expanded scope only when they'd earned the organizational permission to do so.

Let's consolidate the key principles:

Key Takeaways

•Trust is your scarcest resource — Every decision should optimize for trust accumulation during program inception.
•Validation experiments beat exploration experiments — Confirm expected behavior before searching for unknown failures.
•Environment progression matters — Move from non-production to production incrementally as capability builds.
•Document everything — Documentation demonstrates rigor and builds institutional knowledge.
•Socialize successes aggressively — Wins you don't share don't contribute to organizational trust.
•Reframe findings as opportunities — Never blame; always position findings as proactive discovery.
•The first 90 days set trajectory — Invest in foundation, demonstrate safety, establish rhythm.
•Inbound interest is the graduation signal — When teams request experiments, you've succeeded at starting small.

What's next:

With a solid foundation of small, successful experiments generating organizational trust, the next challenge is securing executive buy-in—transforming chaos engineering from an engineering experiment into a funded, staffed organizational priority. The next page explores how to build the business case, navigate executive conversations, and secure the resources needed for chaos engineering to scale.

Page Complete

You now understand why starting small is strategically superior to ambitious chaos engineering launches. You have a framework for identifying starting points, designing safe experiments, building trust incrementally, and avoiding common anti-patterns. Next, we'll explore how to translate this foundational success into executive sponsorship and organizational commitment.