Loading content...
The difference between a chaotic mess and a valuable GameDay lies almost entirely in the planning. A poorly planned GameDay wastes everyone's time, potentially damages production systems, and—worst of all—teaches the wrong lesson: that chaos engineering is dangerous and unproductive.
Effective GameDay planning is an art form that balances ambition with safety, realism with control, and learning objectives with operational constraints. The hours invested in planning multiply the value extracted from the exercise itself.
This page provides a comprehensive framework for planning GameDays that deliver maximum learning while maintaining organizational trust and system stability.
By the end of this page, you will understand how to define clear GameDay objectives, select appropriate failure scenarios, identify the right participants, establish safety controls, communicate with stakeholders, and prepare your environment. You'll have a planning checklist that ensures no critical element is overlooked.
Every effective GameDay begins with a clear answer to the question: What are we trying to learn? Without defined objectives, GameDays become unfocused chaos—entertainment rather than education.
Objectives should be specific, measurable, and tied to genuine organizational concerns. Vague goals like 'test our resilience' provide no guidance on what scenarios to run or how to evaluate success.
| Weak Objective | Why It's Weak | Strong Alternative |
|---|---|---|
| Test our failover | Which failover? What constitutes success? How will we measure it? | Validate that database failover from primary to replica completes in under 2 minutes with zero data loss |
| See how the team responds to incidents | No specific focus; difficult to extract actionable learning | Evaluate whether our on-call engineers can diagnose a cascading cache failure using only documented runbooks |
| Break things and see what happens | No hypothesis; pure exploration without direction | Test our hypothesis that the payment service remains functional when the recommendation service is unavailable |
| Practice incident response | Too generic; no specific skills or processes targeted | Practice the escalation workflow from P2 to P1 severity, including executive notification procedures |
| Validate our DR capability | DR is broad—which aspect? Which region? What RTO/RPO? | Validate that complete regional failover to DR site achieves our 4-hour RTO with less than 15 minutes of data loss |
The hypothesis-driven approach:
Borrowing from the scientific method, effective GameDays are structured around hypotheses that the exercise will test:
This structure ensures that every GameDay produces actionable knowledge, whether the hypothesis is confirmed or refuted.
When selecting objectives, ask the team: 'What failure scenario keeps you up at night?' Those fears often point to known unknowns—areas where the team suspects weakness but hasn't validated. These make excellent GameDay targets because they address real organizational anxiety.
Once objectives are clear, the next step is designing failure scenarios that will test those objectives effectively. Scenario selection requires balancing realism with controllability, and ambition with safety.
Effective scenarios share several characteristics:
Progressive scenario complexity:
Organizations new to GameDays should start with simple, single-failure scenarios and progressively increase complexity:
Level 1: Single Component
Level 2: Single System
Level 3: Multi-System
Level 4: Regional/Cross-cutting
Level 5: Business Process
GameDay scenarios should not be designed to trick or embarrass participants. The goal is learning, not proving that you can stump the on-call engineer. Excessively obscure scenarios produce anxiety rather than growth. Keep scenarios realistic enough that encountering them in production would be plausible.
The right participants make the difference between a GameDay that reveals organizational insights and one that merely demonstrates that experts can solve problems. Careful role assignment ensures productive dynamics during the exercise.
| Role | Responsibilities | Who Should Fill This Role |
|---|---|---|
| Game Master / Exercise Lead | Controls scenario execution, maintains timeline, manages safety controls, can pause or abort exercise | Senior engineer familiar with chaos tooling, not involved in response |
| Incident Commander (simulated) | Leads the response effort, coordinates responders, makes escalation decisions | Whoever would actually lead real incidents—test their capability |
| Technical Responders | Diagnose issues, execute remediation, follow runbooks | Engineers who would actually be on-call—rotate in less experienced engineers for training |
| Observers | Watch without intervening, take notes on process gaps, capture learning points | Senior engineers, SREs, or managers—anyone who can provide constructive debrief input |
| Safety Officer | Monitors for actual production impact, authorized to call abort, watches blast radius | Senior SRE or platform engineer with deep system knowledge and authority to halt |
| Scribe | Captures timeline, key decisions, observed behaviors for post-exercise analysis | Anyone organized and able to document rapidly without participating |
| Customer Representative (optional) | Provides customer perspective, evaluates communication effectiveness | Customer success or support team member, product manager |
Participant selection principles:
The separation of knowledge:
A critical principle in GameDay design is asymmetric information. The Game Master and Safety Officer know the full scope of planned failures. The responders should not.
This separation is essential because:
That said, responders should know that a GameDay is occurring—true 'surprise drills' require extremely mature organizational trust and are not appropriate for most organizations.
Having observers present changes responder behavior—usually for the better during early GameDays, as people try harder. This is acceptable. The goal isn't to trick people into revealing incompetence; it's to practice and improve. As GameDays become routine, the observer effect diminishes and behavior becomes more natural.
Safety is paramount in GameDay planning. The exercise must be designed such that value is extracted without causing unacceptable damage to production systems, customer experience, or team morale. Comprehensive safety controls are non-negotiable.
The safety control checklist:
Before any GameDay execution, verify each of these items:
If any safety control cannot be verified, do not proceed with the GameDay. Postponing is always preferable to an uncontrolled incident caused by your own exercise. A GameDay that damages customer trust or destabilizes production systems is worse than no GameDay at all.
GameDays don't happen in isolation. They affect—or could affect—various stakeholders across the organization. Proactive communication builds trust, prevents confusion, and ensures appropriate support is available.
| Stakeholder Group | What They Need to Know | When to Communicate | Communication Channel |
|---|---|---|---|
| Engineering Leadership | Purpose, scope, timeline, potential risks, expected outcomes | 1-2 weeks before (for approval), day-of reminder | Planning document, calendar invite |
| Participating Teams | Their role, time commitment, what to prepare, meeting details | 1 week before (detailed briefing) | Team meetings, detailed email with agenda |
| On-Call Engineers | That a GameDay is occurring (prevents confusion with real incidents) | Day of exercise, at start time | Slack/chat announcement, on-call handover notes |
| Customer Support | Potential customer impact, timeline, what to tell customers if asked | Day before (briefing), day-of (start/end confirmation) | Email to support leadership, status page update if appropriate |
| Product Teams | That systems they depend on may behave unusually during the window | 1 week before (awareness), day-of (confirmation) | Cross-functional sync, calendar blocking |
| Executive Sponsors | That exercise is occurring, expected learning value, any notable findings | Pre-exercise summary, post-exercise brief report | Executive summary document, follow-up meeting if findings warrant |
The 'real vs. exercise' communication challenge:
During GameDays, especially those that generate real alerts, distinguishing between exercise-caused events and genuine incidents is critical. Establish clear protocols:
The goal is to practice realistic response while never creating confusion that could delay response to actual customer-impacting incidents.
Early GameDays are an opportunity to build organizational support. Communicate results widely, celebrate learnings (not just successes), and acknowledge participants. When people see GameDays as valuable learning opportunities rather than disruptive time-wasters, scheduling future exercises becomes easier.
The technical environment where the GameDay will run requires careful preparation. This includes ensuring chaos tooling is functional, monitoring is in place, and the target systems are in a known-good state before introducing failures.
Environment selection considerations:
Choosing the right environment for your GameDay depends on your maturity level and objectives:
Development/Local:
Staging/Pre-Production:
Production (with controls):
Canary/Shadow Production:
Schedule GameDays during periods without other significant changes. Don't run GameDays during deployment windows, database migrations, or major feature launches. If something goes wrong, you want to be confident it's related to the exercise, not an unrelated change.
Every well-planned GameDay produces a planning document that serves as the authoritative reference for the exercise. This document captures all the decisions made during planning and provides the structure for execution.
A comprehensive planning document should include:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
# GameDay Planning Document ## Exercise Overview- **Exercise Name:** [Descriptive name, e.g., "Q1 Database Failover GameDay"]- **Date & Time:** [Start and end times, including timezone]- **Environment:** [Production/Staging/etc. + specific systems]- **Authorization:** [Who approved this exercise] ## Objectives1. [Primary objective with measurable success criteria]2. [Secondary objective with measurable success criteria]3. [Learning questions to answer] ## HypothesisWe believe that [system/process/team] will [expected behavior] when [failure scenario is introduced]. ## Failure Scenarios| Scenario | Implementation | Expected Behavior | Abort Trigger ||----------|----------------|-------------------|---------------|| [Name] | [How injected] | [What should happen] | [When to abort] | ## Participants| Role | Person | Contact | Backup ||------|--------|---------|--------|| Game Master | | | || Safety Officer | | | || Incident Commander | | | || Responders | | | || Observers | | | | ## Safety Controls- **Blast Radius:** [Specific limits]- **Abort Criteria:** [List of conditions]- **Rollback Procedure:** [How to undo each failure]- **Time Boundary:** [Maximum duration before forced abort]- **Emergency Contacts:** [For escalation beyond GameDay scope] ## Communication Plan- **Pre-Exercise Notification:** [Who, what, when]- **During-Exercise Channel:** [Slack channel, etc.]- **Post-Exercise Communication:** [Results summary distribution] ## Timeline| Time | Activity | Owner ||------|----------|-------|| -30 min | Final preparation, system health check | Game Master || -15 min | Participant briefing | Game Master || 0:00 | Inject Failure #1 | Game Master || +X min | Inject Failure #2 (optional) | Game Master || +Y min | Time boundary / nominal end | Game Master | ## Post-Exercise- **Debrief Time:** [Immediately after]- **Debrief Location:** [Room/video call]- **Action Item Tracking:** [Where will actions be logged]- **Report Distribution:** [Who gets the summary]Circulating the planning document:
The planning document should be shared with:
Store the document where it can be referenced during and after the exercise. It becomes part of your organizational memory of resilience practices.
Effective GameDay planning sets the stage for valuable learning. Let's consolidate the essential elements:
What's next:
With a solid plan in place, the next challenge is execution. The following page covers how to run a GameDay effectively—managing the flow, observing behaviors, handling unexpected developments, and maintaining safety throughout the exercise.
You now understand how to plan GameDays that deliver maximum learning value while maintaining safety. Thorough planning ensures your chaos engineering exercises are productive rather than chaotic. Next, we'll learn how to execute these carefully planned exercises effectively.