Loading content...
The planning is complete. Stakeholders are informed. Safety controls are verified. Now comes the moment of truth: actually running the GameDay.
Execution is where chaos engineering theory transforms into organizational capability. The best-laid plans are useless if the exercise devolves into confusion, if responders feel unsupported, or if the Game Master loses control of pacing. Conversely, skillful execution extracts maximum learning from even modest scenarios.
This page provides a comprehensive guide to running GameDays that are educational, safe, and—importantly—create appetite for future exercises rather than organizational trauma.
By the end of this page, you will understand how to conduct effective pre-exercise briefings, manage failure injection and timing, observe responder behaviors, handle unexpected situations, maintain safety throughout execution, and close the exercise cleanly. You'll learn the art of Game Mastering that makes GameDays valuable.
Every GameDay begins with a briefing that sets expectations, assigns roles, and ensures all participants are ready. This briefing typically occurs 15-30 minutes before failure injection begins.
The briefing agenda:
Setting the psychological tone:
The briefing sets the emotional tone for the entire exercise. Key psychological principles:
Create psychological safety:
Establish realistic urgency without panic:
Frame discovery as success:
Respectors should feel appropriately focused but not terrified. If participants are visibly anxious, that's a signal that either the GameDay scope is too ambitious or organizational culture needs work before production exercises.
After the briefing, give participants two minutes to gather their thoughts, pull up dashboards, and settle in. This transition time helps responders mentally shift from briefing mode to response mode, mimicking the moment when an on-call engineer gets paged and needs to orient themselves.
The Game Master controls the introduction of failures. This role requires technical competence, situational awareness, and judgment about pacing. The goal is to create conditions that simulate real incidents while maintaining safety and learning opportunities.
Injection execution principles:
Multi-scenario GameDays:
For GameDays with multiple planned failure scenarios, pacing is critical:
The hardest skill for experienced engineers serving as Game Masters is watching responders struggle without intervening. Resist the urge to drop hints. If responders take an unusual diagnostic path, let them—that's exactly what would happen at 3 AM. The struggle is where learning happens. Observers should note the struggle for the debrief, not interrupt to 'save' responders.
A GameDay without effective observation produces stories but not actionable insights. Observers and the scribe capture the raw material that will be refined into improvements during the debrief.
What observers should watch for:
The Scribe's Timeline:
The scribe maintains a chronological record of the exercise. This timeline is invaluable for post-exercise analysis and for building organizational memory.
12345678910111213141516171819202122232425
# GameDay Timeline - Database Failover Exercise## Date: 2025-03-15, 14:00-16:00 UTC | Time (UTC) | Event | Actor | Notes ||------------|-------|-------|-------|| 14:00 | Briefing begins | Game Master | All participants present || 14:15 | Exercise starts, failure injection | Game Master | Primary DB network isolated || 14:16 | First alert fires | PagerDuty | "Database connection pool exhausted" || 14:17 | Incident Commander acknowledges | Alice | Opens incident channel || 14:19 | Initial diagnosis begins | Bob | Checking database metrics || 14:23 | Root cause identified | Bob | "Primary DB unreachable" || 14:24 | Decision to initiate failover | Alice | || 14:25 | Failover command executed | Bob | Following runbook Step 5.2 || 14:26 | Runbook issue discovered | Bob | "Step 5.3 references old configuration path" || 14:28 | Workaround applied | Bob | Found correct path in sidebar notes || 14:31 | Failover complete | System | Replica promoted to primary || 14:33 | Customer traffic restored | System | Error rates returning to baseline || 14:35 | Verification checks pass | Carol | All endpoints responding || 14:40 | Exercise concludes | Game Master | Begin wrap-up | ## Key Observations- Detection was fast (< 2 minutes)- Runbook had outdated configuration reference (action item)- Failover took longer than expected (16 minutes vs. target 5 minutes)- Team communication was clear throughoutObservation tips for effectiveness:
With participant consent, recording the screen-share or video call during GameDays provides invaluable debrief material. Reviewing recordings often reveals dynamics that live observers missed. The recordings also serve as training material for engineers who couldn't participate.
Despite careful planning, GameDays often produce surprises. Part of the Game Master's skill is handling these unexpected developments while maintaining exercise value and safety.
| Situation | Risk Level | Recommended Response |
|---|---|---|
| Failure has bigger impact than expected | High | Immediately assess blast radius. If exceeding controls, abort and rollback. If within acceptable bounds but larger, continue with heightened monitoring. |
| Failure has no visible impact | Low | Verify injection succeeded. If yes, this might be a resilience success. If no, troubleshoot injection mechanism. |
| Responders are completely stuck | Medium | Wait 15-20 minutes. If no progress, Game Master may offer a general hint (not the answer). Consider if scenario is too difficult. |
| Real incident occurs during GameDay | High | Immediately pause GameDay. Rollback injected failures. Switch to real incident response. Resume GameDay later if feasible. |
| Responder becomes visibly distressed | Medium | Pause exercise briefly. Check in privately. Offer to swap in a different responder or simplify remaining scenarios. |
| Unexpected cascade failure occurs | High | Assess severity. If within controls, this is valuable discovery—observe carefully. If approaching abort criteria, execute rollback. |
| Responders solve too quickly | Low | Good problem to have! Either introduce planned secondary scenario or conclude early with lessons about effective preparation. |
| Key participant becomes unavailable | Medium | If backup is available, bring them in. If not, simplify remaining exercise or postpone complex scenarios. |
The 'Real vs. Exercise' dilemma:
One of the trickiest situations: an alert fires during a GameDay, and it's unclear whether it's related to the exercise or a genuine production issue.
Protocol for ambiguous alerts:
When abort criteria are approached, don't hesitate. The decision to abort should be swift and unequivocal. A GameDay that causes real customer impact undermines years of trust-building. The Safety Officer has unconditional authority to abort, and that authority should be exercised whenever there's genuine concern, not only when criteria are clearly exceeded.
The most valuable GameDay learning often involves humans, not systems. For participants to be honest about their confusion, uncertainty, and mistakes, psychological safety must be maintained throughout the exercise.
Principles for psychological safety during execution:
When tensions rise:
Even with the best preparation, stress can escalate during GameDays. Signs that psychologically safety is eroding:
De-escalation techniques:
The Game Master's responsibility extends beyond controlling the exercise—they're also protecting participants from negative experiences that would make future GameDays harder to staff. If observers are being unhelpful, if leadership is making responders nervous, or if the technology is creating undue frustration, the Game Master should intervene.
Every GameDay must end with systems returned to known-good states. The wrap-up phase ensures the exercise concludes cleanly and sets the stage for the debrief.
Time between exercise and debrief:
The ideal time for a debrief is immediately after the exercise, when details are fresh and emotional states are still elevated (in a good way). However, if the exercise was particularly long or intense, a short break is appropriate.
Immediately after (recommended):
After a break (15-30 minutes):
Next day (avoid if possible):
If more than an hour passes between exercise and debrief, have participants write down their top 3 observations immediately, preserving them for later discussion.
Before declaring the exercise complete, imagine an engineer who wasn't involved in the GameDay looking at your systems tomorrow. Would they notice anything unusual? Would there be any artifacts, configurations, or states that would confuse them? The goal is a 'clean room'—no evidence of the exercise beyond documentation.
Game Mastering is part art, part science. The following framework helps Game Masters navigate common decision points during execution.
| Situation | Question to Ask | Decision Guidance |
|---|---|---|
| Responders are stuck | How long have they been stuck? Are they learning from the struggle? | <10 min: Wait. 10-20 min: Consider a general hint. >20 min: Offer targeted help or pivot scenario. |
| Want to inject next failure | Have responders stabilized from previous failure? Are they ready for more? | If actively working on current issue, wait. If recovered and stable, proceed. |
| Observer wants to help | Is this a safety issue or just discomfort watching struggle? | If safety, allow. If discomfort, remind observer of role and politely redirect. |
| Approaching time boundary | Is remaining scenario worth rushing, or better to end gracefully? | Quality over completion. If rushing would reduce learning, end early. |
| Scenario going differently than expected | Is this interesting deviation or unproductive tangent? | If learning is happening, let it continue. If truly unproductive, gently redirect. |
| Responder asks if this is the exercise | Is clarification appropriate at this moment? | Generally yes—clarify that this is the exercise. Don't let confusion about reality persist. |
| Metrics approaching abort criteria | Are we trending toward or just near limits? | If improving, monitor closely. If worsening, prepare for abort. If flat near limit, consider preemptive abort. |
| Responder does something unexpected | Is this creative problem-solving or concerning improvisation? | Generally, let responders lead. Intervene only if safety is at risk. |
The 'Lean Back' principle:
New Game Masters tend to over-manage—intervening too quickly, introducing failures too rapidly, or providing hints too readily. The general rule is: when in doubt, lean back.
Most GameDay value comes from watching what naturally happens, even when that involves struggle, confusion, or suboptimal decisions. These observations reveal where your systems and processes actually stand, not where they stand with active expert guidance.
The exception is safety: when safety is in question, lean forward immediately. But for all other concerns, bias toward patience and observation.
Running a GameDay effectively requires balancing control with observation, support with challenge, and safety with realism. Let's consolidate the key principles:
What's next:
The exercise is complete, but the work isn't finished. The next page covers learning from GameDays—the debrief process that converts observations into improvements, and the follow-up that ensures findings translate into lasting organizational change.
You now understand how to execute GameDays effectively—from briefing through failure injection, observation, and recovery. Skilled execution transforms careful planning into organizational learning. Next, we'll explore how to maximize the value extracted from these exercises through effective debriefs and follow-up.