The checklist has seven items. Two were added after the first event went badly and we learned what we’d forgotten. The other five haven’t changed since the third event, when we stopped guessing and started trusting the load-test numbers. The checklist is the same as last year. That’s the point.
The checklist
Capacity plans get reviewed, not created. The prescaling targets were drafted months ago, measured against load tests weeks ago. The week-before pass is a sanity check. Did a new service come online? Did a dependency change? Is a campaign bigger than last year’s? If nothing changed, the capacity plan carries forward. The review takes thirty minutes per service. Most take fifteen.
Runbooks get dusted, not written. The runbook is a living document, updated after every event with the things nobody predicted. The week-before read is to catch drift – a deprecated endpoint, a renamed service, a contact who left. The runbook is only as current as its last read.
Dependencies get re-confirmed. Every external service that takes load during the sale – payment providers, notification gateways, CDNs, the third-party search index – gets a check-in. Not a ticket. A message. “Same window, same volume, anything changed on your side?” Most answer no. The confirmation is the artifact.
The week ends with the last pre-scale dry run. Capacity standing ready, warmed up, health checks passing. Nothing takes traffic yet. It just has to be there.
The freeze
Non-critical merges stop. No refactors, no dependency bumps, no cleanup PRs. The freeze is physical – engineers changing hats from builder to reader. The last merge before the freeze is always one line in a config file, something someone caught during the checklist read, and there is always a moment of wondering whether even that one line should wait.
Sometimes it does.
The freeze isn’t bureaucracy. There is nobody enforcing it, no approval gate, no formal policy. It’s a cultivated instinct. The system is what it is. Nothing new enters. Everything that will handle the spike is already deployed and running. You want to face the event with a system you know, not one you last touched an hour ago.
The calm
The first year through this, the week before was anxious. Everyone ran scenarios. Everyone triple-checked configs. Things we’d forgotten surfaced at the worst possible moment. Scrambling. Fixes rushed through review. The freeze meant nothing because nobody was ready to freeze.
By the third event, the anxiety was gone. We knew what the checklist held. We knew what would break and what wouldn’t. The calm was earned through repetition – not by getting better at preparing, but by having prepared enough times that the preparation became background noise.
The calm is the artifact. It says you’ve done this before.
The event opens in seven days. The capacity is standing by. The runbook is current. The freeze is holding. Nothing breaks this week because nothing is supposed to.
The war room will get ninety minutes of that calm before the plan runs out. Until then, nothing new enters.
