The second cloud migration lands on the roadmap in a staff meeting. On paper it looks like the first one. A target provider, a rough timeline, a named lead. I read the slide and expect the same work.
It is not the same work.
The first time
The first migration was easy to agree to. Our local datacenter had frequent outages and every one of them carried a physical tax – somebody drove over, pulled a cable, power-cycled a router. The DC had a support team, but the parts that mattered to us were ours to maintain. Cloud was the way forward and every one of us knew it. The decision did not take a meeting.
Execution took everything else. The migration was dev-orchestrated. No program manager, no external team, a shared document that kept growing mid-project. Checklists existed but we wrote them as we went. The cutover was a big-bang weekend – database primary moved, services pointed, DNS flipped, all inside a window we negotiated with ourselves.
It was messy. A queue consumer held a stale connection pool and backlogged for close to an hour before anyone noticed; once we saw it on the graph the fix was a fifteen-minute restart and an apology to the on-call. We recovered and wrote post-mortems that became better checklists for whatever came next.
That is not an argument we did it well. We made mistakes a more careful plan would have caught. What I can say is the work was ours. Every piece that moved, someone on the team was holding. The cutover had the feel of work we were inside, not a milestone handed to us.
The second time
The second platform is working.
That is the hard part.
It is not broken. It is not on fire. There are rough edges – a handful of unexplained regional outages, a console less polished than the last cloud, primitives that show their youth in small frustrating ways. At a smaller company with a less business-critical posture, you would live with all of it. Ours is not that company. Rough edges cost, and we cannot afford them the way we used to.
The new target is pitched as a one-fix solution. Cleaner ops, more mature tooling, a level of predictability the current platform does not reach. The decision itself is easy. What the decision buys is peace of mind, and that is something teams have to be convinced they need when nothing is actively on fire.
Convincing the tech leads
Getting tech leads to prioritize the cutover is the work the first migration did not have. The first migration was top-down – all of us, no exceptions, the reason obvious to everyone. This one is consensus-driven. Every service owner has a roadmap full of feature commitments, and adding a transitional system to it means something else gets pushed.
I go to them one at a time. The argument has to be specific to their service – the edges their calls have already hit on the current platform, the cost of one of those edges firing during a sale event, the shape of their recovery path if peace of mind does not arrive. Abstract risk calculus does not move anyone. The same risk calculus in their own access patterns does.
Some buy in the first conversation. Some need a second one, with more data, with a specific incident they can trace. A few negotiate for sequencing – we will carry the transitional system, but not this quarter. That is a fair answer and we take it.
The middle
Before and after are the easy parts.
The decision itself is settled – a plan, a program manager, a budget line, an external team, a transitional system design with dual writes, shadow reads, and a crossover measured in weeks. All of that is in place before any service writes its first line of migration code.
After will be easy too. The on-call rotation will page the right people. Dashboards will settle into a new shape. The old platform will be gracefully drained.
In the middle is where the cost lives. Every service that signs up eats a context sidetrack. For weeks, half of each team’s attention is on dual-write correctness and lag on shadow reads and data parity in edge cases the plan did not catch. Their feature work slows. Their retrospectives show mostly migration tickets. Nobody is visibly excited about this work. The engineer writing the dual-write reconciliation code is doing the hardest work of the quarter.
The first migration had a platform on fire. The second has one that works. The second is the harder one.
Broken platforms are easier to leave than working ones. The middle weeks are where the migration actually costs. That is the shape of a migration when nothing is on fire.
