Devops on vnykmshr

Prescaling for a known spike

Fri, 15 Mar 2019 00:00:00 +0000

Our biggest sale event of the year is on the calendar. The date is fixed, the hour is fixed, and when it starts, traffic hits a multiple of normal within minutes. The engineering challenge isn’t handling surprise. It’s handling certainty at a scale we’ve never seen before.

We prepare for months. Six months out, teams start thinking about what their services need. Backend teams work with SRE and infra to define prescale configurations and autoscale rules. Terraform handles the provisioning. Every service team shares their estimates with infra, and the configurations get codified.

Consul in practice

Mon, 10 Sep 2018 00:00:00 +0000

The microservice count is growing fast. The monolith is mostly gone and what replaced it is dozens of services across datacenters. We don’t have a uniform naming convention. Finding a service means knowing which team owns it, which cloud it’s on, and what they called it. That’s not scalable.

Consul fixed the naming problem first.

Service discovery

Every service registers with Consul. The DNS interface gives us a consistent way to find anything:

Hazard lights

Sat, 10 Jun 2017 00:00:00 +0000

There are about fifteen of us in the enclosure. Backend engineers, SRE, devops, infra – handpicked from across the floor. The rest of the team, about a hundred people, sit outside. They call us the fishes in the aquarium.

The aquarium has hazard lights. Physical ones – wired to fire on any 5xx in the system. When something breaks in production, the room goes red.

It sounds like a gimmick. It isn’t.

Zero-downtime deploys

Mon, 15 Jul 2013 00:00:00 +0000

I deploy everything from the terminal. No web interface, no CI service, no dashboard with green buttons. Just deploy production from my laptop, and the code goes live.

The setup is two pieces. A bash script that handles the remote work – SSH in, pull the latest code, run hooks. And a Node.js process that watches a file on the server and reloads the app cluster when the file changes. Between them, they do zero-downtime deploys in under ten seconds.