The board has been overhead for a year now. Green means a buyer can move from search to checkout without friction. We built it on top of the GraphQL gateway after the first event. When the next one comes, the whole floor reads it at a glance.

What it bought us was a head start. A stage would shade toward yellow and someone was already in the resolver, draining the slow backend, before the support queue noticed. We knew it was struggling before the customer told us. Being ahead of the customer felt like mastery.

What the board measures

This year a stage stayed green while it was wrong.

The traffic took a shape we had not built for. Demand concentrated on a narrow slice of catalog, and to keep the backends alive we did what the gateway is for – served cached responses where the live call was too expensive. Availability held. Latency held. Every box stayed green, because green is what those two numbers make.

The cache was stale. Buyers saw stock that was gone and prices that had moved. The flow worked – a buyer could search, add to cart, reach checkout, and the board followed them through every stage in green – and what they moved through was wrong. Nothing on the board measures whether the data is true. It measures whether the stage answered, and how fast.

We found out from outside the board. A reconciliation number that did not match. A seller asking why an order came in at last week’s price. The board had been green the whole time, and it had not lied. It answered the only question we ever taught it to ask. We had stopped noticing that was the only question.

The move that saved us at the first event is the move that fooled us at this one. A green box means the stage is healthy, or it means we are serving cached data that is wrong, and the board cannot tell those apart. It reports availability and we read it as truth.

A model of the system

A dashboard is a model of the system. Ours was accurate the day we built it. Then the system moved – new traffic shape, new caching behavior under load – and the model stayed where it was. The board keeps rendering. It renders whether or not it still describes anything.

The moment a dashboard stops being true is earlier than the moment it looks wrong. Between those two moments the board is green and the system is not, and that gap is where the damage lives. The whole value of the board was the head start, and a stale model spends that head start against you. It buys time for the wrong thing.

The hazard lights in the aquarium could never do this. One bulb, wired to 5xx. It says something is on fire or it says nothing. It cannot be precise about the wrong question, because it was never precise about anything. There is a floor of crudeness below which an instrument cannot mislead you in this particular way. The richer the board, the more specific the lie it can tell while every light is green.

What we removed

The fix was not another panel. We took some down – stage boxes that implied a precision we could not stand behind, that turned two infrastructure numbers into a claim about correctness they were never measuring.

What we added does not live on the journey board. Reconciliation that runs against what a real request returns, not against whether it returned. Sampled checks that compare a served price to the price of record. They are slower and less pretty, and they answer the question the board cannot.

And a rule, for the next time the traffic changes shape: the board describes the system we had before the change, until something proves it still describes the one we have now. Green is provisional after a change. We earn it back, we do not assume it.

The board is still overhead. During the next event it will be green most of the time, and most of the time it will be right. The difference is that green is a question now. We read the color, and then we go and check.