Reviewing code that looks right

A 25-file PR has been open in another tab for two days. Three files hold the decision. The other twenty-two are reflections – the new check applied at every call site, the tests that exercise each site, the formatter cleaning up after.

The review failure mode most reviewers carry is equal attention. Every file gets the same eye. Every diff gets the same minute. By line twelve hundred the budget is gone, and the three files that decide whether the system is actually defended got the same skim as the twenty-two files that mirror the decision.

The discipline is segregation. Two piles, before reading anything carefully. One pile is decision-bearing – the lines where the security model, the invariant, the contract actually lives. The other pile is trust-tooling – the mechanical work that should be true by construction if the decision-bearing pile is right. The CI mirrors the change. The tests mirror the spec. The formatter mirrors the style.

Move the rigor budget to the pile that decides. Trust the tooling on the pile that mirrors. The phrase “trust the tooling” reads dangerous the first time, like skip-the-tests advice. It isn’t. It means trust the tests to validate the mechanical work, then read the logic harder, because there is now budget to read it harder.

Self-review and others’ review have different failure modes worth naming. The self-bias is “I just wrote this, the trade-offs are in my head.” Most call sites got applied the way they did because the author held the constraint in working memory at the time. By the next morning the memory is gone and the confidence is left behind. The others-bias is “the author is trusted, the diff is large, the test suite is green, what could be wrong.” Segregation forces both readers to do the same thing: name the lines that decide, before deciding whether to trust them.

This used to be enough.

What changed is the cost of producing code that looks right. The plausibility floor lifted. A model writes each file fluently, names variables sensibly, applies the new check at every call site without missing one, generates the test that exercises the happy path, runs the formatter. Each file passes inspection on its own. The CI is green because every individual claim is locally true. The thing that doesn’t show up in any individual file is whether the system’s invariant survived all of it. That part the model does not see, because the model writes each file in isolation. Each site is fluent. The relationship across sites isn’t there to be fluent at.

This is where segregation moves from useful to load-bearing. When code that compiles, runs, passes tests, and reads naturally is cheap, the signal “this looks like code that would work” is degraded. It always looks like code that would work now. Confidence has to come from somewhere else. The somewhere else is the decision-bearing pile, read harder, with the knowledge that the tooling-pass signal cannot anchor it.

A rule of elimination. Default: this line should not exist. Earn it. A reviewer reading with that posture finds the line that was added because the prompt drifted, the helper that got introduced because the model couldn’t see the existing one, the assertion that asserts the implementation instead of the contract. None of those lines look wrong. They look fine. They aren’t earning anything.

Eighteen passes through a recent PR made the point. The diff carried a new parser-and-validator that none of the prior reads had questioned – it looked like work the change required. The elimination pass asked “why does this need to exist?” and surfaced what no fluent read would: the codebase already had one. Removing it cut the diff and the tests that traveled with it.

Forced trade-off framing on judgment calls. “Current is fine” without a stated alternative is the failure mode that lets sub-optimal choices through. The reviewer’s job on a judgment line is not to approve or reject – it is to name the trade-off. If the author can’t articulate why this and not that, the line isn’t ready, regardless of whether it works.

Cross-site symmetry. Three places do related work. Compare them side by side, not file by file. The asymmetry across sites is where the bug lives when each site is locally fluent. The model writes site A correctly, site B correctly, site C correctly, and none of them agree about which one owns the invariant. The reviewer reading file by file never sees it. The reviewer reading site by site does.

Segregation gives both author and reviewer a method, not just hope. The author stages the work so the decision-bearing files come first, with the most context in the description. The reviewer spends their budget where the decision lives, and signals it – “considered alternative X, rejected because Y” – not “looks good to me.”

The honest question for the reviewer isn’t “did I read every line.” It’s “did I read the lines that decide, harder than I would have if every line looked broken.” That answer used to be a question of attention. It is now also a question of where the attention goes.

Read fewer lines, harder.