Alibaba open-sourced a code-review agent. The interesting part isn’t the review quality. It’s one function in the agent loop.
dispatchSubtasks walks for i := range a.diffs and fires one bounded sub-agent per changed file, semaphore-capped so a thousand-file PR doesn’t melt the API budget. Coverage – the property that every changed file got reviewed – is a for loop. Not a model being diligent. Not a prompt asking it to be thorough. A range over a slice. The loop is even honest about what it drops: files too large to fit the token budget get filtered out before the loop, explicitly, with a warning, rather than silently half-read.
The same refusal shows up where the comments land. Positioning a review comment on the right line is a text-match against the diff first. Only when that match fails does the model get called back – and then just to regenerate the code snippet, after which the code re-resolves the position itself. The model proposes; the code decides where it goes. Five tools, not a generic toolkit. Three-zone memory compression so a long review still fits the window. The whole architecture is one idea repeated: do not trust the model where it is weak, and put an engineer’s scaffold there instead.
Reading that loop is what made me look at my own setup, which is two AI reviewers in sequence. A self-gate I run before I’m allowed to be happy with a change, then an independent peer pass whose job is to argue with the first. I’d assumed the two stages covered me – one catches the obvious, the other refutes what’s left. The for loop showed me what neither of mine does. Neither one guarantees it looked at every file. They can’t. There’s no range over a slice anywhere in either; there’s a model, asked nicely, deciding what’s worth its attention.
Three serious tools, the same model underneath all of them, and each made a different bet about where review quality comes from. The Alibaba one bets on process determinism: constrain the model with engineering, trust nothing you can put in a loop instead. Claude’s /code-review bets on adversarial verification: let the model find freely, then spawn a second pass whose only job is to refute each finding before it reaches you. The playbook’s /pb-review – mine – bets on encoded judgment and breadth: the findings worth having are trade-offs, not typos, so spend the sophistication on perspective and trust the model on coverage. The bets barely overlap. They’re not three takes on the same design – they’re three theories of what the model can’t be trusted to do.
And two of the three converged on the same two weaknesses without coordinating. The model is reliably bad at exactly two things: being exhaustive, and being precise. Alibaba’s loop and Claude’s refute-pass each looked at that and built a non-LLM scaffold around it – the range over changed files, the independent second pass. Mine didn’t. Mine trusts the model on coverage and spends its budget on judgment.
You cannot port a guarantee into a prompt. Write “enumerate every changed file and review each as a bounded unit” into a markdown command and you do not get for file := range diffs. You get a stronger suggestion. The model will usually comply, and “usually” is precisely the word a for loop deletes from the sentence. The mechanism and the enforcement are the same object; you cannot keep one and reword the other.
The tell is that one of these bets does port and one doesn’t. The adversarial second pass survives translation from code to prompt, because a command can actually spawn the refuter subagents – the enforcement is real on either side. The coverage loop does not survive it, because a paragraph that asks for completeness is not completeness. So the test for “should I copy this into my own tooling” was never “is it a good idea.” It’s “does it survive being reworded from code into a prompt.” If it doesn’t, copying the wording cargo-cults the mechanism and keeps none of it.
Adding a second AI reviewer is not like adding a second engineer. Two engineers each independently cover the diff; redundancy is the whole point. Two AI reviewers stack mechanisms, and the mechanisms don’t add up to coverage unless one of them enforces it. My pipeline captures adversarial verification and encoded judgment and is structurally blind to the third bet, because neither stage holds a loop. Bolting on a third prompt won’t close it. Coverage isn’t a prompt-shaped thing. It’s a for-shaped thing.
The honest limit: I don’t know that this blind spot ever bites. My reviews run on small atomic commits, and on a five-file change the model’s attention and a for loop probably land in the same place. The gap is real in the architecture and may be theoretical in my actual use – it might be Alibaba’s scale problem, imported into a personal tool that doesn’t have Alibaba’s scale. That’s worth saying out loud rather than dramatizing a hole I haven’t fallen through.
But the law underneath holds at any scale. A guarantee you can reword into a prompt was never a guarantee.
