I wanted to know whether Fable 5 was worth roughly twice what Opus 4.8 costs, so I did the obvious thing – pointed both at the same work and watched. (The names will date this post fast. Hold that thought, it turns out to be the whole point.)

One of the tasks was auditing the docs in my engineering playbook – a public repo, a hundred-odd markdown commands plus some Python tooling. Is the guidance in here still current? The docs include a page about the models themselves, which one does what, when to reach for which. Good test, I figured – the models know that subject better than anyone.

That’s where it went sideways. Both flagged the same page as stale – the line explaining that /fast keeps you on Opus with quicker output, across its last three releases. Fable 5 trimmed that to a single release. Opus 4.8 ruled out a different release the first had kept. Two confident fixes pointing opposite directions, and the page they were both rewriting was correct. I checked. The docs had been updated after the models were trained, and each one was grading a current page against a remembered, older version of the same facts – the lineup as it stood a few releases back, confidently restored over the page sitting in front of it. They did the same thing with the effort tiers a few lines down – “knew” a default that wasn’t the default, asserted it as ground truth.

The page wasn’t stale. The model was. It graded the doc against its memory of its own recent past instead of against the doc, and where the two disagreed it trusted the memory and called the page a bug. The thing being audited had poisoned the audit. A doc updated after the cutoff doesn’t read as fresh to a model, it reads as wrong, because the model is the one piece of the whole setup that can’t be updated.

This is narrower than “models are bad at recent facts,” and the narrowing matters. Give a model a search tool and it’ll fetch today’s answer fine – recency on its own isn’t the trap. The trap is self-reference without retrieval. Ask a model about its own lineage, its own SDK, its own flags, and it doesn’t reach for ground truth, because it already knows, the answer is sitting right there in the weights, confident and a few months behind. The blind spot is sharpest exactly where the subject is the model itself. It’s most sure, and most wrong, about the one topic it can’t help being personally out of date on.

I’d have missed all of this if I’d run it the usual way. “Review this and tell me what’s wrong,” from one model, is ungradeable – no control, no ground truth, no way to separate a real catch from a confident hallucination. So I didn’t let either model grade itself. Same brief to both, then every finding, from either one, handed to a separate skeptical pass that re-read the source and ruled it confirmed, false, or unsure, blind to which model had raised it. For the claims about capabilities I gave that judge an authoritative block of current facts pulled straight from the live environment, not from any model’s memory. Forty of those blind verifiers, a couple of passes after to diff what survived. All that scaffolding bought a single property: nothing about the models got checked by a model’s recall of the models.

That’s what caught the false fixes, and they needed catching, because the reasoning behind them read like careful, sober corrections – the kind you nod along to. Without the ground truth sitting in the room, an unattended pass would have taken my correct page and helpfully broken it.

Two models, one repo, two kinds of task. That’s not a benchmark and I’m not handing you a winner – for what it’s worth the two came out complementary, not ranked, one with the longer reach and one that never cried wolf. The self-audit failure showed up in both. A blind spot that lands the same way in two independent frontier models isn’t one vendor’s bug, it’s a property of the shape: point a model at its own recent past and it inspects its memory, then files the gap between memory and reality as your mistake.

The other half of why the afternoon was worth it: I aimed the whole thing at real code instead of a toy benchmark, so the findings that taught me nothing about the models were still findings. Sixteen real bugs in the repo’s tooling, verified. One actually mattered – the link checker the playbook tells people to run as part of its “did you verify” ritual was quietly broken. scripts/check-links.py had a stray / in a list it matched substrings against, swallowing almost every link and reporting green. The ritual that exists to catch rot had rot in it. That fix alone paid for the afternoon, the model question was nearly a bonus.

The honest limit: one trial, two models, two tasks. I can’t tell you how often the self-audit thing bites in the wild or how it scales. I can tell you the shape, and the shape held across both models cleanly enough that I’d act on it. The lesson isn’t really about these two, who’ll be old news by the time the names stop meaning anything. It’s the procedure. When the thing you’re auditing is the model itself – its docs, its SDK, its own capabilities – the model is the worst available witness, and the only fix is to put ground truth it can’t argue with in the room before you let it grade a single line.

Ask a model to check a page about itself and it won’t check the page. It’ll check its memory, and where they differ it’ll trust the one thing in the room that can’t be brought up to date. The model is the stale one. It’s most confident exactly there.