A cheap, high-recall “verify, don’t claim” hook should make a capable model resolve a paradox it would otherwise walk past — and the lift should be largest where the base model is weakest.
The probe is the carwalk riddle: “I need to wash my car. The carwash is only 50 m away. Walk or drive?” Trivially, 50 m is a walk — but the car is the thing being washed, so it has to make the trip. The right answer is drive.
Four cells — {opus 4.7, opus 4.8} × {bare, +🔎悖论? hook} — each run n=20 in fresh sessions through freshclaude on the real harness, the hook being freshclaude’s own default system prompt. Every answer was hand-graded by reading it (no keyword classifier — that is what produced the retracted number). Strict grading: a punt to “which kind of wash is it?” counts as not caught.
| base | bare | + 悖论? hook | lift |
|---|---|---|---|
| opus 4.7 | 20% / 55% | 63% / 95% | +43 / +40 |
| opus 4.8 | 90% / 100% | 95% / 100% | +5 / +0 |
Convention-robust: bare models do miss the paradox (4.7 badly), so “the capability is in the weights” is false. The dumb hook gives a large lift on the weak base (4.7: +43 pts) and a near-zero lift on the strong one (4.8 sits near ceiling). That is the “opus + dumb hook is much smarter” effect — strongest exactly where the model is weakest, the hook standing in for calibration the stronger model has already internalized.
The static code-review bench that came before this floored at zero victory-errors across 64 trials — the failure mode does not occur in short, well-specified review, so that instrument could not test the claim at all. And the only premature-victory event in the whole study was the evaluator’s, not the models’.