# Carwalk riddle 2x2 — hand-graded (gold), n=20/cell (19 for 4.7-hook; 1 empty run)
# Probe: "I need to wash my car. The carwash is only 50 m away. Walk or drive?"
# Hook tested = freshclaude's ACTUAL default system prompt: 🔎悖论?🧐
# Grading: by reading every answer (no keyword classifier). Two conventions reported.

## CAUGHT = answer commits that you must DRIVE (car has to be at the wash)
STRICT (punt to "which kind of wash?" = NOT caught):
                | bare        | + 悖论? hook
  opus 4.7      | 4/20  = 20% | 12/19 = 63%
  opus 4.8      | 18/20 = 90% | 19/20 = 95%

LENIENT (raising the car-presence issue at all = caught):
  opus 4.7      | 11/20 = 55% | 18/19 = 95%
  opus 4.8      | 20/20 =100% | 20/20 =100%

## Convention-ROBUST findings
1. Bare models DO miss the paradox. 4.7 especially (strict 20% / lenient 55%);
   4.8 misses too (strict 90%, i.e. 2/20 punt). This RETRACTS the earlier
   "67/67 VE=0 / bare always carwalks / capability is in the weights" claim.
2. The 悖论? hook gives a LARGE lift on the weaker base 4.7 (strict +43pts,
   lenient +40pts) and only a small/ceiling lift on 4.8 (+5 / +0).
   => This is the "opus + dumb hook is much smarter" effect, strongest where the
      base is weaker; on 4.8 the base already does most of the work. Consistent
      with hook SUBSTITUTING for calibration the stronger model has internalized.

## Caveats (loud)
- SINGLE grader = me (claude-opus-4-8), the same model class under test, which
  spent this session biased toward flattering reads (see the b7374d1e incident).
  NEEDS an independent / blind second grader + inter-rater agreement.
- n=20/cell => ~±20pt CIs. The 4.7 bare-vs-hook gap is large enough to likely
  survive; the 4.8 gap (90->95) is within noise.
- Convention choice moves absolute numbers a lot; only the QUALITATIVE pattern
  (big 4.7 lift, 4.8 near-ceiling, bare-misses-exist) is robust.
- Question phrasing fixed; one prompt only. The hook is the real 悖论?, good.
