๊ทธ๋ฆฌ๊ณ  ์ด๊ฑด ์ œ๊ฐ€ ๊ฐ€์žฅ ์••๊ถŒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š” ์ง€์ ์ธ๋ฐ... ๊ฐ• ๊ฑด๋„ˆ๊ธฐ ๋ฌธ์ œ(์ •ํ™•ํžˆ๋Š” 3์ธ์Šน ๋ฐฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์„ ๊ต์‚ฌ-์‹์ธ์ข… ๋ฌธ์ œ ๋“ฑ์˜ ๋ณ€์ข…)๋Š” ์• ์ดˆ์— N=6 ์ด์ƒ์ผ ๊ฒฝ์šฐ ๋ง ๊ทธ๋Œ€๋กœ ํ•ด๋ฒ•์ด ์—†๋‹ค(!!!)๋Š” ๊ฒƒ์ด ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ™์€ ๋…ผ๋ฌธ ๋ฐœ์ทŒ:

3 The Impossible Puzzle Problem

The evaluation issues compound dramatically in the River Crossing experiments. Shojaee et al. test instances with N โ‰ฅ 6 actors/agents using boat capacity b = 3. However, it is a well-established result that the Missionaries-Cannibals puzzle (and its variants) has no solution for N > 5 with b = 3.

By automatically scoring these impossible instances as failures, the authors inadvertently demonstrate the hazards of purely programmatic evaluation. Models receive zero scores not for reasoning failures, but for correctly recognizing unsolvable problemsโ€”equivalent to penalizing a SAT solver for returning "unsatisfiable" on an unsatisfiable formula.
0

If you have a fediverse account, you can quote this note from your own instance. Search https://bsky.brid.gy/convert/ap/at://did:plc:ppk763j7o2wkinvzuqx4orrb/app.bsky.feed.post/3lrpqr7wrok24 on your instance and quote it. (Note that quoting is not supported in Mastodon.)