On the limits of LLMs (Large Language models) and LRMs (Large Reasoning Models). The TL;DR: "Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds." Meaning: accuracy collapse.

Interesting paper from Apple. ml-site.cdn-apple.com/papers/t

• We question the current evaluation paradigm of LRMs on established math benchmarks and
design a controlled experimental testbed by leveraging algorithmic puzzle environments that enable
controllable experimentation with respect to problem complexity.
• We show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking)
still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing
to zero beyond certain complexities across different environments.
• We find that there exists a scaling limit in the LRMs’ reasoning effort with respect to problem
complexity, evidenced by the counterintuitive decreasing trend in the thinking tokens after a
complexity point.
• We question the current evaluation paradigm based on final accuracy and extend our evaluation
to intermediate solutions of thinking traces with the help of deterministic puzzle simulators. Our
analysis reveals that as problem complexity increases, correct solutions systematically emerge at
later positions in thinking compared to incorrect ones, providing quantitative insights into the
self-correction mechanisms within LRMs.
• We uncover surprising limitations in LRMs’ ability to perform exact computation, including their
failure to benefit from explicit algorithms and their inconsistent reasoning across puzzle types.
0
0
0

If you have a fediverse account, you can quote this note from your own instance. Search https://social.wildeboer.net/users/jwildeboer/statuses/114641438158216788 on your instance and quote it. (Note that quoting is not supported in Mastodon.)