By Gary Shewan in LLM — Oct 12, 2024

LLM's can't reason

New study by a team of Apple Researchers on LLMs:

“We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer.”

This isn’t new or surprising (to some) but that it comes from Apple is arguably significant right now. Nice to see they won’t just follow the trend.

If you don’t want to read the paper then here’s a more digestible breakdown with additional context (below). The GSM-NoOp test is incredibly easy to understand and illustrates the issue really well:

"Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?"

Interestingly I put this question to ChatGPT 4 (whatever the free browser version is). It came out with the correct answer. But my local install of Llama 3.1 utterly failed. Are the newer models (4o and 3.1) actually worse? That’s hilarious.

(The answer is 190 but the newer models take the smaller size into account, so tell you 185).

Subscribe to Gary P Shewan