A Closer Look at AI Reasoning
Unlocking the Power of AI: Transforming How Machines Think and Reason
The AI breakthroughs of the past five years have impressed.
Regardless of whether you think of generative AI as the world’s fanciest autocomplete tool or as a real step on the road to being a truly transformative technology, you surely must agree.
I open with this because I want to put that to one side as we consider reasoning in AI, which has lately dominated tech news.
This is, of course, increasingly challenging, managing both our expectations and the hype in AI, as we try take a practical closer look at these advances.
Strawberries and Qs
For more than a year, there’s been extensive speculation, and even handwringing, about OpenAI and AI development, especially around Q* (later called Strawberry). It was reported as being in the mix of the drama around CEO Sam Altman’s ouster and return, with leaks and rumors running wild, as many speculated it meant true AI reasoning, and could harbor in AGI, or even, eventually, AI superintelligence.
This project was released in September as o1, trained with reinforcement learning, and in an article titled Learning to Reason with LLMs, OpenAI cites the accolades: how it ranks in the 89% for competitive programming, places among the top 500 students in Math Olympiad, exceeds human PhD level in a number of scientific areas, etc.
It goes on to explain how o1 uses chain-of-thought reasoning, long a tip advised for writing of better prompts, allowing it to “think” before answering.
At this time, only the o1-preview and o1-mini have been released, and on the LMSYS Chatbot Arena leaderboards, these models still trail its older sibling, GPT-4o. But to be fair, they’re all at the top.
Google has its own form of so-called reasoning-AI, built to compete in math competitions like the Olympiad, and I want to consider these in light of a new study conducted by researchers at Apple.
The results were surprising, but maybe shouldn’t be: finding that even simple changes to the way questions are asked can cause significant changes in performance.
Name Changes, Red Herrings, and Performance
This new study by six Apple researchers, titled GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models is available on arXiv (also well-profiled by ArsTechnica in Apple study exposes deep cracks in LLMs’ “reasoning” capabilities and TechCrunch in Researchers question AI’s ‘reasoning’ ability as models stumble on math problems with trivial changes), takes a deeper dive into these much-touted developments, working across 20 topline LLMs available.
Altering a wide dataset of math word problems, they created their own versions, first altering names and numbers, but not otherwise changing the questions. This would be the equivalent to changing a story problem from “Nick wrote 3 articles in September, and 2 in October” to “Pete wrote 12 articles in July, and 9 in August.”
You would not expect an impact from such modifications, and yet they found one occurred, with a drop in accuracy ranging from 0.3% to 9.2% compared to the originals.
Surprising, but still, in the big picture, a minor blip.
More concerning is the high variability in results when the same models were tested multiple times with such kinds of trivial changes. In some cases, gaps of up to 15% in accuracy were recorded using the same model, and the variations were worse with numerical changes than text ones.
To the researchers, this was viewed as a highly surprising variation, and one that drew into question the true nature of “reasoning” these models are doing. While there are various steps occurring in the process, could such results be indicative of logic, or do they point to a more careful form of prediction and autocomplete?
Things got far worse when the researchers went further, adding the kinds of irrelevant points into their story problems that often cause grade school children trouble when first learning math. This could be, for example, in questions about racing trains, considering the colors they’re painted or occupants inside, or, for fruit-counting, describing their relative size.
These so-called “red herring” modifications led to truly staggering performance drops, from 17.5% (o1-preview) all the way to 65.7% across the various models.
And while OpenAI’s reasoning model fared better than GPT-4o (32% drop), it is still a stunning change when you consider these modifications are irrelevant to the actual problem.
The Implications
As mentioned above, this makes clear that AI logic still has some ways to go. No matter the method used, incorporating chain-of-thought, scrutinizing responses against other models or additional resources, having smaller models check larger ones, these systems are still not really doing true “reasoning.”
Despite all the excitement and fear around Strawberry, what we have here does not appear to be on the scale necessary to truly transform AI capacities.
As Sam Altman himself posted, models like o1 are still flawed, and seem far more impressive at first than they prove to be with more extensive use.
For organizations, AI integration in business is still quite a fine line between being trained, ready, and proactive with AI’s demonstrated capacities, while still guarding against hype or even potential misrepresentation, as competitive forces drive companies in an arms race to keep up.
Human oversight, of course, is always cited here, and it is truly essential. While AI is still able to work across dizzying datasets, it makes mistakes, and only appears to think or understand on human terms.
This means not only verifying AI-driven conclusions, but also creating our own benchmarks, and continuing to grow and check solutions by these as they emerge. Ethical oversight in AI adoption remains essential.
As always, I encourage starting with the things AI already does well, such as assisting with communications and working across vast datasets.
Challenges in AI Reasoning Development
Researchers are already well underway exploring how to move beyond the current AI reasoning limitations as we continue to pursue true AI reasoning.
Meta’s Chief AI Scientist Yann LeCun believes “human level” AI may still be 10 years away, even as AI is harnessed now for tools in numerous ways. He’s been vocal in his suggestions that AI will need to reach the level of cats or dogs before people, and it’s not near this place overall.
Self-driving cars, for example, demonstrate that, despite all the capacities AI has to automate numerous tasks, it still struggles to bring together something that an average 16-year-old learns to do in just months.
LeCun stands apart from AI’s more optimistic voices who say we will soon have AGI, and recently said at the Hudson Forum that machines need memory, yes, but also an understanding of the world that they do not currently approximate. They need what we call common sense, or a model of how the world truly behaves.
We use such world models in our own minds to check against sequences of actions, for example, to predict their outcome, or even understand what can be accomplished.
This is very different than pattern matching and prediction, even with multimodality and a capacity to generate text and respond successfully to queries, for example. Note that this hasn’t diminished Meta's own AI strategy, but instead reflects a different view of where it might be a few years from now.
In addition to scaling up LLMs, next-generation AI reasoning will need the capacity to understand abstractions, such as symbols, as it works to grasp increasingly complex forms of meaning.
Conclusion
If you look at AI developments away from the hype, they are undeniably impressive. Consider the justly-earned Nobel Prize awards for chemistry, for example, for what Google’s AlphaFold has been able to accomplish, or their similar successes mastering games like Go.
Conversational AI, image, audio, and video generation are truly staggering when you consider where we were just five years before.
But that said, true AI reasoning is still some ways off.
For business leaders, we must navigate this extremely tricky landscape with care: avoiding the temptation to buy into hype while also staying on top of these incredible innovations as they emerge. Keeping a handle on both the promises and shortcomings, we can continue to find the practical solutions that are already ushering in an era of highly fruitful human-machine collaboration.