University of Colorado Boulder researchers tested five AI models on 2,300 simple Sudoku puzzles and found significant gaps in both problem-solving ability and trustworthiness. The study revealed that even advanced models like ChatGPT’s o1 could only solve 65% of six-by-six puzzles correctly, while their explanations frequently contained fabricated facts or bizarre responses—including one AI that provided an unprompted weather forecast when asked about Sudoku.
What you should know: The research focused less on puzzle-solving ability and more on understanding how AI systems think and explain their reasoning.
- ChatGPT’s o1 model performed best at solving puzzles but was particularly poor at explaining its methodology, using wrong terminology and failing to justify its moves.
- Other AI models were deemed “not currently capable” of solving even simplified six-by-six Sudoku puzzles.
- When asked to explain their reasoning, AI models frequently hallucinated facts, claiming constraints that didn’t actually exist in the puzzles.
Why this matters: The findings highlight critical trust issues that must be resolved before AI can become a reliable partner in human decision-making processes.
- Only 41% of people currently trust AI technology, according to KPMG, a global consulting firm, despite 78% of organizations using AI in at least one business function.
- The World Economic Forum identifies trust as a key factor that will shape outcomes in the AI-powered economy.
What they’re saying: Researchers emphasized the broader implications of AI’s reasoning failures.
- “Sometimes, the AI explanations made up facts,” said Ashutosh Trivedi, study co-author and associate professor of computer science at CU Boulder. “So it might say, ‘There cannot be a two here because there’s already a two in the same row,’ but that wasn’t the case.”
- “At that point, the AI had gone berserk and was completely confused,” explained study co-author Fabio Somenzi when describing the weather forecast incident.
- “If you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote,” Somenzi added.
The big picture: The study underscores that while AI can perform complex tasks like coding websites and summarizing meetings, its reasoning processes remain opaque and unreliable.
- The hallucinations and glitches “underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making,” according to the researchers.
- Understanding how AI systems think could ultimately improve public trust and ensure more reliable outputs across applications from computer code to financial services.
tokens are getting more expensive