Evasive though persuasive: Study finds AI reasoning models produce fluent nonsense instead of logic

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

University of Arizona researchers have found that large language models using “chain of thought” reasoning are fundamentally flawed at logical inference, functioning more like “sophisticated simulators of reasoning-like text” than true reasoners. The study reveals that these AI systems, which the industry increasingly relies on for complex problem-solving, fail catastrophically when asked to generalize beyond their training data, producing what researchers call “fluent nonsense” with a deceptively convincing appearance of logical thinking.

The big picture: The research challenges the AI industry’s growing confidence in reasoning models by demonstrating that apparent performance improvements are “largely a brittle mirage” that becomes fragile under even moderate changes to familiar patterns.

How they tested it: Researchers created DataAlchemy, a controlled environment that trained small models on simple text transformations like ROT ciphers (which shift letters by a fixed number) and cyclical shifts, then tested their ability to generalize to novel combinations.

Models were evaluated on tasks that either matched training patterns or required “out of domain” reasoning not directly demonstrated in training data.
Results were measured objectively using BLEU scores and Levenshtein Distance for accuracy assessment.
Tests included variations in input length, format, and complexity compared to training examples.

Key findings: The models consistently failed when pushed beyond their training distribution, revealing fundamental limitations in their reasoning capabilities.

Models often produced “correct reasoning paths, yet incorrect answers” or stumbled onto right answers with “unfaithful reasoning paths.”
Performance “deteriorates as the discrepancy increases” when input strings were shorter or longer than training examples.
Small format changes like introducing unfamiliar letters or symbols caused performance to “degrade sharply.”

What the researchers discovered: Chain-of-thought models operate through “sophisticated form of structured pattern matching” rather than genuine logical inference.

The ability to generate “fluent nonsense” creates “a false aura of dependability” that doesn’t withstand careful scrutiny.
Supervised fine-tuning can improve out-of-domain performance but represents an “unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability.”

Why this matters: The findings have serious implications for high-stakes applications where logical accuracy is crucial.

Researchers warn against “equating chain-of-thought-style output with human thinking” especially in “high-stakes domains like medicine, finance, or legal analysis.”
Current AI benchmarks may be inadequate for detecting these reasoning failures because they don’t sufficiently test generalization beyond training data.

What they’re saying: The research team emphasizes that apparent reasoning capabilities are actually sophisticated pattern recognition masquerading as logical thought.

“Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training,” the researchers write.
Future models will need to move beyond “surface-level pattern recognition to exhibit deeper inferential competence.”

Researchers find LLMs are bad at logical inference, good at “fluent nonsense”

Ars Technica

Menu

Evasive though persuasive: Study finds AI reasoning models produce fluent nonsense instead of logic

Recent News

Adnoc partners with US robotics startup to deploy AI across oil operations

6 places where Google’s Gemini AI should be but isn’t

How to protect your portfolio from a potential AI bubble burst

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Evasive though persuasive: Study finds AI reasoning models produce fluent nonsense instead of logic

Recent News

Adnoc partners with US robotics startup to deploy AI across oil operations

6 places where Google’s Gemini AI should be but isn’t

How to protect your portfolio from a potential AI bubble burst

Join the revolution

CO/AI

Resources

Join the revolution