30 mathematicians met in secret to stump OpenAI. They (mostly) failed.

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Thirty of the world’s leading mathematicians convened at a secret meeting in Berkeley to test OpenAI‘s o4-mini reasoning model against professor-level mathematical problems. The AI stunned researchers by solving complex questions that would challenge even academic mathematicians, with some experts describing its capabilities as approaching “mathematical genius.”

What you should know: The clandestine mathematical gathering took place over two days in mid-May, with participants signing nondisclosure agreements and communicating only through Signal to prevent AI contamination of their test questions.

Mathematicians competed to devise problems they could solve but would stump the AI, with a $7,500 reward for each question that stumped o4-mini.
Despite their efforts, the group only managed to find 10 questions that the AI couldn’t solve by the meeting’s end.
Participants had to maintain strict security protocols because traditional communication methods could potentially be scanned by large language models and inadvertently train them.

The big picture: This meeting represents a pivotal moment in AI’s mathematical capabilities, demonstrating how reasoning models have evolved far beyond traditional large language models that could solve less than 2 percent of novel mathematical problems.

The test was part of FrontierMath, a benchmark project by Epoch AI, a nonprofit that benchmarks AI models, that collected 300 unpublished mathematical questions across four tiers of difficulty.
By February 2025, o4-mini could solve around 20 percent of questions, including some at the challenging fourth tier designed for academic mathematicians.
Traditional large language models showed they lacked genuine reasoning ability when faced with problems dissimilar to their training data.

How it works: OpenAI’s o4-mini is a reasoning large language model trained with specialized datasets and stronger human reinforcement, making it more nimble than traditional large language models.

The model can dive much deeper into complex mathematical problems by making highly intricate deductions.
Google’s equivalent, Gemini 2.5 Flash, has similar reasoning capabilities.
These models are lighter-weight but more specialized than earlier versions of ChatGPT.

What they’re saying: The AI’s performance left mathematicians both impressed and concerned about its implications for their field.

“I have colleagues who literally said these models are approaching mathematical genius,” says Ken Ono, a mathematician at the University of Virginia and meeting leader.
“I was not prepared to be contending with an LLM like this. I’ve never seen that kind of reasoning before in models. That’s what a scientist does. That’s frightening,” Ono added after watching the AI solve a Ph.D.-level number theory problem in real time.
Yang Hui He from the London Institute for Mathematical Sciences noted: “This is what a very, very good graduate student would be doing—in fact, more.”

Key details: The AI’s problem-solving approach mirrors human mathematical reasoning, complete with strategic thinking and even personality.

When Ono presented a challenging number theory problem, o4-mini spent two minutes mastering related literature before attempting a simpler version first.
The bot completed the full solution in about 10 minutes, ending with a “cheeky” note: “No citation necessary because the mystery number was computed by me!”
The AI works significantly faster than human mathematicians, completing in minutes what would take experts weeks or months.

Why this matters: The rapid advancement of AI mathematical capabilities is forcing mathematicians to reconsider their future role in the field.

Discussions at the meeting turned to a potential “tier five” of questions that even the best human mathematicians couldn’t solve, which AI might eventually tackle.
Mathematicians may need to shift toward posing questions and collaborating with AI reasoning bots, similar to how professors work with graduate students.
The development raises concerns about over-reliance on AI results, with experts warning about “proof by intimidation” where AI’s confident presentation might be trusted without proper verification.

Looking ahead: The meeting highlighted the need for educational reforms to maintain human relevance in mathematics.

Ono predicts that nurturing creativity in higher education will be crucial for keeping mathematics viable for future generations.
“I’ve been telling my colleagues that it’s a grave mistake to say that generalized artificial intelligence will never come, [that] it’s just a computer,” Ono warns, noting that large language models are “already outperforming most of our best graduate students in the world.”

Inside the Secret Meeting Where Mathematicians Struggled to Outsmart AI

Scientific American

Menu

30 mathematicians met in secret to stump OpenAI. They (mostly) failed.

Recent News

SITE BEING UPDATED. PLEASE STAY TUNED.

Adnoc partners with US robotics startup to deploy AI across oil operations

6 places where Google’s Gemini AI should be but isn’t

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

30 mathematicians met in secret to stump OpenAI. They (mostly) failed.

Recent News

SITE BEING UPDATED. PLEASE STAY TUNED.

Adnoc partners with US robotics startup to deploy AI across oil operations

6 places where Google’s Gemini AI should be but isn’t

Join the revolution

CO/AI

Resources

Join the revolution