Google DeepMind's AI Co-Mathematician Achieves 48% on FrontierMath Tier 4, Helps Solve 60-Year-Old Problem

Google DeepMind has released AI Co-Mathematician, a multi-agent workbench that collaborates with human researchers on open-ended mathematical problems. The system achieved 48% on FrontierMath Tier 4, the highest score among all AI systems evaluated on what Epoch AI describes as exceptionally difficult problems. The tool has already contributed to solving a 60-year-old open problem from the Kourovka Notebook in group theory.

Multi-Agent System Manages Mathematical Research Workstreams

Built on Gemini models, AI Co-Mathematician provides an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts. The system generates LaTeX write-ups complete with margin annotations and provenance notes.

A hierarchy of agents coordinated by a top-level project coordinator works in parallel across multiple research workstreams. The system supports ideation, literature search, computational exploration, theorem proving, and theory building stages of mathematical work.

Oxford Mathematician Uses System to Resolve Kourovka Notebook Problem

Marc Lackenby at Oxford University used the tool to resolve Problem 21.10 from the Kourovka Notebook in group theory. A reviewer agent within the system spotted a flaw in the approach, which led Lackenby to realize how to fill the gap and complete the solution to the decades-old problem.

In early testing with a select group of mathematicians, the tool aided researchers in solving open problems, discovering new research directions, and surfacing previously overlooked academic references. The system is currently in limited internal testing.

New Benchmark Performance Sets Record on Difficult Mathematical Problems

The 48% score on FrontierMath Tier 4 represents a significant achievement in AI mathematical reasoning. The system's performance on these exceptionally difficult problems exceeds all previously evaluated AI systems on this benchmark.

The research was published in an arXiv paper on May 7, 2026, titled "AI Co-Mathematician: Accelerating Mathematicians with Agentic AI" by Daniel Zheng, Ingrid von Glehn, and team from Google DeepMind.

Key Takeaways

Google DeepMind's AI Co-Mathematician achieved 48% on FrontierMath Tier 4, the highest score among all evaluated AI systems on this exceptionally difficult benchmark
The multi-agent system helped Oxford mathematician Marc Lackenby solve Problem 21.10 from the Kourovka Notebook, a 60-year-old open problem in group theory
Built on Gemini models, the system manages multiple research workstreams including ideation, literature search, computational exploration, theorem proving, and theory building
The tool is currently in limited internal testing with a select group of mathematicians and outputs LaTeX write-ups with margin annotations and provenance notes
A hierarchy of agents coordinated by a project coordinator works in parallel, managing uncertainty and tracking failed hypotheses throughout the research process

Multi-Agent System Manages Mathematical Research Workstreams

Oxford Mathematician Uses System to Resolve Kourovka Notebook Problem

New Benchmark Performance Sets Record on Difficult Mathematical Problems

Key Takeaways

Google DeepMind's AI Co-Mathematician achieved 48% on FrontierMath Tier 4, the highest score among all evaluated AI systems on this exceptionally difficult benchmark

The multi-agent system helped Oxford mathematician Marc Lackenby solve Problem 21.10 from the Kourovka Notebook, a 60-year-old open problem in group theory

Built on Gemini models, the system manages multiple research workstreams including ideation, literature search, computational exploration, theorem proving, and theory building

The tool is currently in limited internal testing with a select group of mathematicians and outputs LaTeX write-ups with margin annotations and provenance notes

A hierarchy of agents coordinated by a project coordinator works in parallel, managing uncertainty and tracking failed hypotheses throughout the research process