Goedel-Architect Achieves 100% on MiniF2F-test, 88.8% on PutnamBench with Open-Weight Model

Researchers from Princeton and CMU have released Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 that achieves state-of-the-art performance on mathematical benchmarks while using open-weight models. Using DeepSeek-V4-Flash (284B-A13B) as its backbone, the system achieved 99.2% pass@1 on MiniF2F-test and 75.6% pass@1 on PutnamBench without guidance. With optional natural-language proof seeding, it reached 100% on MiniF2F-test and 88.8% (597/672) on PutnamBench.

Blueprint-Driven Approach Contrasts With Recursive Decomposition Methods

Goedel-Architect centers on blueprint generation and refinement rather than recursive lemma decomposition. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. The system first generates a blueprint of formally stated definitions and lemmas with declared dependencies, optionally guided by a natural language proof. A tool-equipped Lean prover component then closes each open lemma node in parallel using relevant dependencies. Failed lemmas drive refinement of the global blueprint, avoiding inefficient loops on dead-end strategies that plague recursive approaches.

System Achieves Superhuman Performance on Competition Mathematics

Beyond benchmark performance, Goedel-Architect solved:

4/6 problems on IMO 2025
11/12 problems on Putnam 2025
3/6 problems on USAMO 2026

The paper, published on arXiv on June 4, 2026 (arXiv:2606.06468), represents a significant advance in automated theorem proving. Authors include Jui-Hui Chung, Ziyang Cai, Zihao Li, Qishuo Yin, Rohit Agarwal, Simon Park, Rodrigo Porto, Narutatsu Ri, Ziran Yang, Shange Tang, Xingyu Dang, Hongzhou Lin, Mengdi Wang, Danqi Chen, Chi Jin, Liam H Fowl, and Sanjeev Arora.

Cost-Effective Performance Compared to Existing Solutions

The research team emphasizes that Goedel-Architect delivers state-of-the-art performance at a price point up to 500x less than comparable open-source pipelines. This efficiency gain comes from both the blueprint-driven architecture and the use of the smaller DeepSeek-V4-Flash model. The approach builds on the Goedel-Prover family of related work, which has been making significant advances in automated theorem proving. Goedel-Prover-V2-32B previously achieved 88.1% pass@32 on MiniF2F, exceeding DeepSeek-Prover-V2-671B's 82.4% while using far fewer parameters.

Key Takeaways

Goedel-Architect achieved 100% on MiniF2F-test and 88.8% (597/672) on PutnamBench using the open-weight DeepSeek-V4-Flash model
The system uses blueprint generation and parallel lemma proving instead of recursive decomposition, avoiding dead-end strategy loops
Performance on competition mathematics includes 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026
The framework operates at up to 500x lower cost than comparable open-source theorem proving pipelines
Results represent state-of-the-art performance for LLM-based automated theorem proving systems

Blueprint-Driven Approach Contrasts With Recursive Decomposition Methods

System Achieves Superhuman Performance on Competition Mathematics

Beyond benchmark performance, Goedel-Architect solved:

4/6 problems on IMO 2025

11/12 problems on Putnam 2025

3/6 problems on USAMO 2026

Cost-Effective Performance Compared to Existing Solutions

Key Takeaways

Goedel-Architect achieved 100% on MiniF2F-test and 88.8% (597/672) on PutnamBench using the open-weight DeepSeek-V4-Flash model

The system uses blueprint generation and parallel lemma proving instead of recursive decomposition, avoiding dead-end strategy loops

Performance on competition mathematics includes 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026

The framework operates at up to 500x lower cost than comparable open-source theorem proving pipelines

Results represent state-of-the-art performance for LLM-based automated theorem proving systems