Google has published an interesting paper on LEAP, a system designed for automated theorem proving. It allows language models to construct formal, machine-verifiable proofs in the Lean language.
A proof in natural language is difficult to verify automatically—it almost always contains logical gaps. A formal proof is written in a machine language and checked by a compiler, which guarantees correctness, but writing it is much more difficult.
In this field, specialized models trained specifically for Lean typically take the lead. LEAP, operating as an agentic framework, uses general models by breaking the problem into parts and recursively correcting errors based on compiler feedback.
Success at the Putnam 2025 Competition
The project's main research victory is the Putnam 2025 competition, an annual mathematics competition for college students in the US.
LEAP formally solved all 12 problems, whereas neither Gemini 3.1 Pro on its own nor the open-source specialized Goedel-Prover-V2 solved a single one. The authors evaluated the closed-source Aristotle system, which achieved a gold medal-level result at the IMO 2025 Math Olympiad, at 9 out of 12 solved problems in their own tests.
IMO-LeanProofBench and Other Achievements
The paper also introduces IMO-LeanProofBench, a dataset of 60 formalized Olympiad-level problems in Lean that require highly unconventional deductions and structurally complex proofs. On this benchmark, LEAP achieves an average of around 70% across the basic and advanced sets, compared to approximately 48% for Aristotle.
An additional achievement: LEAP formally verified the auxiliary part of a combinatorial problem tracing back to mathematician Donald Knuth, generating over 5000 lines of code in Lean 4.
The Irony of Closed Source
It is worth noting a certain irony: in the article, the authors criticize closed competitor systems (Axiom3, Numina, Aristotle) because they are unavailable for scientific verification. However, the code for LEAP itself is also unpublished. Opening up the final proofs partially mitigates this (they can be rechecked by the Lean compiler), but full reproducibility of Google's project is not yet possible.