The intersection of multi-agent systems and automated theorem proving has emerged as one of the most transformative developments in artificial intelligence and formal mathematics during 2024-2025. This convergence has produced systems capable of solving International Mathematical Olympiad problems at gold-medal levels, revolutionizing both pure mathematics research and software verification practices. The integration of large language models with formal proof assistants, combined with sophisticated multi-agent coordination strategies, represents a paradigm shift in how machines approach mathematical reasoning and formal verification.
The past two years have witnessed remarkable advances in leveraging large language models for theorem proving. LeanDojo, released in 2023 and widely adopted in 2024, established foundational infrastructure by creating an open-source playground with toolkits, data, models, and benchmarks extracted from Lean's proof assistant. The accompanying ReProver system introduced retrieval-augmented theorem proving, combining language models with premise selection mechanisms to navigate vast mathematical libraries. Evaluated on 98,734 theorems from Lean's math library, ReProver outperformed non-retrieval baselines and GPT-4, requiring only one GPU week of training while discovering 33 new proofs in miniF2F and 39 in ProofNet.
Building on this foundation, Lean Copilot transformed theorem proving workflows in 2024 by enabling large language models to function as intelligent assistants directly within Lean's environment. When assisting human mathematicians, the system requires only 2.08 manually-entered proof steps on average compared to 3.86 for rule-based automation, while achieving 74.2% automation of proof steps in fully automated mode, representing an 85% improvement over previous automation rates. This human-AI collaboration model exemplifies how multi-agent architectures can augment rather than replace mathematical expertise.
DeepSeek-Prover, published in May 2024, demonstrated that large-scale synthetic training data significantly enhances formal theorem-proving capabilities. The team fine-tuned DeepSeekMath 7B on an 8-million statement synthetic dataset derived from high-school and undergraduate mathematical competition problems, achieving 46.3% accuracy with 64 samples on the Lean 4 miniF2F test and 52% cumulative accuracy, substantially outperforming GPT-4's 23.0% baseline. Critically, the system successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad benchmark where GPT-4 proved zero.
The subsequent DeepSeek-Prover-V2, released in 2025, introduced cold-start training procedures using reinforcement learning for subgoal decomposition. By leveraging DeepSeek-V3 to decompose complex problems into subgoals and synthesizing proofs into chain-of-thought reasoning integrating informal and formal mathematics, the 671-billion parameter system achieved 88.9% pass ratio on MiniF2F-test and solved 49 of 658 PutnamBench problems. These results demonstrate how multi-agent approaches to proof search, with specialized agents handling subgoal generation, proof synthesis, and verification, can substantially narrow the gap between formal and informal mathematical reasoning.
Monte Carlo Tree Search combined with reinforcement learning emerged as a dominant paradigm for multi-agent proof search in 2024-2025. DeepSeek-Prover-V1.5 integrated MCTS with proof assistant feedback, enabling autonomous learning of problem-solving methods from formalized environment feedback. This approach treats theorem proving as a sequential decision-making problem where multiple agents explore different proof paths, with reward signals based on verification success.
The FGeo-DRL system applied these techniques to geometric deductive reasoning, utilizing MCTS to generate search trajectory chains with reward feedback based on ultimate problem resolution. Recent research has shown that process supervision via learned process reward models proves more effective than outcome supervision, with models like ReST-MCTS* leveraging MCTS to annotate step-level rewards for training process reward models that subsequently guide search. This creates a virtuous cycle where multiple specialized agents learn to coordinate their search strategies through repeated interaction with formal verification environments.
The 2024-2025 period witnessed unprecedented AI achievements in mathematical competition benchmarks. At IMO 2024, Google DeepMind's AlphaProof and AlphaGeometry 2 achieved silver-medal standard, solving four of six problems for 28 points. AlphaProof couples a pre-trained Gemini language model with AlphaZero reinforcement learning, training on approximately one million problems across varying difficulty levels. The system employs a formalizer network to translate natural language problems into formal statements, creating a reinforcement learning loop where agents generate solution candidates, prove or disprove them, and reinforce successful strategies.
IMO 2025 marked a watershed moment with three independent systems achieving gold-medal performance. OpenAI's experimental system scored 35 out of 42 points solving five problems, while Google DeepMind's advanced Gemini with Deep Think operated end-to-end in natural language for the same score. Most significantly for formal verification, two systems provided gold-medal formal solutions: Aristotle and ByteDance's Seed-Prover both fully solved 5 of 6 problems with proofs verified in Lean 4. Aristotle integrates three main multi-agent components: a Lean proof search system, an informal reasoning system generating and formalizing lemmas, and a dedicated geometry solver, scaling to over 200 billion parameters. Seed-Prover distinguishes itself by generating whole proofs that are iteratively refined until Lean compiler verification succeeds, contrasting with Aristotle's step-by-step tree search guided by informal reasoning.
The COPRA system exemplifies multi-agent coordination across heterogeneous proof assistants, leveraging cross-language learning across Coq, Lean, and Isabelle to increase dataset diversity. By repeatedly querying GPT-4 to propose tactic applications within stateful backtracking search, COPRA works seamlessly for both Lean and Coq, outperforming ReProver on miniF2F benchmark pass@1 metrics. This cross-pollination approach recognizes that different proof assistants have complementary strengths—for example, induction proofs are more common in Coq software verification than in Lean mathematics.
SciAgent, a unified multi-agent system tested on IMO 2025 and International Mathematics Competition 2025, achieved a perfect score of 100 on IMC 2025, matching the highest human Grand First Prize performance. The system's multi-agent reasoning mechanism enables adaptive coordination among internal mathematical subagents, with specialized agents handling different proof strategies, lemma generation, and verification tasks. This modular architecture demonstrates how task decomposition among specialized agents can achieve superhuman performance on complex mathematical reasoning.
Beyond pure mathematics, multi-agent theorem proving systems have transformed software verification practices. AWS reported using automated theorem proving to reduce critical security bugs by over 70% and improve cloud system performance by 20% in 2024. Modern proof assistants like Lean, Coq, and Isabelle/HOL have become integral to cryptographic protocol verification, with Amazon's LNSym using Lean to model Arm instruction semantics and reason about system security properties.
The miniCodeProps benchmark, introduced in October 2024, features 201 program specifications in Lean aimed at automatically generating proofs for provided programs. However, evaluation revealed that current neural theorem provers struggle to automatically verify even relatively simple programs, highlighting opportunities for multi-agent approaches where specialized agents handle program analysis, specification extraction, and proof generation as coordinated subtasks.
Research on verifying multi-agent systems themselves has emerged, with the Soda language supporting compilation to both Scala and Lean, enabling multi-agent system implementations that integrate into mainstream software ecosystems while remaining formally verifiable. This recursive application—using multi-agent systems to verify multi-agent systems—represents a meta-level advancement in ensuring correctness of complex distributed systems.
The rapid progress in multi-agent automated theorem proving has profound implications for mathematics, software engineering, and AI safety. AlphaProof's methodology was published in Nature on November 12, 2025, following the initial July 2024 announcement, marking formal recognition of these systems as scientific research tools. DARPA's Explainable Math Reasoning program is funding AI systems to assist in frontier mathematical discovery, viewing multi-agent theorem provers as collaborative co-authors rather than mere automation tools.
The convergence of neural and symbolic approaches through multi-agent architectures addresses fundamental limitations of pure language models, which suffer from hallucinations, and pure formal systems, which require extensive manual effort. By distributing responsibilities among specialized agents—premise selection, tactic generation, subgoal decomposition, informal reasoning, and formal verification—these systems achieve capabilities exceeding the sum of their parts. As benchmarks like miniF2F-v2 address discrepancies between formal and informal problem statements, enabling full pipeline accuracies approaching 70%, and as models scale to hundreds of billions of parameters, multi-agent theorem proving systems are positioned to become indispensable partners in mathematical research and mission-critical software verification.