A year ago, a research team at Google DeepMind introduced AlphaGeometry, a neuro-symbolic system combining a large language model with a symbolic engine to solve geometry problems drawn from the 2000 to 2024 International Mathematical Olympiad exams. When the AlphaGeometry paper was published, it reported that the system solidly displayed IMO silver medalist performance, and was approaching gold medalist performance, as it scored 25 out of 30 in a test problem set in which the average gold medalist score was calculated to be 25.9.

In July, DeepMind shared that it paired up an improved version of AlphaGeometry, AlphaGeometry2, with AlphaProof—a reinforcement-learning-based system for formal math reasoning—to solve the problems in the 2024 International Mathematical Olympiad. The combined system solved four out of the six math problems from the exam, a score that would have earned a human competitor a silver medal.

A recent paper details more about AlphaGeometry2, including the process by which the research team determined that this enhanced system surpasses the performance of IMO gold medalists in solving geometry problems. AlphaGeometry (AG1) and AlphaGeometry2 (AG2) were subjected to a test where each system attempted to solve all the geometry problems from the 2020 to 2024 IMO tests. AlphaGeometry had a 54% success rate, while AlphaGeometry2 correctly solved 84% of the test.

The team identifies some limitations to AG1's performance in the research paper. Broadly, AG1 and AG2 work by having a language model process the natural language problems from the data set into a formal language so the symbolic engine can perform the relevant deductions. Notably, AG1 could not solve some problems in the dataset because its formal language lacked the concepts to solve problems about some geometry topics, including angles, ratios, and linear equations. By extending the system's formal language to include formal predicates about some missing concepts, the research team expanded the range of potentially solvable problems from 66% (AG1) to 88%.

This advancement, coupled with an enhanced language model, search algorithm, and symbolic engine, led to the impressive performance reported in the research paper. Still, the research team recognizes there is still work to be done, as AG2 cannot yet solve problems about variable numbers of points, non-linear equations, and involving inequalities. Moreover, there is still no system able to solve an entire IMO test. The team suggests that breaking complex problems into simpler parts and applying reinforcement learning strategies could be a promising approach to bridge this gap.