The most advanced AI models have solved only 2% of complex mathematical problems developed by the world’s leading mathematicians.
Epoch AI Research Institute has presented a new set of FrontierMath tests that requires a doctoral level of mathematical knowledge. The development involved professors of mathematics, including Fields Medal winners
Whereas in previous MMLU tests
«These tasks are extremely difficult. Currently, they can only be solved with the help of an expert in the field or a graduate student in a related field, combined with modern AI and other algebraic tools,», — said Terence Tao, the 2006 Fields Prize winner.
Six leading AI models were tested in the study. Gemini 1.5 Pro (002) by Google and Claude 3.5 Sonnet by Anthropic showed the best result with 2% of correct answers. OpenAI’s o1-preview, o1-mini, and GPT-4o models managed to solve 1% of the tasks, while xAI’s Grok-2 Beta failed to solve a single problem.
FrontierMath covers various mathematical fields, from number theory to algebraic geometry. All test tasks are available on the Epoch AI website. Developers have created unique tasks that are not present in the training data of AI models.
The researchers note that even when the model provided the correct answer, it did not always indicate the correct reasoning – sometimes the result could be obtained through simple simulations without a deep mathematical understanding.
Source: Livescience