Mathematicians devised challenging problems to test the reasoning of Gemini, Claude, and GPT-4o — they failed almost all tests

Published by Ihor Panchenko

The most advanced AI models have solved only 2% of complex mathematical problems developed by the world’s leading mathematicians.

Epoch AI Research Institute has presented a new set of FrontierMath tests that requires a doctoral level of mathematical knowledge. The development involved professors of mathematics, including Fields Medal winnersThe Fields Medal is the most prestigious international award in mathematics, given every four years to mathematicians under the age of 40 for outstanding achievements. The prize is often referred to as the «mathematical Nobel».. It can take doctoral mathematicians from several hours to days to solve such problems.

Whereas in previous MMLU testsMMLU (Measuring Massive Multitask Language Understanding) — is a standard set of tests for evaluating the capabilities of artificial intelligence models. The tests cover more than 57 subject areas, including math, physics, history, law, medicine, and other sciences, and are used to compare different AI models and assess their ability to understand and apply knowledge in different areas. AI models successfully solved 98% of school and university-level math problems, the situation with new tasks is radically different.

«These tasks are extremely difficult. Currently, they can only be solved with the help of an expert in the field or a graduate student in a related field, combined with modern AI and other algebraic tools,», — said Terence Tao, the 2006 Fields Prize winner.

Six leading AI models were tested in the study. Gemini 1.5 Pro (002) by Google and Claude 3.5 Sonnet by Anthropic showed the best result with 2% of correct answers. OpenAI’s o1-preview, o1-mini, and GPT-4o models managed to solve 1% of the tasks, while xAI’s Grok-2 Beta failed to solve a single problem.

FrontierMath covers various mathematical fields, from number theory to algebraic geometry. All test tasks are available on the Epoch AI website. Developers have created unique tasks that are not present in the training data of AI models.

The researchers note that even when the model provided the correct answer, it did not always indicate the correct reasoning – sometimes the result could be obtained through simple simulations without a deep mathematical understanding.

Source: Livescience