Chat GPT-4o, Six Other AI Models Fail China’s College Entrance Maths Exam

Chat GPT-4o, Six Other AI Models Fail China’s College Entrance Maths Exam

(Yicai) June 20 -- US artificial intelligence firm OpenAI’s closed source ChatGPT-4o and six other large language models were asked to write China’s notoriously difficult college entrance examinations in three subjects, English, Chinese and mathematics. Although they performed relatively well in the language options, none of them passed in maths.

Chat GPT-4o as well as open source models developed by e-commerce giant Alibaba Group Holding, 01.AI, Zhipu AI, Shanghai Artificial Intelligence Laboratory and France’s Mistral AI, were put to the test by OpenCompass, the Shanghai AI Lab’s evaluation system.

China’s tough college entrance exams are a good way of gauging LLM’s intelligence, the Shanghai AI Lab said. The tests were all marked manually and the teachers who marked the exams were not informed that the tests were taken by a machine. The exams contained both objective and subjective questions, it added.

Alibaba’s Qwen 2-72B was the smartest LLM, scoring 303 points out of a total of 420 points in the three subjects, according to the results released by OpenCompass yesterday. This was followed by San Francisco-based OpenAI’s Chat GPT-4o with 296 points and the Shanghai AI Lab's InternLM 2.0 with 295.5 points. Mistral AI’s LLM came last with 185 points.

But all of them failed the maths exam. InternLM 2.0 achieved the highest score of just 75 points out of 150. GPT-4o came second with 73 points.

In the maths paper, examiners found that the AI models’ answers to subjective questions were illogical and confused. Sometimes the reasoning was wrong but the answer was correct. The LLMs are able to memorize formulas well but they have trouble in explaining how they solve problems.

This shows that there is still a lot of room for improvement in terms of AI models' maths abilities, Lin Dahua, a scientist at the Shanghai AI Lab, told Yicai. Maths involves complex reasoning, which is a key skill needed for the use of LLMs in finance and other important sectors.

The AI models performed well in terms of modern Chinese but there was a big gap in their knowledge about classical Chinese.

Qwen was the highest scorer in Chinese with 124 out of 150 points, and GPT-4o excelled in English with 109 out of 120 points.

In English, most humans who take the test lose points for not writing enough, but the AI models tended to have points deducted for exceeding the word limit.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了