New Technique Enhances AI's Problem-Solving Abilities with Python Programs

New Technique Enhances AI's Problem-Solving Abilities with Python Programs

Recent advancements in large language models, such as those behind ChatGPT, have demonstrated exceptional performance in tasks ranging from drafting legal briefs to translating documents. Despite these successes, these models often struggle with numerical or symbolic reasoning tasks, which require more than just natural language processing.

For example, a language model might easily recall a list of recent U.S. presidents and their birthdays. However, it could falter when asked, “Which U.S. presidents elected after 1950 were born on a Wednesday?” (The correct answer is Jimmy Carter.)

Addressing this limitation, researchers from MIT and other institutions have introduced a novel approach that enables large language models to tackle natural language, math, data analysis, and symbolic reasoning tasks more effectively by generating programs.

This new method, termed natural language embedded programs (NLEPs), prompts a language model to create and execute a Python program to answer a query, then translates the solution back into natural language.

The research team discovered that NLEPs significantly improved the accuracy of large language models across various reasoning tasks. Moreover, this approach is versatile, allowing a single NLEP prompt to be used for multiple tasks.

NLEPs also enhance transparency, as users can inspect the generated programs to understand how the model arrived at its conclusions and correct any mistakes directly.

“We aim for AI to perform complex reasoning in a transparent and trustworthy manner. While there is still much progress to be made, combining programming and natural language capabilities in large language models is a promising first step towards a future where AI is fully understandable and reliable,” says Hongyin Luo, PhD ’22, an MIT postdoc and co-lead author of a paper on NLEPs.

Luo collaborated on this paper with co-lead authors Tianhua Zhang from the Chinese University of Hong Kong and Jiaxin Ge from Peking University; Yoon Kim, an assistant professor at MIT’s Department of Electrical Engineering and Computer Science; and senior author James Glass, a senior research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the Annual Conference of the North American Chapter of the Association for Computational Linguistics.

Solving Problems with Programs

Most large language models function by predicting the next word based on natural language input. While models like GPT-4 can generate programs, they typically embed these programs within natural language, which can lead to reasoning errors.

In contrast, the MIT researchers’ NLEP approach prompts the model to generate step-by-step Python code, embedding the necessary natural language within the program.

An NLEP consists of four steps: importing necessary packages, incorporating natural language representations of the required knowledge, implementing a function to calculate the answer, and outputting the result in natural language with an optional data visualization.

“It’s like a digital calculator that ensures correct computation as long as the program is accurate,” says Luo.

Users can easily review and correct errors in the code directly, bypassing the need to rerun the entire model for troubleshooting.

This method also offers efficiency. If a user has multiple similar questions, they can generate one core program and adjust specific variables without rerunning the model.

To generate an NLEP, researchers instruct the model to write a Python program, provide two NLEP examples (one involving math and one involving natural language), and pose one test question.

“Typically, few-shot prompting requires designing prompts for each task. We found that one prompt can serve multiple tasks because it teaches the model to solve various problems by writing a program,” explains Luo.

“Using code for reasoning opens up numerous opportunities for tool use, output validation, and structured understanding of the model’s capabilities,” adds Leonid Karlinsky, principal scientist at the MIT-IBM Watson AI Lab.

Achieving High Accuracy

NLEPs achieved over 90 percent accuracy when using GPT-4 for symbolic reasoning tasks, such as tracking shuffled objects or playing the game 24, as well as for instruction-following and text classification tasks. The method outperformed task-specific prompting methods by 30 percent and showed improvements over open-source language models.

Beyond enhancing accuracy, NLEPs could improve data privacy by running programs locally, keeping sensitive user data secure. This approach also allows smaller language models to perform better without costly retraining.

“There’s no magic involved. We use program generation instead of natural language generation, significantly improving performance,” says Luo.

However, NLEPs depend on the model’s program generation capability, making the technique less effective for smaller models trained on limited datasets. Future research will explore ways to enhance smaller models' ability to generate effective NLEPs and investigate prompt variations to improve the robustness of the model’s reasoning processes.

要查看或添加评论,请登录

Dusan Simic的更多文章

社区洞察

其他会员也浏览了