?? A New AI Software Engineer
In this issue:
1. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
Watching: Adaptive-RAG (paper)
What problem does it solve? Retrieval-Augmented Language Models (RALMs) have shown great promise in enhancing the accuracy of Large Language Models (LLMs) on tasks like Question-Answering (QA) by incorporating external knowledge. However, existing approaches often struggle to efficiently handle queries of varying complexity. Simple queries are processed with unnecessary computational overhead, while complex multi-step queries are not adequately addressed. This leads to suboptimal performance and resource utilization, as real-world user requests span a range of complexity levels.
How does it solve the problem? The proposed adaptive QA framework dynamically selects the most suitable strategy for retrieval-augmented LLMs based on the complexity of the incoming query. It employs a smaller LM-based classifier trained to predict the complexity level of queries using automatically collected labels derived from the actual predicted outcomes of models and inherent inductive biases in datasets. By seamlessly adapting between iterative and single-step retrieval-augmented LLMs, as well as no-retrieval methods, the framework efficiently handles queries of varying complexity. This approach strikes a balance between computational efficiency and accuracy, ensuring that the most appropriate strategy is applied to each query.
What's next? The adaptive QA framework demonstrates the potential for intelligent resource allocation in RALMs based on query complexity. Future research could explore more sophisticated methods for query complexity estimation, such as incorporating user feedback or leveraging unsupervised learning techniques. Additionally, the framework could be extended to other NLP tasks beyond QA, such as text summarization or dialogue systems, where adapting to varying input complexity could yield significant improvements in efficiency and performance. As RALMs continue to evolve, developing adaptive strategies that optimize resource utilization while maintaining high accuracy will be crucial for their practical deployment in real-world applications.
2. BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models
Watching: BLADE (paper)
What problem does it solve? While Large Language Models (LLMs) have demonstrated remarkable versatility and performance across a wide range of tasks, they often lack the specialized knowledge required for domain-specific applications, such as those in the legal or medical fields. Adapting these general-purpose LLMs to vertical domains has proven to be challenging, with existing approaches being either cost-prohibitive or unreliable in practical settings.
How does it solve the problem? BLADE (Black-box LArge language models with small Domain-spEcific models) addresses this issue by combining the strengths of a black-box LLM and a small domain-specific LM. The small LM is pre-trained on domain-specific data to capture specialized knowledge and insights, while the general LLM contributes robust language comprehension and reasoning capabilities. The integration of these two models is achieved through a three-step process: pre-training the small LM, fine-tuning it using knowledge instruction data, and jointly optimizing both models using Bayesian optimization.
What's next? The promising results of BLADE on public legal and medical benchmarks suggest that this framework could be a cost-effective and efficient solution for adapting general LLMs to various vertical domains. As more specialized applications of LLMs emerge, it will be interesting to see how BLADE and similar approaches evolve to address the unique challenges and requirements of different industries. Furthermore, the integration of domain-specific knowledge into LLMs could lead to the development of more accurate and reliable AI systems for critical applications, such as legal advice or medical diagnosis.
3. MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
Watching: MAGIS (paper)
What problem does it solve? Resolving GitHub issues is a complex task that requires understanding the context of the repository, the existing codebase, and the specific requirements of the issue. Large Language Models (LLMs) have shown impressive capabilities in code generation and understanding, but they often struggle with making appropriate code changes at the repository level. This is because resolving issues involves not only generating new code but also maintaining the existing functionalities and ensuring compatibility with the rest of the codebase.
How does it solve the problem? To address the challenges of resolving GitHub issues using LLMs, the authors propose MAGIS, a Multi-Agent framework that leverages the collaboration of four specialized agents: Manager, Repository Custodian, Developer, and Quality Assurance Engineer. The Manager agent breaks down the issue into subtasks and assigns them to the appropriate agents. The Repository Custodian agent maintains an understanding of the repository's structure and existing functionalities. The Developer agent generates code changes based on the subtasks, while the Quality Assurance Engineer agent verifies the correctness and compatibility of the generated code. By decomposing the issue resolution process and leveraging the strengths of each agent, MAGIS significantly improves the performance of LLMs in resolving GitHub issues.
What's next? The success of MAGIS in resolving GitHub issues opens up new possibilities for applying LLMs in software evolution tasks. Future research could explore the integration of MAGIS with other software development tools and processes, such as continuous integration and deployment pipelines. Additionally, the multi-agent approach used in MAGIS could be adapted to other domains where LLMs struggle with complex, multi-step tasks that require collaboration and specialized knowledge.
Editor’s note: It’s not clear yet which solution seems more promising. Devin or MAGIS. Contrary to Devin, MAGIS achieves its impressive results with access only to the shell and at ~3x the speed. But Devin is obviously a more polished product with a wider range of features. Most importantly, the models have both been evaluated on subsets of SWE-bench - different subsets.
Papers of the Week:
Software Quality Assurance Manager at QA Mentor - Software Testing Expert
3 个月Really fascinating developments in the realm of AI for software engineering! Do check out TestGrid's CoTester, the world's first AI for testing. It might be worth exploring alongside these advancements! https://testgrid.io/cotester
?? Exploring new AI Frontiers | ML | DL | NLP | CV | Data Science | Generative AI | MAS with LLMs | AGI Enthusiast ?? | Astrophysics Enthusiast ?? | NUST'25 | Let's innovate together ??
5 个月Thats super amazing waiting for the another multi agent framework that will beat MAGIS :) Pascal Biese
Co-Founder LastBot | Repeat Founder & CEO | AI & ML since 1990 | Innovator with 25+ Patents
5 个月Thank you for sharing this Pascal Biese. This is an awesome direction! What I would like to, however, is 1) architect and modularize the software by hand, 2) for selected modules (most of them) have the AI to own the module, 100% write, test, fix, improve. That is where 10x+ efficiency comes. AI should do all the coding where its the owner. I should only be concerned about AI owned modules to the extent they expose their functionality. Cannot wait for the next generations of AI developers to come available. It is likely there is a bunch of these cooking in the labs...
Data Scientist
5 个月I was hoping that authors would release the code and prompts. There is nothing “amazing” in Devin or MAGIS as one can simply add browser tool, shell tool and memory in Langchain and choose one/ many LLMs to work on existing repo issues. When the secret sauce is just the prompt and ordering of tasks within agents it won’t be long before we have open source equally capable coding agents.
Director of AI Programs @ Caltech CTME
5 个月What happens to the issues that weren't resolved? Is there a massive cloud bill that got spun up and damage that needs an engineer to go and fix? I am really curious on the technical cost here: 1 out of every 7 tickets successfully fixed, but the other 6 tickets blew up the pipeline and now it is all hands on deck? Practical AI in this week's podcast had a great episode on this.