Multi-repo, AI-assisted refactoring unlocks large-scale code transformation

Multi-repo, AI-assisted refactoring unlocks large-scale code transformation

In the software development realm, artificial intelligence has significantly advanced coding speed and efficiency. One of the most effective applications is within IDEs, where autocomplete features aid developers by suggesting code in real time and chat functionalities generate and explain various code snippets. It is crucial for developers to scrutinize these suggestions, particularly to correct any inaccuracies introduced by AI.

The success of autocomplete largely stems from the extensive context it gathers from the text around the cursor and other open documents within the IDE. Developers who understand how to manipulate these contexts can enhance the accuracy of AI-based autocompletion by ensuring that pertinent files are open in the IDE, thus providing richer contextual data.

However, this method is impractical for analyzing and refactoring extensive codebases across multiple repositories simultaneously. Each AI-driven edit relies on the immediate surrounding context and serves as a suggestion, necessitating thorough review by developers. For large-scale refactoring, both precision and reliability are essential.

To that end, Moderne has incorporated AI-assisted auto-refactoring to strike a balance that is the best of both worlds between accuracy and efficiency in transforming code across multiple repositories, while also offering the flexibility to utilize AI LLMs to improve this process.

In this article, we explore how AI LLMs are integrated to facilitate auto-refactoring, and we will also discuss a particular customer use case that yielded excellent results.

Improving accuracy of LLMs for code with Lossless Semantic Trees

Before we get into Moderne’s specific integration of AI, let’s walk through some of the technical thinking and specifics about leveraging AI with rules-based auto-refactoring.

Most LLMs are trained on natural language and source code as text and do not take into consideration the difference between natural language and code. The latter has a structure, strict grammar, and a compiler that can deterministically resolve it, which could be leveraged.?

There has been some research to fine-tune the models on a structured code representation such as the AST (abstract syntax tree), CST (concrete syntax tree), Data flow, Control flow, and even program execution. However, AI tuning based on those models will take time (and compute power!). Additionally, these trees lack semantic data about code, such as type attribution, which is crucial for accurate codebase searches.

Today, business applications such as chatbots augment LLM training using a technique called retrieval-augmented generation (RAG). RAG improves the accuracy and reliability of generative AI models by using embeddings that fetch data from relevant external sources, guiding the model’s decision-making process. Embeddings allow us to supply context to the model, including data it wasn’t trained on plus more text than a real-time IDE context window could supply.

When working with large codebases, we similarly need a way to retrieve relevant data from the user's codebase as context and feed it to an LLM alongside the query. Fortunately, through the OpenRewrite auto-refactoring open-source project, we have a new code representation called the lossless semantic tree (LST) that includes type attribution and other metadata. This full-fidelity representation of code is an essential foundation for searching and transforming code accurately—and can do the same job that embeddings do for natural language text. Moreover, the Moderne Platform serializes LSTs to disk where they are horizontally scalable, which takes refactoring work from single-repo mode to multi-repo mode.

The Moderne Platform, with aggregated LSTs, allows us to query large codebases and provide the results to the LLM for its query or generation. We can then validate the LLM suggestion with our 100% accurate refactoring system.

In summary, for working with code, Moderne can provide highly relevant context to LLMs using LSTs, which are extremely rich data representations of your code. We use open-source LLMs to securely deploy this solution on our platform, ensuring that no code leaves this environment. Additionally, with our platform, we can effectively test and select the best LLM for a given task.

Integrating AI LLMs in OpenRewrite rules-based recipes

The Moderne Platform runs OpenRewrite recipes (or programs) against pre-built and cached LSTs to search and transform code. This allows us to execute recipes across multiple repositories providing near-real-time feedback for real-time code analytics.?

When creating recipes, many declarative formats are available for ease of use, but users can develop custom programmatic recipes for maximum flexibility. It is possible to implement many types of integrations from custom OpenRewrite recipes as well. One such example is LaunchDarkly feature flag removal that calls LaunchDarkly to identify unused feature flags, removes functionality behind a feature flag from the codebase, and then calls LaunchDarkly again to remove the unused feature flags. A recipe can also wrap and execute other tools, such as in our JavaScript codemods integration.

LLMs are another important tool we can integrate into recipes to support new and interesting use cases. We’ve focused on OSS-specialized LLMs that can run on CPUs for maximum operational security and efficiency within the Moderne Platform. This enables us to provision the models on the same CPU-based worker nodes where the recipes are manipulating the LST code representations. Through testing and measuring various models, which are downloadable from Hugging Face, we can identify the best ones and iterate on the selection as new models arrive every day. Model selection can make a difference between getting something done accurately and quickly or not at all.

On the Moderne Platform, models run as a Python-based sidecar using the Gradio Python library. Each Python sidecar hosts a different model, allowing for a variety of tasks to be performed. A recipe then has several tools in its toolbox, and different recipes can also use the same sidecar. For example, a recipe that computes the distribution of languages in codebase comments and a recipe that fixes misencoded comments in French can both use the sidecar Python process that hosts a model that can predict the language based on a text input.?

When a recipe is running on a worker, it can search LSTs for the necessary data, pass it to the appropriate LLM, and receive a response. The LST only sends to the model the parts that need to evaluated or transformed. The recipe then inserts the LLM response back into LST. The Moderne Platform produces diffs for developers to review and commit back to source code management (SCM), ensuring models are doing their job with precision. See the process in Figure 1.

Figure 1. AI LLMs as sidecars for recipes to leverage on CPU-based workers

We deploy our microservices with immutable infrastructure, so both the main microservice and the Python sidecar with LLMs can be part of the base image and start together. The Python process is efficient, starting only the first time the recipe is called on that specific worker.

LLM-based recipes can also perform the same functions as other recipes, such as emitting data tables and visualizations or integrating with SCM for pull requests and commits.?

Une étude de cas avec l'intelligence artificielle: Finding and fixing misencoded French

One of our customers came to us with a problem ripe for an automated fix. Their older code had gone through multiple stages of character encoding transformation through the years leading to misencoded French characters being unrenderable. Furthermore, misencoded French characters in Javadoc comments were causing the Javadoc compiler itself to fail, which meant consumers of that code did not have ready access to documentation on the APIs they were using.

French characters can have accents such as é or è or might be even ? or ?. These special French characters could be found in comments, Javadocs, and basically anywhere there was textual data in their codebase. ASCII is a 7-bit character encoding standard that was designed primarily for the English alphabet, supporting a total of 128 characters (0-127). This set includes letters, digits, punctuation marks, and control characters but does not include accented characters like "é" or other non-English characters. When a character has an encoding issue, it will be replaced by ? or ?.?

Perhaps the problem started with a source file that was originally created with a character encoding like ISO-8859. Because there was no marker within a text file of what its character encoding was, subsequent tools might guess wrong and think it was encoded with Windows-1252. Maybe there was an attempt to standardize on UTF-8. Most Latin characters like the standard alphabet are represented with the same bytes in all of these encodings, so the file doesn't get totally mangled, but characters not common in English or characters with diacritics are not the same in all encodings. So over the lifetime of the file, characters with diacritics were entered under a variety of different encodings until there was no longer any one encoding with which the file could be interpreted to correctly display everything.?

With the Moderne Platform and a little help from AI, we were able to solve this problem quickly. We decided to use AI to figure out what the words are supposed to be and to fill in the appropriate modern UTF-8 characters.

Watch our video on this use case.

Combining strengths of rules-based refactoring and AI in the Moderne Platform

OpenRewrite’s recipes are precise, deterministic, and fast. Large language models are creative, versatile, and powerful. At Moderne, we combine the strengths of both.?

In our case study on misencoded French text, fixing these issues is challenging for both purely rules-based systems and LLMs alone. Rules-based systems can't identify the natural language in comments, and LLMs struggle to identify comments or other code syntax/semantic data. By using recipes to guide and focus the LLM, we achieve more predictable and reliable results, which our customers can trust for their large, complex codebases.

Using an LLM alone would not solve the problem for our customer. In the examples shown in Figure 2, we see instances where ChatGPT, an LLM, fails to fix misencoded comments. For example, in the first instance it fails to understand that “this” represents the keyword instead of the determinant. It also fixes “class” to “classe.” which could be frustrating for developers. In the second example, it doesn't recognize that this comment isn’t a question, leading to incorrect fixes.?

Figure 2. ChatGPT LLM alone fails to fix misencoded French

The Moderne Platform provides the framework for a recipe to walk your codebase (LSTs) in a deterministic way, calling the AI model only when needed. This not only safeguards and focuses the model to precise places in your code, but also makes models more efficient as they are only used when needed. The transformation possibilities for your code are truly endless.

To learn more about the flexibility of the Moderne Platform and how we’re leveraging AI to improve large codebases, contact us.


Link to original article

要查看或添加评论,请登录

社区洞察

其他会员也浏览了