Reducing Code Review Time at Google

Reducing Code Review Time at Google

This is the latest issue of my newsletter. Each week I share research and perspectives on developer productivity. Subscribe to get future issues.


This week I read Resolving Code Review Comments with Machine Learning from Google. Code reviews are a critical part of the software development process at Google, but they take a significant amount of time. Researchers looked for a way to speed up code reviews while maintaining quality. This paper documents their solution and results.

My summary of the paper

Developers at Google spend a lot of time "shepherding" code changes through reviews, both as authors and reviewers. Even when only a single review iteration is needed there’s a cost involved—it takes time to understand the reviewer’s recommendation, look up relevant information, and type out the edit. Moreover, the required active work time that the code author must devote to address reviewer comments grows almost linearly with the number of comments.?

For these reasons, Google created a code review comment-resolution assistant. Their goal: to reduce the time spent resolving code review comments by making it easier for reviewers to provide actionable suggestions and authors to efficiently address those suggestions. Their assistant achieves this by proposing code changes based on a comment’s text.

Google's code review comment-resolution assistant uses machine learning to help developers address review comments more efficiently. Here's a simplified explanation of how it was created, how it works, and its impact:?

Modeling and training

The comment-resolution assistant uses a text-to-text machine learning model. It processes code with inline reviewer comments and predicts the necessary code edits to address these comments. The model was trained on a vast dataset, including over 3 billion examples, to handle various software-engineering tasks like code edits and error fixing. During training, it was fine-tuned to prioritize high-precision predictions, ensuring that only the most reliable suggestions are presented to users.

Prototyping an assistant based on the model

The team wanted the assistant to be easy and efficient for developers to use, so they tested different designs through user studies and an internal beta test. They ultimately developed an assistant that works as follows (also illustrated in the image below):?

  1. Incoming comments: It listens for new code-review comments from reviewers.?
  2. Eligible for ML fixing: It ignores irrelevant comments, such as those from automated tools, non-specific comments, comments on unsupported file types, resolved comments, and comments with manual suggestions.
  3. Generated ML predictions: It queries the model to generate a suggested code edit.
  4. If the model is confident in the prediction (above 70% precision), it posts the suggestion to downstream systems (the code-review frontend and the integrated development environment).?
  5. Discovered and Applied: There, the suggested edits are exposed to the user. The system also logs user interactions, such as whether they preview the suggested edit and whether they accept it.?

Deploying and refining the system

Before rolling out the system, the research team conducted several iterations of refinement by testing the model on a separate set of data to see how well it predicted correct edits.?

Then, the beta tool was deployed to a small group of “friendly” users, where it was refined further through user feedback metrics. Specifically, researchers measured the number of comments produced in a day, the number of predictions the model made, the number those predictions that were previewed, and how many of those were applied or received a thumbs up/thumbs down. ?

The tool was then deployed to 50% of Google’s developer population, refined further, and finally to the full 100% of the population.?

Throughout this process, the research team made several important refinements to the model and system that improved performance and usability. For example, a seemingly small change to the way the suggest edits were shown to developers (a wording tweak and visual change) improved the percentage of edits previewed by developers from 20% up to 30%.?

Evaluating the assistant’s impact

The ultimate goal for the tool was to increase productivity. Google used quantitative metrics and qualitative feedback to measure the system’s impact. As for quantitative metrics, the team chose to track the following:?

  • Acceptance rate by author: The fraction of all code-review comments that are resolved by the assistant. This measures, out of all (non-automated) comments left by human reviewers, what fraction received an ML-suggested edit that the author accepted and applied directly to their changelist.?
  • Prediction coverage: This measures the percentage of comments that receive a prediction.?
  • Acceptance rate by reviewer: Similarly, the team measured the percentage of predictions accepted by reviewers.?

After several months of deployment, the tool was addressing roughly 7.5% of comments produced by code reviewers in their day-to-day work. Considering tens of millions of code-review comments are left by Google developers every year, over 7% ML-assisted comment resolution is a considerable contribution to the company’s total engineering productivity.

Additionally, around half of all eligible comments received predictions. Of those predictions, over 63% were accepted by the reviewer and attached to the comment to be sent to the author. 34% of those suggested edits were previewed by the author. Of those previewed, 70% were accepted and applied to code.?

Qualitatively, the research team received positive feedback from developers in internal message boards, who called the assistant's suggestions "sorcery," "magic," and "impressive." For example, reviewers often found that the assistant could suggest the right changes even before they finished typing their comments. This saved time and made the review process more efficient for both reviewers and authors.

Final thoughts

I recently shared Meta’s experiment to reduce code review times, which they achieved by targeting the slowest 25% of code reviews. This study provides another example of a company making targeted improvements to the code review process as a path for improving developer productivity.


Who’s hiring right now

Here is a roundup of recent Developer Experience job openings. Find more open roles here .


That’s it for this week. Thanks for reading.

-Abi

Hamid Davoodi

Software Engineer

5 个月

For some reason I'm not able to download the paper :/

回复

Thanks for sharing this Abi! AI-powered code review is a promising innovation from Google with the potential to significantly boost developer productivity. Dev teams typically dedicate a significant portion of their time, often between 10-30%, to code reviews, utilizing AI to assist with this process can free up valuable developer resources. An automated AI code review tool could be particularly beneficial compared to auto code generators like Copilot. While Copilot offers productivity gains, some organizations are hesitant due to potential intellectual property (IP) concerns related to the training data used in some AI models. Widespread adoption and generalizability are key for maximizing the impact of this technology.

回复
Marc H. Guirand

Stay locked in ???

5 个月

???

回复
Henry Hund

Building AI SRE Agents to fix on call and incident response

5 个月

Very interesting! This is also a great example of leveraging AI to streamline development processes. Implementing AI-powered tools for code reviews not only saves time but also enhances code quality and consistency.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了