The impact of AI tooling on engineering at ANZ Bank
This is the latest issue of my newsletter. Each week I share research and perspectives on developer productivity. Subscribe here to get future issues.
This week I read The Impact of AI Tooling on Engineering at ANZ Bank by members of ANZ’s Architecture & Engineering organizations. ANZ was interested in the potential productivity gains of using GitHub Copilot, so they conducted an experiment with a smaller group of engineers to help determine whether it should be rolled out to the broader organization. This paper describes the experiment’s setup and results.?
My summary of the paper
To evaluate whether Copilot should be used org-wide, the authors of this paper conducted an experiment for six weeks, and compared the tool’s impact on a test group versus a control group. They based their evaluation of the tool’s impact using measures for productivity, quality, and security.?
Experiment design
The experiment included two weeks of preparation and four weeks of actual testing. Here’s how the experiment was designed:
The experiment demonstrated that Copilot significantly reduced the time engineers take to complete tasks and positively influenced their ability to perform specific functions. However, the research team found no statistically significant improvements in code quality or security as a result of using the tool.
Here’s a closer look at the results:
Impact on speed
Throughout the experiment, participants recorded the time they took to complete each challenge. This data allowed the research team to calculate and compare the average time spent on tasks by both the Copilot group and the control group.
The findings were notable: the group using Copilot completed their tasks 42.36% faster than the control group. Specifically, the control group took an average of 30.98 minutes per task, while the Copilot group averaged 17.86 minutes.
When we look closer at the impact for engineers with different levels of Python proficiency, we can see that Copilot was beneficial for participants of all skill levels, but it was most helpful for those who were ‘Expert’ python programmers.?
This is intriguing because it conflicts with a GitHub study which found that developers with less programming experience benefited the most from Copilot. The GitHub study also measured task completion time but did not restrict participants to a specific programming language, unlike the ANZ study. It’s possible that the ‘expert’ python programmers at ANZ were more effective at using Copilot, however this is not certain.?
Participants also reported the difficulty of each task. We can see that Copilot gave the largest improvement when completing ‘Hard’ tasks. This observation makes sense: harder tasks have more opportunities where AI-assisted tools can help.?
As for measures of quality and security, the Copilot group had a 12.86% higher unit test success ratio, however this result was not statistically significant. The experiment was also unable to generate meaningful data to measure code security—however, the data suggests that Copilot did not introduce any major security issues into the code.?
领英推荐
Impact on developer experience
Across all areas, engineers responded positively regarding GitHub Copilot. They felt it helped them review and understand existing code, create documentation, and test their code. Additionally, they felt Copilot helped them spend less time debugging and reduced their overall development time. They also found the suggestions provided by Copilot to be somewhat helpful and generally in line with their project’s coding standards. It should be noted, however, that while sentiment was positive, it was moderate.?
Ultimately, the experiment provided clear results about Copilot’s impact on the speed and ease of completing tasks in engineering. The authors recommended its wider adoption, and by the time the paper was published, over 1,000 users had already integrated Copilot into their workflows.
Final thoughts
While the findings from this study are interesting, I’m mostly inspired by how the organization approached its adoption of Copilot. In its simplest form, they established a baseline, ran an A/B test, and selected a range of metrics to assess the tool’s impact. It’s a great example for organizations looking to evaluate the effectiveness of a tool and determine whether it should be adopted on a larger scale.?
Measure GenAI adoption and impact
We recently published a free guide on how to measure adoption and impact of AI tools like GitHub Copilot. You can get a copy of the guide here.
Who’s hiring right now
Here’s a roundup of new Developer Experience job openings:
Find more DevEx job postings here.
That’s it for this week. Thanks for reading.
-Abi
Tech Lead | Cloud | DevSecOps
5 个月Certainly some highlights in experiment design to learn from. I wonder what would ANZ experiment team recommend/do differently next time? Curious if the weekly python programming problems were on ANZs codebases? Were the problems solvable independently without consulting others for clarifications? 42% faster on well specified independent programming challenges cannot be generalised to 42% on engineering tasks. Engineering tasks rarely come with blank canvas and precise requirements.
Director Of Applications Development at IDEA Public Schools
7 个月Our team has been conducting a similar evaluation of the impact of AI assisted development over the past year using simple metrics of lead time and cycle time as the measure. What we found is that as prompt quality improved over time so did the gains in velocity. The overall improvement was about a 38% gain in velocity which support the papers findings. Additionally, we discovered that the gain was higher the more senior the developer which in some ways may seem counterintuitive however it pointed to the need to develop a different prompting approach for junior developers that ensures they are still actively learning while benefiting from AI tools.
Organisatie Transformatie | Strategie Realisatie | Programma Versnelling
7 个月Lisanne Beijk
Digital Transformation Leader & Agile Delivery Expert. I help mission-driven organisations innovate
7 个月Michiel Starrenburg Reginal Ram
Senior Staff Engineer at Lendable
7 个月Overall I think this study is very misleading and doesn't show that engineers can solve all tasks 42% quicker, it helps them solve those that cover a very small part of what they do that the models are already trained heavily on. I'd almost say this provides little value, if any, in the real world. Now that they have run the test, I assume to justify a rolling out across all teams, is the business now expecting 42% increased output from the engineering teams?