SonarQube benchmark study - CodeScene 6x more accurate
SonarQube is the leading static code analysis tool, and it’s widely used by enterprises for its source code metrics globally. Despite this apparent popularity, the tool has its flaws; Sonar is often criticized for producing a high ratio of false positives. Yet, perhaps more concerning is the tool's approach to code maintainability metrics, which are misleading. This is particularly damaging since maintainability is more important than ever due to Copilot and other AI-assistants which are accelerating the production of new code.
With these challenges in mind, we set out to compare SonarQube’s maintainability scores against CodeScene’s Code Health metric. We’ll cover the details soon, but let’s look at the benchmarking results:
Now, we weren’t surprised that Code Health outperformed Sonar’s maintainability rating – after all, Code Health was developed as a reaction to the short-comings of Sonar.
What did surprise us was the margin: CodeScene is 6 times more accurate than SonarQube and performs at the level of human expert developers.
Benchmarking showdown: SonarQube faces CodeScene
Granted, software maintainability might not be the coolest kid in town: we all think it’s more fun to discuss new tech. Yet, the basic fact remains: unless we sustain a maintainable codebase, no new tools, programming language, or infrastructure will help. There are two key reasons for this:
Consequently, reliable maintainability metrics are fundamental. So how do we know if one metric is better than the other?
Fortunately, there’s a public software maintainability dataset thanks to the work of Schnappinger et al. They had 70 human experts assess and rate 500 files from 9 Java projects for maintainability. The data includes a total of 1.4 million lines of manually reviewed code. It’s a massive effort, which makes it possible to get a ground truth for code quality (e.g readability, understandability.
The beauty of a public benchmarking set is that anyone can reproduce these numbers using the same tools. With a public benchmarking set, we reduce any bias and can make a fair comparison. For that we're grateful for the hard work by the researchers – it’s a valuable service to the whole industry.
In the study, we simply ran SonarQube and CodeScene on the benchmarking dataset to compare their code quality metrics:
For a balanced benchmarking measure you want to consider both Recall and Precision. That is, the metric should be able to reliably identify problematic code without false negatives flagging “good” code as unmaintainable, or vice versa. (For the statisticians reading this, the benchmarking study calculated an F-score).
领英推荐
As we see in the preceding table, CodeScene came (very) close to the level of human experts (88% vs 83% for the tool). The obvious difference is that the tool does its analysis in seconds, whereas an expert needs weeks for this amount of code.
Anyway, those numbers are in stark contrast to SonarQube’s Maintainability Index which achieved a mere 13.3% correctness level. Let’s consider the implications.
The danger of poor code level software metrics
What do these scores mean in practice? Well, as stated in the introduction, maintainable code is fundamental to any company building software. It’s a competitive advantage, impacting everything from roadmap execution to both developer productivity and happiness. As such, we always recommend businesses to make code quality a key performance indicator, reported and discussed at the leadership level. It’s that important. However, a maintainability metric with a mere 13.3% performance does more harm than good:
There are plenty of examples on both types of problems in the SonarQube reports from the benchmarking data. Perhaps even more damaging is the trust issue. Given SonarQube’s successful market penetration, so many companies we meet think of the tool the moment they hear the term “code quality”. Over the years, this has implanted the false perception in our industry that code quality metrics are “vague”, “subjective”, “noisy”, and generally not of much use.?
As such, a poor metric does more harm than good. Fortunately, we see in this benchmarking report that there are objectively better and more modern metrics. So let’s close with a quick discussion of why CodeScene’s Code Health metric outperforms SonarQube’s maintainability rating.
How is Code Health the better software metric?
CodeScene originally didn’t plan to develop a new maintainability metric. Instead, the idea was to use a 3rd party tool – like SonarQube – for the code quality measures and focus 100% on the behavioral code analysis aspects. However, early experiments quickly indicated what we have now studied more formally:
SonarQube’s metrics simply aren’t good enough to predict maintenance problems.
Hence, there was no choice but to build a better alternative. The key to the strong performance of the Code Health measure is that it was built from first principles based on real research which took a much broader perspective; instead of focusing on minor style issues, Code Health detects higher-level design problems. The type of code constructs which do impact our ability to understand and – consequently – maintain code.?
CodeScene also published this research for peer reviews and academic publications. (See for example The Business Impact of Code Quality). This was all very intentional: as tool vendors, we should feel a massive responsibility in that what we sell also works as advertised. And there’s no stronger assurance for that than the scientific method. CodeScene was – and will continue – to be developed with one foot firmly planted in research land. SonarQube did succeed in bringing code analysis to the mainstream, and the tool might have been the obvious choice in 2007.
However, our calendar now says “2024”. It’s time for the next generation of code analysis – your code deserves better!
Interested in deciding on your own? Sign up for a free trial on CodeScene.
Building developer tool strategies.
8 个月At first, you got me thinking with all the "human expert" premise. But I really appreciate the public benchmarking approach, speaks volumes! I'd only argue that formating this comparison either `enterprise v enterprise` or `community v community` would help sway some more naysayers from the get-go. Saw the discussion in the thread below, not everyone will, and clarity is everything. ??
SRE / DevOps - Manager
9 个月It looks promising. Just a thought, why did you compare a community product with an enterprise product? Wouldn't it be fair if the comparison had been made with Sonarcloud?
While obviously an advertising text, it is a nice benchmark comparison. Thanks, CodeScene! I wish also other analysis tool vendors would attempt to compete on a commonly accepted benchmark. That is long overdue.