登录查看更多内容

SonarQube benchmark study - CodeScene 6x more accurate

CodeScene

Next Generation Code Analysis

发布日期: 2024年2月27日

SonarQube is the leading static code analysis tool, and it’s widely used by enterprises for its source code metrics globally. Despite this apparent popularity, the tool has its flaws; Sonar is often criticized for producing a high ratio of false positives. Yet, perhaps more concerning is the tool's approach to code maintainability metrics, which are misleading. This is particularly damaging since maintainability is more important than ever due to Copilot and other AI-assistants which are accelerating the production of new code.

With these challenges in mind, we set out to compare SonarQube’s maintainability scores against CodeScene’s Code Health metric. We’ll cover the details soon, but let’s look at the benchmarking results:

Benchmark performance chart over SonarQube vs CodeScene — Figure: Benchmarking of SonarQube and CodeScene on the public software maintainability dataset. CodeScene’s Code Health metric is 6 times more accurate than SonarQube.

Now, we weren’t surprised that Code Health outperformed Sonar’s maintainability rating – after all, Code Health was developed as a reaction to the short-comings of Sonar.

What did surprise us was the margin: CodeScene is 6 times more accurate than SonarQube and performs at the level of human expert developers.

Benchmarking showdown: SonarQube faces CodeScene

Granted, software maintainability might not be the coolest kid in town: we all think it’s more fun to discuss new tech. Yet, the basic fact remains: unless we sustain a maintainable codebase, no new tools, programming language, or infrastructure will help. There are two key reasons for this:

First, software maintenance accounts for more than 90% of a software product’s life cycle costs.
Second, developers spend 70% of their time on understanding existing code and just a mere 5% writing new code.

Developers spending time on writing new code and trying to understand the existing code chart — Figure: The majority of a developer’s time is spent trying to understand the existing system (data from

Consequently, reliable maintainability metrics are fundamental. So how do we know if one metric is better than the other?

Fortunately, there’s a public software maintainability dataset thanks to the work of Schnappinger et al. They had 70 human experts assess and rate 500 files from 9 Java projects for maintainability. The data includes a total of 1.4 million lines of manually reviewed code. It’s a massive effort, which makes it possible to get a ground truth for code quality (e.g readability, understandability.

The beauty of a public benchmarking set is that anyone can reproduce these numbers using the same tools. With a public benchmarking set, we reduce any bias and can make a fair comparison. For that we're grateful for the hard work by the researchers – it’s a valuable service to the whole industry.

In the study, we simply ran SonarQube and CodeScene on the benchmarking dataset to compare their code quality metrics:

SonarQube implements a Maintainability rating, A-E where A is the best, E is the worst.
CodeScene uses the Code Health metric which goes from 10.0 (best) to 1.00 (worst).

The metrics presented in their respective code analysis tool, SonarQube and CodeScene — Illustration: The metrics presented in their respective code analysis tool.

For a balanced benchmarking measure you want to consider both Recall and Precision. That is, the metric should be able to reliably identify problematic code without false negatives flagging “good” code as unmaintainable, or vice versa. (For the statisticians reading this, the benchmarking study calculated an F-score).

Ministry of Testing 5 个月前

FastAPI: Revolutionizing API-Driven Software…

Vintage 1 个月前

Strategies for Deploying Modern Applications!

Pavan Belagatti 2 年前

SonarQube and CodeScene benchmark performance

As we see in the preceding table, CodeScene came (very) close to the level of human experts (88% vs 83% for the tool). The obvious difference is that the tool does its analysis in seconds, whereas an expert needs weeks for this amount of code.

Anyway, those numbers are in stark contrast to SonarQube’s Maintainability Index which achieved a mere 13.3% correctness level. Let’s consider the implications.

The danger of poor code level software metrics

What do these scores mean in practice? Well, as stated in the introduction, maintainable code is fundamental to any company building software. It’s a competitive advantage, impacting everything from roadmap execution to both developer productivity and happiness. As such, we always recommend businesses to make code quality a key performance indicator, reported and discussed at the leadership level. It’s that important. However, a maintainability metric with a mere 13.3% performance does more harm than good:

Acute problems are missed. This is dangerous as it lures organizations into a false sense of security: “We’re getting a maintainability score of A, so everything must be good”.
Engineering hours are wasted. This happens when the tool reports false positives, eg, low D and E scores for perfectly fine code: we would try to “fix” code which doesn’t need to be fixed.

Figure: The bulk of SonarQube’s measures are low-level style findings which – frankly – neither make nor break maintainability. The above three (3!) lines of simple Java declarations are claimed to take 45 minutes to fix (the “technical debt” metric).

There are plenty of examples on both types of problems in the SonarQube reports from the benchmarking data. Perhaps even more damaging is the trust issue. Given SonarQube’s successful market penetration, so many companies we meet think of the tool the moment they hear the term “code quality”. Over the years, this has implanted the false perception in our industry that code quality metrics are “vague”, “subjective”, “noisy”, and generally not of much use.?

As such, a poor metric does more harm than good. Fortunately, we see in this benchmarking report that there are objectively better and more modern metrics. So let’s close with a quick discussion of why CodeScene’s Code Health metric outperforms SonarQube’s maintainability rating.

How is Code Health the better software metric?

CodeScene originally didn’t plan to develop a new maintainability metric. Instead, the idea was to use a 3rd party tool – like SonarQube – for the code quality measures and focus 100% on the behavioral code analysis aspects. However, early experiments quickly indicated what we have now studied more formally:

SonarQube’s metrics simply aren’t good enough to predict maintenance problems.

Hence, there was no choice but to build a better alternative. The key to the strong performance of the Code Health measure is that it was built from first principles based on real research which took a much broader perspective; instead of focusing on minor style issues, Code Health detects higher-level design problems. The type of code constructs which do impact our ability to understand and – consequently – maintain code.?

CodeScene also published this research for peer reviews and academic publications. (See for example The Business Impact of Code Quality). This was all very intentional: as tool vendors, we should feel a massive responsibility in that what we sell also works as advertised. And there’s no stronger assurance for that than the scientific method. CodeScene was – and will continue – to be developed with one foot firmly planted in research land. SonarQube did succeed in bringing code analysis to the mainstream, and the tool might have been the obvious choice in 2007.

However, our calendar now says “2024”. It’s time for the next generation of code analysis – your code deserves better!

Interested in deciding on your own? Sign up for a free trial on CodeScene.

From Code to Delivery

2,130 位关注者

Aldin Kiselica

Building developer tool strategies.

8 个月

At first, you got me thinking with all the "human expert" premise. But I really appreciate the public benchmarking approach, speaks volumes! I'd only argue that formating this comparison either `enterprise v enterprise` or `community v community` would help sway some more naysayers from the get-go. Saw the discussion in the thread below, not everyone will, and clarity is everything. ??

Miguel Morán Cassy

SRE / DevOps - Manager

9 个月

It looks promising. Just a thought, why did you compare a community product with an enterprise product? Wouldn't it be fair if the comparison had been made with Sonarcloud?

Kay Grosskop

9 个月

While obviously an advertising text, it is a nice benchmark comparison. Thanks, CodeScene! I wish also other analysis tool vendors would attempt to compete on a commonly accepted benchmark. That is long overdue.

8 次回应

查看更多评论

要查看或添加评论，请登录

CodeScene的更多文章

See all articles

SonarQube benchmark study - CodeScene 6x more accurate

CodeScene

Next Generation Code Analysis

Benchmarking showdown: SonarQube faces CodeScene

领英推荐

The danger of poor code level software metrics

How is Code Health the better software metric?

From Code to Delivery

2,130 位关注者

CodeScene的更多文章

社区洞察

其他会员也浏览了

Code Coverage Vs. Test Coverage

The Many Facets of Performance Testing

Dependency Inversion vs. Dependency Injection

Introducing strict-stream - composable streams for AsyncIterable

Exploring the Power of Static Code Analysis

Service Mesh Explained in 5 minutes for Developers

A Deep Dive into Static vs Dynamic Code Analysis

Easy button for improving automated unit tests

Introducing Dexatel's New API Documentation

Jan 2023 - CloudNativeFolks Newsletter

Benchmarking showdown: SonarQube faces CodeScene

领英推荐

The danger of poor code level software metrics

How is Code Health the better software metric?

From Code to Delivery

2,130 位关注者

CodeScene的更多文章

Instantly Fix Code Smells in Your IDE with AI-Powered Refactoring

AI Coding: Ready to Automatically Fix Technical Debt with AI? ??

Use Guardrails for AI-Assisted Coding

CodeScene is a Proud Partner of Lund University and Vinnova's Competence Center NextG2Com

CodeScene Receives Best Paper Award at the International Conference on Technical Debt

Introducing AI-Generated Code Refactoring

Introducing CodeScene's new CLI Tool

Visualize Conway's Law in your Codebase

Change Coupling has been added to CodeScene's VS Code IDE Extension

CodeScene Raises €7.5 Million in its Financing Round

社区洞察

其他会员也浏览了

Code Coverage Vs. Test Coverage

The Many Facets of Performance Testing

Dependency Inversion vs. Dependency Injection

Introducing strict-stream - composable streams for AsyncIterable

Exploring the Power of Static Code Analysis

Service Mesh Explained in 5 minutes for Developers

A Deep Dive into Static vs Dynamic Code Analysis

Easy button for improving automated unit tests

Introducing Dexatel's New API Documentation

Jan 2023 - CloudNativeFolks Newsletter