Decoding the Future of Vulnerability Detection: Can LLMs Outperform Traditional Tools?
On the 5th of this month, I got the chance to speak as a keynote speaker at PyCon Estonia. Unlike other conferences I had been to, I found PyCon to be more dev-focused and attendees were more keen on attending talks and participating in workshops. Overall, a great experience!
The topic of my talk was "Detecting Vulnerabilities - Static Code Analyzers vs Large Language Models".
In simple terms, I wanted to build a case around the capabilities of both the static code analyzer and large language models in terms of identifying vulnerabilities inside code.
The recording of the talk will soon be available on YouTube. However, I wanted to share its content here because while conducting research for this talk, I stumbled upon a rather surprising discovery that changed the final conclusion of the talk.
So here we go.
Why Hunt for Vulnerabilities?
Why do we need to hunt for vulnerabilities in the first place?
The simplest reason for hunting down vulnerabilities is that it makes hacking into your application much harder. A vulnerability left unnoticed is essentially an open door for cyberattacks.
However, the implications go deeper, especially for open-source projects. Open-source software is often used by thousands of other applications, meaning that a vulnerability in a single codebase can cascade into widespread security issues across multiple platforms.
Beyond the technical risk, a security breach can severely damage your business's brand reputation and the trust you've worked hard to build with customers.
How Are Vulnerabilities Detected?
Vulnerabilities are usually found in one of two ways:
While a deep dive into the nitty-gritty of vulnerability types is outside the scope of this blog, you can find comprehensive resources on vulnerability detection from Patchstack’s annual security whitepaper, which specifically covers the state of WordPress security—one of the most widely used open-source platforms in the world.
Static Code Analyzers: The Backbone of Traditional Vulnerability Detection
SCAs are rule-based systems designed to analyze code for vulnerabilities, performance issues, errors, and even best practices. They rely on specific patterns and rules to detect potential issues, but they come with limitations. While efficient for catching well-known vulnerability patterns, they struggle with new, unfamiliar threats.
Two core techniques used by SCAs include:
Though highly effective in their own right, SCAs depend heavily on predefined rules, meaning they are only as good as the patterns they are taught to recognize.
Can Large Language Models Do It Better?
Enter Large Language Models (LLMs) like GPT-4. Recently, a study by David Noever from NASA compared the performance of GPT-4 against a widely used SCA called Snyk. In his research, Noever sought to test the limits of LLMs in vulnerability detection—and the results were surprising.
According to the findings, GPT-4 identified four times as many vulnerabilities as Snyk, especially across complex public repositories like NASA’s software and other high-profile projects. But how exactly do LLMs achieve this superior performance?
Why LLMs Are Superior at Detecting Vulnerabilities
LLMs like GPT-4 outperform SCAs because they possess the ability to generalize and learn from vast amounts of data. Unlike SCAs, which are bound by a fixed set of rules, LLMs are trained on millions of lines of code across various programming languages, making them adaptable to new patterns. Their key advantages include:
To help you understand better, we need to look at a very early example and perhaps the cornerstone in the success of deep learning i.e. Alexnet.
It’s a very famous paper published by very famous scientists including Alex Krizhevsky, Illya Sutskever, and Geoffrey Hinton.
Basically, it is a deep learning model that was designed to compete in the ImageNet competition in 2012. If you’re not familiar with ImageNet, it’s the biggest dataset of images.
There used to be a competition in which different algorithms and Computer vision techniques tried to classify a wide range of images.
Before Alexnet, this competition was dominated by other ML algorithms like SVM, shallow neural networks, and computer vision programs
In 2012, AlexNet performed exceptionally well. It significantly outperformed the runner-up model by reducing the top-5 error rate from 26.2% to 15.3%.
Traditional computer vision methods relied heavily on handcrafted features. They struggled to capture the wide variety of patterns present in images.
AlexNet learned features directly from the data through feature learning. Its multiple layers allowed it to learn a hierarchy of features, from simple edges in early layers to complex patterns and objects in deeper layers.
If a deep neural network is trained on thousands of cat and dog images and an image of a Cheetah is presented to it, it will label it as a cat with a higher degree of confidence as opposed to a dog and vice versa when given an image of a wolf.
领英推荐
It learns to identify patterns like fur texture, ear shapes, body structure, and so on, which are common across these categories. When presented with an image of a Cheetah, which shares many visual similarities with a cat, the neural network is likely to classify the Cheetah as a cat. The network has not seen a Cheetah before, so it generalizes based on the features it learned from the cat images.
Pretty much what happens when you train an LLM on a lot of data, it learns to generalize.
But if I have a computer vision program built to recognize cats, it might fail to recognize a Cheetah or a lion. They often rely on handcrafted features tailored for specific tasks.
If a program is built specifically to recognize cats, it would use features that are typical for domestic cats.
How well do open-source models perform?
Proprietary models whether you go with GPT-4 or Claude, become expensive very quickly.
Secondly, you are bound to share your private data with these models or the companies that own them.
Finally, you do not have any visibility to what’s happening behind the scenes nor have any control over optimization since you do not own the code and can not do anything other than make API requests.
There are other challenges too such as rate limits, availability, latency, etc.
So what's the next best thing? Open-source LLMs.
My own little study
To work around this, I did my own little research in which I replaced GPT-4 with open-source models. To keep it practical, I only went for medium-sized models and not the largest open-source models out there.
The reason for that is I wanted to fine-tune the model rather than simply deploy it on my cloud instance and do inference and fine-tuning anything beyond 12B parameters could require serious computation.
I chose Llama 3.1 8B and Mistral 7B open-source models.
Some important metrics to consider
The winner should have a higher...
Recall simply tells us the model’s capability of identifying vulnerable/non-vulnerable instances out of an actual number of vulnerable/non -non-vulnerable instances.
Precision, tells us out of all the predicted vulnerable/non-vulnerable cases, how many were actually vulnerable/non-vulnerable?
Let's look at the results!
Fine-tuned open-source models such as Llama 3.1 8B and Mistral 7B were tested using a dataset of PHP vulnerabilities. The models were fine-tuned with 2,400 rows of data and tested on 216 rows for validation. However, they did not perform as well as GPT-4, achieving only 50% accuracy compared to GPT-4's 64%.
The biggest surprise in the above study was the performance of XGBoost which is not a large language model, in fact, it's not even a deep learning model. It's a classical gradient-boosting ensemble machine-learning algorithm.
In this study, XGBoost surpassed both the open-source models as well as GPT-4 which is a current SOTA LLM by 1% in recall and 2% in precision.
So why open-source models could not perform well?
To answer this, we first need to understand what fine-tuning really is.
Fine-tuning allows the models to mimic the format of expected results, but it doesn’t substantially improve their underlying understanding of complex patterns in code.
In other words, by fine-tuning, you're not contributing anything to the capabilities of the base model as the model is not being re-trained on new data but is being fine-tuned to match the format of the output as you desire.
What’s the Verdict?
So, should you use SCAs, LLMs, or a combination of both?
LLMs clearly have the potential to reshape how we detect vulnerabilities, offering a more comprehensive and adaptable approach. However, they are not without their challenges.
The high cost and potential privacy issues associated with proprietary models like GPT-4, coupled with the still-developing performance of open-source models, mean that LLMs might not yet be a perfect standalone solution.
One potential path forward is to combine the strengths of both approaches. SCAs can be used to identify known vulnerabilities, while LLMs like GPT-4 can propose fixes or detect more complex, previously unknown threats. A hybrid approach could be the best way to create a robust security framework for your application.