登录查看更多内容

Decoding the Future of Vulnerability Detection: Can LLMs Outperform Traditional Tools?

Ibad Rehman

AI & Machine Learning Engineer

发布日期: 2024年9月13日

On the 5th of this month, I got the chance to speak as a keynote speaker at PyCon Estonia. Unlike other conferences I had been to, I found PyCon to be more dev-focused and attendees were more keen on attending talks and participating in workshops. Overall, a great experience!

The topic of my talk was "Detecting Vulnerabilities - Static Code Analyzers vs Large Language Models".

In simple terms, I wanted to build a case around the capabilities of both the static code analyzer and large language models in terms of identifying vulnerabilities inside code.

The recording of the talk will soon be available on YouTube. However, I wanted to share its content here because while conducting research for this talk, I stumbled upon a rather surprising discovery that changed the final conclusion of the talk.

So here we go.

Why Hunt for Vulnerabilities?

Why do we need to hunt for vulnerabilities in the first place?

The simplest reason for hunting down vulnerabilities is that it makes hacking into your application much harder. A vulnerability left unnoticed is essentially an open door for cyberattacks.

However, the implications go deeper, especially for open-source projects. Open-source software is often used by thousands of other applications, meaning that a vulnerability in a single codebase can cascade into widespread security issues across multiple platforms.

Beyond the technical risk, a security breach can severely damage your business's brand reputation and the trust you've worked hard to build with customers.

How Are Vulnerabilities Detected?

Vulnerabilities are usually found in one of two ways:

Security researchers manually audit the code, often through bug bounty programs.
Tools like Static Code Analyzers (SCAs) automatically scan your code for known patterns of security flaws.

While a deep dive into the nitty-gritty of vulnerability types is outside the scope of this blog, you can find comprehensive resources on vulnerability detection from Patchstack’s annual security whitepaper, which specifically covers the state of WordPress security—one of the most widely used open-source platforms in the world.

Static Code Analyzers: The Backbone of Traditional Vulnerability Detection

SCAs are rule-based systems designed to analyze code for vulnerabilities, performance issues, errors, and even best practices. They rely on specific patterns and rules to detect potential issues, but they come with limitations. While efficient for catching well-known vulnerability patterns, they struggle with new, unfamiliar threats.

Two core techniques used by SCAs include:

Abstract Syntax Trees (ASTs): A representation of the code that focuses on its structure rather than syntax, helping SCAs identify issues like SQL injection points or cross-site scripting vulnerabilities.
Control Flow Graphs (CFGs): These graphs outline the various paths that data and execution take within the code, helping detect unreachable code, infinite loops, and data flow through sensitive sections.

Though highly effective in their own right, SCAs depend heavily on predefined rules, meaning they are only as good as the patterns they are taught to recognize.

Can Large Language Models Do It Better?

Enter Large Language Models (LLMs) like GPT-4. Recently, a study by David Noever from NASA compared the performance of GPT-4 against a widely used SCA called Snyk. In his research, Noever sought to test the limits of LLMs in vulnerability detection—and the results were surprising.

Source: Can Large Language Models Find And Fix Vulnerable Software?

According to the findings, GPT-4 identified four times as many vulnerabilities as Snyk, especially across complex public repositories like NASA’s software and other high-profile projects. But how exactly do LLMs achieve this superior performance?

Why LLMs Are Superior at Detecting Vulnerabilities

LLMs like GPT-4 outperform SCAs because they possess the ability to generalize and learn from vast amounts of data. Unlike SCAs, which are bound by a fixed set of rules, LLMs are trained on millions of lines of code across various programming languages, making them adaptable to new patterns. Their key advantages include:

Generalization: LLMs can detect vulnerabilities beyond predefined rules, making them more versatile in identifying previously unseen flaws.
Contextual Awareness: LLMs are trained on a broad spectrum of data, allowing them to understand the underlying principles and context around security vulnerabilities.

To help you understand better, we need to look at a very early example and perhaps the cornerstone in the success of deep learning i.e. Alexnet.

It’s a very famous paper published by very famous scientists including Alex Krizhevsky, Illya Sutskever, and Geoffrey Hinton.

Basically, it is a deep learning model that was designed to compete in the ImageNet competition in 2012. If you’re not familiar with ImageNet, it’s the biggest dataset of images.

There used to be a competition in which different algorithms and Computer vision techniques tried to classify a wide range of images.

Before Alexnet, this competition was dominated by other ML algorithms like SVM, shallow neural networks, and computer vision programs

In 2012, AlexNet performed exceptionally well. It significantly outperformed the runner-up model by reducing the top-5 error rate from 26.2% to 15.3%.

Traditional computer vision methods relied heavily on handcrafted features. They struggled to capture the wide variety of patterns present in images.

AlexNet learned features directly from the data through feature learning. Its multiple layers allowed it to learn a hierarchy of features, from simple edges in early layers to complex patterns and objects in deeper layers.

If a deep neural network is trained on thousands of cat and dog images and an image of a Cheetah is presented to it, it will label it as a cat with a higher degree of confidence as opposed to a dog and vice versa when given an image of a wolf.

领英推荐

12 essential ethical hacking newsletters to read

Pentest-Tools.com 8 个月前

Cybersecurity ???And Much More Newsletter ?? Vol. 4…

Seif H. 6 个月前

Exploiting Vulnerabilities with Exploit-DB on Kali…

Indian Cyber Security Solutions (GreenFellow IT Security Solutions Pvt Ltd) 7 个月前

It learns to identify patterns like fur texture, ear shapes, body structure, and so on, which are common across these categories. When presented with an image of a Cheetah, which shares many visual similarities with a cat, the neural network is likely to classify the Cheetah as a cat. The network has not seen a Cheetah before, so it generalizes based on the features it learned from the cat images.

Pretty much what happens when you train an LLM on a lot of data, it learns to generalize.

But if I have a computer vision program built to recognize cats, it might fail to recognize a Cheetah or a lion. They often rely on handcrafted features tailored for specific tasks.

If a program is built specifically to recognize cats, it would use features that are typical for domestic cats.

How well do open-source models perform?

Proprietary models whether you go with GPT-4 or Claude, become expensive very quickly.

Secondly, you are bound to share your private data with these models or the companies that own them.

Finally, you do not have any visibility to what’s happening behind the scenes nor have any control over optimization since you do not own the code and can not do anything other than make API requests.

There are other challenges too such as rate limits, availability, latency, etc.

So what's the next best thing? Open-source LLMs.

My own little study

To work around this, I did my own little research in which I replaced GPT-4 with open-source models. To keep it practical, I only went for medium-sized models and not the largest open-source models out there.

The reason for that is I wanted to fine-tune the model rather than simply deploy it on my cloud instance and do inference and fine-tuning anything beyond 12B parameters could require serious computation.

I chose Llama 3.1 8B and Mistral 7B open-source models.

Fine-tuned open-source models using 2400 rows.
Tested GPT-4 on 214 validation datasets via API.
Ran fine-tuned models on 214 validation datasets.
The maximum token count was kept under 2K to avoid hitting rate limits on OpenAI API.

Some important metrics to consider

The winner should have a higher...

Recall simply tells us the model’s capability of identifying vulnerable/non-vulnerable instances out of an actual number of vulnerable/non -non-vulnerable instances.

Precision, tells us out of all the predicted vulnerable/non-vulnerable cases, how many were actually vulnerable/non-vulnerable?

Let's look at the results!

Fine-tuned open-source models such as Llama 3.1 8B and Mistral 7B were tested using a dataset of PHP vulnerabilities. The models were fine-tuned with 2,400 rows of data and tested on 216 rows for validation. However, they did not perform as well as GPT-4, achieving only 50% accuracy compared to GPT-4's 64%.

The biggest surprise in the above study was the performance of XGBoost which is not a large language model, in fact, it's not even a deep learning model. It's a classical gradient-boosting ensemble machine-learning algorithm.

In this study, XGBoost surpassed both the open-source models as well as GPT-4 which is a current SOTA LLM by 1% in recall and 2% in precision.

So why open-source models could not perform well?

To answer this, we first need to understand what fine-tuning really is.

Fine-tuning allows the models to mimic the format of expected results, but it doesn’t substantially improve their underlying understanding of complex patterns in code.

In other words, by fine-tuning, you're not contributing anything to the capabilities of the base model as the model is not being re-trained on new data but is being fine-tuned to match the format of the output as you desire.

What’s the Verdict?

So, should you use SCAs, LLMs, or a combination of both?

LLMs clearly have the potential to reshape how we detect vulnerabilities, offering a more comprehensive and adaptable approach. However, they are not without their challenges.

The high cost and potential privacy issues associated with proprietary models like GPT-4, coupled with the still-developing performance of open-source models, mean that LLMs might not yet be a perfect standalone solution.

One potential path forward is to combine the strengths of both approaches. SCAs can be used to identify known vulnerabilities, while LLMs like GPT-4 can propose fixes or detect more complex, previously unknown threats. A hybrid approach could be the best way to create a robust security framework for your application.

The Myth Behind AI Systems

360 位关注者

要查看或添加评论，请登录

Ibad Rehman的更多文章

SORA ChatGPT Pro & Canvas: OpenAI’s Next Frontier in AI Creativity

2024年12月11日

SORA ChatGPT Pro & Canvas: OpenAI’s Next Frontier in AI Creativity

OpenAI continues to push the boundaries of artificial intelligence with its latest announcements, introducing new…
Run The Latest ?? Llama 3.2 Vision Locally On a Single GPU

2024年9月28日

Run The Latest ?? Llama 3.2 Vision Locally On a Single GPU

Fresh out of the oven, the latest Llama 3.2 Vision model is the new version which is more than just a simple…

2 条评论
Unveiling OpenAI's o1 Preview Model: A Leap Forward in AI Reasoning

2024年9月17日

Unveiling OpenAI's o1 Preview Model: A Leap Forward in AI Reasoning

OpenAI has once again pushed the boundaries with its latest release, the "Strawberry" AI, formally known as the OpenAI…

1 条评论
Beginner’s Guide to Running Mistral 7B Locally on a Single GPU

2024年7月27日

Beginner’s Guide to Running Mistral 7B Locally on a Single GPU

Mistral 7B is a state-of-the-art large language model developed by Mistral AI. It is designed to perform a wide range…
The Most Basic Guide to Understanding Transformers - The Backbone of LLMs

2024年6月20日

The Most Basic Guide to Understanding Transformers - The Backbone of LLMs

The Transformer architecture has revolutionized the field of natural language processing (NLP) by enabling models to…

1 条评论
Reality or Simulation? Simulation Argument by Nick Bostrom - Explained!

2024年6月6日

Reality or Simulation? Simulation Argument by Nick Bostrom - Explained!

Have you ever paused, looked around, and wondered if everything you see, feel, and experience is real? Or could it be…
Everything You Need to Know About Embeddings: The Backbone of LLMs

2024年6月3日

Everything You Need to Know About Embeddings: The Backbone of LLMs

If you've ever found yourself scratching your head at the mention of "embeddings" or felt lost in the sea of technical…
Beginner's Guide to Running ?? LLama 3 Locally On a Single GPU

2024年5月21日

Beginner's Guide to Running ?? LLama 3 Locally On a Single GPU

Many of us don't have access to elaborate setups or multiple GPUs, and the thought of running advanced software such as…

5 条评论
The Internet is About to Disappear — Partially ??

2024年5月14日

The Internet is About to Disappear — Partially ??

The internet we know today is soon going to disappear. No more fancy websites to look at, no more infinite scrolling in…

1 条评论
The Capabilities of Large Language Models in Executing/Preventing Cyber Attacks ??

2024年5月8日

The Capabilities of Large Language Models in Executing/Preventing Cyber Attacks ??

The capabilities of LLMs in executing cyber attacks have become a growing concern for professionals across various…

2 条评论

See all articles

Decoding the Future of Vulnerability Detection: Can LLMs Outperform Traditional Tools?

Ibad Rehman

AI & Machine Learning Engineer

Why Hunt for Vulnerabilities?

How Are Vulnerabilities Detected?

Static Code Analyzers: The Backbone of Traditional Vulnerability Detection

Can Large Language Models Do It Better?

Why LLMs Are Superior at Detecting Vulnerabilities

领英推荐

How well do open-source models perform?

My own little study

Let's look at the results!

What’s the Verdict?

The Myth Behind AI Systems

360 位关注者

Ibad Rehman的更多文章

社区洞察

其他会员也浏览了

What's New in Certified Ethical Hacker v13 (CEH v13 AI)?

Exploit Development and Analysis Using Exploit DB and Metasploit

Hacking full knowledge tools all in one place

Exposures, Exposed! Weekly Round-up August 26 – September 1

OWASP Top 10: Injection Attacks

Top 5 Pentesting Operating Systems for Ethical Hackers

Discover the Future of Application Security and Actionable ASPM at OWASP AppSec Lisbon!

Cybersecurity Tips: Vulnerability Scanners Essentials

Penetration Testing Methodology

Unlocking Bug Bounties: The Power of Port Scanning

Why Hunt for Vulnerabilities?

How Are Vulnerabilities Detected?

Static Code Analyzers: The Backbone of Traditional Vulnerability Detection

Can Large Language Models Do It Better?

Why LLMs Are Superior at Detecting Vulnerabilities

领英推荐

How well do open-source models perform?

My own little study

Let's look at the results!

What’s the Verdict?

The Myth Behind AI Systems

360 位关注者

Ibad Rehman的更多文章

SORA ChatGPT Pro & Canvas: OpenAI’s Next Frontier in AI Creativity

Run The Latest ?? Llama 3.2 Vision Locally On a Single GPU

Unveiling OpenAI's o1 Preview Model: A Leap Forward in AI Reasoning

Beginner’s Guide to Running Mistral 7B Locally on a Single GPU

The Most Basic Guide to Understanding Transformers - The Backbone of LLMs

Reality or Simulation? Simulation Argument by Nick Bostrom - Explained!

Everything You Need to Know About Embeddings: The Backbone of LLMs

Beginner's Guide to Running ?? LLama 3 Locally On a Single GPU

The Internet is About to Disappear — Partially ??

The Capabilities of Large Language Models in Executing/Preventing Cyber Attacks ??

社区洞察

其他会员也浏览了

What's New in Certified Ethical Hacker v13 (CEH v13 AI)?

Exploit Development and Analysis Using Exploit DB and Metasploit

Hacking full knowledge tools all in one place

Exposures, Exposed! Weekly Round-up August 26 – September 1

OWASP Top 10: Injection Attacks

Top 5 Pentesting Operating Systems for Ethical Hackers

Discover the Future of Application Security and Actionable ASPM at OWASP AppSec Lisbon!

Cybersecurity Tips: Vulnerability Scanners Essentials

Penetration Testing Methodology

Unlocking Bug Bounties: The Power of Port Scanning