登录查看更多内容

Taming the Machine

Daniel Cuthbert

Security Researcher

发布日期: 2023年10月6日

In November 2022, OpenAI introduced ChatGPT to the world and since then, the hype and excitement about this reached new levels.?‘How will ChatGPT change the way we think and work?’?and many other articles and scholars wrote about how this tech, and indeed other artificial intelligence will shape our lives.

We really are riding that peak of inflated expectactions when it comes to AI at the moment. Not a day goes by without someone saying "oh we are using AI for X" and I get it, it's new and exciting and makes you feel like you are in the future

But in most instances, i ask why. Why does seemingly every single project/product/$thing need to include AI?

Language Models themselves are a relatively old technology and in reality it is a statistical model of words found in the English language. Take a huge amount of text and see if you can predict what will come next, for example:

Daniel is ___ happy
Daniel is ___ sad
Daniel is ___ missing

There's no doubt that in some cases, AI (well lets be honest, it's mostly GPT these days) is making some tasks much easier but the hype is alarm, especially from vendors who see this as a last chance hurrah to remain relevant.

The one area I've spent a good solid 18 months on is that of code security. Does using a model that has been trained on already bad code mean you are just perpetuating the lifecycle of poor code?

well, yes and anyone who felt this wasn't the case needs their head examining

Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu all set about to answer this question, with a pretty good paper (except for the two column format, come on academia move on with the times)

https://arxiv.org/abs/2310.02059

Perceived Risks of AI Code Completion

Perception

Currently we are seeing an interesting side effect of such models in that they are confident in spreading bull***. Sometimes the response is so good that you struggle to comprehend how it did so and other times the output is so bad that it makes you question the model. This is down to how the model was trained and these models aren’t aware that they are getting it wrong so come over as confidently wrong.?

Fine Tuning Models

One doesn’t just tune a model and leave it be. What is needed is constant adjustments where you take examples of the output you want, for the task you want to do and then do extra training to be more specialised on that exact task. We have seen how GitHub have done this on code found in all of GitHub.com so code completion suggestions are far more accurate than the initial model.?

Automation & The Role of Bias

The current crop of generative coding assistants is capable of a variety of tasks developers perform daily. They can complete code blocks, functions, and even whole small programs. Personal experienc of using Copilot for a year has seen me successfully use it to debug issues as well as provide fixes for buggy code.

Automation bias is the condition where humans favor suggestions from automated systems, even when these suggestions run contrary to other observations. Whenever an automated system is involved, there’s a tendency to trust the outputs of the systems. This is something we, as an industry, have faced for decades. Humans trust computers, they trust the output and it is why so many are easily socially engineered in order to perform actions they’d usually never do.?

In their Black Hat 2022 talk titled ‘In Need of 'Pair’ Review: Vulnerable Code Contributions by GitHub Copilot‘, Hammond Pearce & Benjamin Tan talked about how non-obvious factors, such as setting the Author field in the code and using tabs instead of spaces, had a serious effect on the code generated. Whilst the convenience of having automated code suggestions is attractive from a business perspective, as well as a technical one, a risk of bias is introduced whereby the developer trusts the response from Copilot and if they do not have the depth of experience to validate, this could introduce issues.

Accuracy

A model is as accurate as the data used to train it and this is where any LLM struggles. If we look at using Copilot to write a function that could be used in a web browser to accept data from the user, it suggests the following code:

The code in itself isn’t entirely inaccurate, it does correctly suggest the imports one would need and then based upon my comments, builds up the data flow where it creates a FieldStorage object. This object is used to accept input from the user. It then gets the value of the field named 'name’. This is the name of the person that we want to search for in the database.

Once it has the name, it connects to the database and performs the query.?

What is inaccurate about the above is the security of the code and indeed there exist potential vulnerabities in what was suggested.?

SQL injection attacks: The code takes user input and inserts it directly into the SQL query without validating or sanitizing it. An attacker could use this vulnerability to inject their own SQL code and potentially manipulate or damage the database.
Lack of input validation: The code does not validate the input received from the user. This can lead to unexpected behavior or errors if the user enters invalid or unexpected input.
Exposing sensitive information: The code fetches the results from the database and directly outputs them to the browser via HTML. This can potentially expose sensitive information such as user credentials or personal information.
Lack of authentication and authorization: The code does not perform any authentication or authorization checks, which can allow unauthorized access to the database.
Lack of error handling: The code does not handle errors or exceptions that may occur during the execution of the script, which can lead to unexpected behavior or security vulnerabilities.

What emerges is the idea that using a LLM to generate code is not entirely dissimilar to using google/stackoverflow to find code snippets that do what you want. The developer still has to read through the code and apply their own experience and knowledge to what is presented. The generated code snippets are rarely better than boilerplate examples, which we shall see.

Secure Prompt Engineering

Prompt engineering is a concept in artificial intelligence, particularly natural language processing (NLP). In prompt engineering, the description of the task is embedded in the input, e.g., as a question instead of it being implicitly given.

The rise of ChatGPT has shown how prompt engineering could revolutionise some aspects of development. For example, lets say I want to create some secure cryptographic functions for my code using ChatGPT

领英推荐

Top AI Posts and Summary for July 2023 - TIDES…

Kalilur Rahman 1 年前

ChatGPT and Gemini Advanced Talk About the Future of…

Michael Watkins 1 年前

The Battle for AI Supremacy: DeepSeek, ChatGPT, and…

ABHISHEK RAJ (?????? ???) 1 个月前

from Crypto.Cipher import AES
from Crypto.Random import get_random_bytes

# Define the plaintext string
plaintext = "This is a secret message."

# Generate a 16-byte secret key
key = get_random_bytes(16)

# Define the initialization vector (IV)
iv = get_random_bytes(16)

# Create an AES cipher object with CBC mode and the secret key
cipher = AES.new(key, AES.MODE_CBC, iv)

# Convert the plaintext to bytes and pad it to a multiple of 16 bytes
plaintext_bytes = plaintext.encode('utf-8')
padded_plaintext_bytes = plaintext_bytes + b"\0" * (AES.block_size - len(plaintext_bytes) % AES.block_size)

# Encrypt the padded plaintext
ciphertext_bytes = cipher.encrypt(padded_plaintext_bytes)

# Print the ciphertext and the key and IV (for decryption)
print("Ciphertext: ", ciphertext_bytes.hex())
print("Key: ", key.hex())
print("IV: ", iv.hex())

Not bad, but what about Copilot?

What's different in both scenarios is context, ChatGPT gives the user far more

And more importantly, it stresses the need to do this in a secure way

This is putting aside the need for Post-Quantum cryptography (PQC).

Back to the research paper and it was great to see fellow researchers digging deep, asking hard questions and using CodeQL to analyse the code. I've said it before, CodeQL is truly one of the most exciting things to happen in the static analysis space in decades.

The team collected 435 code snippets generated by Copilot from GitHub projects. Those snippets coverred six common programming languages and CodeQL + another language-dedicated tool not mentioned was used to scan and analyze the code snippets and then combine the results obtained from the two tools.

The results are not surprising for anyone who's main job is finding bugs

So what does that tell us?

Not much really, Python is a very popular language now as is JavaScript, so I expected to see more snippets and as such, more bugs. Digging deeper, they saw

CWE-330: Use of Insufficiently Random Values,

CWE-703: Improper Check or Handling of Exceptional Conditions ,

CWE-400: Uncontrolled Resource Consumption

CWE-502: Deserialization of Untrusted Data.

Some CWEs appeared less frequently, such as

CWE-95: Eval Injection, and

CWE-22: Improper Limitation of a Pathname to a Restricted Directory.

CWE-78: OS Command Injection is the most frequently occurred CWE, and this aligns with the bugs we often see. Input validation is SUPER HARD, just ask any security tool vendor.

Finally, table 9 gave us what we kinda wanted to know

This is code quality in a nutshell, given it is written by a hardcore 10x dev using Emacs or a fancy robot. We all suffer with such bugs and will continue to do so for the foreseable future.

How secure is the code generated by AI?

This is the question I've been asking since I dived deep into the whole AI code thing, and so did the research team and the results aren't overly shocking to me

Among the 435 code snippets generated by Copilot, we found that 35.8% of these code snippets contain security weaknesses

So pretty much like most of the code we see today. But that begs the question: Did the use of AI free up the developer more time to potentially look at using tools like CodeQL or some legacy friction-inducing scanner to at least look for bugs?

Because this is where I feel AI code tools actually offer value. They give us the time back to make it right, whereas right now we just don't have that luxury.

So when I hear efforts to "just convert this old code to new code using AI", I get worried that we aren't learning the lessons from the past and trusting this newfangled robot to know what is good and what is bad. The questions I'd be asking is how they've trained the model, on what data, who is validating the results, what about bias etc etc.

If you aren't asking these questions, then you are destined to feel the wrath.

Overall a great paper and research project. Kudos to the team. Now if you'll excuse me, I'm gonna get copilot to write some documentation and unit tests

Nick Dunn

Security Specialist at IOActive

1 年

Thanks for sharing this Dan. Really interesting article, and it matches my own experience - on one hand, I've been amazed to see ChatGPT convert C to COBOL and vice versa, while correcting vulnerabilities, and telling me about the mistakes it corrected, while on the other hand I've seen it produce incorrect and unusable code from relatively simple requests. (I'm not criticizing ChatGPT, but I am agreeing that it needs to used with caution)

Simon PG Edwards

30+ years testing cyber security.

1 年

Interesting point about AI getting things very right (wow, this is scary!) to very wrong (oh well, it’s just a computer…) If we focus on the ‘hits’ and quietly forget the ‘misses’ AI matches the hype. Same with crypto currency, NFTs, [insert theme that attracts attention seekers]

Stuart Coulson

Tech Mentor | Super Connector | Start-Up Builder | Product Evangelist | Consultant

1 年

Love the image - I see that all too often. Good article though.

1 次回应

Petko D. Petkov

on a break from CISO duties, building chatbotkit.com

1 年

Traditional software is like our reliable old friend, doing the same thing every time we ask. But AI? It's like that adventurous buddy who might take an unexpected detour. While it's cool to have AI on our team, aiming for them to churn out 'flawless' code might not always be the way to go. Sometimes, it's the quirky, unexpected paths that lead to the coolest discoveries.

1 次回应

查看更多评论

要查看或添加评论，请登录

Daniel Cuthbert的更多文章

Hacker Summer Camp 2024

2024年7月25日

Hacker Summer Camp 2024

In the sweltering heat of a 2024 Las Vegas summer, a place where the neon lights burn brighter than the desert sun, I…
Memory is a beautiful thing...

2024年4月23日

Memory is a beautiful thing...

Memory management has been a fundamental component of computing since the early days of programmable machines. The…

32 条评论
Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java

2024年4月11日

Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java

Gather round all, let's tear apart SAST tooling and the claims all vendors make about this space. This piece is based…

24 条评论
Black Hat Europe 2021

2021年11月8日

Black Hat Europe 2021

Humans. Actual human interaction that is not done via Zoom or Teams or Google or any other digital means.

3 条评论
Black Hat USA 2021

2021年7月30日

Black Hat USA 2021

Last year I thought would be a blip, a small delay in seeing me return to the desert to see friends, enjoy the amazing…

2 条评论
Blackhat EU - The Virtual Edition

2020年12月7日

Blackhat EU - The Virtual Edition

Well, we are still here, we've baked more sourdough than any country needs, we've got the craziest hairstyles and…

6 条评论
Blackhat USA - The Virtual Edition

2020年7月22日

Blackhat USA - The Virtual Edition

2020 is an odd year, a year that's testing us all in so many ways. From issuing a halt on our lives to working out how…

2 条评论
Black Hat Europe 2019

2019年12月2日

Black Hat Europe 2019

In the late 1700s, Wapping in London's Docklands area was a den of thieves. The area was teeming with those determined…

1 条评论
Bit Frolicking in Nevada

2019年7月16日

Bit Frolicking in Nevada

One week earlier than last year, so either i'm getting better at this or lucky, i'm leaning more towards the latter…

5 条评论
The Lure of 0hday

2019年2月11日

The Lure of 0hday

I've tried to keep this bottled up, but seeing as we've a whole wave of new people to our industry, maybe it's time to…

16 条评论

See all articles

Taming the Machine

Daniel Cuthbert

Security Researcher

Perceived Risks of AI Code Completion

Perception

Fine Tuning Models

Automation & The Role of Bias

Accuracy

Secure Prompt Engineering

领英推荐

Daniel Cuthbert的更多文章

社区洞察

其他会员也浏览了

An Anniversary Tale of Two Years in AI Development: My Colloquy with Two AI Chatbots

Is Claude 3.7 Sonnet the Future of AI? How It Stacks Up Against ChatGPT and DeepSeek

Decrypting AI: 5 Key Insights from an AI Educator

AI Weekly: ChatGPT’s Memory feature, AI-boosted Safari 18, the mysterious GPT2 chatbot, Amazon Q, and more – 05/10/2024

AI without filter - From ChatGPT to Superintelligence

?????? Trump's Big AI Plan

ChatGPT and Foundation Models for enterprises.. beyond the hype (1 of 5)

DeepSeek AI vs. ChatGPT: A New Challenger Rises in the AI Arena

Can ChatGPT Pass the Blade Runner’s Voight-Kampff Test? A Comparison of GPT-3.5 and GPT-4. | Velebit AI

Near the Singularity: OpenAI's Journey from ChatGPT to Superintelligence

Perceived Risks of AI Code Completion

Perception

Fine Tuning Models

Automation & The Role of Bias

Accuracy

Secure Prompt Engineering

领英推荐

Daniel Cuthbert的更多文章

Hacker Summer Camp 2024

Memory is a beautiful thing...

Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java

Black Hat Europe 2021

Black Hat USA 2021

Blackhat EU - The Virtual Edition

Blackhat USA - The Virtual Edition

Black Hat Europe 2019

Bit Frolicking in Nevada

The Lure of 0hday

社区洞察

其他会员也浏览了

An Anniversary Tale of Two Years in AI Development: My Colloquy with Two AI Chatbots

Is Claude 3.7 Sonnet the Future of AI? How It Stacks Up Against ChatGPT and DeepSeek

Decrypting AI: 5 Key Insights from an AI Educator

AI Weekly: ChatGPT’s Memory feature, AI-boosted Safari 18, the mysterious GPT2 chatbot, Amazon Q, and more – 05/10/2024

AI without filter - From ChatGPT to Superintelligence

?????? Trump's Big AI Plan

ChatGPT and Foundation Models for enterprises.. beyond the hype (1 of 5)

DeepSeek AI vs. ChatGPT: A New Challenger Rises in the AI Arena

Can ChatGPT Pass the Blade Runner’s Voight-Kampff Test? A Comparison of GPT-3.5 and GPT-4. | Velebit AI

Near the Singularity: OpenAI's Journey from ChatGPT to Superintelligence