Taming the Machine
In November 2022, OpenAI introduced ChatGPT to the world and since then, the hype and excitement about this reached new levels.?‘How will ChatGPT change the way we think and work?’?and many other articles and scholars wrote about how this tech, and indeed other artificial intelligence will shape our lives.
We really are riding that peak of inflated expectactions when it comes to AI at the moment. Not a day goes by without someone saying "oh we are using AI for X" and I get it, it's new and exciting and makes you feel like you are in the future
But in most instances, i ask why. Why does seemingly every single project/product/$thing need to include AI?
Language Models themselves are a relatively old technology and in reality it is a statistical model of words found in the English language. Take a huge amount of text and see if you can predict what will come next, for example:
Daniel is ___ happy
Daniel is ___ sad
Daniel is ___ missing
There's no doubt that in some cases, AI (well lets be honest, it's mostly GPT these days) is making some tasks much easier but the hype is alarm, especially from vendors who see this as a last chance hurrah to remain relevant.
The one area I've spent a good solid 18 months on is that of code security. Does using a model that has been trained on already bad code mean you are just perpetuating the lifecycle of poor code?
well, yes and anyone who felt this wasn't the case needs their head examining
Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu all set about to answer this question, with a pretty good paper (except for the two column format, come on academia move on with the times)
Perceived Risks of AI Code Completion
Perception
Currently we are seeing an interesting side effect of such models in that they are confident in spreading bull***. Sometimes the response is so good that you struggle to comprehend how it did so and other times the output is so bad that it makes you question the model. This is down to how the model was trained and these models aren’t aware that they are getting it wrong so come over as confidently wrong.?
Fine Tuning Models
One doesn’t just tune a model and leave it be. What is needed is constant adjustments where you take examples of the output you want, for the task you want to do and then do extra training to be more specialised on that exact task. We have seen how GitHub have done this on code found in all of GitHub.com so code completion suggestions are far more accurate than the initial model.?
Automation & The Role of Bias
The current crop of generative coding assistants is capable of a variety of tasks developers perform daily. They can complete code blocks, functions, and even whole small programs. Personal experienc of using Copilot for a year has seen me successfully use it to debug issues as well as provide fixes for buggy code.
Automation bias is the condition where humans favor suggestions from automated systems, even when these suggestions run contrary to other observations. Whenever an automated system is involved, there’s a tendency to trust the outputs of the systems. This is something we, as an industry, have faced for decades. Humans trust computers, they trust the output and it is why so many are easily socially engineered in order to perform actions they’d usually never do.?
In their Black Hat 2022 talk titled ‘In Need of 'Pair’ Review: Vulnerable Code Contributions by GitHub Copilot‘, Hammond Pearce & Benjamin Tan talked about how non-obvious factors, such as setting the Author field in the code and using tabs instead of spaces, had a serious effect on the code generated. Whilst the convenience of having automated code suggestions is attractive from a business perspective, as well as a technical one, a risk of bias is introduced whereby the developer trusts the response from Copilot and if they do not have the depth of experience to validate, this could introduce issues.
Accuracy
A model is as accurate as the data used to train it and this is where any LLM struggles. If we look at using Copilot to write a function that could be used in a web browser to accept data from the user, it suggests the following code:
The code in itself isn’t entirely inaccurate, it does correctly suggest the imports one would need and then based upon my comments, builds up the data flow where it creates a FieldStorage object. This object is used to accept input from the user. It then gets the value of the field named 'name’. This is the name of the person that we want to search for in the database.
Once it has the name, it connects to the database and performs the query.?
What is inaccurate about the above is the security of the code and indeed there exist potential vulnerabities in what was suggested.?
What emerges is the idea that using a LLM to generate code is not entirely dissimilar to using google/stackoverflow to find code snippets that do what you want. The developer still has to read through the code and apply their own experience and knowledge to what is presented. The generated code snippets are rarely better than boilerplate examples, which we shall see.
Secure Prompt Engineering
Prompt engineering is a concept in artificial intelligence, particularly natural language processing (NLP). In prompt engineering, the description of the task is embedded in the input, e.g., as a question instead of it being implicitly given.
The rise of ChatGPT has shown how prompt engineering could revolutionise some aspects of development. For example, lets say I want to create some secure cryptographic functions for my code using ChatGPT
领英推荐
from Crypto.Cipher import AES
from Crypto.Random import get_random_bytes
# Define the plaintext string
plaintext = "This is a secret message."
# Generate a 16-byte secret key
key = get_random_bytes(16)
# Define the initialization vector (IV)
iv = get_random_bytes(16)
# Create an AES cipher object with CBC mode and the secret key
cipher = AES.new(key, AES.MODE_CBC, iv)
# Convert the plaintext to bytes and pad it to a multiple of 16 bytes
plaintext_bytes = plaintext.encode('utf-8')
padded_plaintext_bytes = plaintext_bytes + b"\0" * (AES.block_size - len(plaintext_bytes) % AES.block_size)
# Encrypt the padded plaintext
ciphertext_bytes = cipher.encrypt(padded_plaintext_bytes)
# Print the ciphertext and the key and IV (for decryption)
print("Ciphertext: ", ciphertext_bytes.hex())
print("Key: ", key.hex())
print("IV: ", iv.hex())
Not bad, but what about Copilot?
What's different in both scenarios is context, ChatGPT gives the user far more
And more importantly, it stresses the need to do this in a secure way
This is putting aside the need for Post-Quantum cryptography (PQC).
Back to the research paper and it was great to see fellow researchers digging deep, asking hard questions and using CodeQL to analyse the code. I've said it before, CodeQL is truly one of the most exciting things to happen in the static analysis space in decades.
The team collected 435 code snippets generated by Copilot from GitHub projects. Those snippets coverred six common programming languages and CodeQL + another language-dedicated tool not mentioned was used to scan and analyze the code snippets and then combine the results obtained from the two tools.
The results are not surprising for anyone who's main job is finding bugs
So what does that tell us?
Not much really, Python is a very popular language now as is JavaScript, so I expected to see more snippets and as such, more bugs. Digging deeper, they saw
CWE-330: Use of Insufficiently Random Values,
CWE-703: Improper Check or Handling of Exceptional Conditions ,
CWE-400: Uncontrolled Resource Consumption
CWE-502: Deserialization of Untrusted Data.
Some CWEs appeared less frequently, such as
CWE-95: Eval Injection, and
CWE-22: Improper Limitation of a Pathname to a Restricted Directory.
CWE-78: OS Command Injection is the most frequently occurred CWE, and this aligns with the bugs we often see. Input validation is SUPER HARD, just ask any security tool vendor.
Finally, table 9 gave us what we kinda wanted to know
This is code quality in a nutshell, given it is written by a hardcore 10x dev using Emacs or a fancy robot. We all suffer with such bugs and will continue to do so for the foreseable future.
How secure is the code generated by AI?
This is the question I've been asking since I dived deep into the whole AI code thing, and so did the research team and the results aren't overly shocking to me
Among the 435 code snippets generated by Copilot, we found that 35.8% of these code snippets contain security weaknesses
So pretty much like most of the code we see today. But that begs the question: Did the use of AI free up the developer more time to potentially look at using tools like CodeQL or some legacy friction-inducing scanner to at least look for bugs?
Because this is where I feel AI code tools actually offer value. They give us the time back to make it right, whereas right now we just don't have that luxury.
So when I hear efforts to "just convert this old code to new code using AI", I get worried that we aren't learning the lessons from the past and trusting this newfangled robot to know what is good and what is bad. The questions I'd be asking is how they've trained the model, on what data, who is validating the results, what about bias etc etc.
If you aren't asking these questions, then you are destined to feel the wrath.
Overall a great paper and research project. Kudos to the team. Now if you'll excuse me, I'm gonna get copilot to write some documentation and unit tests
Security Specialist at IOActive
1 年Thanks for sharing this Dan. Really interesting article, and it matches my own experience - on one hand, I've been amazed to see ChatGPT convert C to COBOL and vice versa, while correcting vulnerabilities, and telling me about the mistakes it corrected, while on the other hand I've seen it produce incorrect and unusable code from relatively simple requests. (I'm not criticizing ChatGPT, but I am agreeing that it needs to used with caution)
25+ years testing cyber security. "Have you checked with SE Labs?"
1 年Interesting point about AI getting things very right (wow, this is scary!) to very wrong (oh well, it’s just a computer…) If we focus on the ‘hits’ and quietly forget the ‘misses’ AI matches the hype. Same with crypto currency, NFTs, [insert theme that attracts attention seekers]
Growing Digital Security Start-ups | Connector of People | Mentor and Coach | Evangelist | Consultant | Advisory Board
1 年Love the image - I see that all too often. Good article though.
on a break from CISO duties, building chatbotkit.com
1 年Traditional software is like our reliable old friend, doing the same thing every time we ask. But AI? It's like that adventurous buddy who might take an unexpected detour. While it's cool to have AI on our team, aiming for them to churn out 'flawless' code might not always be the way to go. Sometimes, it's the quirky, unexpected paths that lead to the coolest discoveries.