登录查看更多内容

Security in Large Language Models: A Beginner's Guide

Domenico Tanzarella

Technical Account Manager at Google Cloud

发布日期: 2025年1月3日

Large language models (LLMs) are rapidly evolving, with frequent new feature announcements from companies like Meta, OpenAI, Google, and Anthropic. Their potential applications touch nearly every aspect of our lives. As AI agents begin handling tasks like travel and reservation bookings, they gain access to sensitive personal data (calendars, emails, credit cards, etc.), making LLM security paramount.?

While LLM research initially prioritized accuracy and safety, security is now receiving significant attention.

This article provides an introductory overview of LLM security, highlighting the vulnerabilities present at each stage of the LLM development process. Each step in preparing an LLM presents a potential attack surface for malicious actors.

The Four Stages of LLMs lifecycle Where Security Matters

Think of creating an LLM like building a house - it happens in stages, and each stage needs its own security measures.?

Let's break down these stages and their security challenges:

Data Retrieval and preparation

LLMs are trained over a very large amount of data, hence the word “Large” at the beginning of their denomination.

But what is this data? Well, the first ChatGPT (and Google’s BERT) were trained on “BookCorpus”, a collection of self published books put together by the University of Toronto.?

As the LLMs evolved, scraping the internet became the norm and recent foundational models (ChatGPT-4, Gemini, Claude) have a whole lot more data in their training dataset, whose sources unfortunately are no longer disclosed.

Also this Data, in order to be useful for training the LLMs, need to have annotations: for example, modern LLMs are multimodal (i.e. are trained by and can generate images not just text) and these images have a description associated with it: such metadata is added by humans (the so-called “human annotators”). This is not a new idea, and has been around for more than a decade since Stanford Vision Lab’s ImageNet was completed with the help of thousands of human labelers hired on Amazon’s Mechanical Turk service.?

Of course technology is evolving and nowadays we use AI also to label and annotate data, however the best training data is still the one coming from human annotators.?

So, how do we know that the data used to train these models are not tampered? This is a question not easy to answer. From a security point of view, the lifecycle of retrieving, cleaning and annotating datasets is a difficult scenario to address:?

Who are these data annotators? Can they be trusted? Well, these are individuals whose profiles span from highly paid software engineers annotating code to poorly paid workers in the south of the world that label data for a few dollars/day. In both cases, bad actors could infiltrate and sabotage the annotation and labeling process.

The scenario just shared is an example of “Data Poisoning” attack, where malicious actors manipulate the training data to influence the model’s behavior, leading to a model generating biased or harmful output and undermining its reliability.

Improving the working conditions of annotators in the south of the world so to attract the best candidates for the job, implementing a stricter vetting of the annotators background and a more direct collaboration between data scientists and annotators would definitely reduce risk of tampering and sabotaging the data used for training LLMs.?

Another risk factor is related to the tools Data Scientists use to manipulate and prepare data for training the models. Are these tools routinely scanned for security vulnerabilities and always up to date? Is there someone in the organization reading the release notes?

A Data Scientist performing data extraction and manipulation is not an expert in cybersecurity, so the part related to hardening, securing and keeping their tools up to date might be neglected and therefore open up a big opportunity for hackers to infiltrate and tamper the training data.?

This is where the emerging field of LLMops can help, bringing the DevOps (and SecOps) best practices to the world of AI.

Model training

Once we have data, we need to train the model chosen and fine tune it. We can use already pre made models and fine tune them (for example the ones stored in Hugging Face) or make our own models: both cases present their own security risks. Below are few examples:

Researchers from JFrog recently found? that some 100 malicious PyTorch and Tensorflow Keras models were identified on Hugging Face.?

One of these, a PyTorch model, uploaded by a user called “baller423”, contained hidden code enabling the creation of a ssh connection to a specific host of IP 210.117.212.93.?

These models, if deployed in production, could compromise the businesses using them.

Even if you train your own model, you will probably use a framework for that (e.g. Tensorflow or PyTorch) and that does not make you safer from hacking: in fact, beware of security vulnerabilities in these software packages. A recent vulnerability found in Pytorch could have allowed attackers to execute arbitrary commands on the underlying OS and steal AI training data.

Hugging Face is implementing measures to limit hosting compromised models by:

Enabling malware and password scanning?
Promoting a new model format, Safetensors, to store model data securely, which only stores key??data and does not contain executable codes?

All this might however not be enough: security is a shared responsibility and security checks on downloaded models should be performed also by the users running these models, which bring us back to the need of implementing solid LLMops best practices.

All the? attacks listed so far are part of the same group called “Supply Chain” attacks: these can be very sophisticated attacks involving multiple steps and possibly financed by government sponsored hacking groups or the result of elaborated corporate espionage.

In order to limit the impact of these attacks security experts are lobbying for the creation of a ML/AI Bill of Material that would associate to each model available in sites like Hugging Face a list (Bill of Material) containing the model’s metadata, data sources and software dependencies:? in case any of these components is found having malware or posing a security threat, it can be tracked down quickly and remediation applied fast.

Model optimization (RAG)

If you ask Antrophic’s Claude model who will be the next president of the United States, it won’t be able to answer as its training data is not up to date: this is common for LLMs.

One of the ways to improve this behavior is to use RAG, which stands for “Retrieval Augmented Generation”: this method allows adding new and/or updated knowledge without retraining the LLM from scratch, so when users ask a question, RAG searches through a designated knowledge base (like a DB, news articles, internal documents) to retrieve the most relevant information related to the query.

The retrieved information is then used to supplement the LLM’s internal knowledge, allowing it to generate current, more accurate and contextually relevant answers, hence lowering the chances of mistakes.

RAG is now widely accepted and part of the LLMs workflow: unfortunately, it shares the same Supply Chain security risks as the previous stages of the LLM lifecycle.

领英推荐

Microsoft’s Kosmos-1 Claims to Solve IQ Tests and…

Lightning AI 2 年前

?? Build an LLM app

Product Hunt 1 年前

LLM FINE-TUNING STRATEGIES FOR DOMAIN-SPECIFIC…

Floatbot.AI 1 年前

It also comes with new and dangerous security concerns of its own, like Data Breach and Exposure.

Data accessed via RAG usually are stored in so-called “Vector Databases”, special databases that store information as mathematical representation (easy to access and process by the model). These databases may contain sensitive and personally identifiable information (PII), making them attractive targets for malicious actors.?

Also, the relative immaturity of vector database security increases the risk of unauthorized access and data breaches, potentially violating data privacy laws like GDPR and HIPAA.

Exploiting a Vector Database often happens at the time of model serving, that will be discussed next.

Model serving

After collecting and cleaning data, training the model and enhancing it by adding a RAG source, the last step of the life of LLM is model serving, that happens when a user asks a question via a web app or the question is asked programmatically using API calls.?

“Asking a question” to a LLM is called “Prompting”: the art of asking LLM great questions is called “Prompt Engineering” and the way to exploit the security of LLMs via prompting is called “Prompt Injection”: prompt injection is by far the most common and easy way to hack a LLM, and many hacking strategies have been developed involving prompts:

Model Hacking via special codes?

When ChatGPT was publicly released, it was easy to trick the chatbot to use their knowledge to provide unethical answers by starting a prompt with the word ‘DAN” (“Do Anything Now”).

Tech companies working on LLMs tried to avoid this pitfall over the last two years, but looks like it is a hard problem to fix as hackers even put together a GitHub repo with different prompts for different LLMs that will remove all safety restrictions and allow them to give uncensored answers.?

This is a serious issue:

A hacked and uncensored LLM is not safe for users and for businesses?
A hacked LLM via prompt injection could accidentally reveal PII used for training or 'internal only' information in case of RAG data breaches.

Prompt filtering is a powerful remediation to this kind of hacking, and it can be easily implemented with few lines of code by using APIs like OpenAI “Moderation” endpoint which is free to use when monitoring the inputs and outputs of OpenAI APIs.?

Even without the Moderation API, businesses can code their own prompt filtering modules with a combination or regular expression and keywords filtering, or customize a small open source LLM to recognize prompt injection and filter user prompts before it is submitted to the main LLM.

Model cloning?

A model can be fine tuned with a dataset of Questions and Answers. If an attacker creates a set of thousands of questions for a LLM, their answers could be used to train another LLM that would eventually perform? similarly to the one queried: as “sci-fi” as it sounds, this is a real threat to businesses that invest large amount of money in training a large model for months and with thousands of GPUs running 24 hours/day.

Tools like OpenAI “Moderation” APIs won’t help in this case: instead it is recommended to use API Rate limiting by Token or IP.

Models DOS (denial of service) and DOW (denial of wallet)

Denial of service exists also for LLMs, albeit is slightly different from the traditional DOS and DDOS attacks to services/websites we are familiar with.?

In the LLM case, the Denial of service is not caused by an overwhelming amount of requests, but by the complexity of them, which will keep the LLM busy researching for answers and unable to respond to queries from other users as they usually have a fixed amount of GPUs available for prediction.

Another consequence of these type of attacks is that the cost of the inference becomes much higher than the average due to the saturations of compute resources, this causing the LLM provider to incur in extra costs for GPU usage, hence the label “Denial of Wallet”.

A combination of prompt filtering and rate limiting might help avoiding these DOS/DOW attacks.

One last recommendation for protecting business from LLM security threats, the one that should be implemented in all phases of LLM lifecycle, is Monitoring: aside from compliance and regulatory requirements, monitoring your LLM through their entire lifecycle is imperative to guarantee that security attacks can be identified and properly addressed, and a pillar of LLMops best practices.

One last threat: hallucinations

We are all well familiar with LLMs hallucinations: they happen when a LLM returns a low probability answer with high confidence (e.g. LLMs once referenced fake legal cases that lawyers used for a real Trial). These are threats to the credibility of the Model and the Tech company owning it, but they might also become a security threat in quite an unexpected way.?

Package hallucinations

Large language models (LLMs) are increasingly used for coding by software engineers to accelerate product development. However, a recent study revealed a significant security risk of these coding “copilots”: LLMs sometimes hallucinate the existence of libraries when generating code. This creates an opportunity for "package confusion attacks".

An attacker could publish a malicious package to a public repository with the same name as a frequently hallucinated library. If a developer uses code generated by the LLM, their project could inadvertently pull in this malicious package.

This risk is amplified by the ease with which attackers can identify common hallucinations. For example, by querying LLMs with popular coding questions from platforms like Stack Overflow, attackers can identify frequently hallucinated packages and create them with malicious code inside.

While LLM researchers are working to improve accuracy and reduce hallucinations, developers must exercise caution. Blindly trusting generated code and unfamiliar packages is a significant security risk.?

Looking Ahead

As AI becomes more integrated into our daily lives, security can't be an afterthought. Organizations need to build security into every stage of AI development and deployment. Just as we wouldn't build a house without locks and alarms, we shouldn't deploy AI systems without proper security measures.

The field of AI security is evolving rapidly, and staying informed about new threats and protections is crucial for anyone creating, working with or using AI systems.

We just scratched the surface of Security in Large Language Models: if you want to deep dive into the topic, a good starting point is the “OWASP Top 10 security threats for LLMS” available here.

Hope you enjoyed this reading: let me know in the comments what are your thoughts on this topic.

Cheers, Domenico

Mark Nudelman

VMware & Cloud Operations Guru

2 个月

Well done ??

1 次回应

Luca Bigoni

2 个月

Great advice

Leonardo Lenoci, PhD

Scientist, hacker, musician | social.edu.nl/@the_dr_leonardo_lenoci

2 个月

Hi Domenico, thanks for this interesting overview. Just to complement that https://gnu.org/philosophy/words-to-avoid.html#ArtificialIntelligence

1 次回应

Peter E.

Helping SMEs automate and scale their operations with seamless tools, while sharing my journey in system automation and entrepreneurship

2 个月

It’s clear that as LLMs evolve, so too must our security frameworks. Ensuring data privacy and preventing exploitation is key to maintaining trust in AI systems. How can organizations balance innovation with robust security without stifling progress?

查看更多评论

要查看或添加评论，请登录

Security in Large Language Models: A Beginner's Guide

Domenico Tanzarella

Technical Account Manager at Google Cloud

The Four Stages of LLMs lifecycle Where Security Matters

Data Retrieval and preparation

Model training

Model optimization (RAG)

领英推荐

Model serving

Model Hacking via special codes?

Model cloning?

Models DOS (denial of service) and DOW (denial of wallet)

One last threat: hallucinations

Package hallucinations

Looking Ahead

社区洞察

其他会员也浏览了

OpenAI Gears Up for Mid-Year Launch of GPT-5: Report

Small Language Models: Empowering Tech SMBs in the AI Era

AI-Security Essentials for Decision-makers: The Rising Significance of Large Language Models (LLM) Security

Tractiv outperforms watermarking in protecting data integrity from LLMs.

H2OGPT Open-source Project; LLMs as Debugger; GPT-5 What can be Expected; New 1Bn LLM by Microsoft; In Growth Zone: Creative Teams; and More

?? Has OpenAI Lost Its Edge?

Successfully Mitigating LLM Bias: Introspection & Prompt Engineering with LLM-Genie!

Thinks and Links | August 9, 2024

Top LLM Papers of the Week (March Week-3 2024)

Large Language Model Settings: Temperature, Top P and Max Tokens