登录查看更多内容

Some words on model repository security on the Hub

Thomas Wolf

Co-founder and Chief Science Officer at ?? Hugging Face – Angel investor

发布日期: 2024年3月4日

There some discussions about Hugging Face and model repository security (e.g., malicious LLMs) over the past weeks, so I'm sharing/formating here a great summary written by Omar Sanseviero our Chief Llama Officier summarizing the work we've done with the community for the last years in the area: safe serialization methods, malware scanning, and more. You can find more details on our Hub Security page. Let's go!

1. Pickle scanning

By default, libraries such as PyTorch, TensorFlow, and sklearn use pickle to serialize files.

Pickle, unfortunately, allows people to run arbitrary code (!) that means that loading a model could lead to running whatever someone wants on your computer. (By the way, this is something anyone using python should know nothing specific to ML or HF here, see the very clear official python doc warning)

As a first step to alleviate it, the Hub added a security scanner that scans for malware (using ClamAV, an open-source antivirus). Read more details here.

As a second step, we added Pickle import scanning. We scan all the imports referenced in the pickle file and raise warnings when the pickle uses imports that are suspicious (e.g. could lead to arbitrary code). Read more on pickle scanning on Hugging Face hub here.

Example of pickle import scanner detection

2. Safetensors

Scanning is not perfect. A better mitigation strategy is to not use pickle at all!

Pushing beyond PyTorch default save format, Hugging Face kicked off the development of safetensors, an efficient and, most importantly, safe format:

To assess the safety of this new format, we collaborated with EleutherAI and Stability AI to conduct an external security audit through the recognized Trail of Bits organization. You can read more about this audit here.

领英推荐

Chatbots Gone Rogue: Vulcan Cyber Reveals AI Package…

P. Raquel B. 1 年前

SANS Is Developing AI Security Guidelines. We Need…

Rob T. Lee 3 个月前

Navigating Cybersecurity in the Public Sector: A…

Sanjay Singh 1 年前

Safetensors is now the default format across many libraries in our ecosystem and abroad, including the famous transformers library!

Other formats, such as GGUF, have arisen in the ecosystem as safer alternatives than pickle as well!

3. Social validation features

Hugging Face has built-in social features like likes, community tabs, files inspectors, reporting mechanism, and spam detection tools.

In GitHub/npm/pip you would surely not randomly download and run random code from any source like for instance a github repo with 0 likes and a suspicious author. The same is true on the Hugging Face Hub, pay attention to the social features associated to a model and take a moment to look at the likes/community/leaderboard/model files before downloading and running a model from a suspicious account with 0 likes and very small download numbers.

If a model comes from an official trusted org such as Google, Salesforce, MistralAI, etc, the risk is much lower.

Conclusion

In the last weeks, there have been a few press releases from companies that sell security features.

Apart from the obvious conflict of interest, these reports are not showing in our opinion anything that the community did not know already (using piclkle is not good in python) and things the ecosystem has been working on for the last years as you saw above.

Combining safe file formats with trusted sources is a great way to keep yourself safe so I hope you'll generally follow on the hub the same safe practices you're surely following in your online life :)

Stay safe, stay Huggy!

Credit: Omar Sanseviero with small edits from yours truly. Original thread at https://twitter.com/osanseviero/status/1763331704146583806

Mor Weinberger

1 年

Thomas Wolf Im truly glad to see HF investments and commitment to model repository security over the time, It’s highly appreciated??. Generally speaking, are you planning to incorporate features related to verification and provenance?

Alex Carey

AI Speaker & Consultant | Helping Organizations Navigate the AI Revolution | Generated $50M+ Revenue | Talks about #AI #ChatGPT #B2B #Marketing #Outbound

1 年

Appreciate the insight shared! Thomas Wolf

查看更多评论

要查看或添加评论，请登录

Thomas Wolf的更多文章

Some notes on "DeepSeek and export control"

2025年1月30日

Some notes on "DeepSeek and export control"

Finally took time to go over Dario's essay on DeepSeek and export control and to be honest it was quite painful to…

34 条评论
Celebrating a crazy month of Open Multimodal LLM Releases

2024年9月29日

Celebrating a crazy month of Open Multimodal LLM Releases

If you haven't followed it several research labs have release impressively capable open multimodal LLM in September…

3 条评论
The rise and fall of synthetic datasets and smaller language models

2024年8月18日

The rise and fall of synthetic datasets and smaller language models

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in…

37 条评论
Tips for open-sourcing research code

2020年1月14日

Tips for open-sourcing research code

I often meet research scientists and NLP practitioners interested in open-sourcing their code/research and asking for…

7 条评论
What happened in Natural language generation decoders in 2019?

2019年5月4日

What happened in Natural language generation decoders in 2019?

A lot of things happened in 2018/2019 for natural language generation decoding algorithms and I thought it was a good…

4 条评论

See all articles

Some words on model repository security on the Hub

Thomas Wolf

Co-founder and Chief Science Officer at ?? Hugging Face – Angel investor

1. Pickle scanning

2. Safetensors

领英推荐

3. Social validation features

Conclusion

Thomas Wolf的更多文章

社区洞察

其他会员也浏览了

The Next Generation of AI-Driven Cybersecurity: A Conversation with Cybersecurity Leader Prabhakar Kota

Weekly Cybersecurity Digest: Top 5 News Stories in the Digital Sphere

AI in Cybersecurity: Battling Evolving Threats with Machine Learning

11 OAuth pentesting pro tips!

Hackers Are Using AI—Are You?

Dangerous ML models ??

Saturday 23rd November 2024

The Rise of AI in Cybersecurity: How to Keep Your Organization Secure in the Digital Age

AI in Cybersecurity: Battling Evolving Threats with Machine Learning

The Asymmetric Game: GenAI as a Double-Edged Sword in Cyber Security

1. Pickle scanning

2. Safetensors

领英推荐

3. Social validation features

Conclusion

Thomas Wolf的更多文章

Some notes on "DeepSeek and export control"

Celebrating a crazy month of Open Multimodal LLM Releases

The rise and fall of synthetic datasets and smaller language models

Tips for open-sourcing research code

What happened in Natural language generation decoders in 2019?

社区洞察

其他会员也浏览了

The Next Generation of AI-Driven Cybersecurity: A Conversation with Cybersecurity Leader Prabhakar Kota

Weekly Cybersecurity Digest: Top 5 News Stories in the Digital Sphere

AI in Cybersecurity: Battling Evolving Threats with Machine Learning

11 OAuth pentesting pro tips!

Hackers Are Using AI—Are You?

Dangerous ML models ??

Saturday 23rd November 2024

The Rise of AI in Cybersecurity: How to Keep Your Organization Secure in the Digital Age

AI in Cybersecurity: Battling Evolving Threats with Machine Learning

The Asymmetric Game: GenAI as a Double-Edged Sword in Cyber Security