登录查看更多内容

Tips for open-sourcing research code

Thomas Wolf

Co-founder and Chief Science Officer at ?? Hugging Face – Angel investor

发布日期: 2020年1月14日

I often meet research scientists and NLP practitioners interested in open-sourcing their code/research and asking for advice, so I've decided to share more widely some of my advice.

First: why should you open-source code and models along with your paper? Because science is a virtuous circle of knowledge sharing, not a zero-sum competition.

1. Consider sharing your code as a tool to build on, more than a snapshot of your work:

other will build things that you can't imagine on top of your code ?=> give them easy access to the core elements
but be careful, not to over-do it => no need for one-liner abstractions that won't fit other's need – keep it clean & simple

2. Put yourself in the shoes of a master student who has to start from scratch with your code:

give them a ride up to the end with pre-trained models
focus examples and code on open-access datasets (not everybody can have access to CoNLL-2003)

3. Give detailed instructions on how to run the code, at least evaluation code, in such a way that, combined with pretrained models, it allows for fast testing and debugging. This will help your users make sure that they are using your code right.

4. Use the least amount of dependencies: if you are using an internal framework to build the model => copy the relevant part inside your codebase instead of asking users to install another dependency.

5. Spend four to five days to do it well. Open-sourcing a good code base takes some time but you should consider it as important as publishing a paper.

6. Consider merging with a larger codebase. Are you working on language models? ??Transformers is probably happy to welcome your model at https://github.com/huggingface/transformers

7. Now, what if you want to build a larger-scale tool like ??Transformers? Here are some additional tips for you:

A. focus on one essential feature that your community really needs and no tool already provides

B. keep putting yourself in the shoes of people using your tool for the 1st time and on other tasks.

C. Open-sourcing ML can be very different from other types of open-sourcing:

- ML bugs are very often silent => your users will need to know exactly what's happening inside the code, they can't just assume you are doing it right for their use-case.

- Researchers will create things you have no ideas about => they want to dive in your code and modify it.

=> keep everything clear and visible. No unnecessary user-facing abstractions or layers. Provide direct access to the core. Each user-facing abstraction is a mask that can hide some ML bug, a potential source of misunderstandings, and a steeper learning curve for users.

Best,

Thom

Frederic H.

Operational efficiency and cybersecurity with Agentic AI strategies

2 年

My friend I just discovered you today but your mentality explain why you are so successful. Inmediate follow. Kudo to you and keep up the good work.

1 次回应

Bhavul Gauri

Leading GenAI Initiatives at Meta | Applied Researcher | BITS Pilani | LLM, Diffusion Models, Generative AI, RAG, Agentic Models, Ranking

4 年

This is so so true. Brilliant advice.

Anuj Gupta

5 年

Piyush Makhija: look at this

1 次回应

Maxime Gendre

Tech Lead | MLOps & SWE @ DeepLife | Ex-Helsing

5 年

Yann GOLHEN Olivier Baes

1 次回应

查看更多评论

要查看或添加评论，请登录

Thomas Wolf的更多文章

Some notes on "DeepSeek and export control"

2025年1月30日

Some notes on "DeepSeek and export control"

Finally took time to go over Dario's essay on DeepSeek and export control and to be honest it was quite painful to…

34 条评论
Celebrating a crazy month of Open Multimodal LLM Releases

2024年9月29日

Celebrating a crazy month of Open Multimodal LLM Releases

If you haven't followed it several research labs have release impressively capable open multimodal LLM in September…

3 条评论
The rise and fall of synthetic datasets and smaller language models

2024年8月18日

The rise and fall of synthetic datasets and smaller language models

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in…

37 条评论
Some words on model repository security on the Hub

2024年3月4日

Some words on model repository security on the Hub

There some discussions about Hugging Face and model repository security (e.g.

2 条评论
What happened in Natural language generation decoders in 2019?

2019年5月4日

What happened in Natural language generation decoders in 2019?

A lot of things happened in 2018/2019 for natural language generation decoding algorithms and I thought it was a good…

4 条评论

See all articles

Tips for open-sourcing research code

Thomas Wolf

Co-founder and Chief Science Officer at ?? Hugging Face – Angel investor

Thomas Wolf的更多文章

社区洞察

其他会员也浏览了

Pixtral-12B: A 12B Multimodal Model with a 128K Context Window from Mistral AI??

AI2’s AllenNLP, Grover, and GPT-2 For Practical Content Generation

Generative AI - Short & Sweet 03 - ?? Code Generation

Countdown to the launch of our API: the Vulavula API release

Semantic search in 2020: the BERT way

Developing LLMs for Generative AI Tokenization and Vectorization

Understanding RAG: Recent Advancements in Retrieval-Augmented Generation

Survey Analysis of Areas of Life Satisfaction using NLP and Logistic Regression

Grammatical Errors Correction Model Using Gemini AI.

Thomas Wolf的更多文章

Some notes on "DeepSeek and export control"

Celebrating a crazy month of Open Multimodal LLM Releases

The rise and fall of synthetic datasets and smaller language models

Some words on model repository security on the Hub

What happened in Natural language generation decoders in 2019?

社区洞察

其他会员也浏览了

Pixtral-12B: A 12B Multimodal Model with a 128K Context Window from Mistral AI??

AI2’s AllenNLP, Grover, and GPT-2 For Practical Content Generation

Generative AI - Short & Sweet 03 - ?? Code Generation

Countdown to the launch of our API: the Vulavula API release

Semantic search in 2020: the BERT way

Developing LLMs for Generative AI Tokenization and Vectorization

Understanding RAG: Recent Advancements in Retrieval-Augmented Generation

Survey Analysis of Areas of Life Satisfaction using NLP and Logistic Regression

Grammatical Errors Correction Model Using Gemini AI.