Tips for open-sourcing research code

I often meet research scientists and NLP practitioners interested in open-sourcing their code/research and asking for advice, so I've decided to share more widely some of my advice.

First: why should you open-source code and models along with your paper? Because science is a virtuous circle of knowledge sharing, not a zero-sum competition.

No alt text provided for this image

1. Consider sharing your code as a tool to build on, more than a snapshot of your work:

  • other will build things that you can't imagine on top of your code ?=> give them easy access to the core elements
  • but be careful, not to over-do it => no need for one-liner abstractions that won't fit other's need – keep it clean & simple

2. Put yourself in the shoes of a master student who has to start from scratch with your code:

  • give them a ride up to the end with pre-trained models
  • focus examples and code on open-access datasets (not everybody can have access to CoNLL-2003)

3. Give detailed instructions on how to run the code, at least evaluation code, in such a way that, combined with pretrained models, it allows for fast testing and debugging. This will help your users make sure that they are using your code right.

4. Use the least amount of dependencies: if you are using an internal framework to build the model => copy the relevant part inside your codebase instead of asking users to install another dependency.

5. Spend four to five days to do it well. Open-sourcing a good code base takes some time but you should consider it as important as publishing a paper.

6. Consider merging with a larger codebase. Are you working on language models? ??Transformers is probably happy to welcome your model at https://github.com/huggingface/transformers

7. Now, what if you want to build a larger-scale tool like ??Transformers? Here are some additional tips for you:

A. focus on one essential feature that your community really needs and no tool already provides

B. keep putting yourself in the shoes of people using your tool for the 1st time and on other tasks.

C. Open-sourcing ML can be very different from other types of open-sourcing:

- ML bugs are very often silent => your users will need to know exactly what's happening inside the code, they can't just assume you are doing it right for their use-case.

- Researchers will create things you have no ideas about => they want to dive in your code and modify it.

=> keep everything clear and visible. No unnecessary user-facing abstractions or layers. Provide direct access to the core. Each user-facing abstraction is a mask that can hide some ML bug, a potential source of misunderstandings, and a steeper learning curve for users.

Best,

Thom

Frederic H.

Operational efficiency and cybersecurity with Agentic AI strategies

2 年

My friend I just discovered you today but your mentality explain why you are so successful. Inmediate follow. Kudo to you and keep up the good work.

Bhavul Gauri

Leading GenAI Initiatives at Meta | Applied Researcher | BITS Pilani | LLM, Diffusion Models, Generative AI, RAG, Agentic Models, Ranking

4 年

This is so so true. Brilliant advice.

回复
Anuj Gupta

Fractional Head of GenAI | Global AI Strategy Leader | Enterprise AI Advisor | Built AI systems driving funding, acquisition & IPO | Author of landmark AI book (O’Reilly, US) | Coaching Execs, CXOs & Founders on GenAI

5 年

Piyush Makhija: look at this

Maxime Gendre

Tech Lead | MLOps & SWE @ DeepLife | Ex-Helsing

5 年

要查看或添加评论,请登录

Thomas Wolf的更多文章

社区洞察

其他会员也浏览了