Tips for open-sourcing research code
I often meet research scientists and NLP practitioners interested in open-sourcing their code/research and asking for advice, so I've decided to share more widely some of my advice.
First: why should you open-source code and models along with your paper? Because science is a virtuous circle of knowledge sharing, not a zero-sum competition.
1. Consider sharing your code as a tool to build on, more than a snapshot of your work:
- other will build things that you can't imagine on top of your code ?=> give them easy access to the core elements
- but be careful, not to over-do it => no need for one-liner abstractions that won't fit other's need – keep it clean & simple
2. Put yourself in the shoes of a master student who has to start from scratch with your code:
- give them a ride up to the end with pre-trained models
- focus examples and code on open-access datasets (not everybody can have access to CoNLL-2003)
3. Give detailed instructions on how to run the code, at least evaluation code, in such a way that, combined with pretrained models, it allows for fast testing and debugging. This will help your users make sure that they are using your code right.
4. Use the least amount of dependencies: if you are using an internal framework to build the model => copy the relevant part inside your codebase instead of asking users to install another dependency.
5. Spend four to five days to do it well. Open-sourcing a good code base takes some time but you should consider it as important as publishing a paper.
6. Consider merging with a larger codebase. Are you working on language models? ??Transformers is probably happy to welcome your model at https://github.com/huggingface/transformers
7. Now, what if you want to build a larger-scale tool like ??Transformers? Here are some additional tips for you:
A. focus on one essential feature that your community really needs and no tool already provides
B. keep putting yourself in the shoes of people using your tool for the 1st time and on other tasks.
C. Open-sourcing ML can be very different from other types of open-sourcing:
- ML bugs are very often silent => your users will need to know exactly what's happening inside the code, they can't just assume you are doing it right for their use-case.
- Researchers will create things you have no ideas about => they want to dive in your code and modify it.
=> keep everything clear and visible. No unnecessary user-facing abstractions or layers. Provide direct access to the core. Each user-facing abstraction is a mask that can hide some ML bug, a potential source of misunderstandings, and a steeper learning curve for users.
Best,
Thom
Operational efficiency and cybersecurity with Agentic AI strategies
2 年My friend I just discovered you today but your mentality explain why you are so successful. Inmediate follow. Kudo to you and keep up the good work.
Leading GenAI Initiatives at Meta | Applied Researcher | BITS Pilani | LLM, Diffusion Models, Generative AI, RAG, Agentic Models, Ranking
4 年This is so so true. Brilliant advice.
Fractional Head of GenAI | Global AI Strategy Leader | Enterprise AI Advisor | Built AI systems driving funding, acquisition & IPO | Author of landmark AI book (O’Reilly, US) | Coaching Execs, CXOs & Founders on GenAI
5 年Piyush Makhija: look at this
Tech Lead | MLOps & SWE @ DeepLife | Ex-Helsing
5 年Yann GOLHEN Olivier Baes