As I dived deeper into Generative-AI strategy and implementation over the last few months, I have come to a few conclusions:?
- All or most of enterprise Gen-AI work is critically dependent on OpenAI in one form or another – ChatGPT, document embeddings, or summarization of RAG snippets. ?
- The prognosis in the famous or infamous leaked Google Memo, “We have no moat ..” is nowhere near to being fulfilled. With so many very smart minds working in this field, we may be on the cusp of a breakthrough, but we are simply not there yet. ?
- I would love to have some real competition for OpenAI – simply to save costs for my company and to have real flexibility in domain adaptations of LLMs. For example, there is no transparency or flexibility in fine-tuning the largest LLM available on OpenAI, namely, Davinci. I do not even know if internally it uses LORA or q-LORA or one of the older fine-tuning techniques. Nor can I change the technique.?
- Thus, I am even more troubled when I review the so-called open-source competition and have to call them ‘Free, Downloadable, Binary LLM’.?
Let us parse the each descriptive word:?
- Free. Yes, these LLMs are free until someone asks money for using them, perhaps with a legal threat that you are breaking someone’s copyrights or IP. ?
- Downloadable. This seems to be the next big thing going for these LLMs. Indeed, one can download the weights file for these LLMs and is free to do what one wishes with these weights.?
- Binary. I use ‘binary’ as a metaphor for a piece of code that a developer cannot change and/or truly understand. I know this is a little stretch – of course a weights file can be used to generate the model architecture and we can inspect the activation functions at each node. All said and done, there is a very limited capability to modify these weights once the number of model parameters reaches even a few Billion.? Most attempts at truly retraining a LLM of this scale using an enterprise corpus which is orders of magnitude smaller as compared the initial web corpus, result in insignificant changes or catastrophic forgetting.?In fact, this is the reason for the advent of the whole field of Parameter Efficient Fine-Tuning field.?
Even if the model weights are under Apache or MIT license, I believe there is a fundamental expectation for a truly open-source software – A developer?who is not part of the original development team should be able to fix a bug or enhance the software independent of the original development team. Towards this goal, I enunciate four requirements for a model to be called as a true Open-Source LLM:?
- Disclosure and public access to training corpus. A simple statement that the model was trained on ‘public web documents’ is not sufficient for other groups to use or validate the corpus for its copyright properties. Furthermore, the public web content is highly dynamic and is even likely to have been modified or deleted by the time other groups want to experiment with this corpus. The training document corpus must be collected and made available in a shared location.?
- Document processing pipeline code and tokenized output. This is one of the most important yet unappreciated steps in the whole process. There should not be a secret sauce or confidentiality in the tokenization process. ?
- Model architecture and training code. This is likely to be the least controversial of my proposals as the model architecture is already derivable from the weights file and many teams have already published some code in GitHub.?
- Training process. This includes the hyperparameters, evaluation metrics, batching techniques, random seeds, etc. The goal of idempotent training may be too far into future but all these and more are needed to reproduce training output with reasonable statistical confidence.?
I have a feeling that I have only scratched the surface in terms of defining what is truly an Open-Source LLM. Until there is progress on these factors, the so-called open-source LLMs are best described as free, downloadable, binary of a LLM.?
Software Engineer at Millennium
1 年Very well articulated!
AI/ML and Software Architect | Natural Language Processing & Generative AI & Data Science Expert | MIT Alumni
1 年Very insightful analysis on open source LLMs
LinkedIN TopVoice 2023 | Data, AppliedAI, Technology & Strategy | CXO | BOD Advisor | Entrepreneur | Analytics | Cloud | Do click ?? to be notified of my latest posts
1 年Ajit, interesting perspective! For me, I feel It's just the beginning with LLMs & Gen AI! With partnerships first and heating m&a activity we have witnessed in the past few months, the competition will only get intense and benefit the consumers/ developers in the long run.
Group Manager - Engineering and R&D Services at HCLTech
1 年Well Articulated ?? ??
Senior Director, Software Engineering. Pathfinding and new initiatives, CTO's office at Juniper Networks
1 年Good perspectives, Ajit and food for thought.