A try to get closer to the OSAID version 1.0
I was very impressed by the exchanges over the last few days between Stefano Maffulli MAFULLI, Director of Open Source Initiative (OSI) , and the 亚马逊 teams, in particular Tom Callaway , Principal Open Source Technical Strategist at Amazon Web Services, and julia ferraioli , Open source strategist, researcher, and practitioner.
I don't (yet) have the honor of knowing Spot, but I was able to meet Julia in Brussels in February at FOSDEM 2024. I was impressed by her presentation and her pure approach to Open Source. When you take a look at their respective LinkedIN profiles and achievements, you have to admit that they have real experience in Open Source, and so I know that I'm about to enter into a dialogue with personalities who are both extremely competent and at the same time widely recognized.
As it seems to me that we already have enough to do with the Yann LeCun and Elon Musk "affair", I'm trying to share my point of view here as an new LLM model producer, with no other ambition than to try to take the debate on the finalization of OSAID one step further.
For my part, I'm Michel-Marie MAUDET, a linux user since 1992 and Managing Director of LINAGORA since 2000. Before LINAGORA, I contributed to several Open Source projects, including #StarOffice, and above all succeeded in bringing Open Source into the French Ministry of the Defense. Since LINAGORA, I've been able to contribute to the #SPIP CMS, and I've mainly devoted myself to recruiting much better developers than myself ;-)
Either to develop our own software like Twake Workplace today, or to bring high-level expertise to our customers. In 2016, we also launched the development of an ASR which has now become the #LinTO product. An Open Source alternative to the giants Google Assistant and Amazon Alexa... uppsss!
Our relationship with OSI has gone through various strong emotional states. Indeed, in the past, with our LinShare software, we tried to promote a license that we felt would bring even more rights to our customers while enabling us to better fund our R&D. Simon Phipps has finally convinced us... and we ended up "falling in line" and adopting to the Open Source licensing mindset promoted by OSI.
A year ago, I launched the OpenLLM ???? ???? community with a single objective: to create digital commons in the field of Generative AI, 100% Open Source, which for us means 3 conditions:
In this sense, we're looking to be one of the few models that could be calibrated with the green color in the Ethical IA framework proposed by the Nextcloud community.
Pre-training of our LUCIE model is underway on the Jean ZAY super computer, and we expect the pre-training phase for our 7B model to be completed by mid-September'24. By choosing to use only academic computing infrastructures located in France, we will need to move more slowly than if we were to use Amazon infrastructures, for example... but that's another debate!
So, I was able to join the OSI private discussion lists in October 2023 with Jean-Pierre LORRE , who was the real active member of the band. I followed the discussions and was able to exchange with Stefano on several occasions during this long process, and more closely in recent weeks.
I have observed the evolution of the #OSAID drafts, which recently ended up excluding the requirement to publish datasets from the mandatory requirements in the definition. However, it's worth noting that the requested requirements for training datasets do provide clarity and discriminate against models that claim to be open source but whose training datasets are 100% opaque.
Today's "Open Weights" models will have to converse solely with this appellation and will no longer be able to claim to be Open Source in the sense of the #OSAID. But is this enough? and that's where points of view diverge.
We, especially the early librists, are more in line with the philosophy of Marcus Aurelius. Marcus Aurelius' thought is rooted in Stoicism, and extends the idea that our passions and emotions are due to the judgments we make about things. This leads us to imagine a just, ideal world that assumes everyone is well cared for.
But my experience in IT in general has shown me that it's sometimes necessary to use tactics, malice... and dare we say it, politics and influence to achieve one's goals... Let's face it: life is just a succession of bad compromises, but that's the way the world goes.
But let's get back to the heart of the matter. For LUCIE training, we made a double bet:
It's a gamble that's fraught with meaning and responsibility: there's no guarantee that we'll have a successful LLM in September! But I do think we'll be making an important contribution to the state of the art, which, depending on our results, could complement the empirical approaches that led, for example, to the publication of scaling laws.
So, at this stage, I don't agree with Julia and Tom's proposal to replace the most sensitive data, such as Personal Identifiable Information (PII), with synthetic data.
However, for the rest of the arguments put forward by the Amazon teams, I do 100% agree. The publication of training data is an essential prerequisite for studying and understanding how the model works (particularly with regard to these preferences), and of course for re-training it. There's no doubt about it on either side.
In my opinion, the publication of training data offers other very important advantages:
Stefano has put forward the argument of federated learning. As far as I'm concerned, given the current maturity of this approach and the current interest in the AI ecosystem in this area (including academic research), I think it would be wise to make an exception of it at this stage or simply abandon this argument, which may be adding fuel to the fire for nothing.
The approach proposed by the NextCloud community is an interesting one, but we run the risk of falling into endless exchanges on model ranking of the type: "but I'm almost green, but not quite...".
For me, as I explained to Stefano, the OSAID definition should make it possible to clearly distinguish the two states between Open Weights (models without strict publication of 100% of datasets) and Open Source (models with strict publication under an OSI compliant license of 100% of training datasets).
However, I understand Stefano's arguments : as he quite rightly points out that if the publication of training data becomes a strictly mandatory requirement, this could severely limit the number of models that could be recognized as compliant with OSAID... and even more so those that could be deployed for unrestricted commercial use.
I finally agreed with Stefano and I've decided personnaly to publicly support version 0.0.8 of the draft. It's imperfect, as we all know, but I feel that what we really need is a 1st level of definition and unequivocal clarification between Open Weights and Open Source.
Finally, I'm also thinking that, with the evolution of model architecture in the future, the advent of PLM and Agentic AI and, finally, the reduction in training capacity required to train the next models, I'm thinking that we'll be obliged to make amendments to OSAID as early as 2025.
In fact, this is already an innovative principle that we could adopt to provide greater agility in the writing of licenses, just as is happening with the emergence of new Internet standards.
To conclude, I was very pleasantly surprised by the points Julia made during her presentation at FOSDEM. And Tom's latest posts confirm this pleasant feeling.
However, to my knowledge, Amazon is not (yet) a producer of LLM. And so, I wonder what could be the reason for this fierce defense of the publication of the training data (which I also support in my heart of hearts)?
And generally speaking, if we want to enable the creation of large datasets of 100% Open Source learning data, ethically constituted, we need to collectively answer this essential question: what could be the reasons or incentives to be put in place to motivate rights holders to make their richest and most recent data available for training models that are 100% Open Source? and therefore freely accessible and usable...
I'm thinking in particular of the media, which would enable us, for example, with LUCIE, to have access to post-2010 societal data, which would be a major contribution to representing the evolution of our society, our values and our culture.
Yes, of course, after all this long and tedious work to clarify things. Our goal with LUCIE is to be 100% compliant with OSAID and also to obtain the green flag in @Nextcloud's ethical AI framework. We will be publishing 100% of our datasets used for training, so that they can be studied, reused and, of course, improved if necessary.
I therefore call on all those who want clarification (at least a 1st level) in our AI ecosystem to take note of the latest draft here (https://opensource.org/deepdive/drafts) and show responsibility in order to make possible a version 1.0 of OSAID by the end of June, so that this long process can be completed and we can have this first version of OSAID by october.
AI Leader · CEO/CTO · MBA · Founder · Xoogler
3 个月Now Meta have ignored the OSI board’s impotent attempt to curtail their abuse of Open Source, which would have been more powerful had the original Open Source Definition been used, per its author Bruce Perens’ recommendation, it’s time to go back to the drawing board. This is particularly true with the election of Trump and his dangerous combination of deregulation, protectionism/tariffs, and strengthened IP rights. Opaque OSAID-compliant models stuffed to the gills with copyrighted content will render those of us outside the US — including here in France — clients of BigTech and the content oligarchs over there. I encourage you and yours to dump your endorsement of the #OSAID and reinforce the meaningful Open Source Definition that has stood the test of time over the past quarter century, as we risk losing both with more and more software being written by and incorporating AI: https://opensourcedeclaration.org/index-fr-fr.html You are also welcome to join the uncensored community-run discussions (https://discuss.opensourcedefinition.org) on the future of Open Source now that the OSI has abdicated it’s responsibilities: https://opensourcedefinition.org/wip
AI Leader · CEO/CTO · MBA · Founder · Xoogler
3 个月Your privileged access to the “OSI private discussion lists in October 2023” with Jean-Pierre LORRE appears to have left an impression, but does it not seem strange to you that you were granted access to what should have been an open process? A former OSI board said in 2008 “a process that is not open cannot be trusted to produce a product that can be considered open”, so what’s changed except some new faces? Are they now saying that Simon Phipps was wrong and that closed processes for open standards are now acceptable? This certainly makes our job easier in undermining it: https://opensource.org/blog/simon-phipps-was-right
AI Leader · CEO/CTO · MBA · Founder · Xoogler
3 个月The cabal on the OSI board and their paid servants rammed through a definition that has been totally ignored by the likes of Meta who continue to use the term with abandon, and which has been roundly rejected by the Open Source community (Debian, FSF, etc). The Software Freedom Conservancy has even demanded it be repealed and threatened to run against the board with a ticket I may yet join. Given Amanda Brock revealed the membership didn’t want this definition, and didn’t approve it, there’s a good chance they/we will win, but the damage may already have been done by then so it’s not my primary strategy. “While OSI claims their OSAID is humble, I beg to differ. The humble act now is to admit that it was just too soon to publish a “definition” and rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many headlines or raise as much money as the OSAID did, but it's the moral and ethical way out of this bad situation.” I would go a step further and say the honourable thing to do at this point is to resign and allow the community to heal from this own goal. https://sfconservancy.org/blog/2024/oct/31/open-source-ai-definition-osaid-erodes-foss/
AI Leader · CEO/CTO · MBA · Founder · Xoogler
3 个月You were impressed by “exchanges” that ultimately led to posts like this? “The whole effort to "define open source AI" has been heartbreaking for many of us who care so very deeply about open source. It is not just the lack of transparency, biased narratives, and fallacious arguments that we have seen from the OSI — it's how participants who present differing viewpoints have been treated. I have been called a liar, accused of being an enemy of open source, had my contributions misrepresented and shut down, and had my job not-so-subtly threatened for authentically contributing my expertise — expertise that lies precisely at the intersection of open source and artificial intelligence. I have seen others experience the same.” I can relate by the way, as I’ve experienced the same while fighting for the future of our industry and those dependent on it. https://www.dhirubhai.net/posts/juliaferraioli_opensource-opensourceai-osi-activity-7256687721182605313-_wpd