登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

A try to get closer to the OSAID version 1.0

Michel-Marie MAUDET

Facilitateur et animateur

发布日期: 2024年6月15日

I was very impressed by the exchanges over the last few days between Stefano Maffulli MAFULLI, Director of Open Source Initiative (OSI) , and the 亚马逊 teams, in particular Tom Callaway , Principal Open Source Technical Strategist at Amazon Web Services, and julia ferraioli , Open source strategist, researcher, and practitioner.

I don't (yet) have the honor of knowing Spot, but I was able to meet Julia in Brussels in February at FOSDEM 2024. I was impressed by her presentation and her pure approach to Open Source. When you take a look at their respective LinkedIN profiles and achievements, you have to admit that they have real experience in Open Source, and so I know that I'm about to enter into a dialogue with personalities who are both extremely competent and at the same time widely recognized.

As it seems to me that we already have enough to do with the Yann LeCun and Elon Musk "affair", I'm trying to share my point of view here as an new LLM model producer, with no other ambition than to try to take the debate on the finalization of OSAID one step further.

For my part, I'm Michel-Marie MAUDET, a linux user since 1992 and Managing Director of LINAGORA since 2000. Before LINAGORA, I contributed to several Open Source projects, including #StarOffice, and above all succeeded in bringing Open Source into the French Ministry of the Defense. Since LINAGORA, I've been able to contribute to the #SPIP CMS, and I've mainly devoted myself to recruiting much better developers than myself ;-)

Either to develop our own software like Twake Workplace today, or to bring high-level expertise to our customers. In 2016, we also launched the development of an ASR which has now become the #LinTO product. An Open Source alternative to the giants Google Assistant and Amazon Alexa... uppsss!

Our relationship with OSI has gone through various strong emotional states. Indeed, in the past, with our LinShare software, we tried to promote a license that we felt would bring even more rights to our customers while enabling us to better fund our R&D. Simon Phipps has finally convinced us... and we ended up "falling in line" and adopting to the Open Source licensing mindset promoted by OSI.

A year ago, I launched the OpenLLM ???? ???? community with a single objective: to create digital commons in the field of Generative AI, 100% Open Source, which for us means 3 conditions:

a model with no restrictions on use,
the code, tooling and secret back-up to re-train the model
and finally, all training data published under a public and Open Source license and in a format allowing auditability.

In this sense, we're looking to be one of the few models that could be calibrated with the green color in the Ethical IA framework proposed by the Nextcloud community.

Pre-training of our LUCIE model is underway on the Jean ZAY super computer, and we expect the pre-training phase for our 7B model to be completed by mid-September'24. By choosing to use only academic computing infrastructures located in France, we will need to move more slowly than if we were to use Amazon infrastructures, for example... but that's another debate!

So, I was able to join the OSI private discussion lists in October 2023 with Jean-Pierre LORRE , who was the real active member of the band. I followed the discussions and was able to exchange with Stefano on several occasions during this long process, and more closely in recent weeks.

I have observed the evolution of the #OSAID drafts, which recently ended up excluding the requirement to publish datasets from the mandatory requirements in the definition. However, it's worth noting that the requested requirements for training datasets do provide clarity and discriminate against models that claim to be open source but whose training datasets are 100% opaque.

Today's "Open Weights" models will have to converse solely with this appellation and will no longer be able to claim to be Open Source in the sense of the #OSAID. But is this enough? and that's where points of view diverge.

We, especially the early librists, are more in line with the philosophy of Marcus Aurelius. Marcus Aurelius' thought is rooted in Stoicism, and extends the idea that our passions and emotions are due to the judgments we make about things. This leads us to imagine a just, ideal world that assumes everyone is well cared for.

But my experience in IT in general has shown me that it's sometimes necessary to use tactics, malice... and dare we say it, politics and influence to achieve one's goals... Let's face it: life is just a succession of bad compromises, but that's the way the world goes.

But let's get back to the heart of the matter. For LUCIE training, we made a double bet:

to prioritize the quality of the training data over quantity, for two obvious reasons: data in French and the 3 other major European languages are under-represented on the Internet compared to English. And, it's almost impossible for us to commit to the content of very large Internet data corpora such as CommonCrawl, C4, RedPajama...
No use of synthetic data, which is one of the avenues suggested by Julia and Tom and which could perhaps help solve part of our problem. However, for the time being, I see it as a philosophical decision (here we go again) to ban the use of data not produced by humans at the pre-training stage. Synthetic data has proven its advantage and effectiveness in the alignment, adjustment or fine-tuning phases. So I don't have such strong opinions here. But for pre-training, since my conception of LLMs is that they contain a representation of our language, i.e. our culture, our values and the civic spirit that binds us together, I expect this representation to be derived from expressions produced by real humans... This is a position that may of course evolve as the proportion of synthetic data compared to historical data produced by humans...

It's a gamble that's fraught with meaning and responsibility: there's no guarantee that we'll have a successful LLM in September! But I do think we'll be making an important contribution to the state of the art, which, depending on our results, could complement the empirical approaches that led, for example, to the publication of scaling laws.

So, at this stage, I don't agree with Julia and Tom's proposal to replace the most sensitive data, such as Personal Identifiable Information (PII), with synthetic data.

However, for the rest of the arguments put forward by the Amazon teams, I do 100% agree. The publication of training data is an essential prerequisite for studying and understanding how the model works (particularly with regard to these preferences), and of course for re-training it. There's no doubt about it on either side.

In my opinion, the publication of training data offers other very important advantages:

possible pooling of the time-consuming phases of collection, de-duplication, categorization, selection according to the expected level of quality, treatment of biases and preferences, and finally availability as close as possible to the GPUs... in short, the Open Source spirit!
the possibility for users or experts to study the composition, enabling them to understand the LLM preferences that have been consciously or unconsciously integrated by the model producer
the possibility for an individual or a rightful claimant to provide irrefutable proof of the presence of a protected work or sensitive personal data
lastly, for use in the most critical environments (health, military, nuclear, etc.), the model's additional capacity not only to provide "explanability", but also to give additional indications of confidence, or even to enable further training on points considered too weak or lacking in integrity.

Stefano has put forward the argument of federated learning. As far as I'm concerned, given the current maturity of this approach and the current interest in the AI ecosystem in this area (including academic research), I think it would be wise to make an exception of it at this stage or simply abandon this argument, which may be adding fuel to the fire for nothing.

The approach proposed by the NextCloud community is an interesting one, but we run the risk of falling into endless exchanges on model ranking of the type: "but I'm almost green, but not quite...".

For me, as I explained to Stefano, the OSAID definition should make it possible to clearly distinguish the two states between Open Weights (models without strict publication of 100% of datasets) and Open Source (models with strict publication under an OSI compliant license of 100% of training datasets).

However, I understand Stefano's arguments : as he quite rightly points out that if the publication of training data becomes a strictly mandatory requirement, this could severely limit the number of models that could be recognized as compliant with OSAID... and even more so those that could be deployed for unrestricted commercial use.

I finally agreed with Stefano and I've decided personnaly to publicly support version 0.0.8 of the draft. It's imperfect, as we all know, but I feel that what we really need is a 1st level of definition and unequivocal clarification between Open Weights and Open Source.

Finally, I'm also thinking that, with the evolution of model architecture in the future, the advent of PLM and Agentic AI and, finally, the reduction in training capacity required to train the next models, I'm thinking that we'll be obliged to make amendments to OSAID as early as 2025.

In fact, this is already an innovative principle that we could adopt to provide greater agility in the writing of licenses, just as is happening with the emergence of new Internet standards.

To conclude, I was very pleasantly surprised by the points Julia made during her presentation at FOSDEM. And Tom's latest posts confirm this pleasant feeling.

However, to my knowledge, Amazon is not (yet) a producer of LLM. And so, I wonder what could be the reason for this fierce defense of the publication of the training data (which I also support in my heart of hearts)?

And generally speaking, if we want to enable the creation of large datasets of 100% Open Source learning data, ethically constituted, we need to collectively answer this essential question: what could be the reasons or incentives to be put in place to motivate rights holders to make their richest and most recent data available for training models that are 100% Open Source? and therefore freely accessible and usable...

I'm thinking in particular of the media, which would enable us, for example, with LUCIE, to have access to post-2010 societal data, which would be a major contribution to representing the evolution of our society, our values and our culture.

Yes, of course, after all this long and tedious work to clarify things. Our goal with LUCIE is to be 100% compliant with OSAID and also to obtain the green flag in @Nextcloud's ethical AI framework. We will be publishing 100% of our datasets used for training, so that they can be studied, reused and, of course, improved if necessary.

I therefore call on all those who want clarification (at least a 1st level) in our AI ecosystem to take note of the latest draft here (https://opensource.org/deepdive/drafts) and show responsibility in order to make possible a version 1.0 of OSAID by the end of June, so that this long process can be completed and we can have this first version of OSAID by october.

Sam Johnston

AI Leader · CEO/CTO · MBA · Founder · Xoogler

3 个月

Now Meta have ignored the OSI board’s impotent attempt to curtail their abuse of Open Source, which would have been more powerful had the original Open Source Definition been used, per its author Bruce Perens’ recommendation, it’s time to go back to the drawing board. This is particularly true with the election of Trump and his dangerous combination of deregulation, protectionism/tariffs, and strengthened IP rights. Opaque OSAID-compliant models stuffed to the gills with copyrighted content will render those of us outside the US — including here in France — clients of BigTech and the content oligarchs over there. I encourage you and yours to dump your endorsement of the #OSAID and reinforce the meaningful Open Source Definition that has stood the test of time over the past quarter century, as we risk losing both with more and more software being written by and incorporating AI: https://opensourcedeclaration.org/index-fr-fr.html You are also welcome to join the uncensored community-run discussions (https://discuss.opensourcedefinition.org) on the future of Open Source now that the OSI has abdicated it’s responsibilities: https://opensourcedefinition.org/wip

Sam Johnston

AI Leader · CEO/CTO · MBA · Founder · Xoogler

3 个月

Your privileged access to the “OSI private discussion lists in October 2023” with Jean-Pierre LORRE appears to have left an impression, but does it not seem strange to you that you were granted access to what should have been an open process? A former OSI board said in 2008 “a process that is not open cannot be trusted to produce a product that can be considered open”, so what’s changed except some new faces? Are they now saying that Simon Phipps was wrong and that closed processes for open standards are now acceptable? This certainly makes our job easier in undermining it: https://opensource.org/blog/simon-phipps-was-right

Sam Johnston

AI Leader · CEO/CTO · MBA · Founder · Xoogler

3 个月

The cabal on the OSI board and their paid servants rammed through a definition that has been totally ignored by the likes of Meta who continue to use the term with abandon, and which has been roundly rejected by the Open Source community (Debian, FSF, etc). The Software Freedom Conservancy has even demanded it be repealed and threatened to run against the board with a ticket I may yet join. Given Amanda Brock revealed the membership didn’t want this definition, and didn’t approve it, there’s a good chance they/we will win, but the damage may already have been done by then so it’s not my primary strategy. “While OSI claims their OSAID is humble, I beg to differ. The humble act now is to admit that it was just too soon to publish a “definition” and rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many headlines or raise as much money as the OSAID did, but it's the moral and ethical way out of this bad situation.” I would go a step further and say the honourable thing to do at this point is to resign and allow the community to heal from this own goal. https://sfconservancy.org/blog/2024/oct/31/open-source-ai-definition-osaid-erodes-foss/

Sam Johnston

AI Leader · CEO/CTO · MBA · Founder · Xoogler

3 个月

You were impressed by “exchanges” that ultimately led to posts like this? “The whole effort to "define open source AI" has been heartbreaking for many of us who care so very deeply about open source. It is not just the lack of transparency, biased narratives, and fallacious arguments that we have seen from the OSI — it's how participants who present differing viewpoints have been treated. I have been called a liar, accused of being an enemy of open source, had my contributions misrepresented and shut down, and had my job not-so-subtly threatened for authentically contributing my expertise — expertise that lies precisely at the intersection of open source and artificial intelligence. I have seen others experience the same.” I can relate by the way, as I’ve experienced the same while fighting for the future of our industry and those dependent on it. https://www.dhirubhai.net/posts/juliaferraioli_opensource-opensourceai-osi-activity-7256687721182605313-_wpd

查看更多评论

要查看或添加评论，请登录

Michel-Marie MAUDET的更多文章

Ma première intervention publique après ce week-end agité, je vous raconte...

2025年1月29日

Ma première intervention publique après ce week-end agité, je vous raconte...

Contexte de la Table Ronde Hier soir, j’ai eu l’opportunité d’intervenir lors du cycle de conférences Haute Densité…

15 条评论
LUCIE : un modèle parfaitement imparfait - Partie 2

2025年1月25日

LUCIE : un modèle parfaitement imparfait - Partie 2

Ces dernières heures sur les réseaux sociaux, plusieurs commentaires et posts ont attiré notre attention. Il semble que…

21 条评论
LUCIE : une 1ère version parfaitement imparfaite ;-)

2025年1月17日

LUCIE : une 1ère version parfaitement imparfaite ;-)

MERCI ! MERCI ! MERCI ! Tout d'abord, je tiens à exprimer à la fois ma fierté et aussi ma plus profonde gratitude pour…

26 条评论
?? L'Agentic AI : le sujet dont tout le monde parle

2024年11月25日

?? L'Agentic AI : le sujet dont tout le monde parle

Imaginez ceci : vous êtes à l’aéroport, prêt à embarquer pour un voyage important, et soudain, l’annonce tombe – votre…

4 条评论
?? Le Bushido au service de l'Open Source : ma réponse aux critiques sur LINAGORA et notre participation au projet LibreOffice ??

2024年10月21日

?? Le Bushido au service de l'Open Source : ma réponse aux critiques sur LINAGORA et notre participation au projet LibreOffice ??

Dans l'art du Bushido, le Samoura? incarne à la fois le courage et la sagesse. Dans cette voie, il est essentiel de…

2 条评论
?? Pré-entra?ner un LLM comme LUCIE : une suite d'ascenseurs émotionnels ??

2024年10月4日

?? Pré-entra?ner un LLM comme LUCIE : une suite d'ascenseurs émotionnels ??

Avant de rentrer dans le vif du sujet, précisons que ce post est simplement un point d’avancement sur le…

16 条评论
L'IA au service de l'éducation du futur - J'ai lu "Brave New Words" de Salman Khan

2024年9月23日

L'IA au service de l'éducation du futur - J'ai lu "Brave New Words" de Salman Khan

Sal Khan, fondateur de la célèbre plateforme éducative Khan Academy (https://www.khanacademy.

2 条评论
Comment et pourquoi un jeu de données d'entra?nement ouvert pour le monde de l'éducation : FrenchEduWeb

2024年7月5日

Comment et pourquoi un jeu de données d'entra?nement ouvert pour le monde de l'éducation : FrenchEduWeb

Rédigé avec l'aide précieuse de Julie Hunter. Mes équipes m'ont fait découvrir très récemment les deux datasets…

6 条评论
Défis juridiques des données d'apprentissage pour les LLMs : le cas de Commoncrawl et CulturaX

2024年6月23日

Défis juridiques des données d'apprentissage pour les LLMs : le cas de Commoncrawl et CulturaX

Le développement rapide des modèles de langage tels que GPT soulève des questions essentielles sur l'utilisation légale…

3 条评论
Extrait du chapitre 3 - "Au-delà des algorithmes, la sagesse d'RV"

2024年6月22日

Extrait du chapitre 3 - "Au-delà des algorithmes, la sagesse d'RV"

..

4 条评论

See all articles

Michel-Marie MAUDET的更多文章

Ma première intervention publique après ce week-end agité, je vous raconte...

LUCIE : un modèle parfaitement imparfait - Partie 2

LUCIE : une 1ère version parfaitement imparfaite ;-)

?? L'Agentic AI : le sujet dont tout le monde parle

?? Le Bushido au service de l'Open Source : ma réponse aux critiques sur LINAGORA et notre participation au projet LibreOffice ??

?? Pré-entra?ner un LLM comme LUCIE : une suite d'ascenseurs émotionnels ??

L'IA au service de l'éducation du futur - J'ai lu "Brave New Words" de Salman Khan

Comment et pourquoi un jeu de données d'entra?nement ouvert pour le monde de l'éducation : FrenchEduWeb

Défis juridiques des données d'apprentissage pour les LLMs : le cas de Commoncrawl et CulturaX

Extrait du chapitre 3 - "Au-delà des algorithmes, la sagesse d'RV"

社区洞察