Has Generative AI gone MAD?
Introduction
Much has been written in the recent months regarding Generative AI (GenAI) having gone MAD. Evidence of this phenomenon has been provided in a research study conducted at Rice University, USA, led by Prof. Richard G. Baraniuk and published as a research paper in July 2023. MAD stands for “Model Autophagy Disorder” so named because of the similarity of this dysfunctionality with the infamous Mad Cow Disease. Autophagy refers to the degenerating behavior triggered by self-consuming loops, initially observed in cows and now extended to GenAI. In the case of cows, it is caused by cows being fed with the left-overs of the slaughtered cows, which are likely to be infected. In the case of GenAI models, it is caused by excessive use of synthetic data rather than fresh data in training/ fine tuning of the GenAI models. It has been established through this research that this phenomenon adversely influences the performance of GenAI.
The focus of the research study conducted at Rice University was on degeneration in the performance of text-to-image models such as DALL-E, Stable Diffusion and Midjourney. These models increasingly use synthetic data for repeated iterations to fine tune a previous generation model to a newer generation model. This is necessitated by the fact that fresh data availability faces headwinds due to privacy concerns. ?Synthetic data, on the other hand, is relatively freely and abundantly available. According to a Gartner Study, “nearly 60 percent of all data used in the development of AI will be synthetic data by the end of 2024”. The excessive use of synthetic data results in rapid deterioration in either the quality or diversity or both in the output generated by the model. When users opt for “cherry picking”, implying that they favour selecting output of higher quality rather than diversity, the output generated preserves the quality but displays a much steeper decline in its diversity.
Consequences of MADness
If unchecked, the consequences of MADness can be devastating. It is believed that the Internet may get increasingly inundated with synthetic data generated by the GenAI models rather than real data. The synthetic data created by GenAI models is obviously not new or fresh data, but merely old data recycled again and again at high speed. Consequently, the data populated on the Internet would steadily turn incomprehensible without making much sense. This will lead to worsening of the performance of future generations of GenAI models, rendering them virtually ineffective. Moreover, it will be increasingly difficult to detect and remove AI-generated content for training new models.
MADness in GenAI models can be cured by several mitigation strategies, which essentially emphasize reduced use of synthetic data and continual injection of fresh real world data in training/ fine tuning. By ensuring that AI models are trained on a diverse and high-quality dataset, including human-generated content, it is possible to prevent them from falling into the trap of MAD. It is also recommended that the model training strategy should be dynamically adjusted based on incoming data quality and diversity. However, these strategies can be challenging as fresh real data would be hard to come by.
Model Collapse and MADness
Though the research of Rice University was limited to “text-to-Image” generating models, the authors of that research contend that the same conclusions would apply to the output generated by Large Language Models (LLMs) such as ChatGPT and Gemini. A Study published in July 2024 found similar results on LLMs. It was observed that “indiscriminate use of model-generated content (same as synthetic data) in training caused irreversible defects in the resulting models resulting in ‘model collapse’. Model collapse was defined as “a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation” and “over time, the models forget the true underlying data distribution”. ?The use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. Two types of model collapse namely, early and late model collapse have been identified. In the early model collapse, the model begins to lose the information about the tails of the distribution in the early iterations. In the late model collapse, the model steadily through several iterations converges to a data distribution which is vastly different from the original data distribution.
Synthetic Data: A Necessary Evil?
Interestingly several articles published in the past have lauded the availability of synthetic data. In a 2022 article, IBM defines synthetic data as “data that has been created artificially through computer simulation or data generated by algorithms to take the place of real-world data”. But despite being artificial, the synthetic data replicates the characteristics and the statistical properties of real-world data and has no limitation of size, time, or location. Thus the developers have complete control over the data,?including its volume,?distribution,?and specific characteristics. Synthetic data can be generated consistently and can be more cost-effective than collecting and cleaning real-world data. Moreover, synthetic data obviates the need for manual data labeling, a process which can be time-consuming, costly and difficult. Use of synthetic data has the advantages of providing scalability, providing training data for edge cases (cases where real data very unlikely does not exist), ease of use, absence of privacy concerns and a possible reduction in bias. However, the claimed advantage of reduced bias through use of synthetic data is contestable, as the synthetic data inherits the bias contained in the real data. Another article published in the same year observed that collecting quality data from the real world is complicated, expensive and time-consuming; but synthetic data, in contrast, is easily available on demand, in whatever volume required and tailored to precise specifications. Thus, synthetic data provides the advantage of “democratizing access to data at scale”. However, despite the advantages of synthetic data, the potential degeneration caused by its indiscriminate use in the training/ fine tuning of GenAI Models, whether LLMs or text-to-image models has emerged as a serious concern later.
What did Gemini have to say about MADness?
To find out what Gemini, the well-known LLM created by Google, had to say about this subject, we prompted it to answer a few questions related to MADness in general and its own MADness.? We got a few generic responses but no concrete response about the mitigation strategies being adopted by the Google team working on Gemini. Some of the responses received from Gemini are narrated below:?
a)????? While the concept of MAD is concerning, it doesn't mean that GenAI is inherently "mad" or uncontrollable.
b)????? New techniques and advancements in GenAI are being developed continuously,?which can help mitigate the risks of MAD.
c)????? As we continue to improve data collection and curation,?we can reduce the likelihood of models becoming overly reliant on AI-generated content.
d)????? Human oversight and intervention will always be essential in preventing GenAI from becoming MAD.?
When asked whether GenAI has already become MAD and if not, to predict a timeline for it to become fully MAD, it provided the following response:
领英推荐
“It is unlikely that GenAI has reached or will ever reach a definitively final stage of MADness. Moreover, it is difficult to predict a specific year when GenAI might reach a critical point of MADness.?The timeline will depend on various factors,?including advancements in the technology,?changes in data practices,?and the effectiveness of human oversight. However, it is important to remain vigilant and proactive.?By addressing the risks of MAD early on,?we can help ensure that GenAI continues to be a valuable and beneficial tool”.
Fortunately, these responses don’t give the impression that GenAI in general and Gemini in particular have gone MAD. Perhaps, a more extensive research is required to examine the symptoms of MADness and possible mitigation strategies to reduce the potential MAD dysfunctionality of LLMs like Gemini.?
Conclusion
MADness or model collapse can be of serious concern for the future of GenAI.? The importance of fresh real life data in training/ fine tuning of GenAI models can’t be over-emphasized. Access to genuine human-generated content over and above the synthetic data is crucial, but challenging. In the absence of such measures, the effectiveness of GenAI can be severely dented.
References
1.????? Aitish Bhatia, “AI Going MAD? The Model Collapse Problem Gets More Attention Now”, Mind matters, 28th August 2024, https://mindmatters.ai/2024/08/ai-going-mad-the-model-collapse-problem-gets-more-attention-now/
2.????? Deepu Varghese, “Why GenAI can become a threat to itself”, Times of India, 23rd August 2024
3.????? Ilia Shumailov; Zakhar Shumaylov; Yiren Zhao; Yarin Gal; Nicolas Papernot; and Ross Anderson, “The Curse of Recursion: Training on Generated Data Makes Models Forget”, 14th April 2024, Accessibility Forum 2024, https://doi.org/10.48550/arXiv.2305.17493
4.????? Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson and Yarin Gal1, “AI models collapse when trained on recursively generated data”, Nature, 24th July 2024, https://doi.org/10.1038/s41586-024-07566-y
5.????? IBM, “What is synthetic data?”, https://www.ibm.com/topics/synthetic-data, Date Not Mentioned
6.????? Paul Mah, “Training AI With Outputs of Generative AI Is Mad”, 19th July 2023, https://www.cdotrends.com/story/18288/training-ai-outputs-generative-ai-mad
7.????? Rob Toews, “Synthetic Data Is About To Transform Artificial Intelligence”, Forbes, 12 June 2022, https://www.forbes.com/sites/robtoews/2022/06/12/synthetic-data-is-about-to-transform-artificial-intelligence/#:~:text=In%20a%20nutshell%2C%20synthetic%20data%20technology%20enables%20practitioners,will%20be%20synthetic%20rather%20than%20real%20by%202024.
8.????? Rodielon Putol, “Could generative AI go MAD and wreck internet data?”, Earth.com, 1st August 2024, https://www.earth.com/news/could-generative-ai-go-mad-and-wreck-internet-data/
9.?????? Sina Alemohammad, Josue Casco-Rodrigue, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeu?, Ali Siahkoohi, Richard G. Baraniuk, “Self-Consuming Generative Models Go MAD”, 4 July 2023
AVP- Digital Transformation | IMT G |Accenture
5 个月This Article is really an eye opener really very insightful ??
Innovation Advisor at Knowledge Economy Innovations
5 个月Great article Surinder - not only a scary picture but frightening proospects you set out. Very interesting to read the results that Gemini gave to the question. I guess we really need to get regulation going on AI and some validation of the data feeding into these models ratgher quickly.
Digital Transformation Leader | Strategy | AI | Machine Learning | Big Data | IOT | Cloud | Web3 | Blockchain | Metaverse | AR | Digital Twin | EV Charging | EMobility | DERM | BMS | EMS | Entrepreneur | Angel Investor
5 个月Nice Article Prof. Surinder Batra Related article: https://www.dhirubhai.net/posts/ganeshraju07_ai-artificialintelligence-machinelearning-activity-7235875106080612352-yZ9c?utm_source=share&utm_medium=member_desktop