Generative AI: How is the sausage made?
Ever since ChatGPT was released to the public (which happened just 12 months ago, even though it feels more like a decade), the tech industry has embarked on a never-ending frenzy of announcing and releasing new AI-powered tools and services. Pretty much all of these new AI tools fall under the category we call “Generative AI”, which Wikipedia describes as “artificial intelligence capable of generating text, images, or other media, using generative models. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics”. From OpenAI’s ChatGPT and Dall-E to Google’s Bard and Microsoft’s Co-Pilot, and a lot more; we have all entered a new technological era. For the most part, we have been told that Generative AI will drastically transform the labor market in the near short term as well as for years to come, transforming and/or eliminating vast amounts of white-collar jobs while also creating new kinds of jobs in the long run. As a consequence, I feel many of us are constantly navigating the tension between feeling both genuinely excited and hopeful for the greatly transformative opportunities Generative AI offers... while also feeling scared and fearful for the risks it brings with it (from existential threats to Humanity to millions of people losing their jobs, including ourselves). Such a great time to be alive!
However, I believe we have not paid as much attention to how the actual development and continuous improvement of Generative AI models/tools are already impacting society in general and the labor market in particular. That is to say, we have not been paying much attention to how the sausage is being made. Unfortunately, it would seem that most of the companies behind Generative AI products prefer that we know as little as possible about how they develop their models; from where they got the data to train these models to the thousands and thousands of data annotators from countries with low wages and rather unregulated labor markets they employ through third-party companies we know very little about). Before we proceed, what is a data annotator? A data annotator is a worker that labels and classifies data to make it easier for machine learning algorithms to understand; this data can be in the form of images, text, audio, or video. Data annotators play an essential role in the development of AI systems by providing the high-quality labeled data that these systems need to train and learn. Data annotators have been and still are critical for the development and refinement of Large Language Models; they are also the protagonists of one of the two articles that have recently sparked my curiosity about this important topic:
a) Not for Machines to Harvest’: Data Revolts Break Out Against A.I. (The New York Times article by Sheera Frenkel and Stuart Thompson published on July 15th, 2023). This article tackles an increasingly crucial debate: Developing and refining Large Language Models requires incredibly vast amounts of human-generated data, from texts to images to sounds; where do the companies developing Generative AI models obtain such data sets? Are they asking permission from their authors? Are they paying them? Should they? The development of LLMs has shown that data is the new gold: The more data you can train your model with, the better it will get. For instance, can OpenAI use the millions of articles written by The New York Times journalists over the past decades to train ChatGPT? Can it do so without NYT permission? Can Google use the millions of pages of Wikipedia? What about the millions of fan-fiction novels available online? This article shows the degree to which authors are pissed at AI companies, so much so that some of them have started suing them for using their work without their permission.?
b) AI is a Lot of Work (The Verge, article by Richard Parry published on June 20th, 2023). This detailed, insightful piece of investigative journalism should be mandatory reading for anyone interested in the generative AI industry and its impact in society at large. Author Richard Parry first exposes the network of data-annotating companies that big AI & tech companies such as Google, OpenAI and Microsoft employ on a regular basis as third-party providers for the labor force needed to train their Large Language Models. Data annotation is a big, lucrative business; one that relies on rather exploitative labor practices, which we could describe as the poster-boy of what is generally called the “gig economy” model: a labor market that relies heavily on temporary and part-time positions filled by independent contractors and freelancers rather than full-time permanent employees. Moreover, these independent contractors are paid very low wages, have no job security or stability what-so-ever and see their tasks change constantly.
领英推荐
In preparation for the task of writing this post, I showed Richard Parry’s article to BARD and asked him “How does this article make you feel?”; then BARD proceeded to write the following: This article makes me feel sad because it describes a hidden workforce of people who are doing tedious and difficult work for low pay. It is also concerning that these workers are not given more information about what they are working on and that their work is being kept secret.
If even a heartless Large Language Model feels badly about the conditions of such data annotators, so should we!
What are your thoughts on this important issue? Do you think writers and artists have the right to be compensated for the use of their work to train LLMs? Do you think AI companies are treating their data annotation workforce fairly? I continue to be excited for the opportunities Generative AI tools offer us, but I strongly believe that is fully compatible with being critical about how it is developed and its potential harmful effects on society. By the way, there is also an increasingly important debate over the environmental impact of AI development; we will cover this in future editions of MAYBE?-
Please share your take and point of view! Also, if you have any articles on podcasts on this topic, please share them as well!