Data as a (Scarce) Resource in the Age of AI
Efecto Cafetería is a window to the world for me, where I share my ideas and experiences spontaneously and directly.
TL/DR: In this post, I discuss a scenario where data becomes scarce because everyone is using genAI, leading to a lack of content to feed LLMs. Want more details? Read the whole post ;p
Let's get to today's topic: Are data a limiting resource in the development or rise of the AI Era? Let's break it down:
GenAI Feeds on Data
Specifically (and at least so far), data - in terms of content - are mostly created by humans: blog entries, articles like Wikipedia, artworks, songs, etc.
If we zoom in on genAI for use in programming, we can simplify the 'black box' inside LLMs as the main source of information they use to answer our questions being Stack Overflow, the main website developers go to for asking and solving coding doubts.
In this sense, and if like me, you've been using tools like Poe or Copilot, you've probably noticed that their responses are like having searched various sites on the internet, selecting what best fits the problem and presenting it in an organized manner. It's like having someone who knows how to search for things much better than you on Google Bing and explains what they found.
How Does GenAI Work?
From the outside, I write a prompt, and it returns (hence the generative part) an illustration, text, video, or code, responding to the request I made.
So, how does it work internally? To keep this short, we can say it works with what are known as large language models or LLMs. These models operate with deep learning, meaning more than a single mathematical operation like logistic regression, it’s a network and/or many iterations (convolutions) that solve a problem (posed by our prompt in this case) by breaking it down into thousands of small operations and/or iterating these operations until reaching an optimal result. Easy, right? XD. With this diagram, you'll understand it much better (trust me ;)
Note: to delve deeper into this, I recommend following this gentleman from Barcelona, who led the machine learning team at Netflix when they started to take off as the giant they are today: Xavier (Xavi) Amatriain. I can imagine the scolding he would give me if he reads my explanation in this article; I'll send it to him to tempt fate...
Going back, this network of mathematical operations is performed on macro-matrices where the content has been tokenized, meaning we convert Wikipedia articles, Stack Overflow comments, as well as artworks, novels, etc., into sets of numbers (tokens). These tokenized (also lemmatized and other things) groups are called embeddings.
To avoid extending this too much, it's important to consider both the model used and the raw material it is trained on. It's an interesting debate on which of these two is more important.
Data: A Scarce Resource?
We've come from a time when we talked about the data deluge, the digital tsunami, and how we were drowning in enormous amounts of data. However, having described how genAI works, we see it feeds on the data it has available. So far, so good: lots of data, sophisticated models, great genAI systems, right?
In this sense, no matter how large the volume of data we create seems, it could be insufficient, even scarce if we consider: quality data, recent data, and data without usage restrictions.
If we focus on the use case of genAI for coding, an AI like Copilot (which started within GitHub in 2021, slightly before the ChatGPT boom) serves as an assistant for performing a specific task: "how can I program the Pong game?" or "can you write an example with Pandas to calculate the moving sum in a dataframe?"
When I ask these questions, our assistant will reference the entries incorporated in its model related to this request (note that these entries were written at some point and somewhere by people; current AI doesn't go to an O'Reilly Python manual and prescribe how to perform a task after "thinking" about how to do it; it doesn't work that way). Thus, by choosing one or combining several inputs (Stack Overflow posts, online tutorials, forums, YouTube transcripts, etc.), it gives us an elaborate explanation of how to do this script. It seems like magic but it isn't.
So, as we use Copilot instead of writing on Stack Overflow and waiting for multiple people to respond - and then responding back, getting annoyed because this noob said I have no idea, and along the way finding a solution in a post from 4 years ago - thus creating my content with genAI, what's already happening is that the rate at which 'primary' content is being generated is decreasing.
领英推荐
Therefore, the raw material from which LLMs derive their 'wisdom' also decreases; hence, their ability to respond accurately and effectively in the coming months/years could dramatically decrease because there are no people writing, solving doubts for others, but rather all of us are asking AI when we have a coding block (which is very common, at least in my case).
AI Feeding on itself?
If we move to a scenario where (and it seems to be happening already) there are more images, texts, and even music and animation created by genAI, this 'AI-made' content, which was initially created from a combination of works by people, could start to be introduced into LLM models as well. In fact, it will be difficult to discriminate in the enormous databases used to train AI what is genuinely human and what is not.
What can happen then? Does AI have 'creative' capacity feeding on itself, or are we heading towards a world where creativity gradually fades away? Maybe there will be disruptions we don't foresee, and we are witnessing the advent of new artistic/creative styles resulting from LLMs?
Similarity with the Aquaculture Dilemma: Fish Fed with Other Fish
Off-topic: The promise of a solution to the problem of overfishing the oceans through aquaculture had a caveat during my Marine Science studies at the University: in production plants, the pellets used to feed fish raised in large tanks for consumption are made from fishmeal, which in turn has been caught in the sea.
Therefore, the more fish cultivated in captivity, the greater the fishing pressure to feed these specimens?
Some research lines over the years have sought to replace fishmeal (or krill, those mini crustaceans also extracted from the sea) with alternative sources like insects, genetically modified plants to contain the essential amino acids fish need (found in other fish), or synthetic food.
More Questions Than Answers, Dear Sam
More questions come to mind than answers when returning to the scenario of data scarcity to feed LLMs:
- Is it feasible to go for mass adoption if we won’t have primary content for these genAIs to be effective?
- If AI ends up feeding on its own creations, will we stagnate?
- Does AI have enough creativity to evolve in this case?
- Perhaps people have been doing something similar to combining inputs + context and randomness, and that's what we call creativity, believing we are little gods? Ultimately, no one is capable of creating something from nothing, there's always something behind; although, mind you, our 'model' from which we derive content from our mind is not an LLM, we operate differently from machines, and here we enter topics we don’t fully understand (yet?)
- Intuition also comes into play here. What intuition can an AI develop? For now, I'd say it doesn't have any, but... who knows
- To what extent is it favorable or not to limit the use of content to train LLMs?
Where There Is Scarcity, There Is Opportunity and Innovation
Finally, wherever there is a scarce resource, business opportunities arise, and perhaps monetization models based on data will undergo a revolution. For example, how much are my data worth? Everything I've written, recorded, etc. What volume does it have and how is it being used? Could a monetization system based on the content each person generates be organized and thus partially solve the problem of future job scarcity due to automation?
Conclusion
In the current framework of LLMs, the mass adoption of genAI use could lead to an impoverishment of these same models and the content generated in the future, causing a scarcity in the raw material of genAI: data, our content.
This post was originally published on July 20th, 2024 in Efecto Cafetería. You are very welcome to visit, like, share and subscribe in my substack.