登录查看更多内容

Data as a (Scarce) Resource in the Age of AI

Jorge Ubero

Product | Projects | Geotech

发布日期: 2024年7月24日

Efecto Cafetería is a window to the world for me, where I share my ideas and experiences spontaneously and directly.

TL/DR: In this post, I discuss a scenario where data becomes scarce because everyone is using genAI, leading to a lack of content to feed LLMs. Want more details? Read the whole post ;p

Let's get to today's topic: Are data a limiting resource in the development or rise of the AI Era? Let's break it down:

GenAI Feeds on Data

Specifically (and at least so far), data - in terms of content - are mostly created by humans: blog entries, articles like Wikipedia, artworks, songs, etc.

If we zoom in on genAI for use in programming, we can simplify the 'black box' inside LLMs as the main source of information they use to answer our questions being Stack Overflow, the main website developers go to for asking and solving coding doubts.

In this sense, and if like me, you've been using tools like Poe or Copilot, you've probably noticed that their responses are like having searched various sites on the internet, selecting what best fits the problem and presenting it in an organized manner. It's like having someone who knows how to search for things much better than you on Google Bing and explains what they found.

How Does GenAI Work?

From the outside, I write a prompt, and it returns (hence the generative part) an illustration, text, video, or code, responding to the request I made.

So, how does it work internally? To keep this short, we can say it works with what are known as large language models or LLMs. These models operate with deep learning, meaning more than a single mathematical operation like logistic regression, it’s a network and/or many iterations (convolutions) that solve a problem (posed by our prompt in this case) by breaking it down into thousands of small operations and/or iterating these operations until reaching an optimal result. Easy, right? XD. With this diagram, you'll understand it much better (trust me ;)

Training Deep Neural Networks. Deep Learning Accessories

Note: to delve deeper into this, I recommend following this gentleman from Barcelona, who led the machine learning team at Netflix when they started to take off as the giant they are today: Xavier (Xavi) Amatriain. I can imagine the scolding he would give me if he reads my explanation in this article; I'll send it to him to tempt fate...

Going back, this network of mathematical operations is performed on macro-matrices where the content has been tokenized, meaning we convert Wikipedia articles, Stack Overflow comments, as well as artworks, novels, etc., into sets of numbers (tokens). These tokenized (also lemmatized and other things) groups are called embeddings.

To avoid extending this too much, it's important to consider both the model used and the raw material it is trained on. It's an interesting debate on which of these two is more important.

Data: A Scarce Resource?

We've come from a time when we talked about the data deluge, the digital tsunami, and how we were drowning in enormous amounts of data. However, having described how genAI works, we see it feeds on the data it has available. So far, so good: lots of data, sophisticated models, great genAI systems, right?

In this sense, no matter how large the volume of data we create seems, it could be insufficient, even scarce if we consider: quality data, recent data, and data without usage restrictions.

If we focus on the use case of genAI for coding, an AI like Copilot (which started within GitHub in 2021, slightly before the ChatGPT boom) serves as an assistant for performing a specific task: "how can I program the Pong game?" or "can you write an example with Pandas to calculate the moving sum in a dataframe?"

When I ask these questions, our assistant will reference the entries incorporated in its model related to this request (note that these entries were written at some point and somewhere by people; current AI doesn't go to an O'Reilly Python manual and prescribe how to perform a task after "thinking" about how to do it; it doesn't work that way). Thus, by choosing one or combining several inputs (Stack Overflow posts, online tutorials, forums, YouTube transcripts, etc.), it gives us an elaborate explanation of how to do this script. It seems like magic but it isn't.

So, as we use Copilot instead of writing on Stack Overflow and waiting for multiple people to respond - and then responding back, getting annoyed because this noob said I have no idea, and along the way finding a solution in a post from 4 years ago - thus creating my content with genAI, what's already happening is that the rate at which 'primary' content is being generated is decreasing.

领英推荐

?? Top AI Papers of the Week

DAIR.AI 1 个月前

Exploring the Branches of Artificial Intelligence: A…

Pratibha Kumari J. 1 年前

AI - Top 30 Posts - Edition- II, Learn to Benefit!

Kalilur Rahman 7 年前

Graph of the number of posts created on Stack Overflow. Source:

Therefore, the raw material from which LLMs derive their 'wisdom' also decreases; hence, their ability to respond accurately and effectively in the coming months/years could dramatically decrease because there are no people writing, solving doubts for others, but rather all of us are asking AI when we have a coding block (which is very common, at least in my case).

AI Feeding on itself?

If we move to a scenario where (and it seems to be happening already) there are more images, texts, and even music and animation created by genAI, this 'AI-made' content, which was initially created from a combination of works by people, could start to be introduced into LLM models as well. In fact, it will be difficult to discriminate in the enormous databases used to train AI what is genuinely human and what is not.

What can happen then? Does AI have 'creative' capacity feeding on itself, or are we heading towards a world where creativity gradually fades away? Maybe there will be disruptions we don't foresee, and we are witnessing the advent of new artistic/creative styles resulting from LLMs?

Similarity with the Aquaculture Dilemma: Fish Fed with Other Fish

Off-topic: The promise of a solution to the problem of overfishing the oceans through aquaculture had a caveat during my Marine Science studies at the University: in production plants, the pellets used to feed fish raised in large tanks for consumption are made from fishmeal, which in turn has been caught in the sea.

Therefore, the more fish cultivated in captivity, the greater the fishing pressure to feed these specimens?

Some research lines over the years have sought to replace fishmeal (or krill, those mini crustaceans also extracted from the sea) with alternative sources like insects, genetically modified plants to contain the essential amino acids fish need (found in other fish), or synthetic food.

Alabama's Commitment to Sustainable Aquaculture - Alabama Cooperative Extension System

Jorge Ubero的更多文章

Geo-tutorial: Análisis espacial aplicado al delivery en Madrid Central

2025年3月5日

Geo-tutorial: Análisis espacial aplicado al delivery en Madrid Central

Adaptando una cadena de restaurantes al reparto a domicilio mediante Location Intelligence con isocronas Esta…
When Silence is the Signal

2024年7月30日

When Silence is the Signal

#EfectoCafetería In an era of noise like the one we are living in, what is not talked about is often more important…
Cuando el silencio es la se?al

2024年7月30日

Cuando el silencio es la se?al

https://jorgeubero.substack.
Desde Murcia, con amor

2024年6月3日

Desde Murcia, con amor

Cómo la genAI ha permitido a dos jóvenes emprendedores vender su startup en sólo cinco meses sin moverse de casa. Tras…
?Podrá la inteligencia artificial crear una película como 'Mulholland Drive'?

2023年9月22日

?Podrá la inteligencia artificial crear una película como 'Mulholland Drive'?

Recientemente, volví a ver la película de David Lynch de 2001. Recuerdo que la primera vez que la vi disfruté de ese…
El bloom de las startups en el sector aeroespacial

2023年9月21日

El bloom de las startups en el sector aeroespacial

El pasado a?o estuve enviando propuestas y contactando con varias entidades del ecosistema inversor, principalmente…

2 条评论
?Se está convirtiendo el VC en la parte ejecutiva de las startups?

2023年9月20日

?Se está convirtiendo el VC en la parte ejecutiva de las startups?

Venture capital, la palanca del emprendimiento En esta parte de Europa, tradicionalmente estos vehículos de…
Ankur

2023年9月19日

Ankur

Con frecuencia, no nos damos cuenta del gran talento que nos rodea, fallamos al identificar a esa persona genuinamente…
Geotecnologías: negocios a Seguir en 2023

2023年9月18日

Geotecnologías: negocios a Seguir en 2023

*En su última edición asistí en Madrid a la Conferencia ESRI en Madrid y publiqué este artículo inicialmente en inglés…
Probé este Product Growth Simulator y te cuento cómo me fue

2023年9月17日

Probé este Product Growth Simulator y te cuento cómo me fue

Hace un tiempo, GoPractice me pidió que probara la versión beta de una nueva solución edtech que me pareció realmente…

See all articles

Data as a (Scarce) Resource in the Age of AI

Jorge Ubero

Product | Projects | Geotech

领英推荐

Jorge Ubero的更多文章

社区洞察

其他会员也浏览了

Evaluating The AI Scientist

The Math Behind AI and Its Relevance to Business Outcomes

The Math Behind AI and Its Relevance to Business Outcomes

2024: AI Year In Review (AI's Best Year, Yet...)

AI Drop 1: AI and Youth, A Climate Score, and a Question about AI for Research

Evolution of AI and Key Concepts of Gen-AI

The two paradigms of Artificial Intelligence: OpenAI's Approach to Building Thinking Machines

Empowering Artificial Intelligence with RAG: The New Era of Retrieval and Content Generation with Databricks and Mosaic AI

Addressing 'Catastrophic forgetting' in Generative AI

The Top 20 AI Buzzwords

领英推荐

Jorge Ubero的更多文章

Geo-tutorial: Análisis espacial aplicado al delivery en Madrid Central

When Silence is the Signal

Cuando el silencio es la se?al

Desde Murcia, con amor

?Podrá la inteligencia artificial crear una película como 'Mulholland Drive'?

El bloom de las startups en el sector aeroespacial

?Se está convirtiendo el VC en la parte ejecutiva de las startups?

Ankur

Geotecnologías: negocios a Seguir en 2023

Probé este Product Growth Simulator y te cuento cómo me fue

社区洞察

其他会员也浏览了

Evaluating The AI Scientist

The Math Behind AI and Its Relevance to Business Outcomes

The Math Behind AI and Its Relevance to Business Outcomes

2024: AI Year In Review (AI's Best Year, Yet...)

AI Drop 1: AI and Youth, A Climate Score, and a Question about AI for Research

Evolution of AI and Key Concepts of Gen-AI

The two paradigms of Artificial Intelligence: OpenAI's Approach to Building Thinking Machines

Empowering Artificial Intelligence with RAG: The New Era of Retrieval and Content Generation with Databricks and Mosaic AI

Addressing 'Catastrophic forgetting' in Generative AI

The Top 20 AI Buzzwords