LLMs run out of data: what BigTech are doing, synthetic data anyone?
It is not a secret that large language models (like chatGPT) consume a huge amount of data of various kinds.?
To put in perspective, GPT3 model (175B parameters) training set includes the whole Wikipedia English (22GB) as a fraction of the total 45 TB of data.
At the Sohn conference in May 2023 , Sam Altman (OpenAI) explicitly mentioned AI will run out of (publicly available) data.
Given the trend of data consumption, the natural solution was, and still is, generating data i.e. synthetic data.?
Not to brag but back in 2018, I already wrote about the ascend on synthetic data and why it was something we will hear more in the future.
As it was my case in 2018, I had clients who did not have data of the necessary quality yet we needed to deliver a functioning AI model to them.?
We then first developed an AI model to generate the data, then use that to develop the anomaly detection model for the clients.
Necessity is really the mother of all invention.
But back to what BigTechs are doing today.
They have first tried to loosen up the privacy policies to allow the usage of generated text into their models.?
For example Google can now use the text inside the Google docs from users or youtube videos? to train its models.
Meta does not have this privilege and they are reportedly trying to buy a large publisher to access long texts.
All these attempts are clearly temporary fixes.?
领英推荐
If to improve the performance we need more quantity of data, not quality then we need to generate more data.
But how?
What BigTechs or Labs are doing now is simply using LLMs to generate more text or videos? or images.
In plain English, the main two techniques at play are
-adversarial models: a model produces the text (or image) while the adversary will judge/correct the text
-models 'somehow’ supervised by humans when the output is wrong (called technically Reinforcement learning from human feedback).
These techniques will definitely work… to an extent. But nobody knows exactly ‘how far’.
Can an AI judge an AI that generates data?
If yes, we are going ‘far’.
If not, we are going ‘less far’.
While I am an enthusiastic proponent of synthetic data in many many AI applications (cyber above all), I am not entirely convinced you can build generalist models just ingesting huge amounts of unverified synthetic data.
As I said before, the ‘quantity approach’ to improve AI performances may reach its limits soon if not already. The ‘quality’ of data or models may lead the next wave of performances.
#ai #artificialintelligence #business #technology #data #innovation
Leadership and Keynote Speaker and member of the Data Science Research Centre at University of Derby
7 个月This approach completely boggles me. We know that in the field of ordinary AI, that models ingesting the output of other models leads to model collapse fairly rapidly. Synthetic data depends on models to create that date. How do we know whether the distributions of the various parameters are of sufficient diversity and match the real world? Over fitting is very quickly achieved. Given that GenAI is prompt autocomplete on steroids, and is only a stochastic parrot, why do these Tech Bros believe that they need any more data? They are not capable of building Knowledge models. There is nothing in the mathematics that can lead to a knowledge model, understanding, correctness or any other capability that can be trusted.