登录查看更多内容

LLMs run out of data: what BigTech are doing, synthetic data anyone?

Dr Andrea Isoni

PhD,Chief AI Officer, AI speaker

发布日期: 2024年4月11日

It is not a secret that large language models (like chatGPT) consume a huge amount of data of various kinds.?

To put in perspective, GPT3 model (175B parameters) training set includes the whole Wikipedia English (22GB) as a fraction of the total 45 TB of data.

At the Sohn conference in May 2023 , Sam Altman (OpenAI) explicitly mentioned AI will run out of (publicly available) data.

Given the trend of data consumption, the natural solution was, and still is, generating data i.e. synthetic data.?

Not to brag but back in 2018, I already wrote about the ascend on synthetic data and why it was something we will hear more in the future.

As it was my case in 2018, I had clients who did not have data of the necessary quality yet we needed to deliver a functioning AI model to them.?

We then first developed an AI model to generate the data, then use that to develop the anomaly detection model for the clients.

Necessity is really the mother of all invention.

But back to what BigTechs are doing today.

They have first tried to loosen up the privacy policies to allow the usage of generated text into their models.?

For example Google can now use the text inside the Google docs from users or youtube videos? to train its models.

Meta does not have this privilege and they are reportedly trying to buy a large publisher to access long texts.

All these attempts are clearly temporary fixes.?

Steve Nouri 2 年前

The Sparks of Artificial General Intelligence AGI in…

Data Science Dojo 11 个月前

The data that trains AI is under the spotlight — and…

VentureBeat 1 年前

If to improve the performance we need more quantity of data, not quality then we need to generate more data.

But how?

What BigTechs or Labs are doing now is simply using LLMs to generate more text or videos? or images.

In plain English, the main two techniques at play are

-adversarial models: a model produces the text (or image) while the adversary will judge/correct the text

-models 'somehow’ supervised by humans when the output is wrong (called technically Reinforcement learning from human feedback).

These techniques will definitely work… to an extent. But nobody knows exactly ‘how far’.

Can an AI judge an AI that generates data?

If yes, we are going ‘far’.

If not, we are going ‘less far’.

While I am an enthusiastic proponent of synthetic data in many many AI applications (cyber above all), I am not entirely convinced you can build generalist models just ingesting huge amounts of unverified synthetic data.

As I said before, the ‘quantity approach’ to improve AI performances may reach its limits soon if not already. The ‘quality’ of data or models may lead the next wave of performances.

#ai #artificialintelligence #business #technology #data #innovation

Thoughts about AI. By a Human.

2,753 位关注者

Richard Self

Leadership and Keynote Speaker and member of the Data Science Research Centre at University of Derby

7 个月

This approach completely boggles me. We know that in the field of ordinary AI, that models ingesting the output of other models leads to model collapse fairly rapidly. Synthetic data depends on models to create that date. How do we know whether the distributions of the various parameters are of sufficient diversity and match the real world? Over fitting is very quickly achieved. Given that GenAI is prompt autocomplete on steroids, and is only a stochastic parrot, why do these Tech Bros believe that they need any more data? They are not capable of building Knowledge models. There is nothing in the mathematics that can lead to a knowledge model, understanding, correctness or any other capability that can be trusted.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

LLMs run out of data: what BigTech are doing, synthetic data anyone?

Dr Andrea Isoni

PhD,Chief AI Officer, AI speaker

领英推荐

Thoughts about AI. By a Human.

2,753 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

?? The new AI data brokers

How To Summarize Public Opinion Using RAG AI

The future of advanced AI is simple

How LLMs are Transforming Bot Building, Botnet Detection at Scale, and Declarative ML for Engineers

?? A new GPT Data Leak?

Smarter AI, Better Decisions: Explore How RAG Integrates Real-Time Data for Next-Level Performance!

I Let AI Analyze My Medium Stats…and Here’s What I Found

The AI Data Odyssey: Navigating the Synthetic Seas

No Connection, No Problem: AI Solutions with GPT4All and KNIME

Big Windows, Better Agents (Part 6 of 10)

领英推荐

Thoughts about AI. By a Human.

2,753 位关注者

AI adoption: what a decision maker should know

2024年11月20日

AI Adoption in business: playing with AI confused with actual adoption

2024年11月13日

AI Mass Adoption: is not happening (yet) but business is playing with it

2024年11月6日

Why AI is not intelligent (yet)

2024年10月30日

Why AI is unlocking a trend that could happen without AI

2024年10月23日

Why AI adoption is increasing cybersecurity spending 2

2024年10月16日

3 steps to AI mass adoption

2024年10月9日

Why chatGPT is NOT going to replace outsourced inexpensive workforce

2024年10月2日

State of AI 2024: Why deepfake detectors are failing (and we got it)

2024年9月25日

Why chatGPT is NOT going to replace outsourced inexpensive workforce

2024年9月18日

社区洞察

其他会员也浏览了

?? The new AI data brokers

How To Summarize Public Opinion Using RAG AI

The future of advanced AI is simple

How LLMs are Transforming Bot Building, Botnet Detection at Scale, and Declarative ML for Engineers

?? A new GPT Data Leak?

Smarter AI, Better Decisions: Explore How RAG Integrates Real-Time Data for Next-Level Performance!

I Let AI Analyze My Medium Stats…and Here’s What I Found

The AI Data Odyssey: Navigating the Synthetic Seas

No Connection, No Problem: AI Solutions with GPT4All and KNIME

Big Windows, Better Agents (Part 6 of 10)