登录查看更多内容

Social Impact of the GenAI Data Scraping Cycle

Turja N Chaudhuri ( ?? to the Cloud )

Global Lead, Platform Success, EY Fabric | ? Practice at EY | Views are my own

发布日期: 2024年3月18日

GenAI is awesome right ? Be it question-answer chatbots like ChatGPT, Bard or Image generators like Dall-E, Midjourney.

And to top it off, the cute AI generated 1-minute puppy videos by OpenAI's latest Sora model are the best, who can argue with that, right ?

Indeed, GenAI toolkits provide a whole lot of value, but the backend process of getting there has a social impact, which is not getting a lot of attention or coverage, at least not at the scale that it deserves.

All these awesome models, be it GPT, Gemini or Claude can really perform wonders, and make it seem like magic ! It's also true that a lot of exceptionally talented engineers and researchers spent a large proportion of their life enhancing these systems, on top of some very powerful Nvidia GPU(s) to reach this point.

But, none of this would have been possible without data to power these systems - high quality, curated data. And where did they get all this data from ? Guess !

Give me your data, pretty please.

Google recently announced that they have expanded their 'strategic partnership' with Reddit in the GenAI space, you can find more details at : https://blog.google/inside-google/company-announcements/expanded-reddit-partnership/

What that announcement does not directly mention is the fact that there is reportedly a deal worth 60M USD / year for Google to train it's models on the proprietary reddit data. [1]

The collaboration will give Google access to Reddit’s data API, which delivers real-time content from Reddit’s platform. This will provide “Google with an efficient and structured way to access the vast corpus of existing content on Reddit,” while also allowing the company to display content from Reddit in new ways across its products.

This is a significant departure from the earlier stance that Reddit had taken towards Google, when it apparently wanted to block Google, in fear of the search giant scraping all it's proprietary content.

But, the fact remains - These AI models need enormous amounts of data to train on, and the more curated, original, authentic and trustworthy that data is, the better the results.

I expect to see more such 'strategic partnerships' between GenAI companies with deep pockets, and online collaborative discussion engines like Reddit, StackOverflow, etc in the coming days.

Want My Data ? Pay Up!

The emergence of GenAI models who rely on massive sources of scraped data, have led to an unintentional side effect or behavior as well - Data hoarding.

Companies are increasingly scared now, that these big GenAI players will scrape their data from the publicly accessible sources ( at zero or low cost ), and then build AI models on top, charge them to the consumers and make billions, without any attribution / chargeback to the original data owners, or data stewards.

This has led to different sorts of consequences. Some have chosen the path of litigation, and gone to court to sue these mammoth GenAI companies, and others have put up expensive API pricing plans to dissuade free scraping, and even treating the sharing of that data to these GenAI companies as a valid business model.

领英推荐

This AI newsletter is all you need #47

Towards AI 1 年前

Is OpenAI becoming too big to fail?

VentureBeat 2 个月前

This AI newsletter is all you need #36

Towards AI 1 年前

Vox reports that a New York Times copyright lawsuit could kill OpenAI. [2]

However, not every company, or owner of a free app has 60M USD to pay every year to Reddit, and many might go out of business due to these new safeguards that have been put on, around systems which were known for being traditionally open, and collaborative platforms.

Popular Reddit app Apollo may go out of business over Reddit’s new, unaffordable API pricing. [3]

The Joke Is On You !

The icing on the cake though is this recent news about GenAI Image generation firm Midjourney banning their competitor Stability AI.

" Midjourney says it has banned Stability AI staffers from using its service, accusing employees at the rival generative AI company of causing a systems outage earlier this month during an attempt to scrape Midjourney’s data " [4]

Isn't it ironic, that the same firms who trained their models on huge amounts of scraped data from public sites, blogs, etc, without any attempt at decent attribution to the original stewards of the data, are now banning each other to avoid the same fate.

" The irony of this situation also hasn’t been lost on online creatives , who have extensively criticized both companies (and generative AI systems in general) for training their models on masses of online data scraped from their works without consent."

The end is near, but just a bit farther

Essentially, the same technologies which were designed to democratize and foster in positive collaboration opportunities, have in some cases, maybe even unwillingly, influenced exactly opposite behaviors.

The barrier to getting curated, high-quality data, or content is already high, and will increase over time, essentially ensuring that independent contributors, small companies, etc will not be able to access the data they need, while hyperscalers and other IT giants will gobble up all the data in the world, and make even more money.

Over time, this will instill a new culture where even individuals who today used to openly help others in chat forums, or community areas might refrain from doing so, as even though multi-nationals will be making a lot of $$$ from their well-researched content, they will be ignored due to absence of a proper attribution mechanism.

Let's hope the AI governance leaders across the world find a way around this, which promotes open collaboration and sharing, and discourages selfish practices that only favor a select few.

References

[1] https://www.theverge.com/2024/2/22/24080165/google-reddit-ai-training-data

[2] https://www.vox.com/technology/2024/1/18/24041598/openai-new-york-times-copyright-lawsuit-napster-google-sony )

[3] https://techcrunch.com/2023/05/31/popular-reddit-app-apollo-may-go-out-of-business-over-reddits-new-unaffordable-api-pricing/

[4] https://www.theverge.com/2024/3/11/24097495/midjourney-bans-stability-ai-employees-data-theft-outage

Darryl Griffiths MBCS

SAP on Cloud Starchitect ???? - doer of doers, hands-on & heads-on, chief washer-upperer

8 个月

How about this as an unfortunate implication of GenAI: - I no longer write nice blogs. There will be many others. Once upon a time, when a nice blog page was written, the end consumer was there, at the blog page gates and recognising that the writer was a) truly unique and b) something or someone to admire and be like. I believe we call it "inspirational". When reading a page written by AI, the consumer already knows they will never attain the "skills" of an AI and so the hope and willing by the consumer to attain a higher level of skill is lost in the 2 seconds it takes the AI to summarise and anonymise the content for the consumer. With the loss of good quality written data, extracted from our heads, what will the AI train on now? As you say, forums and closed sources are one avenue. Just look at the number of people answering questions on LinkedIn in order to obtain a "Top Voice" badge. All that is doing, is creating a training data set that will one day be more knowledgeable than the human answering the questions. Look at the way that MS Teams now tightly integrates with LinkedIn (also Microsoft owned). LinkedIn is a funnel, and it is funnelling our creativity into an LLM. OK, slightly sensationalist, but it is happening.

4 次回应

要查看或添加评论，请登录

查看全部

Social Impact of the GenAI Data Scraping Cycle

Turja N Chaudhuri ( ?? to the Cloud )

Global Lead, Platform Success, EY Fabric | ? Practice at EY | Views are my own

Give me your data, pretty please.

Want My Data ? Pay Up!

领英推荐

The Joke Is On You !

The end is near, but just a bit farther

References

更多精彩文章

社区洞察

其他会员也浏览了

Microsoft is all in On Generative A.I.

What OpenAI’s wave of releases says about 2024

AI Industry Shifts: OpenAIs at Risk, SearchGPT Launch, and Google Gemini Expansion

How Bing vs. Bard became Google’s Super Bowl-level AI loss

The White House steps in to cut AI risks

AI Newsletter

Unicorn Briefing #1: Perplexity AI

Latest In Web3, AI & Emerging Tech

The Latest News in AI: Key Developments and Insights

Product is key to distribution in AI

Give me your data, pretty please.

Want My Data ? Pay Up!

领英推荐

The Joke Is On You !

The end is near, but just a bit farther

References

Habitable Platforms : Why making your platform optional can help in increasing adoption

2024年5月21日

AI is the answer, but what was the question ?

2023年12月3日

Build for AI Models of tomorrow, today

2023年11月26日

IBM’s Asshole Test - Positioning in current enterprise context

2023年8月21日

Affordable AI – Bridging the Social Divide

2023年8月5日

Why Boring is not bad for Enterprises, Always

2023年7月17日

Why enterprises need an AI Architecture Practice

2023年7月11日

AI Storm : The societal impact of AI

2023年7月9日

Why Platforms Could Be The Key to Unlocking The Full Potential of Generative AI Within Your Enterprise

2023年7月7日

社区洞察

其他会员也浏览了

Microsoft is all in On Generative A.I.

What OpenAI’s wave of releases says about 2024

AI Industry Shifts: OpenAIs at Risk, SearchGPT Launch, and Google Gemini Expansion

How Bing vs. Bard became Google’s Super Bowl-level AI loss

The White House steps in to cut AI risks

AI Newsletter

Unicorn Briefing #1: Perplexity AI

Latest In Web3, AI & Emerging Tech

The Latest News in AI: Key Developments and Insights

Product is key to distribution in AI