Social Impact of the GenAI Data Scraping Cycle
Turja N Chaudhuri ( ?? to the Cloud )
Global Lead, Platform Success, EY Fabric | ? Practice at EY | Views are my own
GenAI is awesome right ? Be it question-answer chatbots like ChatGPT, Bard or Image generators like Dall-E, Midjourney.
And to top it off, the cute AI generated 1-minute puppy videos by OpenAI's latest Sora model are the best, who can argue with that, right ?
Indeed, GenAI toolkits provide a whole lot of value, but the backend process of getting there has a social impact, which is not getting a lot of attention or coverage, at least not at the scale that it deserves.
All these awesome models, be it GPT, Gemini or Claude can really perform wonders, and make it seem like magic ! It's also true that a lot of exceptionally talented engineers and researchers spent a large proportion of their life enhancing these systems, on top of some very powerful Nvidia GPU(s) to reach this point.
But, none of this would have been possible without data to power these systems - high quality, curated data. And where did they get all this data from ? Guess !
Give me your data, pretty please.
Google recently announced that they have expanded their 'strategic partnership' with Reddit in the GenAI space, you can find more details at : https://blog.google/inside-google/company-announcements/expanded-reddit-partnership/
What that announcement does not directly mention is the fact that there is reportedly a deal worth 60M USD / year for Google to train it's models on the proprietary reddit data. [1]
The collaboration will give Google access to Reddit’s data API, which delivers real-time content from Reddit’s platform. This will provide “Google with an efficient and structured way to access the vast corpus of existing content on Reddit,” while also allowing the company to display content from Reddit in new ways across its products.
This is a significant departure from the earlier stance that Reddit had taken towards Google, when it apparently wanted to block Google, in fear of the search giant scraping all it's proprietary content.
But, the fact remains - These AI models need enormous amounts of data to train on, and the more curated, original, authentic and trustworthy that data is, the better the results.
I expect to see more such 'strategic partnerships' between GenAI companies with deep pockets, and online collaborative discussion engines like Reddit, StackOverflow, etc in the coming days.
Want My Data ? Pay Up!
The emergence of GenAI models who rely on massive sources of scraped data, have led to an unintentional side effect or behavior as well - Data hoarding.
Companies are increasingly scared now, that these big GenAI players will scrape their data from the publicly accessible sources ( at zero or low cost ), and then build AI models on top, charge them to the consumers and make billions, without any attribution / chargeback to the original data owners, or data stewards.
This has led to different sorts of consequences. Some have chosen the path of litigation, and gone to court to sue these mammoth GenAI companies, and others have put up expensive API pricing plans to dissuade free scraping, and even treating the sharing of that data to these GenAI companies as a valid business model.
领英推荐
Vox reports that a New York Times copyright lawsuit could kill OpenAI. [2]
However, not every company, or owner of a free app has 60M USD to pay every year to Reddit, and many might go out of business due to these new safeguards that have been put on, around systems which were known for being traditionally open, and collaborative platforms.
Popular Reddit app Apollo may go out of business over Reddit’s new, unaffordable API pricing. [3]
The Joke Is On You !
The icing on the cake though is this recent news about GenAI Image generation firm Midjourney banning their competitor Stability AI.
" Midjourney says it has banned Stability AI staffers from using its service, accusing employees at the rival generative AI company of causing a systems outage earlier this month during an attempt to scrape Midjourney’s data " [4]
Isn't it ironic, that the same firms who trained their models on huge amounts of scraped data from public sites, blogs, etc, without any attempt at decent attribution to the original stewards of the data, are now banning each other to avoid the same fate.
" The irony of this situation also hasn’t been lost on online creatives , who have extensively criticized both companies (and generative AI systems in general) for training their models on masses of online data scraped from their works without consent."
The end is near, but just a bit farther
Essentially, the same technologies which were designed to democratize and foster in positive collaboration opportunities, have in some cases, maybe even unwillingly, influenced exactly opposite behaviors.
The barrier to getting curated, high-quality data, or content is already high, and will increase over time, essentially ensuring that independent contributors, small companies, etc will not be able to access the data they need, while hyperscalers and other IT giants will gobble up all the data in the world, and make even more money.
Over time, this will instill a new culture where even individuals who today used to openly help others in chat forums, or community areas might refrain from doing so, as even though multi-nationals will be making a lot of $$$ from their well-researched content, they will be ignored due to absence of a proper attribution mechanism.
Let's hope the AI governance leaders across the world find a way around this, which promotes open collaboration and sharing, and discourages selfish practices that only favor a select few.
SAP on Cloud Starchitect ???? - doer of doers, hands-on & heads-on, chief washer-upperer
8 个月How about this as an unfortunate implication of GenAI: - I no longer write nice blogs. There will be many others. Once upon a time, when a nice blog page was written, the end consumer was there, at the blog page gates and recognising that the writer was a) truly unique and b) something or someone to admire and be like. I believe we call it "inspirational". When reading a page written by AI, the consumer already knows they will never attain the "skills" of an AI and so the hope and willing by the consumer to attain a higher level of skill is lost in the 2 seconds it takes the AI to summarise and anonymise the content for the consumer. With the loss of good quality written data, extracted from our heads, what will the AI train on now? As you say, forums and closed sources are one avenue. Just look at the number of people answering questions on LinkedIn in order to obtain a "Top Voice" badge. All that is doing, is creating a training data set that will one day be more knowledgeable than the human answering the questions. Look at the way that MS Teams now tightly integrates with LinkedIn (also Microsoft owned). LinkedIn is a funnel, and it is funnelling our creativity into an LLM. OK, slightly sensationalist, but it is happening.