The humans in the machine. And why we need them
Synapse Conclave
A first-of-its-kind science tech society conference | Nature, humans, technology
I dove into the black-box belly of the machine in my last blog, laying out an admittedly weird and as-of-yet unexplained behaviour exhibited by AI (read it here).?
This time, I’m looking at something far more explicable – the human element making AI possible, beyond Silicon Valley genius. Just as coffee implicates a supply chain from bean to cup, and batteries from critical minerals to manufacturing, so does the all-knowing AI. And the AI pipeline is littered with human labour – sometimes known, sometimes hidden. Spoiler alert: you’re part of the “hidden” human labour – in ways you perhaps may never even have guessed.?
First up – the known kind of human effort, which, if cut off, will lead to AI literally going “MAD”.?
In 2019, deep neural nets were on average trained on 500 million parameters. Cut to GPT-3, which was trained on 175 billion – taken from “raw” 45 terabytes of text data equivalent to the word-count of about 90 million novels. Everything from Reddit forums and Wikipedia entries to journal articles and instruction manuals. Its successor ChatGPT-4 is estimated to have been trained on a further mind-boggling 1.76 trillion parameters.?
In short – anything available to read online, whether open access or not, is “pretty likely” to have already been scraped by companies building LLMs. Our random 2 am tweets. Those long-lost blog entries on Tumblr. Instagram posts. Any and all existing content is being fed into deep-learning beasts.?
But it’s all human-created data. Which matters – for the quality of the content that LLMs are able to reassemble and regurgitate.?
The internet today teems with more and more AI-generated content – populating the digital landscape far quicker than the human-generated kind. In fact, Epoch AI has estimated that AI models will have read through the existing Internet library of high-quality, human-generated public text – words that could fill 600 million encyclopedias – between 2026 and 2032.
And so, already, models are beginning to train, knowingly or not, on “synthetic data”. That is, data created by other AI models. Cheap, accessible, instant.?
But as generative AI eats itself, it breaks down. LLMs don’t just produce content that is less diverse, more similar, and unoriginal – they actually begin to degenerate, throwing up increasingly mangled data, or gibberish. Researchers from Rice and Stanford universities studying the phenomenon are calling it Model Autophagy Disorder, or – compellingly – “MAD”. (It’s also more generally known as “model collapse”.) And in their research, it only took 5 – just 5 – cycles of training on synthetic data for an AI model’s output to “blow up”, to go MAD.?
***
If it is human effort that is keeping AI alive and kicking – the crucial factor preventing AI from eating its own tail – it is also human labour that is training AI models, making them more reliable and safe, and less biased. And they’re mostly “hidden” – made invisible by corporate PR and public narratives.?
I’m talking about “data labourers”, the underbelly of AI. Workers from Syria, Venezuela, India, Kenya, Bulgaria, Philippines – largely the Global South – that are sorting, categorising, and labelling the data sets that become the parameters being fed into an AI model. What anthropologist Mary Gray and computational social scientist Siddharth Suri call “ghost work”. This is even as roughly 80% of the time spent training AI consists of annotating datasets.?
Remember ImageNet, the first database launched by AI pioneer Fei-Fei Li back in 2006 to teach a computer how to “see” by feeding it thousands of real-world images? At the peak of the project, nearly 50,000 workers across 167 countries were cleaning, sorting, and labelling nearly a billion images.?
Today, there are between 154 and 435 million data workers globally as per the World Bank. And the data market is worth upward of $2 billion, and expected to grow almost 8-fold by 2030.?
From labelling pictures of cats to detecting bone fractures in X-rays. Drawing figures around different objects, from traffic lights to stop signs to human faces. Reviewing hour after hour of footage of drivers at the wheel to spot microexpressions that indicate lapses in concentration, to be fed into an AI monitoring system. Recording speech in languages other than English. And even labelling violent, explicit – and traumatic – content to train AI to be less toxic. See this Time story on how OpenAI used outsourced Kenyan workers earning less than $2 per hour to label graphic text – and ultimately visuals – to feed into an AI-powered safety mechanism that would learn to detect violence, hate speech, and sexual abuse.?
Emerging news stories and testimonies reveal 21st century sweatshops: long hours, repetitive tasks, automated surveillance, constant monitoring, and little or no mental health support. All to earn anywhere between $1-$5 per hour from companies valued at millions and billions.?
This new extractive economy based on human labour doesn’t end here. Beyond our papers and books, our art and music, a fascinating paper by Fabio Morreale et al. from the University of Auckland identifies how you and I are now among “unwitting labourers” doing work we don’t even realise we’re doing – all in the service of training AI models.?
Like the act of creating and adding to our Spotify playlists – which is fodder for the platform to improve its recommendation algorithm. Or the act of filtering junk email in our email inboxes – which is scraped to teach algorithms to identify spam. Or – most surprising of all – answering reCAPTCHA tests – that was used by Google, between 2009 and 2011, to gather the enormous pool of training data required to digitise its entire Google Books collection (and some million NYT articles besides). And since 2014, Google has been using these tests asking us to prove our humanity for image-based applications. So the next time you’re asked to identify part of bicycles or traffic lights in 9 squares, or asked to indicate which image is the right way up ? You now know where your answers are going.????????????
***
Ethicists and researchers are calling out the creation of a global underclass. Stories of unseen and underpaid data workers are coming to light, such as through initiatives like Data Workers’ Inquiry, led by Dr. Milagros Miceli . Ventures like Karya are seeking to make digital work dignified for economically disadvantaged populations. Tech insiders like Jaron Lanier and E. Glen Weyl have conceptualised a “coherent marketplace” – a true market economy for information – based on what they term “data dignity”. Even a first African union of content moderators – whose labour underpins AI systems of platforms like Facebook and ChatGPT – was set up last year. And authors to comedians to music labels are suing AI developers.?
As Silicon Valley rushes to scale up AI – especially now that it has been proven that AI doesn’t just get better with more data, it gets exponentially better ( Dario Amodei ) – and meet calls for adoption across industries, the demand for data – and the humble human that supplies it – will only increase.?
It’s time to talk about intellectual property and copyright. Digital labour and distribution. Rights and dividends. And how to script a new social contract between creators, hidden workers, and AI companies.?
Ritika Passi | Executive Editor, Lucid Lines
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
2 个月It's fascinating to explore the often-invisible human element behind AI development and its ethical implications. The work of "data laborers" is crucial yet frequently overlooked, raising important questions about fairness and recognition in the digital age. Ritika, what specific examples have you encountered that highlight the human impact on both the creation and potential biases within AI systems?