Crap data everywhere
Gerry McGovern
Developer of Top Tasks research method. Author of World Wide Waste: How digital is killing the planet and what to do about it.
We need to talk about data. Crap data. We’re destroying our environment to create and store trillions of blurred photos and cat videos, binge watch Netflix, and ask ChatGPT inane questions and get instant wrong answers. We’re destroying our environment to store copies of copies of copies of stuff we have no intention of ever looking at again. We’re destroying our environment to take 1.4 trillion photos every year. That’s more photos taken in one single year in the 2020s than were taken in the entire 20th century. 10 trillion photos and growing, stored in the Cloud, the vast majority of which will never be viewed again. Exactly as Big Tech wants it.?
I have spent almost thirty years working with some of the largest organizations in the world, trying to help them to better manage their content and data. 90% plus of commercial or government data is crap, total absolute crap. Period. It should never have been created. It certainly should never have been stored. The rise of digital saw the explosion of data crap production. Content management systems were like giving staff diesel-fueled diggers, whereas before they only had data shovels. I remember around 2010 being in conversation with a Microsoft manager, who estimated that there were about 14 million pages on Microsoft.com at that stage, and that four million of them had never been visited. Four million, I thought. That’s basically the population of Ireland of pages that nobody has ever visited. Why were they created? All the time and effort and energy and waste that went into all these pages that nobody had ever read. We are destroying our environment to create crap.
Everywhere I went it was nothing but the same old story. Data crap everywhere. Distributed publishing that allowed basically anyone to publish anything they wanted on the intranet. And nobody maintaining anything. When Kyndryl, the world’s largest provider of IT infrastructure services, was spun off by its parent, IBM, they found they had data scattered over 100 disparate data warehouses. Multiple teams had multiple copies of the same data. After cleanup, they had deleted 90% of the data. There are 10 million stories like this.
Scottish Enterprise had 753 pages on its website. 47 pages got 80% of visits. A large organization I worked for had 100 million visits a year to its website, with 5% of pages getting 80% of visits. 100,000 of its pages had not been reviewed in 10 years. “A huge percentage of the data that gets processed is less than 24 hours old,” computer engineer, Jordan Tigani, stated. “By the time data gets to be a week old, it is probably 20 times less likely to be queried than from the most recent day. After a month, data mostly just sits there.” The Southampton University public website found that 0.2% of pages got 90% of visits. Only 4% of pages were ever visited. So, 96% of the roughly four million pages were NOT visited. One organization had 1,500 terabytes of data, with less than 2% ever having been accessed after it was first stored. There are 20 million more stories like these.
Most organizations have no clue what content they have. It’s worse. Most organizations don’t even know where all their data is stored. It’s even worse. Most organizations don’t even know how many computers they have. At least 50% of data in a particular organization is sitting on some server and nobody in management knows it even exists. The average organization has hundreds of unsanctioned third-party app subscriptions being paid for by some manager’s credit card, storing everything from project chats to draft reports to product prototypes.
The Cloud made crap data infinitely worse. The Cloud is what happens when the cost of storing data is less than the cost of figuring out what to do with the crap. One study found that data stored by the engineering and construction industry had risen from an average of 3 terabytes in 2018 to 26 terabytes in 2023, a compound annual growth rate of 50%! That sort of crap data explosion happened—and is happening—everywhere. And this is what AI is being trained on. Crap data.
Podcast: World Wide Waste Interviews with prominent thinkers outlining what can be done to make digital as sustainable as possible. Listen to episodes
Gerry McGovern
领英推荐
World Wide Waste
Subscribe?
Books by Gerry
World Wide Waste
Digital is physical. Digital is not green. Digital costs the Earth. Every time I download an email I contribute to global warming.
Podcast: World Wide Waste
Interviews with prominent thinkers outlining what can be done to make digital as sustainable as possible.
Head of Representation and Policy
5 个月Spot on.
Design Leader | Speaker | Co-creator of Service Designers Connect | #DesignThinkingDad of 4
5 个月I’ve started to think of this as ‘Dirty Data’ Gerry McGovern. An invisible trail of environmental impact that we’re all accountable for creating.
Always learning
5 个月More data is not better data. Statistical sampling theory has been used and studied since the late 18th century. Yes, sampling error can and has occurred, but simply enlarging data sets does not eliminate either bias or sampling errors. Proper data sampling can produce good conclusions to a 90%, 95% or even higher confidence levels. Two fundamental fallacies of the entire AI industry are that 1)more data is automatically better data, and 2)that its source, the Internet, is a complete data population. Neither is true. Not everything has, or can be, digitized, and there are new discoveries daily. Some of those new discoveries invalidate "truths" we've long held dear. AI can perpetuate falsehoods.
Digital User Experience Specialist at Trinity College Dublin
5 个月One of the first decisions I make for websites I work on is what content I'm going to ignore. Although site owners are usually shocked to hear it, it's utterly essential. Content volumes are so gigantic and business demands so overwhelming, it's simply not possible to manage everything. If you try, the entire website will fail. That's a promise. I've found Content Groups in GA are a great way to make informed, defensible decisions like this. They give exactly the data needed to know where (and where not) to focus your effort. https://www.diffily.com/articles/how-to-prioritise-website-content.htm
Experienced Senior Manager | Project Management Professional | DotCom Veteran | Open Source Advocate | AI Strategist | WebDev and Data Science Trained | Polymath | Logician
5 个月Agree ??% - I think anyone who spent time in their career working in some aspect of information and data management can recognise what a crazy and incredibly wasteful situation we have created. Google started the rot with Gmail and the idea that you 'never had to delete anything ever again'. I'm sure that was a great incentive for folks to jump ship if they'd been used to having their email data levels capped but it just made us think that crap data had no consequences. Your article clearly points out that it does !!!