Crap data everywhere

We need to talk about data. Crap data. We’re destroying our environment to create and store trillions of blurred photos and cat videos, binge watch Netflix, and ask ChatGPT inane questions and get instant wrong answers. We’re destroying our environment to store copies of copies of copies of stuff we have no intention of ever looking at again. We’re destroying our environment to take 1.4 trillion photos every year. That’s more photos taken in one single year in the 2020s than were taken in the entire 20th century. 10 trillion photos and growing, stored in the Cloud, the vast majority of which will never be viewed again. Exactly as Big Tech wants it.?

I have spent almost thirty years working with some of the largest organizations in the world, trying to help them to better manage their content and data. 90% plus of commercial or government data is crap, total absolute crap. Period. It should never have been created. It certainly should never have been stored. The rise of digital saw the explosion of data crap production. Content management systems were like giving staff diesel-fueled diggers, whereas before they only had data shovels. I remember around 2010 being in conversation with a Microsoft manager, who estimated that there were about 14 million pages on Microsoft.com at that stage, and that four million of them had never been visited. Four million, I thought. That’s basically the population of Ireland of pages that nobody has ever visited. Why were they created? All the time and effort and energy and waste that went into all these pages that nobody had ever read. We are destroying our environment to create crap.

Everywhere I went it was nothing but the same old story. Data crap everywhere. Distributed publishing that allowed basically anyone to publish anything they wanted on the intranet. And nobody maintaining anything. When Kyndryl, the world’s largest provider of IT infrastructure services, was spun off by its parent, IBM, they found they had data scattered over 100 disparate data warehouses. Multiple teams had multiple copies of the same data. After cleanup, they had deleted 90% of the data. There are 10 million stories like this.

Scottish Enterprise had 753 pages on its website. 47 pages got 80% of visits. A large organization I worked for had 100 million visits a year to its website, with 5% of pages getting 80% of visits. 100,000 of its pages had not been reviewed in 10 years. “A huge percentage of the data that gets processed is less than 24 hours old,” computer engineer, Jordan Tigani, stated. “By the time data gets to be a week old, it is probably 20 times less likely to be queried than from the most recent day. After a month, data mostly just sits there.” The Southampton University public website found that 0.2% of pages got 90% of visits. Only 4% of pages were ever visited. So, 96% of the roughly four million pages were NOT visited. One organization had 1,500 terabytes of data, with less than 2% ever having been accessed after it was first stored. There are 20 million more stories like these.

Most organizations have no clue what content they have. It’s worse. Most organizations don’t even know where all their data is stored. It’s even worse. Most organizations don’t even know how many computers they have. At least 50% of data in a particular organization is sitting on some server and nobody in management knows it even exists. The average organization has hundreds of unsanctioned third-party app subscriptions being paid for by some manager’s credit card, storing everything from project chats to draft reports to product prototypes.

The Cloud made crap data infinitely worse. The Cloud is what happens when the cost of storing data is less than the cost of figuring out what to do with the crap. One study found that data stored by the engineering and construction industry had risen from an average of 3 terabytes in 2018 to 26 terabytes in 2023, a compound annual growth rate of 50%! That sort of crap data explosion happened—and is happening—everywhere. And this is what AI is being trained on. Crap data.


Follow Gerry on Mastodon

Subscribe and get the weekly issue delivered by email. €10 for 48 issues a year.

Read more from Gerry

Podcast: World Wide Waste Interviews with prominent thinkers outlining what can be done to make digital as sustainable as possible. Listen to episodes

Gerry McGovern

World Wide Waste

Subscribe?

Books by Gerry

World Wide Waste

Digital is physical. Digital is not green. Digital costs the Earth. Every time I download an email I contribute to global warming.

Read a chapter

Buy now

Podcast: World Wide Waste

Interviews with prominent thinkers outlining what can be done to make digital as sustainable as possible.

Listen to episodes

MORE BOOKS

Natalie Lafferty

Head of Representation and Policy

5 个月

Spot on.

回复
Paul Bailey

Design Leader | Speaker | Co-creator of Service Designers Connect | #DesignThinkingDad of 4

5 个月

I’ve started to think of this as ‘Dirty Data’ Gerry McGovern. An invisible trail of environmental impact that we’re all accountable for creating.

Cherie Kurland

Always learning

5 个月

More data is not better data. Statistical sampling theory has been used and studied since the late 18th century. Yes, sampling error can and has occurred, but simply enlarging data sets does not eliminate either bias or sampling errors. Proper data sampling can produce good conclusions to a 90%, 95% or even higher confidence levels. Two fundamental fallacies of the entire AI industry are that 1)more data is automatically better data, and 2)that its source, the Internet, is a complete data population. Neither is true. Not everything has, or can be, digitized, and there are new discoveries daily. Some of those new discoveries invalidate "truths" we've long held dear. AI can perpetuate falsehoods.

Shane Diffily

Digital User Experience Specialist at Trinity College Dublin

5 个月

One of the first decisions I make for websites I work on is what content I'm going to ignore. Although site owners are usually shocked to hear it, it's utterly essential. Content volumes are so gigantic and business demands so overwhelming, it's simply not possible to manage everything. If you try, the entire website will fail. That's a promise. I've found Content Groups in GA are a great way to make informed, defensible decisions like this. They give exactly the data needed to know where (and where not) to focus your effort. https://www.diffily.com/articles/how-to-prioritise-website-content.htm

James Hoskins

Experienced Senior Manager | Project Management Professional | DotCom Veteran | Open Source Advocate | AI Strategist | WebDev and Data Science Trained | Polymath | Logician

5 个月

Agree ??% - I think anyone who spent time in their career working in some aspect of information and data management can recognise what a crazy and incredibly wasteful situation we have created. Google started the rot with Gmail and the idea that you 'never had to delete anything ever again'. I'm sure that was a great incentive for folks to jump ship if they'd been used to having their email data levels capped but it just made us think that crap data had no consequences. Your article clearly points out that it does !!!

要查看或添加评论,请登录

Gerry McGovern的更多文章

  • Data centers are noisy and smelly

    Data centers are noisy and smelly

    You do not want to live close to a data center. Having one near your home is like having a lawnmower running in your…

  • Data center energy scam

    Data center energy scam

    For years, energy efficiency was the great big shining bright green fabulously good spinning story of the Big Tech data…

  • Big Tech’s water use is 100 times bigger than expected

    Big Tech’s water use is 100 times bigger than expected

    The total amount consumed by Big Tech could be much, much higher than what they nominally disclose. “When it comes to…

    1 条评论
  • Why do data centers love deserts?

    Why do data centers love deserts?

    In so many ways, data center water use is more intensive than the way an ordinary person uses water, as Shaolei Ren…

  • Anatomy of a data center

    Anatomy of a data center

    A data center moving into a community is like a prison setting up. Only worse.

  • The anti-Nature Valley

    The anti-Nature Valley

    It worked, and like a magic trick, the digital warmonger was born and boomed as something greener, something softer…

  • Silicon Valley: designing for invisibility

    Silicon Valley: designing for invisibility

    “A lot of that design was about deliberately placing industrial infrastructure out of sight,” scientist Josh Lepawsky…

  • The greenwashing of Silicon Valley

    The greenwashing of Silicon Valley

    It wasn’t always known as the Valley of Pimps and Pushers. Once upon a time, they called it the Valley of Heart’s…

  • The three chip problem

    The three chip problem

    They like their chips well engineered in the USA. Long, straight and thin.

  • Extreme secrecy of data centers

    Extreme secrecy of data centers

    As soon as Lars Ruiter stepped out of his car, he was confronted by a Microsoft security guard seething with anger…

    1 条评论

社区洞察

其他会员也浏览了