登录查看更多内容

Crap data everywhere

Gerry McGovern

Developer of Top Tasks research method. Author of World Wide Waste: How digital is killing the planet and what to do about it.

发布日期: 2024年10月21日

We need to talk about data. Crap data. We’re destroying our environment to create and store trillions of blurred photos and cat videos, binge watch Netflix, and ask ChatGPT inane questions and get instant wrong answers. We’re destroying our environment to store copies of copies of copies of stuff we have no intention of ever looking at again. We’re destroying our environment to take 1.4 trillion photos every year. That’s more photos taken in one single year in the 2020s than were taken in the entire 20th century. 10 trillion photos and growing, stored in the Cloud, the vast majority of which will never be viewed again. Exactly as Big Tech wants it.?

I have spent almost thirty years working with some of the largest organizations in the world, trying to help them to better manage their content and data. 90% plus of commercial or government data is crap, total absolute crap. Period. It should never have been created. It certainly should never have been stored. The rise of digital saw the explosion of data crap production. Content management systems were like giving staff diesel-fueled diggers, whereas before they only had data shovels. I remember around 2010 being in conversation with a Microsoft manager, who estimated that there were about 14 million pages on Microsoft.com at that stage, and that four million of them had never been visited. Four million, I thought. That’s basically the population of Ireland of pages that nobody has ever visited. Why were they created? All the time and effort and energy and waste that went into all these pages that nobody had ever read. We are destroying our environment to create crap.

Everywhere I went it was nothing but the same old story. Data crap everywhere. Distributed publishing that allowed basically anyone to publish anything they wanted on the intranet. And nobody maintaining anything. When Kyndryl, the world’s largest provider of IT infrastructure services, was spun off by its parent, IBM, they found they had data scattered over 100 disparate data warehouses. Multiple teams had multiple copies of the same data. After cleanup, they had deleted 90% of the data. There are 10 million stories like this.

Scottish Enterprise had 753 pages on its website. 47 pages got 80% of visits. A large organization I worked for had 100 million visits a year to its website, with 5% of pages getting 80% of visits. 100,000 of its pages had not been reviewed in 10 years. “A huge percentage of the data that gets processed is less than 24 hours old,” computer engineer, Jordan Tigani, stated. “By the time data gets to be a week old, it is probably 20 times less likely to be queried than from the most recent day. After a month, data mostly just sits there.” The Southampton University public website found that 0.2% of pages got 90% of visits. Only 4% of pages were ever visited. So, 96% of the roughly four million pages were NOT visited. One organization had 1,500 terabytes of data, with less than 2% ever having been accessed after it was first stored. There are 20 million more stories like these.

Most organizations have no clue what content they have. It’s worse. Most organizations don’t even know where all their data is stored. It’s even worse. Most organizations don’t even know how many computers they have. At least 50% of data in a particular organization is sitting on some server and nobody in management knows it even exists. The average organization has hundreds of unsanctioned third-party app subscriptions being paid for by some manager’s credit card, storing everything from project chats to draft reports to product prototypes.

The Cloud made crap data infinitely worse. The Cloud is what happens when the cost of storing data is less than the cost of figuring out what to do with the crap. One study found that data stored by the engineering and construction industry had risen from an average of 3 terabytes in 2018 to 26 terabytes in 2023, a compound annual growth rate of 50%! That sort of crap data explosion happened—and is happening—everywhere. And this is what AI is being trained on. Crap data.

Follow Gerry on Mastodon

Subscribe and get the weekly issue delivered by email. €10 for 48 issues a year.

Gerry McGovern

领英推荐

Data Act: the level of enrichment of data for it to be…

Giulio Coraggio 3 周前

Why Third-Party Data is Still Your Biggest Risk

Barr Moses 7 个月前

Data Analytics Trends Every Business Leader Must Keep…

Dr. Prashant Pansare 1 年前

World Wide Waste

Weekly email
€10 annual subscription
Published since 1996
Read some examples

Subscribe?

Books by Gerry

World Wide Waste

Digital is physical. Digital is not green. Digital costs the Earth. Every time I download an email I contribute to global warming.

Read a chapter

Buy now

Podcast: World Wide Waste

Interviews with prominent thinkers outlining what can be done to make digital as sustainable as possible.

Listen to episodes

Top Tasks – A how-to guide

Transform: A rebel’s guide for digital transformation

MORE BOOKS

Natalie Lafferty

Head of Representation and Policy

5 个月

Spot on.

Paul Bailey

Design Leader | Speaker | Co-creator of Service Designers Connect | #DesignThinkingDad of 4

5 个月

I’ve started to think of this as ‘Dirty Data’ Gerry McGovern. An invisible trail of environmental impact that we’re all accountable for creating.

2 次回应

Cherie Kurland

Always learning

5 个月

More data is not better data. Statistical sampling theory has been used and studied since the late 18th century. Yes, sampling error can and has occurred, but simply enlarging data sets does not eliminate either bias or sampling errors. Proper data sampling can produce good conclusions to a 90%, 95% or even higher confidence levels. Two fundamental fallacies of the entire AI industry are that 1)more data is automatically better data, and 2)that its source, the Internet, is a complete data population. Neither is true. Not everything has, or can be, digitized, and there are new discoveries daily. Some of those new discoveries invalidate "truths" we've long held dear. AI can perpetuate falsehoods.

3 次回应

Shane Diffily

Digital User Experience Specialist at Trinity College Dublin

5 个月

One of the first decisions I make for websites I work on is what content I'm going to ignore. Although site owners are usually shocked to hear it, it's utterly essential. Content volumes are so gigantic and business demands so overwhelming, it's simply not possible to manage everything. If you try, the entire website will fail. That's a promise. I've found Content Groups in GA are a great way to make informed, defensible decisions like this. They give exactly the data needed to know where (and where not) to focus your effort. https://www.diffily.com/articles/how-to-prioritise-website-content.htm

1 次回应

James Hoskins

5 个月

Agree ??% - I think anyone who spent time in their career working in some aspect of information and data management can recognise what a crazy and incredibly wasteful situation we have created. Google started the rot with Gmail and the idea that you 'never had to delete anything ever again'. I'm sure that was a great incentive for folks to jump ship if they'd been used to having their email data levels capped but it just made us think that crap data had no consequences. Your article clearly points out that it does !!!

4 次回应

查看更多评论

要查看或添加评论，请登录

Gerry McGovern的更多文章

Data centers are noisy and smelly

2025年3月18日

Data centers are noisy and smelly

You do not want to live close to a data center. Having one near your home is like having a lawnmower running in your…
Data center energy scam

2025年3月3日

Data center energy scam

For years, energy efficiency was the great big shining bright green fabulously good spinning story of the Big Tech data…
Big Tech’s water use is 100 times bigger than expected

2025年2月18日

Big Tech’s water use is 100 times bigger than expected

The total amount consumed by Big Tech could be much, much higher than what they nominally disclose. “When it comes to…

1 条评论
Why do data centers love deserts?

2025年2月3日

Why do data centers love deserts?

In so many ways, data center water use is more intensive than the way an ordinary person uses water, as Shaolei Ren…
Anatomy of a data center

2025年1月21日

Anatomy of a data center

A data center moving into a community is like a prison setting up. Only worse.
The anti-Nature Valley

2024年12月9日

The anti-Nature Valley

It worked, and like a magic trick, the digital warmonger was born and boomed as something greener, something softer…
Silicon Valley: designing for invisibility

2024年12月2日

Silicon Valley: designing for invisibility

“A lot of that design was about deliberately placing industrial infrastructure out of sight,” scientist Josh Lepawsky…
The greenwashing of Silicon Valley

2024年11月25日

The greenwashing of Silicon Valley

It wasn’t always known as the Valley of Pimps and Pushers. Once upon a time, they called it the Valley of Heart’s…
The three chip problem

2024年11月11日

The three chip problem

They like their chips well engineered in the USA. Long, straight and thin.
Extreme secrecy of data centers

2024年10月14日

Extreme secrecy of data centers

As soon as Lars Ruiter stepped out of his car, he was confronted by a Microsoft security guard seething with anger…

1 条评论

See all articles

Crap data everywhere

Gerry McGovern

Developer of Top Tasks research method. Author of World Wide Waste: How digital is killing the planet and what to do about it.

Gerry McGovern

领英推荐

World Wide Waste

Books by Gerry

World Wide Waste

Podcast: World Wide Waste

Gerry McGovern的更多文章

社区洞察

其他会员也浏览了

Data Analytics in Government: Harnessing the Power of Data

Data Ecosystem - Best practices

How Healthcare Science is Evolving with the Revolution of Big Data Technology

Unlocking the Power of Data with Microsoft Fabric

DATA REVOLUTION: Towards a Data Centric Transformative Culture

Linked Data: Evolving the Web into a Global Data Space

Data Ecosystems Reimagined: From Control to Promises

Poor data quality? It's simple to solve...

Data Monetization Strategies: Turning Information into Profit

Understanding the Big Data Life-Cycle

Gerry McGovern

领英推荐

World Wide Waste

Books by Gerry

World Wide Waste

Podcast: World Wide Waste

Gerry McGovern的更多文章

Data centers are noisy and smelly

Data center energy scam

Big Tech’s water use is 100 times bigger than expected

Why do data centers love deserts?

Anatomy of a data center

The anti-Nature Valley

Silicon Valley: designing for invisibility

The greenwashing of Silicon Valley

The three chip problem

Extreme secrecy of data centers

社区洞察

其他会员也浏览了

Data Analytics in Government: Harnessing the Power of Data

Data Ecosystem - Best practices

How Healthcare Science is Evolving with the Revolution of Big Data Technology

Unlocking the Power of Data with Microsoft Fabric

DATA REVOLUTION: Towards a Data Centric Transformative Culture

Linked Data: Evolving the Web into a Global Data Space

Data Ecosystems Reimagined: From Control to Promises

Poor data quality? It's simple to solve...

Data Monetization Strategies: Turning Information into Profit

Understanding the Big Data Life-Cycle