登录查看更多内容

Sustainable Data Anti-Pattern #4 - Data Pollution

Graham Zabel

发布日期: 2023年7月27日

+ 关注

Note: This article is part of a series on Data anti-patterns:

Sustainable Data Anti-Pattern #1 - Data Explosion through Automation

Sustainable Data Anti-Pattern #2 - Bad Data Virus

Sustainable Data Anti-Pattern #3 - Delete Nothing

Sustainable Data Anti-Pattern #4 - Data Pollution

Problem

No alt text provided for this image — *One zettabyte is approximately equal to 1,000 exabytes or 1 billion terabytes.

Dark Data

According to the latest 2023 figures, approximately 330 million terabytes (TBs) of data is generated every day. Over 100 zettabytes a year.

Some estimates say 90% of the data we create will never be used after 90 days. There is a term for this unused data. It is call Dark Data — that is, data that is unknown, undiscovered, unquantified, underutilized or completely untapped (see Dark data ).

In another recent report sponsored by Splunk, they state that about 55% of an organization’s data is considered “dark”. (Some see this dark data as “data gold” - more on this below.)

Both in our personal life (email, photos, videos, text messages) and in our work life (documents, spreadsheets, email, reports, logs, etc.) we create vast amounts of data, most of which has only a very short lifespan, and much of which is never used at all.

It is important to point out, by the way, that only a small amount of newly created data is saved for long term storage. Much data is created for immedidate consumption and not saved (e.g. a Netflix film), or is overwritten (e.g. rolling caches). This makes the huge growth in data creation and replication slightly less worrying. But storage capacity is still growing quickly (doubling from 2021 to 2025). All this data needs to be managed and monitored, and will require power and water.

Storage is Cheap?

There is a widespread perception that “Storage is Cheap”, especially as we migrate to the cloud, with the multiple storage options that the cloud vendors provide.

Some say it is compute that is expensive, not storage. That is a fallacy. As storage grows, the compute needed to search, analyse, process, sort, replicate and backup the data also grows. Large data sets require larger computational power. If a database is smaller, the compute power need to scan, sort, backup the data will be less.

At an individual level, data explosion and storage concerns may seem trivial. A few extra dollars per month to store my data may be neglibable. A half a second response time to access data from a large data set is not noticably worse than a quarter of a second. At small scale this penalty of 10, 20 or even 50% may not be an issue. But at enterprise scale, at national and global scale this becomes hugely significant.

Data is the new Gold

“There’s money in them there databases!” If you believe this, then your default action will be “if in doubt, store the data”. It may be worth money in the future. We will be able to train AI systems on this data and learn lots of things, which we can then monetise. This idea promotes data hoarding, as you would hoard gold if you had it.

领英推荐

Green Intelligence: Why Data And AI Must Become More…

Bernard Marr 1 年前

Is it time to get your legacy data in order?

Elsevier for Engineering R&D 12 个月前

Sustainability Special Report Must-Reads: 4/12/23

InformationWeek 1 年前

We store all this data because “Storage is Cheap” and “Data is the new gold”. We store everything we can as it may be useful in future. Yet this data not only clogs up our IT systems, DevOps tools and data bases, it is a significant and growing contributor to carbon emissions.

Further, there is the signal to noise ratio problem with storing vast quantities of data. Massive volumes of data require more storage, more computational power and more human resources to make sense of the data. It becomes harder to extract meaningful data from the flood of useless data.

Finally there is FODD (the Fear of Deleting Data) that we have discussed in previous blog posts. There is risk and danger in removing data, so why not just store it?

Solution

Storing unused or unnecessary data is data pollution. Unused data consumes power and water and causes carbon pollution.

Can we be smarter about how we store data and what we store?

Here are some specific steps that organizations can take to address the problem of dark data:

Conduct a data audit. This will help you to identify all of the data that your organization is storing, and to assess how much of it is actually used.
Implement a data classification scheme. This will help you to identify the different types of data that your organization stores, and to assign different retention periods to each type of data.
Develop a data retention/deletion policy. This policy should specify the criteria that you will use to determine when data should be deleted.
Automate your data management processes. This will help you to ensure that your data is managed in a consistent and efficient manner.

By taking these steps, organizations can reduce their “data pollution”, and they can also improve the efficiency of their IT systems, DevOps tools and data management processes.

Graham Zabel的更多文章

Hey Atlassian, why is it so hard to maintain your tools?

2025年3月13日

Hey Atlassian, why is it so hard to maintain your tools?

Across a wide range of businesses, large and small, there are two truisms: Almost all businesses use Jira and…

4 条评论
Hey Google, your storage pricing is costing the Earth

2024年10月23日

Hey Google, your storage pricing is costing the Earth

We have a data problem. The amount of data we create is growing exponentially.

2 条评论
Part 6 - Enterprise Knowledge Solutions - A Comparison

2024年7月25日

Part 6 - Enterprise Knowledge Solutions - A Comparison

Note: This article is part of a series on GenAI and Enterprise Knowledge: Part 1 - Enterprise Knowledge and GenAI Part…

4 条评论
Part 5 - Deep Dive on Enterprise Knowledge Data Management

2024年7月15日

Part 5 - Deep Dive on Enterprise Knowledge Data Management

Note: This article is part of a series on GenAI and Enterprise Knowledge: Part 1 - Enterprise Knowledge and GenAI Part…
Part 4 - GenAI Enterprise Knowledge Use Case - Cost Management

2024年6月24日

Part 4 - GenAI Enterprise Knowledge Use Case - Cost Management

Note: This article is part of a series on GenAI and Enterprise Knowledge: Part 1 - Enterprise Knowledge and GenAI Part…
Part 3 - Enterprise Knowledge solves one problem and reveals another: Data Management

2024年5月24日

Part 3 - Enterprise Knowledge solves one problem and reveals another: Data Management

Note: This article is part of a series on GenAI and Enterprise Knowledge: Part 1 - Enterprise Knowledge and GenAI Part…

1 条评论
Part 2 - Integrating AI Chatbots into Your Enterprise Knowledge Strategy with Amazon Q Business

2024年5月17日

Part 2 - Integrating AI Chatbots into Your Enterprise Knowledge Strategy with Amazon Q Business

Note: This article is part of a series on GenAI and Enterprise Knowledge: Part 1 - Enterprise Knowledge and GenAI Part…

1 条评论
Part 1 - Enterprise Knowledge and GenAI

2024年5月10日

Part 1 - Enterprise Knowledge and GenAI

Note: This article is part of a series on GenAI and Enterprise Knowledge: Part 1 - Enterprise Knowledge and GenAI Part…

3 条评论
The Importance of Service Design to Sustainability and AI

2024年5月3日

The Importance of Service Design to Sustainability and AI

Sustainability and AI don’t seem like natural companions. Predictions of energy requirements for AI are eye-popping and…

9 条评论
DORA DevOps Metrics - From Why? to How?

2024年3月28日

DORA DevOps Metrics - From Why? to How?

Executive Summary DORA metrics measure software development and operations (DevOps) performance to improve…

10 条评论

See all articles

Sustainable Data Anti-Pattern #4 - Data Pollution

Graham Zabel

Problem

Dark Data

Storage is Cheap?

Data is the new Gold

领英推荐

Solution

See Also

Graham Zabel的更多文章

社区洞察

其他会员也浏览了

The Role of Data Science in Sustainable Development Goals

Data Alchemy: Weaving Digital Threads for a Sustainable Tapestry

Leveraging Data for Smarter Policy Decisions

Data circularity: How circular data ecosystems can be leveraged for better decision-support

Non-Armchair Direct Indexing Series #1: Dealing with Missing ESG Data

No Brainer? Tackling Responsible AI in Utilities - Data Digest February Edition

500 Years of Change: A Data-Driven Insight into Nature's Holocene and Global Evolution

5 Ways to Achieve Sustainable Data Futures: Time for a Data Diet in your Business?

Data Upcycling and Recycling: The Emerging Trends in Data Management

The Strategic Role of Green Computing and Data Science Professionals in Sustainability and Achieving ESG and Climate Change Goals

Problem

Dark Data

Storage is Cheap?

Data is the new Gold

领英推荐

Solution

See Also

Graham Zabel的更多文章

Hey Atlassian, why is it so hard to maintain your tools?

Hey Google, your storage pricing is costing the Earth

Part 6 - Enterprise Knowledge Solutions - A Comparison

Part 5 - Deep Dive on Enterprise Knowledge Data Management

Part 4 - GenAI Enterprise Knowledge Use Case - Cost Management

Part 3 - Enterprise Knowledge solves one problem and reveals another: Data Management

Part 2 - Integrating AI Chatbots into Your Enterprise Knowledge Strategy with Amazon Q Business

Part 1 - Enterprise Knowledge and GenAI

The Importance of Service Design to Sustainability and AI

DORA DevOps Metrics - From Why? to How?

社区洞察

其他会员也浏览了

The Role of Data Science in Sustainable Development Goals

Data Alchemy: Weaving Digital Threads for a Sustainable Tapestry

Leveraging Data for Smarter Policy Decisions

Data circularity: How circular data ecosystems can be leveraged for better decision-support

Non-Armchair Direct Indexing Series #1: Dealing with Missing ESG Data

No Brainer? Tackling Responsible AI in Utilities - Data Digest February Edition

500 Years of Change: A Data-Driven Insight into Nature's Holocene and Global Evolution

5 Ways to Achieve Sustainable Data Futures: Time for a Data Diet in your Business?

Data Upcycling and Recycling: The Emerging Trends in Data Management

The Strategic Role of Green Computing and Data Science Professionals in Sustainability and Achieving ESG and Climate Change Goals