Sustainable Data Anti-Pattern #4 - Data Pollution
Note: This article is part of a series on Data anti-patterns:
Problem
Dark Data
According to the latest 2023 figures, approximately 330 million terabytes (TBs) of data is generated every day. Over 100 zettabytes a year.
Some estimates say 90% of the data we create will never be used after 90 days. There is a term for this unused data. It is call Dark Data — that is, data that is unknown, undiscovered, unquantified, underutilized or completely untapped (see Dark data ).
In another recent report sponsored by Splunk, they state that about 55% of an organization’s data is considered “dark”. (Some see this dark data as “data gold” - more on this below.)
Both in our personal life (email, photos, videos, text messages) and in our work life (documents, spreadsheets, email, reports, logs, etc.) we create vast amounts of data, most of which has only a very short lifespan, and much of which is never used at all.
It is important to point out, by the way, that only a small amount of newly created data is saved for long term storage. Much data is created for immedidate consumption and not saved (e.g. a Netflix film), or is overwritten (e.g. rolling caches). This makes the huge growth in data creation and replication slightly less worrying. But storage capacity is still growing quickly (doubling from 2021 to 2025). All this data needs to be managed and monitored, and will require power and water.
Storage is Cheap?
There is a widespread perception that “Storage is Cheap”, especially as we migrate to the cloud, with the multiple storage options that the cloud vendors provide.
Some say it is compute that is expensive, not storage. That is a fallacy. As storage grows, the compute needed to search, analyse, process, sort, replicate and backup the data also grows. Large data sets require larger computational power. If a database is smaller, the compute power need to scan, sort, backup the data will be less.
At an individual level, data explosion and storage concerns may seem trivial. A few extra dollars per month to store my data may be neglibable. A half a second response time to access data from a large data set is not noticably worse than a quarter of a second. At small scale this penalty of 10, 20 or even 50% may not be an issue. But at enterprise scale, at national and global scale this becomes hugely significant.
Data is the new Gold
“There’s money in them there databases!” If you believe this, then your default action will be “if in doubt, store the data”. It may be worth money in the future. We will be able to train AI systems on this data and learn lots of things, which we can then monetise. This idea promotes data hoarding, as you would hoard gold if you had it.
领英推荐
We store all this data because “Storage is Cheap” and “Data is the new gold”. We store everything we can as it may be useful in future. Yet this data not only clogs up our IT systems, DevOps tools and data bases, it is a significant and growing contributor to carbon emissions.
Further, there is the signal to noise ratio problem with storing vast quantities of data. Massive volumes of data require more storage, more computational power and more human resources to make sense of the data. It becomes harder to extract meaningful data from the flood of useless data.
Finally there is FODD (the Fear of Deleting Data) that we have discussed in previous blog posts. There is risk and danger in removing data, so why not just store it?
Solution
Storing unused or unnecessary data is data pollution. Unused data consumes power and water and causes carbon pollution.
Can we be smarter about how we store data and what we store?
Here are some specific steps that organizations can take to address the problem of dark data:
By taking these steps, organizations can reduce their “data pollution”, and they can also improve the efficiency of their IT systems, DevOps tools and data management processes.