Is Data Redundancy Impeding Your Progress?

Is Data Redundancy Impeding Your Progress?

Following up on the last article - if you want to manage the quality of your data, it’s important to notice and act on any significant changes in your data’s fingerprint, an innovative concept that we introduced last week.

In response to the last article, we received a lot feedback from data professionals—architects, analysts, IT —who said they were challenged more and more with data quality because of the growing complexity and volume of data, and were looking for new solutions that could actually keep pace with their data.

So we wanted to share a customer story that illustrates another pervasive problem related to data quality problem that can be solved elegantly with new automated machine learning technology.

The customer is an international bank; lets call it Giant International Bank (GIB). The enterprise data team at GIB —like so many other data teams—receives data from multiple internal and external sources that they funnel into a central data repository. This group is responsible for quality, integrity and risk management of their data assets. They are the keepers of information and provide it to downstream business units for decision-making. So quality issues at the source has a ripple effect across the entire organization. 

GIB suffers from data redundancy, in which the same information appears in multiple systems and in multiple formats across the company and simply does not tally from system to system. This undermines reporting initiatives, impedes efforts to make sound strategic decisions and poses a serious risk to the business. The enterprise data team at GIB also creates data products for downstream use in analytical tools and reporting by mixing and matching data across sources, which is fertile ground for the duplicate data problem to grow. The process of identifying redundant data has been a very manual, error-prone process.

Emcien’s data discovery process uses machine learning and automatically identifies redundant data and flags it, which helps the data team to isolate or eliminate redundant and create a leaner data repository that provides consistent, high-quality data to the enterprise.

Very often redundant data exists in one source or comes from multiple sources – and may have different headers for the same content due to nomenclature inconsistencies. Emcien’s data discovery instantly identifies redundant data even when the column headers are different.

Near-redundant data is also a challenging problem, and is much harder to identify. This occurs frequently when the same data is collected/stored in different metrics. For example, two columns that store the same value but in different metrics like dollars and pounds are actually the same data, even though the content looks very different.  Copies of data are also created during data transformations, which occurs continually. For example, customer annual income is divided by 12 to compute monthly income. The two columns are replicas, since knowing one you can compute the other.

In most companies, the range of data redundancy is 10% on the low side, to 30% on the high side.  The impact of data redundancy shows up across the entire organization, adding up to a 30% cost surcharge on all data related activities, and a much higher cost related to the risk side of the business.

For the customer GIB, having redundant and inconsistent data is not just inconvenient; it is a major impediment to business agility and competitiveness. The cost of managing the data runs into millions of dollars annually – without visibility into quality and completeness of solving the problem.

In most companies, the range of data redundancy is 10% on the low side, to 30% on the high side.  The impact shows up across the entire organization, adding up to a 30% cost surcharge on all data related activities, and a much higher cost related to the risk side of the business.

At GIB, the data team recognizes that identifying and maintaining the repository is not a one-time job. It’s a continuous process, and hence the need for automation. 

Identifying redundancy in an automated way, at the source, reduced cost by decreasing the burden imposed on data storage, computational processes, and team time. It allowed the data team could work on other, more interesting projects. And it more effectively trimmed the “fat” so only lean, useful data was passed through to downstream analytical and reporting systems.

What else? Look forward to hearing from you. 

I like the article and actually think redundancy is 2x higher

Nandakumar Katta

Chairman at Smart India Hackathon-Cleanwater domain | Entrepreneur | Strategy | Solutions | Products | Platforms | IoT

8 年

excellent illustration truly this is problem one can't take for granted once you start looking at Data ..more importantly USEFUL data that can be MEANINGFUL

要查看或添加评论,请登录

Radhika Subramanian的更多文章

社区洞察

其他会员也浏览了