Data cleansing: Gold In, Gold Out
Vicente Castillo
Chief of Innovation and Technology at Zeus by Llyc | Msc Artificial Intelligence | B.Eng. Telecommunications | Lecturer in Universidad Europea de Valencia | Speaker and Trainer in AI and BI
The Cost Of Poor Quality data is measured with the COPQ metric, which determines the costs that would disappear if all failures were removed from a product, service or process.
These costs include waste and variation, overheads to fix the issue, rework costs, as well as lost opportunities such as churned customers or reputation damages.
COPQ is measured either as a percentage of sales, or total costs, and it is estimated that around?20 percent of an organisation's direct costs are the result of poor quality?data, primarily due to overuse, misuse and waste. You can learn more about COPQ in What is COPQ?...
Poor quality data?can seriously harm your business. It can lead to inaccurate analysis, poor customer relations and poor business decisions.
According to IBM, the yearly cost of poor quality data is $3.1 trillion in the US alone. Gartner estimates that every year, poor data quality (DQ) costs organizations an average $12.9 million.
In order to reduce or eliminate your Cost Of Poor DQ, by improving the Data Quality, you need to:
When dealing with Poor Quality data, GIGO is a common acronym for the expression "Garbage In, Garbage Out", and its first use is periodically reviewed, being it currently dated to 1957, but always with the underlying principle that the quality of information coming out of a system will be bad if the quality of information that was put in was already bad.
But what is good, bad or poor quality of information?
Let's say that you have a list of 10000 products, each with the product name, provider name, product family, cost and characteristics (shape, color, type of product,...), and another list with the details of 500 providers with product family, name, tax number, city, address, GPS location of their offices and telephone.
If you need to know what is the provider with a lower cost for a specific product family, for example "My PFamily", you could look in the list of providers for those of "My PFamily" then look for the cost of the products of this "My PFamily" in the first list. But you could find this:
Had you created a separate list with the names of product families, then identified each with a code (an ID), and enter the ID in the product family of every provider (instead of the name of the product family), would have help you the issue of having bad results. It is easier to maintain a list of 10 product families and make sure it is correct, then letting the DataBase Management Systems to take care of the IDs uniqueness and correspondence, than reviewing in hundreds or thousands of records in several lists if the name of the product family was properly entered.
Even in the case that you had perfect lists (tables) with the product family properly written in absolutely all records, if you changed the name of one of your product families to a new onee like from "My PFamily" to "My Stylish Family", you should have to change the name of the family in every record all lists (it could be thousands), making sure no one record is left unchanged. Again, having a separate list of product families with IDs, and using these IDs in the product family of all your records would allow you to change the name just in the product family list, and all records in all the other lists would still work through the IDs which would remain unchanged.
领英推荐
And these are only examples where a proper application of SQL rules may do the trick, but in other cases you may need NoSQL databases, or migrate data from old mainframes, or even have your data in paper or Excel sheets, with errors like manual data entry errors, OCR errors, lack of complete information, ambiguous data, duplicate data, data transformation errors,...
A chaotic data system is impossible to maintain and curate.
There are several practices for cleansing and preventing poor quality data one may apply:
One of the most effective ways to clean data is asking your data for an information, a key performance indicator, and make the changes needed to obtain the correct result.?
GIGO is commonly used to describe failures in human decision-making due to faulty, incomplete, or imprecise data.
Organizations often lack the time and/or resources needed to be manually digging through and manipulating vast quantities of data.
When they do embrace a process of data cleansing, it can be at a high cost only to discover later that some KPIs are still not possible to be extracted from the cleaned data.
Zeus Smart Visual Data finds and solve these issues in 100% of the projects of business intelligence we develop for our clients. And we solve them because we need a reliable database from which to extract the gold quality data for comparisons, aggregates and predictions that become gold for the decision makers.
Up to 90% of the work of a business intelligence project consists of the data cleansing, extract, transformation and load (ETL) of the dashboard database, which often can be used at the end of the BI project as the DataWareHouse of the client for other purposes.
This is why I propose you to convert the GIGO status of your data set from a "Garbage In, Garbage Out", into a "Gold In, Gold Out", via the Data Cleansing that comes along with any Business Intelligence project, providing you with both a gold quality data in your company, resulting in a gold understanding of your information through a business intelligence dashboard for your best decision-making.