Data cleansing: Gold In, Gold Out
Foto de Anna Nekrashevich en Pexels

Data cleansing: Gold In, Gold Out

The Cost Of Poor Quality data is measured with the COPQ metric, which determines the costs that would disappear if all failures were removed from a product, service or process.

These costs include waste and variation, overheads to fix the issue, rework costs, as well as lost opportunities such as churned customers or reputation damages.

COPQ is measured either as a percentage of sales, or total costs, and it is estimated that around?20 percent of an organisation's direct costs are the result of poor quality?data, primarily due to overuse, misuse and waste. You can learn more about COPQ in What is COPQ?...

Poor quality data?can seriously harm your business. It can lead to inaccurate analysis, poor customer relations and poor business decisions.

According to IBM, the yearly cost of poor quality data is $3.1 trillion in the US alone. Gartner estimates that every year, poor data quality (DQ) costs organizations an average $12.9 million.

In order to reduce or eliminate your Cost Of Poor DQ, by improving the Data Quality, you need to:

  • clean your data (if it needs to, no data set is perfect), and
  • not to get your data dirty in the first place

No hay texto alternativo para esta imagen

When dealing with Poor Quality data, GIGO is a common acronym for the expression "Garbage In, Garbage Out", and its first use is periodically reviewed, being it currently dated to 1957, but always with the underlying principle that the quality of information coming out of a system will be bad if the quality of information that was put in was already bad.

But what is good, bad or poor quality of information?

Let's say that you have a list of 10000 products, each with the product name, provider name, product family, cost and characteristics (shape, color, type of product,...), and another list with the details of 500 providers with product family, name, tax number, city, address, GPS location of their offices and telephone.

If you need to know what is the provider with a lower cost for a specific product family, for example "My PFamily", you could look in the list of providers for those of "My PFamily" then look for the cost of the products of this "My PFamily" in the first list. But you could find this:

  • the product family of some providers of "My PFamily" had been actually entered in the list as "My_PFamily", or "MiPFamili" or "My ProductF", so all not coinciding with "My PFamily" are discarded by the system.
  • the product family of some providers of "My PFamily" was blank, null, void, no value,... again being out of the results
  • the same provider appeared twice in the list, with "My PFamily" in one of the appearance, and "MyPFamily" (no space) in another, being counted as 2 different occurrences when it was just the same

Had you created a separate list with the names of product families, then identified each with a code (an ID), and enter the ID in the product family of every provider (instead of the name of the product family), would have help you the issue of having bad results. It is easier to maintain a list of 10 product families and make sure it is correct, then letting the DataBase Management Systems to take care of the IDs uniqueness and correspondence, than reviewing in hundreds or thousands of records in several lists if the name of the product family was properly entered.

Even in the case that you had perfect lists (tables) with the product family properly written in absolutely all records, if you changed the name of one of your product families to a new onee like from "My PFamily" to "My Stylish Family", you should have to change the name of the family in every record all lists (it could be thousands), making sure no one record is left unchanged. Again, having a separate list of product families with IDs, and using these IDs in the product family of all your records would allow you to change the name just in the product family list, and all records in all the other lists would still work through the IDs which would remain unchanged.

And these are only examples where a proper application of SQL rules may do the trick, but in other cases you may need NoSQL databases, or migrate data from old mainframes, or even have your data in paper or Excel sheets, with errors like manual data entry errors, OCR errors, lack of complete information, ambiguous data, duplicate data, data transformation errors,...

A chaotic data system is impossible to maintain and curate.
No hay texto alternativo para esta imagen

There are several practices for cleansing and preventing poor quality data one may apply:

  • Data definition: the first thing to do is to have a clear-cut idea of the data needed for the system to work and the size of the data you are expecting to get, which will also be a guide when planning for your hardware, (server and backup tools).
  • Knowledge of data sources: if the input of your system are the quantity of cars, you should not allow for an input of π cars, thus you don't need to program an algorithm to deal with input data that you know mistaken.
  • Data collection: data needed can be not readily available or difficult to extract. When the company comes from a system environment, the data collection is facilitated by the software itself (ERP, CRM, PMS,... through API, Webservices,...) but it needs to be done in a an effective and secure way. And it is not rare that part of the data collected is in hard copy or soft copy or even both.?
  • Data analysis: tag your data, classify it and identify it, (with for example material number, business partner number and employee ID), and group your data (for example item group, customer group). Adopt a naming convention, ensure that all mandatory (primary) fields are filled, without nulls (blanks).?
  • Data clean up: filter, separate the shaft from the wheat. In most cases, the data that you get irrespective of your definition contains lots of junks. It’s important that these data be filtered, so as to identify the ones you need and the ones you don’t need.
  • Data standardization: bringing data into a uniform format, for example, putting different variables on the same scale in order to compare scores between different types of variables. Units entered uniformingly is key.
  • Data normalization: organizing data to appear similar across all records and fields. It increases the cohesion of entry types leading to cleansing, lead generation, segmentation, and higher quality data.
  • Data quality check: Accuracy, Relevancy (data should meet the requirements for the intended use), Completeness (not have missing values or miss data records) and Timeliness (data must be up to date).

No hay texto alternativo para esta imagen
One of the most effective ways to clean data is asking your data for an information, a key performance indicator, and make the changes needed to obtain the correct result.?

GIGO is commonly used to describe failures in human decision-making due to faulty, incomplete, or imprecise data.

Organizations often lack the time and/or resources needed to be manually digging through and manipulating vast quantities of data.

When they do embrace a process of data cleansing, it can be at a high cost only to discover later that some KPIs are still not possible to be extracted from the cleaned data.

Zeus Smart Visual Data finds and solve these issues in 100% of the projects of business intelligence we develop for our clients. And we solve them because we need a reliable database from which to extract the gold quality data for comparisons, aggregates and predictions that become gold for the decision makers.

No hay texto alternativo para esta imagen
Up to 90% of the work of a business intelligence project consists of the data cleansing, extract, transformation and load (ETL) of the dashboard database, which often can be used at the end of the BI project as the DataWareHouse of the client for other purposes.

This is why I propose you to convert the GIGO status of your data set from a "Garbage In, Garbage Out", into a "Gold In, Gold Out", via the Data Cleansing that comes along with any Business Intelligence project, providing you with both a gold quality data in your company, resulting in a gold understanding of your information through a business intelligence dashboard for your best decision-making.

要查看或添加评论,请登录

Vicente Castillo的更多文章

  • ChatGPT principles (if you don't like them, it has others)

    ChatGPT principles (if you don't like them, it has others)

    GPT-3, a language generation model developed by OpenAI, is capable of generating human-like text on a variety of…

  • Be liquid, my data

    Be liquid, my data

    In 1971, Bruce Lee mentioned the famous inspirational speech "Be water, my friend" in a canadian TV show hosted by…

  • How AI will change Data Visualization

    How AI will change Data Visualization

    Great data visualizations (dataviz) start with great datasets. Nowadays, we find ourselves stuck sorting through a…

  • Business Intelligence: what do you mean?

    Business Intelligence: what do you mean?

    In a globalised and highly competitive business environment, analysing the enormous amount of data generated by our…

  • Reflecting on employee turnover and Artificial Intelligence

    Reflecting on employee turnover and Artificial Intelligence

    AI, big data and predictive analytics are transforming the employee turnover prediction game. Artificial intelligence…

  • You feel it in your fingers, you feel it in your toes, AI is all around us

    You feel it in your fingers, you feel it in your toes, AI is all around us

    Artificial Intelligence (AI) as a technology has reached so many layers of everyday life that it can now be found in…

  • Bear vs Bull cryptomarket

    Bear vs Bull cryptomarket

    Bull market designates an upward stock market (English even has the word bullish for anything rampantly increasing)…

    1 条评论
  • Turing, Siri and GPT-3, data never spoke so much

    Turing, Siri and GPT-3, data never spoke so much

    In 1950, British mathematician Alan Turing published a paper titled "Computing Machinery and Intelligence" in which he…

    2 条评论
  • NFT: a bridge between the digital world and the real world

    NFT: a bridge between the digital world and the real world

    While the common of mortals only hear about NFTs in the TV news or the short last-breaking-news summary, usually in…

    3 条评论
  • Business Intelligence Benchmark

    Business Intelligence Benchmark

    Meerkats (or suricates) are social mongooses that live in family groups. They are very territorial, defending their…

社区洞察

其他会员也浏览了