Data Cleaning, Outlier Detection and Data Integration in Data Quality
Azeez Olanrewaju Shoderu
Educator @A.O.S Abroad ?? AI Consultant ???? Publisher ?? Amazon Best-Selling Author ?? Simplifying AI, Empowering Professionals to Succeed in Life, Build Profitable Careers or Businesses Worldwide.
Introduction
Studying and working with data is indeed a pleasurable task when data professionals know what they need to look out for and the processes they must follow to get a desirable outcome. In this article, the focus will be on some of the cogent steps that must be taken whilst collecting, manipulating and storing data.
a.??????Data Quality
Shoderu (2022) explained data quality to be the appropriateness of data in relation to the purpose of collecting the data like analyzing organizational services, predicting earnings, offering client management services among others. To put it differently, the value some data have based on what they were retrieved for will always dictate how relevant and important the data is to the data scientists. To illustrate, if a set of data is gotten to assess the behaviors of students towards online courses, it will be of low quality to see some datasets related to preference of workers to the physical workplace.
b.?????Data Cleaning
Data cleaning is a major procedure in data science that deals with spotting and deleting errors in datasets. In order to make sure that data collected is of high quality, the time consuming and tedious duty of data cleaning must be effected (Deshmukh & Wangikar, 2011). Apparently, the process of data collection is not always perfect as data could have some irregularities when either data collectors are retrieving this information or when users are inputting their data into the system. Hence, it becomes one of the paramount responsibilities of the data scientists to focus their time and resources to remove these mistakes like typographical errors, tautology and other data related problems.
c.??????Outlier Detection
Outlier detection may be referred to as the data cleaning concept which aims to fish out deviating values from the norm of data (Chu, 2019). Thus, outlier relates to that odd data that does not follow the average or normal orientation of data in a list or set of data. It appears to be the black sheep in the herd of animals when it comes to data. For example; in a situation where there is a form of data that identifies the gender of people, any value written which is not part of the female, male, man, woman or F/ M will be regarded as an outlier and the technique of finding this out will be regarded as the outlier detection.
领英推荐
d.?????Data Integration
According to Eltabakh (2012), data integration can be defined as the method of collating data from various data sources from offline or online data repositories. Many of these data have one thing in common; their heterogeneity. In other words, there happen to be so many ways same data can be represented in and around databases and data warehouses. For instance, the word ‘Doctor’ can be indicated via the use of ‘Dr’, ‘PhD’, ‘doctor’, etc making it difficult for data professionals to work with the data. Though, the process of data cleansing can come readily useful at this point as it would help bring up a kind of automated pattern that helps identify these similar words and use just one to represent the lot.
Conclusion
No doubt, data is in everything we do. In fact, it can be applied by individuals, companies and even government to create more opportunities after understanding the past complaints and future needs of people. However, data of high quality will further enhance such studies. In cases where data of low quality is retrieved, data professionals should be charged with cleansing it while detecting the outliers within the dataset. Finally, when there is need to combine data to better comprehend the trends behind some occurrences and to predict the following insights, data from multiple libraries and repositories can be integrated for maximum benefits.
Reference List
Chu, X (2019). Data Cleaning. In Ilyas, I. F. and Chu, X. Data Cleaning. Georgia: ACM Books.
Deshmukh, R., & Wangikar, V. (2011). Data Cleaning: Current Approaches and Issues. IEEE International Conference on Knowledge Engineering. At: Department of CS & IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad. ?
Eltabakh, M. (2012). Data integration. CS561-spring 2012. Worcester Polytechnic Onstitute. https://web.cs.wpi.edu/~cs561/s12/Lectures/IntegrationOLAP/DataIntegration.pdf
Shoderu, A. O. (2022, March 22). Importance of Data Quality and Meta Data in Data Science. LinkedIn. https://www.dhirubhai.net/pulse/importance-data-quality-meta-science-azeez-olanrewaju-shoderu/