Data Cleaning, Outlier Detection and Data Integration in Data Quality
https://www.dataversity.net/the-fundamentals-of-data-integration/

Data Cleaning, Outlier Detection and Data Integration in Data Quality

Introduction

Studying and working with data is indeed a pleasurable task when data professionals know what they need to look out for and the processes they must follow to get a desirable outcome. In this article, the focus will be on some of the cogent steps that must be taken whilst collecting, manipulating and storing data.

a.??????Data Quality

Shoderu (2022) explained data quality to be the appropriateness of data in relation to the purpose of collecting the data like analyzing organizational services, predicting earnings, offering client management services among others. To put it differently, the value some data have based on what they were retrieved for will always dictate how relevant and important the data is to the data scientists. To illustrate, if a set of data is gotten to assess the behaviors of students towards online courses, it will be of low quality to see some datasets related to preference of workers to the physical workplace.

b.?????Data Cleaning

Data cleaning is a major procedure in data science that deals with spotting and deleting errors in datasets. In order to make sure that data collected is of high quality, the time consuming and tedious duty of data cleaning must be effected (Deshmukh & Wangikar, 2011). Apparently, the process of data collection is not always perfect as data could have some irregularities when either data collectors are retrieving this information or when users are inputting their data into the system. Hence, it becomes one of the paramount responsibilities of the data scientists to focus their time and resources to remove these mistakes like typographical errors, tautology and other data related problems.

c.??????Outlier Detection

Outlier detection may be referred to as the data cleaning concept which aims to fish out deviating values from the norm of data (Chu, 2019). Thus, outlier relates to that odd data that does not follow the average or normal orientation of data in a list or set of data. It appears to be the black sheep in the herd of animals when it comes to data. For example; in a situation where there is a form of data that identifies the gender of people, any value written which is not part of the female, male, man, woman or F/ M will be regarded as an outlier and the technique of finding this out will be regarded as the outlier detection.

d.?????Data Integration

According to Eltabakh (2012), data integration can be defined as the method of collating data from various data sources from offline or online data repositories. Many of these data have one thing in common; their heterogeneity. In other words, there happen to be so many ways same data can be represented in and around databases and data warehouses. For instance, the word ‘Doctor’ can be indicated via the use of ‘Dr’, ‘PhD’, ‘doctor’, etc making it difficult for data professionals to work with the data. Though, the process of data cleansing can come readily useful at this point as it would help bring up a kind of automated pattern that helps identify these similar words and use just one to represent the lot.

Conclusion

No doubt, data is in everything we do. In fact, it can be applied by individuals, companies and even government to create more opportunities after understanding the past complaints and future needs of people. However, data of high quality will further enhance such studies. In cases where data of low quality is retrieved, data professionals should be charged with cleansing it while detecting the outliers within the dataset. Finally, when there is need to combine data to better comprehend the trends behind some occurrences and to predict the following insights, data from multiple libraries and repositories can be integrated for maximum benefits.

Reference List

Chu, X (2019). Data Cleaning. In Ilyas, I. F. and Chu, X. Data Cleaning. Georgia: ACM Books.

Deshmukh, R., & Wangikar, V. (2011). Data Cleaning: Current Approaches and Issues. IEEE International Conference on Knowledge Engineering. At: Department of CS & IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad. ?

Eltabakh, M. (2012). Data integration. CS561-spring 2012. Worcester Polytechnic Onstitute. https://web.cs.wpi.edu/~cs561/s12/Lectures/IntegrationOLAP/DataIntegration.pdf

Shoderu, A. O. (2022, March 22). Importance of Data Quality and Meta Data in Data Science. LinkedIn. https://www.dhirubhai.net/pulse/importance-data-quality-meta-science-azeez-olanrewaju-shoderu/


要查看或添加评论,请登录

Azeez Olanrewaju Shoderu的更多文章

社区洞察

其他会员也浏览了