Ultimate Guide to Data Cleaning using Python, MS Excel, Open Refine and Rapid Miner
Hiranmayee Panchangam
Information Technology Geek | UNT IS Grad '24 | KLU CS Grad '20 | Tech Enthusiast
Here I review the most relevant, updated citations in this regard and identify the credible, trust-worthy techniques one by one. We have always known the steps that our muscle memory goes to while pre-processing a data set. Removing Null values, Changing the data types, Converting the categorical variables to numeric variables, merging the target variables using Principal Component Analysis, eradicating null values, populating a few rows using mean, median, mode values.
But the question is - Is there more to it ? Let's find out, what the experts say.
?According to the above article (Dilmegani, 2022), the 5 steps to a cleaner data are –?
I think the third step is the most important one in any dataset which is measuring data accuracy. Because without the data being accurate any models being applied no matter how best the algorithm works, the results won’t serve the project. Hence, we must know up to what level we can trust the data. E.g.: Insurance data being used to fetch phone number for cold calling for real estate purposes cannot be fruitful as all of the numbers in the data might not be valid.???
?
2. Citation: Dilmegani, C. (2023). Guide to data cleaning in ’23: Steps to Clean Data & Best Tools. AIMultiple. https://research.aimultiple.com/data-cleaning/?
According to the above article, the author (Dilmegani, 2022) recommends the below six best practices –?
I did not expect to see the first one – which is to think of the data in a holistic way. Because as data engineers we often think of it as just a duty to derive some conclusions in the end. We never know how the conclusions can be used further and how will that serve the mankind. Like, Einstein did a great job discovering few formulae which was later used as a lethal weapon to erase generations of a race. We should always keep in mind what and how an analysis is going to be used from the beginning to the end user & in the mean future too.??
3. Citation: Petrova-Antonova, D., & Tancheva, R. (2020). Data Cleaning: A case study with OpenRefine and Trifacta Wrangler. Communications in Computer and Information Science, 32–40. https://doi.org/10.1007/978-3-030-58793-2_3?
According to the author (Petrova-Antonova & Tancheva, 2020) “Producing high quality datasets require data problems to be identified and cleaned using different data cleaning techniques.”?
Yes, I agree. Because unless we thoroughly comprehend with the metadata of the data, fields in a data field and how they can be utilized, what do they convey, how can they contribute, why are they important – unless we know all these answers to these questions, even if there are multiple fields it will be useless. Now to intake these fields problem must be identified.?
?
领英推荐
4. Citation: McFarland, A. (2022, April 28). 10 best data cleaning tools. Unite.AI. https://www.unite.ai/10-best-data-cleaning-tools/?
According to the author (McFarland, 2022), “There can be many errors in data coming from things like bad data entry, the source of data, mismatch of source and destination, and invalid calculation.”?
Other source of error in data not identified by the author are many like – Time-prone data, Time frame errors – like data wrong syncs due to different time zones like Morning 10 AM in Hyderabad, India can be 10 PM in Denton, Texas: Now merging co-existing data without keeping this in mind can reproduce errors, outliers etc.??
?
5. Citation: Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin - Association of Canadian Map Libraries and Archives (ACMLA), (170). https://doi.org/10.15353/acmla.n170.4873?
The authors (Miller & Vielfaure, 2022) think a log is a good feature for the below two reasons –?
In my opinion, I think that this capturing the log feature can also aid in troubleshooting. Debugging & spotting the error in the exact step.??
6. Citation: Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin - Association of Canadian Map Libraries and Archives (ACMLA), (170). https://doi.org/10.15353/acmla.n170.4873?
According to the article authors (Miller & Vielfaure, 2022), the most popular transformations for which researchers request support are-?
Join Transformation surprises me. Doing this transformation by merging different columns with certain pre-conditioning using matching field type or column size is truly a brilliant idea so that the data won’t get truncated.??
Quick Fact: According to the article authors (Miller & Vielfaure, 2022), the previous name of open refine was google refine. Google has stopped it’s support to open refine in 2012. Open refine is now being supported by passionate volunteers forming a dedicated team across the globe.?