Ultimate Guide to Data Cleaning using Python, MS Excel, Open Refine and Rapid Miner
Canvas designed using Canva by Hiranmayee Panchangam

Ultimate Guide to Data Cleaning using Python, MS Excel, Open Refine and Rapid Miner

Here I review the most relevant, updated citations in this regard and identify the credible, trust-worthy techniques one by one. We have always known the steps that our muscle memory goes to while pre-processing a data set. Removing Null values, Changing the data types, Converting the categorical variables to numeric variables, merging the target variables using Principal Component Analysis, eradicating null values, populating a few rows using mean, median, mode values.


But the question is - Is there more to it ? Let's find out, what the experts say.



  1. Citation: Dilmegani, C. (2023). Guide to data cleaning in ’23: Steps to Clean Data & Best Tools. AIMultiple. https://research.aimultiple.com/data-cleaning/?

?According to the above article (Dilmegani, 2022), the 5 steps to a cleaner data are –?

  • Develop a data quality plan?
  • Correct data at the source?
  • Measure data accuracy?
  • Manage data and duplicates?
  • Append data?

I think the third step is the most important one in any dataset which is measuring data accuracy. Because without the data being accurate any models being applied no matter how best the algorithm works, the results won’t serve the project. Hence, we must know up to what level we can trust the data. E.g.: Insurance data being used to fetch phone number for cold calling for real estate purposes cannot be fruitful as all of the numbers in the data might not be valid.???

?

2. Citation: Dilmegani, C. (2023). Guide to data cleaning in ’23: Steps to Clean Data & Best Tools. AIMultiple. https://research.aimultiple.com/data-cleaning/?

According to the above article, the author (Dilmegani, 2022) recommends the below six best practices –?

  • Consider your data in the most holistic way possible?
  • Increased control on the database inputs?
  • Highlight and even potentially resolve faulty data before it becomes problematic?
  • Limit your sample size?
  • Spot check throughout?
  • Leverage free online courses?

I did not expect to see the first one – which is to think of the data in a holistic way. Because as data engineers we often think of it as just a duty to derive some conclusions in the end. We never know how the conclusions can be used further and how will that serve the mankind. Like, Einstein did a great job discovering few formulae which was later used as a lethal weapon to erase generations of a race. We should always keep in mind what and how an analysis is going to be used from the beginning to the end user & in the mean future too.??


3. Citation: Petrova-Antonova, D., & Tancheva, R. (2020). Data Cleaning: A case study with OpenRefine and Trifacta Wrangler. Communications in Computer and Information Science, 32–40. https://doi.org/10.1007/978-3-030-58793-2_3?

According to the author (Petrova-Antonova & Tancheva, 2020) “Producing high quality datasets require data problems to be identified and cleaned using different data cleaning techniques.”?

Yes, I agree. Because unless we thoroughly comprehend with the metadata of the data, fields in a data field and how they can be utilized, what do they convey, how can they contribute, why are they important – unless we know all these answers to these questions, even if there are multiple fields it will be useless. Now to intake these fields problem must be identified.?

?


4. Citation: McFarland, A. (2022, April 28). 10 best data cleaning tools. Unite.AI. https://www.unite.ai/10-best-data-cleaning-tools/?

According to the author (McFarland, 2022), “There can be many errors in data coming from things like bad data entry, the source of data, mismatch of source and destination, and invalid calculation.”?

Other source of error in data not identified by the author are many like – Time-prone data, Time frame errors – like data wrong syncs due to different time zones like Morning 10 AM in Hyderabad, India can be 10 PM in Denton, Texas: Now merging co-existing data without keeping this in mind can reproduce errors, outliers etc.??

?


5. Citation: Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin - Association of Canadian Map Libraries and Archives (ACMLA), (170). https://doi.org/10.15353/acmla.n170.4873?

The authors (Miller & Vielfaure, 2022) think a log is a good feature for the below two reasons –?

  • It is a requirement for some journals granting agencies to support a move open towards data science.?
  • It can also be used to repeat the actions you take on multiple files.??

In my opinion, I think that this capturing the log feature can also aid in troubleshooting. Debugging & spotting the error in the exact step.??


6. Citation: Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin - Association of Canadian Map Libraries and Archives (ACMLA), (170). https://doi.org/10.15353/acmla.n170.4873?

According to the article authors (Miller & Vielfaure, 2022), the most popular transformations for which researchers request support are-?

  • Clusters?

  • Join?

  • Splits?

Join Transformation surprises me. Doing this transformation by merging different columns with certain pre-conditioning using matching field type or column size is truly a brilliant idea so that the data won’t get truncated.??


Quick Fact: According to the article authors (Miller & Vielfaure, 2022), the previous name of open refine was google refine. Google has stopped it’s support to open refine in 2012. Open refine is now being supported by passionate volunteers forming a dedicated team across the globe.?


要查看或添加评论,请登录

Hiranmayee Panchangam的更多文章

社区洞察

其他会员也浏览了