登录查看更多内容

Ultimate Guide to Data Cleaning using Python, MS Excel, Open Refine and Rapid Miner

Hiranmayee Panchangam

Information Technology Geek | UNT IS Grad '24 | KLU CS Grad '20 | Tech Enthusiast

发布日期: 2025年2月3日

+ 关注

Here I review the most relevant, updated citations in this regard and identify the credible, trust-worthy techniques one by one. We have always known the steps that our muscle memory goes to while pre-processing a data set. Removing Null values, Changing the data types, Converting the categorical variables to numeric variables, merging the target variables using Principal Component Analysis, eradicating null values, populating a few rows using mean, median, mode values.

But the question is - Is there more to it ? Let's find out, what the experts say.

Citation: Dilmegani, C. (2023). Guide to data cleaning in ’23: Steps to Clean Data & Best Tools. AIMultiple. https://research.aimultiple.com/data-cleaning/?

?According to the above article (Dilmegani, 2022), the 5 steps to a cleaner data are –?

Develop a data quality plan?
Correct data at the source?
Measure data accuracy?
Manage data and duplicates?
Append data?

I think the third step is the most important one in any dataset which is measuring data accuracy. Because without the data being accurate any models being applied no matter how best the algorithm works, the results won’t serve the project. Hence, we must know up to what level we can trust the data. E.g.: Insurance data being used to fetch phone number for cold calling for real estate purposes cannot be fruitful as all of the numbers in the data might not be valid.???

2. Citation: Dilmegani, C. (2023). Guide to data cleaning in ’23: Steps to Clean Data & Best Tools. AIMultiple. https://research.aimultiple.com/data-cleaning/?

According to the above article, the author (Dilmegani, 2022) recommends the below six best practices –?

Consider your data in the most holistic way possible?
Increased control on the database inputs?
Highlight and even potentially resolve faulty data before it becomes problematic?
Limit your sample size?
Spot check throughout?
Leverage free online courses?

I did not expect to see the first one – which is to think of the data in a holistic way. Because as data engineers we often think of it as just a duty to derive some conclusions in the end. We never know how the conclusions can be used further and how will that serve the mankind. Like, Einstein did a great job discovering few formulae which was later used as a lethal weapon to erase generations of a race. We should always keep in mind what and how an analysis is going to be used from the beginning to the end user & in the mean future too.??

3. Citation: Petrova-Antonova, D., & Tancheva, R. (2020). Data Cleaning: A case study with OpenRefine and Trifacta Wrangler. Communications in Computer and Information Science, 32–40. https://doi.org/10.1007/978-3-030-58793-2_3?

According to the author (Petrova-Antonova & Tancheva, 2020) “Producing high quality datasets require data problems to be identified and cleaned using different data cleaning techniques.”?

Yes, I agree. Because unless we thoroughly comprehend with the metadata of the data, fields in a data field and how they can be utilized, what do they convey, how can they contribute, why are they important – unless we know all these answers to these questions, even if there are multiple fields it will be useless. Now to intake these fields problem must be identified.?

领英推荐

50 Days of Data Analysis: Analyzing Data with NumPy

Benjamin Bennett Alexander 3 周前

A Compilation of my articles on various Data…

Parul Pandey 6 年前

Exploring Qualitative Data Analysis with PyCharm

Maxwell E. Uduafemhe, Ph.D., CDA. 1 年前

4. Citation: McFarland, A. (2022, April 28). 10 best data cleaning tools. Unite.AI. https://www.unite.ai/10-best-data-cleaning-tools/?

According to the author (McFarland, 2022), “There can be many errors in data coming from things like bad data entry, the source of data, mismatch of source and destination, and invalid calculation.”?

Other source of error in data not identified by the author are many like – Time-prone data, Time frame errors – like data wrong syncs due to different time zones like Morning 10 AM in Hyderabad, India can be 10 PM in Denton, Texas: Now merging co-existing data without keeping this in mind can reproduce errors, outliers etc.??

5. Citation: Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin - Association of Canadian Map Libraries and Archives (ACMLA), (170). https://doi.org/10.15353/acmla.n170.4873?

The authors (Miller & Vielfaure, 2022) think a log is a good feature for the below two reasons –?

It is a requirement for some journals granting agencies to support a move open towards data science.?
It can also be used to repeat the actions you take on multiple files.??

In my opinion, I think that this capturing the log feature can also aid in troubleshooting. Debugging & spotting the error in the exact step.??

6. Citation: Miller, M., & Vielfaure, N. (2022). OpenRefine: An approachable open tool to clean research data. Bulletin - Association of Canadian Map Libraries and Archives (ACMLA), (170). https://doi.org/10.15353/acmla.n170.4873?

According to the article authors (Miller & Vielfaure, 2022), the most popular transformations for which researchers request support are-?

Clusters?

Join?

Splits?

Join Transformation surprises me. Doing this transformation by merging different columns with certain pre-conditioning using matching field type or column size is truly a brilliant idea so that the data won’t get truncated.??

Quick Fact: According to the article authors (Miller & Vielfaure, 2022), the previous name of open refine was google refine. Google has stopped it’s support to open refine in 2012. Open refine is now being supported by passionate volunteers forming a dedicated team across the globe.?

要查看或添加评论，请登录

Hiranmayee Panchangam的更多文章

The super fuss about Web 3.0, AI bots and Semantic Web Technologies

2025年2月6日

The super fuss about Web 3.0, AI bots and Semantic Web Technologies

The most fascinating thing with AI and recent innovations is that, it was always there and the leaders are working…
UX Usability Testing : Traditional OPAC tools Vs. Modern Search Engines

2025年2月5日

UX Usability Testing : Traditional OPAC tools Vs. Modern Search Engines

A Digital Library is an OPAC tool that provides access for free for any University's students, faculty, and staff. OPAC…
Slack Vs. Knowledge Management

2025年1月23日

Slack Vs. Knowledge Management

“Knowledge will forever govern ignorance, and a people who mean to be their governors must arm themselves with the…
Swift Understanding of IOT Security Architecture

2025年1月21日

Swift Understanding of IOT Security Architecture

The security architecture of an IoT system is closely related to cybersecurity. In an IoT system, the security…
Ultimate Guide to the Systematic Review for the challenges, performance of Security Operations Centre

2025年1月7日

Ultimate Guide to the Systematic Review for the challenges, performance of Security Operations Centre

Here I am on a little mission, to reflect everything that I have read so far. Cybersecurity is the upcoming super cool…
Bayesian approach is a powerful technique for uncertainty.

2024年12月27日

Bayesian approach is a powerful technique for uncertainty.

Before 2025 snaps in, I have decided to not waste my holiday period instead read scientific articles and revise what…
A statistical approach to understanding how Google Survey works.

2024年12月24日

A statistical approach to understanding how Google Survey works.

Google Survey dynamically prompts survey forms while scrolling on websites with various optimization techniques. As…
Survey Analysis of Areas of Life Satisfaction using NLP and Logistic Regression

2024年12月21日

Survey Analysis of Areas of Life Satisfaction using NLP and Logistic Regression

According to (Kessel & Hughes, 2018) the research conducted by the Pew Research Centre was a survey on the Life…

2 条评论
Statistical Testing for wage gap - Are women only present in Low Paying Jobs ?

2024年12月19日

Statistical Testing for wage gap - Are women only present in Low Paying Jobs ?

The gender Pay Gap is a real scene even in the richest of the rich countries, as the article emphasizes throughout. It…
Should we drop or embrace outliers ?

2024年12月15日

Should we drop or embrace outliers ?

With the growing internet today, vast amounts of unstructured data are produced every single day. There is no doubt…

2 条评论

See all articles

Ultimate Guide to Data Cleaning using Python, MS Excel, Open Refine and Rapid Miner

Hiranmayee Panchangam

Information Technology Geek | UNT IS Grad '24 | KLU CS Grad '20 | Tech Enthusiast

But the question is - Is there more to it ? Let's find out, what the experts say.

领英推荐

Quick Fact: According to the article authors (Miller & Vielfaure, 2022), the previous name of open refine was google refine. Google has stopped it’s support to open refine in 2012. Open refine is now being supported by passionate volunteers forming a dedicated team across the globe.?

Hiranmayee Panchangam的更多文章

社区洞察

其他会员也浏览了

Top 10 Tools for data scientists in 2022

20 Advanced Methods For Doing Data Analysis in Excel

Introduction to Quant Investing with Python

Understanding the essential Data Processing libraries

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Top 10 Tools for data scientists in 2022.

Introduction to Pandas Series and DataFrames: Building Blocks of Data Handling in Python

Data Cleaning Techniques in Python

Data Wrangling with Python

Accessing Data with iloc: Position-Based Indexing in Pandas

But the question is - Is there more to it ? Let's find out, what the experts say.

领英推荐

Quick Fact: According to the article authors (Miller & Vielfaure, 2022), the previous name of open refine was google refine. Google has stopped it’s support to open refine in 2012. Open refine is now being supported by passionate volunteers forming a dedicated team across the globe.?

Hiranmayee Panchangam的更多文章

The super fuss about Web 3.0, AI bots and Semantic Web Technologies

UX Usability Testing : Traditional OPAC tools Vs. Modern Search Engines

Slack Vs. Knowledge Management

Swift Understanding of IOT Security Architecture

Ultimate Guide to the Systematic Review for the challenges, performance of Security Operations Centre

Bayesian approach is a powerful technique for uncertainty.

A statistical approach to understanding how Google Survey works.

Survey Analysis of Areas of Life Satisfaction using NLP and Logistic Regression

Statistical Testing for wage gap - Are women only present in Low Paying Jobs ?

Should we drop or embrace outliers ?

社区洞察

其他会员也浏览了

Top 10 Tools for data scientists in 2022

20 Advanced Methods For Doing Data Analysis in Excel

Introduction to Quant Investing with Python

Understanding the essential Data Processing libraries

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Top 10 Tools for data scientists in 2022.

Introduction to Pandas Series and DataFrames: Building Blocks of Data Handling in Python

Data Cleaning Techniques in Python

Data Wrangling with Python

Accessing Data with iloc: Position-Based Indexing in Pandas