登录查看更多内容

Data Pre-Processing & Its Role in harnessing Intelligence

Gaurav Malhotra

Data Steward : SAFe6 Certified Agilist| Black Belt: Gen AI LLM| Data/Content Management Geek | Design Thinking Specialist| Certified PGBA | UGC NET Qualified Professional

发布日期: 2023年11月6日

In Today's era of exponential growth & innovation with rapid advancements in Generative AI tools & Analytics, we're witnessing an unprecedented boom in each & every sphere of life, from learning the mandala art to how to become an efficient developer: you just type it & there you go!

Hence in order to keep pace with rapidly growing advancements, one must be well versed with the domain knowledge so as to carve out a distinct advantage for themselves.

When we talk about AI or Analytics, It should have 2 crucial elements:

1. Capable of Delivering Essential Intelligence

2. Should have least time to market: & this all depends upon the efficacy of data management & data enrichment process : precisely the building block/heart which is termed as Data Pre-processing.

So let’s try to understand it in brief-

Assume You're assisting your mom with laundry today, what'll you will need to ensure here:

A.) Segregation of dirty clothes from the available ones

B.) Scrub the dirt

C.) Rinse & Repeat

Somewhat similar process gets applied while dealing with Data & here is how

(I.) Observe your data: Pay attention to detailed behavior's, see for the gaps : is there anything missing( empty data rows/files) or impossible :

-say categorical variable is having quantitative result anyhow?, is it consistent throughout or display inconsistency in terms of say:?

Field header name change, Missing headers in some files or
uppercase lower case formatting mismatch within files etc.

Point to Note : If identified correctly, half of the battle is won

(II.) Data Management: depending upon the type of problems you've in hand, you'll need to make best use of the tools in armory,

similar to a saying that Sewing pin & sword can never fulfill each other's place, here your domain expertise & problem solving skills come into play . So lets see how this could be done:

a.) Usage of “NA” or “-“ in some fields: - Rule of thumb: Is it adding value by any chance? Is it representing something material to data in hand? Is some another data field dependent upon these sample values? If answer is no, check out to best filter these values out

-else if yes then see how these values can be best by some other encoding characters making more sense to the data you're dealing with.

Point to Note:

领英推荐

NEW from Learning Data 06/26 - 07/07!

Maven Analytics 1 年前

Retrieval-Augmented Generation…

Dr. Rabi Prasad Padhy 6 个月前

Week 6: The Skills You Need to Succeed in the Age of AI

Rio Ferdinand Kiantara 5 个月前

you can choose to keep it as it is if its value adding the current way you've depending upon the dataset inherent nature.
other alternative you can think of is Imputation of data i.e. see if you can best replace the missing data with statistical parameters say mode (commonly occurring data sample value) or median or mean

Note tip : This would be a great step if dealing with Tine series data as here missing data can lead to distorted conclusions at times.

b.) Outliers: extremely high or extremely low data points/values which isn't in general occurrence criteria

Say: Car racing at maximum speed of 240 Km/hr or 100°C at antarctic sensor.

This can be handled via 3 approaches:

(i.) See if such observations are adding value: if yes for e.g. an ecommerce store could have customers in premium segment who bought 3 times more than the average order value , if? yes include it in this case else proceed with:

(ii.) Removing them from analysis by removing upper & lower x percentiles i.e. trimming , other being using statistical means such as weighted average so as to ensure no negative impact/consequence of outliers gets faced by major chunk of data.

(iii.) Segment data into different groups so that essential elements & observation don’t go unnoticed

(iii.) Inconsistent, Duplicated & Invalid Data: - say inconsistent data headers, separators, names, duplicated data information

How to fix this:

1. Efficient data collection & linking process: Checking Data ETL Pipeline & reengineering data pipelines for high efficacy

2. Robust Quality checks for better errors elimination

3. Standardizing & correct encoding & type casting & correct casing of data

4. Deduplicating, filtering & enriching the data into correct desired format

Once done all this, simply repeat this process so as to ensure just like your laundry/washing machine remains committed to proactive identification of dirt/garbage & performs " Garbage out" at the right time so that clothes get cleaned in all respects with no stain & no impact/high load on the washing machine at holistic level.

Time to Market basically means reducing time from problem identification till problem solving with an edge over competitors.

This can be achieved by "Automation" & here's how you can aim for it:

(1.) Problem discovery automation: Use Visualization Tools to observe data behaviour, pattern & data distribution so as to handle missing inconsistent data timely

(2.) Efficient Transformation: via Macros or python pre-processing scripts for e.g.: to handle incorrect encoding, removing white spaces, leading & trailing irrelevant lines & correct data formatting.

So What you’re waiting for? are you ready to deep dive into the world of Data & create Magic?

Trick is simple: Just like your actions depends upon the choices you make, similar is with the Data: It'll?create wonders if you're willing to go an extra mile -step by step & by sticking to fundamentals!

要查看或添加评论，请登录

Gaurav Malhotra的更多文章

My Maiden NGO’s Visit Experience: Kilkaari Bachhpan Ki

2020年2月11日

My Maiden NGO’s Visit Experience: Kilkaari Bachhpan Ki

"Small acts, when multiplied by millions of people, can transform the world” ~ Howard Zinn. This’s one of the finest…

4 条评论
"V" For Sustainability

2019年10月10日

"V" For Sustainability

Isn't It Should be "WE" for Sustainability, He must've committed a typo BUT No Its Fully Appt. Millennial Era is…
Stretch,Leverage & Fit Success Mantra : Corporate Sustainability

2019年8月6日

Stretch,Leverage & Fit Success Mantra : Corporate Sustainability

Today's Millennial Era is already flooded with AI, Machine Learning, Robotics, Six Sigma. Within a single click one can…

6 条评论
4 Indian Mega cities Turns Water Vulnerable : WWF Study 2018 Results, Sustainability-Not Just"on Cards" Please!

2019年6月25日

4 Indian Mega cities Turns Water Vulnerable : WWF Study 2018 Results, Sustainability-Not Just"on Cards" Please!

4 Indian Mega cities Turns Water Vulnerable : World Wildlife Fund's Research Study 2018 , India on brink of worst ever…

Data Pre-Processing & Its Role in harnessing Intelligence

Gaurav Malhotra

Data Steward : SAFe6 Certified Agilist| Black Belt: Gen AI LLM| Data/Content Management Geek | Design Thinking Specialist| Certified PGBA | UGC NET Qualified Professional

领英推荐

Gaurav Malhotra的更多文章

社区洞察

其他会员也浏览了

Week 6: The Skills You Need to Succeed in the Age of AI

The Enduring Importance of Fundamentals in the Future of Business Intelligence

Why Decision Intelligence Is The Most Important Data Analytics Trend Of This Decade

#67: ??Unlocking Excellence: A Roadmap to Scaling Decision Intelligence

What are the Potential Benefits and Challenges of Using AI for Data Analytics Tasks?

From Memorisation to Generalisation: How to Tackle Overfitting

Identifying, Avoiding LLM Hallucination in Data cleansing activities - AI augmented Data Ops

How AI is shaping Data Management Trends

When Your Data Gets Tired of Being Boring...

Small Data and Artificial Intelligence: The Path to Innovation

领英推荐

Gaurav Malhotra的更多文章

My Maiden NGO’s Visit Experience: Kilkaari Bachhpan Ki

"V" For Sustainability

Stretch,Leverage & Fit Success Mantra : Corporate Sustainability

4 Indian Mega cities Turns Water Vulnerable : WWF Study 2018 Results, Sustainability-Not Just"on Cards" Please!

社区洞察

其他会员也浏览了

Week 6: The Skills You Need to Succeed in the Age of AI

The Enduring Importance of Fundamentals in the Future of Business Intelligence

Why Decision Intelligence Is The Most Important Data Analytics Trend Of This Decade

#67: ??Unlocking Excellence: A Roadmap to Scaling Decision Intelligence

What are the Potential Benefits and Challenges of Using AI for Data Analytics Tasks?

From Memorisation to Generalisation: How to Tackle Overfitting

Identifying, Avoiding LLM Hallucination in Data cleansing activities - AI augmented Data Ops

How AI is shaping Data Management Trends

When Your Data Gets Tired of Being Boring...

Small Data and Artificial Intelligence: The Path to Innovation