Data Pre-Processing & Its Role in harnessing Intelligence
Image Source Credits: AVIXA Xchange

Data Pre-Processing & Its Role in harnessing Intelligence

In Today's era of exponential growth & innovation with rapid advancements in Generative AI tools & Analytics, we're witnessing an unprecedented boom in each & every sphere of life, from learning the mandala art to how to become an efficient developer: you just type it & there you go!

Hence in order to keep pace with rapidly growing advancements, one must be well versed with the domain knowledge so as to carve out a distinct advantage for themselves.

When we talk about AI or Analytics, It should have 2 crucial elements:

1. Capable of Delivering Essential Intelligence

2. Should have least time to market: & this all depends upon the efficacy of data management & data enrichment process : precisely the building block/heart which is termed as Data Pre-processing.

So let’s try to understand it in brief-

Assume You're assisting your mom with laundry today, what'll you will need to ensure here:

A.) Segregation of dirty clothes from the available ones

Image Source: Google Images

B.) Scrub the dirt

Image Source: Google Images


C.) Rinse & Repeat

Image Source: Google Images

Somewhat similar process gets applied while dealing with Data & here is how

(I.) Observe your data: Pay attention to detailed behavior's, see for the gaps : is there anything missing( empty data rows/files) or impossible :

-say categorical variable is having quantitative result anyhow?, is it consistent throughout or display inconsistency in terms of say:?

  • Field header name change, Missing headers in some files or
  • uppercase lower case formatting mismatch within files etc.

Point to Note : If identified correctly, half of the battle is won


(II.) Data Management: depending upon the type of problems you've in hand, you'll need to make best use of the tools in armory,

similar to a saying that Sewing pin & sword can never fulfill each other's place, here your domain expertise & problem solving skills come into play . So lets see how this could be done:

a.) Usage of “NA” or “-“ in some fields: - Rule of thumb: Is it adding value by any chance? Is it representing something material to data in hand? Is some another data field dependent upon these sample values? If answer is no, check out to best filter these values out

-else if yes then see how these values can be best by some other encoding characters making more sense to the data you're dealing with.

Point to Note:

  • you can choose to keep it as it is if its value adding the current way you've depending upon the dataset inherent nature.
  • other alternative you can think of is Imputation of data i.e. see if you can best replace the missing data with statistical parameters say mode (commonly occurring data sample value) or median or mean

Note tip : This would be a great step if dealing with Tine series data as here missing data can lead to distorted conclusions at times.

b.) Outliers: extremely high or extremely low data points/values which isn't in general occurrence criteria

Say: Car racing at maximum speed of 240 Km/hr or 100°C at antarctic sensor.

This can be handled via 3 approaches:

(i.) See if such observations are adding value: if yes for e.g. an ecommerce store could have customers in premium segment who bought 3 times more than the average order value , if? yes include it in this case else proceed with:

(ii.) Removing them from analysis by removing upper & lower x percentiles i.e. trimming , other being using statistical means such as weighted average so as to ensure no negative impact/consequence of outliers gets faced by major chunk of data.

(iii.) Segment data into different groups so that essential elements & observation don’t go unnoticed

(iii.) Inconsistent, Duplicated & Invalid Data: - say inconsistent data headers, separators, names, duplicated data information


How to fix this:

1. Efficient data collection & linking process: Checking Data ETL Pipeline & reengineering data pipelines for high efficacy

2. Robust Quality checks for better errors elimination

3. Standardizing & correct encoding & type casting & correct casing of data

4. Deduplicating, filtering & enriching the data into correct desired format

Once done all this, simply repeat this process so as to ensure just like your laundry/washing machine remains committed to proactive identification of dirt/garbage & performs " Garbage out" at the right time so that clothes get cleaned in all respects with no stain & no impact/high load on the washing machine at holistic level.


Time to Market basically means reducing time from problem identification till problem solving with an edge over competitors.

This can be achieved by "Automation" & here's how you can aim for it:

(1.) Problem discovery automation: Use Visualization Tools to observe data behaviour, pattern & data distribution so as to handle missing inconsistent data timely

(2.) Efficient Transformation: via Macros or python pre-processing scripts for e.g.: to handle incorrect encoding, removing white spaces, leading & trailing irrelevant lines & correct data formatting.


So What you’re waiting for? are you ready to deep dive into the world of Data & create Magic?

Trick is simple: Just like your actions depends upon the choices you make, similar is with the Data: It'll?create wonders if you're willing to go an extra mile -step by step & by sticking to fundamentals!


要查看或添加评论,请登录

Gaurav Malhotra的更多文章

社区洞察

其他会员也浏览了