Exploratory Data Analysis (EDA)
Penmetsa Shiva Padmaja
SDE @Bank of America | Ex-SDE Intern @Amazon | Grad @VIT Bhimavaram'23
What is EDA ?
EDA is one of the crucial step in data science that allows us to achieve certain insights and statistical measure that is essential for the business continuity, stockholders and data scientists. You may have to find if the data has integrity and values make sense, have people reported data on different scales, are their missing values over there. Do some columns have outliers, are there datasets with multiple modes, what is the distribution of values, how features correlate with one another and so on.
Why is EDA so important ?
It performs to define and refine our important features variable selection, that will be used in our model. Explorative Data Analysis is a process where one learns about the data, forms insights and identifies important columns (features) that can be user to tell a story or later formulate a ML problem.
Procedure for performing EDA:
EDA involves 4 steps
Those are
- Data Collection
- Data Cleaning
- Data Preprocessing
- Data Visualization
1.Data Collection
Data collection is the process of gathering information in an established systematic way that enables one to test hypothesis and evaluate outcomes easily.
2.Data Cleaning
Data cleaning is the process of ensuring that your data is correct and useable by identifying any errors in the data, or missing data by correcting or deleting them.
3.Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. It includes normalization and standardization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training dataset.
4.Data Visualization
Data visualization is the graphical representation of information and data. It uses statistical graphics, plots, information graphics and other tools to communicate information clearly and efficiently.
Here I have taken a "Hotel Bookings" dataset. EDA was performed on this data. pshivapadmaja1/EDA (github.com) This is the git hub link for EDA processing.