Data Analysis Process

According to Wikipedia, Data analysis is a process of inspecting, cleansing, transforming, and modeling data to discover useful information, inform conclusions, and support decision-making.

The data analysis process includes 5 parts-

1. Asking Questions

2. Data Wrangling/Munging/Data preprocessing

3. EDA

4. Drawing a conclusion

5. Communicating results

1. Asking Questions >>> Domain expert will know the total scenario and he/she might be the one to ask better questions. The more the experience, more the informative questions will generate. Some sample questions are as like below-

???a. What features will contribute to my analysis?

???b. What features are not important for my analysis?

???c. Which of the features have a strong correlation?

???d. Do I need data preprocessing?

???e. What kind of feature manipulation/Engineering is required?

2. Data Wrangling/Munging/Data Preprocessing >>> Transforming any raw data into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analysis. Remember, about 60-70% of efforts are invested in this stage.

???a. Gathering data: There are different sources to gather data for any analysis. The data engineer or third party may provide data. Sometimes, a data scientist also had to put effort to gather necessary data from time to time. The sources are-

??????(i) From API

??????(ii) Web scraping

??????(iii) Databases

??????(iv) CSV, Excel, JSON, PDF (Both import & export data)

???b. Assessing data: After getting data the first few steps are to find out the basic information and take necessary action accordingly. A few basic steps are-

??????(i) Finding the number of rows/columns/Shape

??????(ii) Data types of variable columns/info

??????(iii) Checking for missing values/info

??????(iv) Check for duplicate values/is unique

??????(v) Memory occupied by the dataset

??????(vi) High-level mathematical overview/describe

???c. cleaning data: Almost 99% time we need to clean up the data as per our requirement as any data we got is in raw format and inappropriate to analyze.

Some treatments are as below-

??????(i) Missing data and replacing with proper treatment

??????(ii) Remove/drop duplicates

??????(iii) Incorrect data type

3. EDA (Exploratory Data Analysis) >>> EDA is the next part to see various patterns, graphs, info, etc. before drawing a conclusion and taking any decision as required. It has two-part-

???a. Exploring Data: Below are some examples-

??????(i) Finding correlation and covariance

??????(ii) Performing uni-variate, bi-variate, and multivariate analysis

??????(iii) Plotting graphs (Data visualization)

???b. Augmenting data: This step is also called feature engineering where we need to generate any features if required.

??????(i) Removing outliers using box-plot

??????(ii) Merging data frames

??????(iii) Adding new columns

4. Concluding >>> After doing EDA, the essential part is to draw some meaningful conclusions which helped to take important business decisions. Different domain

experts may conclude by using different methods. The most used methods are-

??????a. Using Machine Learning: Using ML to predict and forecast by ML engineer.

??????b. Using Inferential Statistics:?Statistical experts may use this method.

??????c. Descriptive Statistics: A sampling technique may use in this method from a large dataset.

5. Communicating results/Storytelling >>> Here, communication skills are highly required to communicate and interpret the result in the different departments. It can be done in different formats as below-

??????a. In person

??????b. Through reports

??????c. Blog posting

??????d. PPT/Slide

Note: All these processes are nonlinear, we may switch any step anytime when needed.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, we will discuss in detail the step of assessing and cleaning data <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

1. >>> Writing summary: The first step of assessing & cleaning is to write a summary of any dataset. This will help us to understand the meaningful insight details in the data.

generally, domain expert knows the best about the given data. So, a data analyst/scientist had to seat with the expert and know the details about the data. Remember, the underlying concepts and insight of the data will help us to analyze, draw better conclusions, and better decision making.

2. >>> Writing every column description: This will show how well you understood the data and also helps to do feature engineering if needed.

3. >>> Adding any new information: Here, you may reshape the data by adding, changing, and assigning new values if needed. Each relevant information has to be included with proficiency.

Types of assessment: Every dataset should go through 2 types of assessment-

???a. Manual: Here, you need to look deeply at your dataset by opening in Google Sheets/excel and try to find out the issues and take detailed notes.

???b. Programmatic: After manual assessment, you may perform some programmatic assessment through info(), describe(), sample(), etc.

Steps in assessment: There are two steps-

???a. Discover: You have to be focused on both manual and programmatic processes to discover the issues.

???b. Document: Only discovering will not help to proceed further, you must document it for further assessment.?

Now, in this step data cleaning comes into play. There are two types of unclean data-

A. Dirty Data: Dirty data, also known as low-quality data. Low-quality data has content issues. The examples are-

> Duplicated data

> Missing Data

> Corrupt Data

> Inaccurate Data

B. Messy Data: Messy data, also known as untidy data. Untidy data has structural issues. Tidy data has the following properties-

> Each variable forms a column

> Each observation forms a row

> Each observational unit forms a table

Exploratory Data Analysis and Drawing Conclusion.

The first question we may have regarding EDA is why we need to perform EDA. Here are some reasons for many-

???>> Model building

???>> Analysis and reporting

???>> Validate assumptions

???>> Handling missing values

???>> Feature engineering

???>> Detecting outliers

EDA is an iterative process, and it can be done repetitively.

Now, we must assess the data by using three types of analysis-

A. Uni-variate Analysis

B. Bi-variate Analysis

C. Multivariate Analysis

A. Uni-variate Analysis: It focuses on analyzing each feature in the dataset independently. Through this analysis, we are trying to find out-

>>Distribution analysis: The distribution of each feature is examined to identify its shape, central tendency, and dispersion.

>>Identifying potential issues: Uni-variate analysis helps in identifying potential problems with the data such as outliers, skewness, and missing values.

The shape of a data distribution refers to its overall pattern or form as it is represented on a graph. Some common shapes of data distributions include-

???>> Normal Distribution: A symmetrical and bell-shaped distribution where the mean, median, and mode are equal, and most of the data falls in the middle of the distribution with gradually decreasing frequencies towards the tails.

???>> Skewed Distribution: A distribution that is not symmetrical, with one tail being longer than the other. It can be either positively skewed (right-skewed) or negatively skewed (left-skewed).

???>> Bimodal Distribution: A distribution with two peaks or modes.

???>> Uniform Distribution: A distribution where all values have an equal chance of occurring.

The shape of the data distribution is important in identifying the presence of outliers, skewness, and the type of statistical tests and models that can be used for further analysis.

Dispersion is a statistical term used to describe the spread or variability of a set of data. It measures how far the values in a data set are spread out from the central tendency (mean, median, or mode) of the data.

There are several measures of dispersion, including:

???>> Range: The difference between the largest and smallest values in a data set.

???>> Variance: The average of the squared deviations of each value from the mean of the data set.

???>> Standard Deviation: The square root of the variance. It provides a measure of the spread of the data that is in the same units as the original data.

???>> Interquartile range (IQR): The range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data.

Dispersion helps to describe the spread of the data, which can help to identify the presence of outliers and skewness in the data.

Steps of doing Uni-variate Analysis on Numerical columns:

???>> Descriptive Statistics: Compute basic summary statistics for the column, such as mean, median, mode, standard deviation, range, and quartiles. These statistics give a general understanding of the distribution of the data and can help identify skewness or outliers.

???>> Visualizations: Create visualizations to explore the distribution of the data. Some common visualizations for numerical data include histograms, box plots, and density plots. These visualizations provide a visual representation of the distribution of the data and can help identify skewness as an outlier.

???>> Identifying Outliers: Identify and examine any outliers in the data. Outliers can be identified using visualizations. It is important to determine whether the outliers are due to measurement errors, data entry errors, or legitimate differences in the data, and to decide whether to include or exclude them from the analysis.

???>> Skewness: Check for skewness in the data and consider transforming the data or using robust statistical methods that are less sensitive to skewness, if necessary.

Steps of doing Uni-variate Analysis on Categorical columns:

???>> Descriptive Statistics: Compute the frequency distribution of the categories in the column. This will give a general understanding of the distribution of the categories and their relative frequencies.

???>> Visualizations: Create visualizations to explore the distribution of the categories. Some common visualizations for categorical data include count plots and pie charts. These visualizations provide a visual representation of the distribution of the categories and can help identify any patterns or anomalies in the data.

???>> Missing Values: Check for missing values in the data and decide how to handle them. Missing values can be imputed or excluded from the analysis, depending on the research question and the data set.

Conclusion: Summarize the findings of the EDA and make decisions about how to proceed with further analysis.


Steps of doing Bi-variate Analysis:

???>> Select 2 cols.

???>> Understand the type of relationship.

???????1. Numerical - Numerical

???????a. You can plot graphs like scatter plots (regression plots), 2D his plot, 2D KDEplots

???????b. Check correlation coefficient to check linear relationship.

???????2. Numerical - Categorical - create visualizations that compare the distribution of the numerical data across different categories of the categorical data.

?????????????a. You can plot graphs like bar plots, box plots, deploy violin plots even scatter plots.

???????3. Categorical - Categorical

?????????????a. You can create cross-tabulations or contingency tables that show the distribution of values in one categorical column, grouped by the values in the other categorical column.

?????? b. You can plot heat maps, stacked bar plots, and tree maps.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了