登录查看更多内容

A Beginner's Guide: How to Check if Data is Normal Before Training a Machine Learning Model in Exploratory Data Analysis (EDA)

Bushra Akram

Machine Learning Engineer | AI Engineer | AI App Developer | AI Agents & RAG Systems (LangChain, LangGraph) | Python

发布日期: 2024年3月31日

Exploratory Data Analysis (EDA) is a crucial step in any data science project, especially when it comes to preparing data for machine learning (ML) models. One fundamental aspect of EDA is assessing the distribution of data. Knowing whether your data follows a normal distribution or not can significantly impact the choice of ML algorithms and the performance of your model. In this article, we will explore methods to check if data is normally distributed before training an ML model.

Understanding Normal Distribution:

Before diving into the techniques for checking normality, let's briefly review what a normal distribution is. A normal distribution, also known as a Gaussian distribution, is a bell-shaped curve characterized by its mean and standard deviation. In a normal distribution:

The mean, median, and mode are equal.
The data is symmetric around the mean.
Approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

Techniques to Assess Normality:

To assess data normality, there are two main categories of techniques: graphical methods and analytical methods. Graphical methods involve visually inspecting the data distribution through tools like histograms and Q-Q probability plots. These methods are useful for observing the shape of the data distribution and identifying deviations from normality, especially in larger sample sizes.

Analytical methods, on the other hand, rely on statistical tests to quantitatively assess normality. Common tests include the Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, and others. These tests provide objective measures of normality but can be sensitive to sample size, with some tests being more suitable for specific scenarios. For instance, the Shapiro-Wilk test is often recommended as a reliable choice for testing data normality.

Histogram:

To determine whether data follows a normal distribution, one common method is to use a histogram. A histogram provides a visual representation of the data distribution, allowing for an initial assessment of normality based on the shape of the histogram. However, histograms might not always be the most reliable method for assessing normality, especially with small sample sizes. When using histograms, it is essential to consider the shape and spread of the distribution, looking for a bell-shaped curve that is symmetric around the mean.

To check data normalization using a histogram plot, several key aspects need to be considered:

Shape of the Distribution: When examining a histogram, it is essential to observe the shape of the distribution. A normal distribution typically appears as a bell-shaped curve that is symmetric around the mean. Deviations from this shape can indicate departures from normality.
Bin Heights: The heights of the bars in the histogram represent the frequency or density of data points within each bin. Normalizing the histogram involves ensuring that the area under the entire histogram equals 1, reflecting the probability of any value occurring.
Normalization Types: Different types of normalization can be applied to histograms, such as frequency distribution, discrete probability distribution, discrete percentage probability distribution, frequency density distribution, and probability density distribution. Each type serves a specific purpose in analyzing the data distribution.
Comparison with Theoretical Distribution: A better way to assess normality is by using a quantile-quantile plot (Q-Q plot) in addition to or instead of a histogram. The Q-Q plot compares the theoretical quantiles expected from a normal distribution with the quantiles of the actual data. If the data points closely align with the theoretical line, it suggests normality.

Normal Probability Plot (Q-Q Plot):

A Normal Probability Plot, also known as a Q-Q (quantile-quantile) plot, is a graphical tool used to assess whether a dataset follows a normal distribution. In a Q-Q plot, the observed data quantiles are plotted against the quantiles of a theoretical normal distribution. If the data points fall approximately along a straight line, it indicates that the data are normally distributed. Deviations from this line suggest departures from normality, with curves or systematic patterns indicating non-normality.

To check if data is normal using a Q-Q plot, follow these steps:

Plot the Data: Start by plotting the observed data quantiles against the quantiles of a normal distribution. The closer the data points align with a straight line, the more likely the data follows a normal distribution.
Interpret the Plot: When interpreting a Q-Q plot, focus on whether the data points follow the 45-degree line (y = x). If the points generally follow this line with some random variability above and below it, the data are likely normally distributed. Use the "fat pencil test" by visually checking if an imaginary fat pencil covers the data points along the line.
Identify Deviations: Look for systematic departures from the straight line in the plot. Curves, "S" shapes, or patterns that deviate consistently from the line indicate non-normality. These deviations can provide insights into the specific characteristics of the data distribution, such as skewness or the presence of outliers.
Compare with Theoretical Distribution: By comparing the data points with the expected line representing a normal distribution, you can visually assess the degree of normality in the dataset. The closer the points align with the line, the more closely the data resemble a normal distribution.

Shapiro-Wilk Test:

The Shapiro-Wilk test is a statistical test used to assess whether a dataset follows a normal distribution. It is particularly suitable for small to moderate sample sizes. When conducting the Shapiro-Wilk test, the null hypothesis assumes that the data are normally distributed. The test calculates a test statistic based on the differences between the observed data and the expected values under a normal distribution. The p-value associated with the test statistic indicates the probability of obtaining the observed results if the data were actually drawn from a normal distribution. To determine whether the data is normal or not using the Shapiro-Wilk test, the following key aspects are considered:

Test Statistic: The Shapiro-Wilk test calculates a test statistic, W, which is used to assess the normality of the data. A value of W close to 1 indicates that the data closely follow a normal distribution.
P-Value: The p-value associated with the Shapiro-Wilk test indicates the significance level of the test. If the p-value is greater than the chosen significance level (commonly 0.05), the null hypothesis of normality is not rejected, suggesting that the data are normally distributed. Conversely, a p-value less than the significance level leads to rejecting the null hypothesis, indicating non-normality.
Interpretation: When interpreting the results of the Shapiro-Wilk test, a high p-value suggests that the data are likely normally distributed, while a low p-value indicates departures from normality. Researchers use the p-value to make decisions about the normality assumption required for many statistical analyses.

领英推荐

The Essential Role of Data Visualization in Machine…

Dr. John Martin 1 年前

You want to be a data guru?

Manish Kushwaha 3 年前

A Data Sapient Guide to Feature Engineering: Handling…

Suparna Chowdhury 5 个月前

After assessing data for normality using tests like the Shapiro-Wilk test or histogram, if the data is not normally distributed and normalization is required, you can consider the following techniques to normalize the data:

Transformations:

Log Transformation: Useful for positively skewed data.

Square Root Transformation: Suitable for moderately skewed data.

Box-Cox Transformation: Adaptable to various types of data distributions.

Standardization:

Z-Score Standardization: Transforming data to have a mean of 0 and a standard deviation of 1.

Min-Max Scaling: Rescaling data to a specific range, often between 0 and 1.

Normalization Methods:

Normalization: Scaling data to have a range between 0 and 1.

Robust Scaling: Scaling data based on percentiles to reduce the impact of outliers.

Standardization (Z-score normalization):

Standardization rescales the data so that it has a mean of 0 and a standard deviation of 1. It is achieved by subtracting the mean from each feature and then dividing by the standard deviation.

Min-Max Scaling:

Min-Max Scaling (Normalization) rescales the data to a fixed range, usually [0, 1]. It is achieved by subtracting the minimum value of the feature and then dividing by the range (maximum value minus minimum value).

Robust Scaling:

Robust Scaling is similar to Min-Max Scaling but uses the interquartile range (IQR) instead of the range. It is less affected by outliers in the data.

Log Transformation:

If the data is positively skewed, applying a log transformation may help in normalizing the distribution. Log transformation compresses the range of large values and expands the range of small values.

Faraz Hussain Buriro

12 个月

Excited to dive deep into the world of AI and ML with you all! ????

1 次回应

要查看或添加评论，请登录

Bushra Akram的更多文章

LangGraph Tutorial: Understanding and Using LangGraph

2024年11月1日

LangGraph Tutorial: Understanding and Using LangGraph

LangGraph is an essential library in the LangChain ecosystem. It offers a structured and efficient way to define…

2 条评论
The Best and Most Popular Open-Source LLMs: Revolutionizing AI with Transparency

2024年9月25日

The Best and Most Popular Open-Source LLMs: Revolutionizing AI with Transparency

Introduction Large Language Models (LLMs) have fundamentally changed the way we interact with machines, providing…

1 条评论
Build a simple RAG Based Chatbot with LangChain

2024年9月7日

Build a simple RAG Based Chatbot with LangChain

In this blog post, Ill show you how to build a special type of chatbot called a RAG (Retrieval-Augmented Generation)…

13 条评论
Exploring Transformers: The Game-Changing Neural Network Architecture

2024年9月2日

Exploring Transformers: The Game-Changing Neural Network Architecture

What is a Transformer? A Transformer is a type of neural network architecture designed to process and generate…

7 条评论
Tokenization and Text Preprocessing in NLP

2024年6月25日

Tokenization and Text Preprocessing in NLP

Introduction In the world of Natural Language Processing (NLP), understanding and manipulating text data is…
What is a Vector Database & How Does it Work With Examples?

2024年4月24日

What is a Vector Database & How Does it Work With Examples?

Introduction: In the digital world, databases play a critical role in organizing and retrieving information…
Artificial Neural Networks: Bridging the Gap Between Computers and Human Intelligence

2024年4月19日

Artificial Neural Networks: Bridging the Gap Between Computers and Human Intelligence

Artificial Neural Networks (ANNs) are a subset of machine learning, inspired by the structure and function of the human…
Optimizing Costs: Calculating Tokens and Choosing the Most Cost-Effective LLM API for Your Chatbot

2024年4月17日

Optimizing Costs: Calculating Tokens and Choosing the Most Cost-Effective LLM API for Your Chatbot

In the exciting world of AI-powered chatbots, large language models (LLMs) have become the stars of the show. These…

4 条评论
Understanding Your Data Before Training a Machine Learning Model

2024年4月11日

Understanding Your Data Before Training a Machine Learning Model

In machine learning (ML), the adage "garbage in, garbage out" holds. The success of any ML model hinges heavily on the…

1 条评论
Exploring the Mystery Behind Different Job Titles for Data Engineer, Machine Learning Engineer, Data Scientist, and Data Analyst

2024年4月4日

Exploring the Mystery Behind Different Job Titles for Data Engineer, Machine Learning Engineer, Data Scientist, and Data Analyst

Do you want to start a career in the field of Data Engineer, Machine Learning Engineer, Data Scientist, or Data Analyst…

3 条评论

See all articles

A Beginner's Guide: How to Check if Data is Normal Before Training a Machine Learning Model in Exploratory Data Analysis (EDA)

Bushra Akram

Machine Learning Engineer | AI Engineer | AI App Developer | AI Agents & RAG Systems (LangChain, LangGraph) | Python

Understanding Normal Distribution:

Techniques to Assess Normality:

Histogram:

Normal Probability Plot (Q-Q Plot):

Shapiro-Wilk Test:

领英推荐

Standardization (Z-score normalization):

Min-Max Scaling:

Robust Scaling:

Log Transformation:

Bushra Akram的更多文章

社区洞察

其他会员也浏览了

Matrices and Other Data Science Concepts You need to Know

Top Interview Questions for Data Analytics:

Refining Insights: Unveiling the Power of Outlier Management in Data Science

Exploratory Data Analysis: The Fun Way To Find Answers.

Essential Data Science Concepts from A to Z

Imbalanced classification

Checklist for Prepping Data in ML Projects

Unravelling the Data Science Step-by-Step Process: From Raw Data to Actionable Insights

Data Science: An Overview

Some essential data science concepts from A to Z.

Understanding Normal Distribution:

Techniques to Assess Normality:

Histogram:

Normal Probability Plot (Q-Q Plot):

Shapiro-Wilk Test:

领英推荐

Standardization (Z-score normalization):

Min-Max Scaling:

Robust Scaling:

Log Transformation:

Bushra Akram的更多文章

LangGraph Tutorial: Understanding and Using LangGraph

The Best and Most Popular Open-Source LLMs: Revolutionizing AI with Transparency

Build a simple RAG Based Chatbot with LangChain

Exploring Transformers: The Game-Changing Neural Network Architecture

Tokenization and Text Preprocessing in NLP

What is a Vector Database & How Does it Work With Examples?

Artificial Neural Networks: Bridging the Gap Between Computers and Human Intelligence

Optimizing Costs: Calculating Tokens and Choosing the Most Cost-Effective LLM API for Your Chatbot

Understanding Your Data Before Training a Machine Learning Model

Exploring the Mystery Behind Different Job Titles for Data Engineer, Machine Learning Engineer, Data Scientist, and Data Analyst

社区洞察

其他会员也浏览了

Matrices and Other Data Science Concepts You need to Know

Top Interview Questions for Data Analytics:

Refining Insights: Unveiling the Power of Outlier Management in Data Science

Exploratory Data Analysis: The Fun Way To Find Answers.

Essential Data Science Concepts from A to Z

Imbalanced classification

Checklist for Prepping Data in ML Projects

Unravelling the Data Science Step-by-Step Process: From Raw Data to Actionable Insights

Data Science: An Overview

Some essential data science concepts from A to Z.