A Beginner's Guide: How to Check if Data is Normal Before Training a Machine Learning Model in Exploratory Data Analysis (EDA)
Bushra Akram
Machine Learning Engineer | AI Engineer | AI App Developer | AI Agents & RAG Systems (LangChain, LangGraph) | Python
Exploratory Data Analysis (EDA) is a crucial step in any data science project, especially when it comes to preparing data for machine learning (ML) models. One fundamental aspect of EDA is assessing the distribution of data. Knowing whether your data follows a normal distribution or not can significantly impact the choice of ML algorithms and the performance of your model. In this article, we will explore methods to check if data is normally distributed before training an ML model.
Understanding Normal Distribution:
Before diving into the techniques for checking normality, let's briefly review what a normal distribution is. A normal distribution, also known as a Gaussian distribution, is a bell-shaped curve characterized by its mean and standard deviation. In a normal distribution:
Techniques to Assess Normality:
To assess data normality, there are two main categories of techniques: graphical methods and analytical methods. Graphical methods involve visually inspecting the data distribution through tools like histograms and Q-Q probability plots. These methods are useful for observing the shape of the data distribution and identifying deviations from normality, especially in larger sample sizes.
Analytical methods, on the other hand, rely on statistical tests to quantitatively assess normality. Common tests include the Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, and others. These tests provide objective measures of normality but can be sensitive to sample size, with some tests being more suitable for specific scenarios. For instance, the Shapiro-Wilk test is often recommended as a reliable choice for testing data normality.
Histogram:
To determine whether data follows a normal distribution, one common method is to use a histogram. A histogram provides a visual representation of the data distribution, allowing for an initial assessment of normality based on the shape of the histogram. However, histograms might not always be the most reliable method for assessing normality, especially with small sample sizes. When using histograms, it is essential to consider the shape and spread of the distribution, looking for a bell-shaped curve that is symmetric around the mean.
To check data normalization using a histogram plot, several key aspects need to be considered:
Normal Probability Plot (Q-Q Plot):
A Normal Probability Plot, also known as a Q-Q (quantile-quantile) plot, is a graphical tool used to assess whether a dataset follows a normal distribution. In a Q-Q plot, the observed data quantiles are plotted against the quantiles of a theoretical normal distribution. If the data points fall approximately along a straight line, it indicates that the data are normally distributed. Deviations from this line suggest departures from normality, with curves or systematic patterns indicating non-normality.
To check if data is normal using a Q-Q plot, follow these steps:
Shapiro-Wilk Test:
The Shapiro-Wilk test is a statistical test used to assess whether a dataset follows a normal distribution. It is particularly suitable for small to moderate sample sizes. When conducting the Shapiro-Wilk test, the null hypothesis assumes that the data are normally distributed. The test calculates a test statistic based on the differences between the observed data and the expected values under a normal distribution. The p-value associated with the test statistic indicates the probability of obtaining the observed results if the data were actually drawn from a normal distribution. To determine whether the data is normal or not using the Shapiro-Wilk test, the following key aspects are considered:
领英推荐
After assessing data for normality using tests like the Shapiro-Wilk test or histogram, if the data is not normally distributed and normalization is required, you can consider the following techniques to normalize the data:
Log Transformation: Useful for positively skewed data.
Square Root Transformation: Suitable for moderately skewed data.
Box-Cox Transformation: Adaptable to various types of data distributions.
Z-Score Standardization: Transforming data to have a mean of 0 and a standard deviation of 1.
Min-Max Scaling: Rescaling data to a specific range, often between 0 and 1.
Normalization: Scaling data to have a range between 0 and 1.
Robust Scaling: Scaling data based on percentiles to reduce the impact of outliers.
Standardization (Z-score normalization):
Standardization rescales the data so that it has a mean of 0 and a standard deviation of 1. It is achieved by subtracting the mean from each feature and then dividing by the standard deviation.
Min-Max Scaling:
Min-Max Scaling (Normalization) rescales the data to a fixed range, usually [0, 1]. It is achieved by subtracting the minimum value of the feature and then dividing by the range (maximum value minus minimum value).
Robust Scaling:
Robust Scaling is similar to Min-Max Scaling but uses the interquartile range (IQR) instead of the range. It is less affected by outliers in the data.
Log Transformation:
If the data is positively skewed, applying a log transformation may help in normalizing the distribution. Log transformation compresses the range of large values and expands the range of small values.
?? 24K+ Followers | Real-Time, Pre-Qualified Leads for Businesses | ?? AI Visionary & ?? Digital Marketing Expert | DM & AI Trainer ?? | ?? Founder of PakGPT | Co-Founder of Bint e Ahan ?? | ??DM for Collab??
12 个月Excited to dive deep into the world of AI and ML with you all! ????