A Beginner's Guide: How to Check if Data is Normal Before Training a Machine Learning Model in Exploratory Data Analysis (EDA)

A Beginner's Guide: How to Check if Data is Normal Before Training a Machine Learning Model in Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial step in any data science project, especially when it comes to preparing data for machine learning (ML) models. One fundamental aspect of EDA is assessing the distribution of data. Knowing whether your data follows a normal distribution or not can significantly impact the choice of ML algorithms and the performance of your model. In this article, we will explore methods to check if data is normally distributed before training an ML model.

Understanding Normal Distribution:

Before diving into the techniques for checking normality, let's briefly review what a normal distribution is. A normal distribution, also known as a Gaussian distribution, is a bell-shaped curve characterized by its mean and standard deviation. In a normal distribution:

  • The mean, median, and mode are equal.
  • The data is symmetric around the mean.
  • Approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

Techniques to Assess Normality:

To assess data normality, there are two main categories of techniques: graphical methods and analytical methods. Graphical methods involve visually inspecting the data distribution through tools like histograms and Q-Q probability plots. These methods are useful for observing the shape of the data distribution and identifying deviations from normality, especially in larger sample sizes.

Analytical methods, on the other hand, rely on statistical tests to quantitatively assess normality. Common tests include the Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, and others. These tests provide objective measures of normality but can be sensitive to sample size, with some tests being more suitable for specific scenarios. For instance, the Shapiro-Wilk test is often recommended as a reliable choice for testing data normality.

Histogram:

To determine whether data follows a normal distribution, one common method is to use a histogram. A histogram provides a visual representation of the data distribution, allowing for an initial assessment of normality based on the shape of the histogram. However, histograms might not always be the most reliable method for assessing normality, especially with small sample sizes. When using histograms, it is essential to consider the shape and spread of the distribution, looking for a bell-shaped curve that is symmetric around the mean.

To check data normalization using a histogram plot, several key aspects need to be considered:

  • Shape of the Distribution: When examining a histogram, it is essential to observe the shape of the distribution. A normal distribution typically appears as a bell-shaped curve that is symmetric around the mean. Deviations from this shape can indicate departures from normality.
  • Bin Heights: The heights of the bars in the histogram represent the frequency or density of data points within each bin. Normalizing the histogram involves ensuring that the area under the entire histogram equals 1, reflecting the probability of any value occurring.
  • Normalization Types: Different types of normalization can be applied to histograms, such as frequency distribution, discrete probability distribution, discrete percentage probability distribution, frequency density distribution, and probability density distribution. Each type serves a specific purpose in analyzing the data distribution.
  • Comparison with Theoretical Distribution: A better way to assess normality is by using a quantile-quantile plot (Q-Q plot) in addition to or instead of a histogram. The Q-Q plot compares the theoretical quantiles expected from a normal distribution with the quantiles of the actual data. If the data points closely align with the theoretical line, it suggests normality.

Normal Probability Plot (Q-Q Plot):

A Normal Probability Plot, also known as a Q-Q (quantile-quantile) plot, is a graphical tool used to assess whether a dataset follows a normal distribution. In a Q-Q plot, the observed data quantiles are plotted against the quantiles of a theoretical normal distribution. If the data points fall approximately along a straight line, it indicates that the data are normally distributed. Deviations from this line suggest departures from normality, with curves or systematic patterns indicating non-normality.

To check if data is normal using a Q-Q plot, follow these steps:

  1. Plot the Data: Start by plotting the observed data quantiles against the quantiles of a normal distribution. The closer the data points align with a straight line, the more likely the data follows a normal distribution.
  2. Interpret the Plot: When interpreting a Q-Q plot, focus on whether the data points follow the 45-degree line (y = x). If the points generally follow this line with some random variability above and below it, the data are likely normally distributed. Use the "fat pencil test" by visually checking if an imaginary fat pencil covers the data points along the line.
  3. Identify Deviations: Look for systematic departures from the straight line in the plot. Curves, "S" shapes, or patterns that deviate consistently from the line indicate non-normality. These deviations can provide insights into the specific characteristics of the data distribution, such as skewness or the presence of outliers.
  4. Compare with Theoretical Distribution: By comparing the data points with the expected line representing a normal distribution, you can visually assess the degree of normality in the dataset. The closer the points align with the line, the more closely the data resemble a normal distribution.

Shapiro-Wilk Test:

The Shapiro-Wilk test is a statistical test used to assess whether a dataset follows a normal distribution. It is particularly suitable for small to moderate sample sizes. When conducting the Shapiro-Wilk test, the null hypothesis assumes that the data are normally distributed. The test calculates a test statistic based on the differences between the observed data and the expected values under a normal distribution. The p-value associated with the test statistic indicates the probability of obtaining the observed results if the data were actually drawn from a normal distribution. To determine whether the data is normal or not using the Shapiro-Wilk test, the following key aspects are considered:

  1. Test Statistic: The Shapiro-Wilk test calculates a test statistic, W, which is used to assess the normality of the data. A value of W close to 1 indicates that the data closely follow a normal distribution.
  2. P-Value: The p-value associated with the Shapiro-Wilk test indicates the significance level of the test. If the p-value is greater than the chosen significance level (commonly 0.05), the null hypothesis of normality is not rejected, suggesting that the data are normally distributed. Conversely, a p-value less than the significance level leads to rejecting the null hypothesis, indicating non-normality.
  3. Interpretation: When interpreting the results of the Shapiro-Wilk test, a high p-value suggests that the data are likely normally distributed, while a low p-value indicates departures from normality. Researchers use the p-value to make decisions about the normality assumption required for many statistical analyses.

After assessing data for normality using tests like the Shapiro-Wilk test or histogram, if the data is not normally distributed and normalization is required, you can consider the following techniques to normalize the data:

  • Transformations:

Log Transformation: Useful for positively skewed data.

Square Root Transformation: Suitable for moderately skewed data.

Box-Cox Transformation: Adaptable to various types of data distributions.

  • Standardization:

Z-Score Standardization: Transforming data to have a mean of 0 and a standard deviation of 1.

Min-Max Scaling: Rescaling data to a specific range, often between 0 and 1.

  • Normalization Methods:

Normalization: Scaling data to have a range between 0 and 1.

Robust Scaling: Scaling data based on percentiles to reduce the impact of outliers.

Standardization (Z-score normalization):

Standardization rescales the data so that it has a mean of 0 and a standard deviation of 1. It is achieved by subtracting the mean from each feature and then dividing by the standard deviation.

Min-Max Scaling:

Min-Max Scaling (Normalization) rescales the data to a fixed range, usually [0, 1]. It is achieved by subtracting the minimum value of the feature and then dividing by the range (maximum value minus minimum value).

Robust Scaling:

Robust Scaling is similar to Min-Max Scaling but uses the interquartile range (IQR) instead of the range. It is less affected by outliers in the data.

Log Transformation:

If the data is positively skewed, applying a log transformation may help in normalizing the distribution. Log transformation compresses the range of large values and expands the range of small values.

Faraz Hussain Buriro

?? 24K+ Followers | Real-Time, Pre-Qualified Leads for Businesses | ?? AI Visionary & ?? Digital Marketing Expert | DM & AI Trainer ?? | ?? Founder of PakGPT | Co-Founder of Bint e Ahan ?? | ??DM for Collab??

12 个月

Excited to dive deep into the world of AI and ML with you all! ????

要查看或添加评论,请登录

Bushra Akram的更多文章

社区洞察

其他会员也浏览了