Normal Distribution

Normal Distribution

What is Normal Distribution?

A Normal Distribution is a type of continuous probability distribution for a real-valued random variable. The graph of a normal distribution is bell-shaped and symmetrical, centered around the mean. In a normal distribution:

  • Mean: Determines the center of the distribution.
  • Standard Deviation: Determines the spread or width of the distribution.

Why is it Used?

The normal distribution is widely used in statistics because many natural phenomena and measurement outcomes tend to be normally distributed. It helps to make inferences about populations from sample data, making it crucial for hypothesis testing, confidence intervals, and many other statistical analyses.

Real-Life Examples of Normal Distribution

  1. Height of People: Heights in a population tend to follow a normal distribution where most people fall near the average height, and fewer people are either much shorter or much taller.
  2. IQ Scores: IQ scores are designed to be normally distributed with a mean of 100 and a standard deviation of 15.
  3. Measurement Errors: Measurement errors in experiments often follow a normal distribution.

How Mean and Standard Deviation Are Used in Normal Distribution

  • Mean: The average value around which the data is centered.
  • Standard Deviation: A measure of how spread out the data is. It quantifies the dispersion of the dataset.

What is Meant by 1, 2, 3 Standard Deviations from the Mean?

  • 1 Standard Deviation: Roughly 68% of the data falls within one standard deviation of the mean.
  • 2 Standard Deviations: About 95% of the data falls within two standard deviations.
  • 3 Standard Deviations: Approximately 99.7% of the data falls within three standard deviations.

This is often referred to as the 68-95-99.7 Rule.

Python Example 1: Calculating and Visualizing Normal Distribution

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Generate some data that follows a normal distribution
mean = 50
std_dev = 10
data = np.random.normal(mean, std_dev, 1000)

# Plot the histogram of the data
plt.hist(data, bins=30, density=True, alpha=0.6, color='g')

# Plot the normal distribution curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mean, std_dev)
plt.plot(x, p, 'k', linewidth=2)
plt.title('Normal Distribution (Mean: 50, Std Dev: 10)')
plt.show()
        
Normal Distribution

In this example:

  • The data is generated using a normal distribution with a mean of 50 and standard deviation of 10.
  • A histogram is plotted to show the distribution of data.
  • The normal distribution curve is overlaid.


Code Explanation:

  • numpy (np): A fundamental package for scientific computing in Python. It provides functions to work with arrays and generate random data.
  • matplotlib.pyplot (plt): A plotting library in Python used for creating static, interactive, and animated visualizations. Here, it’s used to create a histogram and plot a curve.
  • scipy.stats (stats): A module within scipy that contains functions for statistical operations, including working with probability distributions.
  • mean = 50: The mean (or expected value) of the normal distribution is set to 50.
  • std_dev = 10: The standard deviation of the normal distribution is set to 10.
  • np.random.normal(mean, std_dev, 1000): This function generates 1000 random data points that follow a normal distribution with the specified mean and std_dev. The result is stored in data.
  • data: The dataset that we generated from the normal distribution.
  • bins=30: The number of bins (or intervals) used to group the data in the histogram.
  • density=True: Normalizes the histogram so that the area under the histogram is equal to 1, making it comparable to the probability density function (PDF).
  • alpha=0.6: Adjusts the transparency of the bars, making them semi-transparent.
  • color='g': Sets the color of the histogram bars to green (g).
  • xmin, xmax = plt.xlim(): Retrieves the current x-axis limits of the plot, so the curve will fit within the histogram’s range.
  • x = np.linspace(xmin, xmax, 100): Generates 100 evenly spaced points between xmin and xmax to plot the curve smoothly.
  • stats.norm.pdf(x, mean, std_dev): Computes the probability density function (PDF) of the normal distribution for the generated x values, given the mean and std_dev. The result, p, is the height of the curve at each point in x.
  • plt.plot(x, p, 'k', linewidth=2): Plots the normal distribution curve on the same graph as the histogram.'k': The color of the line is black. linewidth=2: Sets the thickness of the line to 2.

  • plt.title('Normal Distribution (Mean: 50, Std Dev: 10)'): Adds a title to the plot, indicating the parameters of the normal distribution (mean and standard deviation).
  • plt.show(): Displays the plot with both the histogram and the normal distribution curve.


Python Example 2: Z-Score Calculation for a Normal Distribution

A Z-score tells how many standard deviations an element is from the mean.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Data: Heights of people in cm
heights = [160, 165, 170, 155, 175, 180, 185, 160, 162, 167]

# Calculate mean and standard deviation
mean_height = np.mean(heights)
std_dev_height = np.std(heights)

# Calculate Z-scores (how many standard deviations each height is from the mean)
z_scores = [(x - mean_height) / std_dev_height for x in heights]

# Print the mean, standard deviation, and Z-scores
print(f"Mean Height: {mean_height}")
print(f"Standard Deviation: {std_dev_height}")
print(f"Z-scores: {z_scores}")
        
Output:
Mean Height: 167.9
Standard Deviation: 9.104394543296111
Z-scores: [-0.8677128349866006, -0.3185274963874867, 0.23065784221162722, -1.4168981735857145, 0.7798431808107411, 1.329028519409855, 1.878213858008969, -0.8677128349866006, -0.648038699546955, -0.09885336094784113]        

In this example:

  • The mean and standard deviation of the heights are calculated.
  • Z-scores are computed to determine how far each height is from the mean in terms of standard deviations.
  • Each value in the Z-scores list corresponds to the number of standard deviations that the corresponding height is away from the mean height.

Code Explanation:

  • The variable heights is a list containing heights of 10 people (in centimeters). This is the raw data for which we will calculate the Z-scores.

  • np.mean(heights): This function from the numpy library calculates the mean (average) of the list heights. The mean is a measure of central tendency, calculated by summing all the data points and dividing by the number of points.
  • np.std(heights): This function calculates the standard deviation of the list heights. The standard deviation measures the amount of variation or dispersion in a set of data points. A small standard deviation means that the data points are close to the mean, while a large standard deviation indicates that the data points are spread out over a wider range of values.
  • This is a list comprehension that calculates the Z-scores for each height in the list. A Z-score is a statistical measure that tells you how far (in standard deviations) a data point is from the mean. It is calculated using the following formula:
  • Z= (x?μ)/σ Where:
  • x is an individual data point (in this case, the height),
  • μ is the mean of the data points (here, mean_height),
  • σ is the standard deviation of the data points (here, std_dev_height).
  • (x - mean_height): Subtracts the mean from each height to find the distance of that height from the mean.
  • /(std_dev_height): Divides this distance by the standard deviation to express it in terms of standard deviations. The resulting z_scores list contains the Z-scores of all the heights in the original heights list.
  • The print statements display the calculated mean, standard deviation, and Z-scores in a readable format.
  • f"Mean Height: {mean_height}": The f-string formatting is used to insert the calculated values into the output text.


What the Z-Scores Mean

  • A Z-score of 0 means the height is exactly the same as the mean height.
  • A positive Z-score means the height is above the mean. For example, a Z-score of 1 means the height is 1 standard deviation above the mean.
  • A negative Z-score means the height is below the mean. For example, a Z-score of -1 means the height is 1 standard deviation below the mean.


For a detailed explanation of the article, you can watch the accompanying YouTube video here: YouTube Video.

For more in-depth technical insights and articles, feel free to explore:

This post is a great review of the Normal Distribution and the walk-through of coding Python to generate a Normal Distribution.

Dhiraj Patra

AI, ML, GenAI, IoT Innovator | Software Architect | Cloud | Data Science

4 个月

Very helpful

要查看或添加评论,请登录

Chandra Girish S的更多文章

社区洞察

其他会员也浏览了