Histograms, PDFs, and CDFs: A Comprehensive Guide
Vaibhav Gupta
Student Data Scientist @ Nissan | Vanderbilt Innovation Fellow | Gen AI Engineer Intern @ Vanderbilt | Graduate CS Student @ Vanderbilt University | Former Data Eng @ TCS | Knows the Maths behind AI | Fine-tuning LLMs
When we're faced with a dataset, one of the first steps in exploring and understanding our data is to visualize it. This is where histograms, Probability Density Functions (PDFs), and Cumulative Distribution Functions (CDFs) come into play. These tools are not just graphs; they are storytellers that reveal the underlying patterns and probabilities within our data. In this article, we'll break down these concepts into their simplest forms, provide a detailed example using the Iris dataset, and share some of my handwritten notes to enrich your understanding.
Understanding Histograms
A histogram is a visual representation of the distribution of numerical data. It's constructed by dividing the range of the data into bins and counting the number of observations that fall into each bin.
Why Histograms Matter:
- Shape of Data: They show the shape of the data's distribution, which can indicate the presence of modes (peaks), skewness, or outliers.
- Frequency of Values: Histograms display how often values occur within each bin, which helps us understand where data points are concentrated.
Diving into PDFs
A Probability Density Function (PDF) is a curve that represents the likelihood of a continuous random variable taking on a particular value. The area under the curve within a given range represents the probability of the variable falling within that range.
The Role of PDFs:
- Probability Estimation: PDFs help estimate the probability of a variable falling within a specific interval.
- Comparison of Distributions: They allow us to compare different distributions and understand their characteristics, such as variance and mean.
Exploring CDFs
A Cumulative Distribution Function (CDF) shows the probability that a random variable is less than or equal to a certain value. It's a running total of probabilities.
Importance of CDFs:
- Cumulative Probability: CDFs provide a cumulative probability for the variable and help identify percentile ranks.
领英推荐
- Data Thresholds: They are useful for finding thresholds, such as the median, which is the point where 50% of the data lies below it.
Practical Example: The Iris Dataset
The Iris dataset is a classic in the field of data science, containing measurements of iris flowers' features. We'll focus on one feature: petal length. Let's create a histogram, PDF, and CDF to visualize the distribution of petal lengths.
Step-by-Step Code Example
Using Python and its libraries, we can easily create these visualizations:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
# Load the Iris dataset
iris = sns.load_dataset('iris')
# Select petal length from the Iris dataset
petal_lengths = iris['petal_length'].values
# Histogram
plt.figure(figsize=(10, 5))
plt.hist(petal_lengths, bins=20, alpha=0.7, color='gray', edgecolor='black')
plt.title('Histogram of Iris Petal Lengths')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
# PDF
plt.figure(figsize=(10, 5))
mu, std = norm.fit(petal_lengths) # Fit a normal distribution to the data
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2, label='PDF')
plt.title('PDF of Iris Petal Lengths')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Probability Density')
plt.legend()
# CDF
plt.figure(figsize=(10, 5))
hist_data, bin_edges = np.histogram(petal_lengths, bins=20, density=True)
cdf = np.cumsum(hist_data * np.diff(bin_edges))
plt.plot(bin_edges[1:], cdf, 'r', linewidth=2, label='CDF')
plt.title('CDF of Iris Petal Lengths')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Cumulative Probability')
plt.legend()
# Show the plots
plt.show()
Handwritten Notes for Additional Insights
To complement this article, I've included my handwritten notes. These notes delve deeper into the nuances of histograms, PDFs, and CDFs, providing personal insights and practical tips for analysis.
Wow, your detailed approach in breaking down histograms, PDFs, and CDFs is super impressive! You could also explore machine learning concepts to further deepen your data analysis skills. Have you thought about how you might use these skills in your future career? Is there a particular industry or field you're aiming to apply your data science expertise in? Keep up the fantastic work and keep sharing your insights!
Super helpful guide Vaibhav Gupta!