Performing Univariate Analysis in Python
Mahsa Salimi
Business Intelligence Developer | Data Analyst | SQL, Power BI, Python Expert | Let's Turn Numbers into Success! ??
This chapter on performing univariate analysis in Python covers various techniques to analyze individual variables in a dataset. It includes methods such as histograms, boxplots, violin plots, summary tables, bar charts, and pie charts. These techniques help in understanding the distribution, outliers, and patterns within the data.
We'll need the datasets: Amsterdam House Prices Data and Palmer Archipelago Penguins data. These datasets can be downloaded from Kaggle or retrieved from the provided GitHub repository.
Import the pandas and seaborn libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load the .csv into a dataframe using read_csv. Subset the dataframe to include only the relevant columns:
penguins_data = pd.read_csv("data/penguins_size.csv")
penguins_data = penguins_data[['species','culmen_length_mm']]
Check the first five rows using the head method. Check the number of columns and rows as well as the data types:
penguins_data.head()
penguins_data.shape
penguins_data.dtypes
Performing univariate analysis using a histogram
A histogram is a bar graph-like representation that provides insights into our dataset’s underlying frequency distribution, usually a continuous dataset. With the histogram, we can quickly identify outliers, data spread, skewness, and more.
sns.histplot( data = penguins_data, x= penguins_data["culmen_length_mm"])
#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.histplot( data = penguins_data, x= penguins_data["culmen_length_mm"])
ax.set_xlabel('Culmen Length in mm',fontsize = 15)
ax.set_ylabel('Count of records', fontsize = 15)
ax.set_title('Univariate analysis of Culmen Length',fontsize= 20)
Performing univariate analysis using a boxplot
Seaborn's boxplot method creates a boxplot for a single numeric variable and its descriptive statistics. components of a boxplot are:
Here's how you can create a boxplot using Seaborn:
sns.boxplot(data = houseprices_data, x= houseprices_data["Price"])
#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.boxplot(data = houseprices_data, x= houseprices_data["Price"])
ax.set_xlabel('House Prices in millions',fontsize = 15)
ax.set_title('Univariate analysis of House Prices',fontsize= 20)
plt.ticklabel_format(style='plain', axis='x')
Performing univariate analysis using a violin plot
A violin plot is a boxplot with a curve that shows the density and shape of the data. The components of a violin plot are:
领英推荐
sns.violinplot(data = houseprices_data, x= houseprices_data["Price"])
#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.violinplot(data = houseprices_data, x=
houseprices_data["Price"])
ax.set_xlabel('House Prices in millions',fontsize = 15)
ax.set_title('Univariate analysis of House Prices',fontsize= 20)
plt.ticklabel_format(style='plain', axis='x')
Performing univariate analysis using a summary table
Create a summary table using the describe method:
houseprices_data.describe()
Performing univariate analysis using a bar chart
Seaborn's countplot and barplot methods create bar charts for categorical data and compare them with histograms for numerical data.
Create a bar plot using the countplot method:
sns.countplot(data = penguins_data, x= penguins_
data['species'])
#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.countplot(data = penguins_data, x= penguins_data['species'])
ax.set_xlabel('Penguin Species',fontsize = 15)
ax.set_ylabel('Count of records',fontsize = 15)
ax.set_title('Univariate analysis of Penguin Species',
fontsize= 20)ax.set_title('Univariate analysis of CulmenLength', fontsize= 20)
Performing univariate analysis using a pie chart
Matplotlib's pie method creates pie charts for categorical data and compares them with other categories.
#Group the data using the groupby method in pandas
penguins_group = penguins_data.groupby('species').count()
#Reset the index using the reset_index method to ensure the index isn’t the species column
penguins_group= penguins_group.reset_index()
#Create a pie chart using the pie method
plt.pie(penguins_group["culmen_length_mm"], labels =
penguins_group['species'])
plt.show()
#Provide some additional details for the chart:
cols = ['g', 'b', 'r']
plt.pie(penguins_group["culmen_length_mm"], labels =
penguins_group['species'],colors = cols)
plt.title('Univariate Analysis of Species', fontsize=15)
plt.show()
Conclusion
In this article, we explored various techniques for performing univariate analysis in Python, focusing on individual variables within a dataset. We utilized popular libraries such as pandas, seaborn, and matplotlib to visualize and analyze the data. Techniques such as histograms, boxplots, violin plots, summary tables, bar charts, and pie charts were employed to gain insights into the distribution, outliers, and patterns present in the data. By leveraging these techniques, analysts can efficiently explore and understand the characteristics of their datasets, enabling informed decision-making and further analysis.
The content presented in this article is derived from the book "Exploratory Data Analysis with Python Cookbook" by Ayodele Oluleye. This book offers over 50 recipes that guide readers in analyzing, visualizing, and extracting insights from both structured and unstructured data. If you are interested in exploring further or accessing the code used in this article, you can find all the code examples in the book on the GitHub repository: Exploratory Data Analysis with Python Cookbook. The book serves as a valuable resource for those looking to enhance their skills in data analysis using Python.
BI Team Lead at Farabi
1 年Thanks for posting
UI/UX Designer
1 年Love this??????