登录查看更多内容

Performing Univariate Analysis in Python

Mahsa Salimi

Business Intelligence Developer | Data Analyst | SQL, Power BI, Python Expert | Let's Turn Numbers into Success! ??

发布日期: 2024年3月9日

This chapter on performing univariate analysis in Python covers various techniques to analyze individual variables in a dataset. It includes methods such as histograms, boxplots, violin plots, summary tables, bar charts, and pie charts. These techniques help in understanding the distribution, outliers, and patterns within the data.

We'll need the datasets: Amsterdam House Prices Data and Palmer Archipelago Penguins data. These datasets can be downloaded from Kaggle or retrieved from the provided GitHub repository.

Import the pandas and seaborn libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the .csv into a dataframe using read_csv. Subset the dataframe to include only the relevant columns:

penguins_data = pd.read_csv("data/penguins_size.csv")
penguins_data = penguins_data[['species','culmen_length_mm']]

Check the first five rows using the head method. Check the number of columns and rows as well as the data types:

penguins_data.head()
penguins_data.shape
penguins_data.dtypes

Performing univariate analysis using a histogram

A histogram is a bar graph-like representation that provides insights into our dataset’s underlying frequency distribution, usually a continuous dataset. With the histogram, we can quickly identify outliers, data spread, skewness, and more.

sns.histplot( data = penguins_data, x= penguins_data["culmen_length_mm"])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.histplot( data = penguins_data, x= penguins_data["culmen_length_mm"])
ax.set_xlabel('Culmen Length in mm',fontsize = 15)
ax.set_ylabel('Count of records', fontsize = 15)
ax.set_title('Univariate analysis of Culmen Length',fontsize= 20)

Performing univariate analysis using a boxplot

Seaborn's boxplot method creates a boxplot for a single numeric variable and its descriptive statistics. components of a boxplot are:

The box: This represents the interquartile range.
The whisker limits: The position of the whiskers is calculated based on Q1 - 1.5(IQR) and Q3 + 1.5(IQR) for the lower and upper whisker respectively.
The circles: These show the outlier data that are too small or too large.

Here's how you can create a boxplot using Seaborn:

sns.boxplot(data = houseprices_data, x= houseprices_data["Price"])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.boxplot(data = houseprices_data, x= houseprices_data["Price"])
ax.set_xlabel('House Prices in millions',fontsize = 15)
ax.set_title('Univariate analysis of House Prices',fontsize= 20)
plt.ticklabel_format(style='plain', axis='x')

Performing univariate analysis using a violin plot

A violin plot is a boxplot with a curve that shows the density and shape of the data. The components of a violin plot are:

The thick line: This represents the interquartile range.
The white dot: This represents the median.
The thin line: This is the same as the upper and lower whisker limits of the boxplot that represent the range of values in our dataset that are not outliers. The lower and upper limits are calculated as Q1 - 1.5(IQR) and Q3 + 1.5(IQR).
The kernel density plot: This displays the shape of the distribution.

领英推荐

Why You Should Learn Python for Data Analysis:…

Eduardo Miranda 8 个月前

Top 10 Ways to deal with Missing Values in Python

Babu Chakraborty 2 年前

Code Snippets for Statistical Tests in?Python

Gustavo R Santos 4 个月前

sns.violinplot(data = houseprices_data, x= houseprices_data["Price"])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.violinplot(data = houseprices_data, x=
houseprices_data["Price"])
ax.set_xlabel('House Prices in millions',fontsize = 15)
ax.set_title('Univariate analysis of House Prices',fontsize= 20)
plt.ticklabel_format(style='plain', axis='x')

Performing univariate analysis using a summary table

Create a summary table using the describe method:

houseprices_data.describe()

Performing univariate analysis using a bar chart

Seaborn's countplot and barplot methods create bar charts for categorical data and compare them with histograms for numerical data.

Create a bar plot using the countplot method:

sns.countplot(data = penguins_data, x= penguins_
data['species'])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.countplot(data = penguins_data, x= penguins_data['species'])
ax.set_xlabel('Penguin Species',fontsize = 15)
ax.set_ylabel('Count of records',fontsize = 15)
ax.set_title('Univariate analysis of Penguin Species',
fontsize= 20)ax.set_title('Univariate analysis of CulmenLength', fontsize= 20)

Performing univariate analysis using a pie chart

Matplotlib's pie method creates pie charts for categorical data and compares them with other categories.

#Group the data using the groupby method in pandas
penguins_group = penguins_data.groupby('species').count()

#Reset the index using the reset_index method to ensure the index isn’t the species column
penguins_group= penguins_group.reset_index()

#Create a pie chart using the pie method
plt.pie(penguins_group["culmen_length_mm"], labels =
penguins_group['species'])
plt.show()

#Provide some additional details for the chart:
cols = ['g', 'b', 'r']
plt.pie(penguins_group["culmen_length_mm"], labels =
penguins_group['species'],colors = cols)
plt.title('Univariate Analysis of Species', fontsize=15)
plt.show()

Conclusion

In this article, we explored various techniques for performing univariate analysis in Python, focusing on individual variables within a dataset. We utilized popular libraries such as pandas, seaborn, and matplotlib to visualize and analyze the data. Techniques such as histograms, boxplots, violin plots, summary tables, bar charts, and pie charts were employed to gain insights into the distribution, outliers, and patterns present in the data. By leveraging these techniques, analysts can efficiently explore and understand the characteristics of their datasets, enabling informed decision-making and further analysis.

The content presented in this article is derived from the book "Exploratory Data Analysis with Python Cookbook" by Ayodele Oluleye. This book offers over 50 recipes that guide readers in analyzing, visualizing, and extracting insights from both structured and unstructured data. If you are interested in exploring further or accessing the code used in this article, you can find all the code examples in the book on the GitHub repository: Exploratory Data Analysis with Python Cookbook. The book serves as a valuable resource for those looking to enhance their skills in data analysis using Python.

Kourosh Hasani

BI Team Lead at Farabi

1 年

Thanks for posting

Dorsa Salimi

UI/UX Designer

1 年

Love this??????

1 次回应

查看更多评论

要查看或添加评论，请登录

Mahsa Salimi的更多文章

Visualizing Data

2024年2月27日

Visualizing Data

Visualizing data is a crucial aspect of Exploratory Data Analysis (EDA), enabling us to uncover relationships…

7 条评论
Smart Ways to Make Your Power BI Files Smaller

2024年2月12日

Smart Ways to Make Your Power BI Files Smaller

Ever wondered why making your Power BI files smaller is a big deal? Well, let me break it down for you. When your files…

1 条评论
Preparing Data for EDA

2023年12月26日

Preparing Data for EDA

Preparing data for Exploratory Data Analysis (EDA) involves transforming, aggregating, and cleaning tabular data…

3 条评论
Understanding Summary Statistics

2023年12月13日

Understanding Summary Statistics

In data analysis, we use summary stats to quickly see patterns in info tables. This article introduces key ideas like…

1 条评论

Performing Univariate Analysis in Python

Mahsa Salimi

Business Intelligence Developer | Data Analyst | SQL, Power BI, Python Expert | Let's Turn Numbers into Success! ??

Performing univariate analysis using a histogram

Performing univariate analysis using a boxplot

Performing univariate analysis using a violin plot

领英推荐

Performing univariate analysis using a summary table

Performing univariate analysis using a bar chart

Performing univariate analysis using a pie chart

Conclusion

Mahsa Salimi的更多文章

社区洞察

其他会员也浏览了

The Complete Guide To Time Series Analysis With Python.

Basic Syntax and Variables in Python

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python

Everything that you should know about Linear Regression in python

Python Fundamental 01- print function (), variable, Data Types & comments. | Belayet Hossain.

The "Adult" dataset, also known as the "Census Income" dataset used to predict whether a person's income surpasses $50K per year based on data.

Python Recommendation Systems

Mastering Hotel Revenue Management with Python: Dynamic Pricing Made Easy with Code Snippets

Introduction to Data Structures in Python With Examples

Performing univariate analysis using a histogram

Performing univariate analysis using a boxplot

Performing univariate analysis using a violin plot

领英推荐

Performing univariate analysis using a summary table

Performing univariate analysis using a bar chart

Performing univariate analysis using a pie chart

Conclusion

Mahsa Salimi的更多文章

Visualizing Data

Smart Ways to Make Your Power BI Files Smaller

Preparing Data for EDA

Understanding Summary Statistics

社区洞察

其他会员也浏览了

The Complete Guide To Time Series Analysis With Python.

Basic Syntax and Variables in Python

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python

Everything that you should know about Linear Regression in python

Python Fundamental 01- print function (), variable, Data Types & comments. | Belayet Hossain.

The "Adult" dataset, also known as the "Census Income" dataset used to predict whether a person's income surpasses $50K per year based on data.

Python Recommendation Systems

Mastering Hotel Revenue Management with Python: Dynamic Pricing Made Easy with Code Snippets

Introduction to Data Structures in Python With Examples