Performing Univariate Analysis in Python

Performing Univariate Analysis in Python

This chapter on performing univariate analysis in Python covers various techniques to analyze individual variables in a dataset. It includes methods such as histograms, boxplots, violin plots, summary tables, bar charts, and pie charts. These techniques help in understanding the distribution, outliers, and patterns within the data.

We'll need the datasets: Amsterdam House Prices Data and Palmer Archipelago Penguins data. These datasets can be downloaded from Kaggle or retrieved from the provided GitHub repository.

Import the pandas and seaborn libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns        

Load the .csv into a dataframe using read_csv. Subset the dataframe to include only the relevant columns:

penguins_data = pd.read_csv("data/penguins_size.csv")
penguins_data = penguins_data[['species','culmen_length_mm']]        

Check the first five rows using the head method. Check the number of columns and rows as well as the data types:

penguins_data.head()
penguins_data.shape
penguins_data.dtypes        

Performing univariate analysis using a histogram

A histogram is a bar graph-like representation that provides insights into our dataset’s underlying frequency distribution, usually a continuous dataset. With the histogram, we can quickly identify outliers, data spread, skewness, and more.

sns.histplot( data = penguins_data, x= penguins_data["culmen_length_mm"])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.histplot( data = penguins_data, x= penguins_data["culmen_length_mm"])
ax.set_xlabel('Culmen Length in mm',fontsize = 15)
ax.set_ylabel('Count of records', fontsize = 15)
ax.set_title('Univariate analysis of Culmen Length',fontsize= 20)        

Performing univariate analysis using a boxplot

Seaborn's boxplot method creates a boxplot for a single numeric variable and its descriptive statistics. components of a boxplot are:

  • The box: This represents the interquartile range.
  • The whisker limits: The position of the whiskers is calculated based on Q1 - 1.5(IQR) and Q3 + 1.5(IQR) for the lower and upper whisker respectively.
  • The circles: These show the outlier data that are too small or too large.

Here's how you can create a boxplot using Seaborn:

sns.boxplot(data = houseprices_data, x= houseprices_data["Price"])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.boxplot(data = houseprices_data, x= houseprices_data["Price"])
ax.set_xlabel('House Prices in millions',fontsize = 15)
ax.set_title('Univariate analysis of House Prices',fontsize= 20)
plt.ticklabel_format(style='plain', axis='x')        


Performing univariate analysis using a violin plot

A violin plot is a boxplot with a curve that shows the density and shape of the data. The components of a violin plot are:

  • The thick line: This represents the interquartile range.
  • The white dot: This represents the median.
  • The thin line: This is the same as the upper and lower whisker limits of the boxplot that represent the range of values in our dataset that are not outliers. The lower and upper limits are calculated as Q1 - 1.5(IQR) and Q3 + 1.5(IQR).
  • The kernel density plot: This displays the shape of the distribution.

sns.violinplot(data = houseprices_data, x= houseprices_data["Price"])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.violinplot(data = houseprices_data, x=
houseprices_data["Price"])
ax.set_xlabel('House Prices in millions',fontsize = 15)
ax.set_title('Univariate analysis of House Prices',fontsize= 20)
plt.ticklabel_format(style='plain', axis='x')        

Performing univariate analysis using a summary table

Create a summary table using the describe method:

houseprices_data.describe()        

Performing univariate analysis using a bar chart

Seaborn's countplot and barplot methods create bar charts for categorical data and compare them with histograms for numerical data.

Create a bar plot using the countplot method:

sns.countplot(data = penguins_data, x= penguins_
data['species'])

#Provide some additional details for the chart
plt.figure(figsize= (12,6))
ax = sns.countplot(data = penguins_data, x= penguins_data['species'])
ax.set_xlabel('Penguin Species',fontsize = 15)
ax.set_ylabel('Count of records',fontsize = 15)
ax.set_title('Univariate analysis of Penguin Species',
fontsize= 20)ax.set_title('Univariate analysis of CulmenLength', fontsize= 20)        

Performing univariate analysis using a pie chart

Matplotlib's pie method creates pie charts for categorical data and compares them with other categories.

#Group the data using the groupby method in pandas
penguins_group = penguins_data.groupby('species').count()

#Reset the index using the reset_index method to ensure the index isn’t the species column
penguins_group= penguins_group.reset_index()

#Create a pie chart using the pie method
plt.pie(penguins_group["culmen_length_mm"], labels =
penguins_group['species'])
plt.show()

#Provide some additional details for the chart:
cols = ['g', 'b', 'r']
plt.pie(penguins_group["culmen_length_mm"], labels =
penguins_group['species'],colors = cols)
plt.title('Univariate Analysis of Species', fontsize=15)
plt.show()        


Conclusion

In this article, we explored various techniques for performing univariate analysis in Python, focusing on individual variables within a dataset. We utilized popular libraries such as pandas, seaborn, and matplotlib to visualize and analyze the data. Techniques such as histograms, boxplots, violin plots, summary tables, bar charts, and pie charts were employed to gain insights into the distribution, outliers, and patterns present in the data. By leveraging these techniques, analysts can efficiently explore and understand the characteristics of their datasets, enabling informed decision-making and further analysis.


The content presented in this article is derived from the book "Exploratory Data Analysis with Python Cookbook" by Ayodele Oluleye. This book offers over 50 recipes that guide readers in analyzing, visualizing, and extracting insights from both structured and unstructured data. If you are interested in exploring further or accessing the code used in this article, you can find all the code examples in the book on the GitHub repository: Exploratory Data Analysis with Python Cookbook. The book serves as a valuable resource for those looking to enhance their skills in data analysis using Python.


Kourosh Hasani

BI Team Lead at Farabi

1 年

Thanks for posting

回复

Love this??????

要查看或添加评论,请登录

Mahsa Salimi的更多文章

  • Visualizing Data

    Visualizing Data

    Visualizing data is a crucial aspect of Exploratory Data Analysis (EDA), enabling us to uncover relationships…

    7 条评论
  • Smart Ways to Make Your Power BI Files Smaller

    Smart Ways to Make Your Power BI Files Smaller

    Ever wondered why making your Power BI files smaller is a big deal? Well, let me break it down for you. When your files…

    1 条评论
  • Preparing Data for EDA

    Preparing Data for EDA

    Preparing data for Exploratory Data Analysis (EDA) involves transforming, aggregating, and cleaning tabular data…

    3 条评论
  • Understanding Summary Statistics

    Understanding Summary Statistics

    In data analysis, we use summary stats to quickly see patterns in info tables. This article introduces key ideas like…

    1 条评论

社区洞察

其他会员也浏览了