登录查看更多内容

A Reference Notebook for (+30) Statistical Charts in Seaborn

Leonardo A.

Data Analyst

发布日期: 2021年4月2日

The purpose of this tutorial is that we can build graphs to assist in the application of the data science process. We can employ visualizations during exploratory analysis, before or after processing data, construct statistical graphs to analyze datasets, identify variable relationships, or verify how data is distributed.

We can do all this with Matplotlib; however, we have a library that is much better and much easier when we refer to statistical graphs — Seaborn. Therefore, knowing how to create a visualization, regardless of its tool, is of fundamental importance.

Visit Jupyter Notebook to see all the concepts that we will cover about Data Visualization with Seaborn. Note: Important functions, outputs, and terms are bold to facilitate understanding—at least mine.

Seaborn - Statistical Data Visualization

We will create statistical graphs in Seaborn, manipulate formatting, make necessary adjustments to the data to allow its plotting correctly.

Gallery with a series of examples of statistical charts and their respective codes— https://seaborn.pydata.org/

Install Seaborn

The command runs on the operating system; it would be the same as opening the terminal or command prompt in windows and typing pip install seaborn:

!pip install seaborn

An exclamation keeps us working directly on the Jupyter Notebook.

? Loading Packages

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

sns.__version__
'0.11.1'

? Check Seaborn datasets

Seaborn brings some datasets so we can try the tool, call the get_dataset_names() and return a list:

# Imported datasets with Seaborn
sns.get_dataset_names()

['anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'iris',
 'mpg',
 'planets',
 'tips',
 'titanic']

? Load dataset

Load one of the datasets from the list. We are going to work with the famous iris:

# Loading dataset
iris = sns.load_dataset("iris")

? Check type

type(iris)

pandas.core.frame.DataFrame

? Check first lines

iris.head()

N?o foi fornecido texto alternativo para esta imagem

? Statistical summary

iris.describe()

? Dataset columns

iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'], dtype='object')

Seaborn Statistical Charts

? Distplot

It is a plot with a univariate distribution—one variable. We call the distplot function, we pass the name of the variable sepal_length, from the iris dataset, we indicate that we want to enable rug and adjust the data set according to the fit parameter:

sns.distplot(iris.sepal_length, rug = True, fit = stats.gausshyper);

? Jointplot

It is a plot for bivariate distribution, the famous scatterplot. See that we have the variables on each axis, the positively related data points, and the histograms with the frequency distribution for each variable.

We got much information on a single chart, running only this little command:

# Scatterplot - Bivariate Distribution
sns.jointplot(x = "sepal_length", y = "petal_length", data = iris);

? Jointplot Hex

We can also make some changes in this joint plot, consult the seaborn documentation and adjust the parameters in the desired way:

# Useful graph when working with large datasets
with sns.axes_style("white"):
     sns.jointplot(x = "sepal_length",
                   y = "petal_length",
                   data = iris,
                   kind = "hex",
                   color = "k");

The Hex chart represents the same thing that the common scatterplot, having only a different formation. Soon, we can change the layout of the graphics and leave them in a more personalized way.

? Density Jointplot

Instead of leaving the data points, we can change the kind parameter to KDE, note that the histograms of the axes have also been changed to density charts:

# Bivariate Distribution
sns.jointplot(x = "sepal_length",
              y = "petal_length",
              data = iris,
              kind = "kde");

? Customizing Density Jointplot

To exemplify what seaborn is capable of doing, here we have customization of the parameters of the previous density chart:

# Bivariate Distribution
g = sns.jointplot(x = "sepal_length",
                  y = "petal_length",
                  data = iris,
                  kind = "kde",
                  color = "m")

g.plot_joint(plt.scatter, c = "w", s = 30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0);

? Pairplot - Handy graph for few variables

This graph shows the relationships between all variables in the dataset. Note that the only parameter in the pairplot function is the iris dataset.

The function did by itself all combinations of all variables, placing scatterplots and histograms showing their relationships:

# Bivariate Distribution
sns.pairplot(iris);

Relationship Visualizations

? Load dataset

We have other options for graphics. For this, we will load another set, the dataset tips:

# Loading tips dataset
tips = sns.load_dataset("tips")

? Check type

type(tips)
pandas.core.frame.DataFrame

? Check first lines

tips.head()

? Statistical summary

tips.describe()

? Jointplot—Linear Regression

We pass the variables x and y, passing the reg argument — linear regression to the kind parameter.

This plot draws the scatterplot and creates the regression line—in practice, a Machine Learning algorithm delimits the margins of error, applied the histograms on the axes to verify the distribution of the variables, and still draws the line with the density graph—incredible.

# Scatterplot with regression line - Bivariate Distribution
sns.jointplot(x = "total_bill",
              y = "tip",
              data = tips,
              kind = "reg");

? Lmplot

Instead of placing the joint plot with the histograms, we put only the scatterplot with the regression line:

# Linear Regression (uses 95% confidence interval by default
sns.lmplot(x = "total_bill",
           y = "tip",
           data = tips);

We can also tailor the lmplot. It shows the data points differently:

sns.lmplot(x = "size",
           y = "tip",
           data = tips,
           x_jitter = .05);

? Limits

We can place each point with upper and lower limit:

sns.lmplot(x = "size",
           y = "tip",
           data = tips,
           x_estimator = np.mean);

? Load Dataset

Let's load another seaborn dataset:

# Loading anscombe dataset
anscombe = sns.load_dataset("anscombe")

? Query—filter + regression

We can make a kind of query, that is, we can filter the data from the dataset to plot in a chart:

# Non-linear relationship
sns.lmplot(x = "x",
           y = "y",
           data = anscombe.query("dataset == 'II'"),
           ci = None,
           scatter_kws = {"s": 80});

? Parameter adjustment

If you want to adjust the points to the regression line, make a change to the parameters:

# We can adjust the parameters to fit the curve
sns.lmplot(x = "x",
           y = "y",
           data = anscombe.query("dataset == 'II'"),
           order = 2,
           ci = None,
           scatter_kws = {"s": 80});

? View outliers

We can see the points that run away from the regular pattern of the data:

# Visualizing outliers
sns.lmplot(x = "x",
           y = "y",
           data = anscombe.query("dataset == 'III'"),
           ci = None,
           scatter_kws = {"s": 80});

? Nonlinear relationship

We can represent a nonlinear relationship, that is, the change of one variable is not associated with the evolution of another:

# Using lowess smoother for variable with non-linear relationships
sns.lmplot(x = "total_bill",
           y = "tip",
           data = tips,
           lowess = True);

? Different pieces of information

Another lmplot representing several different information:

# Using more than 2 variables
sns.lmplot(x = "total_bill",
           y = "tip",
           hue = "smoker",
           data = tips);

? Customize chart

We can change the configuration, customize the parameters to make quite evident the difference between the variables:

# Changing the chart setting
sns.lmplot(x = "total_bill",
           y = "tip",
           hue = "smoker",
           data = tips, markers = ["o", "x"],
           palette = "Set1");

? Split area

We can also divide the drawing area. The complete area takes the name of the plot area; above, we have a single chart in the plot area and below two graphs in the plot area.

Let's look at the tip variable on the y-axis. We are using the same variable for two charts and making a change in the col parameter, determining that one chart has total_bill referring to lunchtime and dinner time.

# Dividing the drawining area
sns.lmplot(x = "total_bill",
           y = "tip",
           hue = "smoker",
           col = "time",
           data = tips);

Then we could use these charts as models to make predictions. That is, according to the lunch or dinner time, what would be tip expected to receive.

Divide areas with more variables

Here we are customizing with another variable. Now we have four blocks of data. In addition to lunch and dinner, we have a consumption by men and women.

# Dividing the drawning area
sns.lmplot(x = "total_bill",
           y = "tip",
           hue = "smoker",
           col = "time",
           row = "sex",
           data = tips);

? Split area

We can do one more split. Now we change col to days; that is, the variables are now the days of the week:

# Dividing the drawining area
sns.lmplot(x = "total_bill",
           y = "tip",
           col = "day",
           data = tips,
           col_wrap = 2,
           size = 3);

? Split horizontal area

See that the days of the week are now all side by side. In the previous chart, we had placed in different quadrants col_wrap, and then we changed the parameters to have a horizontal visualization:

# Dividing the drawning area
sns.lmplot(x = "total_bill",
           y = "tip",
           col = "day",
           data = tips,
           aspect = .5);

Working with Categorical variables

Now we'll look at some charts to visualize categorical variables —strings. Until then, we've seen graphs for numeric variables.

? Stripplot

Now we are seeking the total bill per day of the week. The day of the week is a categorical variable, so we have to represent it differently.

# stripplot
sns.stripplot(x = "day",
              y = "total_bill",
              data = tips);

? Customize Stripplot

We can make slight modifications. Here we have a chart a little more compact than the previous one.

# stripplot
sns.stripplot(x = "day",
              y = "total_bill",
              data = tips,
              jitter = True);

? Swarmplot

It's like we have the previous drawing, represent the points avoiding overlap.

# swarmplot - Avoiding Categorical overlap points
sns.swarmplot(x = "day",
              y = "total_bill",
              data = tips);

? Boxplot

One of the most famous charts in statistics used with categorical variables. The points represent the outliers, that is, values that run away from the data representation pattern:

# boxplot
sns.boxplot(x = "day",
            y = "total_bill",
            hue = "time",
            data = tips);

Modify Boxplot

We can make a change in the orientation of the boxplot. We will see it horizontally:

# boxplot
sns.boxplot(data = iris,
            orient = "h");

? Violin plot

# violinplot
sns.violinplot(x = "total_bill",
               y = "day",
               hue = "time",
               data = tips);

? Customizing Violin plot

We can do some tailoring to narrow the violins:

# violinplot
sns.violinplot(x = "total_bill",
               y = "day",
               hue = "time",
               data = tips,
               bw = .1,
               scale = "count",
               scale_hue = False);

? Violin plot

Vertical orientation:

# violinplot
 sns.violinplot(x = "day",
                y = "total_bill",
                hue = "sex",
                data = tips,
                split = True);

? Barplot

Another commonly used chart for categorical variables:

# barplot
sns.barplot(x = "day",
            y = "total_bill",
            hue = "sex",
            data = tips);

? Countplot

Graph for counting elements for each day of the week:

# countplot
sns.countplot(x = "day",
              data = tips,
              palette = "Greens_d");

? Countplot

We can customize the countplot by placing the orientation horizontally and representing the number of people per sex per day of the week:

# countplot
sns.countplot(y = "day",
              hue = "sex",
              data = tips,
              palette = "Greens_d");

? Continuous Countplot

Here we have an example of counting with continuous bars, fully divided by days:

# countplot
f, ax = plt.subplots(figsize = (7, 3))
                     sns.countplot(y = "day",
                     data = tips,
                     color = "c");

? Point plot

Another graph that we can use for categorical variables, establishing the gender relationship with a full account:

# pointplot
sns.pointplot(x = "sex",
              y = "total_bill",
              hue = "smoker",
              data = tips);

? Factorplot

It is a modified version of the point plot above:

# factorplot
sns.factorplot(x = "day",
               y = "total_bill",
               hue = "smoker",
               data = tips);

Viewing Pandas DataFrames in Seaborn

Finally, we'll look at Pandas DataFrames to generate statistical charts with Seaborn.

? Import libraries

import random
import pandas as pd

? Create empty DataFrame

df = pd.DataFrame()

? Creating ranges

We'll create random ranges of values by placing these values in two columns of the DataFrame:

df['x'] = random.sample(range(1, 100), 25)
df['y'] = random.sample(range(1, 100), 25)

df.head()

? Scatterplot

We will create a scatterplot from the dataset created in Pandas, calling x and, indicating the dataset as df and using False in the fit_reg parameter because we want only the data points, without regression line:

# Scatterplot
sns.lmplot('x',
           'y',
           data      ?= df,
           fit_reg = False)

There is no relationship. There is no visible tendency.

? Density Plot - kdeplot

A graph that shows the internal density area for the variable y:

# Density Plot
sns.kdeplot(df.y)

? Distplot

We will now see the density of x with the distplot:

# Distplot
sns.distplot(df.x)

? Rugplot

We can also apply a histogram with the rugs. Rugs are the points that appear at the base of the chart:

# Histogram
plt.hist(df.x, alpha = .3)
sns.rugplot(df.x);

? Boxplot

Boxplot always valuable for quickly visualize the median, quartiles, and eventual outliers values:

# Boxplot
sns.boxplot([df.y, df.x])

? Heatmap

We can create the heat map, showing the frequency of the data according to the most intense coloration:

# Heatmap
sns.heatmap([df.y, df.x], annot = True, fmt = "d")

? Cluster map

It is the division between clusters, widely used when working with unsupervised learning:

# Clustermap
sns.clustermap(df)

Therefore, we have built here an instrumental reference material to apply statistical graphs and exploratory analysis.

And there we have it. I hope you have found this useful. Thank you for reading. ??

Anello – Medium

Vivi Aryanti

Product Manager | Logistics

3 年

Hi Sir, thank u so much for this useful article. I have a question. How to read the data in some jointplot? I mean, how to understand it, scatter plot inside with 2 bar outside?

查看更多评论

要查看或添加评论，请登录

Leonardo A.的更多文章

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

2024年11月20日

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

Overview In this project, we’ll explore techniques for exploratory data analysis and dive into the interpretation of…

2 条评论
SQL: Mastering Data Engineering Essentials

2024年9月19日

SQL: Mastering Data Engineering Essentials

Here’s an interesting fact: do you know when the SQL language was created? When it first appeared? I do! It was in…
The Power of Hypothesis Testing

2024年8月3日

The Power of Hypothesis Testing

Hypothesis testing is a fundamental tool in inferential statistics and data science, allowing us to evaluate claims…
Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

2024年8月2日

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

I’m going to bring you now probably the topic that generates the most doubts among those who are just starting their…
Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

2024年7月4日

Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

1. Introduction Data preprocessing is a critical step in any data analysis or machine learning project.
Data Splitting in Machine Learning: Techniques and?Pitfalls

2024年7月1日

Data Splitting in Machine Learning: Techniques and?Pitfalls

Machine learning is all the rage these days, but are you really grasping the fundamentals? If you’re diving into this…
Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

2024年6月28日

Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

We have completed the first part of our project, which was building the Machine Learning model. Now, let’s move on to…
8 Steps to Building a Machine Learning Model for Classification

2024年6月26日

8 Steps to Building a Machine Learning Model for Classification

Explore the process of creating, training, and deploying a machine learning model to predict product types based on…

1 条评论
9-Step Guide to Building Machine Learning Models

2024年6月24日

9-Step Guide to Building Machine Learning Models

In this article, I will walk you through the process of building machine learning models. I will first describe the…
Data Engineering: Principles of ETL vs. ELT

2024年6月21日

Data Engineering: Principles of ETL vs. ELT

Introduction There is a long journey within data engineering, especially in the ETL process. ETL is an acronym that…

See all articles

Seaborn - Statistical Data Visualization

Install Seaborn

? Loading Packages

? Check Seaborn datasets

? Load dataset

? Check type

? Check first lines

? Statistical summary

? Dataset columns

Seaborn Statistical Charts

? Distplot

? Jointplot

? Jointplot Hex

? Density Jointplot

? Customizing Density Jointplot

? Pairplot - Handy graph for few variables

Relationship Visualizations

? Load dataset

? Check type

? Check first lines

? Statistical summary

? Jointplot—Linear Regression

? Lmplot

? Limits

? Load Dataset

? Query—filter + regression

? Parameter adjustment

? View outliers

? Nonlinear relationship

? Different pieces of information

? Customize chart

? Split area

Divide areas with more variables

? Split area

? Split horizontal area

Working with Categorical variables

? Stripplot

? Customize Stripplot

? Swarmplot

? Boxplot

Modify Boxplot

? Violin plot

? Customizing Violin plot

? Violin plot

? Barplot

? Countplot

? Countplot

? Continuous Countplot

? Point plot

? Factorplot

Viewing Pandas DataFrames in Seaborn

? Import libraries

? Create empty DataFrame

? Creating ranges

? Scatterplot

? Density Plot - kdeplot

? Distplot

? Rugplot

? Boxplot

? Heatmap

? Cluster map

Leonardo A.的更多文章

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

SQL: Mastering Data Engineering Essentials

The Power of Hypothesis Testing

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

Data Splitting in Machine Learning: Techniques and?Pitfalls

Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

8 Steps to Building a Machine Learning Model for Classification

9-Step Guide to Building Machine Learning Models

Data Engineering: Principles of ETL vs. ELT

社区洞察

其他会员也浏览了

Cleaning the DATA

Data Scientist Journey with the 100 Days of Code Challenge - Part 1

Choosing the Right Graphical Representation: Understanding the Differences between Bar Charts and Histograms

?? Unlock Time Series Insights Using Python’s KPSS Test ??

Merge Overlapping Rasters Using R and Terra

How to index data into Vector DB from highly unstructured pdfs