A Reference Notebook for (+30) Statistical Charts in Seaborn
The purpose of this tutorial is that we can build graphs to assist in the application of the data science process. We can employ visualizations during exploratory analysis, before or after processing data, construct statistical graphs to analyze datasets, identify variable relationships, or verify how data is distributed.
We can do all this with Matplotlib; however, we have a library that is much better and much easier when we refer to statistical graphs — Seaborn. Therefore, knowing how to create a visualization, regardless of its tool, is of fundamental importance.
Visit Jupyter Notebook to see all the concepts that we will cover about Data Visualization with Seaborn. Note: Important functions, outputs, and terms are bold to facilitate understanding—at least mine.
Seaborn - Statistical Data Visualization
We will create statistical graphs in Seaborn, manipulate formatting, make necessary adjustments to the data to allow its plotting correctly.
Gallery with a series of examples of statistical charts and their respective codes— https://seaborn.pydata.org/
Install Seaborn
The command runs on the operating system; it would be the same as opening the terminal or command prompt in windows and typing pip install seaborn:
!pip install seaborn
An exclamation keeps us working directly on the Jupyter Notebook.
? Loading Packages
import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import numpy as np import warnings warnings.filterwarnings('ignore') %matplotlib inline sns.__version__ '0.11.1'
? Check Seaborn datasets
Seaborn brings some datasets so we can try the tool, call the get_dataset_names() and return a list:
# Imported datasets with Seaborn sns.get_dataset_names() ['anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'exercise', 'flights', 'fmri', 'gammas', 'iris', 'mpg', 'planets', 'tips', 'titanic']
? Load dataset
Load one of the datasets from the list. We are going to work with the famous iris:
# Loading dataset iris = sns.load_dataset("iris")
? Check type
type(iris) pandas.core.frame.DataFrame
? Check first lines
iris.head()
? Statistical summary
iris.describe()
? Dataset columns
iris.columns Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
Seaborn Statistical Charts
? Distplot
It is a plot with a univariate distribution—one variable. We call the distplot function, we pass the name of the variable sepal_length, from the iris dataset, we indicate that we want to enable rug and adjust the data set according to the fit parameter:
sns.distplot(iris.sepal_length, rug = True, fit = stats.gausshyper);
? Jointplot
It is a plot for bivariate distribution, the famous scatterplot. See that we have the variables on each axis, the positively related data points, and the histograms with the frequency distribution for each variable.
We got much information on a single chart, running only this little command:
# Scatterplot - Bivariate Distribution sns.jointplot(x = "sepal_length", y = "petal_length", data = iris);
? Jointplot Hex
We can also make some changes in this joint plot, consult the seaborn documentation and adjust the parameters in the desired way:
# Useful graph when working with large datasets with sns.axes_style("white"): sns.jointplot(x = "sepal_length", y = "petal_length", data = iris, kind = "hex", color = "k");
The Hex chart represents the same thing that the common scatterplot, having only a different formation. Soon, we can change the layout of the graphics and leave them in a more personalized way.
? Density Jointplot
Instead of leaving the data points, we can change the kind parameter to KDE, note that the histograms of the axes have also been changed to density charts:
# Bivariate Distribution sns.jointplot(x = "sepal_length", y = "petal_length", data = iris, kind = "kde");
? Customizing Density Jointplot
To exemplify what seaborn is capable of doing, here we have customization of the parameters of the previous density chart:
# Bivariate Distribution g = sns.jointplot(x = "sepal_length", y = "petal_length", data = iris, kind = "kde", color = "m") g.plot_joint(plt.scatter, c = "w", s = 30, linewidth=1, marker="+") g.ax_joint.collections[0].set_alpha(0);
? Pairplot - Handy graph for few variables
This graph shows the relationships between all variables in the dataset. Note that the only parameter in the pairplot function is the iris dataset.
The function did by itself all combinations of all variables, placing scatterplots and histograms showing their relationships:
# Bivariate Distribution sns.pairplot(iris);
Relationship Visualizations
? Load dataset
We have other options for graphics. For this, we will load another set, the dataset tips:
# Loading tips dataset tips = sns.load_dataset("tips")
? Check type
type(tips) pandas.core.frame.DataFrame
? Check first lines
tips.head()
? Statistical summary
tips.describe()
? Jointplot—Linear Regression
We pass the variables x and y, passing the reg argument — linear regression to the kind parameter.
This plot draws the scatterplot and creates the regression line—in practice, a Machine Learning algorithm delimits the margins of error, applied the histograms on the axes to verify the distribution of the variables, and still draws the line with the density graph—incredible.
# Scatterplot with regression line - Bivariate Distribution sns.jointplot(x = "total_bill", y = "tip", data = tips, kind = "reg");
? Lmplot
Instead of placing the joint plot with the histograms, we put only the scatterplot with the regression line:
# Linear Regression (uses 95% confidence interval by default sns.lmplot(x = "total_bill", y = "tip", data = tips);
We can also tailor the lmplot. It shows the data points differently:
sns.lmplot(x = "size", y = "tip", data = tips, x_jitter = .05);
? Limits
We can place each point with upper and lower limit:
sns.lmplot(x = "size", y = "tip", data = tips, x_estimator = np.mean);
? Load Dataset
Let's load another seaborn dataset:
# Loading anscombe dataset anscombe = sns.load_dataset("anscombe")
? Query—filter + regression
We can make a kind of query, that is, we can filter the data from the dataset to plot in a chart:
# Non-linear relationship sns.lmplot(x = "x", y = "y", data = anscombe.query("dataset == 'II'"), ci = None, scatter_kws = {"s": 80});
? Parameter adjustment
If you want to adjust the points to the regression line, make a change to the parameters:
# We can adjust the parameters to fit the curve sns.lmplot(x = "x", y = "y", data = anscombe.query("dataset == 'II'"), order = 2, ci = None, scatter_kws = {"s": 80});
? View outliers
We can see the points that run away from the regular pattern of the data:
# Visualizing outliers sns.lmplot(x = "x", y = "y", data = anscombe.query("dataset == 'III'"), ci = None, scatter_kws = {"s": 80});
? Nonlinear relationship
We can represent a nonlinear relationship, that is, the change of one variable is not associated with the evolution of another:
# Using lowess smoother for variable with non-linear relationships sns.lmplot(x = "total_bill", y = "tip", data = tips, lowess = True);
? Different pieces of information
Another lmplot representing several different information:
# Using more than 2 variables sns.lmplot(x = "total_bill", y = "tip", hue = "smoker", data = tips);
? Customize chart
We can change the configuration, customize the parameters to make quite evident the difference between the variables:
# Changing the chart setting sns.lmplot(x = "total_bill", y = "tip", hue = "smoker", data = tips, markers = ["o", "x"], palette = "Set1");
? Split area
We can also divide the drawing area. The complete area takes the name of the plot area; above, we have a single chart in the plot area and below two graphs in the plot area.
Let's look at the tip variable on the y-axis. We are using the same variable for two charts and making a change in the col parameter, determining that one chart has total_bill referring to lunchtime and dinner time.
# Dividing the drawining area sns.lmplot(x = "total_bill", y = "tip", hue = "smoker", col = "time", data = tips);
Then we could use these charts as models to make predictions. That is, according to the lunch or dinner time, what would be tip expected to receive.
Divide areas with more variables
Here we are customizing with another variable. Now we have four blocks of data. In addition to lunch and dinner, we have a consumption by men and women.
# Dividing the drawning area sns.lmplot(x = "total_bill", y = "tip", hue = "smoker", col = "time", row = "sex", data = tips);
? Split area
We can do one more split. Now we change col to days; that is, the variables are now the days of the week:
# Dividing the drawining area sns.lmplot(x = "total_bill", y = "tip", col = "day", data = tips, col_wrap = 2, size = 3);
? Split horizontal area
See that the days of the week are now all side by side. In the previous chart, we had placed in different quadrants col_wrap, and then we changed the parameters to have a horizontal visualization:
# Dividing the drawning area sns.lmplot(x = "total_bill", y = "tip", col = "day", data = tips, aspect = .5);
Working with Categorical variables
Now we'll look at some charts to visualize categorical variables —strings. Until then, we've seen graphs for numeric variables.
? Stripplot
Now we are seeking the total bill per day of the week. The day of the week is a categorical variable, so we have to represent it differently.
# stripplot sns.stripplot(x = "day", y = "total_bill", data = tips);
? Customize Stripplot
We can make slight modifications. Here we have a chart a little more compact than the previous one.
# stripplot sns.stripplot(x = "day", y = "total_bill", data = tips, jitter = True);
? Swarmplot
It's like we have the previous drawing, represent the points avoiding overlap.
# swarmplot - Avoiding Categorical overlap points sns.swarmplot(x = "day", y = "total_bill", data = tips);
? Boxplot
One of the most famous charts in statistics used with categorical variables. The points represent the outliers, that is, values that run away from the data representation pattern:
# boxplot sns.boxplot(x = "day", y = "total_bill", hue = "time", data = tips);
Modify Boxplot
We can make a change in the orientation of the boxplot. We will see it horizontally:
# boxplot sns.boxplot(data = iris, orient = "h");
? Violin plot
# violinplot sns.violinplot(x = "total_bill", y = "day", hue = "time", data = tips);
? Customizing Violin plot
We can do some tailoring to narrow the violins:
# violinplot sns.violinplot(x = "total_bill", y = "day", hue = "time", data = tips, bw = .1, scale = "count", scale_hue = False);
? Violin plot
Vertical orientation:
# violinplot sns.violinplot(x = "day", y = "total_bill", hue = "sex", data = tips, split = True);
? Barplot
Another commonly used chart for categorical variables:
# barplot sns.barplot(x = "day", y = "total_bill", hue = "sex", data = tips);
? Countplot
Graph for counting elements for each day of the week:
# countplot sns.countplot(x = "day", data = tips, palette = "Greens_d");
? Countplot
We can customize the countplot by placing the orientation horizontally and representing the number of people per sex per day of the week:
# countplot sns.countplot(y = "day", hue = "sex", data = tips, palette = "Greens_d");
? Continuous Countplot
Here we have an example of counting with continuous bars, fully divided by days:
# countplot f, ax = plt.subplots(figsize = (7, 3)) sns.countplot(y = "day", data = tips, color = "c");
? Point plot
Another graph that we can use for categorical variables, establishing the gender relationship with a full account:
# pointplot sns.pointplot(x = "sex", y = "total_bill", hue = "smoker", data = tips);
? Factorplot
It is a modified version of the point plot above:
# factorplot sns.factorplot(x = "day", y = "total_bill", hue = "smoker", data = tips);
Viewing Pandas DataFrames in Seaborn
Finally, we'll look at Pandas DataFrames to generate statistical charts with Seaborn.
? Import libraries
import random import pandas as pd
? Create empty DataFrame
df = pd.DataFrame()
? Creating ranges
We'll create random ranges of values by placing these values in two columns of the DataFrame:
df['x'] = random.sample(range(1, 100), 25) df['y'] = random.sample(range(1, 100), 25) df.head()
? Scatterplot
We will create a scatterplot from the dataset created in Pandas, calling x and, indicating the dataset as df and using False in the fit_reg parameter because we want only the data points, without regression line:
# Scatterplot sns.lmplot('x', 'y', data ?= df, fit_reg = False)
There is no relationship. There is no visible tendency.
? Density Plot - kdeplot
A graph that shows the internal density area for the variable y:
# Density Plot sns.kdeplot(df.y)
? Distplot
We will now see the density of x with the distplot:
# Distplot sns.distplot(df.x)
? Rugplot
We can also apply a histogram with the rugs. Rugs are the points that appear at the base of the chart:
# Histogram plt.hist(df.x, alpha = .3) sns.rugplot(df.x);
? Boxplot
Boxplot always valuable for quickly visualize the median, quartiles, and eventual outliers values:
# Boxplot sns.boxplot([df.y, df.x])
? Heatmap
We can create the heat map, showing the frequency of the data according to the most intense coloration:
# Heatmap sns.heatmap([df.y, df.x], annot = True, fmt = "d")
? Cluster map
It is the division between clusters, widely used when working with unsupervised learning:
# Clustermap sns.clustermap(df)
Therefore, we have built here an instrumental reference material to apply statistical graphs and exploratory analysis.
And there we have it. I hope you have found this useful. Thank you for reading. ??
Product Manager | Logistics
3 年Hi Sir, thank u so much for this useful article. I have a question. How to read the data in some jointplot? I mean, how to understand it, scatter plot inside with 2 bar outside?