Seaborn

Seaborn

Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.

Seaborn design allows you to explore and understand your data quickly. Seaborn works by capturing entire dataframes or arrays containing all your data and performing all the internal functions necessary for semantic mapping and statistical aggregation to convert data into informative plots.

It abstracts complexity while allowing you to design your plots to your requirements.

Installing Seaborn 

Installing seaborn is as easy as installing one library using your favorite Python package manager. When installing seaborn, the library will install its dependencies, including matplotlib, pandas, numpy, and scipy.

Let’s then install seaborn, and of course, also the package notebook to get access to our data playground.

pipenv install seaborn notebook


Additionally, we are going to import a few modules before we get started.

import seaborn as sns

import pandas as pd

import numpy as np

import matplotlib

Building your first plots 

Before we can start plotting anything, we need data. The beauty of seaborn is that it works directly with pandas dataframes, making it super convenient. Even more so, the library comes with some built-in datasets that you can now load from code, no need to manually downloading files.

Let’s see how that works by loading a dataset that contains information about flights.

flights_data = sns.load_dataset("flights")

flights_data.head()

	year	month	passengers
0	1949	Jan	112
1	1949	Feb	118
2	1949	Mar	132
3	1949	Apr	129
4	1949	May	121

All the magic happens when calling the function load_dataset, which expects the name of the data to be loaded and returns a dataframe. All these datasets are available on a GitHub repository.

Scatter Plot 

A scatter plot is a diagram that displays points based on two dimensions of the dataset. Creating a scatter plot in the seaborn library is so simple and with just one line of code.

sns.scatterplot(data=flights_data, x="year", y="passengers")


Sample scatter plot

Very easy, right? The function scatterplot expects the dataset we want to plot and the columns representing the x and y axis.

Line Plot 

This plot draws a line that represents the revolution of continuous or categorical data. It is a popular and known type of chart, and it’s super easy to produce. Similarly to before, we use the function lineplot with the dataset and the columns representing the x and y axis. Seaborn will do the rest.

sns.lineplot(data=flights_data, x="year", y="passengers")


Sample line plot

Bar Plot 

It is probably the best-known type of chart, and as you may have predicted, we can plot this type of plot with seaborn in the same way we do for lines and scatter plots by using the function barplot.

sns.barplot(data=flights_data, x="year", y="passengers")


Sample bar plot

It’s very colorful, I know, we will learn how to customize it later on in the guide.

Extending with matplotlib 

Seaborn builds on top of matplotlib, extending its functionality and abstracting complexity. With that said, it does not limit its capabilities. Any seaborn chart can be customized using functions from the matplotlib library. It can come in handy for specific operations and allows seaborn to leverage the power of matplotlib without having to rewrite all its functions.

Let’s say that you, for example, want to plot multiple graphs simultaneously using seaborn; then you could use the subplot function from matplotlib.

diamonds_data = sns.load_dataset('diamonds')

plt.subplot(1, 2, 1)

sns.countplot(x='carat', data?=diamonds_data)

plt.subplot(1, 2, 2)

sns.countplot(x='depth', data?=diamonds_data)


Sample plot with sub-plots

Using the subplot function, we can draw more than one chart on a single plot. The function takes three parameters, the first is the number of rows, the second is the number of columns, and the last one is the plot number.

We are rendering a seaborn chart in each subplot, mixing matplotlib with seaborn functions.

Seaborn loves Pandas 

We already talked about this, but seaborn loves pandas to such an extent that all its functions build on top of the pandas dataframe. So far, we saw examples of using seaborn with pre-loaded data, but what if we want to draw a plot from data we already have loaded using pandas?

drinks_df = pd.read_csv("data/drinks.csv")

sns.barplot(x="country", y="beer_servings", data?=drinks_df)


Sample plot with pandas

Making beautiful plots with styles 

Seaborn gives you the ability to change your graphs’ interface, and it provides five different styles out of the box: darkgrid, whitegrid, dark, white, and ticks.

sns.set_style("darkgrid")

sns.lineplot(data = data, x = "year", y = "passengers")


Sample plot with darkgrid style

Here is another example

sns.set_style("whitegrid")

sns.lineplot(data=flights_data, x="year", y="passengers")


Sample plot with whitegrid style

Cool use cases 

We know the basics of seaborn, now let’s get them into practice by building multiple charts over the same dataset. In our case, we will use the dataset “tips” that you can download directly using seaborn.

First, load the dataset.

tips_df = sns.load_dataset('tips')

tips_df.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

I like to print the first few rows of the data set to get a feeling of the columns and the data itself. Usually, I use some pandas functions to fix some data issues like null values and add information to the data set that may be helpful. You can read more about this on the guide to working with pandas.

Let’s create an additional column to the data set with the percentage that represents the tip amount over the total of the bill.

tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"]

tips_df.head()


Now our data frame looks like the following:

	total_bill	tip	sex	smoker	day	time	size	tip_percentage
0	16.99	1.01	Female	No	Sun	Dinner	2	0.059447
1	10.34	1.66	Male	No	Sun	Dinner	3	0.160542
2	21.01	3.50	Male	No	Sun	Dinner	3	0.166587
3	23.68	3.31	Male	No	Sun	Dinner	2	0.139780
4	24.59	3.61	Female	No	Sun	Dinner	4	0.146808

Next, we can start plotting some charts.

Understanding tip percentages 

Let’s try first to understand the tip percentage distribution. For that, we can use histplot that will generate a histogram chart.

sns.histplot(tips_df["tip_percentage"], binwidth=0.05)


Understanding tip percentages plot

That’s good, we had to customize the binwidth property to make it more readable, but now we can quickly appreciate our understanding of the data. Most customers would tip between 15 to 20%, and we have some edge cases where the tip is over 70%. Those values are anomalies, and they are always worth exploring to determine if the values are errors or not.

It would also be interesting to know if the tip percentage changes depending on the moment of the day,

sns.histplot(data=tips_df, x="tip_percentage", binwidth=0.05, hue="time")


Understanding tip percentages by time plot

This time we loaded the chart with the full dataset instead of just one column, and then we set the property hue to the column time. This will force the chart to use different colors for each value of time and add a legend to it.

Total of tips per day of the week 

Another interesting metric is to know how much money in tips can the personnel expect depending on the day of the week.

sns.barplot(data=tips_df, x="day", y="tip", estimator=np.sum)


Understanding tip percentages per day plot

It looks like Friday is a good day to stay home.

Impact of table size and day on the tip 

Sometimes we want to understand how to variables play together to determine output. For example, how do the day of the week and the table size impact the tip percentage?

To draw the next chart we will combine the pivot function of pandas to pre-process the information and then draw a heatmap chart.

pivot = tips_df.pivot_table(

    index=["day"],

    columns=["size"],

    values="tip_percentage",

    aggfunc=np.average)

sns.heatmap(pivot)


要查看或添加评论,请登录

Anjali Kumari的更多文章

  • Blockchain Technology

    Blockchain Technology

    For all you’ve probably heard about Bitcoin, Ethereum, and other cryptocurrencies lately, many financial experts say…

  • Apache Airflow

    Apache Airflow

  • STLC

    STLC

  • DBI SQL

    DBI SQL

  • Azure Data Factory

    Azure Data Factory

  • Backend Developer

    Backend Developer

    What Is The Role Of A Back-End Developer The back-end developers generally work along with the front-end developers as…

  • DAX

    DAX

    What is DAX DAX stands for Data Analysis Expressions, it is language developed by Microsoft to interact with data in a…

  • GitHub

    GitHub

    GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on…

  • JIRA

    JIRA

    A Jira 'issue' refers to a single work item of any type or size that is tracked from creation to completion. For…

  • Data Repository

    Data Repository

    Data Repository: Types, Challenges, and Best Practice The importance of data is growing as everyone uses data to make…

社区洞察

其他会员也浏览了