Data Comprehension in Python
Madhumita Mazumdar
Market Research Analyst at Honeycomb Pvt. Ltd. || MBA/PGDM || Business Analytics
In the field of data analysis and communication, data visualisation is a potent and crucial tool. To make complex information and data easier to obtain, comprehend, and act upon, it is presented in graphical or visual formats. Data visualisation is made simple and effective by a variety of modules and tools that Python, a flexible and popular programming language, provides.
The four main data visualization libraries for Python are Pandas, Matplotlib, Seaborn, and Plotly, each with its own specialties. Here is a quick synopsis of each.
Matplotlib: It offers a huge selection of configurable graphs, charts, and plots. You can make basic line charts, scatter plots, bar charts, histograms, and more with its great level of flexibility. Although it gives users a lot of customization options, additional code may be needed for complicated visualisations.
Pandas: The plot function in Pandas can be used to quickly create basic plots even if it lacks substantial visualisation capabilities. This method offers a quick way to make simple charts like line plots, bar plots, and histograms by acting as a rudimentary wrapper around Matplotlib’s features.
Seaborn: Seaborn is a higher-level interface for producing aesthetically pleasing and educational statistical visuals that is developed on top of Matplotlib. It helps developers write less code while still producing aesthetically beautiful visualisations. Statistical plots including distribution plots, pair plots, and regression plots are Seaborn’s area of expertise.
Plotly?: It is a potent library for building interactive, dynamic visualisations, such as heatmaps, scatter plots, 2D and 3D charts, and more. Plotly’s visualisations may be integrated into websites or notebooks, enabling users to independently explore data.
Importing Datasets
We shall make use of two publicly accessible datasets in this post. The datasets for Iris and Wine Reviews are both able to be loaded into memory with pandas’ read_csv method.
Dataset links:
import pandas as pd
iris = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
print(iris.head())
wine_reviews = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)
wine_reviews.head()
Matplotlib
Most people use the Python charting package Matplotlib. It’s a low-level library with a Matlab-like user interface that provides a lot of freedom at the expense of requiring more code to be written.
pip install matplotlib
or
conda install matplotlib
Making simple graphs like line charts, bar charts, histograms, etc. is particularly well suited to Matplotlib. You can import it by keying in:
import matplotlib.pyplot as plt
Scatter Plot
The scatter method in Matplotlib can be used to produce a scatter plot. In order to give our plot a title and labels, we will also make a figure and an axis using plt.subplots.
# create a figure and axis
fig, ax = plt.subplots()
# scatter the sepal_length against the sepal_width
ax.scatter(iris['sepal_length'], iris['sepal_width'])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
By highlighting each data point according to its class, we can offer the graph greater context. This can be accomplished by building a dictionary that maps from class to colour, scattering each point independently using a for-loop, and passing the appropriate colour.
# create color dictionary
colors = {'Iris-setosa':'r', 'Iris-versicolor':'g', 'Iris-virginica':'b'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(iris['sepal_length'])):
ax.scatter(iris['sepal_length'][i], iris['sepal_width'][i],color=colors[iris['class'][i]])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
Line Chart
By using the plot method in Matplotlib, we can make a line graph. Additionally, we can plot many columns on a single axis by iteratively looping through the columns we need.
# get columns to plot
columns = iris.columns.drop(['class'])
# create x data
x_data = range(0, iris.shape[0])
# create figure and axis
fig, ax = plt.subplots()
# plot each column
for column in columns:
ax.plot(x_data, iris[column])
# set title and legend
ax.set_title('Iris Dataset')
ax.legend()
Histogram
The hist method in Matplotlib can be used to generate a histogram. The frequency of each class will be determined automatically if we pass categorical data, such as the points column from the wine-review dataset.
# create figure and axis
fig, ax = plt.subplots()
# plot histogram
ax.hist(wine_reviews['points'])
# set title and labels
ax.set_title('Wine Review Scores')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')
Bar Chart
The bar method can be used to make a bar chart. We will use the pandas value_counts method to determine a category’s frequency because the bar chart doesn’t do this automatically. Less than 30 categories should be used for categorical data because more than that can make the bar chart look cluttered.
# create a figure and axis
fig, ax = plt.subplots()
# count the occurrence of each class
data = wine_reviews['points'].value_counts()
# get x and y data
points = data.index
frequency = data.values
# create bar chart
ax.bar(points, frequency)
# set title and labels
ax.set_title('Wine Review Scores')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')
Pandas
Data structures like data frames and data analysis tools like the ones we’ll use in this article’s visualisation are both provided by the open-source, high-performance, and simple-to-use Pandas library.
Plots may be easily made from a pandas data frame and series using Pandas Visualisation.
pip install pandas
or
conda install pandas
Scatter Plot
We can call <dataset> in Pandas to get a scatter plot. the names of the x-column and the y-column as two arguments to plot. scatter(), we have the option of titling it.
iris.plot.scatter(x='sepal_length', y='sepal_width', title='Iris Dataset')
in the image, it is automatically setting the x and y label to the column names.
Line Chart
We can call <dataframe> in Pandas to generate a line chart.plot.line(). Pandas automatically plots all available numeric columns (at least if we don’t provide a specific column/s), unlike Matplotlib, which required us to loop through each column we wished to plot.
iris.drop(['class'], axis=1).plot.line(title='Iris Dataset')
Histogram
Using Pandas’ plot.hist method, we can make a histogram. Although we can optionally pass some options, such the bin size, none are necessary.
wine_reviews['points'].plot.hist()
领英推荐
Bar Chart
We may use the plot.bar() function to create a bar chart, but first we must gather our data. The value_count() method will be used to first count the occurrences, and the sort_index() method will be used to order them in size from smallest to largest.
wine_reviews['points'].value_counts().sort_index().plot.bar()
Seaborn
A Matplotlib-based Python data visualisation library is called Seaborn. It offers a sophisticated interface for designing eye-catching graphs.
Seaborn offers a number of benefits. For instance, you may make graphs in Matplotlib that would require several tens of lines in just one line. It also offers an excellent interface for working with Pandas dataframes. Its standard designs are fantastic.
You can import it by keying in:
import seaborn as sns
Scatter plot
The scatterplot method may be used to create a scatterplot, and just like in Pandas, we must supply it the column names of the x and y data. However, since we aren’t running the function directly on the data like we did in Pandas, we also need to pass the data as an additional argument.
sns.scatterplot(x='sepal_length', y='sepal_width', data=iris)
We can also highlight the points by class using the?hue?argument, which is a lot easier than in Matplotlib.
sns.scatterplot(x='sepal_length', y='sepal_width', hue='class', data=iris)
Line chart
Using the sns.lineplot method, a line chart may be produced. The data, which in our instance are the four numeric columns from the Iris dataset, is the sole parameter that is necessary. We might alternatively employ the sns.kdeplot method, which cleans up the curve edges and is better if your dataset has a lot of outliers.
sns.lineplot(data=iris.drop(['class'], axis=1))
Histogram
We use the sns.distplot method in Seaborn to produce a histogram. The column we wish to plot must be passed to it, and it will determine the occurrences on its own. If we want to plot a gaussian kernel density estimate inside the graph, we may additionally pass it the number of bins.
sns.distplot(wine_reviews['points'], bins=10, kde=False)
sns.distplot(wine_reviews['points'], bins=10, kde=True)
Bar chart
By providing the data to the sns.countplot method in Seaborn, a bar chart may be produced.
sns.countplot(wine_reviews['points'])
Plotly
There are few different graph types that are helpful for extracting insides now that you have a basic understanding of the Matplotlib, Pandas Visualisation, and Seaborn syntax are been done.
Due to its high-level interface, Seaborn is the go-to library for the majority of them, enabling the development of stunning graphs in just a few lines of code.
Box plots
A graphical way to show the five-number summary is with a box plot. By utilising the sns.boxplot method on Seaborn and providing the x and y column names along with the data, we can produce box plots.
df = wine_reviews[(wine_reviews['points']>=95) & (wine_reviews['price']<1000)]
sns.boxplot('points', 'price', data=df)
Heatmap
The individual values found in a matrix are represented as colours in a heatmap, which is a graphical representation of data. Heatmaps are ideal for investigating the relationship between the features in a dataset.
We can call “dataset” to obtain the correlation of the features contained within the dataset.corr(), a dataframe method in the Pandas library. We will receive the correlation matrix as a result.
The heatmap can now be produced using either Matplotlib or Seaborn.
# get correlation matrix
corr = iris.corr()
fig, ax = plt.subplots()
# create heatmap
im = ax.imshow(corr.values)
# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
Two for loops must be added in order to annotate the heatmap:
# get correlation matrix
corr = iris.corr()
fig, ax = plt.subplots()
# create heatmap
im = ax.imshow(corr.values)
# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(corr.columns)):
for j in range(len(corr.columns)):
text = ax.text(j, i, np.around(corr.iloc[i, j], decimals=2),
ha="center", va="center", color="black")
Seaborn makes it way easier to create a heatmap and add annotations:
sns.heatmap(iris.corr(), annot=True)
Conclusion
The study of data visualisation involves attempting to comprehend data by putting it in a visual context in order to reveal patterns, trends, and connections that might not otherwise be visible.
Python has a number of excellent graphing packages that are jam-packed with unique capabilities. In this article, we examined Seaborn, Pandas visualisation, and Matplotlib.