The Broad-Brush Approach: How to Explore Data Relationships Using Pair Plots in Python + R
Data from https://www.fueleconomy.gov/feg/download.shtml

The Broad-Brush Approach: How to Explore Data Relationships Using Pair Plots in Python + R

We continue our series on linear regression by visualising the relationships between the predictors and target in our dataset. So far, we have plotted a correlation funnel and last week demonstrated how important it is to plot the model fit with your data. The data science job to be done today is to take a step back and plot our data against each other. The goal is similar to the correlation funnel plot in terms of finding relationships to include in our model but as we saw last week, plotting the data can reveal structure and information in our data that is not captured by simple linear regression statistics.

The data we are looking at today has been downloaded from Fuel Economy Data and we have narrowed in on just a few parameters so that the visualisations are a little more manageable. Let's have a look at how we do this in R shall we?

The GGally package helps you easily generate pair plots in R

GGally provides bonus information than we don't get in the Python pair plot examples, without any additional work. There is an additional column and row for the transmission category that includes the histogram and boxplot representation for each and the correlation coefficients for each category in each scatterplot is given on the right of the diagonal.

Creating a pairplot in GGally and R

The output from this R code is shown in the cover image above.

You have two simple pair plotting options in Python

To complete the same task in Python we will take a look at two different methods. The first is approach is to use the pandas scatter_matrix method.

data preparation and scatter_matrix code in Python

The scatter_matrix output for four different variables is shown below:

pair plot using pandas plotting scatter_matrix

The seaborn library also has a pairplot function that can easily handle this task.

code to create seaborn pair plot

This seaborn pair plot duplicates the bottom 3 rows and right 3 columns of the GGally pair plot shown on the cover. You will note that it does not include the extra category distributions, box plots and correlation coefficients that are the extras you get with GGally by default. The GGally plot replaces the right-hand scatter plots with correlation coefficients without losing any information because the plots are basically the same as the plots to the left of the diagonal just with the axes flipped.

pair plot using seaborn and Python

Pair plots can be useful for exploring and analyzing datasets, but they have some limitations when it comes to analyzing large datasets. Some of the main limitations of pair plots when analyzing large datasets include:

  • Overplotting: When you have a large number of data points, the points can overlap and make it difficult to see the underlying patterns in the data. You can use alpha to help reduce this effect
  • Computational complexity: Pair plots can be computationally intensive and slow to render, especially when you have a large number of variables in the dataset.
  • Lack of flexibility: Pair plots only show the relationship between pairs of variables, and do not allow you to easily explore higher-order relationships among the variables. This can make it difficult to uncover complex relationships in the data.

Summary

Pair plots can be very useful for visualising the relationships between variables in your dataset but do have some limitations. They can be combined with other techniques, such as dimensionality reduction or clustering, to effectively explore your data.

If you enjoyed this article, please share it. Follow me and subscribe to the newsletter for more posts like this.

Thank you for this! Helped me with an assignment and really appreciated the list of limitations of using pairplots to contextualize their use. ??

回复
??? ?????? ?? ????

???? ??????? ?? ?????? ?????? ???????

2 年

Excelent. Thanks a lot for you scientific paper

要查看或添加评论,请登录

Matt Rosinski的更多文章

社区洞察

其他会员也浏览了