登录查看更多内容

Principal Component Analysis in Python

Vyom Vats

Senior Data Scientist at Amazon

发布日期: 2017年2月25日

When working with real-world data, rarely does it happen that you have a small number of features (or variables) in your datasets. And as the number of features increase, the costs of data computation increase as well. This also occurs of course when the number of observations increase, but this is especially pronounced in case of increasing number of features.

One way to get around it is to use fewer number of features. For example, consider you're trying to fit a linear regression model on a data which has hundreds of features. To see which features possess the most predictive power for your dependent variable, you could fit the model to all the features and then based on the output of the model, see which ones are most significant and which of those you'd like to keep.

But wouldn't it be nice if you could identify what features of your data are going to be most useful when building a model, before you built it? You could then use only those features while preserving the variation in your data, and have a less complex model that would most probably work better than one which uses the complete data. Win-win!

This is where Principal Component Analysis comes in. It's an unsupervised machine learning technique used to examine the interrelations among a set of variables in order to identify the underlying structure of those variables. It's also sometimes called general factor analysis.

You know how a regression analysis aims to determine a line (or a hyperplane) of best fit to a dataset? Likewise, PCA determines several orthogonal lines (called components) of best fit to the dataset that capture (potentially) almost all the variation in your data. By orthogonal, I mean that these lines are perpendicular to each other in n-dimensional space (there are as many dimensions in this space as there are variables in your dataset).

These components are in fact linear transformations that chooses a variable system for the dataset such that the greatest amount of variance of the dataset comes to lie on the first axis, the second greatest amount of variance on the second axis, and so on. The image below will make this more clear: 78% of the variation is captured by the first component, and 20% is captured by the second component (which is orthogonal in this 2-D space to the first component) for a total of 98%.

This particular image may not make it exactly clear, but it does give an idea that finding the components that account for the most variation in your data can help you to reduce the number of variables used in an analysis. For models such as regression, this is especially useful since high correlations among the predictors can mess up the results from the model. PCA helps to remove correlation, since in the sample space the principal components are orthogonal, and hence there is no correlation among them.

Long story short, if we use PCA on a dataset with a large number of variables, we can compress the amount of variation explained to just a few components. To know how to actually use PCA on your data using the Python language, you can go through my Github page Principal Component Analysis, where I'll take you step-by-step through through the process of finding Principal Components using linear algebra, and also use Python's canned PCA procedure to find Principal Components.

If you are interested in reading more about PCA, be sure to check out Setosa.io's article on the same. They've got some rad visualizations there!

Thanks for reading, and let me know your thoughts in the comments section!

Dana Southern

Consumer Advocate at Your Car Insiders

7 年

Ouch! It hurts just thinking about what you are Learning / Doing! I have been studying for almost 32 years. Still learn everyday. Thank God!

Amos Chalmers

Sr. Solution Architect

7 年

Great explanation! I'm working through a difficult Medicare data set for my applied project and have been experimenting with PCA in Python to extract the most relevant features, however I still had question as to what I should do with the PCA results.

Prosanta Dey

Research Director (Product Development/Marketing), Justice, Political Strategist, I.I.T.-Kharagpur, Owner, Jharnanil Futuretek Laboratories, Kharagpur, India

7 年

no distinction between less or more important variables, key variableis

shrinivas deogirikar

Founder and Director

8 年

Less important variables will give good results

1 次回应

查看更多评论

要查看或添加评论，请登录

Vyom Vats的更多文章

One-way ANOVA in R

2017年2月4日

One-way ANOVA in R

The term ANOVA (ANalysis Of VAriance) applies to statistical models used to analyze the differences among group means…

3 条评论

Principal Component Analysis in Python

Vyom Vats

Senior Data Scientist at Amazon

Vyom Vats的更多文章

社区洞察

其他会员也浏览了

Why You Should Learn Python for Data Analysis: Surpassing Excel in Efficiency and Automation

Python Basics for Data Science

Complex: Python With A Real And Imaginary Number

Top 10 Ways to deal with Missing Values in Python

Building a Machine Learning Model from Scratch Using?Python

?? Master PCA, t-SNE, and SVD in Python! ??

The Complete Guide To Time Series Analysis With Python.

My Python Joy: A World Without Tables and Calculators? (#05)

How to build Gradient Boosting Regressor in?Python?

How to Use Kaggle API in Python?

Vyom Vats的更多文章

One-way ANOVA in R

社区洞察

其他会员也浏览了

Why You Should Learn Python for Data Analysis: Surpassing Excel in Efficiency and Automation

Python Basics for Data Science

Complex: Python With A Real And Imaginary Number

Top 10 Ways to deal with Missing Values in Python

Building a Machine Learning Model from Scratch Using?Python

?? Master PCA, t-SNE, and SVD in Python! ??

The Complete Guide To Time Series Analysis With Python.

My Python Joy: A World Without Tables and Calculators? (#05)

How to build Gradient Boosting Regressor in?Python?

How to Use Kaggle API in Python?