Principal Component Analysis in Python
When working with real-world data, rarely does it happen that you have a small number of features (or variables) in your datasets. And as the number of features increase, the costs of data computation increase as well. This also occurs of course when the number of observations increase, but this is especially pronounced in case of increasing number of features.
One way to get around it is to use fewer number of features. For example, consider you're trying to fit a linear regression model on a data which has hundreds of features. To see which features possess the most predictive power for your dependent variable, you could fit the model to all the features and then based on the output of the model, see which ones are most significant and which of those you'd like to keep.
But wouldn't it be nice if you could identify what features of your data are going to be most useful when building a model, before you built it? You could then use only those features while preserving the variation in your data, and have a less complex model that would most probably work better than one which uses the complete data. Win-win!
This is where Principal Component Analysis comes in. It's an unsupervised machine learning technique used to examine the interrelations among a set of variables in order to identify the underlying structure of those variables. It's also sometimes called general factor analysis.
You know how a regression analysis aims to determine a line (or a hyperplane) of best fit to a dataset? Likewise, PCA determines several orthogonal lines (called components) of best fit to the dataset that capture (potentially) almost all the variation in your data. By orthogonal, I mean that these lines are perpendicular to each other in n-dimensional space (there are as many dimensions in this space as there are variables in your dataset).
These components are in fact linear transformations that chooses a variable system for the dataset such that the greatest amount of variance of the dataset comes to lie on the first axis, the second greatest amount of variance on the second axis, and so on. The image below will make this more clear: 78% of the variation is captured by the first component, and 20% is captured by the second component (which is orthogonal in this 2-D space to the first component) for a total of 98%.
This particular image may not make it exactly clear, but it does give an idea that finding the components that account for the most variation in your data can help you to reduce the number of variables used in an analysis. For models such as regression, this is especially useful since high correlations among the predictors can mess up the results from the model. PCA helps to remove correlation, since in the sample space the principal components are orthogonal, and hence there is no correlation among them.
Long story short, if we use PCA on a dataset with a large number of variables, we can compress the amount of variation explained to just a few components. To know how to actually use PCA on your data using the Python language, you can go through my Github page Principal Component Analysis, where I'll take you step-by-step through through the process of finding Principal Components using linear algebra, and also use Python's canned PCA procedure to find Principal Components.
If you are interested in reading more about PCA, be sure to check out Setosa.io's article on the same. They've got some rad visualizations there!
Thanks for reading, and let me know your thoughts in the comments section!
Consumer Advocate at Your Car Insiders
7 年Ouch! It hurts just thinking about what you are Learning / Doing! I have been studying for almost 32 years. Still learn everyday. Thank God!
Sr. Solution Architect
7 年Great explanation! I'm working through a difficult Medicare data set for my applied project and have been experimenting with PCA in Python to extract the most relevant features, however I still had question as to what I should do with the PCA results.
Research Director (Product Development/Marketing), Justice, Political Strategist, I.I.T.-Kharagpur, Owner, Jharnanil Futuretek Laboratories, Kharagpur, India
7 年no distinction between less or more important variables, key variableis
Founder and Director
8 年Less important variables will give good results