Principal Component Analysis????
Utkarsh Sharma
SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor
What is PCA?
Principal Component Analysis, or?PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
Why PCA ?
Large datasets are increasingly widespread in many disciplines. In order to interpret such datasets, methods are required to drastically reduce their dimensionality in an interpretable way, such that most of the information in the data is preserved. Many techniques have been developed for this purpose, but principal component analysis (PCA) is one of the oldest and most widely used. Its idea is simple—reduce the dimensionality of a dataset, while preserving as much ‘variability’ (i.e. statistical information) as possible.
How we do PCA ?
The basic idea behind PCA is very simple that we transform all of our attributes of the data into a transformed plane and every transformed attribute will be having some relationship with the original attributes. So, for example if we have 10 attributes and we apply PCA on to them then we get 10 transformed variables. But if we want to keep 5 of the transformed variables, then we can do that because they will be having some of the information of the original 10 attributes. We can reduce the number of PCA components we want to select but, it will be again some loss of variance of original data.
领英推荐
That’s why we keep only important principal components out of all components which can properly represent the variance of our original dataset.
?Below is a python code to represent how PCA works on IRIS Dataset:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponent
????????????, columns = ['principal component 1', 'principal component 2'])
?
Here we specified the number of principal components as 2, So, the transformed data is having 2 attributes representing the variance of all 4 variables of original dataset.
Principal Engineer at Paytm
2 年Nice