Role of PCA in current data science
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579?newreg=61a6aab73cbe4720b00fbf52bd8b9afd

Role of PCA in current data science

Why to read this?

PCA reduce the size of the feature vector and eliminate the redundant features. This helps to improve robustness of the ML model. If you like to know its role in current data science, then this document helps.

Technical explanation

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation which converts a set of correlated variables to a set of uncorrelated variables. PCA is a most widely used tool in exploratory data analysis and in machine learning for predictive models.

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. 

No alt text provided for this image


This is done by transforming the variables to a new set of variables, which are known as the principal components (or simply, the PCs) and are orthogonal, ordered such that the retention of variation present in the original variables decreases as we move down in the order. So, in this way, the 1st principal component retains maximum variation that was present in the original components. The principal components are the eigenvectors of a covariance matrix, and hence they are orthogonal.


Illustration with 2-D

Before PCA

No alt text provided for this image


After PCA

Note the line which represents the 2-D points in 1-D. This line is derived using eigenvector mathematical concept. This eigenvector is used as principal component.

No alt text provided for this image


PCA Impact on variance

PCA replaces original variables with new variables, called principal components, which are orthogonal (i.e. they have zero covariations) and have variances (called eigenvalues) in decreasing order. 

As an example, Below is the covariance matrix of some 3 variables. Their variances are on the diagonal, and the sum of the 3 values (3.448) is the overall variability. 


   1.343730519   -.160152268    .186470243 
   -.160152268    .619205620   -.126684273 
    .186470243   -.126684273   1.485549631

Now, after PCA, the covariance matrix between the principal components extracted from the above data is below. Note that here co-variance became zero.

   1.651354285    .000000000    .000000000 
    .000000000   1.220288343    .000000000 
    .000000000    .000000000    .576843142

Note that the diagonal sum is still 3.448, which says that all 3 components account for all the multivariate variability. The 1st principal component accounts for or "explains" 1.651/3.448 = 47.9% of the overall variability; the 2nd one explains 1.220/3.448 = 35.4% of it; the 3rd one explains .577/3.448 = 16.7% of it.


Role of eigenvectors and eigenvalues

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA. Here, The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In PCA matrix, they are ordered in decreasing manner

What if eigenvectors is not a real number, instead it is complex number?

Eigenvalues for Co-variance matrix is always real number and so, this case will not arise. Note that co-variance matrix is positive semi definite matrix (non-negative definite matrix) and symmetric and so, eigenvalues are always real and strictly non-negative number.


Use in machine learning
  • The ability to generalize correctly becomes exponentially harder as the dimensionality of the training dataset grows. PCA helps in this by reducing dimension.
  • PCA can also be used as a filtering approach for noisy data. The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise. So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.

Refer colab example here (sourced from here)

Impact of PCA in dataset

PCA impacts the statistical property namely variance of dataset. Note that PCA transforms the original variables to a new set of variables. During this, dataset variance changes. 

When reducing the dimensions of data, it’s important not to lose more information than is necessary. The variation in a data set can be seen as representing the information that we would like to keep. Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the dimensionality of data, while keeping as much variation as possible. Selection of new dimension hyperparameter is important. Higher the number, better the quality. 

Reasoning for excluding low valued components

Low valued components have low variances(Note that their eigenvalues will be low). Note that these values can be approximated with mean value safely (since scatter is low) and so, loss of information will be not significant. Hence losing low valued components are fine. This is the reason behind power of this algorithm (dimensionality reduction along with good performance) 

Relevance with neural networks

We use PCA to reduce the dimensions of our dataset so that when you apply the resulting dataset on a machine learning algorithm the computational time decreases while training the algorithm. 

Selecting hyperparameters

Although the algorithm needs number of features, however actual need is different. In this case, it is better to decide on the % of variance which needs to be preserved. 


Reference
Thanks to these helping hands
https://youtu.be/JlmJ5PEmIOo

https://en.wikipedia.org/wiki/Principal_component_analysis

https://www.dezyre.com/data-science-in-python-tutorial/principal-component-analysis-tutorial

https://www.geeksforgeeks.org/ml-principal-component-analysispca/

https://www.researchgate.net/post/How-can-PCA-reduce-the-size-of-the-feature-vector-and-eliminate-the-redundant-features

https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579?newreg=61a6aab73cbe4720b00fbf52bd8b9afd

https://www.youtube.com/watch?v=6Pv2txQVhxA

https://www.qlucore.com/news/the-benefits-of-principal-component-analysis-pca

https://medium.com/analytics-vidhya/merging-principal-component-analysis-pca-with-artificial-neural-networks-1ea6dad2c095

https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.09-Principal-Component-Analysis.ipynb#scrollTo=Ao1YFn0OcBD8

https://stats.stackexchange.com/questions/247260/principal-component-analysis-eliminate-noise-in-the-data

https://medium.com/@dareyadewumi650/understanding-the-role-of-eigenvectors-and-eigenvalues-in-pca-dimensionality-reduction-10186dad0c5c

https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
https://youtu.be/YSqFB7Srx-4
https://youtu.be/YSqFB7Srx-4?t=1082
https://math.stackexchange.com/questions/2026480/covariance-matrix-with-complex-eigenvalues
https://youtu.be/YSqFB7Srx-4?t=2205

要查看或添加评论,请登录

社区洞察

其他会员也浏览了