Role of PCA in current data science
Deepak Kumar
Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing
Why to read this?
PCA reduce the size of the feature vector and eliminate the redundant features. This helps to improve robustness of the ML model. If you like to know its role in current data science, then this document helps.
Technical explanation
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation which converts a set of correlated variables to a set of uncorrelated variables. PCA is a most widely used tool in exploratory data analysis and in machine learning for predictive models.
The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent.
This is done by transforming the variables to a new set of variables, which are known as the principal components (or simply, the PCs) and are orthogonal, ordered such that the retention of variation present in the original variables decreases as we move down in the order. So, in this way, the 1st principal component retains maximum variation that was present in the original components. The principal components are the eigenvectors of a covariance matrix, and hence they are orthogonal.
Illustration with 2-D
Before PCA
After PCA
Note the line which represents the 2-D points in 1-D. This line is derived using eigenvector mathematical concept. This eigenvector is used as principal component.
PCA Impact on variance
PCA replaces original variables with new variables, called principal components, which are orthogonal (i.e. they have zero covariations) and have variances (called eigenvalues) in decreasing order.
As an example, Below is the covariance matrix of some 3 variables. Their variances are on the diagonal, and the sum of the 3 values (3.448) is the overall variability.
1.343730519 -.160152268 .186470243 -.160152268 .619205620 -.126684273 .186470243 -.126684273 1.485549631
Now, after PCA, the covariance matrix between the principal components extracted from the above data is below. Note that here co-variance became zero.
1.651354285 .000000000 .000000000 .000000000 1.220288343 .000000000 .000000000 .000000000 .576843142
Note that the diagonal sum is still 3.448, which says that all 3 components account for all the multivariate variability. The 1st principal component accounts for or "explains" 1.651/3.448 = 47.9% of the overall variability; the 2nd one explains 1.220/3.448 = 35.4% of it; the 3rd one explains .577/3.448 = 16.7% of it.
Role of eigenvectors and eigenvalues
The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA. Here, The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In PCA matrix, they are ordered in decreasing manner
What if eigenvectors is not a real number, instead it is complex number?
Eigenvalues for Co-variance matrix is always real number and so, this case will not arise. Note that co-variance matrix is positive semi definite matrix (non-negative definite matrix) and symmetric and so, eigenvalues are always real and strictly non-negative number.
Use in machine learning
- The ability to generalize correctly becomes exponentially harder as the dimensionality of the training dataset grows. PCA helps in this by reducing dimension.
- PCA can also be used as a filtering approach for noisy data. The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise. So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.
Refer colab example here (sourced from here)
Impact of PCA in dataset
PCA impacts the statistical property namely variance of dataset. Note that PCA transforms the original variables to a new set of variables. During this, dataset variance changes.
When reducing the dimensions of data, it’s important not to lose more information than is necessary. The variation in a data set can be seen as representing the information that we would like to keep. Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the dimensionality of data, while keeping as much variation as possible. Selection of new dimension hyperparameter is important. Higher the number, better the quality.
Reasoning for excluding low valued components
Low valued components have low variances(Note that their eigenvalues will be low). Note that these values can be approximated with mean value safely (since scatter is low) and so, loss of information will be not significant. Hence losing low valued components are fine. This is the reason behind power of this algorithm (dimensionality reduction along with good performance)
Relevance with neural networks
We use PCA to reduce the dimensions of our dataset so that when you apply the resulting dataset on a machine learning algorithm the computational time decreases while training the algorithm.
Selecting hyperparameters
Although the algorithm needs number of features, however actual need is different. In this case, it is better to decide on the % of variance which needs to be preserved.
Reference
Thanks to these helping hands
https://youtu.be/JlmJ5PEmIOo https://en.wikipedia.org/wiki/Principal_component_analysis https://www.dezyre.com/data-science-in-python-tutorial/principal-component-analysis-tutorial https://www.geeksforgeeks.org/ml-principal-component-analysispca/ https://www.researchgate.net/post/How-can-PCA-reduce-the-size-of-the-feature-vector-and-eliminate-the-redundant-features https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579?newreg=61a6aab73cbe4720b00fbf52bd8b9afd https://www.youtube.com/watch?v=6Pv2txQVhxA https://www.qlucore.com/news/the-benefits-of-principal-component-analysis-pca https://medium.com/analytics-vidhya/merging-principal-component-analysis-pca-with-artificial-neural-networks-1ea6dad2c095 https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.09-Principal-Component-Analysis.ipynb#scrollTo=Ao1YFn0OcBD8 https://stats.stackexchange.com/questions/247260/principal-component-analysis-eliminate-noise-in-the-data https://medium.com/@dareyadewumi650/understanding-the-role-of-eigenvectors-and-eigenvalues-in-pca-dimensionality-reduction-10186dad0c5c https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html https://youtu.be/YSqFB7Srx-4 https://youtu.be/YSqFB7Srx-4?t=1082 https://math.stackexchange.com/questions/2026480/covariance-matrix-with-complex-eigenvalues https://youtu.be/YSqFB7Srx-4?t=2205