登录查看更多内容

Role of PCA in current data science

Deepak Kumar

Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing

发布日期: 2021年2月7日

Why to read this?

PCA reduce the size of the feature vector and eliminate the redundant features. This helps to improve robustness of the ML model. If you like to know its role in current data science, then this document helps.

Technical explanation

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation which converts a set of correlated variables to a set of uncorrelated variables. PCA is a most widely used tool in exploratory data analysis and in machine learning for predictive models.

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent.

This is done by transforming the variables to a new set of variables, which are known as the principal components (or simply, the PCs) and are orthogonal, ordered such that the retention of variation present in the original variables decreases as we move down in the order. So, in this way, the 1st principal component retains maximum variation that was present in the original components. The principal components are the eigenvectors of a covariance matrix, and hence they are orthogonal.

Illustration with 2-D

Before PCA

After PCA

Note the line which represents the 2-D points in 1-D. This line is derived using eigenvector mathematical concept. This eigenvector is used as principal component.

PCA Impact on variance

PCA replaces original variables with new variables, called principal components, which are orthogonal (i.e. they have zero covariations) and have variances (called eigenvalues) in decreasing order.

As an example, Below is the covariance matrix of some 3 variables. Their variances are on the diagonal, and the sum of the 3 values (3.448) is the overall variability.

   1.343730519   -.160152268    .186470243 
   -.160152268    .619205620   -.126684273 
    .186470243   -.126684273   1.485549631

Now, after PCA, the covariance matrix between the principal components extracted from the above data is below. Note that here co-variance became zero.

   1.651354285    .000000000    .000000000 
    .000000000   1.220288343    .000000000 
    .000000000    .000000000    .576843142

Note that the diagonal sum is still 3.448, which says that all 3 components account for all the multivariate variability. The 1st principal component accounts for or "explains" 1.651/3.448 = 47.9% of the overall variability; the 2nd one explains 1.220/3.448 = 35.4% of it; the 3rd one explains .577/3.448 = 16.7% of it.

Role of eigenvectors and eigenvalues

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA. Here, The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In PCA matrix, they are ordered in decreasing manner

What if eigenvectors is not a real number, instead it is complex number?

Eigenvalues for Co-variance matrix is always real number and so, this case will not arise. Note that co-variance matrix is positive semi definite matrix (non-negative definite matrix) and symmetric and so, eigenvalues are always real and strictly non-negative number.

Use in machine learning

The ability to generalize correctly becomes exponentially harder as the dimensionality of the training dataset grows. PCA helps in this by reducing dimension.
PCA can also be used as a filtering approach for noisy data. The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise. So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.

Refer colab example here (sourced from here)

Impact of PCA in dataset

PCA impacts the statistical property namely variance of dataset. Note that PCA transforms the original variables to a new set of variables. During this, dataset variance changes.

When reducing the dimensions of data, it’s important not to lose more information than is necessary. The variation in a data set can be seen as representing the information that we would like to keep. Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the dimensionality of data, while keeping as much variation as possible. Selection of new dimension hyperparameter is important. Higher the number, better the quality.

Reasoning for excluding low valued components

Low valued components have low variances(Note that their eigenvalues will be low). Note that these values can be approximated with mean value safely (since scatter is low) and so, loss of information will be not significant. Hence losing low valued components are fine. This is the reason behind power of this algorithm (dimensionality reduction along with good performance)

Relevance with neural networks

We use PCA to reduce the dimensions of our dataset so that when you apply the resulting dataset on a machine learning algorithm the computational time decreases while training the algorithm.

Selecting hyperparameters

Although the algorithm needs number of features, however actual need is different. In this case, it is better to decide on the % of variance which needs to be preserved.

Reference

Thanks to these helping hands

https://youtu.be/JlmJ5PEmIOo

https://en.wikipedia.org/wiki/Principal_component_analysis

https://www.dezyre.com/data-science-in-python-tutorial/principal-component-analysis-tutorial

https://www.geeksforgeeks.org/ml-principal-component-analysispca/

https://www.researchgate.net/post/How-can-PCA-reduce-the-size-of-the-feature-vector-and-eliminate-the-redundant-features

https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579?newreg=61a6aab73cbe4720b00fbf52bd8b9afd

https://www.youtube.com/watch?v=6Pv2txQVhxA

https://www.qlucore.com/news/the-benefits-of-principal-component-analysis-pca

https://medium.com/analytics-vidhya/merging-principal-component-analysis-pca-with-artificial-neural-networks-1ea6dad2c095

https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.09-Principal-Component-Analysis.ipynb#scrollTo=Ao1YFn0OcBD8

https://stats.stackexchange.com/questions/247260/principal-component-analysis-eliminate-noise-in-the-data

https://medium.com/@dareyadewumi650/understanding-the-role-of-eigenvectors-and-eigenvalues-in-pca-dimensionality-reduction-10186dad0c5c

https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
https://youtu.be/YSqFB7Srx-4
https://youtu.be/YSqFB7Srx-4?t=1082
https://math.stackexchange.com/questions/2026480/covariance-matrix-with-complex-eigenvalues
https://youtu.be/YSqFB7Srx-4?t=2205

Role of PCA in current data science

Deepak Kumar

Propelling AI To Reinvent The Future || Mentor|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing

After PCA

更多精彩文章

社区洞察

其他会员也浏览了

Demystifying Inference Pipelines in Data Science: From Data to Decisions

Data vs. Features: The Building Blocks of Data Science

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Different Data Transformations in Machine Learning - Part 09

Class 16 - DATA SCIENCE PROCESSES Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Comprehensive Guide to Data Science Terminology

Resampling Techniques: Unlocking the Hidden Potential of Your Data

Essential Data Science Concepts from A to Z

Outlier Detection in Data Science: Techniques and Use?Cases

After PCA

Role of DBSCAN in machine learning

2023年12月21日

Choice between multithreading and multi-processing: When to use what

2023年12月20日

Artificial Narrow Intelligence

2023年12月18日

Federated learning and Vehicular IoT

2023年11月29日

An age old proven technique for image resizing

2023年7月14日

Stock Market Volatility Index

2023年7月12日

The case for De-normalisation in Machine learning

2023年7月8日

Kubernetes complements Meta-verse

2023年7月4日

Which one offers better Security- OSS or Proprietary software

2023年6月24日

Why chatGPT/LLM should have unlearning capability like human has..

2023年5月29日

社区洞察

其他会员也浏览了

Demystifying Inference Pipelines in Data Science: From Data to Decisions

Data vs. Features: The Building Blocks of Data Science

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Different Data Transformations in Machine Learning - Part 09

Class 16 - DATA SCIENCE PROCESSES Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Comprehensive Guide to Data Science Terminology

Resampling Techniques: Unlocking the Hidden Potential of Your Data

Essential Data Science Concepts from A to Z

Outlier Detection in Data Science: Techniques and Use?Cases