Dimension Reduction - Principal Component Analysis (aka PCA)

No alt text provided for this image

Being in an era of data flowing from every here and there, we often come across scenarios that we gather way too much of features for a dataset. Some of these features are not as important as others for defining the relationship with the target features. Also, some of the features that we have accumulated might be co-related with each other as well. In such cases, we end up over fitting the data model by using too many features. In addition, it takes much larger space to maintain such data with large number of features. Also, it gets difficult to maintain the datasets with larger number of features when it comes to visualization and maintainability.

In order to overcome the above-mentioned problem, the concept of dimension reduction comes into play. Dimension reduction compresses large set of features onto a new feature subset of lower dimension without losing the important information. Although it is not always possible to retain all the information after reducing the dimension of the data, the technique helps us carry as much information as possible after reducing the dimension of the dataset.

Based on the type of machine learning, viz Supervised or Unsupervised, different types of dimension reduction techniques are used. We will try to understand the widely used technique called Principal Component Analysis (aka PCA) in this article.

Let us first understand some of the associated terminologies:

  • Dimensionality: It is the number of random variables in a dataset or simply the number of features, or rather more simply, the number of columns present in your dataset.
  • Correlation: It shows how strongly two variables are related to each other. The value of the same ranges for -1 to +1. Positive indicates that when one variable increases, the other increases as well, while negative indicates the other decreases on increasing the former. And the modulus value of indicates the strength of relation.
  • Orthogonal: Uncorrelated to each other, i.e., correlation between any pair of variables is 0.
  • Eigenvectors: Eigenvectors and Eigenvalues are in itself a big domain, let’s restrict ourselves to the knowledge of the same which we would require here. So, consider a non-zero vector v. It is an eigenvector of a square matrix A, if Av is a scalar multiple of v. Or simply:

Av = ?v

Here, v is the eigenvector and ? is the eigenvalue associated with it.

I will cover the Eigenvectors and Eigenvalues in the next article.

  • Covariance Matrix: This matrix consists of the covariances between the pairs of variables. The (i,j)th element is the covariance between i-th and j-th variable.

 

Although PCA can be used on both supervised and unsupervised learnings, it is mainly used in case of the latter. It is used in reducing the dimension on images, textual contents, speech recognition etc. and can help de-noise and detect patterns on the dataset. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analysing data much easier and faster for machine learning algorithms without extraneous variables to process.

Step by step explanation of PCA:

Step1: Standardization

The scaling or standardization of data is the first and foremost requirement for PCA. Consider a scenario where we have two different features on a dataset which defines the output/predicted feature. Now, one of these two features is scaled from 0 to 100 whereas the other one is scaled between 0 to 1. If we use these values as it is, the feature having a bigger range (0 to 100) will dominate over the other feature having a smaller range (0 to 1). Hence, resulting in a biased outcome.

In order to overcome this problem, it is required to get all the features on the same scale.

Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.

No alt text provided for this image

 

Once the standardization is done, all the variables will be transformed to the same scale.

Step2: Covariance Matrix

The covariance matrix plays a very important role in PCA. The relationship between two features of the dataset is determined using this matrix. Sometimes the features of the dataset are co-related with each other so closely that the provide redundant information about the target feature.

The covariance matrix is a n x n symmetric matrix (where n is the number or features on the dataset) that has as entries the covariances of every feature with each other. For example, the covariance matrix of a three dimension (a, b, c) will have matrix of 3 x 3.

The covariance of feature with itself is its variance (COV(a,a) = VAR(a)) and is assigned a value of 1 and can be seen on the main diagonal of the covariance matrix.

Also, as the covariance is commutative (COV(a,b)=COV(b,a)), we can see a symmetrical representation of values across the main diagonal.

No alt text provided for this image

Covariance can be calculated using the above formula where xi and yi are the value of x and y in ith dimension. x? and ? express the mean.

The values of the correlation vary from -1 to 1. The correlation is higher when the values are closer to -1 or 1 and the correlation is smaller when the value is closer to 0.

Now, a positive value of correlation indicates that the value of one feature increases or decreases with the increase or decrease in the other feature. Whereas the negative value of correlation indicates that the value of feature will decrease with the increase in another feature and vice versa.

 

Step3: Eigenvectors & Eigenvalues

Let us first understand the Principal Components before we get into Eigenvectors & Eigenvalues. Principal components are the set of new features for the dataset that are created using the initial features to contain similar information without any correlation.

So, the idea is 5-dimensional dataset gives us 5 principal components, but PCA tries to put the information contained in the initial five features to be compressed into the first principal component itself. Now, the maximum of remaining information which could not be fitted on the first principal component is squeezed on the second principal component and so on.

This way, we can say that the information contained on the one principal component is always higher than the next one. This type of information organization helps us drop the components from the right side of the list and still able to carry as much information as possible.

Let’s go back to the Eigenvectors and Eigenvalues which are the driving factors behind the PCA. These come into pair, which means that every eigenvector will have an eigenvalue and the number or eigenvectors will be equal to the number of features on the dataset.

For example, a 3 x 3 dataset will have three eigenvectors and there will be three eigenvalues associated with them.

the eigenvectors of the Covariance matrix are the directions of the axes where there is the most variance (most information) and that we call Principal Components. And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.

We get the principal components in the order of significance by ranking the eigenvectors in order of their eigenvalues. The higher the eigenvalue, the more the variance of the feature and hence, the more information it contains about the target variable.

 

Step4: Feature Vector

Now that we know that the principal components are created in such a way that the amount of information they carry, decreases as we move from left to right. It is up to the user to decide whether to keep all the principal components or to discard the ones having lesser significance, depending on what is s/he looking for. If s/he is looking for just a new set of features which are not co-related and much interested in reducing the dimensions of the dataset, then all the principal components can be retained. However, if the objective is to reduce the dimension of the dataset, then the features having lesser significance can be dropped.


Example using R:

Let’s see how does the code for PCA work. I have used the house price prediction dataset available on Kaggle for this example and R is the language used.

No alt text provided for this image

Here we can see that the accuracy for the model using the initial three features came as 83.97%.

Now, we will use PCA to reduce the dimension of the dataset and create another model to predict the values.

No alt text provided for this image

Here we see that the PCA has reduced the dimension of the dataset and we are using only two independent features. The accuracy of the model has come as 83.33% which is almost like that of the prior model where dimension reduction was not used.

Thus, we can say that the PCA did reduce the number of features from the dataset, however, did not loose any information that the features were carrying to define the resultant feature.

Limitations of PCA:

·        Outliers: The PCA is highly affected by outliers on the dataset. Hence, a normalization of the features becomes essential for this. PCA can not function as per the expectations if the data is not normalized or scaled properly.

·        Performance: PCA may lead to a decrease in the model accuracy as the information carried by the features is lost to some extent while doing the dimension reduction.  Since the information carried by the principal components is not actually the same that of the initial features, there are chances that the model gets under-fitted for the available dataset.

·        Interpretability: The principal components are made up of the information carried by the initial features and does not carry any co-relation once PCA is applied. That way, we lose the ability to define the correlation between the features. Also, the importance of the individual features is lost after PCA combines them to form principal components.


ANJU R.

Scrum Master (Scaled Agile) Project Management

4 年

great one !! very clearly explained.

要查看或添加评论,请登录

Gautam Kumar的更多文章

  • Treating outliers on a dataset

    Treating outliers on a dataset

    An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In…

  • What is Cloud Computing

    What is Cloud Computing

    The most simplistic definition of cloud computing is the delivery of on-demand IT services over the internet. The…

    1 条评论
  • An Introduction to Lambda Function

    An Introduction to Lambda Function

    Functions are basically piece of codes which execute only when we invoke them. For any programming language, functions…

  • Understanding Support Vector Machine

    Understanding Support Vector Machine

    Support Vector Machine: An Introduction I have talked about Linear regression and Classification on my prior articles…

  • Classification in Data Science

    Classification in Data Science

    What is Classification? Although classification can be performed on both structured and unstructured data, it is mainly…

  • Understanding the basics of Data Clustering

    Understanding the basics of Data Clustering

    Clustering Clustering is the task of dividing the population or data points into a few groups such that data points in…

  • Multicollinearity - understanding the relationship between variables

    Multicollinearity - understanding the relationship between variables

    Multicollinearity Multicollinearity or simply collinearity is defined by the phenomenon in which two or more…

  • Understanding the ROC & AUC

    Understanding the ROC & AUC

    Introduction In any type of machine learning, we need to calculate the accuracy of the model for performance…

  • Linear Regression

    Linear Regression

    When it comes to supervised machine learning, there are two types of learning algorithms: Regression – this basically…

社区洞察

其他会员也浏览了