Comparison of Multivariate Data Using Principal Component Analysis

Comparison of Multivariate Data Using Principal Component Analysis

Lately I was working on a project where we need to solve the problem of comparing population with its sample to ensure that sample is representative of population or not in a high-dimensional space. Comparing these directly became a little complex due to their complexity and the high number of dimensions.

In statistics, we have multiple univariate and bivariate techniques to analyse data. Univariate techniques focus on summarizing and understanding a single variable, using methods such as histograms, box plots, and summary statistics like mean, median, and standard deviation. Bivariate techniques, on the other hand, examine the relationship between two variables. Common bivariate methods include scatter plots, correlation coefficients, and simple linear regression.

However, as data becomes more complex and multidimensional, these techniques fall short. They do not adequately capture the interactions and relationships among multiple variables.

Suppose that we wish to visualize n observations with measurements on a set of N number of features as part of an EDA. We could do this by examining 2D scatterplots of the data, each of which contains the n observations measurements on two of the features. However, If N is large, then it will certainly not be possible to look at all of them; moreover, most likely none of them will be informative since they each contain just a small fraction of the total information present in the data set. I have used iris dataset with 4 features, where we can see that how clumsy this gets if the feature space grows further.



Clearly, a better method is required to visualize the n observations when p is large. In particular, we would like to find a low-dimensional representation of the data that captures as much of the information as possible. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.

This is where Principal Component Analysis (PCA), come into play.?

PCA provides a way to do just this. It finds a low-dimensional representation of a data set that contains as much as possible of the variation. The idea is that each of the n observations lives in p-dimensional space, but not all of these dimensions are equally important. PCA seeks a small number of dimensions that are important, where it is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the p features. We now explain the manner in which these dimensions, or principal components, are found.

While I will post another article for inner working of PCA, yet it is important to build a basic understanding how principal components are determined to prove our hypothesis. The first principal component of a set of features is the normalized linear combination of the features that has the largest variance.

Given a high dimensional data set X, how do we compute the first principal component? Since we are only interested in variance, we assume that each of the variables in X has been cantered to have mean zero. We then look for the linear combination of the sample feature values of the form that has largest sample variance.

After the first principal component Z1 of the features has been determined, we can find the second principal component Z2. The second principal component is the linear combination of all features that has maximal variance out of all linear combinations that are uncorrelated with Z1. It turns out that constraining Z2 to be uncorrelated with Z1 is equivalent to constraining the direction Z2 to be orthogonal (perpendicular) to the direction Z1.

Once we have computed the principal components, we can plot them against each other in order to produce low-dimensional views of the data. For instance, I have plotted IRIS dataset population and sample with their corresponding principal components to ensure whether the Sample is representative of population or not and to my surprise the result is quite intuitive and promising.



Rosemary Tate

Independent Consultant, Biostatistics and Data science. Data quality detective.

7 个月

PCA is so useful! ??

回复

Nice one Jitender.. how do we make sure we don't lose out on important components which might hold significant variance or important signals while we are reducing the dimensions and how do we handle the non gaussian data

Bhupendra Siyag

Information Security & Networking Analyst.

9 个月

Quite interesting Malik saab! So in essence a component here is a function of a set of features with the highest variance? How many features do you typically pick for a given component? Also interested to know whether you found PCA more effective than some popular clustering techniques when analyzing well-structured data.

Aravind Vijayan

Associate Vice President - Change Management @ NatWest Markets

9 个月

Whether we can perform this analysis for a non normally distributed data ?

Parbat Singh Rajpurohit

Product Head | Deliver Results | Transformative ERP Solutions for SMEs

9 个月

Well said!

要查看或添加评论,请登录

Jitender Malik的更多文章

社区洞察

其他会员也浏览了