登录查看更多内容

Comparison of Multivariate Data Using Principal Component Analysis

Jitender Malik

SVP | Data engineering & Science(AI/ML, Gen AI, Computer Vision) | AI Engineering Lead at NatWest Group

发布日期: 2024年6月16日

Lately I was working on a project where we need to solve the problem of comparing population with its sample to ensure that sample is representative of population or not in a high-dimensional space. Comparing these directly became a little complex due to their complexity and the high number of dimensions.

In statistics, we have multiple univariate and bivariate techniques to analyse data. Univariate techniques focus on summarizing and understanding a single variable, using methods such as histograms, box plots, and summary statistics like mean, median, and standard deviation. Bivariate techniques, on the other hand, examine the relationship between two variables. Common bivariate methods include scatter plots, correlation coefficients, and simple linear regression.

However, as data becomes more complex and multidimensional, these techniques fall short. They do not adequately capture the interactions and relationships among multiple variables.

Suppose that we wish to visualize n observations with measurements on a set of N number of features as part of an EDA. We could do this by examining 2D scatterplots of the data, each of which contains the n observations measurements on two of the features. However, If N is large, then it will certainly not be possible to look at all of them; moreover, most likely none of them will be informative since they each contain just a small fraction of the total information present in the data set. I have used iris dataset with 4 features, where we can see that how clumsy this gets if the feature space grows further.

Clearly, a better method is required to visualize the n observations when p is large. In particular, we would like to find a low-dimensional representation of the data that captures as much of the information as possible. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.

This is where Principal Component Analysis (PCA), come into play.?

领英推荐

Unveiling the Multifaceted Tapestry of Data: Diversity…

Digvijay Singh 1 年前

What Is The Difference Between Parametric And…

Ze Learning Labb 7 个月前

I ran 580 model-dataset experiments to show that, even…

Santiago Viquez 9 个月前

PCA provides a way to do just this. It finds a low-dimensional representation of a data set that contains as much as possible of the variation. The idea is that each of the n observations lives in p-dimensional space, but not all of these dimensions are equally important. PCA seeks a small number of dimensions that are important, where it is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the p features. We now explain the manner in which these dimensions, or principal components, are found.

While I will post another article for inner working of PCA, yet it is important to build a basic understanding how principal components are determined to prove our hypothesis. The first principal component of a set of features is the normalized linear combination of the features that has the largest variance.

Given a high dimensional data set X, how do we compute the first principal component? Since we are only interested in variance, we assume that each of the variables in X has been cantered to have mean zero. We then look for the linear combination of the sample feature values of the form that has largest sample variance.

After the first principal component Z1 of the features has been determined, we can find the second principal component Z2. The second principal component is the linear combination of all features that has maximal variance out of all linear combinations that are uncorrelated with Z1. It turns out that constraining Z2 to be uncorrelated with Z1 is equivalent to constraining the direction Z2 to be orthogonal (perpendicular) to the direction Z1.

Once we have computed the principal components, we can plot them against each other in order to produce low-dimensional views of the data. For instance, I have plotted IRIS dataset population and sample with their corresponding principal components to ensure whether the Sample is representative of population or not and to my surprise the result is quite intuitive and promising.

Rosemary Tate

Independent Consultant, Biostatistics and Data science. Data quality detective.

7 个月

PCA is so useful! ??

Akshay Hardi

9 个月

Nice one Jitender.. how do we make sure we don't lose out on important components which might hold significant variance or important signals while we are reducing the dimensions and how do we handle the non gaussian data

1 次回应

Bhupendra Siyag

Information Security & Networking Analyst.

9 个月

Quite interesting Malik saab! So in essence a component here is a function of a set of features with the highest variance? How many features do you typically pick for a given component? Also interested to know whether you found PCA more effective than some popular clustering techniques when analyzing well-structured data.

1 次回应

Aravind Vijayan

Associate Vice President - Change Management @ NatWest Markets

9 个月

Whether we can perform this analysis for a non normally distributed data ?

1 次回应

Parbat Singh Rajpurohit

Product Head | Deliver Results | Transformative ERP Solutions for SMEs

9 个月

Well said!

1 次回应

查看更多评论

要查看或添加评论，请登录

Jitender Malik的更多文章

Key issues (Post-production) in an ML based solution

2025年3月2日

Key issues (Post-production) in an ML based solution

In my last article, I talked about the key challenges in AI adoption. Even after organizations successfully build and…

1 条评论
6 Core steps for choosing a ML Model

2025年2月24日

6 Core steps for choosing a ML Model

There are many possible solutions to any given problem. Given a task that can leverage ML in its solution, you might…

1 条评论
LLM Agents: Reasoning and acting (ReAct)

2025年1月5日

LLM Agents: Reasoning and acting (ReAct)

In this article I have covered three things to start the series of articles on Agents. first, what is LLM agents? And…
LLM's: Chain of thought prompting

2024年10月6日

LLM's: Chain of thought prompting

Chain rule(Backpropagation - Wikipedia) doesn’t get the appreciation it should. Without it back-propagation (Backbone…

1 条评论
Deep Learning 1: ANN (Artificial Neural Network) Architecture

2024年7月28日

Deep Learning 1: ANN (Artificial Neural Network) Architecture

Neuron and perceptron Deep learning is heavily inspired by our own nervous system. Just as our nervous system works…

4 条评论
Logistic regression: A deep learning approach.

2024年7月20日

Logistic regression: A deep learning approach.

Logistic regression is one of the most modern machine learning algorithms, and it is important because if you want to…

3 条评论
Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

2024年7月13日

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

This article talks about the journey of transformer architecture where the 4 groundbreaking research paper brought the…

1 条评论
A data driven approach for scalable Integration testing.

2024年6月29日

A data driven approach for scalable Integration testing.

Note: This article talks about using statistics to scale integration testing using pact flow. for detailed…

3 条评论
EMIR Refit Pairing and matching : A machine learning approach.

2024年6月22日

EMIR Refit Pairing and matching : A machine learning approach.

The EMIR mandates EU counterparties to report their transactions to trade repositories. EMIR focuses on the…

10 条评论

See all articles

Comparison of Multivariate Data Using Principal Component Analysis

Jitender Malik

SVP | Data engineering & Science(AI/ML, Gen AI, Computer Vision) | AI Engineering Lead at NatWest Group

领英推荐

Jitender Malik的更多文章

社区洞察

其他会员也浏览了

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

Data

Unraveling the Magic of Statistical Analysis: A Journey into Data Wonderland

TIME SERIES FORECASTING APPROACH

Optimizations: Trendlines

Identification of Outliers in Data Cleaning and Econometric Interpretation

When the statistical test: Anderson-Darling doesn’t work well

Homoscedasticity — From a line in a checklist to a key element in data analysis

Table Preparation and Visualization

领英推荐

Jitender Malik的更多文章

Key issues (Post-production) in an ML based solution

6 Core steps for choosing a ML Model

LLM Agents: Reasoning and acting (ReAct)

LLM's: Chain of thought prompting

Deep Learning 1: ANN (Artificial Neural Network) Architecture

Logistic regression: A deep learning approach.

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

A data driven approach for scalable Integration testing.

EMIR Refit Pairing and matching : A machine learning approach.

社区洞察

其他会员也浏览了

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

Data

Unraveling the Magic of Statistical Analysis: A Journey into Data Wonderland

TIME SERIES FORECASTING APPROACH

Optimizations: Trendlines

Identification of Outliers in Data Cleaning and Econometric Interpretation

When the statistical test: Anderson-Darling doesn’t work well

Homoscedasticity — From a line in a checklist to a key element in data analysis

Table Preparation and Visualization