Dimensionality Reduction by PCA using Orange

Utkarsh Sharma

SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor

发布日期: 2022年4月21日

The curse of dimensionality haunts every data scientist dealing with a dataset containing a large number of attributes. Whenever someone handles such a huge dataset for finding out some pattern into it or analyze, the high dimensionality sometimes makes the task difficult. Because every time a new attribute gets added into the system to capture its variability among all entities, we will require 100s of new tuples. And as we all know there is always a scarcity of the training dataset. To deal with this problem, we apply dimensionality reduction techniques to shorten the number of attributes while still preserving the information or the variance provided by the attributes.

???????????One of the most widely used techniques for dimensionality reduction is Principal Component Analysis(PCA). This technique works on the principle of eigenvectors. PCA will project your original attributes in form of the same number of principal components. But out of those principal components, you can select only those which combinedly able to represent the variance of the dataset up to the desired level. PCA is supported by almost every data analytical programming language and tool. The working of PCA is almost the same in every platform but the representation of the outcome may differ.

As an example, we will look at how PCA works on a dataset using the Orange. Orange is an open-source data visualization, machine learning, and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization. Let’s see how we can perform PCA in Orange.

This is the welcome screen of Orange, just click on New for creating a new project. When you click on New, a blank project will open in front of you and the first step will be to load a dataset on which you need to perform PCA.?

For loading a dataset click on the File icon, the moment you click on File an icon will be created on the blank screen named File. Double click on that icon and a file explorer window will open and then browse to your desired file, your content of the file will be displayed on the screen if the file loads successfully. Close the window after that, there is no OK button so don’t get worried. The file I loaded for demonstration contains 19719 records with 50 attributes.

Now click on the dashed line along with the file icon and it will draw an arrow shape, when you leave the line a menu will pop up in front of you that will show various options to select.

Select the appropriate option from that menu, in our case, we will select PCA.

That’s it, done. You successfully applied PCA to the dataset. Just double click on the PCA icon and it will show you the variance described by the number of principal components you select. For example, in this case, if we will select 22 attributes or principal components then it will preserve 75% of the original variance. If you need to contain more variance of the original data, you can increase the number of principal components.

You can also check how the original variables got transformed into the principal components. Click on the dashed line of the PCA icon and drag-drop the arrowhead and select the option data table. Then right-click on the link established and select reset signals and then create a link between components and data.

The final table is showing what percentage of original variables represented by each principal component.

Utkarsh Sharma

SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor

2 个月

#orangesoftware #datamining #PCA #dimensionalityreduction

要查看或添加评论，请登录

Utkarsh Sharma的更多文章

reCAPTCHA: The Turing Test We Use Daily

2023年9月20日

reCAPTCHA: The Turing Test We Use Daily

It is amazing that we use some things so frequently that we forget to understand the mechanism behind them, like for…
Enable Machines to Feel: Sentiment Analysis

2022年5月5日

Enable Machines to Feel: Sentiment Analysis

Have you ever got a text from someone and couldn't tell if they were kidding or not? Unless we clearly tell the person…
Introduction to Time Series Analysis

2022年4月28日

Introduction to Time Series Analysis

Time series is a sequence of data points organized in time order. Forecast of data by analyzing time-based data is Time…

1 条评论
Model Drift in Machine Learning

2022年4月14日

Model Drift in Machine Learning

“Change is the only constant in life.”- Heraclitus (Greek philosopher).
Principal Component Analysis????

2022年4月1日

Principal Component Analysis????

What is PCA? Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce…

3 条评论
Curse of Dimensionality

2022年3月17日

Curse of Dimensionality

Yes, data scientists and the data handling community do suffer from this well-known curse. So, is it really a curse or…
Market Basket Analysis:- What will I buy next?

2022年3月10日

Market Basket Analysis:- What will I buy next?

Have you ever wondered, while entering a shopping store that how they organize or stack the things in a particular…
What do Data Engineer Do?

2022年3月3日

What do Data Engineer Do?

So, to define it very shortly a data engineer is that person who is responsible to collect the data from various…

4 条评论
A beginner’s Guide to data mining : RapidMiner

2022年2月24日

A beginner’s Guide to data mining : RapidMiner

RapidMiner studio is a data science and data mining platform that lets users extract transform and load data to draw…
Database Vs Data Warehouse Vs Data Lake

2022年2月17日

Database Vs Data Warehouse Vs Data Lake

In this article, we are going to discuss the difference between databases, data warehouses, and data lakes. So, to need…

1 条评论

See all articles

Utkarsh Sharma的更多文章

reCAPTCHA: The Turing Test We Use Daily

Enable Machines to Feel: Sentiment Analysis

Introduction to Time Series Analysis

Model Drift in Machine Learning

Principal Component Analysis????

Curse of Dimensionality

Market Basket Analysis:- What will I buy next?

What do Data Engineer Do?

A beginner’s Guide to data mining : RapidMiner

Database Vs Data Warehouse Vs Data Lake

社区洞察