Dimensionality Reduction by PCA using Orange

Dimensionality Reduction by PCA using Orange

The curse of dimensionality haunts every data scientist dealing with a dataset containing a large number of attributes. Whenever someone handles such a huge dataset for finding out some pattern into it or analyze, the high dimensionality sometimes makes the task difficult. Because every time a new attribute gets added into the system to capture its variability among all entities, we will require 100s of new tuples. And as we all know there is always a scarcity of the training dataset. To deal with this problem, we apply dimensionality reduction techniques to shorten the number of attributes while still preserving the information or the variance provided by the attributes.

???????????One of the most widely used techniques for dimensionality reduction is Principal Component Analysis(PCA). This technique works on the principle of eigenvectors. PCA will project your original attributes in form of the same number of principal components. But out of those principal components, you can select only those which combinedly able to represent the variance of the dataset up to the desired level. PCA is supported by almost every data analytical programming language and tool. The working of PCA is almost the same in every platform but the representation of the outcome may differ.

As an example, we will look at how PCA works on a dataset using the Orange. Orange is an open-source data visualization, machine learning, and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization. Let’s see how we can perform PCA in Orange.

This is the welcome screen of Orange, just click on New for creating a new project. When you click on New, a blank project will open in front of you and the first step will be to load a dataset on which you need to perform PCA.?

No alt text provided for this image

For loading a dataset click on the File icon, the moment you click on File an icon will be created on the blank screen named File. Double click on that icon and a file explorer window will open and then browse to your desired file, your content of the file will be displayed on the screen if the file loads successfully. Close the window after that, there is no OK button so don’t get worried. The file I loaded for demonstration contains 19719 records with 50 attributes.

No alt text provided for this image

Now click on the dashed line along with the file icon and it will draw an arrow shape, when you leave the line a menu will pop up in front of you that will show various options to select.

Select the appropriate option from that menu, in our case, we will select PCA.

No alt text provided for this image

That’s it, done. You successfully applied PCA to the dataset. Just double click on the PCA icon and it will show you the variance described by the number of principal components you select. For example, in this case, if we will select 22 attributes or principal components then it will preserve 75% of the original variance. If you need to contain more variance of the original data, you can increase the number of principal components.

No alt text provided for this image

You can also check how the original variables got transformed into the principal components. Click on the dashed line of the PCA icon and drag-drop the arrowhead and select the option data table. Then right-click on the link established and select reset signals and then create a link between components and data.

No alt text provided for this image
No alt text provided for this image

The final table is showing what percentage of original variables represented by each principal component.



Utkarsh Sharma

SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor

2 个月

#orangesoftware #datamining #PCA #dimensionalityreduction

回复

要查看或添加评论,请登录

Utkarsh Sharma的更多文章

  • reCAPTCHA: The Turing Test We Use Daily

    reCAPTCHA: The Turing Test We Use Daily

    It is amazing that we use some things so frequently that we forget to understand the mechanism behind them, like for…

  • Enable Machines to Feel: Sentiment Analysis

    Enable Machines to Feel: Sentiment Analysis

    Have you ever got a text from someone and couldn't tell if they were kidding or not? Unless we clearly tell the person…

  • Introduction to Time Series Analysis

    Introduction to Time Series Analysis

    Time series is a sequence of data points organized in time order. Forecast of data by analyzing time-based data is Time…

    1 条评论
  • Model Drift in Machine Learning

    Model Drift in Machine Learning

    “Change is the only constant in life.”- Heraclitus (Greek philosopher).

  • Principal Component Analysis????

    Principal Component Analysis????

    What is PCA? Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce…

    3 条评论
  • Curse of Dimensionality

    Curse of Dimensionality

    Yes, data scientists and the data handling community do suffer from this well-known curse. So, is it really a curse or…

  • Market Basket Analysis:- What will I buy next?

    Market Basket Analysis:- What will I buy next?

    Have you ever wondered, while entering a shopping store that how they organize or stack the things in a particular…

  • What do Data Engineer Do?

    What do Data Engineer Do?

    So, to define it very shortly a data engineer is that person who is responsible to collect the data from various…

    4 条评论
  • A beginner’s Guide to data mining : RapidMiner

    A beginner’s Guide to data mining : RapidMiner

    RapidMiner studio is a data science and data mining platform that lets users extract transform and load data to draw…

  • Database Vs Data Warehouse Vs Data Lake

    Database Vs Data Warehouse Vs Data Lake

    In this article, we are going to discuss the difference between databases, data warehouses, and data lakes. So, to need…

    1 条评论

社区洞察