Dimensionality Reduction by PCA using Orange
Utkarsh Sharma
SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor
The curse of dimensionality haunts every data scientist dealing with a dataset containing a large number of attributes. Whenever someone handles such a huge dataset for finding out some pattern into it or analyze, the high dimensionality sometimes makes the task difficult. Because every time a new attribute gets added into the system to capture its variability among all entities, we will require 100s of new tuples. And as we all know there is always a scarcity of the training dataset. To deal with this problem, we apply dimensionality reduction techniques to shorten the number of attributes while still preserving the information or the variance provided by the attributes.
???????????One of the most widely used techniques for dimensionality reduction is Principal Component Analysis(PCA). This technique works on the principle of eigenvectors. PCA will project your original attributes in form of the same number of principal components. But out of those principal components, you can select only those which combinedly able to represent the variance of the dataset up to the desired level. PCA is supported by almost every data analytical programming language and tool. The working of PCA is almost the same in every platform but the representation of the outcome may differ.
As an example, we will look at how PCA works on a dataset using the Orange. Orange is an open-source data visualization, machine learning, and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization. Let’s see how we can perform PCA in Orange.
This is the welcome screen of Orange, just click on New for creating a new project. When you click on New, a blank project will open in front of you and the first step will be to load a dataset on which you need to perform PCA.?
For loading a dataset click on the File icon, the moment you click on File an icon will be created on the blank screen named File. Double click on that icon and a file explorer window will open and then browse to your desired file, your content of the file will be displayed on the screen if the file loads successfully. Close the window after that, there is no OK button so don’t get worried. The file I loaded for demonstration contains 19719 records with 50 attributes.
Now click on the dashed line along with the file icon and it will draw an arrow shape, when you leave the line a menu will pop up in front of you that will show various options to select.
Select the appropriate option from that menu, in our case, we will select PCA.
That’s it, done. You successfully applied PCA to the dataset. Just double click on the PCA icon and it will show you the variance described by the number of principal components you select. For example, in this case, if we will select 22 attributes or principal components then it will preserve 75% of the original variance. If you need to contain more variance of the original data, you can increase the number of principal components.
You can also check how the original variables got transformed into the principal components. Click on the dashed line of the PCA icon and drag-drop the arrowhead and select the option data table. Then right-click on the link established and select reset signals and then create a link between components and data.
The final table is showing what percentage of original variables represented by each principal component.
SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor
2 个月#orangesoftware #datamining #PCA #dimensionalityreduction