Principal Component Analysis - PCA
Shekhar Pandey
Tech Lead | Digital Transformation, Robotics Process Automation, Machine Learning, Artificial Intelligence, AIOps, DevOps, Cloud Computing
Dimensionality reduction for visualization:
Often we deal with a high dimensionality dataset, and there arises a need to convert it into a lower dimension space, so we can visualize it , with the condition that we retain the maximum information.
Principal Component Analysis (PCA) :
The main idea of PCA is to reduce the dimensionality of a dataset consisting of many variables (i.e. dimensions), while retaining the variation (i.e. the information) present in the original dataset up to the maximum extent. The is done by transforming the variables to a new set of variables, which are known as the principal components. These principal components retain the variation present in original variables in an ordered manner, i.e. first principal component retains maximum information, then second principal component and so on .
So if we can convert a high dimensionality dataset into 2 or 3 dimensions while retaining around 80% to 90% of original variation, that really helps.
Implementation: Let us use PCA technique on breast cancer dataset which has 30 columns, to transform it into two Principal Components and Visualize the newly created dataset.
# import libraries import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.datasets import load_breast_cancer # load dataset breast_cancer = load_breast_cancer() type(breast_cancer) # sklearn.utils.Bunch # to see detailed description of dataset print(breast_cancer.DESCR)
breast_cancer.data.shape # (569, 30) : 569 datapoints i.e rows and 30 columns breast_cancer.target.shape # (569,) # 569 labels as 0 or 1
raw_data = breast_cancer.data # normalized data normalized_data = StandardScaler().fit_transform(raw_data) # initialize pca with 2 components pca = PCA(n_components=2) # fit data
pca_data = pca.fit_transform(normalized_data)
# Variance explained by principal components print(pca.explained_variance_ratio_) # [0.44272026 0.18971182] # Total Variance explained by principal components total_var = 100 * np.sum(pca.explained_variance_ratio_) print(f'{total_var:.3}% of total variance is explained by 2 principal components') # 63.2% of total variance is explained by 2 principal components
So, with PCA we converted a 30 dimensions dataset into 2 dimensions and retaining 63% of information of original dataset. Now we can easily plot this newly created dataset on a 2 dimension graph.
# Create dataframe pca_df = pd.DataFrame(np.vstack((pca_data.T, breast_cancer.target)).T, columns = ['1st_Prin', '2nd_Prin', 'label']) # Replace 0 with Malignant and 1 with Benign pca_df['label'].replace(0.0, 'Malignant',inplace=True) pca_df['label'].replace(1.0, 'Benign',inplace=True) # Check the count of label pca_df.label.value_counts() # Benign 357 # Malignant 212 # This count matches with labels as per dataset description # Create Plot # Set palette of colors for different labels pal = dict(Malignant="red", Benign="green") ax = sns.FacetGrid(pca_df, hue='label', height=6, palette=pal, hue_order=["Malignant", "Benign"]).\ map(plt.scatter, '1st_Prin', '2nd_Prin').\ add_legend() plt.show()