Principal Component Analysis - PCA

Dimensionality reduction for visualization:

Often we deal with a high dimensionality dataset, and there arises a need to convert it into a lower dimension space, so we can visualize it , with the condition that we retain the maximum information. 

Principal Component Analysis (PCA) :

The main idea of PCA is to reduce the dimensionality of a dataset consisting of many variables (i.e. dimensions), while retaining the variation (i.e. the information) present in the original dataset up to the maximum extent. The is done by transforming the variables to a new set of variables, which are known as the principal components. These principal components retain the variation present in original variables in an ordered manner, i.e. first principal component retains maximum information, then second principal component and so on .

So if we can convert a high dimensionality dataset into 2 or 3 dimensions while retaining around 80% to 90% of original variation, that really helps. 

Implementation: Let us use PCA technique on breast cancer dataset which has 30 columns, to transform it into two Principal Components and Visualize the newly created dataset.

# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer

# load dataset
breast_cancer = load_breast_cancer()

type(breast_cancer)
# sklearn.utils.Bunch

# to see detailed description of dataset
print(breast_cancer.DESCR)



breast_cancer.data.shape
# (569, 30) : 569 datapoints i.e rows and 30 columns

breast_cancer.target.shape
# (569,) # 569 labels as 0 or 1 

raw_data = breast_cancer.data

# normalized data
normalized_data = StandardScaler().fit_transform(raw_data)

# initialize pca with 2 components
pca = PCA(n_components=2)

# fit data

pca_data = pca.fit_transform(normalized_data)


# Variance explained by principal components
print(pca.explained_variance_ratio_)
# [0.44272026 0.18971182]

# Total Variance explained by principal components
total_var = 100 * np.sum(pca.explained_variance_ratio_)
print(f'{total_var:.3}% of total variance is explained by 2 principal components')
# 63.2% of total variance is explained by 2 principal components

So, with PCA we converted a 30 dimensions dataset into 2 dimensions and retaining 63% of information of original dataset. Now we can easily plot this newly created dataset on a 2 dimension graph.

# Create dataframe 
pca_df = pd.DataFrame(np.vstack((pca_data.T, breast_cancer.target)).T,
                      columns = ['1st_Prin', '2nd_Prin', 'label'])


# Replace 0 with Malignant and 1 with Benign
pca_df['label'].replace(0.0, 'Malignant',inplace=True)
pca_df['label'].replace(1.0, 'Benign',inplace=True)

# Check the count of label
pca_df.label.value_counts()

# Benign       357
# Malignant    212
# This count matches with labels as per dataset description

# Create Plot
# Set palette of colors for different labels
pal = dict(Malignant="red", Benign="green")

ax = sns.FacetGrid(pca_df, hue='label', height=6, palette=pal,
                   hue_order=["Malignant", "Benign"]).\
                   map(plt.scatter, '1st_Prin', '2nd_Prin').\
                   add_legend()

plt.show()

No alt text provided for this image


要查看或添加评论,请登录

Shekhar Pandey的更多文章

  • CloudSql

    CloudSql

    Google Cloud SQL Let's start with basic question as why would anyone should use a google cloud service for SQL when you…

  • GCP: Identity and Access Management

    GCP: Identity and Access Management

    Identity and Access Management (IAM) lets administrators authorize who can take action on specific resources. An IAM…

  • Concept of Regional, Zonal resources in GCP

    Concept of Regional, Zonal resources in GCP

    Regions: Regions are independent geographic areas that consist of zones. Zones: A zone is a deployment area for Google…

  • Cloud Computing - key characterstics

    Cloud Computing - key characterstics

    5 fundamental attributes of Cloud Computing: On-demand: Customers get computing resources on-demand and self-service…

  • GCP Storage

    GCP Storage

    Google Cloud Platform (GCP) offers various storage options. The main storage options are : Google Cloud Storage, Google…

  • Learn Numpy, Pandas

    Learn Numpy, Pandas

    Numpy: https://github.com/shekhar270779/Learn_Numpy Pandas Series and DataFrame: https://github.

    1 条评论
  • Slice and Dice

    Slice and Dice

    In data analysis, the term generally implies a systematic method of reducing a complete set of data into smaller parts…

  • Python Evironment

    Python Evironment

    A python environment allows to install libraries and dependencies of different versions in different environments. It…

  • pip vs venv vs conda

    pip vs venv vs conda

    Often there is a confusion as which command to use for new package installation , environment set up in (core) python ,…

社区洞察

其他会员也浏览了