登录查看更多内容

Principal Component Analysis - PCA

Shekhar Pandey

Tech Lead | Digital Transformation, Robotics Process Automation, Machine Learning, Artificial Intelligence, AIOps, DevOps, Cloud Computing

发布日期: 2020年5月16日

Dimensionality reduction for visualization:

Often we deal with a high dimensionality dataset, and there arises a need to convert it into a lower dimension space, so we can visualize it , with the condition that we retain the maximum information.

Principal Component Analysis (PCA) :

The main idea of PCA is to reduce the dimensionality of a dataset consisting of many variables (i.e. dimensions), while retaining the variation (i.e. the information) present in the original dataset up to the maximum extent. The is done by transforming the variables to a new set of variables, which are known as the principal components. These principal components retain the variation present in original variables in an ordered manner, i.e. first principal component retains maximum information, then second principal component and so on .

So if we can convert a high dimensionality dataset into 2 or 3 dimensions while retaining around 80% to 90% of original variation, that really helps.

Implementation: Let us use PCA technique on breast cancer dataset which has 30 columns, to transform it into two Principal Components and Visualize the newly created dataset.

# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer

# load dataset
breast_cancer = load_breast_cancer()

type(breast_cancer)
# sklearn.utils.Bunch

# to see detailed description of dataset
print(breast_cancer.DESCR)

breast_cancer.data.shape
# (569, 30) : 569 datapoints i.e rows and 30 columns

breast_cancer.target.shape
# (569,) # 569 labels as 0 or 1

raw_data = breast_cancer.data

# normalized data
normalized_data = StandardScaler().fit_transform(raw_data)

# initialize pca with 2 components
pca = PCA(n_components=2)

# fit data

pca_data = pca.fit_transform(normalized_data)

# Variance explained by principal components
print(pca.explained_variance_ratio_)
# [0.44272026 0.18971182]

# Total Variance explained by principal components
total_var = 100 * np.sum(pca.explained_variance_ratio_)
print(f'{total_var:.3}% of total variance is explained by 2 principal components')
# 63.2% of total variance is explained by 2 principal components

So, with PCA we converted a 30 dimensions dataset into 2 dimensions and retaining 63% of information of original dataset. Now we can easily plot this newly created dataset on a 2 dimension graph.

# Create dataframe 
pca_df = pd.DataFrame(np.vstack((pca_data.T, breast_cancer.target)).T,
                      columns = ['1st_Prin', '2nd_Prin', 'label'])


# Replace 0 with Malignant and 1 with Benign
pca_df['label'].replace(0.0, 'Malignant',inplace=True)
pca_df['label'].replace(1.0, 'Benign',inplace=True)

# Check the count of label
pca_df.label.value_counts()

# Benign       357
# Malignant    212
# This count matches with labels as per dataset description

# Create Plot
# Set palette of colors for different labels
pal = dict(Malignant="red", Benign="green")

ax = sns.FacetGrid(pca_df, hue='label', height=6, palette=pal,
                   hue_order=["Malignant", "Benign"]).\
                   map(plt.scatter, '1st_Prin', '2nd_Prin').\
                   add_legend()

plt.show()

要查看或添加评论，请登录

Shekhar Pandey的更多文章

CloudSql

2022年4月30日

CloudSql

Google Cloud SQL Let's start with basic question as why would anyone should use a google cloud service for SQL when you…
GCP: Identity and Access Management

2022年3月23日

GCP: Identity and Access Management

Identity and Access Management (IAM) lets administrators authorize who can take action on specific resources. An IAM…
Concept of Regional, Zonal resources in GCP

2022年3月21日

Concept of Regional, Zonal resources in GCP

Regions: Regions are independent geographic areas that consist of zones. Zones: A zone is a deployment area for Google…
Cloud Computing - key characterstics

2022年3月21日

Cloud Computing - key characterstics

5 fundamental attributes of Cloud Computing: On-demand: Customers get computing resources on-demand and self-service…
GCP Storage

2022年3月21日

GCP Storage

Google Cloud Platform (GCP) offers various storage options. The main storage options are : Google Cloud Storage, Google…
Learn Numpy, Pandas

2020年6月5日

Learn Numpy, Pandas

Numpy: https://github.com/shekhar270779/Learn_Numpy Pandas Series and DataFrame: https://github.

1 条评论
Slice and Dice

2020年5月9日

Slice and Dice

In data analysis, the term generally implies a systematic method of reducing a complete set of data into smaller parts…
Python Evironment

2020年3月23日

Python Evironment

A python environment allows to install libraries and dependencies of different versions in different environments. It…
pip vs venv vs conda

2020年3月5日

pip vs venv vs conda

Often there is a confusion as which command to use for new package installation , environment set up in (core) python ,…

See all articles

Principal Component Analysis - PCA

Shekhar Pandey

Tech Lead | Digital Transformation, Robotics Process Automation, Machine Learning, Artificial Intelligence, AIOps, DevOps, Cloud Computing

Shekhar Pandey的更多文章

社区洞察

其他会员也浏览了

3) A Pivotal Moment

Fun with Graphing in Power BI - Part 1

How to justify assumptions behind a Fit Least Squares model in JMP?

Fun with Graphing in Power BI - Part SQRT(4)

PCA - Principal Component Analysis

Data Visualization Research is Great...but What Do I Do About It?

Detecting Global Optimum Convergence

Part 2: Predicting results, and working with Command Boards using Machine Learning

Unlocking Insights: How Everyday Charts Boost Business Understanding and Decision-Making

Understanding Ridge Regression with 2D Data & Custom Implementation

Shekhar Pandey的更多文章

CloudSql

GCP: Identity and Access Management

Concept of Regional, Zonal resources in GCP

Cloud Computing - key characterstics

GCP Storage

Learn Numpy, Pandas

Slice and Dice

Python Evironment

pip vs venv vs conda

社区洞察

其他会员也浏览了

3) A Pivotal Moment

Fun with Graphing in Power BI - Part 1

How to justify assumptions behind a Fit Least Squares model in JMP?

Fun with Graphing in Power BI - Part SQRT(4)

PCA - Principal Component Analysis

Data Visualization Research is Great...but What Do I Do About It?

Detecting Global Optimum Convergence

Part 2: Predicting results, and working with Command Boards using Machine Learning

Unlocking Insights: How Everyday Charts Boost Business Understanding and Decision-Making

Understanding Ridge Regression with 2D Data & Custom Implementation