登录查看更多内容

PCA - Principal Component Analysis

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE

Mortgage World Bankers - Predictive modeling for residential & commercial Lending in NY, NJ, CT, PA, FL

发布日期: 2018年12月15日

+ 关注

Dimension reduction

● More efficient storage and computation

● Remove less-informative "noise" features

● ... which cause problems for prediction tasks, e.g. classification, regression

Principal Component Analysis

● PCA = "Principal Component Analysis"

● Fundamental dimension reduction technique

● First step "decorrelation" (considered here)

● Second step reduces dimension (considered later)

PCA follows the fit/transform pattern

● PCA a scikit-learn component like KMeans or StandardScaler

● fit() learns the transformation from given data

● transform() applies the learned transformation

● transform() can also be applied to new data

Correlated data in nature

We have an array grains giving the width and length of samples of grain. We suspect that width and length will be correlated. To confirm this, we can make a scatter plot of width vs length and measure their Pearson correlation.

matplotlib.pyplot as plt.
pearsonr from scipy.stats.
Assign column 0 of grains to width and column 1 of grains to length.
Make a scatter plot with width on the x-axis and length on the y-axis.
Use the pearsonr() function to calculate the Pearson correlation of width and length.

As you would expect, the width and length of the grain samples are highly correlated.

Decorrelating the grain measurements with PCA

We have observed previously that the width and length measurements of the grain are correlated. Now, we'll use PCA to decorrelate these measurements, then plot the decorrelated points and measure their Pearson correlation.

Import PCA from sklearn.decomposition.
Create an instance of PCA called model.
Use the .fit_transform() method of model to apply the PCA transformation to grains. Assign the result to pca_features.
The subsequent code to extract, plot, and compute the Pearson correlation of the first two columns pca_features has been written for you, so hit 'Submit Answer' to see the result!

We've successfully decorrelated the grain measurements with PCA! The principal components have to align with the axes of the point cloud.

Intrinsic dimension

● Intrinsic dimension = number of features needed to approximate the dataset

● Essential idea behind dimension reduction

● What is the most compact representation of the samples?

● Can be detected with PCA

Versicolor dataset

● "versicolor", one of the iris species

● Only 3 features: sepal length, sepal width, and petal width

● Samples are points in 3D space

PCA identifies intrinsic dimension

● Scatter plots work only if samples have 2 or 3 features

● PCA identifies intrinsic dimension when samples have any number of features

● Intrinsic dimension = number of PCA features with significant variance

The first principal component

The first principal component of the data is the direction in which the data varies the most. We can use PCA to find the first principal component of the length and width measurements of the grain samples, and represent it as an arrow on the scatter plot.

The array grains gives the length and width of the grain samples. PyPlot (plt) and PCA have already been imported.

Make a scatter plot of the grain measurements.
Create a PCA instance called model.
Fit the model to the grains data.
Extract the coordinates of the mean of the data using the .mean_ attribute of model.
Get the first principal component of model using the .components_[0,:] attribute.
Plot the first principal component as an arrow on the scatter plot, using the plt.arrow() function. You have to specify the first two arguments - mean[0] and mean[1].

This is the direction in which the grain data varies the most.

Variance of the PCA features

The fish dataset is 6-dimensional. But what is its intrinsic dimension? Make a plot of the variances of the PCA features to find out. As before, samples is a 2D array, where each row represents a fish. We'll need to standardize the features first.

Create an instance of StandardScaler called scaler.
Create a PCA instance called pca.
Use the make_pipeline() function to create a pipeline chaining scaler and pca.
Use the .fit() method of pipeline to fit it to the fish samples samples.
Extract the number of components used using the .n_components_ attribute of pca. Place this inside a range() function and store the result as features.
Use the plt.bar() function to plot the explained variances, with features on the x-axis and pca.explained_variance_ on the y-axis.

It looks like PCA features 0 and 1 have significant variance.

Since PCA features 0 and 1 have significant variance, the intrinsic dimension of this dataset appears to be 2.

Dimension reduction

● Represents same data, using less features

● Important part of machine-learning pipelines

● Can be performed using PCA

Dimension reduction with PCA

● Specify how many features to keep

● E.g. PCA(n_components=2)

● Keeps the first 2 PCA features

● Intrinsic dimension is a good choice

Dimension reduction with PCA

● Discards low variance PCA features

● Assumes the high variance features are informative

● Assumption typically holds in practice (e.g. for iris)

Dimension reduction of the fish measurements

Previously we have seen that 2 was a reasonable choice for the "intrinsic dimension" of the fish measurements. Now we can use PCA for dimensionality reduction of the fish measurements, retaining only the 2 most important components.

The fish measurements have already been scaled, and are available as scaled_samples.

Import PCA from sklearn.decomposition.
Create a PCA instance called pca with n_components=2.
Use the .fit() method of pca to fit it to the scaled fish measurements scaled_samples.
Use the .transform() method of pca to transform the scaled_samples. Assign the result to pca_features.

We've successfully reduced the dimensionality from 6 to 2.

A tf-idf word-frequency array

Here we'll create a tf-idf word frequency array for a toy collection of documents. For this, we can use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has fit() and transform() methods like other sklearn objects.

We are given a list documents of toy documents about pets.
Import TfidfVectorizer from sklearn.feature_extraction.text.
Create a TfidfVectorizer instance called tfidf.
Apply .fit_transform() method of tfidf to documents and assign the result to csr_mat. This is a word-frequency array in csr_matrix format.
Inspect csr_mat by calling its .toarray() method and printing the result. This has been done for you.
The columns of the array correspond to words. Get the list of words by calling the .get_feature_names() method of tfidf, and assign the result to words.

We'll now move to clustering Wikipedia articles!

Clustering Wikipedia part I

The TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. We can combine our knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. Here we will build the pipeline. Later, we'll apply it to the word-frequency array of some Wikipedia articles.

Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix, so there's no need for a TfidfVectorizer).

The Wikipedia dataset you will be working with was obtained from here.

https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/

Import:

TruncatedSVD from sklearn.decomposition.
KMeans from sklearn.cluster.
make_pipeline from sklearn.pipeline.
Create a TruncatedSVD instance called svd with n_components=50.
Create a KMeans instance called kmeans with n_clusters=6.
Create a pipeline called pipeline consisting of svd and kmeans.

Clustering Wikipedia part II

It is now time to put our pipeline from the previous exercise to work! We are given an array articles of tf-idf word-frequencies of some popular Wikipedia articles, and a list titles of their titles. Use the pipeline to cluster the Wikipedia articles.

A solution to the previous exercise has been pre-loaded, so a Pipeline pipeline chaining TruncatedSVD with KMeans is available.

Import pandas as pd.
Fit the pipeline to the word-frequency array articles.
Predict the cluster labels.
Align the cluster labels with the list titles of article titles by creating a DataFrame df with labels and titles as columns. This has been done for you.
Use the .sort_values() method of df to sort the DataFrame by the 'label' column, and print the result.

Take a look at the cluster labels and see if you can identify any patterns!

要查看或添加评论，请登录

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

2018年12月9日

Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H.

1 条评论
Unsupervised learning using Scikit Learn

2018年12月7日

Unsupervised learning using Scikit Learn

Supervised vs unsupervised learning ● Supervised learning finds ptterns for a prediction task ● E.g.
Machine Learning School Budgets

2018年11月22日

Machine Learning School Budgets

Loading the data Now it's time to check out the dataset! You'll use pandas (which has been pre-imported as pd) to load…
Pre-processing data in Python for Machine Learning

2018年11月8日

Pre-processing data in Python for Machine Learning

Exploring categorical features The Gapminder dataset that contained a categorical 'Region' feature, which we dropped in…
Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

2018年11月6日

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

Logistic regression for binary classification ● Logistic regression outputs probabilities ● If the probability ‘p’ is…
How good is your model?

2018年11月6日

How good is your model?

Metrics for classification The performance of k-NN classifier based on its accuracy. However, accuracy is not always an…

1 条评论
Fit & predict for regression

2018年11月5日

Fit & predict for regression

If your problem requires a continuous outcome? Regression, which is best suited to solving such problems. The…
#MachineLearning Train/Test Split + Fit/Predict/Accuracy

2018年11月4日

#MachineLearning Train/Test Split + Fit/Predict/Accuracy

We will introduce the classification problems and learn how to solve them using supervised learning techniques…
EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

2018年10月18日

EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

Background? In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing…
Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

2018年10月16日

Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

Background When racial disparities in life outcomes occur, explicit or subtle prejudice leading to discriminatory…

See all articles

PCA - Principal Component Analysis

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE

Mortgage World Bankers - Predictive modeling for residential & commercial Lending in NY, NJ, CT, PA, FL

Dimension reduction

Principal Component Analysis

PCA follows the fit/transform pattern

Correlated data in nature

Decorrelating the grain measurements with PCA

Intrinsic dimension

Versicolor dataset

PCA identifies intrinsic dimension

The first principal component

Variance of the PCA features

Dimension reduction

Dimension reduction with PCA

Dimension reduction of the fish measurements

A tf-idf word-frequency array

Clustering Wikipedia part I

Clustering Wikipedia part II

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

社区洞察

其他会员也浏览了

Data Science #29

PRINCIPAL COMPONENT ANALYSIS - Simplifying Data with PCA

Quantico: Forecasting Panel & Single Series Data

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

Introduction To Graphs

Important Methods in Data Structure Algorithm

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Vectorization over loop

Practical Time Series Analysis Review

Dimension reduction

Principal Component Analysis

PCA follows the fit/transform pattern

Correlated data in nature

Decorrelating the grain measurements with PCA

Intrinsic dimension

Versicolor dataset

PCA identifies intrinsic dimension

The first principal component

Variance of the PCA features

Dimension reduction

Dimension reduction with PCA

Dimension reduction of the fish measurements

A tf-idf word-frequency array

Clustering Wikipedia part I

Clustering Wikipedia part II

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

Predicting breast cancer with Machine Learning (K-NN, SVM, RFC)

Unsupervised learning using Scikit Learn

Machine Learning School Budgets

Pre-processing data in Python for Machine Learning

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

How good is your model?

Fit & predict for regression

#MachineLearning Train/Test Split + Fit/Predict/Accuracy

EDA - Hospital Readmissions Data Analysis and Recommendations for Reduction?

Examine Racial Discrimination in the US Job Market Using Exploratory Data Analysis

社区洞察

其他会员也浏览了

Data Science #29

PRINCIPAL COMPONENT ANALYSIS - Simplifying Data with PCA

Quantico: Forecasting Panel & Single Series Data

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

Introduction To Graphs

Important Methods in Data Structure Algorithm

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Vectorization over loop

Practical Time Series Analysis Review