PCA - Principal Component Analysis

PCA - Principal Component Analysis

Dimension reduction

● More efficient storage and computation

● Remove less-informative "noise" features

● ... which cause problems for prediction tasks, e.g. classification, regression

Principal Component Analysis

● PCA = "Principal Component Analysis"

● Fundamental dimension reduction technique

● First step "decorrelation" (considered here)

● Second step reduces dimension (considered later)

PCA follows the fit/transform pattern

● PCA a scikit-learn component like KMeans or StandardScaler

● fit() learns the transformation from given data

● transform() applies the learned transformation

● transform() can also be applied to new data

Correlated data in nature

We have an array grains giving the width and length of samples of grain. We suspect that width and length will be correlated. To confirm this, we can make a scatter plot of width vs length and measure their Pearson correlation.

  • matplotlib.pyplot as plt.
  • pearsonr from scipy.stats.
  • Assign column 0 of grains to width and column 1 of grains to length.
  • Make a scatter plot with width on the x-axis and length on the y-axis.
  • Use the pearsonr() function to calculate the Pearson correlation of width and length.

As you would expect, the width and length of the grain samples are highly correlated.

Decorrelating the grain measurements with PCA

We have observed previously that the width and length measurements of the grain are correlated. Now, we'll use PCA to decorrelate these measurements, then plot the decorrelated points and measure their Pearson correlation.

  • Import PCA from sklearn.decomposition.
  • Create an instance of PCA called model.
  • Use the .fit_transform() method of model to apply the PCA transformation to grains. Assign the result to pca_features.
  • The subsequent code to extract, plot, and compute the Pearson correlation of the first two columns pca_features has been written for you, so hit 'Submit Answer' to see the result!

We've successfully decorrelated the grain measurements with PCA! The principal components have to align with the axes of the point cloud. 

Intrinsic dimension

● Intrinsic dimension = number of features needed to approximate the dataset

● Essential idea behind dimension reduction

● What is the most compact representation of the samples?

● Can be detected with PCA

Versicolor dataset

● "versicolor", one of the iris species

● Only 3 features: sepal length, sepal width, and petal width

● Samples are points in 3D space

PCA identifies intrinsic dimension

● Scatter plots work only if samples have 2 or 3 features

● PCA identifies intrinsic dimension when samples have any number of features

● Intrinsic dimension = number of PCA features with significant variance

The first principal component

The first principal component of the data is the direction in which the data varies the most. We can use PCA to find the first principal component of the length and width measurements of the grain samples, and represent it as an arrow on the scatter plot.

The array grains gives the length and width of the grain samples. PyPlot (plt) and PCA have already been imported.

  • Make a scatter plot of the grain measurements.
  • Create a PCA instance called model.
  • Fit the model to the grains data.
  • Extract the coordinates of the mean of the data using the .mean_ attribute of model.
  • Get the first principal component of model using the .components_[0,:] attribute.
  • Plot the first principal component as an arrow on the scatter plot, using the plt.arrow() function. You have to specify the first two arguments - mean[0] and mean[1].


This is the direction in which the grain data varies the most.



Variance of the PCA features

The fish dataset is 6-dimensional. But what is its intrinsic dimension? Make a plot of the variances of the PCA features to find out. As before, samples is a 2D array, where each row represents a fish. We'll need to standardize the features first.

  • Create an instance of StandardScaler called scaler.
  • Create a PCA instance called pca.
  • Use the make_pipeline() function to create a pipeline chaining scaler and pca.
  • Use the .fit() method of pipeline to fit it to the fish samples samples.
  • Extract the number of components used using the .n_components_ attribute of pca. Place this inside a range() function and store the result as features.
  • Use the plt.bar() function to plot the explained variances, with features on the x-axis and pca.explained_variance_ on the y-axis.


It looks like PCA features 0 and 1 have significant variance.

Since PCA features 0 and 1 have significant variance, the intrinsic dimension of this dataset appears to be 2.

Dimension reduction

● Represents same data, using less features

● Important part of machine-learning pipelines

● Can be performed using PCA

Dimension reduction with PCA

● Specify how many features to keep

● E.g. PCA(n_components=2)

● Keeps the first 2 PCA features

● Intrinsic dimension is a good choice

Dimension reduction with PCA

● Discards low variance PCA features

● Assumes the high variance features are informative

● Assumption typically holds in practice (e.g. for iris)

Dimension reduction of the fish measurements

Previously we have seen that 2 was a reasonable choice for the "intrinsic dimension" of the fish measurements. Now we can use PCA for dimensionality reduction of the fish measurements, retaining only the 2 most important components.

The fish measurements have already been scaled, and are available as scaled_samples.

  • Import PCA from sklearn.decomposition.
  • Create a PCA instance called pca with n_components=2.
  • Use the .fit() method of pca to fit it to the scaled fish measurements scaled_samples.
  • Use the .transform() method of pca to transform the scaled_samples. Assign the result to pca_features.


We've successfully reduced the dimensionality from 6 to 2.




A tf-idf word-frequency array

Here we'll create a tf-idf word frequency array for a toy collection of documents. For this, we can use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has fit() and transform() methods like other sklearn objects.

  • We are given a list documents of toy documents about pets.
  • Import TfidfVectorizer from sklearn.feature_extraction.text.
  • Create a TfidfVectorizer instance called tfidf.
  • Apply .fit_transform() method of tfidf to documents and assign the result to csr_mat. This is a word-frequency array in csr_matrix format.
  • Inspect csr_mat by calling its .toarray() method and printing the result. This has been done for you.
  • The columns of the array correspond to words. Get the list of words by calling the .get_feature_names() method of tfidf, and assign the result to words.

We'll now move to clustering Wikipedia articles!


Clustering Wikipedia part I

The TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. We can combine our knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. Here we will build the pipeline. Later, we'll apply it to the word-frequency array of some Wikipedia articles.

Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix, so there's no need for a TfidfVectorizer).

The Wikipedia dataset you will be working with was obtained from here.

https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/

Import:

  • TruncatedSVD from sklearn.decomposition.
  • KMeans from sklearn.cluster.
  • make_pipeline from sklearn.pipeline.
  • Create a TruncatedSVD instance called svd with n_components=50.
  • Create a KMeans instance called kmeans with n_clusters=6.
  • Create a pipeline called pipeline consisting of svd and kmeans.

Clustering Wikipedia part II

It is now time to put our pipeline from the previous exercise to work! We are given an array articles of tf-idf word-frequencies of some popular Wikipedia articles, and a list titles of their titles. Use the pipeline to cluster the Wikipedia articles.

A solution to the previous exercise has been pre-loaded, so a Pipeline pipeline chaining TruncatedSVD with KMeans is available.

  • Import pandas as pd.
  • Fit the pipeline to the word-frequency array articles.
  • Predict the cluster labels.
  • Align the cluster labels with the list titles of article titles by creating a DataFrame df with labels and titles as columns. This has been done for you.
  • Use the .sort_values() method of df to sort the DataFrame by the 'label' column, and print the result.

                    Take a look at the cluster labels and see if you can identify any patterns!














要查看或添加评论,请登录

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

社区洞察

其他会员也浏览了