Dimensionality Reduction — Can PCA improve the performance of a classification model?

ali tehrani

data science

发布日期: 2021年4月23日

+ 关注

What is PCA?

Principal Component Analysis (PCA) is a common feature extraction technique in data science that employs matrix factorization to reduce the dimensionality of data into lower space.In real-world datasets, there are often too many features in the data. The higher the number of features harder it is to visualize the data and work on it. Sometimes most of the features are correlated, and hence redundant. Hence feature extraction comes into play

you can use the make_classification() function to create a synthetic binary classification problem with 2,000 examples and 30 input features, 20 inputs of which are meaningful.

1- The scikit-learn library provides the PCA class .

data = ...
# define transform
pca = PCA()
# prepare transform on dataset
pca.fit(data)

# apply transform to dataset

transformed = pca.transform(data)

The outputs of the PCA can be used as input to train a model.

2-Pipeline of transforms with a final estimator.

class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)[source]

# define the pipeline
steps = [('pca', PCA()), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

you can to normalize data prior to performing the PCA transform if the input variables have differing units or scales; for example:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X= scaler.fit_transform(data)
# define the pipeline
steps = [('norm', MinMaxScaler()), ('pca', PCA()), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

3-dataset for classification

#  dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7)

# summarize the dataset

print(X.shape, y.shape)

4-Logistic Regression ML model using all 30 features.

#  logistic regression algorithm 
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7)
# define the pipeline
steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.726 (0.031)

How do we know that reducing 30 dimensions of input down to 20 is good or the best we can do?Is twenty one a good choice?

5-Feature Extraction using PCA

An appropriate method is to evaluate the model with a number of variable features until we reach the best accuracy.

# compare pca number of components with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
import matplotlib.pyplot as plt
fig= plt.figure(figsize=(25,10))
# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7)
	return X, y


# get a list of models to evaluate
def get_models():
	models = dict()
	for i in range(1,30):
		steps = [('pca', PCA(n_components=i)), ('m', LogisticRegression())]
		models[str(i)] = Pipeline(steps=steps)
	return models


# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
	return scores


# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	scores = evaluate_model(model, X, y)
	results.append(scores)
	names.append(name)
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.xticks(rotation=45)

pyplot.show()

>1 0.625 (0.036)
>2 0.635 (0.030)
>3 0.674 (0.030)
>4 0.680 (0.028)
>5 0.705 (0.028)
>6 0.709 (0.028)
>7 0.719 (0.032)
>8 0.720 (0.032)
>9 0.723 (0.031)
>10 0.726 (0.031)
>11 0.765 (0.027)
>12 0.769 (0.027)
>13 0.775 (0.031)
>14 0.779 (0.030)
>15 0.779 (0.028)
>16 0.791 (0.028)
>17 0.796 (0.029)
>18 0.857 (0.027)
>19 0.864 (0.026)
>20 0.866 (0.023)
>21 0.866 (0.023)
>22 0.866 (0.023)
>23 0.866 (0.023)
>24 0.866 (0.023)
>25 0.866 (0.023)
>26 0.866 (0.023)
>27 0.866 (0.023)
>28 0.866 (0.023)
>29 0.866 (0.023)

you see a general trend of increased performance as the number of dimensions is increased. On this dataset, the results suggest a trade-off in the number of dimensions vs. the classification accuracy of the model.

Interestingly, we don’t see any improvement beyond 18 components. This matches our definition of the problem where only the first 18 components contain information about the class and the remaining five are redundant.A box and whisker plot is best visulization for the distribution of accuracy scores for each configured number of dimensions.

Conclusion:

As it was observed,Hence we conclude that, we changed the number of selected components in PCA and observed and plotted its effect on the accuracy of the model, that the number of eighteen components is the best choice for the model.

要查看或添加评论，请登录

ali tehrani的更多文章

Linear Regression With Bootstrapping

2021年5月14日

Linear Regression With Bootstrapping

For data scientists and machine learning engineers, this bootstrapping context is an important tool for sampling data…

1 条评论
How to interpret Principal Component Analysis

2021年4月24日

How to interpret Principal Component Analysis

Suppose that after applying Principal Component Analysis (PCA) to your dataset, you are interested in understanding…
Imbalanced classification

2021年3月31日

Imbalanced classification

Welcome! I'm Ali Mirzaei Data scientist. and I help developers get results with machine learning.

7 条评论

What is PCA?

ali tehrani的更多文章

Linear Regression With Bootstrapping

How to interpret Principal Component Analysis

Imbalanced classification

社区洞察