Dimensionality Reduction — Can PCA improve the performance of a classification model?
What is PCA?
Principal Component Analysis (PCA) is a common feature extraction technique in data science that employs matrix factorization to reduce the dimensionality of data into lower space.In real-world datasets, there are often too many features in the data. The higher the number of features harder it is to visualize the data and work on it. Sometimes most of the features are correlated, and hence redundant. Hence feature extraction comes into play
you can use the make_classification() function to create a synthetic binary classification problem with 2,000 examples and 30 input features, 20 inputs of which are meaningful.
1- The scikit-learn library provides the PCA class .
data = ... # define transform pca = PCA() # prepare transform on dataset pca.fit(data)
# apply transform to dataset
transformed = pca.transform(data)
The outputs of the PCA can be used as input to train a model.
2-Pipeline of transforms with a final estimator.
class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)[source]
# define the pipeline steps = [('pca', PCA()), ('m', LogisticRegression())] model = Pipeline(steps=steps)
you can to normalize data prior to performing the PCA transform if the input variables have differing units or scales; for example:
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X= scaler.fit_transform(data) # define the pipeline steps = [('norm', MinMaxScaler()), ('pca', PCA()), ('m', LogisticRegression())] model = Pipeline(steps=steps)
3-dataset for classification
# dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7)
# summarize the dataset
print(X.shape, y.shape)
4-Logistic Regression ML model using all 30 features.
# logistic regression algorithm from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7) # define the pipeline steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) Accuracy: 0.726 (0.031)
How do we know that reducing 30 dimensions of input down to 20 is good or the best we can do?Is twenty one a good choice?
5-Feature Extraction using PCA
An appropriate method is to evaluate the model with a number of variable features until we reach the best accuracy.
# compare pca number of components with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from matplotlib import pyplot import matplotlib.pyplot as plt fig= plt.figure(figsize=(25,10)) # get the dataset def get_dataset(): X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in range(1,30): steps = [('pca', PCA(n_components=i)), ('m', LogisticRegression())] models[str(i)] = Pipeline(steps=steps) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model, X, y) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.xticks(rotation=45)
pyplot.show()
>1 0.625 (0.036) >2 0.635 (0.030) >3 0.674 (0.030) >4 0.680 (0.028) >5 0.705 (0.028) >6 0.709 (0.028) >7 0.719 (0.032) >8 0.720 (0.032) >9 0.723 (0.031) >10 0.726 (0.031) >11 0.765 (0.027) >12 0.769 (0.027) >13 0.775 (0.031) >14 0.779 (0.030) >15 0.779 (0.028) >16 0.791 (0.028) >17 0.796 (0.029) >18 0.857 (0.027) >19 0.864 (0.026) >20 0.866 (0.023) >21 0.866 (0.023) >22 0.866 (0.023) >23 0.866 (0.023) >24 0.866 (0.023) >25 0.866 (0.023) >26 0.866 (0.023) >27 0.866 (0.023) >28 0.866 (0.023) >29 0.866 (0.023)
you see a general trend of increased performance as the number of dimensions is increased. On this dataset, the results suggest a trade-off in the number of dimensions vs. the classification accuracy of the model.
Interestingly, we don’t see any improvement beyond 18 components. This matches our definition of the problem where only the first 18 components contain information about the class and the remaining five are redundant.A box and whisker plot is best visulization for the distribution of accuracy scores for each configured number of dimensions.
Conclusion:
As it was observed,Hence we conclude that, we changed the number of selected components in PCA and observed and plotted its effect on the accuracy of the model, that the number of eighteen components is the best choice for the model.