Using Autoencoders for Dimensionality Reduction: A Practical Guide with MNIST
Rany ElHousieny, PhD???
Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager
Dimensionality reduction is a crucial technique in data preprocessing, particularly for high-dimensional datasets. It helps in simplifying the dataset, reducing storage requirements, and often improving the performance of machine learning models. One powerful method for dimensionality reduction is the use of autoencoders. In this article, we’ll explore how to use autoencoders for this purpose using the MNIST dataset and then compare its accuracy with PCA.
What is an Autoencoder?
An autoencoder is a type of neural network designed to learn efficient codings of input data. It consists of two main parts:
Steps to Use Autoencoders for Dimensionality Reduction
Let’s walk through these steps using the MNIST dataset.
1. Load and Preprocess the Dataset
First, we need to load the MNIST dataset, which contains images of handwritten digits.
from tensorflow.keras.datasets import mnist
import numpy as np
# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Normalize the data to the range [0, 1]
X_train = X_train.astype('float32') / 255.
X_test = X_test.astype('float32') / 255.
# Flatten the images to vectors of size 784 (28*28)
X_train = X_train.reshape((X_train.shape[0], -1))
X_test = X_test.reshape((X_test.shape[0], -1))
import matplotlib.pyplot as plt
# Plot some of the images
num_images = 10
plt.figure(figsize=(10, 1))
for i in range(num_images):
# Reshape the flattened image back to 28x28
image = X_train[i].reshape(28, 28)
# Plot the image
plt.subplot(1, num_images, i + 1)
plt.imshow(image, cmap='gray')
plt.axis('off')
plt.show()
2. Define and Train the Autoencoder
We define an autoencoder with a simple architecture where the encoder compresses the data to a 32-dimensional representation.
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# Define the input layer
input_shape = (784,) # 28*28 pixels
input_layer = Input(shape=input_shape)
# Define the encoder
hidden_layer_1 = Dense(128, activation='relu')(input_layer)
hidden_layer_2 = Dense(64, activation='relu')(hidden_layer_1)
encoded_representation = Dense(32, activation='relu')(hidden_layer_2)
# Define the decoder
decoded = Dense(64, activation='relu')(encoded_representation)
decoded = Dense(128, activation='relu')(decoded)
output_layer = Dense(784, activation='sigmoid')(decoded)
# Create the autoencoder model
autoencoder = Model(inputs=input_layer, outputs=output_layer)
# Compile the model
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
# Train the autoencoder
autoencoder.fit(X_train, X_train, epochs=50, batch_size=256, shuffle=True, validation_data=(X_test, X_test))
3. Extract Encoded Features
After training, we extract the encoder part of the model to obtain the encoded features.
# Extract the encoder model
encoder = Model(inputs=input_layer, outputs=encoded_representation)
# Obtain the encoded features
encoded_train_features = encoder.predict(X_train)
encoded_test_features = encoder.predict(X_test)
领英推荐
4. Use Encoded Features for Further Analysis
We can now use these encoded features for further analysis. For instance, we can train a classifier on the encoded features and evaluate its performance.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Train a classifier on the encoded features
clf = RandomForestClassifier()
clf.fit(encoded_train_features, y_train)
# Evaluate the classifier
predictions = clf.predict(encoded_test_features)
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')
Accuracy: 0.9453
Autoencoder vs PCA
Let's compare autoencoders with Principal Component Analysis (PCA) using the same MNIST dataset.
1. Dimensionality Reduction with PCA
Let's perform dimensionality reduction using PCA:
from sklearn.decomposition import PCA
# Initialize PCA with the same number of components as the autoencoder
pca = PCA(n_components=32)
# Fit PCA on the training data and transform both training and test data
pca_train_features = pca.fit_transform(X_train)
pca_test_features = pca.transform(X_test)
2. Train and Evaluate a Classifier using PCA Features
We'll train and evaluate a Random Forest classifier using the features obtained from the PCA:
# Train a classifier on the PCA encoded features
clf_pca = RandomForestClassifier()
clf_pca.fit(pca_train_features, y_train)
pca_predictions = clf_pca.predict(pca_test_features)
pca_accuracy = accuracy_score(y_test, pca_predictions)
print(f'Accuracy using PCA: {pca_accuracy}')
Comparison and Results
Accuracy using PCA: 0.9535
We had the Accuracy from the Encoder before:
Accuracy: 0.9453
This example demonstrates how to perform dimensionality reduction using both autoencoders and PCA, and then evaluate the performance of a classifier trained on the reduced data. By comparing the accuracies, we can assess the effectiveness of each dimensionality reduction technique.
to learn more about PCA and dimensionality reduction, you can read the following articles:
Conclusion
Both autoencoders and PCA are powerful tools for dimensionality reduction. Autoencoders can learn more complex and non-linear transformations, potentially capturing more intricate structures in the data. PCA, on the other hand, is a linear method that is often simpler and faster to apply. The choice between them depends on the specific characteristics of your data and the requirements of your task. In practice, it's valuable to experiment with both methods to determine which one works best for your particular application.