PCA for image reconstruction, from scratch
Waifu reconstruction using PCA

PCA for image reconstruction, from scratch

Today I want to show you the power of Principal Component Analysis (PCA). It is a technique of reducing dimensionality of data, increasing interpretability, but at the same time minimizing information loss.

That being said, let us see how this magic happens! I will showcase a python code for implementing PCA from scratch. This will help you understand the concept in greater detail and will also facilitate practical learning. To visually see the power of PCA and to be able to comprehend it, I have chosen image data. No matter the data is consisting of various kinds of anime waifus! there are still some statistical patterns in the dataset that PCA can learn and interpret the images using less features.

So lets begin! I am producing a code which I developed as a Teaching Assistant in my Master's degree with the help and guidance of Prof. Viswanath Gopalakrishnan for students to understand and implement the code for PCA in the subject, Mathematics for Machine Learning. I am using a subset of dataset available on Kaggle. The dataset can be found here: Dataset. I am assuming you are using Google Colab. However It can also be done on a powerful system offline in a Jupyter Notebook using plotly offline if required. I would also recommend to use dark mode for best experience.

The first step would be to mount your Drive to Google Colab. It can be easily done using the following code:

#code?to?mount?drive
from?google.colab?import?drive
drive.mount('/content/drive')        

Just follow the instructions go grant permission and you will be ready for the next step which is to upload the dataset. Kindly note that I have made a folder named 'PCA' in MyDrive and uploaded the dataset in that folder.

Now we will import the required libraries. Just use the snippet given below. Here we are also setting the default renderer value to Google Colab so that our plotly plots are visible in the notebook.

#importing?libraries
import?os
import?sys?
import?cv2?as?cv
import?numpy?as?np
import?plotly.io?as?pio
import?plotly.graph_objs?as?go


from?PIL?import?Image
from?skimage?import?color
from?plotly?import?subplots
from?sklearn.model_selection?import?train_test_split


#setting?the?rederer?as?colab
pio.renderers.default?=?"colab"        

Now we will load the dataset. But before I proceed, I would like to provide a word of caution. I have chosen the values of number of images to 500 and the image dimensions of 64x64 so that you can run the code flawlessly within the provided resource limit in Google Colab. However, if you are running it locally on a powerful system with RAM 32GB or above, you can make changes to these values to get better results. For example, number of images provided in the dataset are 1000. Or you can also increase the image dimensions to 128x128.

All that being said, now here is the code for loading the images into the system. I will explain the code in detail step by step.

#Loading?data?and?reducing?size?to?64?x?64?pixels
IMG_DIR?=?'/content/drive/MyDrive/PCA/Dataset'
X?=?[]
X_flat?=?[]
count?=?1
size?=?64
total?=?500
print("Loading...")
for?img?in?os.listdir(IMG_DIR):
????if?count?==?total?+?1:
????????break
????sys.stdout.write("\r"?+?str(count)?+?"?/?"?+?str(total))
????sys.stdout.flush()
????img_array?=?cv.imread(os.path.join(IMG_DIR, img),?cv.IMREAD_GRAYSCALE)
????img_pil?=?Image.fromarray(img_array)
????img_64x64?=?np.array(img_pil.resize((size,?size),?Image.ANTIALIAS))
????X.append(img_64x64)
????img_array?=?img_64x64.flatten()
????X_flat.append(img_array)
????count?+=?1
print()
print("Done!")        

  • The location of the dataset is loaded into the variable 'IMG_DIR'.
  • Two lists 'X' and 'X_flat' are created for storing images.
  • A counter 'count' is initialized to 1 to keep track of which image is being loaded.
  • 'size' is initialized to 64 so as to keep the dimensions of the image 64x64.
  • 'total' is initialized to 500 so as to load only 500 images from the dataset.
  • Now each image in the dataset is accessed using the for loop.
  • An if condition is used to keep track of number of images already been loaded. If 'count' becomes greater than 'total', the loop is terminated.
  • A constant feedback of which image is being loaded is given using the two sys statements.
  • Each image, one by one, is read into the variable 'img_array' using openCV's imread() command. Also note that reach image is converted to gray scale.
  • The image (presently a numpy array) is converted to PIL image using Image.fromarray() command. This is done to help resizing the image.
  • Then the image is resized to 64x64 using Anti Aliasing method and again converted back to a numpy array.
  • It is then appended to the list 'X'
  • A flattened version of the images are also stored in the list 'X_flat' where a 64x64 image is shaped to (4096,) numpy array. We will be using 'X_flat' for further processing. 'X' is just for us to see how the images look for now.

I hope the above points made the code very clear. Now lets have a look whether the images have loaded or not, and also get an idea of how they look like. Below is the snippet to show first 25 images that were loaded in 'X'.

#visualizing?some?images
size?=?5
count?=?0
fig?=?subplots.make_subplots(rows?=?size,?cols?=?size,?vertical_spacing?=?0.06,?horizontal_spacing?=?0.02)
for?row?in?range(size):
??for?col?in?range(size):
????fig.add_trace(go.Image(z?=?color.gray2rgb(X[count])),?row?=?row?+?1,?col?=?col?+?1)
????count?+=?1
fig["layout"].update(title?=?"Anime?Images",?template?=?"plotly_dark",?height?=?900)
fig.show()        

Once you run the code, you will see 25 images of some really pretty waifus! Kindly note that each of them is a grayscale image and plotly doesn't plot grayscale images. So I had to use the gray2rgb() utility in skimage to make the grayscale images have 3 channels like in a standard RGB image. The default color model in go.Image() is RGB only. If you don't understand this plotting of traces in plotly, I would highly recommend you to look into my other article on plotly. It will take you all the way from beginner to advanced plotting in no time! The article can be accessed here: A complete introduction to plotly, from beginner to advanced. Here is the expected output.

No alt text provided for this image

Ok! lets not distract ourself with these pretty faces, and quickly move ahead to the main agenda we have today. Now I am converting 'X_flat' into a numpy array in the below snippet. You will see an output (500, 4096). Please remember, one flattened image was (4096,) and I have clubbed all the 500 images that we loaded together. The dimensions of the dataset might fool you to believe that there are 4096 features, but kindly make a note that each image makes a feature, Therefore there are 500 features (which will be reduced to 250 due to test-train split further down the line). [Question1: Can you tell what will happen if we take 4096 as features once you have gone through the entire concept?]

#converting?X_flat?to?numpy?array
X_flat?=?np.asarray(X_flat)
X_flat.shape        

Output: (500, 4096)

Now lets do the test-train split. Do note that the split has to be 50-50. You can see that 'test_size' attribute is set to 0.5. Also not that I have taken the transpose, to make the data look more like the natural understanding of columns being features and rows being data points. Use the 'random_state' attribute to get the same split as mine. [Question 2: Can you answer why we need to split things in equal half's once you have gone through the entire concept?]

#Test-Train?split
X_train,?X_test?=?train_test_split(X_flat,?test_size?=?0.5,?random_state?=?69)
X_train?=?X_train.T
X_test?=?X_test.T
print(X_train.shape)
print(X_test.shape)        

Output: (4096, 250) '\n' (4096, 250)

So now we have our data ready to train PCA. So let's start implementing PCA! But first let me give you a little brief info about how PCA works. PCA can be broken down into 5 steps. I will go through them one by one sequentially. It can be a lot to take in if you are very new to this topic, but I request you to please be patient and read again if required. I have chosen very precise language to put forth the idea in a very subtle way. Unfortunately there is a lot of reading ahead!

  1. Standardization: We standardize the data to let each each feature have equal contribution in the analysis. Imagine a regression line being plot for a data where x axis values differ by small quantities like 0.3 to 0.9 (small variation in data), and y axis values differ by amounts ranging in thousands (large variation in data). Such a line will be very skewed towards the y axis and will not be able to follow variance in data on x axis. Thus we need to bring all the features to a comparable scale so that none of them is dominating. In standardization, we make the feature values have mean 0 and standard deviation as 1.
  2. Covariance: We make a covariance matrix of the given data to find if there is any relationship between features. Because features may be highly correlated suggesting redundant information. Thus we try to remove such features. So to get this insight of highly correlating features we build a covariance matrix.
  3. Eigenvalues and Eigenvectors: This step helps us to determine the principle components of the data. Principal components are new feature values that are constructed as linear combinations or mixtures of the initial feature values. This is done in such a way that the new feature values are uncorrelated and most of the information is made available in the first component. Then remaining in the next and so on. This way we can remove the less important principal components to reduce the dimensionality of the data. Geometrically, eigenvectors (characteristic vectors) are those vectors that do not change their direction when applied to a linear transformation and eigenvalues represent the amount by which they are scaled. So once we find the eigenvectors of the covariance matrix, we get the directions of the data that explain maximum amount of variance depending on the eigenvalues (large eigenvalue means more variance in that eigenvector's direction). And since the covariance matrix is symmetric, the eigenvectors will be perpendicular. How convenient! so we can now use them as the axes to explain the dataset. Thus we sort the eigenvectors based on their eigenvalues in descending order to get the important principal components first.
  4. Feature vectors: These are those vectors that make the cut. That is, the selected principle components that we think are sufficient to represent the data without loss of much information.
  5. Projection to principal components: So far we have just found the directions that best explain the data. Now we reorient the data from original axes to the one described by the principal components. This will give us the data with reduced dimensions.

Phew! that seems a lot of work! But if you understand the math properly and use the python concept of vectorization, Coding all this will be a breeze. I would recommend in depth reading of these concepts if you are not familiar from resources by Gilbert Strang. So lets not wait anymore and get our hands dirty!

The following snippets were fill in the blanks for students to complete the code which was then automatically checked using Jupyter Notebook's nbgrader. But I present the complete code to you guys.

First is a function to do standardization. (Step 1)

#function?for?standardizing?image
def?Standardize(X):
????#calcualte?the?mean?of?each?column?mu???
????mu?=?np.mean(X,?axis?=?0)?
????
????#Substract?the?mean?from?X
????X?=?X?-?mu??
????
????#calcualte?the?standard?deviation?of?each?column?std
????std?=?np.std(X,?axis?=?0)??
????
????#handleing?zero?standard?deviation
????std_filled?=?std.copy()
????std_filled[std?==?0]?=?1.0
????
????#calculate?standardized?X?called?Xbar
????Xbar?=?(X-mu)?/?std_filled??
????
????return?Xbar,?mu,?std        

Then a function to calculate eigenvalues and eigenvectors and sort them based on eigenvalues in descending order. (Step 3)

#function?for?calcualting?eigen?values?and?eigen?vectors
def?eig(S):
????#calculate?the?eigen?values?and?eigen?vectors
????eig_val,?eig_vec?=?np.linalg.eigh(S)??
????
????#sorting?them?in?decrasing?order
????sorted_eig??=?np.argsort(-eig_val)
????eig_val?=?eig_val[sorted_eig]
????eig_vec?=?eig_vec[:,?sorted_eig]
????
????return?(eig_val,?eig_vec)        

Then comes the function to project the data. (Step 5). I know I am jumping steps but as I said these were the codes I expected students to write, so I had them here as functions clubbed together. Step 2 and Step 4 are taken care of in correct order in the snippet following this snippet.

#function?for?projection?matrix
def?projection_matrix(B):
????#calculate?the?projection?matrix?P
????P?=?B?@?B.T?
????
????return?P        

Now the main function to perform PCA.

#implementing?PCA
def?PCA(X,?num_components):
????#calculate?the?data?covariance?matrix?S
????S?=?np.cov(X)??
????
????#now?find?eigenvalues?and?corresponding?eigenvectors?for?S?by?implementing?eig().
????eig_vals,?eig_vecs?=?eig(S)
????
????#select?eigen?vectors
????U?=?eig_vecs[:,?range(num_components)]
????
????#reconstruct?the?images?from?the?lowerdimensional?representation
????#to?do?this,?we?first?need?to?find?the?projection_matrix?(which?you?implemented?earlier)
????#which?projects?our?input?data?onto?the?vector?space?spanned?by?the?eigenvectors
????P?=?projection_matrix(U)?#?projection?matrix
????
????return?P        

Note how all the code falls in place following each step in order with this code. Now lets use all these functions that we have made. First, you guessed it, we standardize the data.

#standardizind
Xbar_train,?mu_train,?std_train?=?Standardize(X_train)
Xbar_test,?mu_test,?std_test?=?Standardize(X_test)        

Now I am writing a function to calculate Mean Squared Error (MSE) for training.

#function?for?mean?square?error
def?mse(predict,?actual):
????return?np.square(predict?-?actual).sum(axis?=?1).mean()        

Now, with the train data I am building a projection matrix, that projects the given data into the eigenvector feature space that was calculated using the line shown here (projection?=?PCA(Xbar_train.T,?num_component). Then I am using this projection matrix made from the trained data to project test data into this eigenvector space to see what happens, using the line (reconst?=?Xbar_test?@?projection). Can unseen data (an unknown anime waifu) be also reconstructed using this projection matrix? Lets find out!

#calculating?loss?and?reconstructing?images
loss?=?[]
reconstructions?=?[]
max_components?=?len(X_train.T)
print("Processing...")
animation?=?np.arange(1,?max_components?+?1,?1)
for?num_component?in?range(1,?max_components?+?1):
????sys.stdout.write("\r"?+?str(animation[num_component?-?1])?+?"?/?"?+?str(max_components))
????sys.stdout.flush()
????projection?=?PCA(Xbar_train.T,?num_component)
????reconst?=?Xbar_test?@?projection
????error?=?mse(reconst,?Xbar_test)
????reconstructions.append(reconst)
????loss.append((num_component,?error))
print()
print("Done!")        

You can see I am gradually increasing the number of principal components from 1, all the way to max (250 in this case since data had 250 features to start with). I am simultaneously noting down the loss in reconstruction of images to plot a graph. Let us see the graph and understand what happened.

#visualizing?mse?vs?number?of?principal?components
trace?=?go.Scatter(x?=?loss[:,?0],?y?=?loss[:,?1])
data?=?[trace]
fig?=?go.Figure(data)
fig.update_layout(title?=?"MSE?vs?number?of?principal?components",?
??????????????????xaxis_title?=?"Number?of?principal?components",?
??????????????????yaxis_title?=?"MSE",?template?=?"plotly_dark")
fig.show()        
No alt text provided for this image

You can clearly see the MSE decrease as I increase the number of principal components used to make the projection matrix. That means the quality of reconstructed images is getter better and better as we include more and more principle components. Almost to Zero! if all 250 principal components are used.

Lets find out how our waifus are looking! I can't wait!

First remember that the data must me unstandardized. Otherwise the images won't have correct pixel values. We can do it using the following simple code.

#unstandardizing?the?reconstructed?images
reconstructions?=?np.asarray(reconstructions)
reconstructions?=?reconstructions?*?std_test?+?mu_test?
loss?=?np.asarray(loss)
print("Done")        

Once done, we can have a look at our waifus using the following code.

#plotting?the?reconstructed?images
col?=?5
row?=?2
size?=?10
skip?=?20
images?=?3


component?=?0
titles?=?[]
for?loop?in?range(size):
??titles.append(str(component)?+?"?components")
??component?+=?skip


figs?=?[0]*images
for?num?in?range(images):
??figs[num]?=?subplots.make_subplots(rows?=?row,?cols?=?col,?subplot_titles?=?titles)
??component?=?0
??for?r?in?range(row):
????for?c?in?range(col):
??????figs[num].add_trace(go.Image(z?=?color.gray2rgb(reconstructions[component,?:?,?num].reshape(64,?64))),?row?=?r?+?1,?col?=?c?+?1)
??????component?+=?skip
??figs[num]["layout"].update(title?=?"Recontructed?Images",?template?=?"plotly_dark")
for?num?in?range(images):
??figs[num].show()        

The output must look something like this:

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

3 waifus are ready! you can see I am able to get some adorable faces just by using 180 principal components. That means 70 features were mostly useless (250 - 180). Imagine finding out that about 30% of features that you are providing to your model are redundant! This information can help attaining better performance from the model as a whole.

We all love anime, I hope you now associate PCA with it every time too! I hope I made this topic easy and interesting for you to understand. PCA is a very difficult yet a very important algorithm that helps more often than not. I would recommend further reading into Linear Algebra if you didn't understand what was happening properly. Use resources from Gilbert Strang to get concrete idea of what are linear transformations, eigenvalues and eigenvectors, their significance, projections etc.

That is all from me for this time. Kindly excuse me now. Do answer the two questions in the comments.

I hope you gained something from this article and I wish you best of luck.

Bye!

Special thanks to Prof. Viswanath Gopalakrishnan for helping me fix major issues in the code.

Didier Gamassa

Ingénieur Biomédical Itinérant @Wassenburg Medical France

2 年

Well explained!

Rajkumar Thirugnanam

Technical Project Manager | AWS Certified Solutions Architect | Machine Learning | ReactJS | NodeJS

3 年

Nice explanation Pranjall Kumar

要查看或添加评论,请登录

社区洞察