登录查看更多内容

Cat-Dog Classifier: using image augmentation to combat overfitting

Justin van Haaren

Revenue Engineering at American Airlines | Duke University

发布日期: 2020年12月9日

Intro

Recently, I had taken some courses on deep learning using TensorFlow and I decided to spend some time over Thanksgiving to work on a small neural network classifier of my own. Since this was my first unguided image classification project, I went with a simple problem of building a classifier to tell the difference between cats and dogs. After quite some experimentation with different models and parameters, my final model achieved 87% test accuracy (meaning 87% of images were correctly classified as either dog or cat), a reasonable score. For comparison, some of the top models on Kaggle achieved results in the 90%+ range. To give the project a little bit more relatability , I took four pictures of cats and dogs myself to see how well the network would generalize to 'real-world' pictures not found in the dataset.

Data

The dataset can be downloaded from Kaggle (link provided below). It consists of 10,000 color images, equally divided into cats and dogs, of which there are 4,000 training and 1,000 test images of each species.

Task at hand

Goal

For the purpose of this small project, I was simply interested in classifying as many images as possible correctly. In other words, I was completely indifferent between misclassifying a dog as a cat or vice-versa. Therefore, binary accuracy suffices as the performance metric.

Model

Since we are dealing with image classification, we will make use of a Convolutional Neural Network (CNN). It is a type of model that is really powerful when dealing with certain types of data, such as image data. Through use of convolutional and pooling layers, the network will select fractions of an image, apply filters, and then through an activation function determine whether specific features are present. In our cat v dogs example, it will look for things like shape of ears, shape of eyes, or the nose. Below is an illustration of what a CNN does. A really useful article to understand CNNs better can be found here.

Data preprocessing

This article is not intended as a coding tutorial; I included some code snippets on some of the important operations, but for the complete script you can navigate to my LinkedIn page and you can find he Jupyter Notebook script under Featured. The comments in the file assume at least some prior knowledge of neural networks.

Loading the data

Our data consist of a bunch of images grouped into separate folders for training and testing, and for cats and dogs. Although it is arguably preferential to use a data generator object here to load the images, I decided to at this point use a loop so that the resulting object (a numpy array) will be easier to work with than a TensorFlow object that requires specialized functions and syntax. The images are conveniently labeled cat.1, cat2., etc. , so we can easily create the train and test data using a for loop like so:

for i in range(1,4001):
    catnames_train.append('cat.'+str(i)+'.jpg')
    dognames_train.append('dog.'+str(i)+'.jpg')


for i in range(4001,5001):
    catnames_test.append('cat.'+str(i)+'.jpg')
    dognames_test.append('dog.'+str(i)+'.jpg')

Similarly, we can now load the images as separate entries in a dictionary using their names as the keys. I also rescaled the images in the same step so they are all the same size and a lot smaller (so the models won't take half a day to train).

for i in catnames_train:
    img=Image.open(catpath_train+i)
    cattrain[i]=np.asarray(img.resize((120, 80)))

And this you can repeat for all four datasets. We can check the shape of the image after loading it into the dictionary and it should be a three-dimensional array of size (80, 120, 3). If all is good, we can proceed with appending the images in the dictionary to two numpy arrays: one for training and one for test data.

train_data_cats=np.asarray(list(cattrain.values()))
train_data_dogs=np.asarray(list(dogtrain.values()))

train_data=np.append(train_data_cats,train_data_dogs,axis=0)

Preprocessing

So now all the features are stored in two numpy arrays: one with training and one with test features. Next, I made two changes to the data: I removed color and I rescaled the scalars in the arrays so that their values lie between 0 and 1. The latter is very common in machine learning when dealing with certain types of models, including neural networks, as they operate under the assumption of normalized data. I removed the color, as this reduces the size of the images (i.e. reducing training time), while color has limited meaning when discriminating between cats and dogs — although it could help distinguish between the animal and other objects in the fore-/background. In a later iteration I may try to improve accuracy leaving the color in. The first line takes the average of the RGB channels of the image as an intensity value, while the second inserts a dummy channel, as some layers require a specific number of dimensions. Finally, the last line rescales the values so that they lie between zero and one.

#Average RGB channels
train_data=np.dot(train_data[...,:3], [(1/3), (1/3), (1/3)])

#Add dummy channel
train_data=np.expand_dims(train_data,-1)

#Rescale to values between zero and one
train_data=train_data/255

Next, let's print out some random images to see what they look like after these preprocessing steps:

#Randomly select 6 images to plot
sample_of_6=random.sample(range(1,len(train_data)+1),6)


#Generate the plotting grid
plt.figure(figsize=(14,20))


#Plot the images side by side
j=0
for i in sample_of_6:
    plt.subplot(1,6,j+1)
    plt.title('Image '+str(i))
    plt.imshow(train_data[i])
    j+=1
plt.show()

Result:

Generating labels

Neural networks are a form of supervised learning, meaning that they require a 'label' or 'target' value to train. Our data currently consist of a collection of images without labels, so they must be created. You may have noticed that the dataset was created using the append function, so that the first 4,000 images are cats and the next 4,000 are dogs. This was done on purpose, as we can now very easily generate the labels for the images (we will deal with shuffling later): we just create a vector with 4,000 zeroes and 4,000 ones.

train_labels=np.append((len(train_data)//2)*[0],(len(train_data)//2)*[1])

It's as simple as that. Notice the // used for division; this represents integer division, while a regular division results in a float value. The result has to be an integer, so that Numpy can use it for slicing the data.

Let's do a sanity check to make sure all the shapes make sense, as it's very easy for mistakes to slip in. In the Notebook, you can see many more sanity checks throughout, but for the sake of simplicity I only included this final one here:

print(train_data.shape)
print(train_labels.shape)
print(test_data.shape)
print(test_labels.shape)

#---Output---
(8000, 80, 120, 1)
(8000,)
(2000, 80, 120, 1)
(2000,)

All looks well. Specifically, we have:

Training feaures containing 8,000 images, of dimensions height×width of 80×120 with one dummy channel
Training labels, a vector of length 8,000
Test feaures containing 2,000 images, of dimensions height×width of 80×120 with one dummy channel
Test labels, a vector of length 2,000

Modelling

Over the course of several days I experimented with different models, all with different types of layers, regularizers, and hyperparameters. Unfortunately, training a model could last anywhere between 20 minutes and several hours depending on complexity, so I could not try too many different ones. If you have a really powerful computer, you may be able to try out a lot more models. I will show three models that did reasonably well below.

Callbacks

I decided to implement a few callbacks that proved very useful during training: first, EarlyStopping: this callback will stop the model from training when a certain condition is met. I set it to stop training if the validation accuracy does not improve for 15 epochs by at least 0.001 (or 0.1 percentage points). The reason I originally implemented this, is so that the model will not continue training for dozens of epochs if it is not getting any better at predicting to preserve time for training models that have more potential — although for the initial few models it was very useful, for the final model I realized I should not have used it, but more on that later .

earlystopping=EarlyStopping(patience=15,monitor='val_accuracy',min_delta=0.001)

Next, I also included ModelCheckpoint callbacks. These simply save the model based on a specific criterion, so that it can be used without having to retrain the model. I used two of these, one that saved the best model's weights based on lowest validation loss, and one based on highest validation accuracy (more on loss v accuracy later):

checkpoint1_acc=ModelCheckpoint(monitor='val_accuracy',
  filepath='Model1_acc.h5',
  save_weights_only=True,
  verbose=1,
  save_best_only=True)
  
checkpoint1_loss=ModelCheckpoint(monitor='val_loss',
  filepath='Model1_loss.h5',
  save_weights_only=True,
  verbose=1,
  save_best_only=True)

Regularization

I wrote this article with a target audience of fellow data science students/professionals in mind, so I won't go into detail on the topic of overfitting v underfitting (you can read more about it here). Neural networks are very powerful and thus especially prone to overfitting (i.e. it's very easy to make a network too powerful, so that it extracts patterns that are in fact meaningless). Indeed, my initial, unregularized, models were severely overfitting the data: they approached 90%+ training accuracy within a few epochs, while the validation accuracy remained in the 60%s. Regularization is a method of combatting this, which can be done in a few ways in neural networks. Here, I used a combination of Dropout layers and L2 or weight decay regularizers — admittedly I do not have any specific reason to use these two, but I just experimented a bit; it could be that L1 regularization would lead to better results in the final model. Dropout layers are very straightforward: they randomly select neurons to drop out of the model while training to reduce overfitting.

Model 1

After some experimentation, I arrived at the first model that looked promising (Model 1), composed as shown on the right. It has two Conv2D layers follow by a MaxPool2D layer at the start and another Conv2D and MaxPool2D layer later on with a few Dropout layers against overfitting. It uses the standard Adam optimizer with binary_crossentropy loss and accuracy as metric. Below is a code snippet for building, compiling, and fitting the model:

#Setting weight decay parameter

weight_decay1=0.003


#Building the model

model1=Sequential()
model1.add(Conv2D(32,(3,3),activation='relu',input_shape=(80,120,1),
padding='SAME'))
model1.add(Conv2D(32,(2,2),activation='relu',padding='SAME'))
model1.add(MaxPooling2D((3,3)))
model1.add(Dropout(0.2))
model1.add(Conv2D(64,(2,2),activation='relu',padding='SAME'))
model1.add(MaxPooling2D((2,2)))
model1.add(Flatten())
model1.add(Dense(64,activation='relu',
kernel_regularizer=regularizers.l2(weight_decay1)))
model1.add(Dropout(0.2))
model1.add(Dense(64,activation='tanh',
kernel_regularizer=regularizers.l2(weight_decay1)))
model1.add(Dropout(0.2))
model1.add(Dense(1,activation='sigmoid'))

Now, let's compile the model:

#Compiling the model
model1.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

#Training the model
history1=model1.fit(train_data,
train_labels,epochs=100,
callbacks=[earlystopping,checkpoint1_acc,checkpoint1_loss],
validation_split=0.2)

Let's check out the performance:

We can clearly see that the model starts overfitting from around epoch 8 onwards: while the training loss continues to decrease, the validation loss starts rising. Similarly, the validation accuracy plateaus while the training accuracy goes into the 90% range. EarlyStopping kicked in and stopped training well before the 100 epochs were reached. The lowest loss model results in about 75% accuracy, which is not terrible, but there is clearly room for improvement. I tried resolving the overfitting by adding more Dropout layers and/or regularizers, but that resulted in an underfitting model: both training and testing accuracy stayed in the lower 60%s or even just slightly above 50%. These models were now too weak to extract important patterns from the data. The next step was thus to increase the number of layers and also increase the regularization.

Model 2

Model 2 is slightly more complex: it now includes three pairs of convolutional layers, followed by pooling layers. I reduced the number of Dropout layers, while putting weight decay regularizers on nearly all the layers. Instead of using Adam, I compiled this model using RMSProp with a learning rate of 0.0001 (also after experimentation). I did not include the code to build the model here, but it can be found in the Notebook.

This model's performance looks a lot better, with both validation loss and accuracy closely trailing the training metrics. The lowest loss model achieved 80% validation accuracy, which is a significant improvement vis-à-vis Model 1. It looked like I was onto something; however, I saw that on Kaggle there were people who had accuracies of more than 90%. Although I did not expect to do as well as some of the top Kagglers, I wanted to get to at least 85% test accuracy. I needed to find a way to increase model complexity without overfitting too much. Besides regularization, another way of countering overfitting is to increase the increase the training data. Manually looking for pictures of dogs and cats was not really an option, but luckily there exists another trick: image augmentation.

Model 3

This model is essentially the same as model 2, but with most of the regularizers removed. This will inevitably lead to more overfitting, but we can counter that using image augmentation. Now that most regularizers are gone, the model should be able to uncover more complex patterns.

Although at first sight, the performance looks a lot worse than Model 2's, if you look at the axes you see that loss is lower and accuracy is higher than for Model 2. Notice how the training accuracy approaches 100% after 40 epochs, whereas Model 2 took 80 to get to 85%. At this point I started using image augmentation to combat the overfitting that is clearly going on in this model.

Image augmentation

As mentioned earlier, one way of reducing overfitting is to increase the amount of training data. Think think about it this way: overfitting is a lack of generalizability. More training data (as long as it's fairly representative) will make the data more like the the general population, which reduces overfitting. Rather than trying to collect more images, however, we can also take the images that we have and make changes to them (essentially creating an extremely large training dataset with ease and no storage problems). In fact, each epoch will have a completely different training set. This way the model is less likely to extract patterns that are specific to the data on hand and just those that are more generalizable. TensorFlow comes with a built-in function to do this.

The ImageDataGenerator

TensorFlow comes with a data generator class. It eliminates the need to load all training data into memory before starting the training of the model. Especially when dealing with image data, storage requirements tend to become large and the training data often won't fit in the memory. The generator will load the data from disk for each batch and then train the model on each batch, so that we can still train the model when the data exceeds memory. I used the ImageDataGenerator, as it comes with image augmentation options. The generator can be initialized as follows:

generator=ImageDataGenerator(
rescale=1/255,
rotation_range=30,
brightness_range=[0.7,1.3],
horizontal_flip=True,
validation_split=0.2)

The images are rescaled to lie between one and zero, they are randomly rotated between 0 and 30 degrees to either the left or the right, their brightness is randomly adjusted, and they are randomly flipped on the horizontal axis. In essence, each epoch will have unique images of its own. I again used a validation split of 20% of the training data.

Loading & preprocessing the data

Loading the data using a generator works differently from the for loop described before. As it turns out, the data are structured in a way that they can be directly used with the generator class. First, we need to specify the train and test directories and the different classes that exist in our data:

#Specifying directories:
traindirectory='G:/Datasets/Catdog/training_set/training_set'
testdirectory='G:/Datasets/Catdog/test_set/test_set'

#Specifying classes:
classes=['cats','dogs']

The next step is to use flow_from_directory to specify which directory to take the data from. Here we can specify the batch_size and the image dimensions, as well as the color settings. We need separate generators for the training, validation, and test data. If validation data doesn't have its own directory, we can set the subset argument to 'validation' and it will use the validation data resulting from the validation_split set in the generator object specified before. Notice the test data uses a separate instance of the ImageDataGenerator, since the test images should not be augmented. I set the color_mode to 'grayscale' and the class_mode to 'binary', so that the model uses sparse label representation rather than one-hot-encoded labels for multiclass classification.

traingenerator=generator.flow_from_directory(traindirectory,
subset='training',
batch_size=16,
classes=classes,
target_size=(80,120),
seed=12,
color_mode='grayscale',
class_mode='binary')

valgenerator=generator.flow_from_directory(traindirectory,
subset='validation',
batch_size=16,
classes=classes,
target_size=(80,120),
seed=12,
color_mode='grayscale',
class_mode='binary')


testgenerator=ImageDataGenerator(rescale=1/255).flow_from_directory(testdirectory,batch_size=2000,
classes=classes,
target_size=(80,120),
seed=12,
color_mode='grayscale',
class_mode='binary')

As it turns out the data folders contain duplicate images and this is reflected in the number of images that the generator found belonging to each class. Since in Windows, a folder cannot contain more than one file with the same name, the duplicates have (1) behind their file name. Therefore, the for loop method we created earlier excluded the duplicates automatically as they do not adhere to the specified names. A small number of duplicate images is not really an issue for training a model, so I won't bother manually deleting them from the directory (even if I would, they wouldn't correspond to the original Kaggle data). Let's print out some of the images that the image generator augmented with their corresponding labels:

Training

Here I took a little shortcut: to avoid having to retrain all the way from scratch, I loaded the model weights of the lowest loss training epoch from the prior training of Model 3 as the initial weights and started training from there — we may achieve better results if we train from scratch, but I don't think the difference will be worth the time saved.

model4.load_weights('Model3_loss.h5')

history4=model4.fit(
  traingenerator,
  validation_data=valgenerator,
  epochs=100,
  callbacks=[earlystopping,checkpoint4_acc,checkpoint4_loss]
)

One thing you'll notice is that instead of being consistently higher than training loss, the validation loss is now moving around the training loss; some epochs it's higher, some epochs it's lower. This makes perfect sense, and is a clear demonstration that the model is much more generalizable (it now sometimes works better at predicting images outside the training data than inside). You'll also notice a huge spike on epoch 29. Although I was initially concerned about this, after doing some research online I found that this is likely caused by either Dropout layers or due to the batch size. Remember, the dropout layers randomly drop neurons from the network. Think about it: what if during a certain epoch, the specific deactivated neurons all happened to be very important in predicting the animal type, the prediction performance will greatly suffer. Here it is evidenced by a huge spike in validation loss. Alternatively, due to the duplicate images, our batch size of 16 no longer fits perfectly in the previously expected 6,400 images. The images in the last (partial) mini-batch may have some outlier effects that skew the weights, as lower sample size tends to have higher variance. I am not concerned, since these spikes are just one-time deviations and performance does not seem to be too impacted if you look at the validation accuracy.

Although the spikes in validation accuracy seem large, if you look at the Y axis you can see that the range is very small. This becomes clearer in the plot to the right, which also shows the initial training of Model 3. The purple line shows where the initial training stopped and image augmentation kicked in: the effects that are immediately clear is that the training accuracy drops to the same level as the validation accuracy (versus the original which tends to 1.00) and that both plateau around 83%. The image augmentation has effectively eliminated overfitting. So now let's evaluate all of our models.

Evaluation

Test accuracy

So now that the models are trained and validated on the training data, it is time to see how well they do on unseen test data. We load the weights of the lowest loss model trained and evaluate using evaluate or evaluate_generator for the augmented model:

model1.load_weights('Model1_loss.h5')
model1.evaluate(test_data,test_labels)

model4.load_weights('Model4_loss.h5')
model4.evaluate_generator(testgenerator)

Model 1 test accuracy: 65%
Model 2 test accuracy: 69%
Model 3 test accuracy: 78%
Augmented Model 3 test accuracy: 87%

The results show exactly why we use both validation data and test data separately: even though Models 1 and 2 had validation accuracies in the 70/80%s during training, they still do not generalize well to unseen image as evidenced by much lower test scores. This issue becomes much less in Model 3 and is completely eliminated in the augmented model: the test accuracy is even higher than the training accuracy.

Classifying my own pictures

It is always a lot more interesting when you can actually use the application you've built. In this case, I loaded four pictures I took of dogs and cats over the past weeks to see if the model would generalize even beyond the images included with the dataset. Here are the results (for the code, refer to the Notebook):

All four animals were classified correctly, most with high confidence (recall that 0 corresponds to cat and 1 to dog). I do admit that the three pictures were relatively clear. The classifier did struggle with the noisy picture of the cat on the desk with all the objects around it and was barely able to classify it correctly. This picture would benefit from a more powerful neural network.

Takeaways

Augmenting the images led to a huge reduction in overfitting. The training, validation, and test performance are now very similar, which is an indication that we are underfitting. We can now build models far more complex to recognize dogs and cats in noisier pictures and achieve higher accuracy. However, when the models get more powerful, the amount of overfitting could increase as well. In particular, the networks could get complex enough that augmenting the images may no longer prevent overfitting (because the content of the image remains the same, regardless of transformations). I may make a second iteration of the classifier in the near future if I have time.

Cat-Dog Classifier: using image augmentation to combat overfitting

Justin van Haaren

Revenue Engineering at American Airlines | Duke University

Intro

Data

Task at hand

Goal

Model

Data preprocessing

Loading the data

Preprocessing

Generating labels

Modelling

Callbacks

Regularization

Model 1

Model 2

Model 3

Image augmentation

The ImageDataGenerator

Loading & preprocessing the data

Training

Evaluation

Test accuracy

Classifying my own pictures

Takeaways

社区洞察

其他会员也浏览了

Kaggle “Dogs vs. Cats” Challenge?—?Complete Step by Step Guide?—?Part 2

?? How a Mini Neural Network Reads Handwritten Digits! ??

Understanding Deep Neural Networks Training Course

Configure Deep Learning Architecture

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

Chapter 2.2 : Self-Driving Car [Intro to TensorFlow & Deep Neural Network]

The 10 Deep Learning Methods AI Practitioners Need to Apply

MNIST Handwritten Digits Classification Using a Convolutional Neural Network

Deep Learning

#73 - LLMs, RAG, Graph Neural Networks and Open Source with Maxime Labonne