3 practical examples of tricking Neural Networks using GA and FGSM. How can object classification be easily fooled?
Profil Software
Python Development Company. JS Software House. Data Science, Data Engineering, Artificial Intelligence, Machine Learning
Hi! I’m Przemys?aw from?AI Software Development Company?located in Northern Poland where I’m working as a Python developer. My interests in AI were raised while studying the topic of reinforcement learning and computer vision. I have a strong inner need to see how things are done under the hood so I wanted to check if I could mess with some well known object classification models such as?CNNs?(Convolutional Neural Networks). They are just a bunch of numbers and mathematical operations, so let’s see if we can play with that!
Image classification
Image classification refers to a process in computer vision that can classify an image according to its visual content. It should not be mistaken with other similar operations such as localization, object detection or segmentation. The
below image shows the difference to make sure that everything is clear:
Experiments’ description
For the purpose of this article I’ve chosen two algorithms to go through. The first one is a genetic algorithm used for?One Pixel Attack?which, as its name suggests, changes only a single pixel value to fool the classification model. The second one is?FGSM?(Fast Gradient Sign Method) which modifies an image with a little noise which is practically unseen by humans but can manipulate the model’s prediction.
One Pixel Attack
When I was searching the net to find ways to fool?DNN?(Deep Neural Network) models, I ran across the very interesting concept of?One Pixel Attack,?and I knew I needed to check it out. My intuition was telling me that changing only one pixel in the original image wouldn’t be enough to break all those concepts of filters and convolutional layers used in neural networks that do a great job when it comes to object classification.
The only information that was used to manipulate the input image was the probability of classification (percentage values for each label). The way I wanted to achieve that without a brute force method was by using GA (Genetic Algorithm). The idea was easy:
For the experiments I used a model based on the?VGG16?architecture for the cifar10 dataset with pretrained weights (https://github.com/geifmany/cifar-vgg). It was done like this to eliminate the impact from a ‘potentially’ badly trained model. The sample code below is presented to get a kick-start with training your own models on that dataset:
# cifar10 dataset preparation
from keras.datasets import cifar10
from keras.utils import to_categorical
cifar_10_categories = {
0: 'airplane',
1: 'automobile',
2: 'bird',
3: 'cat',
4: 'deer',
5: 'dog',
6: 'frog',
7: 'horse',
8: 'ship',
9: 'truck',
}
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
# training and evaluation goes here
...
The results obtained from the attack were really good because for almost 20% of the images, changing only one pixel successfully led to misclassification.
FGSM
Another method I found is?FGSM?(Fast Gradient Sign Method), which is extremely easy in its concepts, but also leads to great effects. Without getting too deep into all the technical issues, this method is based on calculating the gradient (between input and output of neural net) for a given image that will increase the classification of the true label.
For an?untargeted?approach the next step is just to add the sign value of the gradient (-1, 0 or 1 for each pixel component) to an image to avoid a good prediction. Some studies also use a param called?epsilon?which is a multiplier for the sign value, but in this experiment we considered images that are represented by integer rgb values. This step can be repeated a few times to get satisfying results.
Another approach is a?targeted?attack which differs in the way the gradient is calculated. For this type of attack it is taken between the input image and target label (not true label). It is then subtracted from image to move the classification closer to the aim. Easy isn’t it? I’ve pasted some sample code below to make it easier to understand.
# sample code that calculates the gradients and updates an image
import keras.backend as K
sess = K.get_session()
...
target = K.one_hot(target_class if target_class is not None else
base_class, num_classes)
def get_image_update_function(target_class):
def target(img, delta):
return img - epsilon * delta
def non_target(img, delta):
return img + epsilon * delta
if target_class is not None:
return target
return non_target
update_fun = get_image_update_function(target_class)
# calculate delta - difference noise
loss = losses.categorical_crossentropy(target, model.output)
grads = K.gradients(loss, evaluated_model.input)
delta = K.sign(grads[0])
delta = sess.run(delta, feed_dict={model.input: image})
# update image
image = update_fun(image, delta)
The model that was used in this experiment is?resnet18?with?imagenet?weights.?The sample code that enables its loading (using image-classifiers==0.2.2) is pasted below:
领英推荐
# loading resnet pretrained models (224x224px, 1000 classes)
from classification_models import Classifiers
ResNet18, preprocess_input = Classifiers.get('resnet18')
resnet_dim = (224, 224)
model = ResNet18(input_shape=(*resnet_dim, 3), weights='imagenet', classes=1000)
The below image presents an original and adversarial example generated using FGSM + generated noise after 2 steps of the algorithm:
Black-box FGSM
The previous method was an easy case where we have full info about the attacked model, but what about when it is not available??Here is a study?that estimates the gradient by using a large amount of queries to the target model. I tried to fool the target model using my own model that had a different architecture but did similar tasks. The modified images were prepared based on my model (it took 7 steps to decrease true label prediction under 1%) and checked by the target model (vgg16 cifar10 model used in previous steps). Results from this experiment are shown below:
These results look promising but we have to take into account that these are relatively simple tasks (classifying 32x32 pixel images), and the difficulty of fooling other models will probably grow with the complexity of the structures that are used.
Conclusion
The approaches that were presented show that we can perturb images in a way to manipulate classification results. This is easy when we have full info about model structure. Otherwise it is hard to estimate perturbed samples with limited access to the target model.
The knowledge that comes from these experiments can help to defend from such attacks by extending the training set with slightly modified images.
Resources
published at
ResearchGate:?Tricking Neural Networks
Dev.to:?Tricking Neural Networks
Thanks to Peter Plesa.?
#ImageRecognition #GenericAlgorithm #NeuralNetworks #MachineLearning