Clouded Vision: Adversarial Attack and the Fooling of Artificial Intelligence

by Dr. Andrew Kyngdon

Much has been remarked over the rise of Artificial Intelligence. Social and mainstream media are ablaze with it. Talking heads advocate government pensions for all because “…AI will be automating [insert favourite percentage] of jobs in [insert favourite number] of years’ time”. Leaders in AI like Andrew Ng patiently argue that AI is the new electricity with education and training being the keys to harnessing its power.

These changes can be trivial or even undetectable to the human eye. This is known as “adversarial attack”.

No one knows exactly where AI will take humanity, much like those who deployed the first steam engines could have scarcely conceived of a smart phone. But one thing history teaches us is that people rarely ignore a technological advance.

All the hype notwithstanding, the AI and computer vision communities have been concerned about a more immediate and concrete problem. Szegedy, et al, (2014) found that by making clever changes to pixel intensities, an AI algorithm (Deep Neural Network) can be deceived into confusing what appears in an image for something else. These changes can be trivial or even undetectable to the human eye. This is known as “adversarial attack”.

Clean Image - Adversarial Attack and the Fooling of Artificial Intelligence
Clean Image

At Neuranext, we love cars. To illustrate adversarial attack I selected a masterpiece by the Alfa Romeo hot-rodder turned motoring icon Enzo Ferrari. Below is a Ferrari 250 GTO, which connoisseurs consider is the best instantiation of the marque. Despite only 36 being made, it is one of the world’s most recognised sports cars.

Let’s see what AI makes of this image. I’m going to use the AI library called “Caffe”, developed by Berkeley Artificial Intelligence Research, as well the CaffeNet model and weights trained on the ImageNet database. CaffeNet is BAIR’s variant of AlexNet, the model that famously won the 2012 ImageNet Large Scale Visual Recognition Challenge.

To have CaffeNet classify the GTO, I created a Jupyter Notebook similar to the one prepared by BAIR here. The model predicts the 250 GTO belongs to the ImageNet category of “sports car”, with a probability of 0.5718. Its next best prediction is “convertible”, with a probability of 0.2334.

Now let’s attack CaffeNet. To do that, I’m going to use DeepFool Caffe so I can access the Caffe Model Zoo and create adversarial models of the VGGNets, GoogLeNet, the various ResNets and SqueezeNet. Weights for these models trained on ImageNet are also available.

Adding DeepFool’s perturbations results in our USD$50 million GTO looking like it got a little dirty after a stint in the Millie Miglia. Being slightly degraded may raise suspicions, but it is still clearly a Ferrari.

CaffeNet’s confidence, however, is shaken. It predicts that there is only a 39.65% chance of it being a sportscar. Furthermore, the chance of it being a convertible has increased to 36.12%.

This is an example of a “white box” adversarial attack, specifically, a “confidence reduction attack” (Papernot, et al, 2016) which lessens the sureness an AI model has in its predictions. In a white box attack the architecture and weights of the AI model have been compromised and are possessed by the attacker, whom is also skilled in AI.

DeepFool CaffeNet - Adversarial Attack and the Fooling of Artificial Intelligence
Image with DeepFool CaffeNet Distortions
Gaussian Noise - Adversarial Attack and the Fooling of Artificial Intelligence
Image with Gaussian Noise Added

A less sophisticated threat is the “black box” attack, where the attacker knows nothing of the model used and needs little knowledge of AI. Adding Gaussian noise to an image is a simple example, which can be done in Python using the scikit-image module.

Whilst a bit grainy, the car is still clearly visible. But CaffeNet is now even more confused. The GTO has a 29.83% chance of being a sportscar and a similar chance (28.84%) of being a convertible.

Nonetheless, CaffeNet stubbornly endorses its Top 1 prediction of the clean image. Flicking through the ImageNet thumbnails for sports cars, I realised the choice of image may have been unintentionally biased. The majority of cars depicted are 30 – 50 years younger than the GTO, with a smattering of vintage sports cars. Furthermore, many images are of competition only vehicles (even a Formula 1 car), not roadworthy sports cars. This means that the GTO image may not be data from the same distribution that CaffeNet was trained on, so some resilience is not implausible.

But what if we were to mount a “two stage” attack (Zantedeschi, et al, 2017) that combines both white and black box approaches? Using DeepFool on the image with Gaussian noise, CaffeNet finally relents with an incorrect Top 1 prediction of “convertible” (26.43% chance). The chance of it being a sportscar is now 23.12%. We now have an “adversarial image”; and whilst the image is degraded, the Ferrari is obviously no convertible to human eyes.

More powerful AI architectures are not immune. The VGG16 model predicts the GTO in the clean image has a 70% chance of being a sports car. But attack it with DeepFool and the prediction plummets to 48.3%. GoogLeNet is similarly affected but less so.

Our limited example shows that an AI vision model can be fooled. Models other than the Deep Neural Networks we’ve explored here, such as shallow regression and kernel/support vector machines, are also vulnerable (Goodfellow, et al, 2015). Indeed, because these models cannot benefit from the universal approximator theorem (which states that any function can be modelled by a neural network provided it’s big enough), they are unlikely to ever mount an effective defence (Goodfellow, et al, 2015).

Adversarial attacks and defences against them are currently the focus of a vigorous research effort. This begs the question of what can be done to mitigate the risk of an adversarial attack. Whilst there is no silver bullet at present, below are some strategies that are worth exploring.

Gaussian Noise + DeepFool CaffeNet - Adversarial Attack and the Fooling of Artificial Intelligence
Image with Gaussian Noise + DeepFool CaffeNet Distortions

Adversarial Training

This involves creating adversarial examples of the images an AI model was trained on, adding them to the data, then “fine-tuning” the model on the new dataset (Szegedy, et al, 2014). DeepFool, FoolBox and CleverHans can all create adversarial images for this purpose.

However, two stage attacks can break AI models hardened by adversarial training (Zantedeschi, et al, 2017).

Feature Squeezing

To make colour images attractive to people, they are loaded up with 24 bit (8 bit x 3 channels) pixel values. But when used as input to an AI vision model, images offer a vast space to exploit for the skilled attacker (Xu, et al, 2017a). Images can be “squeezed” by either reducing the colour depth of each pixel or by using spatial smoothing (e.g., “Gaussian blur”). If there is a big difference in the model output between a squeezed image and the original one, it is likely the original is adversarial.

Feature squeezing works by trying to detect adversarial images, rather than by hardening AI models against misclassifying them. This may be an effective defence (Xu, et al, 2017b) and squeezing images is easily done using Python. Xu, et al, (2017a) even provided the code.

Defensive Distillation

With this technique, the image classification probabilities produced by an AI model are obtained. These “soft” or “smoothed” labels replace the “hard labels” originally used and the model retrained.

According to Papernot, et al, (2016), soft label vectors contain more information about the relative differences between the image classes. With our GTO example, CaffeNet’s probability of 0.5718 for sports car and 0.2234 for convertible suggests the model considers the former to be more similar to the latter than to a “station wagon” (at 0.042). Training on soft labels exploits this information and mitigates overfitting.

Defensive distillation can reduce misclassification error rates to around 5% with adversarial examples (Papernot, et al, 2016). However, Carlini & Wagner (2016) and Zantedeschi, et al, (2017) broke the defence.

Bounded ReLUs and Gaussian Augmentation

The accumulation of minor image perturbations across the layers of an AI model may cause adversarial attacks (Zantedeschi, et al, 2017). Bounding the Rectified Linear Unit (ReLU), a type of neuron popular in computer vision, prevents minor image distortions being “amplified” throughout the network (Zantedeschi, et al, 2017).

Zantedeschi, et al, (2017) found that bounded ReLUs, plus augmenting the training data with images containing Gaussian noise, may be an effective defence against adversarial attack.

If you are thinking of creating an AI vision system, or you currently have one in place, it would be prudent to consider hardening your models against adversarial attack. Remember also to follow your cybersecurity protocols and seek expert assistance if needed.

As Artificial Intelligence reportedly continues its Terminator like march into our jobs, cars and lounge rooms, adversarial attack reminds us that AI is perhaps currently more steam engine than smart phone. But it’s advancing quickly.

Share your thoughts.