Understanding FGSM: A more intuitive Approach to Adversarial Attacks

Introduction

Over the past decade, machine learning models have evolved from simple image classifiers into sophisticated systems capable of solving complex tasks—ranging from image recognition to mathematical reasoning, and even code generation. The advent of models like AlexNet in 2011 was a turning point, marking a new era of AI-driven perception and reasoning. Since then, these algorithms have matured and now exhibit remarkable accuracy in identifying and classifying objects.

Yet, it is both surprising and concerning that these powerful models can be deceived by carefully crafted inputs known as adversarial examples. Even minute changes to the pixels of an image can cause a model, which once confidently recognized a panda, to classify it as a gibbon. Contrary to popular belief, these adversarial examples are not mere random anomalies. Instead, they emerge from a deep mathematical structure and can be systematically produced.

In this blog, we’ll focus on one of the earliest and most foundational methods in the generation of adversarial examples: the Fast Gradient Sign Method (FGSM) proposed by Goodfellow et al. in their seminal paper Explaining and Harnessing Adversarial Examples. By understanding FGSM, we gain insight into why models are so vulnerable to small, carefully chosen perturbations—and how this vulnerability relates to the intricate geometry of high-dimensional spaces.

Adversarial Attacks: More Than Just Random Noise

The existence of adversarial examples might initially seem surprising. After all, if models are so accurate, how can something seemingly insignificant—like a subtle pixel-level tweak—cause such a drastic misclassification? But these adversarial examples are not flukes. There is a purposeful, mathematical reasoning behind them. Once you see the logic, it becomes evident how straightforward these attacks can be to construct.

The Fast Gradient Sign Method (FGSM)

One of the earliest and simplest adversarial attack methods is the FGSM. Goodfellow and colleagues introduced this technique to demonstrate just how easily a model’s perception could be misled.

The core idea of FGSM is to modify the original input by adding a tiny perturbation in the direction that maximally increases the model’s loss. This method relies on gradient-based optimization, much like the training process itself, but flips the script to push the model away from the correct classification rather than towards it.

Mathematically, FGSM can be expressed as:

$$ x'= x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x,y)) $$

Where:

$x$ is the original input image.
$x'$ is the adversarially perturbed image.
$\epsilon$ is a small scaling factor that controls the magnitude of the perturbation.
$\text{sign}(\cdot)$ takes the sign of each component, producing a step in either the positive or negative direction for each pixel.
$\nabla_x J(\theta,x,y)$ is the gradient of the loss function J with respect to the input x, given the model parameters θ and the correct label y.

Intuition Behind the Equation

During normal training, gradient descent is used to reduce the loss. The network’s parameters are updated by moving in the opposite direction of the gradient, refining its understanding and making correct classification more likely.

FGSM, however, uses a sort of “reverse” logic. Instead of guiding the model toward the correct classification, it identifies the direction in which the input should move to increase the loss—sometimes called gradient ascent. By making a slight shift in each pixel’s value in the direction that harms the model’s performance, we effectively push the input across the decision boundary into a misclassified region.