Towards Deep Learning Models Resistant to Adversarial Attacks
Author: Aleksander Mądry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu
ICLR 2018
No ISBN/ISSN explicitly provided in the document.
ICLR 2018
No ISBN/ISSN explicitly provided in the document.
This paper addresses the issue that deep learning models are vulnerable to adversarial attacks,
and proposes Adversarial Training as a method to enhance model robustness.
Specifically, the authors employ Projected Gradient Descent (PGD) attacks to train robust networks
and conduct experiments on MNIST and CIFAR-10 datasets to verify its effectiveness.
In this review, I summarize the key findings of the paper and extend its scope by conducting additional experiments on FashionMNIST and SVHN
to examine whether the proposed method is effective across different datasets.
Deep learning models are highly susceptible to adversarial attacks
Adversarial examples: Inputs that fool the model by making subtle, imperceptible modifications
Example:
The original image is correctly classified as a panda
After adding a small amount of adversarial noise, the model misclassifies it as a gibbon
Research Questions:
How can deep learning models be made more robust against adversarial attacks?
What are the limitations of conventional defense methods?
✅ Proposes Adversarial Training as a robust optimization approach
✅ Utilizes PGD (Projected Gradient Descent) attacks for model training
✅ Conducts experiments on MNIST and CIFAR-10 to validate the approach
FGSM (Fast Gradient Sign Method): Generates adversarial examples with a single-step perturbation
PGD (Projected Gradient Descent):
Iteratively adds small perturbations to craft stronger adversarial examples
Updates gradients repeatedly to find the most vulnerable attack direction
Standard Model:
Easily misclassifies inputs under adversarial attacks
Adversarially Trained Model:
Learns to be more robust by training with both clean and adversarial examples
Result:
Shows significantly higher robustness against adversarial attacks
MNIST:
Adversarial Training effectively maintains robustness (above 98%)
CIFAR-10:
Less effective compared to MNIST, with accuracy degrading as adversarial perturbation increases
Since the paper only tested MNIST and CIFAR-10,
→ I extended the experiments to FashionMNIST and SVHN
FashionMNIST Results:
Maintained 80.34% Adversarial Accuracy
Showed a relatively stable robustness
SVHN Results:
Maintained 52.52% Adversarial Accuracy (relatively lower)
Due to complex backgrounds and digit variations, Adversarial Training was less effective
📌 Conclusion:
✅ Adversarial Training is effective for certain datasets, but
✅ Performance varies across different datasets
1️⃣ Focuses only on PGD attacks (Limited testing on stronger attacks like CW, AutoAttack)
2️⃣ Less effective on CIFAR-10 compared to MNIST
1️⃣ Significant drop in general accuracy in CIFAR-10
2️⃣ FashionMNIST remains vulnerable to CW and AutoAttack
3️⃣ SVHN shows weak robustness despite Adversarial Training
✅ Develop models that are robust against attacks beyond PGD (e.g., CW, AutoAttack)
✅ Explore ways to mitigate accuracy drop in CIFAR-10
✅ Conduct additional experiments on diverse datasets
📌 Key Takeaways from this Study:
1️⃣ Adversarial Training is highly effective in MNIST but has limitations in CIFAR-10
2️⃣ When applied to FashionMNIST and SVHN, performance varies across datasets
3️⃣ CW and AutoAttack reveal vulnerabilities that PGD-based training cannot fully mitigate
📌 Final Thoughts:
✅ Adversarial Training is a powerful defense mechanism, but not a universal solution
✅ More research is needed to improve robustness across various attack types and datasets