Understanding Catastrophic Overfitting in Adversarial Training

Abstract

Recently, FGSM adversarial training is found to be able to train a robust model which is comparable to the one trained by PGD but an order of magnitude faster. However, there is a failure mode called catastrophic overfitting (CO) that the classifier loses its robustness suddenly during the training and hardly recovers by itself. In this paper, we find CO is not only limited to FGSM, but also happens in DF∞-1 adversarial training. Then, we analyze the geometric properties for both FGSM and DF∞-1 and find they have totally different decision boundaries after CO. For FGSM, a new decision boundary is generated along the direction of perturbation and makes the small perturbation more effective than the large one. While for DF∞-1, there is no new decision boundary generated along the direction of perturbation, instead the perturbation generated by DF∞-1 becomes smaller after CO and thus loses its effectiveness. We also experimentally analyze three hypotheses on potential factors causing CO. And then based on the empirical analysis, we modify the RS-FGSM by not projecting perturbation back to the l∞ ball. By this small modification, we could achieve 47.56 0.37\% PGD-50-10 accuracy on CIFAR10 with ε=8/255 in contrast to 43.57 0.30\% by RS-FGSM and also further extend the working range of ε from 8/255 to 11/255 on CIFAR10 without CO occurring.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…