Reward-Preserving Attacks For Robust Reinforcement Learning

Abstract

Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an α fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes η are selected dynamically, using a learned critic Q((s,a),η) that estimates the expected return of α-reward-preserving rollouts. For intermediate values of α, this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…