Provably Safe Reinforcement Learning with Step-wise Violation Constraints
Abstract
In this paper, we investigate a novel safe reinforcement learning problem with step-wise violation constraints. Our problem differs from existing works in that we consider stricter step-wise violation constraints and do not assume the existence of safe actions, making our formulation more suitable for safety-critical applications which need to ensure safety in all decision steps and may not always possess safe actions, e.g., robot control and autonomous driving. We propose a novel algorithm SUCBVI, which guarantees O(ST) step-wise violation and O(H3SAT) regret. Lower bounds are provided to validate the optimality in both violation and regret performance with respect to S and T. Moreover, we further study a novel safe reward-free exploration problem with step-wise violation constraints. For this problem, we design an (,δ)-PAC algorithm SRF-UCRL, which achieves nearly state-of-the-art sample complexity O((S2AH2+H4SA2)((1δ)+S)), and guarantees O(ST) violation during the exploration. The experimental results demonstrate the superiority of our algorithms in safety performance, and corroborate our theoretical results.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.