Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning
Abstract
Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity O(|S||A|\|h\|sp33), where \|h\|sp is the span norm of the bias function, improving previous results by at least a factor of \|h\|sp22. In the federated setting with M agents, we prove that collaboration reduces the per-agent sample complexity to O(|S||A|\|h\|sp3M3), with only O(\|h\|sp) communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.