Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation
Abstract
We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of O(dDT) where d is the dimension of feature mapping, D is the diameter of the underlying MDP, and T is the horizon. For discounted-reward MDPs, our algorithm achieves O(d(1-γ)-2T) regret where γ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of (dDT) for learning communicating MDPs of diameter D and a lower bound of (d(1-γ)3/2T) for learning discounted-reward MDPs with discount factor γ. Lastly, we show a regret lower bound of (dH3/2K) for learning H-horizon episodic MDPs with MNL function approximation where K is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.