Online Resource Allocation in Episodic Markov Decision Processes
Abstract
This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by O(-1H3/2SAT) where ∈(0,1) is the budget parameter, H is the length of the horizon, S and A are the numbers of states and actions, and T is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by O(-1H3/2SAT) with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.