Thompson Sampling Algorithm for Stochastic Games

Abstract

We study a stochastic differential game with N competitive players in a linear-quadratic framework with ergodic cost, where d-dimensional diffusion processes govern the state dynamics with an unknown common drift (matrix). Assuming a Gaussian prior on the drift, we use filtering techniques to update its posterior estimates. Based on these estimates, we propose a Thompson-sampling-based algorithm with dynamic episode lengths to approximate strategies. We show that the Bayesian regret for each player has an error bound of order O(T(T)), where T is the time-horizon, independent of the number of players. This implies that average regret per unit time goes to zero. Finally, we prove that the algorithm results in a Nash equilibrium.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…