A Fully Data-Driven Value Iteration for Stochastic LQR: Convergence, Robustness and Stability

Grégoire G. Macqueron

doi:10.1109/TNNLS.2026.3675892

A Fully Data-Driven Value Iteration for Stochastic LQR: Convergence, Robustness and Stability

Abstract

Unlike traditional model-based reinforcement learning approaches that estimate system parameters from data, non-model-based data-driven control learns the optimal policy directly from input-state data without any intermediate model identification. Although this direct reinforcement learning approach offers increased adaptability and resilience to model misspecification, its reliance on raw data leaves it vulnerable to system noise and disturbances that may undermine convergence, robustness, and stability. In this article, we establish the convergence, robustness, and stability of value iteration (VI) for data-driven control of stochastic linear quadratic (LQ) systems in discrete-time with entirely unknown dynamics and cost. Our contributions are three-fold. First, we prove that VI is globally exponentially stable for any positive semidefinite initial value matrix in noise-free settings, thereby significantly relaxing restrictive assumptions on initial value functions in existing literature. Second, we extend our analysis to settings with external disturbances, proving that VI maintains small-disturbance input-to-state stability (ISS) and converges within a small neighborhood of the optimal solution when disturbances are sufficiently small. Third, we propose a new non-model-based robust adaptive dynamic programming (ADP) algorithm for adaptive optimal controller design, which, unlike existing procedures, requires no prior knowledge of an initial admissible control policy. Numerical experiments on a ``data center cooling'' problem demonstrate the convergence and stability of the algorithm compared to established methods, highlighting its robustness and adaptability for data-driven control in noisy environments. Finally, we apply the method to dynamic portfolio allocation, demonstrating its practical relevance outside traditional control tasks.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…