On the Limits of Tabular Hardness Metrics for Deep RL: A Study with the Pharos Benchmark

Abstract

Principled evaluation is critical for progress in deep reinforcement learning (RL), yet it lags behind the theory-driven benchmarks of tabular RL. While tabular settings benefit from well-understood hardness measures like MDP diameter and suboptimality gaps, deep RL benchmarks are often chosen based on intuition and popularity. This raises a critical question: can tabular hardness metrics be adapted to guide non-tabular benchmarking? We investigate this question and reveal a fundamental gap. Our primary contribution is demonstrating that the difficulty of non-tabular environments is dominated by a factor that tabular metrics ignore: representation hardness. The same underlying MDP can pose vastly different challenges depending on whether the agent receives state vectors or pixel-based observations. To enable this analysis, we introduce pharos, a new open-source library for principled RL benchmarking that allows for systematic control over both environment structure and agent representations. Our extensive case study using pharos shows that while tabular metrics offer some insight, they are poor predictors of deep RL agent performance on their own. This work highlights the urgent need for new, representation-aware hardness measures and positions pharos as a key tool for developing them.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…