Policy Testing in Markov Decision Processes
arXiv:2505.15342v2 Announce Type: replace Abstract: We study the policy testing problem in discounted Markov decision processes (MDPs) in the fixed-confidence setting under a generative model with static sampling. The goal is to decide whether the value of a given policy exceeds a specified threshold while minimizing the number of samples. We first derive an instance-dependent lower bound that any reasonable algorithm must satisfy, characterized as the solution to an optimization problem with non-convex constraints. Guided by this formulation, we propose a new algorithm. While this design paradigm is common in pure exploration problems such as best-arm identification, the non-convex constraints that arise in MDPs introduce substantial difficulties. To address them, we reformulate the lower-bound problem by swapping the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. This reformulation admits an interpretation as a policy optimization task in a newly constructed reversed MDP. We further show that the global KL constraint can be decomposed exactly into a family of product-box subproblems, which are solved by projected policy gradient and combined through an outer budget search. Beyond policy testing, our reformulation and reversed MDP view suggest extensions to other pure exploration tasks in MDPs, including policy evaluation and best policy identification.
