在线学术报告 | 齐正灵助理教授:一个用于混杂部分可观测马尔可夫决策过程的策略梯度方法

学术   教育   2024-11-01 07:02   广东  


  

  


摘要

In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any parameterized history-dependent policy gradient under POMDPs using the offline data. The identification boils down to solving a sequence of conditional moment restrictions and we adopt the min-max learning procedure with general function approximation for estimating the policy gradient. 

We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon and concentratability coefficient in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the last-iterate global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.

嘉宾介绍

Zhengling Qi is an assistant professor at School of Business, the George Washington University. He got his PhD degree from Department of Statistics and Operations Research at the University of North Carolina, Chapel Hill. His research has been focused on statistical machine Learning and related non-convex optimization. He is now mainly working on reinforcement learning and causal inference problems.


狗熊会线上学术报告厅向数据科学及相关领域的学者及从业者开放,非常期待各位熊粉报名或推荐报告人。相关事宜,请联系:常莹,ying.chang@clubear.org

狗熊会
狗熊会,统计学第二课堂!传播统计学知识,培养统计学人才,推动统计学在产业中的应用!
 最新文章