Skip Navigation
Search

ECE Departmental Seminar

Model-free RL for constrained MDP: Theory and Practice

Arnob Ghosh
New Jersey Institute of Technology

Thursday, 11/9/23, 11:00am
Light Engineering 250 

Abstract: Many constrained sequential decision-making processes such as safe AV navigation, wireless network control, caching, cloud computing, etc., can be cast as Constrained Markov Decision Processes (CMDP). Reinforcement Learning (RL) algorithms have been used to learn optimal policies for unknown unconstrained MDP. Extending these RL algorithms to unknown CMDP, brings the additional challenge of not only maximizing the reward but also satisfying the constraints. In contrast to existing model-based approaches or model-free methods accompanied by a ‘simulator’, we aim to develop the first model-free, simulator-free RL algorithm that achieves a sublinear regret and a sublinear constraint violation with respect to the number of episodes. To this end, we consider the episodic linear CMDP where the transition probabilities and rewards are linear in the feature space. We show that our approach can achieve 𝑂̃(√𝐾) regret and 𝑂̃(√𝐾) constraint violation where K is the no. of episodes. Our result does not depend on the dimension of the state-space, and hence is applicable to the large state space. This is the first result which shows that 𝑂̃(√𝐾) regret and 𝑂̃(√𝐾) constraint violation is achievable using model-free RL even for small state space (a.k.a tabular case). We also show that we can in fact achieve zero constraint violation for large K. We propose a primal-dual adaptation of the LSVI-UCB algorithm and demonstrate how we overcome analytical challenges for infinite state space.

In the second part of the talk, I will demonstrate how the theoretical understanding of the constrained MDP can help us to develop algorithms for practical applications. As a first application, we show how to learn to obtain optimal beam directions under time-varying interference-constrained channels. Optimal beam selection in mmWave is challenging because of its time-varying nature. In contrast to existing approaches, we model the received signal strength for a given beam direction using functions in Reproducing Kernel Hilbert Space (RKHS), which represents a rich class of functions, to model the correlation across various beam directions. We propose a primal-dual Gaussian process bandit with adaptive reinitialization in order to handle non-stationarity and interference constraints. We theoretically and empirically show the dynamic regret and constraint violation bounds achieved by our proposed algorithm are sub-linear. Further, we also demonstrate that our proposed algorithm indeed learns to select optimal beam directions for a practical dataset collected at Northeastern University. As a second application, we develop a constrained deep RL approach for Adaptive Bit-Rate (ABR) selection for video streaming to handle the high throughput variability of 5G. Studies have shown that in the 5G dataset, the unconstrained RL-based approach such as Pensieve has a high stall time whereas control-based approaches have a low bit rate. Empirical results using real data traces show that our proposed algorithm achieves a better bit rate compared to control-based techniques and a lower stall time compared to traditional RL-based approaches such as Pensieve.

Bio: Arnob Ghosh has been an Assistant Professor at the Department of Electrical and Computer Engineering at New Jersey Institute of Technology since September 2023. Before that Arnob Ghosh was a Research Scientist at the Dept. of Electrical and Computer Engg. at the Ohio State University with Ness Shroff.  Arnob Ghosh obtained his Ph.D. degree from the University of Pennsylvania, USA in Electrical and Systems Engg.. Prior to joining the OSU, he was an Assistant Professor at the IIT-Delhi. Arnob Ghosh was also a post-doc research Associate at Purdue University. Arnob Ghosh has worked in diverse areas with the theme of efficient decision-making in interconnected systems. His current research interest includes Reinforcement Learning, game theory, Online Learning, and decision theory, and applying those tools in various engineering applications such as Cyber physical system, Intelligent transportation, wireless communication, and computer network. His paper runner-up for the best student paper award at IEEE WiOpt.