Stablizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation
Shi-Yong Chen (Nanjing University); Yang Yu (Nanjing University); Qing Da (Alibaba Group); Jun Tan (Alibaba Group); Hai-Kuan Huang (Alibaba Group); Hai-Hong Tang (Alibaba Group)
Deep reinforcement learning has shown great potential in improving system performance autonomously, by learning from iterations with the environment. However, traditional reinforcement learning approaches are designed to work in static environments. In many real-world problems, the environments are commonly dynamic, in which the performance of reinforcement learning approaches can degrade drastically. A direct cause of the performance degradation is the high-variance and biased estimation of the reward, due to the distribution shifting in dynamic environments. In this paper, we propose two techniques to alleviate the unstable reward estimation problem in dynamic environments, the stratified sampling replay strategy and the approximate regretted reward, which address the problem from the sample aspect and the reward aspect, respectively. Integrating the two techniques with Double DQN, we propose the Robust DQN method. We apply Robust DQN in the tip recommendation system in Taobao online retail trading platform. We firstly disclose the highly dynamic property of the recommendation application. We then carried out online A/B test to examine Robust DQN. The results show that Robust DQN can effectively stabilize the value estimation and, therefore, improves the performance in this real-world dynamic environment.