Contextual multi-armed bandit problems have gained increasing popularity and attention in recent years due to their capability of leveraging contextual information to deliver online personalized recommendation services (e.g., online advertising and news article selection). To predict the reward of each arm given a particular con-text, existing relevant research studies for contextual multi-armed bandit problems often assume the existence of a fixed yet unknown reward mapping function. However, this assumption rarely hold-s in practice, since real-world problems often involve underlying processes that are dynamically evolving over time.

In this paper, we study the time varying contextual multi-armed problem where the reward mapping function changes over time. In particular, we propose a dynamical context drift model based on particle learning. In the proposed model, the drift on the reward mapping function is explicitly modeled as a set of random walk particles, where good fitted particles are selected to learn the map-ping dynamically. Taking advantage of the fully adaptive inference strategy of particle learning, our model is able to effectively capture the context change and learn the latent parameters. In addition, those learnt parameters can be naturally integrated into existing multi-arm selection strategies such as LinUCB and Thompson sampling. Empirical studies on two real-world applications, including online personalized advertising and news recommendation, demonstrate the effectiveness of our proposed approach. The experimental results also show that our algorithm can dynamically track the changing reward over time and consequently improve the click-through rate.

Filed under: Recommender Systems