A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit
Data Science Seminar
Vladimir Kobzar (Columbia University)
You may attend the talk either in person in Walter 402 or register via Zoom. Registration is required to access the Zoom webinar.
The multi-armed bandit is a classic sequential prediction problem. At each round, the predictor (player) selects a probability distribution from a finite collection of distributions (arms) with the goal of minimizing the difference (regret) between the player’s rewards sampled from the selected arms and the rewards of the arm with the highest expected reward. The player’s choice of the arm and the reward sampled from that arm are revealed to the player, and this prediction process is repeated until the final round. Our work addresses a version of the two-armed bandit problem where the arms are distributed independently according to Bernoulli distributions and the sum of the means of the arms is one (the symmetric two-armed Bernoulli bandit). In a regime where the gap between these means goes to zero and the number of prediction periods approaches infinity, we obtain the leading order terms of the expected regret for this problem by associating it with a solution of a linear parabolic partial differential equation. Our results improve upon the previously known results; specifically we explicitly compute the leading order term of the optimal regret in three different scaling regimes for the gap. Additionally, we obtain new non-asymptotic bounds for any given time horizon. This is joint work with Robert Kohn available at https://arxiv.org/abs/2202.05767.