Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards

Stochastic Systems 9 (4), 319-337 (2019)

30 Pages Posted: 15 May 2014 Last revised: 2 Jan 2020

See all articles by Omar Besbes

Omar Besbes

Columbia University - Columbia Business School, Decision Risk and Operations

Yonatan Gur

Stanford Graduate School of Business; Netflix

Assaf Zeevi

Columbia University - Columbia Business School, Decision Risk and Operations

Date Written: December 1, 2019

Abstract

In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to maximize cumulative expected earnings over some planning horizon of length T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation). The gambler's policy is measured relative to a (static) oracle that knows the identity of the best arm a priori. The gap in performance between the former and latter is often referred to as the regret, and the main question is how small the regret can be as a function of problem primitives (hardness of the problem, typically measured as the distinguishability of the best arm) and the game horizon (T). This problem has been studied extensively when the reward distributions do not change over time. The uncertainty in this set up is purely spatial and essentially amounts to identifying the optimal arm. We complement this literature by developing a flexible nonparametric model for temporal uncertainty in the rewards. The extent of temporal uncertainty is measured via the cumulative mean change of the rewards over the horizon, a metric we refer to as temporal variation, and regret is measured relative to a (dynamic) oracle that plays the pointwise optimal action at each instant in time. Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed V (viewed as a temporal uncertainty budget), we characterize the complexity of this MAB game via the minimax regret which depends on V (the hardness of the problem) the horizon length T and the number of arms K. When the uncertainty budget V is not known a priori, we develop a family of "master slave" policies that adapt to the realized variation in rewards and provide numerical evidence suggesting that these policies are nearly minimax optimal.

Keywords: Multi-armed bandit, exploration/exploitation, non-stationary, minimax regret

Suggested Citation

Besbes, Omar and Gur, Yonatan and Zeevi, Assaf, Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards (December 1, 2019). Stochastic Systems 9 (4), 319-337 (2019), Available at SSRN: https://ssrn.com/abstract=2436629 or http://dx.doi.org/10.2139/ssrn.2436629

Omar Besbes

Columbia University - Columbia Business School, Decision Risk and Operations ( email )

New York, NY
United States

Yonatan Gur (Contact Author)

Stanford Graduate School of Business ( email )

655 Knight Way
Stanford, CA 94305-5015
United States

Netflix ( email )

Los Gatos, CA
United States

Assaf Zeevi

Columbia University - Columbia Business School, Decision Risk and Operations ( email )

New York, NY
United States
212-854-9678 (Phone)
212-316-9180 (Fax)

HOME PAGE: http://www.gsb.columbia.edu/faculty/azeevi/

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
446
Abstract Views
2,816
Rank
119,133
PlumX Metrics