[Solved]: k-armed bandit - index policies vs. Q-learning

Problem Detail: I am interested in the stateless k-armed bandit problem, where an agent repeatedly chooses one of k independent arms, each with a different distribution of rewards, and tries to maximize its total reward. I read many papers about this subject, but I am still confused, as there are two types of solutions: A. It has been proven that the optimal solution to this problem is an index policy: For each arm, calculate a quality index, based on the rewards received from that arm so far. The index roughly estimates an upper confidence bound on the expected reward from that arm. Now choose the arm with the highest index. B. On the other hand, this problem can also be presented as a Markov decision process, which can be solved by Q-learning. The general formula for Q-learning is: $Q_{t+1}(s,a) := (1-z_t) Q_t(s,a) + z_t (R_t + g * max_{a’}(Q_t(s’,a’))$ Where: t is the time, s is the state we were at, a is the action we did, s’ is the state we got into, $z_t$ is the rate of learning at time t, R_t is the reward we received at time t, and g is a constant discount rate. In this case, there is only a single non-terminal state. Each arm-pull is an episode: it starts with the initial state, and always goes to the final state. Therefore, the Q value of the final state is always 0, and we only need to calculate Q values for actions of the single initial state: $Q_{t+1}(a) := (1-z_t) Q_t(a) + z_t R_t$ It is guaranteed that the process will converge to an optimal policy, if each action is selected eventually, and if the learning rate $z_t$ decreases in an appropriate rate (speficically, $sum_{t=1}^infty z_t^2 < infty$ and $sum_{t=1}^infty z_t = infty$). Now, my question is: what is the relation between two methods for solving the k-armed bandit problem?

Is it possible to prove, that the Q values converge to one of the indices given in the literature?
Why use an index method, if it is possible to use the much simpler Q-learning method?

Asked By : Erel Segal-Halevi

Answered By : Jeremy List

The difference between your formula and the index methods is that your formula gives more weight to recent data. If there’s a lever that has large but infrequent rewards then your formula won’t converge. On the other hand, your formula will perform better than index methods if the reward distribution for some of the levers changes across time. The next state is actually not the terminal state unless there are no more trials, so simplifying out $g * max_{a’}Q_{t}(s’,a’)$ is only valid if $g = 0$. But in this case $g = 0$ should work quite well.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/7602 Ask a Question Download Related Notes/Documents