[Solved]: How to implement the regret matching algorithm?

Problem Detail: My question is the following: How to calculate the regret in practice? I am trying to implement the regret matching algorithm but I do not understand how to do it.

First, I have $n$ players with the joint action space $mathcal{A}={a_0, a_1,cdots,a_m}^n.$
Then, I fix some period $T$. The action set $A^tinmathcal{A}$ is the action set chosen by players at time $t$. After the period $T$ (every player has chosen an action). So I get $u_i(A^t)$.
Now the regret of player $i$ of not playing action $a_i$ in the past is: (here $A^toplus a_i$ denotes the strategy set obtained if player $i$ changed its strategy from $a’_i$ to $a_i$) $$maxlimits_{a_iin A_i}left{dfrac{1}{T}sum_{tleqslant T}left(u_i(A^toplus a_i )-u_i(A^t)right)right}.$$ I do not understand how to calculate this summation. Why there is a max over the action $a_iin A_i$? Should I calculate the regret of all actions in $A_i$ and calculate the maximum? Also, In Hart’s paper, the maximum is $max{R, 0}$. Why is there such a difference? I mean if the regret was: $dfrac{1}{T}sum_{tleqslant T}left(u_i(A^toplus a_i )-u_i(A^t)right),$ the calculation would be easy for me.

The regret is defined in the following two papers [1] (see page 4, equation (2.1c)) and [2] (see page 3, section I, subsection B).

A simple adaptive procedure leading to correlated equilibrium by S. Hart et al (2000)
Distributed algorithms for approximating wireless network capacity by Michael Dinitz (2010)

I would like to get some helps from you. Any suggestions step by step how to implement such an algorithm please?

Asked By : Azzo

Answered By : Rahul Savani

The index set of the max operation is $A_i$, the actions of player $i$. The formula says: take each such action $a_i in A_i$ and compute its regret (with the sub-formula you say you can implement easily), and then take the maximum of those regrets. The reason for the $max(R,0)$ is that actions with negative regrets are performing worse than the action currently chosen. To implement this in code, just set a temporary variable $t$ to be 0. Now loop through the actions one by one, and for each action $a$, compute its regret $r$, and set $t$ as $max(r,t)$. Note that this approach includes the $max(R,0)$ operation; to do this without that, set $t$ initially to $-infty$.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/27915

Asked By : Azzo

Answered By : Rahul Savani

Best Answer from StackOverflow

Related