SARSA.
Q-learning is an off-policy learning algorithm, which means that while following some exploration policy π, it aims at estimating the optimal policy π∗. A related on-policy algorithm that learns the Q-value function for the policy the agent is ac- tually executing is the SARSA [Rummery and Niranjan(1994), Rummery(1995), Sutton(1996)] algorithm, which stands for State–Action–Reward–State–Aaction. It uses the following update rule:
Qt+1(st,at) = Qt(st,at)+α rt +γQt(st+1,at+1)−Qt(st,at) (19)
where the action at+1 is the action that is executed by the current policy for state st+1. Note that the max-operator in Q-learning is replaced by the estimate of the value of the next action according to the policy. This learning algorithm will still converge in the limit to the optimal value function (and policy) under the condi- tion that all states and actions are tried infinitely often and the policy converges in the limit to the greedy policy, i.e. such that exploration does not occur anymore. SARSA is especially useful in non-stationary environments. In these situations one will never reach an optimal policy. It is also useful if function approximation is used, because off-policy methods can diverge when this is used. However, off-policy methods are needed in many situations such as in learning using hierarchically struc- tured policies.