where R is the reward observed after performing Q in Z, and where P is the learning rate (may be the same for all pairs).
where R_{t+1} is the reward observed after performing a_{t} in s_{t},
Q:S imes A o mathbb {R}
Before learning has started, Q returns an (arbitrary) fixed value, chosen by the designer. Then, each time the agent selects an action, and observes a reward and a new state that may depend on both the previous state and the selected action, "Q" is updated. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.
Q-learning
• Assume no knowledge of R or T.
• Maintain a table-lookup data structure Q (estimates of Q*) for all state-action pairs
• When a transition s r s’ occurs, do
Q(s,a)←α(r+γ maxQ(s′,a′))+(1−α)Q(s,a) a′
• Essentially implements a kind of asynchronous Monte Carlo value iteration, using sample backups
• Guaranteed to eventually converge to Q* as long as every state-action pair sampled infinitely often
Q-learning
• This approach is even cleverer than it looks: the Q values are not biased by any particular exploration policy. It avoids the credit assignment problem.
• The convergence proof extends to any variant in which every Q(s,a) is updated infinitely often, whether on-line or not.