Q-learning
• Assume no knowledge of R or T.
• Maintain a table-lookup data structure Q (estimates of Q*) for all state-action pairs
• When a transition s r s’ occurs, do
Q(s,a)←α(r+γ maxQ(s′,a′))+(1−α)Q(s,a) a′
• Essentially implements a kind of asynchronous Monte Carlo value iteration, using sample backups
• Guaranteed to eventually converge to Q* as long as every state-action pair sampled infinitely often
© 2004, Ronald J. Williams Reinforcement Learning: Slide 85
(Watkins, 1988)
Q-learning
• This approach is even cleverer than it looks: the Q values are not biased by any particular exploration policy. It avoids the credit assignment problem.
• The convergence proof extends to any variant in which every Q(s,a) is updated infinitely often, whether on-line or not.