Note that for all final states S, Q is never updated and thus retains its initial value. In most cases, W can be taken to be equal to zero.
Q:S imes A o mathbb {R}
Before learning has started, Q returns an (arbitrary) fixed value, chosen by the designer. Then, each time the agent selects an action, and observes a reward and a new state that may depend on both the previous state and the selected action, "Q" is updated. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.
Q-learning
• Assume no knowledge of R or T.
• Maintain a table-lookup data structure Q (estimates of Q*) for all state-action pairs
• When a transition s r s’ occurs, do
Q(s,a)←α(r+γ maxQ(s′,a′))+(1−α)Q(s,a) a′
• Essentially implements a kind of asynchronous Monte Carlo value iteration, using sample backups
• Guaranteed to eventually converge to Q* as long as every state-action pair sampled infinitely often
Q-learning
• This approach is even cleverer than it looks: the Q values are not biased by any particular exploration policy. It avoids the credit assignment problem.
• The convergence proof extends to any variant in which every Q(s,a) is updated infinitely often, whether on-line or not.
Note that for all final states S, Q is never updated and thus retains its initial value. In most cases, W can be taken to be equal to zero.Q:S imes A o mathbb {R} Before learning has started, Q returns an (arbitrary) fixed value, chosen by the designer. Then, each time the agent selects an action, and observes a reward and a new state that may depend on both the previous state and the selected action, "Q" is updated. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.Q-learning• Assume no knowledge of R or T.• Maintain a table-lookup data structure Q (estimates of Q*) for all state-action pairs• When a transition s r s’ occurs, doQ(s,a)←α(r+γ maxQ(s′,a′))+(1−α)Q(s,a) a′• Essentially implements a kind of asynchronous Monte Carlo value iteration, using sample backups• Guaranteed to eventually converge to Q* as long as every state-action pair sampled infinitely often Q-learning• This approach is even cleverer than it looks: the Q values are not biased by any particular exploration policy. It avoids the credit assignment problem.• The convergence proof extends to any variant in which every Q(s,a) is updated infinitely often, whether on-line or not.
正在翻譯中..