The algorithm therefore has a function that calculates the Quantity of a state-action combination:
Q:S imes A o mathbb {R}
Before learning has started, Q returns an (arbitrary) fixed value, chosen by the designer. Then, each time the agent selects an action, and observes a reward and a new state that may depend on both the previous state and the selected action, "Q" is updated. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.
Q-learning
• Assume no knowledge of R or T.
• Maintain a table-lookup data structure Q (estimates of Q*) for all state-action pairs
• When a transition s r s’ occurs, do
Q(s,a)←α(r+γ maxQ(s′,a′))+(1−α)Q(s,a) a′
• Essentially implements a kind of asynchronous Monte Carlo value iteration, using sample backups
• Guaranteed to eventually converge to Q* as long as every state-action pair sampled infinitely often
Q-learning
• This approach is even cleverer than it looks: the Q values are not biased by any particular exploration policy. It avoids the credit assignment problem.
• The convergence proof extends to any variant in which every Q(s,a) is updated infinitely often, whether on-line or not.
The algorithm therefore has a function that calculates the Quantity of a state-action combination:Q:S imes A o mathbb {R} Before learning has started, Q returns an (arbitrary) fixed value, chosen by the designer. Then, each time the agent selects an action, and observes a reward and a new state that may depend on both the previous state and the selected action, "Q" is updated. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.Q-learning• Assume no knowledge of R or T.• Maintain a table-lookup data structure Q (estimates of Q*) for all state-action pairs• When a transition s r s’ occurs, doQ(s,a)←α(r+γ maxQ(s′,a′))+(1−α)Q(s,a) a′• Essentially implements a kind of asynchronous Monte Carlo value iteration, using sample backups• Guaranteed to eventually converge to Q* as long as every state-action pair sampled infinitely often Q-learning• This approach is even cleverer than it looks: the Q values are not biased by any particular exploration policy. It avoids the credit assignment problem.• The convergence proof extends to any variant in which every Q(s,a) is updated infinitely often, whether on-line or not.
正在翻譯中..