TD Learning
After making a transition from s to s’ and receiving reward r, we nudge V(s) to be closer to the estimated return based on the observed successor, as follows:
V (s ) ← α (r + γ V (s ′ )) + (1 − α )V (s )
α is called a “learning rate” parameter.
For α < 1 this represents a partial backup.
Furthermore, if the rewards and/or transitions are stochastic, as in a general MDP, this is a sample backup.
The reward and next-state values are only noisy estimates of the corresponding expectations, which is what offline DP would use in the appropriate computations (full backup).
Nevertheless, this converges to the return for a fixed policy (under the right technical assumptions, including decreasing learning rate)
TD LearningAfter making a transition from s to s’ and receiving reward r, we nudge V(s) to be closer to the estimated return based on the observed successor, as follows:V (s ) ← α (r + γ V (s ′ )) + (1 − α )V (s )α is called a “learning rate” parameter.For α < 1 this represents a partial backup.Furthermore, if the rewards and/or transitions are stochastic, as in a general MDP, this is a sample backup.The reward and next-state values are only noisy estimates of the corresponding expectations, which is what offline DP would use in the appropriate computations (full backup).Nevertheless, this converges to the return for a fixed policy (under the right technical assumptions, including decreasing learning rate)
正在翻譯中..