SARSA.Q-learning is an off-policy l

SARSA.
Q-learning is an off-policy learning algorithm, which means that while following some exploration policy π, it aims at estimating the optimal policy π∗. A related on-policy algorithm that learns the Q-value function for the policy the agent is ac- tually executing is the SARSA [Rummery and Niranjan(1994), Rummery(1995), Sutton(1996)] algorithm, which stands for State–Action–Reward–State–Aaction. It uses the following update rule:

Qt+1(st,at) = Qt(st,at)+α rt +γQt(st+1,at+1)−Qt(st,at) (19)
where the action at+1 is the action that is executed by the current policy for state st+1. Note that the max-operator in Q-learning is replaced by the estimate of the value of the next action according to the policy. This learning algorithm will still converge in the limit to the optimal value function (and policy) under the condi- tion that all states and actions are tried infinitely often and the policy converges in the limit to the greedy policy, i.e. such that exploration does not occur anymore. SARSA is especially useful in non-stationary environments. In these situations one will never reach an optimal policy. It is also useful if function approximation is used, because off-policy methods can diverge when this is used. However, off-policy methods are needed in many situations such as in learning using hierarchically struc- tured policies.

SARSA.
Q-learning is an off-policy learning algorithm, which means that while following some exploration policy π, it aims at estimating the optimal policy π∗. A related on-policy algorithm that learns the Q-value function for the policy the agent is ac- tually executing is the SARSA [Rummery and Niranjan(1994), Rummery(1995), Sutton(1996)] algorithm, which stands for State–Action–Reward–State–Aaction. It uses the following update rule:
 
Qt+1(st,at) = Qt(st,at)+α rt +γQt(st+1,at+1)−Qt(st,at) (19)
where the action at+1 is the action that is executed by the current policy for state st+1. Note that the max-operator in Q-learning is replaced by the estimate of the value of the next action according to the policy. This learning algorithm will still converge in the limit to the optimal value function (and policy) under the condi- tion that all states and actions are tried infinitely often and the policy converges in the limit to the greedy policy, i.e. such that exploration does not occur anymore. SARSA is especially useful in non-stationary environments. In these situations one will never reach an optimal policy. It is also useful if function approximation is used, because off-policy methods can diverge when this is used. However, off-policy methods are needed in many situations such as in learning using hierarchically struc- tured policies.

0/5000

原始語言: -

目標語言: -

結果 (繁體中文) 1: [復制]

復制成功！

絲蘭。Q-學習是一種關閉政策學習演算法，這意味著，在遵循一些探索政策 π，同時它的目的是估計的最優政策 π∗。學習策略代理的 Q 值函數相關的政策對演算法是交流說到底執行是絲蘭 [酒店，Niranjan(1994)，Rummery(1995)，Sutton(1996)] 演算法，即狀態 — — 行動 — — 獎勵 — — 國家 — — Aaction。它使用以下更新規則: Qt+1(st,at) = Qt(st,at) + α rt +γQt(st+1,at+1)−Qt(st,at) (19)在哪裡在行動 + 1 是由狀態 st + 1 的當前策略執行的行動。請注意中 Q-學習的最大值運算子取而代之的是根據政策的下一步行動的價值的估計。這個學習演算法將仍然彙集在最優值函數 (和政策) 的限制下，無限地經常試著所有的狀態和操作和政策聚合在貪婪的政策，限制即這樣的探索不會再發生。絲蘭是在非平穩環境中尤其有用。在這些情況下一個將永遠不會達到最優的策略。它也是有用如果使用函數逼近，因為關閉政策方法可以出現分歧，這使用時。然而，在許多情況下，如在學習使用分層結構結點進行聚類策略需要關閉政策方法。

正在翻譯中..

結果 (繁體中文) 2:[復制]

復制成功！

SARSA。Q學習是截止策略學習算法，這意味著以下一些勘探策略π而，它的目的是估算最優策略π*。
一個相關的政策算法學習Q值函數的代理AC- tually執行的政策是SARSA [Rummery和NIRANJAN（1994年），Rummery（1995年），薩頓（1996年）的算法，它代表國家清議-回報-狀態- Aaction。它使用以下更新規則：Qt 的+ 1（ST，AT）= Qt的（ST，AT）+αRT +γQt（ST + 1，在+ 1）-qt（ST，AT）（19）凡在行動1是由當前策略狀態ST + 1執行的動作。需要注意的是，最大操作員在Q學習被替換根據策略下一個動作的值的估計。這種學習算法仍然收斂的限制最優值函數（和政策）下的條件化，所有國家和行動嘗試無限往往與政策收斂的限制貪婪的政策，也就是這樣的探索並不發生了。SARSA是在非靜態環境中尤其有用。在這種情況下，人會永遠達不到最優化的策略。它也是有用的，如果函數逼近的情況下，因為離策略方法能發散時此被使用。然而，還需要在許多情況下離政策的方法，例如在使用分層結構化策略學習。

正在翻譯中..

結果 (繁體中文) 3:[復制]

復制成功！

絲蘭。
Q學習是一個政策的學習算灋，這意味著在以下一些探索政策π，旨在估計的最優策略π∗。相關政策算灋學習的Q值函數的政策代理交流最後執行是絲蘭[飯店，該飯店（1994）、（1995）、薩頓（1996）]算灋，這是國家–行動–獎勵–狀態–行為。使用下列更新規則：

QT 1（ST，在QT）=（ST，在αRT）γQT（1街，1）−QT（ST，在
）（19）在1的動作是由國家ST 1現行政策執行的動作。注意學習中的最大運營商所取代的下一步行動的價值估計根據政策。在最優值函數的極限收斂學習算灋仍然會（政策）條件下，所有的狀態和動作試著無窮和政策收斂在貪婪的政策限制，即，探索不會再發生。絲蘭在非平穩環境中是特別有用的。在這種情況下，一個永遠不會達到最佳的政策。它也是有用的，如果使用函數逼近，因為偏離政策的方法可以使用時，這是不分。然而，政策的方法在許多情况下，如需要使用分層結構學習培養策略。

正在翻譯中..

其它語言

本翻譯工具支援: 世界語, 中文, 丹麥文, 亞塞拜然文, 亞美尼亞文, 伊博文, 俄文, 保加利亞文, 信德文, 偵測語言, 優魯巴文, 克林貢語, 克羅埃西亞文, 冰島文, 加泰羅尼亞文, 加里西亞文, 匈牙利文, 南非柯薩文, 南非祖魯文, 卡納達文, 印尼巽他文, 印尼文, 印度古哈拉地文, 印度文, 吉爾吉斯文, 哈薩克文, 喬治亞文, 土庫曼文, 土耳其文, 塔吉克文, 塞爾維亞文, 夏威夷文, 奇切瓦文, 威爾斯文, 孟加拉文, 宿霧文, 寮文, 尼泊爾文, 巴斯克文, 布爾文, 希伯來文, 希臘文, 帕施圖文, 庫德文, 弗利然文, 德文, 意第緒文, 愛沙尼亞文, 愛爾蘭文, 拉丁文, 拉脫維亞文, 挪威文, 捷克文, 斯洛伐克文, 斯洛維尼亞文, 斯瓦希里文, 旁遮普文, 日文, 歐利亞文 (奧里雅文), 毛利文, 法文, 波士尼亞文, 波斯文, 波蘭文, 泰文, 泰盧固文, 泰米爾文, 海地克里奧文, 烏克蘭文, 烏爾都文, 烏茲別克文, 爪哇文, 瑞典文, 瑟索托文, 白俄羅斯文, 盧安達文, 盧森堡文, 科西嘉文, 立陶宛文, 索馬里文, 紹納文, 維吾爾文, 緬甸文, 繁體中文, 羅馬尼亞文, 義大利文, 芬蘭文, 苗文, 英文, 荷蘭文, 菲律賓文, 葡萄牙文, 蒙古文, 薩摩亞文, 蘇格蘭的蓋爾文, 西班牙文, 豪沙文, 越南文, 錫蘭文, 阿姆哈拉文, 阿拉伯文, 阿爾巴尼亞文, 韃靼文, 韓文, 馬來文, 馬其頓文, 馬拉加斯文, 馬拉地文, 馬拉雅拉姆文, 馬耳他文, 高棉文, 等語言的翻譯.