Learning rate[edit]The learning rat

Learning rate[edit]
The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. In fully deterministic environments, a learning rate of alpha _{t}(s,a)=1 is optimal. When the problem is stochastic, the algorithm still converges under some technical conditions on the learning rate, that require it to decrease to zero. In practice, often a constant learning rate is used, such as alpha _{t}(s,a)=0.1 for all t.[1]

Discount factor[edit]
The discount factor γ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1, without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.[2]

Initial conditions (Q_{0})[edit]
Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high initial value, also known as "optimistic initial conditions",[3] can encourage exploration: no matter what action will take place, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward r could be used to reset the initial conditions[citation needed]. According to this idea, the first time an action is taken the reward is used to set the value of Q. This will allow immediate learning in case of fixed deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.[4]

Discount factor[edit]
The discount factor γ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1, without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.[2]

Initial conditions (Q_{0})[edit]
Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high initial value, also known as "optimistic initial conditions",[3] can encourage exploration: no matter what action will take place, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward r could be used to reset the initial conditions[citation needed]. According to this idea, the first time an action is taken the reward is used to set the value of Q. This will allow immediate learning in case of fixed deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.[4]

0/5000

原始語言: -

目標語言: -

結果 (繁體中文) 1: [復制]

復制成功！

學習率 [編輯]學習速率確定到何種程度上新近獲得的資訊將會覆蓋舊的資訊。0 的一個因素會使代理不學到東西，而因數為 1 會使代理考慮只有最新的資訊。在完全確定性的環境中，Alpha _ 學習速率 {t}(s,a) = 1 是最優的。當問題是隨機的時該演算法仍然收斂下一些技術的條件，對學習率，要求它降低到零。在實踐中，不斷的學習率常用的如 Alpha _ {t}(s,a) = 0.1，所有 t [1]。貼現因數 [編輯]折扣因數 γ 決定未來回報的重要性。0 的一個因素會使代理"近視"(或短視) 于僅考慮當前獎勵，同時一個因素接近 1 會使它爭取長期的高回報。如果貼現因數達到或超過 1，操作值可能出現分歧。為 γ = 1，沒有終端的狀態，或者如果代理永遠不會到達一個、所有環境歷史將都會無限長，而公用事業與添加劑、貼現獎勵一般會都是無限。[] 2初始條件 (Q_ {0}) [編輯]因為 Q-學習是一種反覆運算演算法，它隱式假定初始的條件第一次更新發生之前。高的初值，也被稱為"樂觀初始條件"，[3] 可以鼓勵探索: 不管地方會採取什麼行動，更新規則將導致它有比其他替代方案，從而增加了他們的選擇概率較低的值。最近，有人第一個獎勵 r 可用於重置初始條件 [引文需要]。按照這種思路，第一次採取行動的獎勵用於設置 Q 的值。這將允許立即在固定的確定性獎勵的情況下學習。出人意料的是，這個重置-的--初始條件 (RIC) 處理辦法似乎符合人類的行為，在反復的二元選擇實驗。[] 4

正在翻譯中..

結果 (繁體中文) 2:[復制]

復制成功！

學習率[編輯]
學習速率決定在何種程度上新近獲得的信息將覆蓋舊的信息。0因素將使代理什麼也學不到，而1倍會使代理只考慮最新信息。在完全確定的環境中，阿爾法_【T】（S，A）= 1的學習率是最佳的。當問題是隨機的，該算法仍然在學習速率一定的技術條件，即要求它降低到零下收斂。在實踐中，往往是一個不斷學習率時，如阿爾法_【T】（S，A）= 0.1對所有的t。[1] 折扣係數[編輯] 的折扣係數γ決定未來回報的重要性。0因素將使代理“近視”（或短視）只考慮當前的獎勵，而接近1的一個因素將使爭取一個長期的高回報。如果折扣係數達到或超過1，動作值可能會發散。對於γ= 1，不使用終端狀態，或者如果代理從未到達一個，全部環境的歷史將是無限長，和公用事業用添加劑，未貼現獎勵通常將是無限的。[2] 的初始條件（Q_ {0}）[編輯] 由於Q學習是一個迭代算法，它隱含地假定初始條件的第一個更新發生之前。一個高的初始值，也被稱為“樂觀初始條件”，[3]可以鼓勵勘探：無論什麼行動將發生時，更新規則將使得它有較低的值比其它替代方案中，從而增加了他們的選擇概率。最近，有人建議，第一個獎勵- [R可以用來重置初始條件[來源請求]。按照這個思路，第一時間採取行動的報酬是用來設置值Q的這將使近期的學習的情況下確定的固定回報。出人意料的是，這種復位，即用初始條件（RIC）的做法似乎是在重複的二元選擇實驗，人類行為是一致的。[4]

正在翻譯中..

結果 (繁體中文) 3:[復制]

復制成功！

學習率[編輯]的學習率决定了什麼程度的新獲得的資訊將取代舊的資訊。一個0因素會使代理人不學習任何東西，而1個因素會使代理人只考慮最新的資訊。在完全確定的環境下，一個α_ { }學習率（S，A）= 1是最佳的。當問題是隨機的，在一些技術條件下，該算灋仍收斂於學習速率，這要求它降低到零。在實踐中，往往是一個持續不斷的學習率，如α_ {T}（S，A）= 0.1的所有T [ 1 ]

貼現因數[編輯]
貼現因數γ决定的重要性，未來的回報。0個因素將使代理“短視”（或短視）只考慮現時的回報，而一個因素接近1，將使它爭取長期高回報。如果貼現因數達到或超過1，其作用值可能會出現偏離。對於1的，沒有一個終端狀態，或如果代理從未到達一個，所有環境的歷史將無限長，和公用事業附加、未貼現回報通常是無限的。[ 2 ]

初始條件（q_ { 0 }）[編輯]
自Q-學習算灋是一種反覆運算算灋，它隱含的假設在第一次更新時的初始條件。一個很高的初始值，也被稱為“樂觀的初始條件”，[ 3 ]可以鼓勵探索：無論什麼行動將採取，更新規則將使其具有較低的值比其他選擇，從而提高他們的選擇概率。最近，有一個建議，第一個獎勵可以用來重置初始條件[引用需要]。根據這個想法，第一次採取行動是用來設定的價值的獎勵。這將允許即時學習的情况下，固定的確定性回報。令人驚訝的是，該復位初始條件（RIC）的方法似乎是人類的行為在重複的二元選擇實驗一致。[ 4 ]

正在翻譯中..

其它語言

本翻譯工具支援: 世界語, 中文, 丹麥文, 亞塞拜然文, 亞美尼亞文, 伊博文, 俄文, 保加利亞文, 信德文, 偵測語言, 優魯巴文, 克林貢語, 克羅埃西亞文, 冰島文, 加泰羅尼亞文, 加里西亞文, 匈牙利文, 南非柯薩文, 南非祖魯文, 卡納達文, 印尼巽他文, 印尼文, 印度古哈拉地文, 印度文, 吉爾吉斯文, 哈薩克文, 喬治亞文, 土庫曼文, 土耳其文, 塔吉克文, 塞爾維亞文, 夏威夷文, 奇切瓦文, 威爾斯文, 孟加拉文, 宿霧文, 寮文, 尼泊爾文, 巴斯克文, 布爾文, 希伯來文, 希臘文, 帕施圖文, 庫德文, 弗利然文, 德文, 意第緒文, 愛沙尼亞文, 愛爾蘭文, 拉丁文, 拉脫維亞文, 挪威文, 捷克文, 斯洛伐克文, 斯洛維尼亞文, 斯瓦希里文, 旁遮普文, 日文, 歐利亞文 (奧里雅文), 毛利文, 法文, 波士尼亞文, 波斯文, 波蘭文, 泰文, 泰盧固文, 泰米爾文, 海地克里奧文, 烏克蘭文, 烏爾都文, 烏茲別克文, 爪哇文, 瑞典文, 瑟索托文, 白俄羅斯文, 盧安達文, 盧森堡文, 科西嘉文, 立陶宛文, 索馬里文, 紹納文, 維吾爾文, 緬甸文, 繁體中文, 羅馬尼亞文, 義大利文, 芬蘭文, 苗文, 英文, 荷蘭文, 菲律賓文, 葡萄牙文, 蒙古文, 薩摩亞文, 蘇格蘭的蓋爾文, 西班牙文, 豪沙文, 越南文, 錫蘭文, 阿姆哈拉文, 阿拉伯文, 阿爾巴尼亞文, 韃靼文, 韓文, 馬來文, 馬其頓文, 馬拉加斯文, 馬拉地文, 馬拉雅拉姆文, 馬耳他文, 高棉文, 等語言的翻譯.