Learning rate[edit]
The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. In fully deterministic environments, a learning rate of alpha _{t}(s,a)=1 is optimal. When the problem is stochastic, the algorithm still converges under some technical conditions on the learning rate, that require it to decrease to zero. In practice, often a constant learning rate is used, such as alpha _{t}(s,a)=0.1 for all t.[1]
Discount factor[edit]
The discount factor γ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1, without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.[2]
Initial conditions (Q_{0})[edit]
Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high initial value, also known as "optimistic initial conditions",[3] can encourage exploration: no matter what action will take place, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward r could be used to reset the initial conditions[citation needed]. According to this idea, the first time an action is taken the reward is used to set the value of Q. This will allow immediate learning in case of fixed deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.[4]
學習率 [編輯]學習速率確定到何種程度上新近獲得的資訊將會覆蓋舊的資訊。0 的一個因素會使代理不學到東西,而因數為 1 會使代理考慮只有最新的資訊。在完全確定性的環境中,Alpha _ 學習速率 {t}(s,a) = 1 是最優的。當問題是隨機的時該演算法仍然收斂下一些技術的條件,對學習率,要求它降低到零。在實踐中,不斷的學習率常用的如 Alpha _ {t}(s,a) = 0.1,所有 t [1]。貼現因數 [編輯]折扣因數 γ 決定未來回報的重要性。0 的一個因素會使代理"近視"(或短視) 于僅考慮當前獎勵,同時一個因素接近 1 會使它爭取長期的高回報。如果貼現因數達到或超過 1,操作值可能出現分歧。為 γ = 1,沒有終端的狀態,或者如果代理永遠不會到達一個、 所有環境歷史將都會無限長,而公用事業與添加劑、 貼現獎勵一般會都是無限。[] 2初始條件 (Q_ {0}) [編輯]Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high initial value, also known as "optimistic initial conditions",[3] can encourage exploration: no matter what action will take place, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward r could be used to reset the initial conditions[citation needed]. According to this idea, the first time an action is taken the reward is used to set the value of Q. This will allow immediate learning in case of fixed deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.[4]
正在翻譯中..