A study of Q-learning considering negative rewards - Darwin的小小AI天地

A study of Q-learning considering negative rewards

Post author:darren1231
Post published:2022 年 1 月 26 日
Post category:文獻閱讀
Post comments:0 Comments

Main idea
- 大部分的時間Q-learning 都沒有善用到負的reward
- 假若設牆壁的reward為-100的話，使用迷宮地圖當成實驗環境跑出來的數據會像下圖所示
- 學習完的Q table只有靠近牆壁的Q值會變成負的，之後就不會再傳遞擴散了，這就是作者所說的負reward缺點，因此作者給出一個構想圖解釋

Proposed Method
- 作者提出使用絕對值來更新Q value，如此一來負的reward就可以一直傳遞下去，圖三為作者的解釋，當要更新Q值的時候，原本的Q learning 演算法會下一個狀態最大的action value 做更新，但是遇到負的reward時，Q learning 通常都不會更新只會更新撞到牆壁的那一次，因此負的reward都不會傳遞開來，作者提出當要更新Q值的時候應該選擇argmax 絕對值大的來更新，如圖三所示，原本的Q learning 會選擇$a_1$，但作者提出應該使用$a_3$ 來做更新的動作(指公式三的p)

Experiment
- 實驗環境
- - exp-1. 1 positive area and 1 negative area are placed in the bait world.
  - exp-2. 1 positive area and 2 negative areas are placed in the bait world.
  - 作者提出兩個實驗環境，如上圖4所示，一個有兩個負reward的區域一個只有一個
  - 實驗結果
    - 實驗參數:the learning rate α = 0.1, and the discount factor γ = 0.9.
    - 實驗一:
      - “exp1-p” 代表本文提出的方法，c 代表原本方法，pos代表正的reward 加總，neg代表負的reward 加總，由圖可以得知在正的reward兩種方法不相上下，但是在負的reward中作者提出的方法明顯比原來的方法好
    - 實驗二:
      - 在第二個實驗中表現更加明顯，因為第二個實驗環境有兩個負的reward區域，更能體現本文所提出的方法
Future
- 作者認為在正的reward方面表現沒有很好，是未來需要改進的地方

相關

0 0 votes

Article Rating

Subscribe

0 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

0

Would love your thoughts, please comment.x

()