Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

前端未结

关注

 3  1978

鱼传尺愫 2021-01-12 16:26

I\'m in the process of development of a simple Q-Learning implementation over a trivial application, but there\'s something that keeps puzzling me.

Let\'s consider t

3条回答

旧时难觅i (楼主)

2021-01-12 16:58
From this, we know:

The convergence of Q-learning holds using any exploration policy, and only requires that each state action pair (s,a) is executed infinitely often.

The epsilon-greedy policy is a balance between exploration and exploitation, which both guarantees convergence and often good performance. But in practical problems, we often need some heuristics to change the learning speed alpha to represent a better return. Otherwise, the infinite often requirement is hard to meet.

I list an example below. This is a classical problem, in which you have a grid and you have possibly different reward amount in each cell. For instance, a 4x4 grid is shown below, in which every cell contains a reward of 1, except the top-left cell (you have a bigger reward with an amount of 10). A robot is moving in the grid. The legal actions are moving LEFT, RIGHT, UP and DOWN, but the robot cannot move out of the grid.

So our state space contains 16 distinct states, which corresponds to the 16 cells. For each state, there are different number of legal actions because of the border constraint. Our goal is to calculate the optimal policy (given any state s, output an optimal action a).
```
+++++++++++++++++++++
+ 10 +  1 +  1 + 1  +
+++++++++++++++++++++
+ 1  +  1 +  1 + 1  +
+++++++++++++++++++++
+ 1  +  1 +  1 + 1  +
+++++++++++++++++++++
+ 1  +  1 +  1 + 1  +
+++++++++++++++++++++
```
Suppose we use an epsilon-greedy policy with epsilon=0.1, a constant learning rate alpha=0.1. We start with a random position on the grid. Whenever we reach the top-left corner, we restart with a random position again.

Below is a result of running a simulation of 200,000 moves. The left-most block shows visually the current greedy policy at each cell.
- --> Moving right
- <-- Moving left
- ^ Moving up
- v Moving down
So you see this is far from an optimal policy. Obviously in an optimal policy, every cell should point to either left or up because we have a significant bigger reward at position (0,0).
```
 v   v   v   v   |      2      9      5      4   
 v   v   v   v   |     14     98     75     14   
-->  v   v  <--  |    258   3430   3312    245  
--> --> <-- <--  |   3270  93143  92978   3191  
```
The right block shows how many times we visit each cell so far. You see that we spend most of the visits at the bottom but we visit the top row very rare. This is why we have not reached the optimal policy yet.

If we change the learning rate to be alpha=1/(number of times you visited (s,a) so far), we are able to reach the optimal policy (shown below) within 20,000 steps. Also the number of times we visited each cell are distributed more uniformly though not perfect.
```
 --> <-- <-- <--  |     34   7997   7697    294 
  ^   ^   ^  <--  |    731    898    524    132 
  ^   ^   ^   ^   |    709    176     88     94 
  ^   ^   ^   ^   |    245    256     96     77  
```
For a larger problem with more states, e.g., a 10x10 grids, I find it's better to use larger epsilon. For example, below is a result of a simulation after 80,000 moves on a 10x10 grid with epsilon=0.5. It's almost optimal except the bottom-right corner. There is also idea about using Simulated Annealing to help improving the convergence rate of Q-learning.
```
 v  <-- <-- <-- <-- <-- <-- <-- <-- <--  |     19   2500   1464    716    386    274    216    159    121     71 
 ^  <-- <-- <-- <--  v  <-- <-- <-- <--  |   9617  11914   3665   1071    580    410    319    225    207    131 
 ^   ^   ^  <-- <-- <-- <--  v  <-- <--  |   5355   5716   2662   1675   1465    611    302    183    162    101 
 ^   ^   ^   ^   ^  <-- <-- <-- <-- <--  |   1604   1887   1192    621   1056    882    693    403    206    100 
 ^   ^   ^   ^   ^   ^   ^  <-- <-- <--  |    639    735    731    333    412    399    480    294    172    114 
 ^   ^   ^  <--  ^   ^   ^  <-- <--  ^   |    373    496    640    454    272    266    415    219    107     98 
 ^   ^   ^   ^   ^   ^   ^   ^  <--  ^   |    251    311    402    428    214    161    343    176    114     99 
 ^   ^   ^   ^  <-- -->  ^  <-- <-- <--  |    186    185    271    420    365    209    359    200    113     70 
 ^   ^   ^   ^   ^   ^   ^   ^   v   v   |    129    204    324    426    434    282    235    131     99     74 
 ^   ^   ^   ^   ^  <--  ^  <-- <-- <--  |    100    356   1020   1233    703    396    301    216    152     78 
```
BTW, my Python code (~100 lines) for the toy problem is here.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...