发表新帖

发表新帖

Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

前端未结

关注

 3  1930

鱼传尺愫 2021-01-12 16:26

I\'m in the process of development of a simple Q-Learning implementation over a trivial application, but there\'s something that keeps puzzling me.

Let\'s consider t

3条回答

慢半拍i (楼主)

2021-01-12 16:41

Q(K, A) does not just keep growing infinitely, due to the minus Q(S, A) term. This is more apparent if you rewrite the update rule to:

Q(S, A) <-- Q(S, A)(1 - a) + a(R + maxQ(S', A'))

This shows that Q(K, A) slowly moves towards its "actual" value of R + maxQ(S', A'). Q(K, A) only grows to approach that; not infinitely. When it stops growing (has approximated its actual value), the Q(K, A) for other As can catch up.

Anyway, the whole point of epsilon is to control whether you want the learning process to be more greedy (heuristic) or explorative (random), so increase it if the learning process is too narrow.

Also note that one of the formal conditions for QL convergence is that each pair of (S, A) are visited an infinite number of times (paraphrased)! So yes, at the end of the training process, you want each pair to have been visited a decent amount of times.

Good luck!

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题