Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

前端 未结 3 1930
鱼传尺愫
鱼传尺愫 2021-01-12 16:26

I\'m in the process of development of a simple Q-Learning implementation over a trivial application, but there\'s something that keeps puzzling me.

Let\'s consider t

3条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-12 16:41

    Q(K, A) does not just keep growing infinitely, due to the minus Q(S, A) term. This is more apparent if you rewrite the update rule to:

    Q(S, A) <-- Q(S, A)(1 - a) + a(R + maxQ(S', A'))

    This shows that Q(K, A) slowly moves towards its "actual" value of R + maxQ(S', A'). Q(K, A) only grows to approach that; not infinitely. When it stops growing (has approximated its actual value), the Q(K, A) for other As can catch up.

    Anyway, the whole point of epsilon is to control whether you want the learning process to be more greedy (heuristic) or explorative (random), so increase it if the learning process is too narrow.

    Also note that one of the formal conditions for QL convergence is that each pair of (S, A) are visited an infinite number of times (paraphrased)! So yes, at the end of the training process, you want each pair to have been visited a decent amount of times.

    Good luck!

提交回复
热议问题