I\'m in the process of development of a simple Q-Learning implementation over a trivial application, but there\'s something that keeps puzzling me.
Let\'s consider t
Q(K, A)
does not just keep growing infinitely, due to the minus Q(S, A)
term. This is more apparent if you rewrite the update rule to:
Q(S, A) <-- Q(S, A)(1 - a) + a(R + maxQ(S', A'))
This shows that Q(K, A)
slowly moves towards its "actual" value of R + maxQ(S', A')
. Q(K, A)
only grows to approach that; not infinitely. When it stops growing (has approximated its actual value), the Q(K, A)
for other A
s can catch up.
Anyway, the whole point of epsilon is to control whether you want the learning process to be more greedy (heuristic) or explorative (random), so increase it if the learning process is too narrow.
Also note that one of the formal conditions for QL convergence is that each pair of (S, A)
are visited an infinite number of times (paraphrased)! So yes, at the end of the training process, you want each pair to have been visited a decent amount of times.
Good luck!