I\'m in the process of development of a simple Q-Learning implementation over a trivial application, but there\'s something that keeps puzzling me.
Let\'s consider t
As mentioned in one of the comments, the gamma value being less than one is what guaranties that the sum will not drift of to positive infinity (given that the rewards themselves are bounded).
But it could indeed get stuck on a bad choice for a while. There are some things that can be done:
Optimistic Initialization: If you start out all the Q-values optimistically, then each time you try something new you will get a "disillusion" so that the next time you will want to try something else. This keeps going until you have a realistic sense of the value of each action.
Working with advantage functions: In the case where every action is good but some are better than others it is a good idea to use the advantage function (that is how much better this action is to the expected reward of this state) to update your parameters. This is especially useful for policy gradients.