发表新帖

发表新帖

Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

前端未结

关注

 3  1931

鱼传尺愫 2021-01-12 16:26

I\'m in the process of development of a simple Q-Learning implementation over a trivial application, but there\'s something that keeps puzzling me.

Let\'s consider t

3条回答

深忆病人 (楼主)

2021-01-12 17:06

As mentioned in one of the comments, the gamma value being less than one is what guaranties that the sum will not drift of to positive infinity (given that the rewards themselves are bounded).

But it could indeed get stuck on a bad choice for a while. There are some things that can be done:

Optimistic Initialization: If you start out all the Q-values optimistically, then each time you try something new you will get a "disillusion" so that the next time you will want to try something else. This keeps going until you have a realistic sense of the value of each action.

Working with advantage functions: In the case where every action is good but some are better than others it is a good idea to use the advantage function (that is how much better this action is to the expected reward of this state) to update your parameters. This is especially useful for policy gradients.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题