Q-Learning values get too high

♀尐吖头ヾ 提交于 2019-12-01 00:18:48

If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions):

On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.

The reward function is likely the problem. Reinforcement learning methods try to maximize the expected total reward; it gets a positive reward for every time step in the game, so the optimal policy is to play as long as possible! The q-values, which define the value function (expected total reward of taking an action in a state then behaving optimally) are growing because the correct expectation is unbounded. To incentivize winning, you should have a negative reward every time step (kind of like telling the agent to hurry up and win).

See 3.2 Goals and Rewards in Reinforcement Learning: An Introduction for more insight into the purpose and definition of reward signals. The problem you are facing is actually exercise 3.5 in the book.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!