I\'ve recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I\'m new to Reinforcement Learning and AI in general, so the error may very wel
The reward function is likely the problem. Reinforcement learning methods try to maximize the expected total reward; it gets a positive reward for every time step in the game, so the optimal policy is to play as long as possible! The q-values, which define the value function (expected total reward of taking an action in a state then behaving optimally) are growing because the correct expectation is unbounded. To incentivize winning, you should have a negative reward every time step (kind of like telling the agent to hurry up and win).
See 3.2 Goals and Rewards in Reinforcement Learning: An Introduction for more insight into the purpose and definition of reward signals. The problem you are facing is actually exercise 3.5 in the book.