q-learning

Reward function for learning to play Curve Fever game with DQN

会有一股神秘感。 提交于 2021-02-11 10:40:41
问题 I've made a simple version of Curve Fever also known as "Achtung Die Kurve". I want the machine to figure out how to play the game optimally. I copied and slightly modified an existing DQN from some Atari game examples that is made with Google's Tensorflow. I'm tyring to figure out an appropriate reward function. Currently, I use this reward setup: 0.1 for every frame it does not crash -500 for every crash Is this the right approach? Do I need to tweak the values? Or do I need a completely

Low GPU utilisation when running Tensorflow

寵の児 提交于 2021-01-27 07:10:20
问题 I've been doing Deep Reinforcement Learning using Tensorflow and OpenAI gym. My problem is low GPU utilisation. Googling this issue, I understood that it's wrong to expect much GPU utilisation when training small networks ( eg. for training mnist). But my Neural Network is not so small, I think. The architecture is similar to the given in the original deepmind paper (more or less). The architecture of my network is summarized below Convolution layer 1 (filters=32, kernel_size=8x8, strides=4)

Low GPU utilisation when running Tensorflow

狂风中的少年 提交于 2021-01-27 07:09:29
问题 I've been doing Deep Reinforcement Learning using Tensorflow and OpenAI gym. My problem is low GPU utilisation. Googling this issue, I understood that it's wrong to expect much GPU utilisation when training small networks ( eg. for training mnist). But my Neural Network is not so small, I think. The architecture is similar to the given in the original deepmind paper (more or less). The architecture of my network is summarized below Convolution layer 1 (filters=32, kernel_size=8x8, strides=4)

How does DQN work in an environment where reward is always -1

删除回忆录丶 提交于 2021-01-05 07:14:05
问题 Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code that proves it. By working, I mean that when I train the agent, the agent quickly (within 300-500 episodes) learns how to solve the mountaincar problem. Below is an example from my trained agent. It is my understanding that ultimately there needs to be a "sparse reward"

QLearning network in a custom environment is choosing the same action every time, despite the heavy negative reward

牧云@^-^@ 提交于 2020-12-15 04:35:09
问题 So I plugged QLearningDiscreteDense into a dots and boxes game I made. I created a custom MDP environment for it. The problem is that it chooses action 0 each time, the first time it works but then it's not an available action anymore so it's an illegal move. I give illegal moves a reward of Integer.MIN_VALUE , but it doesn't affect anything. Here's the MDP class: public class testEnv implements MDP<testState, Integer, DiscreteSpace> { final private int maxStep; DiscreteSpace actionSpace =

QLearning network in a custom environment is choosing the same action every time, despite the heavy negative reward

*爱你&永不变心* 提交于 2020-12-15 04:35:06
问题 So I plugged QLearningDiscreteDense into a dots and boxes game I made. I created a custom MDP environment for it. The problem is that it chooses action 0 each time, the first time it works but then it's not an available action anymore so it's an illegal move. I give illegal moves a reward of Integer.MIN_VALUE , but it doesn't affect anything. Here's the MDP class: public class testEnv implements MDP<testState, Integer, DiscreteSpace> { final private int maxStep; DiscreteSpace actionSpace =