reinforcement-learning

When to use a certain Reinforcement Learning algorithm?

末鹿安然 提交于 2019-12-02 15:54:30
I'm studying Reinforcement Learning and reading Sutton's book for a university course. Beside the classic PD, MC, TD and Q-Learning algorithms, I'm reading about policy gradient methods and genetic algorithms for the resolution of decision problems. I have never had experience before in this topic and I'm having problems understanding when a technique should be preferred over another. I have a few ideas, but I'm not sure about them. Can someone briefly explain or tell me a source where I can find something about typical situation where a certain methods should be used? As far as I understand:

What is the difference between value iteration and policy iteration?

a 夏天 提交于 2019-12-02 13:54:22
In reinforcement learning, what is the difference between policy iteration and value iteration ? As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy. My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies. zyxue Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book:

Training a Neural Network with Reinforcement learning

穿精又带淫゛_ 提交于 2019-12-02 13:53:41
I know the basics of feedforward neural networks, and how to train them using the backpropagation algorithm, but I'm looking for an algorithm than I can use for training an ANN online with reinforcement learning. For example, the cart pole swing up problem is one I'd like to solve with an ANN. In that case, I don't know what should be done to control the pendulum, I only know how close I am to the ideal position. I need to have the ANN learn based on reward and punishment. Thus, supervised learning isn't an option. Another situation is something like the snake game , where feedback is delayed,

tensorflow: how come gather_nd is differentiable?

巧了我就是萌 提交于 2019-12-01 15:57:24
I'm looking at a tensorflow network implementing reinforcement-learning for the CartPole open-ai env. The network implements the likelihood ratio approach for a policy gradient agent. The thing is, that the policy loss is defined using the gather_nd op!! here, look: .... self.y = tf.nn.softmax(tf.matmul(self.W3,self.h2) + self.b3,dim=0) self.curr_reward = tf.placeholder(shape=[None],dtype=tf.float32) self.actions_array = tf.placeholder(shape=[None,2],dtype=tf.int32) self.pai_array = tf.gather_nd(self.y,self.actions_array) self.L = -tf.reduce_mean(tf.log(self.pai_array)*self.curr_reward) And

How do neural networks use genetic algorithms and backpropagation to play games?

天涯浪子 提交于 2019-12-01 15:48:15
I came across this interesting video on YouTube on genetic algorithms . As you can see in the video, the bots learn to fight. Now, I have been studying neural networks for a while and I wanted to start learning genetic algorithms.. This somehow combines both. How do you combine genetic algorithms and neural networks to do this? And also how does one know the error in this case which you use to back-propagate and update your weights and train the net? And also how do you think the program in the video calculated its fitness function ? I guess mutation is definitely happening in the program in

tensorflow: how come gather_nd is differentiable?

折月煮酒 提交于 2019-12-01 14:42:41
问题 I'm looking at a tensorflow network implementing reinforcement-learning for the CartPole open-ai env. The network implements the likelihood ratio approach for a policy gradient agent. The thing is, that the policy loss is defined using the gather_nd op!! here, look: .... self.y = tf.nn.softmax(tf.matmul(self.W3,self.h2) + self.b3,dim=0) self.curr_reward = tf.placeholder(shape=[None],dtype=tf.float32) self.actions_array = tf.placeholder(shape=[None,2],dtype=tf.int32) self.pai_array = tf.gather

Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

元气小坏坏 提交于 2019-12-01 02:56:54
问题 I'm in the process of development of a simple Q-Learning implementation over a trivial application, but there's something that keeps puzzling me. Let's consider the standard formulation of Q-Learning Q(S, A) = Q(S, A) + alpha * [R + MaxQ(S', A') - Q(S, A)] Let's assume there's this state K that has two possible actions, both awarding our agent rewards R and R' by A and A' . If we follow an almost-totally-greedy approach (let's say we assume a 0.1 epsilon), I'll at first randomly choose one of

Q-Learning values get too high

♀尐吖头ヾ 提交于 2019-12-01 00:18:48
I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine. Here's how I implemented the solution to an m,n,k-game environment: At each given time t , the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r , then proceeds to update the value of Q(s, a) for time t-1 func (agent *RLAgent) learn(reward float64) { var mState = marshallState(agent.prevState, agent.id)

How can I register a custom environment in OpenAI's gym?

烈酒焚心 提交于 2019-11-30 23:12:30
I have created a custom environment, as per the OpenAI Gym framework; containing step , reset , action , and reward functions. I aim to run OpenAI baselines on this custom environment. But prior to this, the environment has to be registered on OpenAI gym. I would like to know how the custom environment could be registered on OpenAI gym? Also, Should I be modifying the OpenAI baseline codes to incorporate this? You do not need to modify baselines repo. Here is a minimal example. Say you have myenv.py , with all the needed functions ( step , reset , ...). The name of the class environment is

How can I apply reinforcement learning to continuous action spaces?

。_饼干妹妹 提交于 2019-11-29 20:28:31
I'm trying to get an agent to learn the mouse movements necessary to best perform some task in a reinforcement learning setting (i.e. the reward signal is the only feedback for learning). I'm hoping to use the Q-learning technique, but while I've found a way to extend this method to continuous state spaces , I can't seem to figure out how to accommodate a problem with a continuous action space. I could just force all mouse movement to be of a certain magnitude and in only a certain number of different directions, but any reasonable way of making the actions discrete would yield a huge action