reinforcement-learning

What is the difference between Q-learning and SARSA?

只谈情不闲聊 提交于 2019-12-29 02:26:23
问题 Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms. According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(s t , a t ), can be updated as follows Q(s t , a t ) = Q(s t , a t ) + α*(r t + γ*Q(s t+1 , a t+1 ) - Q

Reinforcement learning algorithm using turtle graphics not functioning

梦想与她 提交于 2019-12-25 00:58:06
问题 Currently trying to implement a Q table algorithm in my environment created using turtle graphics. When i try running the algorithm which uses Q learning I get an error stating: File "<ipython-input-1-cf5669494f75>", line 304, in <module> rl() File "<ipython-input-1-cf5669494f75>", line 282, in rl A = choose_action(S, q_table) File "<ipython-input-1-cf5669494f75>", line 162, in choose_action state_actions = q_table.iloc[state, :] File "/Users/himansuodedra/anaconda3/lib/python3.6/site

When using functional approximation in reinforcement learning how does one select actions?

ε祈祈猫儿з 提交于 2019-12-24 12:57:29
问题 This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions. I'm confused about how to write the feature functions. Given an observation, I can understand how to extract features from the observation. But given an observation, one doesn't know what the result of taking an action will be on the features. So how does one write a function that maps an observation and an action to a numerical value? In the Pacman example shown a few slides later, one knows

TypeError: Cannot interpret feed_dict key as Tensor: The name 'save/Const:0' refers to a Tensor which does not exist

一世执手 提交于 2019-12-24 08:47:43
问题 From this file: https://github.com/llSourcell/pong_neural_network_live/blob/master/RL.py I've updated the lines #first convolutional layer. bias vector #creates an empty tensor with all elements set to zero with a shape W_conv1 = tf.Variable(tf.zeros([8, 8, 4, 32]) , name='W_conv1') b_conv1 = tf.Variable(tf.zeros([32]), name='b_conv1') W_conv2 = tf.Variable(tf.zeros([4, 4, 32, 64]), name='W_conv2') b_conv2 = tf.Variable(tf.zeros([64]), name='b_conv2') W_conv3 = tf.Variable(tf.zeros([3, 3, 64,

How to understand the RLstep in Keepaway (Compare with Sarsa)

半腔热情 提交于 2019-12-24 08:37:07
问题 In "Stone, Peter, Richard S. Sutton, and Gregory Kuhlmann. "Reinforcement learning for robocup soccer keepaway." Adaptive Behavior 13.3 (2005): 165-188.", the RLstep pseudocode seems quite a bit different from Sarsa(λ), which the authors say RLStep implements. Here is the RLstep pseudocode and here is the Sarsa(lambda) pseudocode. The areas of confusion are: Line 10 in the Sarsa(λ) pseudocode updates the Q value for each state-action pair after adding 1 to the e(s,a) . But in the RLstep

Why should continuous actions be clamped?

二次信任 提交于 2019-12-24 08:05:10
问题 In Deep Reinforcement Learning, using continuous action spaces, why does it seem to be common practice to clamp the action right before the agent's execution? Examples: OpenAI Gym Mountain Car https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py#L57 Unity 3DBall https://github.com/Unity-Technologies/ml-agents/blob/master/unity-environment/Assets/ML-Agents/Examples/3DBall/Scripts/Ball3DAgent.cs#L29 Isn't information lost doing so? Like if the model

Neural Network Reinforcement Learning Requiring Next-State Propagation For Backpropagation

南笙酒味 提交于 2019-12-23 11:53:28
问题 I am attempting to construct a neural network incorporating convolution and LSTM (using the Torch library) to be trained by Q-learning or Advantage-learning, both of which require propagating state T+1 through the network before updating the weights for state T. Having to do an extra propagation would cut performance and that's bad, but not too bad; However, the problem is that there is all kinds of state bound up in this. First of all, the Torch implementation of backpropagation has some

Learning rate of a Q learning agent

喜欢而已 提交于 2019-12-22 06:27:53
问题 The question how the learning rate influences the convergence rate and convergence itself. If the learning rate is constant, will Q function converge to the optimal on or learning rate should necessarily decay to guarantee convergence? 回答1: Learning rate tells the magnitude of step that is taken towards the solution. It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to

Training only one output of a network in Keras

扶醉桌前 提交于 2019-12-21 02:30:17
问题 I have a network in Keras with many outputs, however, my training data only provides information for a single output at a time. At the moment my method for training has been to run a prediction on the input in question, change the value of the particular output that I am training and then doing a single batch update. If I'm right this is the same as setting the loss for all outputs to zero except the one that I'm trying to train. Is there a better way? I've tried class weights where I set a

Pytorch ValueError: optimizer got an empty parameter list

梦想与她 提交于 2019-12-20 03:14:05
问题 When trying to create a neural network and optimize it using Pytorch, I am getting ValueError: optimizer got an empty parameter list Here is the code. import torch.nn as nn import torch.nn.functional as F from os.path import dirname from os import getcwd from os.path import realpath from sys import argv class NetActor(nn.Module): def __init__(self, args, state_vector_size, action_vector_size, hidden_layer_size_list): super(NetActor, self).__init__() self.args = args self.state_vector_size =