reinforcement-learning | 易学教程

What is the way to understand Proximal Policy Optimization Algorithm in RL?

阅读更多关于 What is the way to understand Proximal Policy Optimization Algorithm in RL?

I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able read arxiv PPO paper ? What is the roadmap to learn and use PPO ? To better understand PPO, it is helpful to look at the main contributions of the paper, which are: (1) the Clipped Surrogate Objective and (2) the use of "multiple epochs of stochastic gradient ascent to perform each policy update". First, to ground these points in the original PPO paper : We have introduced [PPO], a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy

What is the difference between Q-learning and SARSA?

阅读更多关于 What is the difference between Q-learning and SARSA?

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms. According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(s t , a t ), can be updated as follows Q(s t , a t ) = Q(s t , a t ) + α*(r t + γ*Q(s t+1 , a t+1 ) - Q(s t , a t )) On the other hand, the update step for the Q-learning algorithm is the following Q(s t , a

Display OpenAI gym in Jupyter notebook only

阅读更多关于 Display OpenAI gym in Jupyter notebook only

I want to play with the OpenAI gyms in a notebook, with the gym being rendered inline. Here's a basic example: import matplotlib.pyplot as plt import gym from IPython import display %matplotlib inline env = gym.make('CartPole-v0') env.reset() for i in range(25): plt.imshow(env.render(mode='rgb_array')) display.display(plt.gcf()) display.clear_output(wait=True) env.step(env.action_space.sample()) # take a random action env.close() This works, and I get see the gym in the notebook: But! it also opens an interactive window that shows precisely the same thing. I don't want this window to be open:

How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?

阅读更多关于 How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?

EDIT(1/3/16): corresponding github issue I'm using Tensorflow (Python interface) to implement a q-learning agent with function approximation trained using stochastic gradient-descent. At each iteration of the experiment a step function in the agent is called that updates the parameters of the approximator based on the new reward and activation, and then chooses a new action to perform. Here is the problem(with reinforcement learning jargon): The agent computes its state-action value predictions to choose an action. Then gives control back another program which simulates a step in the

Tensorflow and Multiprocessing: Passing Sessions

阅读更多关于 Tensorflow and Multiprocessing: Passing Sessions

I have recently been working on a project that uses a neural network for virtual robot control. I used tensorflow to code it up and it runs smoothly. So far, I used sequential simulations to evaluate how good the neural network is, however, I want to run several simulations in parallel to reduce the amount of time it takes to get data. To do this I am importing python's multiprocessing package. Initially I was passing the sess variable ( sess=tf.Session() ) to a function that would run the simulation. However, once I get to any statement that uses this sess variable, the process quits without

How can I apply reinforcement learning to continuous action spaces?

阅读更多关于 How can I apply reinforcement learning to continuous action spaces?

问题 I'm trying to get an agent to learn the mouse movements necessary to best perform some task in a reinforcement learning setting (i.e. the reward signal is the only feedback for learning). I'm hoping to use the Q-learning technique, but while I've found a way to extend this method to continuous state spaces, I can't seem to figure out how to accommodate a problem with a continuous action space. I could just force all mouse movement to be of a certain magnitude and in only a certain number of

Pytorch: How to create an update rule that doesn't come from derivatives?

阅读更多关于 Pytorch: How to create an update rule that doesn't come from derivatives?

问题 I want to implement the following algorithm, taken from this book, section 13.6: I don't understand how to implement the update rule in pytorch (the rule for w is quite similar to that of theta). As far as I know, torch requires a loss for loss.backwward() . This form does not seem to apply for the quoted algorithm. I'm still certain there is a correct way of implementing such update rules in pytorch. Would greatly appreciate a code snippet of how the w weights should be updated, given that V

Display OpenAI gym in Jupyter notebook only

阅读更多关于 Display OpenAI gym in Jupyter notebook only

问题 I want to play with the OpenAI gyms in a notebook, with the gym being rendered inline. Here's a basic example: import matplotlib.pyplot as plt import gym from IPython import display %matplotlib inline env = gym.make('CartPole-v0') env.reset() for i in range(25): plt.imshow(env.render(mode='rgb_array')) display.display(plt.gcf()) display.clear_output(wait=True) env.step(env.action_space.sample()) # take a random action env.close() This works, and I get see the gym in the notebook: But! it also

How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?

阅读更多关于 How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?

问题 EDIT(1/3/16): corresponding github issue I'm using Tensorflow (Python interface) to implement a q-learning agent with function approximation trained using stochastic gradient-descent. At each iteration of the experiment a step function in the agent is called that updates the parameters of the approximator based on the new reward and activation, and then chooses a new action to perform. Here is the problem(with reinforcement learning jargon): The agent computes its state-action value

Tensorflow and Multiprocessing: Passing Sessions

阅读更多关于 Tensorflow and Multiprocessing: Passing Sessions

问题 I have recently been working on a project that uses a neural network for virtual robot control. I used tensorflow to code it up and it runs smoothly. So far, I used sequential simulations to evaluate how good the neural network is, however, I want to run several simulations in parallel to reduce the amount of time it takes to get data. To do this I am importing python's multiprocessing package. Initially I was passing the sess variable ( sess=tf.Session() ) to a function that would run the