reinforcement-learning | 易学教程

Neural Network and Temporal Difference Learning

阅读更多关于 Neural Network and Temporal Difference Learning

问题 I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions. -Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)? -What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD? 回答1:

Epsilon and learning rate decay in epsilon greedy q learning

阅读更多关于 Epsilon and learning rate decay in epsilon greedy q learning

问题 I understand that epsilon marks the trade-off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher Q-values you've found. However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay. How do we set our epsilon and alpha such that values converge? 回答1:

Epsilon and learning rate decay in epsilon greedy q learning

阅读更多关于 Epsilon and learning rate decay in epsilon greedy q learning

AttributeError: module '_Box2D' has no attribute 'RAND_LIMIT_swigconstant'

阅读更多关于 AttributeError: module '_Box2D' has no attribute 'RAND_LIMIT_swigconstant'

问题 I am trying to run a lunar_lander on reinforcement learning, but when I run it, it occurs an error. Plus my computer is osx system. Here is the code of lunar lander: import numpy as np import gym import csv from keras.models import Sequential from keras.layers import Dense, Activation, Flatten from keras.optimizers import Adam from rl.agents.dqn import DQNAgent from rl.policy import BoltzmannQPolicy, EpsGreedyQPolicy from rl.memory import SequentialMemory import io import sys import csv #

In Reinforcement learning using feature approximation, does one have a single set of weights or a set of weights for each action?

阅读更多关于 In Reinforcement learning using feature approximation, does one have a single set of weights or a set of weights for each action?

问题 This question is an attempt to reframe this question to make it clearer. This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions. These discussions (The Basic Update Rule and Linear Value Function Approximation) show a set of weights for each action. The reason they are different is that the first slide assumes you can anticipate the result of performing an action and then find features for the resulting states. (Note that the feature functions are

Markov Model descision process in Java

阅读更多关于 Markov Model descision process in Java

问题 I'm writing an assisted learning algorithm in Java. I've run into a mathematical problem that I can probably solve, but because the processing will be heavy I need an optimum solution. That being said, if anyone knows a optimized library that will be totally awesome, but the language is Java so that will need to be taken into consideration. The idea is fairly simple: Objects will store combination of variables such as ABDC, ACDE, DE, AE. The max number of combination will be based on how many

What is a policy in reinforcement learning? [closed]

阅读更多关于 What is a policy in reinforcement learning? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I've seen such words as: A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. But still didn't fully understand. What exactly is a policy in reinforcement learning? 回答1:

What is a policy in reinforcement learning? [closed]

阅读更多关于 What is a policy in reinforcement learning? [closed]

Normalizing Rewards to Generate Returns in reinforcement learning

阅读更多关于 Normalizing Rewards to Generate Returns in reinforcement learning

问题 The question is about vanilla, non-batched reinforcement learning. Basically what is defined here in Sutton's book. My model trains, (woohoo!) though there is an element that confuses me. Background: In an environment where duration is rewarded (like pole-balancing), we have rewards of (say) 1 per step. After an episode, before sending this array of 1's to the train step, we do the standard discounting and normalization to get returns: returns = self.discount_rewards(rewards) returns =

Normalizing Rewards to Generate Returns in reinforcement learning

阅读更多关于 Normalizing Rewards to Generate Returns in reinforcement learning