q-learning

Something wrong with Keras code Q-learning OpenAI gym FrozenLake

空扰寡人 提交于 2020-08-02 07:49:11
问题 Maybe my question will seem stupid. I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code. My code: import gym import numpy as np import random from keras.layers import Dense from keras.models import Sequential from keras import backend as K import matplotlib.pyplot as plt %matplotlib inline env = gym.make('FrozenLake-v0') model = Sequential() model.add(Dense(16, activation='relu',

Are Q-learning and SARSA with greedy selection equivalent?

痴心易碎 提交于 2020-05-25 07:26:30
问题 The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state. If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical? 回答1: Well, not actually. A key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm (it follows the policy that is

Are Q-learning and SARSA with greedy selection equivalent?

不问归期 提交于 2020-05-25 07:25:06
问题 The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state. If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical? 回答1: Well, not actually. A key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm (it follows the policy that is

【李宏毅深度强化学习笔记】6、Actor-Critic、A2C、A3C、Pathwise Derivative Policy Gradient

时光怂恿深爱的人放手 提交于 2020-01-28 17:28:19
【李宏毅深度强化学习笔记】1、深度强化学习算法 策略梯度方法(Policy Gradient) https://blog.csdn.net/ACL_lihan/article/details/104020259 【李宏毅深度强化学习笔记】2、深度强化学习 Proximal Policy Optimization (PPO) 算法 https://blog.csdn.net/ACL_lihan/article/details/103989581 【李宏毅深度强化学习笔记】3、深度强化学习算法 Q-learning(Basic Idea) https://blog.csdn.net/ACL_lihan/article/details/104041905 【李宏毅深度强化学习笔记】4、Q-learning更高阶的算法 https://blog.csdn.net/ACL_lihan/article/details/104056542 【李宏毅深度强化学习笔记】5、Q-learning用于连续动作 (NAF算法) https://blog.csdn.net/ACL_lihan/article/details/104076938 【李宏毅深度强化学习笔记】6、Actor-Critic、A2C、A3C、Pathwise Derivative Policy Gradient(本文) https:/

Epsilon and learning rate decay in epsilon greedy q learning

浪子不回头ぞ 提交于 2020-01-23 00:24:03
问题 I understand that epsilon marks the trade-off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher Q-values you've found. However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay. How do we set our epsilon and alpha such that values converge? 回答1:

Epsilon and learning rate decay in epsilon greedy q learning

◇◆丶佛笑我妖孽 提交于 2020-01-23 00:21:35
问题 I understand that epsilon marks the trade-off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher Q-values you've found. However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay. How do we set our epsilon and alpha such that values converge? 回答1:

DDPG-强化学习算法

给你一囗甜甜゛ 提交于 2020-01-22 16:55:29
文章目录 Background Quick Facts Key Equations DDPG的Q-learning部分 DDPG的策略学习部分 Exploration vs. Exploitation(探索vs.利用) Documentation References Why These Papers? Background DDPG是一种同时学习Q-函数和策略的算法。它使用off-policy的数据以及bellman方程去学习Q函数,然后用Q函数去学习策略。 这种方法与Q-learning联系密切,源于那么一种思路:如果你知道最优的动作值函数 Q ∗ ( s , a ) Q^*(s,a) Q ∗ ( s , a ) , 则当给定状态,最优动作 a ∗ ( s ) a^*(s) a ∗ ( s ) 可以通过解决一下问题而找出: a ∗ ( s ) = arg ⁡ max ⁡ a Q ∗ ( s , a ) . a^*(s) = \arg \max_a Q^*(s,a). a ∗ ( s ) = ar g a max ​ Q ∗ ( s , a ) . DDPG将学习 Q ∗ ( s , a ) Q^*(s,a) Q ∗ ( s , a ) 的近似与学习 a ∗ ( s ) a^*(s) a ∗ ( s ) 的近似进行交织,并且这样做的方式特别适合于具有连续动作空间的环境。但是

What is the difference between Q-learning and SARSA?

只谈情不闲聊 提交于 2019-12-29 02:26:23
问题 Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms. According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(s t , a t ), can be updated as follows Q(s t , a t ) = Q(s t , a t ) + α*(r t + γ*Q(s t+1 , a t+1 ) - Q

TypeError: Cannot interpret feed_dict key as Tensor: The name 'save/Const:0' refers to a Tensor which does not exist

一世执手 提交于 2019-12-24 08:47:43
问题 From this file: https://github.com/llSourcell/pong_neural_network_live/blob/master/RL.py I've updated the lines #first convolutional layer. bias vector #creates an empty tensor with all elements set to zero with a shape W_conv1 = tf.Variable(tf.zeros([8, 8, 4, 32]) , name='W_conv1') b_conv1 = tf.Variable(tf.zeros([32]), name='b_conv1') W_conv2 = tf.Variable(tf.zeros([4, 4, 32, 64]), name='W_conv2') b_conv2 = tf.Variable(tf.zeros([64]), name='b_conv2') W_conv3 = tf.Variable(tf.zeros([3, 3, 64,