q-learning | 易学教程

Learning rate of a Q learning agent

阅读更多关于 Learning rate of a Q learning agent

问题 The question how the learning rate influences the convergence rate and convergence itself. If the learning rate is constant, will Q function converge to the optimal on or learning rate should necessarily decay to guarantee convergence? 回答1: Learning rate tells the magnitude of step that is taken towards the solution. It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to

Training only one output of a network in Keras

阅读更多关于 Training only one output of a network in Keras

问题 I have a network in Keras with many outputs, however, my training data only provides information for a single output at a time. At the moment my method for training has been to run a prediction on the input in question, change the value of the particular output that I am training and then doing a single batch update. If I'm right this is the same as setting the loss for all outputs to zero except the one that I'm trying to train. Is there a better way? I've tried class weights where I set a

Q-Learning values get too high

阅读更多关于 Q-Learning values get too high

问题 I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine. Here's how I implemented the solution to an m,n,k-game environment: At each given time t , the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r , then proceeds to update the value of Q(s, a) for time t

深度强化学习系列(4): Q-Learning算法原理与实现

阅读更多关于深度强化学习系列(4): Q-Learning算法原理与实现

论文地址： http://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf Q-Learning是发表于1989年的一种value-based，且model-free的特别经典的off-policy算法，近几年的DQN等算法均是在此基础上通过神经网络进行展开的。 1. 相关简介强化学习学习过程中，通常是将学习的序列数据存储在表格中，通过获取表中的数据，利用greedy策略进行最大化Q值函数的学习方法。 2. 原理及推导 Q-Learning就是在某一个时刻的状态(state)下，采取动作a能够获得收益的期望，环境会根据agent的动作反馈相应的reward奖赏，核心就是将state和action构建成一张Q_table表来存储Q值，然后根据Q值来选取能够获得最大收益的动作，如表所示： Q-Table a 1 a_{1} a 1 a 2 a_{2} a 2 s 1 s_{1} s 1 Q ( s 1 , a 1 ) Q(s_{1},a_{1}) Q ( s 1 , a 1 ) Q ( s 1 , a 2 ) Q(s_{1},a_{2}) Q ( s 1 , a 2 ) s 2 s_{2} s 2 Q ( s 2 , a 1 ) Q(s_{2},a_{1}) Q ( s 2 , a 1 ) Q ( s

State dependent action set in reinforcement learning

阅读更多关于 State dependent action set in reinforcement learning

问题 How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states. I'm also interested in see if the solutions would be different if the legal actions were overlapping. For Q learning (where my network gives me the values for state/action pairs), I was thinking

Training only one output of a network in Keras

阅读更多关于 Training only one output of a network in Keras

I have a network in Keras with many outputs, however, my training data only provides information for a single output at a time. At the moment my method for training has been to run a prediction on the input in question, change the value of the particular output that I am training and then doing a single batch update. If I'm right this is the same as setting the loss for all outputs to zero except the one that I'm trying to train. Is there a better way? I've tried class weights where I set a zero weight for all but the output I'm training but it doesn't give me the results I expect? I'm using

What is the difference between Q-learning and Value Iteration?

阅读更多关于 What is the difference between Q-learning and Value Iteration?

问题 How is Q-learning different from value iteration in reinforcement learning? I know Q-learning is model-free and training samples are transitions (s, a, s', r) . But since we know the transitions and the reward for every transition in Q-learning, is it not the same as model-based learning where we know the reward for a state and action pair, and the transitions for every action from a state (be it stochastic or deterministic)? I do not understand the difference. 回答1: You are 100% right that if

What is the difference between Q-learning and Value Iteration?

阅读更多关于 What is the difference between Q-learning and Value Iteration?

How is Q-learning different from value iteration in reinforcement learning? I know Q-learning is model-free and training samples are transitions (s, a, s', r) . But since we know the transitions and the reward for every transition in Q-learning, is it not the same as model-based learning where we know the reward for a state and action pair, and the transitions for every action from a state (be it stochastic or deterministic)? I do not understand the difference. You are 100% right that if we knew the transition probabilities and reward for every transition in Q-learning, it would be pretty

Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

阅读更多关于 Unbounded increase in Q-Value, consequence of recurrent reward after repeating the same action in Q-Learning

问题 I'm in the process of development of a simple Q-Learning implementation over a trivial application, but there's something that keeps puzzling me. Let's consider the standard formulation of Q-Learning Q(S, A) = Q(S, A) + alpha * [R + MaxQ(S', A') - Q(S, A)] Let's assume there's this state K that has two possible actions, both awarding our agent rewards R and R' by A and A' . If we follow an almost-totally-greedy approach (let's say we assume a 0.1 epsilon), I'll at first randomly choose one of

Q-Learning values get too high

阅读更多关于 Q-Learning values get too high

I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine. Here's how I implemented the solution to an m,n,k-game environment: At each given time t , the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r , then proceeds to update the value of Q(s, a) for time t-1 func (agent *RLAgent) learn(reward float64) { var mState = marshallState(agent.prevState, agent.id)