Something wrong with Keras code Q-learning OpenAI gym FrozenLake

空扰寡人 提交于 2020-08-02 07:49:11

问题


Maybe my question will seem stupid.

I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.

My code:

import gym
import numpy as np
import random

from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K    

import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('FrozenLake-v0')

model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))

def custom_loss(yTrue, yPred):
    return K.sum(K.square(yTrue - yPred))

model.compile(loss=custom_loss, optimizer='sgd')

# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []

num_episodes = 2000
for i in range(num_episodes):
    current_state = env.reset()
    rAll = 0
    d = False
    j = 0
    while j < 99:
        j+=1

        current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)
        action = np.reshape(np.argmax(current_state_Q_values), (1,))

        if np.random.rand(1) < e:
            action[0] = env.action_space.sample() #random action

        new_state, reward, d, _ = env.step(action[0])

        rAll += reward
        jList.append(j)
        rList.append(rAll)

        new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)
        max_newQ = np.max(new_Qs)

        targetQ = current_state_Q_values
        targetQ[0,action[0]] = reward + y*max_newQ
        model.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)
        current_state = new_state

        if d == True:
            #Reduce chance of random action as we train the model.
            e = 1./((i/50) + 10)
            break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")

When I run it, it doesn't work well: Percent of succesful episodes: 0.052%

plt.plot(rList)

The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%

plt.plot(rList)

What have I done wrong ?


回答1:


Besides setting use_bias=False as @Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.

I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)



来源:https://stackoverflow.com/questions/45869939/something-wrong-with-keras-code-q-learning-openai-gym-frozenlake

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!