I\'m looking at the policy gradients sample in this notebook: https://github.com/ageron/handson-ml/blob/master/16_reinforcement_learning.ipynb
The relevant code is h
The problem is that
optimizer.compute_gradients(cross_entropy)
seems to return a single gradient, even though cross_entropy is a 1d tensor of shape[None, 1]
.
That happens by design, as the gradient terms for each tensor are automatically aggregated. Gradient computation operations such as optimizer.compute_gradients
and the low-level primitive tf.gradients make a sum of all gradient operations, according to the default AddN
aggregation method. This is fine for most cases of stochastic gradient descent.
In the end unfortunately, gradient computation will have to be made over a single batch. Of course, unless a custom gradient function is built, or the TensorFlow API is extended to provide gradient computation without full aggregation. Changing the implementation of tf.gradients to do this does not seem to be very trivial.
One trick that you might wish to employ for your reinforcement learning model is to perform multiple session runs in parallel. According to the FAQ, the Session API supports multiple concurrent steps, and will take advantage of the existing resources for parallel computation. The question Asynchronous computation in TensorFlow shows how to do this.
One weak solution I came up with is to create an array of gradient operations, one per instance in the batch, which I can then run all at the same time:
X = tf.placeholder(tf.float32, shape=[minibatch_size, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
hidden2 = tf.layers.dense(hidden, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden2, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
# Calculate gradients per batch instance - for minibatch training
batch_gradients = []
for instance_cross_entropy in tf.unstack(cross_entropy):
instance_grads_and_vars = optimizer.compute_gradients(instance_cross_entropy)
instance_gradients = [grad for grad, variable in instance_grads_and_vars]
batch_gradients.append(instance_gradients)
# Calculate gradients for just one instance - for single instance training
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
# Create gradient placeholders
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
# In the end we only apply a single set of averaged gradients
training_op = optimizer.apply_gradients(grads_and_vars_feed)
...
while step < len(obs_array) - minibatch_size:
action_array, batch_gradients_array = sess.run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]})
for action_val, gradient in zip(action_array, batch_gradients_array):
action_vals.append(action_val)
current_gradients.append(gradient)
step += minibatch_size
The main points are that I need to specify the batch size for placeholder X, I can't leave it open ended, otherwise unstack has no idea how many elements to unstack. I unstack cross_entropy to get cross_entropy per instance, then I call compute_gradients per instance. During training I run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]}), which gives me the separate gradients per batch.
This is all well and good, but it doesn't give me much of a performance boost. I only get a max speedup of 2x. Increasing the batch size past 5 just scales the runtime of run() linearly, and gives no gain.
It's sad that Tensorflow can calculate and aggregate gradients over hundreds of instances blazingly fast, but requesting the gradients one by one is so much slower. Might need to dig into the source next...