I\'m looking at the policy gradients sample in this notebook: https://github.com/ageron/handson-ml/blob/master/16_reinforcement_learning.ipynb
The relevant code is h
The problem is that
optimizer.compute_gradients(cross_entropy)seems to return a single gradient, even though cross_entropy is a 1d tensor of shape[None, 1].
That happens by design, as the gradient terms for each tensor are automatically aggregated. Gradient computation operations such as optimizer.compute_gradients and the low-level primitive tf.gradients make a sum of all gradient operations, according to the default AddN aggregation method. This is fine for most cases of stochastic gradient descent.
In the end unfortunately, gradient computation will have to be made over a single batch. Of course, unless a custom gradient function is built, or the TensorFlow API is extended to provide gradient computation without full aggregation. Changing the implementation of tf.gradients to do this does not seem to be very trivial.
One trick that you might wish to employ for your reinforcement learning model is to perform multiple session runs in parallel. According to the FAQ, the Session API supports multiple concurrent steps, and will take advantage of the existing resources for parallel computation. The question Asynchronous computation in TensorFlow shows how to do this.