Is it possible to compute per-example gradients efficiently in TensorFlow, in just one graph run?

问题

TD;DR: is there a way to evaluate f'(x1), f'(x2), ..., f'(xn) in just one graph run, in a vectorized form? Where f'(x) is the derivative of f(x).

Something like:

x = tf.placeholder(tf.float32, shape=[100])
f = tf.square(x)
f_grad = tf.multiple_gradients(x) # f_grad contains f'(x[0]), f'(x[1]), ...

More specifically, I'm trying to implement Black Box Stochastic Variational Inference (BBSVI) manually (I know I could use a library like Edward, but I'm trying to implement it myself). At one point, I need to compute the mean of f'(x)g(x) across many different values of x (x1, x2, ..., xn), where f(x) and g(x) are two functions, and f'(x) is the derivative of f(x).

Using TensorFlow's autodiff feature, I can compute f'(x1), f'(x2), ..., f'(xn), by simply calling f_prime.eval(feed_dict={x: xi}) once for each value xi in (x1, x2, ..., xn). This is not efficient at all: I would like to use a vectorized form instead, but I'm not sure how to do this.

Perhaps using tf.stop_gradient() somehow? Or using the grad_ys argument in tf.gradients()?

回答1:

After a bit of digging, it seems that it is not trivial to compute per-example gradients in TensorFlow, because this library performs standard back-propagation to compute the gradients (as do other deep learning libraries like PyTorch, Theano and so on), which never actually computes the per-example gradients, it directly obtains the sum of the per-example gradients. Check out this discussion for more details.

However, there are some techniques to work around this issue, at least for some use cases. For example, the paper Efficient per-example gradient computation by Ian Goodfellow explains how to efficiently compute per-example vectors containing the sum of squared derivatives. Here is an excerpt from the paper showing the computation (but I highly encourage you read the paper, it is very short):

This algorithm is O(mnp) instead of O(mnp²), where m is the number of examples, n is the number of layers in the neural net, and p is the number of neurons per layer. So it is much faster than the naive approach (i.e., performing back-prop once per example), especially when p is large, and even more when using a GPU (which speeds up vectorized approaches by a large factor).

回答2:

You can use tf.vectorized_map(forward_and_backward_fn, batch_of_inputs) to compute per example gradients efficiently.

来源：https://stackoverflow.com/questions/50080929/is-it-possible-to-compute-per-example-gradients-efficiently-in-tensorflow-in-ju

标签

python

tensorflow

gradient