DDPG (Deep Deterministic Policy Gradients), how is the actor updated?

问题

I'm currently trying to implement DDPG in Keras. I know how to update the critic network (normal DQN algorithm), but I'm currently stuck on updating the actor network, which uses the equation:

so in order to reduce the loss of the actor network wrt to its weight dJ/dtheta, it's using chain rule to get dQ/da (from critic network) * da/dtheta (from actor network).

This looks fine, but I'm having trouble understanding how to derive the gradients from those 2 networks. Could someone perhaps explain this part to me?

回答1:

So the main intuition is that here, J is something you want to maximize instead of minimize. Therefore, we can call it an objective function instead of a loss function. The equation simplifies down to:

dJ/dTheta = dQ / da * da / dTheta = dQ/dTheta

Meaning you want to change the parameters Theta to change Q. Since in RL, we want to maximize Q, for this part, we want to do gradient ascent instead. To do this, you just perform gradient descent, except feed the gradients as negative values.

To derive the gradients, do the following:

Using the online actor network, send in a batch of states that was sampled from your replay memory. (The same batch used to train the critic)
Calculate the deterministic action for each of those states
Send the states used to calculate those actions to the online critic network to map those exact states to Q values.
Calculate the gradient of the Q values with respect with the actions calculated in step 2. We can use tf.gradients(Q value, actions) to do this. Now, we have dQ/dA.
Send the states to the actor online critic again and map it to actions.
Calculate the gradient of the actions with respect to the online actor network weights, again using tf.gradients(a, network_weights). This will give you dA/dTheta
Multiply dQ/dA by -dA/dTheta to get GRADIENT ASCENT. We are left with the gradient of the objective function, i.e., gradient J
Divide all elements of gradient J by the batch size, i.e.,

for j in J,
```
 j / batch size
```
Apply a variant of gradient descent by first zipping gradient J with the network parameters. This can be done using tf.apply_gradients(zip(J, network_params))
And bam, your actor is training its parameters with respect to maximizing Q.

I hope this makes sense! I also had a hard time understanding this concept, and am still a little fuzzy on some parts to be completely honest. Let me know if I can clarify anything!

来源：https://stackoverflow.com/questions/51496159/ddpg-deep-deterministic-policy-gradients-how-is-the-actor-updated

标签

keras

reinforcement-learning