问题
I have been using the training method proposed in the cifar10_multi_gpu_train example for (local) multi-gpu training, i.e., creating several towers and then average the gradient. However, I was wondering the following: What does happen if I just take the losses coming from the different GPUs, sum them up and then just apply gradient descent to that new loss.
Would that work? Probably this is a silly question, and there must be a limitation somewhere. So I would be happy if you could comment on this.
Thanks and best regards, G.
回答1:
It would not work with the sum. You would get a bigger loss and consequentially bigger and probably erroneous gradients. While averaging the gradients you get an average of the direction that the weights have to take in order to minimize the loss, but each single direction is the one computed for the exact loss value.
One thing that you can try is to run the towers independently and then average the weights from time to time, slower convergence rate but faster processing on each node.
来源:https://stackoverflow.com/questions/41029037/training-multi-gpu-on-tensorflow-a-simpler-way