How does asynchronous training work in distributed Tensorflow?

后端 未结 3 1619
孤独总比滥情好
孤独总比滥情好 2020-12-12 19:43

I\'ve read Distributed Tensorflow Doc, and it mentions that in asynchronous training,

each replica of the graph has an independent training loop that

3条回答
  •  伪装坚强ぢ
    2020-12-12 20:17

    When you train asynchronously in Distributed TensorFlow, a particular worker does the following:

    1. The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).

    2. The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.

    3. The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.

    In asynchronous training, each update from the worker is applied concurrently, and the updates may be somewhat coordinated if the optional use_locking=True flag was set when the respective optimizer (e.g. tf.train.GradientDescentOptimizer) was initialized. Note however that the locking here only provides mutual exclusion for two concurrent updates, and (as noted above) reads do not acquire locks; the locking does not provide atomicity across the entire set of updates.

    (By contrast, in synchronous training, a utility like tf.train.SyncReplicasOptimizer will ensure that all of the workers read the same, up-to-date values for each model parameter; and that all of the updates for a synchronous step are aggregated before they are applied to the underlying variables. To do this, the workers are synchronized by a barrier, which they enter after sending their gradient update, and leave after the aggregated update has been applied to all variables.)

提交回复
热议问题