问题
In /model/inception/inception/inception_distributed_training.py apply_gradients are called for each worker.
apply_gradients_op = opt.apply_gradients(grads, global_step=global_step)
and go into SyncReplicasOptimizer.py:
285 # sync_op will be assigned to the same device as the global step.
286 with ops.device(global_step.device), ops.name_scope(""):
287 update_op = self._opt.apply_gradients(aggregated_grads_and_vars,
288 global_step)
289
line 287 are will be executed by each worker process at ps device.
I think, even the job that aggregating all replicas gradients only works for one time, but once aggregating job finished, all replicas will rpc calls remote apply_gradients operations group to generate next variable value. If that's real, the duplicated apply_gradients can be eliminated by checking is_chief flag.
By the way two more questions:
How to control the exclusive variable buffer access if multiple updating operations come?
Can we use "caching_device" flag to eliminate multiple remote variable access (multiple network communication procedure) ? if that's ok, how to trigger update(invalid) cached variable if variables on ps are updated?
I have carefully read lots of documents and done lots of experiments to verify it, but official answer could be highly appreciated again.
回答1:
I would like to answer it by myself after carefully and carefully review these code snippets.
```
310 with ops.device(global_step.device), ops.name_scope(""):
311 # Replicas have to wait until they can get a token from the token queue.
312 with ops.control_dependencies(train_ops):
313 token = sync_token_queue.dequeue()
314 train_op = state_ops.assign(self._local_step, token)
315
316 with ops.control_dependencies([update_op]):
317 # Sync_op needs to insert tokens to the token queue at the end of the
318 # step so the replicas can fetch them to start the next step.
319 tokens = array_ops.fill([self._tokens_per_step], global_step)
320 sync_op = sync_token_queue.enqueue_many((tokens,))
321
322 if self._variable_averages is not None:
323 with ops.control_dependencies([sync_op]), ops.name_scope(""):
324 sync_op = self._variable_averages.apply(
325 self._variables_to_average)
326
327 self._chief_queue_runner = queue_runner.QueueRunner(dummy_queue,
328 [sync_op])
```
There two ops set, train_ops and update_op. update_op ends up with sync_op. the 'sync_op' will be executed by QueueRunner return as self._chief_queue_runner. The train_ops is end up with train_op called at the context of each worker.
As a brief conclusion, sync_op are returned for the chief worker to do parameter updating(done by all ps, in reality, the chief worker just do control synchronisation mechanism). train_op are called by each worker.
That's the updating action only works one time, no duplicated updating.
That's all.
来源:https://stackoverflow.com/questions/44826477/does-tf-train-syncreplicasoptimizer-do-complete-parameter-update-from-aggregated