How to make the generated data in remote worker span iterations in in-graph replica in distributed tensorflow?

若如初见. 提交于 2019-12-11 06:44:45

问题


I use the in-graph replication of tensorflow to do distributed training. For reducing communicaiton cost purpose, i need hold some generated data (such as the cell states in LSTM) in some remote worker in one training iteration to next iteration, but i found that i can not achieve it.

If i use the fetch operation of 'session.run' interface to retrieve the data generated in one remote worker, and feed the data to this remoter worker in the next training iteration, the unnecessary network costs are produced, as below codes show:

cluster = tf.train.ClusterSpec({"worker": ["remoteIP0:port", "remoteIP1:port"]})
...

for i in xrange(2):
  with tf.device("/woker:local/task:%d" % i):
    with tf.name_scope('%s_%d' % (TOWER_NAME, i)) as scope:
      # execute code for building the model replica and one taining
      # step.
      ...
      initial_state[i] = ...
      ...
      weight[i] = ...
      bias[i] = ...
      cost[i] = ...
      ...
      gradient[i] =
      final_state[i] = 
      ...
 grad = aggregate_func(gradient[0], gradient[1])
 optimizer = tf.train.GradientDescentOptimizer(lr)
 train_op = optimizer.apply_gradient(grad)

 ...
 with tf.Session("grpc://localhost:port") as session:
   ...
   for k in xrange(max_step):
     cost, finalstate, _ = session.run([cost, final_state, train_op],
                             {initial_state: finalstate})
   ...

The 'final_state[i]' generated in iteration k need be assigned to the 'initial_state[i]' in iteration k+1 for every remote worker, how can we do the assignment in remote worker machine without fetching to the master(grpc://localhost:port) machine and feeding again to the remote workers ?


回答1:


Variable objects and persistent tensors can replace feed_dict, as Yaroslav proposed. Thanks Yaroslav.



来源:https://stackoverflow.com/questions/42317636/how-to-make-the-generated-data-in-remote-worker-span-iterations-in-in-graph-repl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!