How does TensorFlow cluster distribute load across machines if not specified explicitly?

问题

I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:

sess = tf.InteractiveSession()

with the following:

sess = tf.InteractiveSession("grpc://localhost:12345")

where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.

Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).

So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?

回答1:

TF is great for data parallelization, e.g. when you need to sift through tons of data, which is then distributed to multiple GPUs.

It's also great for weights parallelization. Using tf.train.replica_device_setter, weights are distributed among multiple devices for better IO.

Now, it seems you are asking for parallelization within a single model. That's difficult to do automatically, since TF does not know what's the best way to distribute your computation of the same model to multiple devices. It would depend on too many factors, e.g. how fast is the connection between your devices.

来源：https://stackoverflow.com/questions/41844800/how-does-tensorflow-cluster-distribute-load-across-machines-if-not-specified-exp

标签

tensorflow

distributed-computing

horizontal-scaling