When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?

问题

When training a neural network across multiple servers and GPUs, I can't think of a scenario where the ParameterServerStrategy would be preferable to the MultiWorkerMirroredStrategy.

What are the ParameterServerStrategy's main use cases and why would it be better than using MultiWorkerMirroredStrategy?

回答1:

MultiWorkerMirroredStrategy is intended for synchronous distributed training across multiple workers, each of which can have multiple GPUs
ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training.

One of the key differences is that ParameterServerStrategy can be used for asynchronous training, while MultiWorkerMirroredStrategy is intended for Synchronous distributed training. In MultiWorkerMirroredStrategy a copy of all variables in the model is kept on each device across all workers, and a communication method is needed to keep all variables in sync. In contrast, in ParameterServerStrategy each variable of the model is placed on one parameter server.

This matters because:

In synchronous training, all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue. If the failed or preempted worker does not restart for some reason, your workers will keep waiting.
In contrast in ParameterServerStrategy, each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker. Hence unlike MultiWorkerMirroredStrategy, different workers are not waiting on each other.

I guess the question is, do you expect workers to fail, and will the delay in restarting them slow down training when MultiWorkerMirroredStrategy ? If that is the case, maybe ParameterServerStrategy is better.

EDIT: Answers to questions in comments:

So is the only benefit of PSS the fact that it resists better to failing workers than MWMS?

Not exactly - even if workers do not fail in MWMS, as workers still need to be in sync there could be network bottle necks.

If so, then I imagine it would only be useful when training on many workers, say 20 or more, or else the probability that a worker will fail during training is low (and it can be avoided by saving regular snapshots).

Maybe not, it depends on the situation. Perhaps in your scenario the probability of failure is low. In someone else's scenario there may be a higher probability. For the same number of workers, the longer a job is, there is more likelihood of a failure occurring in the middle of a job. To illustrate further (with an over simplistic example), if I have the same number of nodes, but theyre simply slower, they could take much longer to do a job, and hence there is greater likelihood of any kind of interruption / failure occurring during the job.

(and it can be avoided by saving regular snapshots).

Not sure I understand what you mean - if a worker fails, and you've saved a snapshot, then you haven't lost data. But the worker still needs to restart. In the interim between failure and restarting other workers may be waiting.

Isn't there a possible benefit with I/O saturation? If the updates are asynchronous, I/O would be more spread out in time, right? But maybe this benefit is cancelled by the fact that it uses more I/O? Could you please detail this a bit?

I will first try to answer it from a conceptual point of view.

I would say try looking at it from a different angle - in a synchronous operation, you're waiting for something else to finish, and you may be idle till that something gives you what you need. In constrast in an asynchronous operation, you do your own work and when you need more you ask for it.
There is no hard and fast rule about whether synchronous operations or asynchronous operations are better. It depends on the situation.

I will now try to answer it from an optimization point of view:

Isn't there a possible benefit with I/O saturation? If the updates are asynchronous, I/O would be more spread out in time, right? But maybe this benefit is cancelled by the fact that it uses more I/O? Could you please detail this a bit?

In a distributed system it is possible that your bottleneck could be CPU / GPU, Disk or Network. Nowadays networks are really fast, and in some cases faster than disk. Depending on your workers configuration CPU / GPU could be the bottle neck. So it really depends on the configuration of your hardware and network.

Hence I would do some performance testing to determine where the bottlenecks in your system are, and optimize for your specific problem.

EDIT: Additional follow up questions:

One last thing: in your experience, in what use cases is PSS used? I mean, both PSS and MWMS are obviously for use with large datasets (or else a single machine would suffice), but what about the model? Would PSS be better for larger models? And in your experience, is MWMS more frequently used?

I think cost and the type of problem being worked on may influence the choice. For example, both AWS and GCP offer “spot instances” / “premptible instances” which are heavily discounted servers that can be taken away at any moment. In such a scenario, it may make sense to use PSS - even though machine failure is unlikely, a instance may simply be taken away without notice because it is a “spot instance”. If you use PSS, then the performance impact of servers disappearing may not be as large as when using MWMS. If you’re using dedicated instances, the instances are dedicated to you, and will not be taken away - the only risk of interruption is machine failure. In such cases MWMS may be more attractive if you can take advantage of performance optimisations or plugin architecture.

来源：https://stackoverflow.com/questions/63374495/when-is-tensorflows-parameterserverstrategy-preferable-to-its-multiworkermirror

标签

tensorflow

tensorflow2.0

distributed-computing