Using Multiple GPUs outside of training in PyTorch

邮差的信 提交于 2021-02-19 08:46:06

问题


I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus. The code follows:

def ac_distance(layer):
    total = 0
    for p in layer.weight:
      for q in layer.weight:
         total += distance(p,q)
    return total

Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both.


回答1:


Parallelism while training neural networks can be achieved in two ways.

  1. Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
  2. Model Parallelism - Split the computations and run them on different GPUs

As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.

However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.

In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.

There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:

  • Do gradient accumulation for larger batches
  • Use DataParallel
  • If that doesn't suffice, go with DistributedDataParallel


来源:https://stackoverflow.com/questions/55624102/using-multiple-gpus-outside-of-training-in-pytorch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!