When we run the NVIDIA code for training on Imagenet using mixed precision and the following:
$ python -m torch.distributed.launch --nproc_per_node=n main_amp