I had a quick look on the forums and I don\'t think this question has been asked already.
I am currently working with an MPI/CUDA hybrid code, made by somebody else
Apparently since 2015 it is possible to auto-annotated MPI calls via NVTX and mpi_interceptions.so library when using nvprof profiler:
https://devblogs.nvidia.com/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/
http://on-demand.gputechconf.com/gtc/2017/presentation/s7495-jain-optimizing-application-performance-cuda-profiling-tools.pdf
TAO still does not support distributed deep learning according to this presentation:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7684-allen-malony-performance-analysis-of-cuda-deep-learning-networks-using-tau.pdf