How does mpi allreduce work




















Communication among GPUs is one of the many challenges when training distributed deep learning models in a large-scale environment. The latency of exchanging gradients over all GPUs is a severe bottleneck in data-parallel synchronized distributed deep learning.

How is the communication performed in distributed deep learning? Also, why is the communication so time-consuming? It achieves far better communication performance than MPI, which is the de-facto standard communication library in the HPC community. Without it, the ImageNet min feat could not have been achieved[2]. Since NCCL is not an open source library, we tried to understand the high performance of the library by developing and optimizing an experimental AllReduce library. AllReduce is an operation that reduces the target arrays in all processes to a single array and returns the resultant array to all processes.

Now, let P the total number of processes. In distributed deep learning, the SUM operation is used to compute the mean of gradients. In the rest of this blog post, we assume that the reduction operation is SUM. There are several algorithms to implement the operation. For example, a straightforward one is to select one process as a master, gather all arrays into the master, perform reduction operations locally in the master, and then distribute the resulting array to the rest of the processes.

Although this algorithm is simple and easy to implement, it is not scalable. The master process is a performance bottleneck because its communication and reduction costs increase in proportion to the number of total processes.

Faster and more scalable algorithms have been proposed. They eliminate the bottleneck by carefully distributing the computation and communication over the participant processes. We will focus on the Ring-AllReduce algorithms in this blog post. This algorithm is also employed by NCCL [5] and baidu-allreduce[6].

Let us assume that P is the total number of the processes, and each process is uniquely identified a number between 1 and P. As shown in the Fig. Let chunk[p] be the p-th chunk. Next, let us focus on the process [p]. The process sends chunk[p] to the next process, while it receives chunk[p-1] from the previous process simultaneously Fig. By repeating the receive-reduce-send steps P-1 times, each process obtains a different portion of the resulting array Fig.

For those that may have forgotten, standard deviation is a measure of the dispersion of numbers from their mean. A lower standard deviation means that the numbers are closer together and vice versa for higher standard deviations.

To find the standard deviation, one must first compute the average of all numbers. After the average is computed, the sums of the squared difference from the mean are computed. The square root of the average of the sums is the final result. Given the problem description, we know there will be at least two sums of all the numbers, translating into two reductions.

The root process can then compute the standard deviation by taking the square root of the mean of the global squared differences.

In the next lesson, we will start diving into MPI groups and communicators. An introduction to reduce Reduce is a classic concept from functional programming. This site is hosted entirely on GitHub. This site is no longer being actively contributed to by the original author Wes Kendall , but it was placed on GitHub in the hopes that others would write high-quality MPI tutorials. Learn more. Asked 7 years, 11 months ago. Active 7 years, 11 months ago. Viewed 1k times.

SUM [stdout:0] 1. I use: Ubuntu Improve this question. Cyrille Rossant Cyrille Rossant 1 1 gold badge 5 5 silver badges 18 18 bronze badges. I wonder if, unlike the python tutorial, the fact you are using allrecude lowercase-a whereas they use Allreduce upper-case a matters?

Using Allreduce leads to the exact same problem. I use allreduce here just to make the example shorter. Add a comment. Active Oldest Votes. Improve this answer. You're right.



0コメント

  • 1000 / 1000