HOME

TheInfoList



OR:

Collective operations are building blocks for interaction patterns, that are often used in
SPMD In computing, single program, multiple data (SPMD) is a technique employed to achieve parallelism; it is a subcategory of MIMD. Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results fas ...
algorithms in the
parallel programming Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different f ...
context. Hence, there is an interest in efficient realizations of these operations. A realization of the collective operations is provided by the Message Passing InterfaceIntercommunicator Collective Operations
The Message Passing Interface (MPI) standard, chapter 7.3.1. Mathematics and Computer Science Division,
Argonne National Laboratory Argonne National Laboratory is a science and engineering research national laboratory operated by UChicago Argonne LLC for the United States Department of Energy. The facility is located in Lemont, Illinois, outside of Chicago, and is the lar ...
. (MPI).


Definitions

In all asymptotic runtime functions, we denote the latency \alpha, the communication cost per word \beta, the number of processing units p and the input size per node n. In cases where we have initial messages on more than one node we assume that all local messages are of the same size. To address individual processing units we use p_i \in \. If we do not have an equal distribution, i.e. node p_i has a message of size n_i, we get an upper bound for the runtime by setting n = \max(n_0, n_1, \dots, n_). A distributed memory model is assumed. The concepts are similar for the shared memory model. However, shared memory systems can provide hardware support for some operations like broadcast () for example, which allows convenient concurrent read.Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 395 Thus, new algorithmic possibilities can become available.


Broadcast

The broadcast patternSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 396-401 is used to distribute data from one processing unit to all processing units, which is often needed in
SPMD In computing, single program, multiple data (SPMD) is a technique employed to achieve parallelism; it is a subcategory of MIMD. Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results fas ...
parallel programs to dispense input or global values. Broadcast can be interpreted as an inverse version of the reduce pattern (). Initially only root r with id 0 stores message m. During broadcast m is sent to the remaining processing units, so that eventually m is available to all processing units. Since an implementation by means of a sequential for-loop with p-1 iterations becomes a bottleneck, divide-and-conquer approaches are common. One possibility is to utilize a binomial tree structure with the requirement that p has to be a power of two. When a processing unit is responsible for sending m to processing units i..j, it sends m to processing unit \left \lceil (i+j)/2 \right \rceil and delegates responsibility for the processing units \left \lceil (i+j)/2 \right \rceil .. \left \lceil (i+j)-1 \right \rceil to it, while its own responsibility is cut down to i..\left \lceil (i+j)/2 \right \rceil-1. Binomial trees have a problem with long messages m. The receiving unit of m can only propagate the message to other units, after it received the whole message. In the meantime, the communication network is not utilized. Therefore pipelining on
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
s is used, where m is split into an array of k packets of size \left \lceil n/k \right \rceil . The packets are then broadcast one after another, so that data is distributed fast in the communication network. Pipelined broadcast on balanced
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
is possible in \mathcal(\alpha \log p + \beta n).


Reduce

The reduce patternSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 402-403 is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator. Given p processing units, message m_i is on processing unit p_i initially. All m_i are aggregated by \otimes and the result is eventually stored on p_0. The reduction operator \otimes must be associative at least. Some algorithms require a commutative operator with a neutral element. Operators like sum, min, max are common. Implementation considerations are similar to broadcast (). For pipelining on
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
s the message must be representable as a vector of smaller object for component-wise reduction. Pipelined reduce on a balanced
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
is possible in \mathcal(\alpha \log p + \beta n).


All-Reduce

The all-reduce patternSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 403-404 (also called allreduce) is used if the result of a reduce operation () must be distributed to all processing units. Given p processing units, message m_i is on processing unit p_i initially. All m_i are aggregated by an operator \otimes and the result is eventually stored on all p_i. Analog to the reduce operation, the operator \otimes must be at least associative. All-reduce can be interpreted as a reduce operation with a subsequent broadcast (). For long messages a corresponding implementation is suitable, whereas for short messages, the latency can be reduced by using a
hypercube In geometry, a hypercube is an ''n''-dimensional analogue of a square () and a cube (). It is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions ...
() topology, if p is a power of two. All-reduce is possible in \mathcal(\alpha \log p + \beta n), since reduce and broadcast are possible in \mathcal(\alpha \log p + \beta n) with pipelining on balanced
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
s.


Prefix-Sum/Scan

The prefix-sum or scan operationSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 404-406 is used to collect data or partial results from different processing units and to compute intermediate results by an operator, which are stored on those processing units. It can be seen as a generalization of the reduce operation (). Given p processing units, message m_i is on processing unit p_i. The operator \otimes must be at least associative, whereas some algorithms require also a commutative operator and a neutral element. Common operators are sum, min and max. Eventually processing unit p_i stores the prefix sum \otimes_m_. In the case of the so-called exclusive prefix sum, processing unit p_i stores the prefix sum \otimes_m_. Some algorithms require to store the overall sum at each processing unit in addition to the prefix sums. For short messages, this can be achieved with a hypercube topology if p is a power of two. For long messages, the
hypercube In geometry, a hypercube is an ''n''-dimensional analogue of a square () and a cube (). It is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions ...
(, ) topology is not suitable, since all processing units are active in every step and therefore pipelining can't be used. A
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
topology is better suited for arbitrary p and long messages (). Prefix-sum on a binary tree can be implemented with an upward and downward phase. In the upward phase reduction is performed, while the downward phase is similar to broadcast, where the prefix sums are computed by sending different data to the left and right children. With this approach pipelining is possible, because the operations are equal to reduction () and broadcast (). Pipelined prefix sum on a binary tree is possible in \mathcal(\alpha \log p + \beta n).


Barrier

The barrierSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 408 as a collective operation is a generalization of the concept of a
barrier A barrier or barricade is a physical structure which blocks or impedes something. Barrier may also refer to: Places * Barrier, Kentucky, a community in the United States * Barrier, Voerendaal, a place in the municipality of Voerendaal, Netherla ...
, that can be used in distributed computing. When a processing unit calls barrier, it waits until all other processing units have called barrier as well. Barrier is thus used to achieve global synchronization in distributed computing. One way to implement barrier is to call all-reduce () with an empty/ dummy operand. We know the runtime of All-reduce is \mathcal(\alpha \log p + \beta n) . Using a dummy operand reduces size n to a constant factor and leads to a runtime of \mathcal(\alpha \log p).


Gather

The gather communication patternSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 412-413 is used to store data from all processing units on a single processing unit. Given p processing units, message m_i on processing unit p_i. For a fixed processing unit p_j, we want to store the message m_1 \cdot m_2 \cdot \ldots \cdot m_p on p_j. Gather can be thought of as a reduce operation () that uses the concatenation operator. This works due to the fact that concatenation is associative. By using the same binomial tree reduction algorithm we get a runtime of \mathcal(\alpha \log p + \beta p n). We see that the asymptotic runtime is similar to the asymptotic runtime of reduce \mathcal(\alpha \log p + \beta n), but with the addition of a factor p to the term \beta n. This additional factor is due to the message size increasing in each step as messages get concatenated. Compare this to reduce where message size is a constant for operators like min.


All-Gather

The all-gather communication pattern is used to collect data from all processing units and to store the collected data on all processing units. Given p processing units p_i, message m_i initially stored on p_i, we want to store the message m_1 \cdot m_2 \cdot \ldots \cdot m_p on each p_j. It can be thought of in multiple ways. The first is as an all-reduce operation () with concatenation as the operator, in the same way that gather can be represented by reduce. The second is as a gather-operation followed by a broadcast of the new message of size pn. With this we see that all-gather in \mathcal(\alpha \log p + \beta p n) is possible.


Scatter

The scatter communication patternSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 413 is used to distribute data from one processing unit to all the processing units. It differs from broadcast, in that it does not send the same message to all processing units. Instead it splits the message and delivers one part of it to each processing unit. Given p processing units p_i, a fixed processing unit p_j that holds the message m = m_1 \cdot m_2 \cdot \ldots \cdot m_p. We want to transport the message m_i onto p_i. The same implementation concerns as for gather () apply. This leads to an optimal runtime in \mathcal(\alpha \log p + \beta p n).


All-to-all

All-to-allSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 413-418 is the most general communication pattern. For 0 \leq i, j < p, message m_ is the message that is initially stored on node i and has to be delivered to node j. We can express all communication primitives that do not use operators through all-to-all. For example, broadcast of message m from node p_k is emulated by setting m_ = m for i = k and setting m_ empty for l \neq k. Assuming we have a fully connected network, the best possible runtime for all-to-all is in \mathcal(p (\alpha + \beta n)) . This is achieved through p rounds of direct message exchange. For p power of 2, in communication round k , node p_i exchanges messages with node p_j, j = i \oplus k . If the message size is small and latency dominates the communication, a hypercube algorithm can be used to distribute the messages in time \mathcal(\log p (\alpha + \beta p n)) .


Runtime Overview

This tableSanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 394 gives an overview over the best known asymptotic runtimes, assuming we have free choice of network topology. Example topologies we want for optimal runtime are
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
, binomial tree,
hypercube In geometry, a hypercube is an ''n''-dimensional analogue of a square () and a cube (). It is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions ...
. In practice, we have to adjust to the available physical topologies, e.g. dragonfly,
fat tree The fat tree network is a universal network for provably efficient communication. It was invented by Charles E. Leiserson of the Massachusetts Institute of Technology in 1985. k-ary n-trees, the type of fat-trees commonly used in most high-per ...
,
grid network Elex Media Komputindo is a publishing company in Indonesia which publishes books, comics, magazines, novels and other print media. Established on January 15, 1985, Elex Media Komputindo is a subsidiary of Kompas Gramedia Group. Elex is headquarte ...
(references other topologies, too). More information under
Network topology Network topology is the arrangement of the elements ( links, nodes, etc.) of a communication network. Network topology can be used to define or describe the arrangement of various types of telecommunication networks, including command and contr ...
. For each operation, the optimal algorithm can depend on the input sizes n. For example, broadcast for short messages is best implemented using a binomial tree whereas for long messages a pipelined communication on a balanced binary tree is optimal. The complexities stated in the table depend on the latency \alpha and the communication cost per word \beta in addition to the number of processing units p and the input message size per node n. The ''# senders'' and ''# receivers'' columns represent the number of senders and receivers that are involved in the operation respectively. The ''# messages'' column lists the number of input messages and the ''Computations?'' column indicates if any computations are done on the messages or if the messages are just delivered without processing. ''Complexity'' gives the asymptotic runtime complexity of an optimal implementation under free choice of topology.


Notes


References

{{cite book, last1=Sanders, first1=Peter, title=Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox, last2=Mehlhorn, first2=Kurt, last3=Dietzfelbinger, first3=Martin, last4=Dementiev, first4=Roman, date=2019, publisher=Springer Nature Switzerland AG, isbn=978-3-030-25208-3, authorlink1=Peter Sanders (computer scientist), authorlink2=Kurt Mehlhorn Parallel computing Algorithms Distributed computing