computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...

, the reduction operator is a type of

operator Operator may refer to: Mathematics * A symbol indicating a mathematical operation * Logical operator or logical connective in mathematical logic * Operator (mathematics), mapping that acts on elements of a space to produce elements of another ...

that is commonly used in

parallel programming Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different f ...

to reduce the elements of an array into a single result. Reduction operators are

associative In mathematics, the associative property is a property of some binary operations, which means that rearranging the parentheses in an expression will not change the result. In propositional logic, associativity is a valid rule of replacement ...

and often (but not necessarily)

commutative In mathematics, a binary operation is commutative if changing the order of the operands does not change the result. It is a fundamental property of many binary operations, and many mathematical proofs depend on it. Most familiar as the name o ...

.SolihinChandra p. 59 The reduction of sets of elements is an integral part of programming models such as

Map Reduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filterin ...

, where a reduction operator is applied ( mapped) to all elements before they are reduced. Other

parallel algorithm In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can do multiple operations in a given time. It has been a tradition of computer science to describe serial algorithms in abstract machine ...

s use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

Theory

A reduction operator can help break down a task into various partial tasks by calculating partial results which can be used to obtain a final result. It allows certain serial operations to be performed in parallel and the number of steps required for those operations to be reduced. A reduction operator stores the result of the partial tasks into a private copy of the variable. These private copies are then merged into a shared copy at the end. An operator is a reduction operator if: * It can reduce an array to a single scalar value. * The final result should be obtainable from the results of the partial tasks that were created. These two requirements are satisfied for commutative and associative operators that are applied to all array elements. Some operators which satisfy these requirements are addition, multiplication, and some logical operators (and, or, etc.). A reduction operator

\oplus

can be applied in constant time on an input set

V = \

p

vectors with

m

elements each. The result

r

of the operation is the combination of the elements

r = \begin e_0^0 \oplus e_1^0 \oplus \dots \oplus e_^0 \\ \vdots \\ e_0^ \oplus e_1^ \oplus \dots \oplus e_^\end = \begin \bigoplus_^ e_i^0 \\ \vdots \\ \bigoplus_^ e_i^ \end

and has to be stored at a specified root processor at the end of the execution. If the result

r

has to be available at every processor after the computation has finished, it is often called Allreduce. An optimal sequential linear-time algorithm for reduction can apply the operator successively from front to back, always replacing two vectors with the result of the operation applied to all its elements, thus creating an instance that has one vector less. It needs

(p-1)\cdot m

steps until only

r

is left. Sequential algorithms can not perform better than linear time, but parallel algorithms leave some space left to optimize.

Example

Suppose we have an array

, 3, 5, 1, 7, 6, 8, 4 /math>. The sum of this array can be computed serially by sequentially reducing the array into a single sum using the '+' operator. Starting the summation from the beginning of the array yields: \Bigg( \bigg( \Big( \big(\, (\, (2 + 3) + 5 ) + 1 \big) + 7\Big) + 6 \bigg) + 8\Bigg) + 4 = 36 Since '+' is both commutative and associative, it is a reduction operator. Therefore this reduction can be performed in parallel using several cores, where each core computes the sum of a subset of the array, and the reduction operator merges the results. Using a

binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...

reduction would allow 4 cores to compute

(2 + 3)

(5 + 1)

(7 + 6)

, and

(8 + 4)

. Then two cores can compute

(5 + 6)

and

(13 + 12)

, and lastly a single core computes

(11 + 25) = 36

. So a total of 4 cores can be used to compute the sum in

\log_8 = 3

steps instead of the

7

steps required for the serial version. This parallel binary tree technique computes

\big(\,(2 + 3) + (5 + 1)\,\big) + \big(\,(7 + 6) + (8 + 4)\,\big)

. Of course the result is the same, but only because of the associativity of the reduction operator. The commutativity of the reduction operator would be important if there were a master core distributing work to several processors, since then the results could arrive back to the master processor in any order. The property of commutativity guarantees that the result will be the same.

Nonexample

Matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the ...

is not a reduction operator since the operation is not commutative. If processes were allowed to return their matrix multiplication results in any order to the master process, the final result that the master computes will likely be incorrect if the results arrived out of order. However, note that matrix multiplication is associative, and therefore the result would be correct as long as the proper ordering were enforced, as in the binary tree reduction technique.

Algorithms

Binomial tree algorithms

Regarding parallel algorithms, there are two main models of parallel computation, the

parallel random access machine In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confus ...

as an extension of the RAM with shared memory between processing units and the bulk synchronous parallel computer which takes communication and synchronization into account. Both models have different implications for the time-complexity, therefore two algorithms will be shown.

PRAM-algorithm

This algorithm represents a widely spread method to handle inputs where

p

is a power of two. The reverse procedure is often used for broadcasting elements. Binomial tree

: for

k \gets 0

\lceil\log_2 p\rceil - 1

do :: for

i \gets 0

p - 1

do in parallel ::: if

p_i

is active then :::: if bit

k

i

is set then ::::: set

p_i

to inactive :::: else if

i + 2^k < p

:::::

x_i \gets x_i \oplus^ x_

The binary operator for vectors is defined element-wise such that

\begin e_i^0 \\ \vdots \\ e_i^\end \oplus^\star \begin e_j^0 \\ \vdots \\ e_j^\end = \begin e_i^0 \oplus e_j^0 \\ \vdots \\ e_i^ \oplus e_j^ \end

. The algorithm further assumes that in the beginning

x_i = v_i

for all

i

and

p

is a power of two and uses the processing units

p_0, p_1,\dots p_

. In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor

p_i

evaluates the given operator on the element

x_i

it is currently holding and

x_j

where

j

is the minimal index fulfilling

j > i

, so that

p_j

is becoming an inactive processor in the current step.

x_i

and

x_j

are not necessarily elements of the input set

X

as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from

0

p-1

is used. Each processor looks at its

k

-th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the

k

-th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm. Only

p_0

holds the result in the end, therefore it is the root processor. For an Allreduce-operation the result has to be distributed, which can be done by appending a broadcast from

p_0

. Furthermore, the number

p

of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.

= Runtime analysis

= The main loop is executed

\lceil\log_2 p\rceil

times, the time needed for the part done in parallel is in

\mathcal(m)

as a processing unit either combines two vectors or becomes inactive. Thus the parallel time

T(p, m)

for the PRAM is

T(p, m) = \mathcal(\log(p) \cdot m)

. The strategy for handling read and write conflicts can be chosen as restrictive as an exclusive read and exclusive write (EREW). The speedup

S(p, m)

of the algorithm is

S(p, m) \in \mathcal\left(\frac\right) = \mathcal\left(\frac\right)

and therefore the efficiency is

E(p, m) \in \mathcal\left(\frac\right) = \mathcal\left(\frac\right)

. The efficiency suffers because half of the active processing units become inactive after each step, so

\frac

units are active in step

i

Distributed memory algorithm

In contrast to the PRAM-algorithm, in the

distributed memory In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task m ...

model, memory is not shared between processing units and data has to be exchanged explicitly between processing units. Therefore, data has to be exchanged explicitly between units, as can be seen in the following algorithm. : for

k \gets 0

\lceil\log_2 p\rceil - 1

do :: for

i \gets 0

p - 1

do in parallel ::: if

p_i

is active then :::: if bit

k

i

is set then ::::: send

x_i

p_

::::: set

p_k

to inactive :::: else if

i + 2^k < p

::::: receive

x_

:::::

x_i \gets x_i \oplus^\star x_

The only difference between the distributed algorithm and the PRAM version is the inclusion of explicit communication primitives, the operating principle stays the same.

= Runtime analysis

= The communication between units leads to some overhead. A simple analysis for the algorithm uses the BSP-model and incorporates the time

T_

needed to initiate communication and

T_

the time needed to send a byte. Then the resulting runtime is

\Theta((T_ + n \cdot T_)\cdot log(p))

, as

m

elements of a vector are sent in each iteration and have size

n

in total.

Pipeline-algorithm

For distributed memory models, it can make sense to use pipelined communication. This is especially the case when

T_

is small in comparison to

T_

. Usually, linear pipelines split data or a tasks into smaller pieces and process them in stages. In contrast to the binomial tree algorithms, the pipelined algorithm uses the fact that the vectors are not inseparable, but the operator can be evaluated for single elements: :for

k \gets 0

p+m-3

do :: for

i \gets 0

p - 1

do in parallel ::: if

i \leq k < i+m \land i \neq p-1

:::: send

x_i^

p_

::: if

i-1 \leq k < i-1+m \land i \neq 0

:::: receive

x_^

from

p_

::::

x_^ \gets x_^ \oplus x_^

It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at

p_

at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step.

Runtime analysis

The number of steps in the parallel execution are

p + m -2

, it takes

p-1

steps until the last processing unit receives its first element and additional

m-1

until all elements are received. Therefore, the runtime in the BSP-model is

T(n, p, m) = \left(T_ + \frac\cdot T_\right)(p+m-2)

, assuming that

n

is the total byte-size of a vector. Although

m

has a fixed value, it is possible to logically group elements of a vector together and reduce

m

. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is sent each step, but the number of steps has roughly halved. It means that the parameter

m

is halved, while the total byte-size

n

stays the same. The runtime

T(p)

for this approach depends on the value of

m

, which can be optimized if

T_

and

T_

are known. It is optimal for

m = \sqrt

, assuming that this results in a smaller

m

that divides the original one.

Applications

Reduction is one of the main collective operations implemented in the Message Passing Interface, where performance of the used algorithm is important and evaluated constantly for different use cases. Operators can be used as parameters for MPI_Reduce and MPI_Allreduce, with the difference that the result is available at one (root) processing unit or all of them.

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filteri ...

relies heavily on efficient reduction algorithms to process big data sets, even on huge clusters. Some parallel sorting algorithms use reductions to be able to handle very big data sets.

References

Books

* * {{cite book, last1=Solihin, first1=Yan, title=Fundamentals of Parallel Multicore Architecture, date=2016, publisher=CRC Press, isbn=978-1-4822-1118-4, page=75 Parallel computing

Theory

Example

Nonexample

Algorithms

Binomial tree algorithms

PRAM-algorithm

= Runtime analysis

Distributed memory algorithm

= Runtime analysis

Pipeline-algorithm

Runtime analysis

Applications

See also

References

Books