In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory

abstract machine An abstract machine is a computer science theoretical model that allows for a detailed and precise analysis of how a computer system functions. It is analogous to a mathematical function in that it receives inputs and produces outputs based on p ...

. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the

parallel random-access machine In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confuse ...

(PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory. __TOC__

Model

Definition

The PEM model is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of

P

processors and a two-level

memory hierarchy In computer architecture, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlli ...

. This memory hierarchy consists of a large external memory (main memory) of size

N

and

P

small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size

M

which is partitioned in blocks of size

B

. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size

B

I/O complexity

The complexity measure of the PEM model is the I/O complexity, which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if

P

processors load parallelly a data block of size

B

form the main memory into their caches, it is considered as an I/O complexity of

O(1)

not

O(P)

. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts occur. Like in the PRAM model, three different variations of this problem are considered: * Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently. * Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time. * Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time. The following two algorithms solve the CREW and EREW problem if

P \leq B

processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of

P

parallel block transfers. A second approach needs

O(\log(P))

parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round

P

processors combine their blocks into

P/2

blocks. Then

P/2

processors combine the

P/2

blocks into

P/4

. This procedure is continued until all the data is combined in one block.

Comparison to other models

Examples

Multiway partitioning

Let

M=\

be a vector of d-1 pivots sorted in increasing order. Let

A

be an unordered set of N elements. A d-way partition of

A

is a set

\Pi=\

, where

\cup_^d A_i = A

and

A_i\cap A_j=\emptyset

for

1\leq i . A_i is called the i-th bucket. The number of elements in A_i is greater than m_and smaller than m_^2 . In the following algorithm the input is partitioned into N/P-sized contiguous segments S_1,...,S_P in main memory. The processor i primarily works on the segment S_i . The multiway partitioning algorithm (

PEM_DIST_SORT) uses a PEM

prefix sum In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers is a second sequence of numbers , the sums of prefixes (running totals) of the input sequence: : : : :... For instance, the prefix sums of ...

algorithm to calculate the prefix sum with the optimal

O\left(\frac + \log(P)\right)

I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm. // Compute parallelly a d-way partition on the data segments

S_i

for each processor i in parallel do Read the vector of pivots

M

into the cache. Partition

S_i

into d buckets and let vector

M_i=\

be the number of items in each bucket. end for Run PEM prefix sum on the set of vectors

\

simultaneously. // Use the prefix sum vector to compute the final partition for each processor i in parallel do Write elements

S_i

into memory locations offset appropriately by

M_

and

M_

. end for Using the prefix sums stored in

M_P

the last processor P calculates the vector

B

of bucket sizes and returns it. If the vector of

d=O\left(\frac\right)

pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with

O\left(\frac + \left\lceil \frac \right\rceil>\log(P)+d\log(B)\right)

I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

The

selection problem In computer science, a selection algorithm is an algorithm for finding the ''k''th smallest number in a list or array; such a number is called the ''k''th '' order statistic''. This includes the cases of finding the minimum, maximum, and me ...

is about finding the k-th smallest item in an unordered list

A

of size

N

. The following code makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in

O(\log N)

, and SELECT, which is a cache optimal single-processor selection algorithm. if

N \leq P

then

\texttt(A,P)

return

A /math>
 end if 
 
 //Find median of each S_i for each processor i in parallel do m_i = \texttt(S_i, \frac) end for 
 
 // Sort medians \texttt(\lbrace m_1, \dots, m_2 \rbrace, P) // Partition around median of medians t = \texttt(A, m_,P) if k \leq t then 
     return \texttt(A :t P, k) else 
     return \texttt(A +1:N P, k-t) end if

Under the assumption that the input is stored in contiguous memory,

PEMSELECT has an I/O complexity of:

O(\frac + \log (PB) \cdot \log(\frac))

Distribution sort

Distribution sort In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is importan ...

partitions an input list

A

of size

N

into

d

disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list. If

P = 1

the task is delegated to a cache-optimal single-processor sorting algorithm. Otherwise the following algorithm is used: // Sample

\tfrac

elements from

A

for each processor

i

in parallel do if

M < , S_i,

then

d = M/B

Load

S_i

M

-sized pages and sort pages individually else

d = , S_i,

Load and sort

S_i

as single page end if Pick every

\sqrt/4

'th element from each sorted memory page into contiguous vector

R^i

of samples end for in parallel do Combine vectors

R^1 \dots R^P

into a single contiguous vector

\mathcal

Make

\sqrt

copies of

\mathcal

\mathcal_1 \dots \mathcal_

end do // Find

\sqrt

pivots

\mathcal /math>
 for j = 1 to \sqrt in parallel do \mathcal = \texttt(\mathcal_i, \tfrac, \tfrac) end for
 
 Pack pivots in contiguous array \mathcal // Partition A around pivots into buckets \mathcal \mathcal = \texttt(A :N \mathcal,\sqrt,P) // Recursively sort buckets
 for j = 1 to \sqrt + 1 in parallel do
     recursively call \texttt on bucket j of size \mathcal /math>
     using O \left( \left \lceil \tfrac \right \rceil \right) processors responsible for elements in bucket j end for

The I/O complexity of

PEMDISTSORT is:

O \left( \left \lceil \frac \right \rceil \left ( \log_d P + \log_ \frac \right ) + f(N,P,d) \cdot \log_d P \right)

where

f(N,P,d) = O \left ( \log \frac \log \frac + \left \lceil \frac \log P + \sqrt \log B \right \rceil \right )

If the number of processors is chosen that

f(N,P,d) = O\left ( \left \lceil \tfrac \right \rceil \right )

and

M < B^

the I/O complexity is then:

O \left ( \frac \log_ \frac \right )

Other PEM algorithms

Where

\textrm_P(N)

is the time it takes to sort

N

items with

P

processors in the PEM model.

References

{{Parallel Computing Algorithms Models of computation Analysis of parallel algorithms External memory algorithms Cache (computing)