For several years parallel hardware was only available for

distributed computing A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...

but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing

multi-threaded In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes dif ...

or parallel processing applications. Parallel programming requires handling various issues such as synchronization and

deadlock In concurrent computing, deadlock is any situation in which no member of some group of entities can proceed because each waits for another member, including itself, to take action, such as sending a message or, more commonly, releasing a lo ...

avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.

Need for automatic parallelization

Past techniques provided solutions for languages like FORTRAN and C; however, these are not enough. These techniques dealt with parallelization sections with specific system in mind like loop or particular section of code. Identifying opportunities for parallelization is a critical step while generating multithreaded application. This need to parallelize applications is partially addressed by tools that analyze code to exploit parallelism. These tools use either

compile time In computer science, compile time (or compile-time) describes the time window during which a computer program is compiled. The term is used as an adjective to describe concepts related to the context of program compilation, as opposed to concept ...

techniques or run-time techniques. These techniques are built-in in some parallelizing compilers but user needs to identify parallelize code and mark the code with special language constructs. The compiler identifies these language constructs and analyzes the marked code for parallelization. Some tools parallelize only special form of code like loops. Hence a fully automatic tool for converting sequential code to parallel code is required.

General procedure of parallelization

1. The process starts with identifying code sections that the programmer feels have parallelism possibilities. Often this task is difficult since the programmer who wants to parallelize the code has not originally written the code under consideration. Another possibility is that the programmer is new to the application domain. Thus, though this first stage in the parallelization process seems easy at first it may not be so. 2. The next stage is to shortlist code sections out of the identified ones that are actually parallelization. This stage is again most important and difficult since it involves lot of analysis. Generally for codes in C/

C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significa ...

where pointers are involved are difficult to analyze. Many special techniques such as pointer alias analysis, functions side effects analysis are required to conclude whether a section of code is dependent on any other code. If the dependencies in the identified code sections are more the possibilities of parallelization decreases. 3. Sometimes the dependencies are removed by changing the code and this is the next stage in parallelization. Code is transformed such that the functionality and hence the output is not changed but the dependency, if any, on other code section or other instruction is removed. 4. The last stage in parallelization is generating the parallel code. This code is always functionally similar to the original sequential code but has additional constructs or code sections which when executed create multiple threads or processes.

Automatic parallelization technique

Scan

This is the first stage where the scanner will read the input source files to identify all static and extern usages. Each line in the file will be checked against pre-defined patterns to segregate into

token Token may refer to: Arts, entertainment, and media * Token, a game piece or counter, used in some games * The Tokens, a vocal music group * Tolkien Black, a recurring character on the animated television series ''South Park,'' formerly known as ...

s. These tokens will be stored in a file which will be used later by the grammar engine. The grammar engine will check patterns of tokens that match with pre-defined rules to identify variables, loops, controls statements, functions etc. in the code.....

Analyze

The

analyzer An analyser or analyzer is a tool used to analyze data. For example, a gas analyzer tool is used to analyze gases. It examines the given data and tries to find patterns and relationships. An analyser can be a piece of hardware or software. Autoa ...

is used to identify sections of code that can be executed concurrently. The analyzer uses the static data information provided by the scanner-parser. The analyzer will first find out all the functions that are totally independent of each other and mark them as individual tasks. Then analyzer finds which tasks are having dependencies.

Schedule

The

scheduler A schedule or a timetable, as a basic time-management tool, consists of a list of times at which possible tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order in which such things are i ...

will lists all the tasks and their dependencies on each other in terms of execution and start times. The scheduler will produce optimal schedule in terms of number of processors to be used or the total time of execution for the application.

Code Generation

The

will generate list of all the tasks and the details of the cores on which they will execute along with the time that they will execute for. The code Generator will insert special constructs in the code that will be read during execution by the scheduler. These constructs will instruct the scheduler on which core a particular task will execute along with the start and end times......

Parallelization tools

There are a number of Automatic Parallelization tools for Fortran, C, C++, and several other languages.

YUCCA

YUCCA is a Sequential to Parallel automatic code conversion tool developed by

KPIT Technologies KPIT Technologies Limited (formerly KPIT Cummins Infosystems Ltd) is an Indian multinational corporation which provides embedded software and product engineering services to automotive companies. Popularly known as KPIT, the company is headq ...

Ltd. Pune. It takes input as C source code which may have multiple source and header files. It gives output as transformed

parallel code using

pthreads POSIX Threads, commonly known as pthreads, is an execution model that exists independently from a language, as well as a parallel execution model. It allows a program to control multiple different flows of work that overlap in time. Each flow of ...

functions and

OpenMP OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating sy ...

constructs. The YUCCA tool does task and loop level parallelization.

Par4All

Par4All is an automatic parallelizing and

optimizing compiler In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power cons ...

(workbench) for C and Fortran sequential programs. The purpose of this

source-to-source compiler A source-to-source translator, source-to-source compiler (S2S compiler), transcompiler, or transpiler is a type of translator that takes the source code of a program written in a programming language as its input and produces an equivalent sou ...

is to adapt existing applications to various hardware targets such as multicore systems, high performance computers and

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...

. It creates a new source code and thus allows the original source code of the application to remain unchanged.

Cetus

Cetus Cetus () is a constellation, sometimes called 'the whale' in English. The Cetus was a sea monster in Greek mythology which both Perseus and Heracles needed to slay. Cetus is in the region of the sky that contains other water-related conste ...

is a compiler infrastructure for the source-to-source transformation of software programs. This project is developed by

Purdue University Purdue University is a public land-grant research university in West Lafayette, Indiana, and the flagship campus of the Purdue University system. The university was founded in 1869 after Lafayette businessman John Purdue donated land and ...

. Cetus is written in

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...

. It provides basic infrastructure for writing automatic parallelization tools or compilers. The basic parallelizing techniques Cetus currently implements are

privatization Privatization (also privatisation in British English) can mean several different things, most commonly referring to moving something from the public sector into the private sector. It is also sometimes used as a synonym for deregulation when ...

, reduction variables recognition and

induction variable In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable. For example, in the following loop, i and j are induct ...

substitution. A new graphic user interface (GUI) was added in Feb 2013.

Speedup In computer architecture, speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with d ...

calculations and graph display were added in May 2013. A Cetus remote server in a

client–server model The client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients. Often clients and servers communicate ov ...

was added in May 2013 and users can optionally transform C Code through the server. This is especially useful when users run Cetus on a non-Linux platform. An experimental

Hubzero HUBzero is an open source software platform for building websites that support scientific activities. History HUBzero was created by researchers at Purdue University in conjunction with the NSF-sponsored Network for Computational Nanotechnology. I ...

version of Cetus was also implemented in May 2013 and users can also run Cetus through a web browser.

PLUTO

PLUTO is an automatic parallelization tool based on the

polyhedral model The polyhedral model (also called the polytope method) is a mathematical framework for programs that perform large numbers of operations -- too large to be explicitly enumerated -- thereby requiring a ''compact'' representation. Nested loop progra ...

. The polyhedral model for compiler optimization is a representation for programs that makes it convenient to perform high-level transformations such as

loop nest optimization In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead redu ...

s and loop parallelization. Pluto transforms C programs from source to source for coarse-grained parallelism and data locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but not limited to those.

parallel code for multicores can be automatically generated from sequential C program sections.

Polaris compiler

The Polaris compiler takes a Fortran77 program as input, transforms this program so that it runs efficiently on a

parallel computer Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different f ...

, and outputs this program version in one of several possible parallel FORTRAN dialects. Polaris performs its transformations in several "compilation passes". In addition to many commonly known passes, Polaris includes advanced capabilities performing the following tasks: Array privatization,

Data dependence A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements (or instructions) is ...

testing,

Induction variable In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable. For example, in the following loop, i and j are induct ...

recognition, Inter procedural analysis, and symbolic program analysis.

Intel C++ compiler

The auto-parallelization feature of the

Intel C++ Compiler Intel oneAPI DPC++/C++ Compiler and Intel C++ Compiler Classic are Intel’s C, C++, SYCL, and Data Parallel C++ (DPC++) compilers for Intel processor-based systems, available for Windows, Linux, and macOS operating systems. Overview Intel ...

automatically translates serial portions of the input program into semantically equivalent

code. Automatic parallelization determines the loops that are good work sharing candidates, performs the data-flow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with

directives. The

and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor systems.

Intel Advisor

Th
Intel Advisor 2017
is a vectorization optimization and thread prototyping tool. It integrates several steps into its workflow to search for parallel sites, enable users to mark loops for vectorization and threading, check loop-carried dependencies and memory access patterns for marked loops, and insert pragmas for vectorization and threading.

AutoPar

AutoPar
is a tool which can automatically insert OpenMP pragmas into input serial C/C++ codes. For input programs with existing OpenMP directives, the tool will double check the correctness when the right option is turned on. Compared to conventional tools, AutoPar can incorporate user knowledge (semantics) to discover more parallelization opportunities.

iPat/OMP

This tool provides users with the assistance needed for OpenMP parallelization of a sequential program. This tool is implemented as a set of functions on the Emacs editor. All the activities related to program parallelization, such as selecting a target portion of the program, invoking an assistance command, and modifying the program based on the assistance information shown by the tool, can be handled in the source program editor environment.

Vienna Fortran compiler (VFC)

It is a new source-to-source parallelization system for HPF+ (optimized version of HPF), which addresses the requirements of irregular applications.

SUIF compiler

SUIF ( Stanford University Intermediate Format) is a free infrastructure designed to support collaborative research in optimizing and parallelizing compilers. SUIF is a fully functional compiler that takes both Fortran and C as input languages. The parallelized code is output as an

SPMD In computing, single program, multiple data (SPMD) is a technique employed to achieve parallelism; it is a subcategory of MIMD. Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results fas ...

(Single Program Multiple Data) parallel C version of the program that can be compiled by native C compilers on a variety of architectures.

Omni OpenMP compiler

It translates C and Fortran programs with

pragmas into C code suitable for compiling with a native compiler linked with the Omni OpenMP runtime library. It does for loop parallelization.

Timing-Architects Optimizer

It uses a simulation based approach to improve task allocation and task parallelization to multiple cores. By use of a simulation based performance and real-time analysis, different task allocation alternatives are benchmarked against each other. Dependencies as well as processor platform specific effects are considered
TA Optimizer
is used in embedded system engineering.

TRACO

It uses the Iteration Space Slicing and Free Schedule Framework. The core is based on the Presburger Arithmetic and the transitive closure operation. Loop dependencies are represented with relations. TRACO uses the Omega Calculator, CLOOG and ISL libraries, and the Petit dependence analyser. The compiler extracts better locality with fine- and coarse-grained parallelism for C/C++ applications. The tool is developed by the West-Pomeranian University of Technology team; (Bielecki, Palkowski, Klimek and other authors) http://traco.sourceforge.net.

SequenceL

SequenceL SequenceL is a general purpose functional programming language and auto-parallelizing ( Parallel computing) compiler and tool set, whose primary design objectives are performance on multi-core processor hardware, ease of programming, platform port ...

is a general-purpose functional programming language and auto-parallelizing tool set, whose primary design objectives are performance on multi-core processor hardware, ease of programming, platform portability/optimization, and code clarity and readability. Its main advantage is that it can be used to write straightforward code that automatically takes full advantage of all the processing power available, without programmers needing to be concerned with identifying parallelisms, specifying vectorization, avoiding race conditions, and other challenges of manual directive-based programming approaches such as OpenMP. Programs written in SequenceL can be compiled to multithreaded code that runs in parallel, with no explicit indications from a programmer of how or what to parallelize. As of 2015, versions of the SequenceL compiler generate parallel code in C++ and OpenCL, which allows it to work with most popular programming languages, including C, C++, C#, Fortran, Java, and Python. A platform-specific runtime manages the threads safely, automatically providing parallel performance according to the number of cores available.

OMP2MPI

OMP2MPI Automatically generates

MPI MPI or Mpi may refer to: Science and technology Biology and medicine * Magnetic particle imaging, an emerging non-invasive tomographic technique * Myocardial perfusion imaging, a nuclear medicine procedure that illustrates the function of the hear ...

source code from

. Allowing that the program exploits non shared-memory architectures such as cluster, or Network-on-Chip based(NoC-based) multiprocessors-system-on-chip (MPSoC). OMP2MPI gives a solution that allow further optimization by an expert that want to achieve better results.

OMP2HMPP

OMP2HMPP, a tool that, automatically translates a high-level C source code(

) code into HMPP. The generated version rarely will differs from a hand-coded HMPP version, and will provide an important speedup, near 113%, that could be later improved by hand-coded

CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ...

emmtrix Parallel Studio

emmtrix Parallel Studio
is a source-to-source parallelization tool combined with an interactive GUI developed by emmtrix Technologies GmbH. It takes C,

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...

Simulink Simulink is a MATLAB-based graphical programming environment for modeling, simulating and analyzing multidomain dynamical systems. Its primary interface is a graphical block diagramming tool and a customizable set of block libraries. It offers t ...

Scilab Scilab is a free and open-source, cross-platform numerical computational package and a high-level, numerically oriented programming language. It can be used for signal processing, statistical analysis, image enhancement, fluid dynamics simula ...

Xcos Scilab is a free and open-source, cross-platform numerical computational package and a high-level, numerically oriented programming language. It can be used for signal processing, statistical analysis, image enhancement, fluid dynamics simu ...

source code as input and generates parallel C code as output. It relies on static schedule and a message passing API for the parallel program. The whole parallelization process is controlled and visualized in an interactive GUI enabling parallelization decisions by the end user. It targets embedded multicore architectures combined with GPU and FPGA accelerators.

CLAW Compiler

Th
CLAW Compiler
translates Fortran programs with claw pragmas into Fortran code suitable for a specific supercomputer target augmented with OpenMP or OpenACC pragmas.

PaSH

PaSH is parallelizing compiler for

Unix shell A Unix shell is a command-line interpreter or shell that provides a command line user interface for Unix-like operating systems. The shell is both an interactive command language and a scripting language, and is used by the operating system t ...

scripts.

References

{{reflist