Very long instruction word (VLIW) refers to

instruction set architecture In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, ...

s that are designed to exploit

instruction-level parallelism Instruction-level parallelism (ILP) is the Parallel computing, parallel or simultaneous execution of a sequence of Instruction set, instructions in a computer program. More specifically, ILP refers to the average number of instructions run per st ...

(ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel, whereas conventional

central processing unit A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary Processor (computing), processor in a given computer. Its electronic circuitry executes Instruction (computing), instructions ...

s (CPUs) mostly allow programs to specify instructions to execute in sequence only. VLIW is intended to allow higher performance without the complexity inherent in some other designs. The traditional means to improve performance in processors include dividing instructions into sub steps so the instructions can be executed partly at the same time (termed ''pipelining''), dispatching individual instructions to be executed independently, in different parts of the processor (''

superscalar A superscalar processor (or multiple-issue processor) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single in ...

architectures''), and even executing instructions in an order different from the program (''

out-of-order execution In computer engineering, out-of-order execution (or more formally dynamic execution) is an instruction scheduling paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In t ...

''). These methods all complicate hardware (larger circuits, higher cost and energy use) because the processor must make all of the decisions internally for these methods to work. In contrast, the VLIW method depends on the programs providing all the decisions regarding which instructions to execute simultaneously and how to resolve conflicts. As a practical matter, this means that the

compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...

(software used to create the final programs) becomes more complex, but the hardware is simpler than in many other means of parallelism.

History

The concept of VLIW architecture, and the term ''VLIW'', were invented by Josh Fisher in his research group at

Yale University Yale University is a Private university, private Ivy League research university in New Haven, Connecticut, United States. Founded in 1701, Yale is the List of Colonial Colleges, third-oldest institution of higher education in the United Stat ...

in the early 1980s. His original development of trace scheduling as a compiling method for VLIW was developed when he was a graduate student at

New York University New York University (NYU) is a private university, private research university in New York City, New York, United States. Chartered in 1831 by the New York State Legislature, NYU was founded in 1832 by Albert Gallatin as a Nondenominational ...

. Before VLIW, the notion of prescheduling

execution unit In computer engineering, an execution unit (E-unit or EU) is a part of a processing unit that performs the operations and calculations forwarded from the instruction unit. It may have its own internal control sequence unit (not to be confused w ...

s and instruction-level parallelism in software was well established in the practice of developing horizontal microcode. Before Fisher the theoretical aspects of what would be later called VLIW were developed by the Soviet computer scientist Mikhail Kartsev based on his Sixties work on military-oriented M-9 and M-10 computers. His ideas were later developed and published as a part of a textbook two years before Fisher's seminal paper, but because of the

Iron Curtain The Iron Curtain was the political and physical boundary dividing Europe into two separate areas from the end of World War II in 1945 until the end of the Cold War in 1991. On the east side of the Iron Curtain were countries connected to the So ...

and because Kartsev's work was mostly military-related it remained largely unknown in the West. Fisher's innovations involved developing a compiler that could target horizontal microcode from programs written in an ordinary

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

. He realized that to get good performance and target a wide-issue machine, it would be necessary to find parallelism beyond that generally within a

basic block In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit. This restricted form makes a basic block highly amenable to analysis. Compilers usually decom ...

. He also developed region scheduling methods to identify parallelism beyond basic blocks. Trace scheduling is such a method, and involves scheduling the most likely path of basic blocks first, inserting compensating code to deal with speculative motions, scheduling the second most likely trace, and so on, until the schedule is complete. Fisher's second innovation was the notion that the target CPU architecture should be designed to be a reasonable target for a compiler; that the compiler and the architecture for a VLIW processor must be codesigned. This was inspired partly by the difficulty Fisher observed at Yale of compiling for architectures like Floating Point Systems' FPS164, which had a

complex instruction set computing A complex instruction set computer (CISC ) is a computer architecture in which single instructions can execute several low-level operations (such as a load from memory, an arithmetic operation, and a memory store) or are capable of multi-step ...

(CISC) architecture that separated instruction initiation from the instructions that saved the result, needing very complex scheduling algorithms. Fisher developed a set of principles characterizing a proper VLIW design, such as self-draining pipelines, wide multi-port

register file A register file is an array of processor registers in a central processing unit (CPU). The instruction set architecture of a CPU will almost always define a set of registers which are used to stage data between memory and the functional units on ...

s, and memory architectures. These principles made it easier for compilers to emit fast code. The first VLIW compiler was described in a Ph.D. thesis by John Ellis, supervised by Fisher. The compiler was named Bulldog, after Yale's mascot. Fisher left Yale in 1984 to found a startup company, Multiflow, along with cofounders John O'Donnell and John Ruttenberg. Multiflow produced the TRACE series of VLIW

minisupercomputer Minisupercomputers constituted a short-lived class of computers that emerged in the mid-1980s, characterized by the combination of vector processing and small-scale multiprocessing. As scientific computing using vector processors became more popul ...

s, shipping their first machines in 1987. Multiflow's VLIW could issue 28 operations in parallel per instruction. The TRACE system was implemented in a mix of medium-scale integration (MSI), large-scale integration (LSI), and very large-scale integration (VLSI), packaged in cabinets, a technology obsoleted as it grew more cost-effective to integrate all of the components of a processor (excluding memory) on one chip. Multiflow was too early to catch the following wave, when chip architectures began to allow multiple-issue CPUs. The major semiconductor companies recognized the value of Multiflow technology in this context, so the compiler and architecture were subsequently licensed to most of these firms.

Motivation

A processor that executes every instruction one after the other (i.e., a non- pipelined scalar architecture) may use processor resources inefficiently, yielding potential poor performance. The performance can be improved by executing different substeps of sequential instructions simultaneously (termed ''pipelining''), or even executing multiple instructions entirely simultaneously as in

architectures. Further improvement can be achieved by executing instructions in an order different from that in which they occur in a program, termed

. These three methods all raise hardware complexity. Before executing any operations in parallel, the processor must verify that the instructions have no interdependencies. For example, if a first instruction's result is used as a second instruction's input, then they cannot execute at the same time and the second instruction cannot execute before the first. Modern out-of-order processors have increased the hardware resources which schedule instructions and determine interdependencies. In contrast, VLIW executes operations in parallel, based on a fixed schedule, determined when programs are compiled. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three methods described above require. Thus, VLIW CPUs offer more computing with less hardware complexity (but greater compiler complexity) than do most superscalar CPUs. This is also complementary to the idea that as many computations as possible should be done before the program is executed, at compile time.

Design

In superscalar designs, the number of execution units is invisible to the instruction set. Each instruction encodes one operation only. For most superscalar designs, the instruction width is 32 bits or fewer. In contrast, one VLIW instruction encodes multiple operations, at least one operation for each execution unit of a device. For example, if a VLIW device has five execution units, then a VLIW instruction for the device has five operation fields, each field specifying what operation should be done on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at least 64 bits wide, and far wider on some architectures. For example, the following is an instruction for the Super Harvard Architecture Single-Chip Computer (SHARC). In one cycle, it does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits in one 48-bit instruction: f12 = f0 * f4, f8 = f8 + f12, f0 = dm(i0, m3), f4 = pm(i8, m9); Since the earliest days of computer architecture, some CPUs have added several

arithmetic logic unit In computing, an arithmetic logic unit (ALU) is a Combinational logic, combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on ...

s (ALUs) to run in parallel.

Superscalar A superscalar processor (or multiple-issue processor) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single in ...

CPUs use hardware to decide which operations can run in parallel at runtime, while VLIW CPUs use software (the compiler) to decide which operations can run in parallel in advance. Because the complexity of instruction scheduling is moved into the compiler, complexity of hardware can be reduced substantially. A similar problem occurs when the result of a parallelizable instruction is used as input for a branch. Most modern CPUs ''guess'' which branch will be taken even before the calculation is complete, so that they can load the instructions for the branch, or (in some architectures) even start to compute them speculatively. If the CPU guesses wrong, all of these instructions and their context need to be '' flushed'' and the correct ones loaded, which takes time. This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly, and the simplicity of the original

reduced instruction set computing In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a com ...

(RISC) designs has been eroded. VLIW lacks this logic, and thus lacks its energy use, possible design defects, and other negative aspects. In a VLIW, the compiler uses heuristics or profile information to guess the direction of a branch. This allows it to move and preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If the branch takes an unexpected way, the compiler has already generated compensating code to discard speculative results to preserve program semantics.

Vector processor In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...

cores (designed for large one-dimensional arrays of data called ''vectors'') can be combined with the VLIW architecture such as in the Fujitsu

FR-V The Fujitsu FR-V (Fujitsu RISC- VLIW) is one of the very few processors ever able to process both a very long instruction word (VLIW) and vector processor instructions at the same time, increasing throughput with high parallel computing while ...

microprocessor, further increasing

throughput Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel in a communication network, such as Ethernet or packet radio. The data that these messages contain may be delivered ov ...

and

speed In kinematics, the speed (commonly referred to as ''v'') of an object is the magnitude of the change of its position over time or the magnitude of the change of its position per unit of time; it is thus a non-negative scalar quantity. Intro ...

Implementations

Cydrome was a company producing VLIW numeric processors using

emitter-coupled logic In electronics, emitter-coupled logic (ECL) is a high-speed integrated circuit bipolar transistor logic family. ECL uses a bipolar junction transistor (BJT) differential amplifier with single-ended input and limited emitter current to avoid th ...

(ECL) integrated circuits in the same timeframe (late 1980s). This company, like Multiflow, failed after a few years. One of the licensees of the Multiflow technology is

Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...

, which Josh Fisher joined after Multiflow's demise.

Bob Rau Bantwal Ramakrishna "Bob" Rau (1951 – December 10, 2002) was a computer engineer and HP Fellow. Rau was a founder and chief architect of Cydrome, where he helped develop the Very long instruction word technology that is now common in modern comp ...

, founder of Cydrome, also joined HP after Cydrome failed. These two would lead computer architecture research at Hewlett-Packard during the 1990s. Along with the above systems, during the same time (1989–1990), Intel implemented VLIW in the Intel i860, their first 64-bit microprocessor, and the first processor to implement VLIW on one chip. This processor could operate in both simple RISC mode and VLIW mode:

In the early 1990s, Intel introduced the i860 RISC microprocessor. This simple chip had two modes of operation: a scalar mode and a VLIW mode. In the VLIW mode, the processor always fetched two instructions and assumed that one was an integer instruction and the other floating-point.

The i860's VLIW mode was used extensively in embedded

digital signal processor A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on metal–oxide–semiconductor (MOS) integrated circuit chips. ...

(DSP) applications since the application execution and datasets were simple, well ordered and predictable, allowing designers to fully exploit the parallel execution advantages enabled by VLIW. In VLIW mode, the i860 could maintain floating-point performance in the range of 20-40 double-precision MFLOPS; a very high value for its time and for a processor running at 25-50Mhz. In the 1990s, Hewlett-Packard researched this problem as a side effect of ongoing work on their

PA-RISC Precision Architecture reduced instruction set computer, RISC (PA-RISC) or Hewlett Packard Precision Architecture (HP/PA or simply HPPA), is a computer, general purpose computer instruction set architecture (ISA) developed by Hewlett-Packard f ...

processor family. They found that the CPU could be greatly simplified by removing the complex dispatch logic from the CPU and placing it in the compiler. Compilers of the day were far more complex than those of the 1980s, so the added complexity in the compiler was considered to be a small cost. VLIW CPUs are usually made of multiple RISC-like

s that operate independently. Contemporary VLIWs usually have four to eight main execution units. Compilers generate initial instruction sequences for the VLIW CPU in roughly the same manner as for traditional CPUs, generating a sequence of RISC-like instructions. The compiler analyzes this code for dependence relationships and resource requirements. It then schedules the instructions according to those constraints. In this process, independent instructions can be scheduled in parallel. Because VLIWs typically represent instructions scheduled in parallel with a longer instruction word that incorporates the individual instructions, this results in a much longer

opcode In computing, an opcode (abbreviated from operation code) is an enumerated value that specifies the operation to be performed. Opcodes are employed in hardware devices such as arithmetic logic units (ALUs), central processing units (CPUs), and ...

(termed ''very long'') to specify what executes on a given cycle. Examples of contemporary VLIW CPUs include the TriMedia media processors by NXP (formerly Philips Semiconductors), the Super Harvard Architecture Single-Chip Computer (SHARC) DSP by Analog Devices, the ST200 family by STMicroelectronics based on the Lx architecture (designed in Josh Fisher's HP lab by Paolo Faraboschi), the

from Fujitsu, the BSP15/16 from Pixelworks, the CEVA-X DSP from CEVA, the Jazz DSP from Improv Systems, the HiveFlex series from Silicon Hive, and the MPPA Manycore family by Kalray. The Texas Instruments TMS320 DSP line has evolved, in its C6000 family, to look more like a VLIW, in contrast to the earlier C5000 family. One or more

Qualcomm Hexagon Hexagon is the brand name for a family of digital signal processor (DSP) and later neural processing unit (NPU) products by Qualcomm. Hexagon is also known as QDSP6, standing for “sixth generation digital signal processor.” According to Qua ...

s can be found in most cell phones today. These contemporary VLIW CPUs are mainly successful as embedded media processors for consumer electronic devices. VLIW features have also been added to configurable processor cores for

system-on-a-chip A system on a chip (SoC) is an integrated circuit that combines most or all key components of a computer or electronic system onto a single microchip. Typically, an SoC includes a central processing unit (CPU) with memory, input/output, and dat ...

(SoC) designs. For example, Tensilica's Xtensa LX2 processor incorporates a technology named Flexible Length Instruction eXtensions (FLIX) that allows multi-operation instructions. The Xtensa C/C++ compiler can freely intermix 32- or 64-bit FLIX instructions with the Xtensa processor's one-operation RISC instructions, which are 16 or 24 bits wide. By packing multiple operations into a wide 32- or 64-bit instruction word and allowing these multi-operation instructions to intermix with shorter RISC instructions, FLIX allows SoC designers to realize VLIW's performance advantages while eliminating the code bloat of early VLIW architectures. The Infineon Carmel DSP is another VLIW processor core intended for SoC. It uses a similar code density improvement method called ''configurable long instruction word'' (CLIW). Outside embedded processing markets, Intel's

Itanium Itanium (; ) is a discontinued family of 64-bit computing, 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly dev ...

IA-64

explicitly parallel instruction computing Explicitly parallel instruction computing (EPIC) is a term coined in 1997 by the Itanium, HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called ''Independe ...

(EPIC) and Elbrus 2000 appear as the only examples of a widely used VLIW CPU architectures. However, EPIC architecture is sometimes distinguished from a pure VLIW architecture, since EPIC advocates full instruction predication, rotating register files, and a very long instruction word that can encode non-parallel instruction groups. VLIWs also gained significant consumer penetration in the

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

(GPU) market, though both

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

and

AMD Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...

have since moved to RISC architectures to improve performance on non-graphics workloads.

ATI Technologies ATI Technologies Inc. was a Canadian semiconductor industry, semiconductor technology corporation based in Markham, Ontario, that specialized in the development of graphics processing units and chipsets. Founded in 1985, the company listed pub ...

' (ATI) and

Advanced Micro Devices Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a Information technology, hardware and F ...

' (AMD) TeraScale microarchitecture for

s (GPUs) is a VLIW microarchitecture. In December 2015, the first shipment of PCs based on VLIW CPU Elbrus-4s was made in Russia. The Neo by REX Computing is a processor consisting of a 2D mesh of VLIW cores aimed at power efficiency. The Elbrus 2000 () and its successors are Russian 512-bit wide VLIW microprocessors developed by Moscow Center of SPARC Technologies (MCST) and fabricated by

TSMC Taiwan Semiconductor Manufacturing Company Limited (TSMC or Taiwan Semiconductor) is a Taiwanese multinational semiconductor contract manufacturing and design company. It is one of the world's most valuable semiconductor companies, the world' ...

Backward compatibility

When silicon technology allowed for wider implementations (with more execution units) to be built, the compiled programs for the earlier generation would not run on the wider implementations, as the encoding of binary instructions depended on the number of execution units of the machine.

Transmeta Transmeta Corporation was an American fabless semiconductor company based in Santa Clara, California. It developed low power x86 compatible microprocessors based on a VLIW core and a software layer called Code Morphing Software. Code Morphing ...

addressed this issue by including a binary-to-binary software compiler layer (termed '' code morphing'') in their Crusoe implementation of the

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...

architecture. This mechanism was advertised to basically recompile, optimize, and translate x86 opcodes at runtime into the CPU's internal machine code. Thus, the Transmeta chip is ''internally'' a VLIW processor, effectively decoupled from the x86 CISC

instruction set In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, s ...

that it executes. Intel's

architecture (among others) solved the backward-compatibility problem with a more general mechanism. Within each of the multiple-opcode instructions, a bit field is allocated to denote dependency on the prior VLIW instruction within the program instruction stream. These bits are set at

compile time In computer science, compile time (or compile-time) describes the time window during which a language's statements are converted into binary instructions for the processor to execute. The term is used as an adjective to describe concepts relat ...

, thus relieving the hardware from calculating this dependency information. Having this dependency information encoded in the instruction stream allows wider implementations to issue multiple non-dependent VLIW instructions in parallel per cycle, while narrower implementations would issue a smaller number of VLIW instructions per cycle. Another perceived deficiency of VLIW designs is the code bloat that occurs when one or more execution unit(s) have no useful work to do and thus must execute ''No Operation'' NOP instructions. This occurs when there are dependencies in the code and the instruction pipelines must be allowed to drain before later operations can proceed. Since the number of transistors on a chip has grown, the perceived disadvantages of the VLIW have diminished in importance. VLIW architectures are growing in popularity, especially in the

embedded system An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is e ...

market, where it is possible to customize a processor for an application in a

References

External links

Paper That Introduced VLIWs

Book on the history of Multiflow Computer, VLIW pioneering company

ISCA "Best Papers" Retrospective On Paper That Introduced VLIWs

VLIW and Embedded Processing

FR500 VLIW-architecture High-performance Embedded Microprocessor

DIS: an Architecture for fast LISP execution.
A similar VLIW architecture, with a parallelizing compiler directed toward LISP. {{CPU technologies Digital signal processing Instruction processing Parallel computing