computer architecture In computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, the ...

, a transport triggered architecture (TTA) is a kind of

processor Processor may refer to: Computing Hardware * Processor (computing) **Central processing unit (CPU), the hardware within a computer that executes a program *** Microprocessor, a central processing unit contained on a single integrated circuit (I ...

design in which programs directly control the internal transport

buses A bus (contracted from omnibus, with variants multibus, motorbus, autobus, etc.) is a road vehicle that carries significantly more passengers than an average car or van. It is most commonly used in public transport, but is also in use for cha ...

of a processor. Computation happens as a side effect of data transports: writing data into a ''triggering port'' of a

functional unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...

triggers the functional unit to start a computation. This is similar to what happens in a

systolic array In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from i ...

. Due to its modular structure, TTA is an ideal processor template for

application-specific instruction set processor An application-specific instruction set processor (ASIP) is a component used in system on a chip design. The instruction set architecture of an ASIP is tailored to benefit a specific application. This specialization of the core provides a tradeo ...

s (''ASIP'') with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators. Typically a transport triggered processor has multiple transport buses and multiple functional units connected to the buses, which provides opportunities for

instruction level parallelism Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution. Disc ...

. The parallelism is statically defined by the programmer. In this respect (and obviously due to the large instruction word width), the TTA architecture resembles the

very long instruction word Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to exe ...

(VLIW) architecture. A TTA instruction word is composed of multiple slots, one slot per bus, and each slot determines the data transport that takes place on the corresponding bus. The fine-grained control allows some optimizations that are not possible in a conventional processor. For example, software can transfer data directly between functional units without using registers. Transport triggering exposes some microarchitectural details that are normally hidden from programmers. This greatly simplifies the control logic of a processor, because many decisions normally done at run time are fixed at

compile time In computer science, compile time (or compile-time) describes the time window during which a computer program is compiled. The term is used as an adjective to describe concepts related to the context of program compilation, as opposed to concep ...

. However, it also means that a binary compiled for one TTA processor will not run on another one without recompilation if there is even a small difference in the architecture between the two. The binary incompatibility problem, in addition to the complexity of implementing a full context switch, makes TTAs more suitable for

embedded system An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded ...

s than for general purpose computing. Of all the

one-instruction set computer A one-instruction set computer (OISC), sometimes called an ultimate reduced instruction set computer (URISC), is an abstract machine that uses only one instructionobviating the need for a machine language opcode. With a judicious choice for the si ...

architectures, the TTA architecture is one of the few that has had processors based on it built, and the only one that has processors based on it sold commercially.

Benefits in comparison to VLIW architectures

TTAs can be seen as "exposed datapath" VLIW architectures. While VLIW is programmed using operations, TTA splits the operation execution to multiple ''move'' operations. The low level programming model enables several benefits in comparison to the standard VLIW. For example, a TTA architecture can provide more parallelism with simpler register files than with VLIW. As the programmer is in control of the timing of the operand and result data transports, the complexity (the number of input and output ports) of the

register file A register file is an array of processor registers in a central processing unit (CPU). Register banking is the method of using a single name to access multiple different physical registers depending on the operating mode. Modern integrated circuit- ...

(RF) need not be scaled according to the worst case issue/completion scenario of the multiple parallel instructions. An important unique software optimization enabled by the transport programming is called ''software bypassing''. In case of software bypassing, the programmer bypasses the register file write back by moving data directly to the next functional unit's operand ports. When this optimization is applied aggressively, the original move that transports the result to the register file can be eliminated completely, thus reducing both the register file port pressure and freeing a

general purpose register A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. ...

for other temporary variables. The reduced register pressure, in addition to simplifying the required complexity of the RF hardware, can lead to significant CPU energy savings, an important benefit especially in mobile embedded systems.V. Guzma, P. Jääskeläinen, P. Kellomäki, and J. Takala, “Impact of Software Bypassing on Instruction Level Parallelism and Register File Traffic” Johan Janssen
"Compiler Strategies for Transport Triggered Architectures"
2001. p. 168.

Structure

TTA processors are built of independent ''function units'' and

s, which are connected with ''transport buses'' and ''sockets''.

Function unit

Each function unit implements one or more operations, which implement functionality ranging from a simple addition of integers to a complex and arbitrary user-defined application-specific computation. Operands for operations are transferred through function unit ''ports''. Each function unit may have an independent pipeline. In case a function unit is fully pipelined, a new operation that takes multiple

clock cycle In electronics and especially synchronous digital circuits, a clock signal (historically also known as ''logic beat'') oscillates between a high and a low state and is used like a metronome to coordinate actions of digital circuits. A clock sig ...

s to finish can be started in every clock cycle. On the other hand, a pipeline can be such that it does not always accept new operation start requests while an old one is still executing. Data memory access and communication to outside of the processor is handled by using special function units. Function units that implement memory accessing operations and connect to a memory module are often called load/store units.

Control unit

''Control unit'' is a special case of function units which controls execution of programs. Control unit has access to the instruction memory in order to fetch the instructions to be executed. In order to allow the executed programs to transfer the execution (jump) to an arbitrary position in the executed program, control unit provides control flow operations. A control unit usually has an

instruction pipeline In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing inco ...

, which consists of stages for fetching, decoding and executing program instructions.

Register files

s, which are used to store variables in programs. Like function units, also register files have input and output ports. The number of read and write ports, that is, the capability of being able to read and write multiple registers in a same clock cycle, can vary in each register file.

Transport buses and sockets

Interconnect architecture In telecommunications, interconnection is the physical linking of a carrier's network with equipment or facilities not belonging to that network. The term may refer to a connection between a carrier's facilities and the equipment belonging to ...

consists of transport buses which are connected to function unit ports by means of ''sockets''. Due to expense of connectivity, it is usual to reduce the number of connections between units (function units and register files). A TTA is said to be ''fully connected'' in case there is a path from each unit output port to every unit's input ports. Sockets provide means for programming TTA processors by allowing to select which bus-to-port connections of the socket are enabled at any time instant. Thus, data transports taking place in a clock cycle can be programmed by defining the source and destination socket/port connection to be enabled for each bus.

Conditional execution

Some TTA implementations support

conditional execution Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions i ...

Conditional execution Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions i ...

is implemented with the aid of ''guards''. Each data transport can be conditionalized by a guard, which is connected to a register (often a 1-bit

conditional register A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. ...

) and to a bus. In case the value of the guarded register evaluates to false (zero), the data transport programmed for the bus the guard is connected to is ''squashed'', that is, not written to its destination. ''Unconditional'' data transports are not connected to any guard and are always executed.

Branches

All processors, including TTA processors, include

control flow In computer science, control flow (or flow of control) is the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated. The emphasis on explicit control flow distinguishes an '' ...

instructions that alter the program counter, which are used to implement

subroutine In computer programming, a function or subroutine is a sequence of program instructions that performs a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed. Functions may ...

if-then-else In computer science, conditionals (that is, conditional statements, conditional expressions and conditional constructs,) are programming language commands for handling decisions. Specifically, conditionals perform different computations or actio ...

, for-loop, etc. The assembly language for TTA processors typically includes control flow instructions such as unconditional branches (JUMP), conditional relative branches (BNZ), subroutine call (CALL), conditional return (RETNZ), etc. that look the same as the corresponding assembly language instructions for other processors. Like all other operations on a TTA machine, these instructions are implemented as "move" instructions to a special function unit. TTA implementations that support conditional execution, such as the sTTAck and the first MOVE prototype, can implement most of these control flow instructions as a conditional move to the program counter. TTA implementations that only support unconditional data transports, such as the

Maxim Integrated Maxim Integrated, a subsidiary of Analog Devices, designs, manufactures, and sells analog and mixed-signal integrated circuits for the automotive, industrial, communications, consumer, and computing markets. Maxim's product portfolio includes p ...

MAXQ, typically have a special function unit tightly connected to the program counter that responds to a variety of destination addresses. Each such address, when used as the destination of a "move", has a different effect on the program counter—each "relative branch <condition>" instruction has a different destination address for each condition; and other destination addresses are used CALL, RETNZ, etc.

Programming

In more traditional processor architectures, a processor is usually programmed by defining the executed operations and their operands. For example, an addition instruction in a RISC architecture could look like the following.

add r3, r1, r2

This example operation adds the values of general-purpose registers r1 and r2 and stores the result in register r3. Coarsely, the execution of the instruction in the processor probably results in translating the instruction to control signals which control the interconnection network connections and function units. The interconnection network is used to transfer the current values of registers r1 and r2 to the function unit that is capable of executing the add operation, often called ALU as in Arithmetic-Logic Unit. Finally, a control signal selects and triggers the addition operation in ALU, of which result is transferred back to the register r3. TTA programs do not define the operations, but only the data transports needed to write and read the operand values. Operation itself is triggered by writing data to a ''triggering operand'' of an operation. Thus, an operation is executed as a side effect of the triggering data transport. Therefore, executing an addition operation in TTA requires three data transport definitions, also called ''moves''. A move defines endpoints for a data transport taking place in a transport bus. For instance, a move can state that a data transport from function unit F, port 1, to register file R, register index 2, should take place in bus B1. In case there are multiple buses in the target processor, each bus can be utilized in parallel in the same clock cycle. Thus, it is possible to exploit data transport level parallelism by scheduling several data transports in the same instruction. An addition operation can be executed in a TTA processor as follows:

r1 -> ALU.operand1
r2 -> ALU.add.trigger
ALU.result -> r3

The second move, a write to the second operand of the function unit called ALU, triggers the addition operation. This makes the result of addition available in the output port 'result' after the execution latency of the 'add'. The ports associated with the ALU may act as an accumulator, allowing creation of

macro instruction In computer programming, a macro (short for "macro instruction"; ) is a rule or pattern that specifies how a certain input should be mapped to a replacement output. Applying a macro to an input is known as macro expansion. The input and outpu ...

s that abstract away the underlying TTA: lda r1 ; "load ALU": move value to ALU operand 1 add r2 ; add: move value to add trigger sta r3 ; "store ALU": move value from ALU result

Programmer visible operation latency

The leading philosophy of TTAs is to move complexity from hardware to software. Due to this, several additional hazards are introduced to the programmer. One of them is

delay slot In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branch instruction on a RISC or DSP ...

s, the programmer visible operation latency of the function units. Timing is completely the responsibility of the programmer. The programmer has to schedule the instructions such that the result is neither read too early nor too late. There is no hardware detection to lock up the processor in case a result is read too early. Consider, for example, an architecture that has an operation ''add'' with latency of 1, and operation ''mul'' with latency of 3. When triggering the ''add'' operation, it is possible to read the result in the next instruction (next clock cycle), but in case of ''mul'', one has to wait for two instructions before the result can be read. The result is ready for the 3rd instruction after the triggering instruction. Reading a result too early results in reading the result of a previously triggered operation, or in case no operation was triggered previously in the function unit, the read value is undefined. On the other hand, result must be read early enough to make sure the next operation result does not overwrite the yet unread result in the output port. Due to the abundance of programmer-visible processor context which practically includes, in addition to register file contents, also function unit pipeline register contents and/or function unit input and output ports, context saves required for external interrupt support can become complex and expensive to implement in a TTA processor. Therefore, interrupts are usually not supported by TTA processors, but their task is delegated to an external hardware (e.g., an I/O processor) or their need is avoided by using an alternative synchronization/communication mechanism such as polling.

Implementations

* MAXQ from

, the only commercially available microcontroller built upon transport triggered architecture, is an OISC or "

". It offers a single though flexible MOVE instruction, which can then function as various virtual instructions by moving values directly to the

program counter The program counter (PC), commonly called the instruction pointer (IP) in Intel x86 and Itanium microprocessors, and sometimes called the instruction address register (IAR), the instruction counter, or just part of the instruction sequencer, i ...

. * Th
"move project"
has designed and fabricated several experimental TTA microprocessors. * OpenASIP is an open source application-specific instruction set toolset utilizing TTA as the processor template. * The architecture of the Amiga Copper has all the basic features of a transport triggered architecture. * Th
Able
processor developed by

New England Digital New England Digital Corporation (1976–1993) was founded in Norwich, Vermont, and relocated to White River Junction, Vermont. It was best known for its signature product, the Synclavier Synthesizer System, which evolved into the Synclavier Digit ...

. * The

WireWorld Wireworld, alternatively WireWorld, is a cellular automaton first proposed by Brian Silverman in 1987, as part of his program Phantom Fish Tank. It subsequently became more widely known as a result of an article in the "Computer Recreations" colu ...

base
computer

Dr. Dobb's
published One-Der, a 32-bit TTA in Verilog with a matching cross assembler and Forth compiler.Web site with more details on the Dr. Dobb's CPU
* Mali (200/400) vertex processor, uses a 128-bit instruction word single precision floating-point scalar TTA.

References

External links

MOVE project: Automatic Synthesis of Application Specific Processors (accessible through the wayback machine)

{{DEFAULTSORT:Transport Triggered Architecture Computer architecture Instruction processing