In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a

branch A branch, also called a ramus in botany, is a stem that grows off from another stem, or when structures like veins in leaves are divided into smaller veins. History and etymology In Old English, there are numerous words for branch, includ ...

instruction on a

RISC In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a comp ...

or DSP architecture; this instruction will execute even if the preceding branch is taken. This makes the instruction execute out-of-order compared to its location in the original

assembler language In computing, assembly language (alternatively assembler language or symbolic machine code), often referred to simply as assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence bet ...

code. Modern processor designs generally do not use delay slots, and instead perform ever more complex forms of

branch prediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow ...

. In these systems, the CPU immediately moves on to what it believes will be the correct side of the branch and thereby eliminates the need for the code to specify some unrelated instruction, which may not always be obvious at compile-time. If the assumption is wrong, and the other side of the branch has to be called, this can introduce a lengthy delay. This occurs rarely enough that the speed up of avoiding the delay slot is easily made up by the smaller number of wrong decisions.

Pipelining

central processing unit A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary Processor (computing), processor in a given computer. Its electronic circuitry executes Instruction (computing), instructions ...

generally performs instructions from the

machine code In computer programming, machine code is computer code consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). For conventional binary computers, machine code is the binaryOn nonb ...

using a four-step process; the instruction is first read from memory, then decoded to understand what needs to be performed, those actions are then executed, and finally, any results are written back to memory. In early designs, each of these stages was performed in series, so that instructions took some multiple of the machine's

clock cycle In electronics and especially synchronous digital circuits, a clock signal (historically also known as ''logic beat'') is an electronic logic signal (voltage or current) which oscillates between a high and a low state at a constant frequency and ...

to complete. For instance, in the

Zilog Z80 The Zilog Z80 is an 8-bit computing, 8-bit microprocessor designed by Zilog that played an important role in the evolution of early personal computing. Launched in 1976, it was designed to be Backward compatibility, software-compatible with the ...

, the minimum number of clocks needed to complete an instruction was four, but could be as many as 23 clocks for some (rare) instructions. At any given stage of the instruction's processing, only one part of the chip is involved. For instance, during the execution stage, typically only the

arithmetic logic unit In computing, an arithmetic logic unit (ALU) is a Combinational logic, combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on ...

(ALU) is active, while other units, like those that interact with

main memory Computer data storage or digital data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processin ...

or decode the instruction, are idle. One way to improve the overall performance of a computer is through the use of an

instruction pipeline In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming Mac ...

. This adds some additional circuitry to hold the intermediate states of the instruction as it flows through the units. While this does not improve the cycle timing of any single instruction, the idea is to allow a second instruction to use the other CPU sub-units when the previous instruction has moved on. For instance, while one instruction is using the ALU, the next instruction from the program can be in the decoder, and a third can be fetched from memory. In this

assembly line An assembly line, often called ''progressive assembly'', is a manufacturing process where the unfinished product moves in a direct line from workstation to workstation, with parts added in sequence until the final product is completed. By mechan ...

type arrangement, the total number of instructions processed at any time can be improved by up to the number of pipeline stages. In the Z80, for example, a four-stage pipeline could improve overall throughput by four times. However, due to the complexity of the instruction timing, this would not be easy to implement. The much simpler

instruction set architecture In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, ...

(ISA) of the

MOS 6502 The MOS Technology 6502 (typically pronounced "sixty-five-oh-two" or "six-five-oh-two") William Mensch and the moderator both pronounce the 6502 microprocessor as ''"sixty-five-oh-two"''. is an 8-bit microprocessor that was designed by a small ...

allowed a two-stage pipeline to be included, which gave it performance that was about double that of the Z80 at any given clock speed.

Branching problems

A major issue with the implementation of pipelines in early systems was that instructions had widely varying cycle counts. For instance, the instruction to add two values would often be offered in multiple versions, or ''

opcode In computing, an opcode (abbreviated from operation code) is an enumerated value that specifies the operation to be performed. Opcodes are employed in hardware devices such as arithmetic logic units (ALUs), central processing units (CPUs), and ...

s'', which varied on where they read in the data. One version of might take the value found in one

processor register A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-onl ...

and add it to the value in another, another version might add the value found in memory to a register, while another might add the value in one memory location to another memory location. Each of these instructions takes a different amount of bytes to represent it in memory, meaning they take different amounts of time to fetch, may require multiple trips through the memory interface to gather values, etc. This greatly complicates the pipeline logic. One of the goals of the

chip design concept was to remove these variants so that the pipeline logic was simplified, which leads to the

classic RISC pipeline In the history of computing hardware, history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: ...

which completes one instruction every cycle. However, there is one problem that comes up in pipeline systems that can slow the performance down. This occurs when the next instruction may change depending on the results of the last. In most systems, this happens when a

occurs. For instance, consider the following pseudo-code: top: read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register write the result to memory read a number from memory and store it in another register ... In this case, the program is linear and can be easily pipelined. As soon as the first instruction has been read and is being decoded, the second instruction can be read from memory. When the first moves to execute, the is being read from memory while the second is decoding, and so forth. Although it still takes the same number of cycles to complete the first , by the time it is complete the value from the second is ready and the CPU can immediately add them. In a non-pipelined processor the first four instructions will take 16 cycles to complete, in a pipelined one, it takes only five. Now consider what occurs when a branch is added: top: read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register if the result in the 3rd register is greater than 1000, then go back to top: (if it is not) write the result to memory read a number from memory and store it in another register ... In this example the outcome of the comparison on line four will cause the "next instruction" to change; sometimes it will be the following to memory, and sometimes it will be the from memory at the top. The processor's pipeline will normally have already read the next instruction, the , by the time the ALU has calculated which path it will take. This is known as a branch hazard. If it has to return to the top, the instruction has to be discarded and the instruction read from memory instead. That takes one full instruction cycle, at a minimum, and results in the pipeline being empty for at least one instruction's time. This is known as a "pipeline stall" or "bubble", and, depending on the number of branches in the code, can have a noticeable impact on overall performance.

Branch delay slots

One strategy for dealing with this problem is to use a delay slot, which refers to the instruction slot after any instruction that needs more time to complete. In the examples above, the instruction that requires more time is the branch, which is by far the most common type of delay slot, and these are more commonly referred to as a branch delay slot. In early implementations, the instruction following the branch would be filled with a no-operation, or , simply to fill out the pipeline to ensure the timing was right such that by the time the had been loaded from memory the branch was complete and the

program counter The program counter (PC), commonly called the instruction pointer (IP) in Intel x86 and Itanium microprocessors, and sometimes called the instruction address register (IAR), the instruction counter, or just part of the instruction sequencer, ...

could be updated with the correct value. This simple solution wastes the processing time available. More advanced solutions would instead try to identify another instruction, typically nearby in the code, to place in the delay slot so that useful work would be accomplished. In the examples above, the instruction at the end is completely independent, it does not rely on any other information and can be performed at any time. This makes it suitable for placement in the branch delay slot. Normally this would be handled automatically by the assembler program or

compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...

, which would re-order the instructions: read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register if the result in the 3rd register is greater than 1000, then go back to the top read a number from memory and store it in another register (if it is not) write the result to memory ... Now when the branch is executing, it goes ahead and performs the next instruction. By the time that instruction is read into the processor and starts to decode, the result of the comparison is ready and the processor can now decide which instruction to read next, the at the top or the at the bottom. This prevents any wasted time and keeps the pipeline full at all times. Finding an instruction to fill the slot can be difficult. The compilers generally have a limited "window" to examine and may not find a suitable instruction in that range of code. Moreover, the instruction cannot rely on any of the data within the branch; if an instruction takes a previous calculation as one of its inputs, that input cannot be part of the code in a branch that might be taken. Deciding if this is true can be very complex in the presence of

register renaming In computer architecture, register renaming is a technique that abstracts logical processor register, registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instructio ...

, in which the processor may place data in registers other than what the code specifies without the compiler being aware of this. Another side effect is that special handling is needed when managing

breakpoint In software development, a breakpoint is an intentional stopping or pausing place in a computer program, program, put in place for debugging purposes. It is also sometimes simply referred to as a pause. More generally, a breakpoint is a means o ...

s on instructions as well as stepping while

debugging In engineering, debugging is the process of finding the Root cause analysis, root cause, workarounds, and possible fixes for bug (engineering), bugs. For software, debugging tactics can involve interactive debugging, control flow analysis, Logf ...

within the branch delay slot. An interrupt is unable to occur during a branch delay slot and is deferred until after the branch delay slot. Placing branch instruction in the branch delay slot is prohibited or deprecated. The ideal number of branch delay slots in a particular pipeline implementation is dictated by the number of pipeline stages, the presence of register forwarding, what stage of the pipeline the branch conditions are computed, whether or not a

branch target buffer In computer architecture, a branch target predictor is the part of a processor that predicts the target, i.e., the address of the instruction that is executed next, of a taken conditional branch or unconditional branch instruction before the ta ...

(BTB) is used and many other factors. Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next. This inevitably requires that newer hardware implementations contain extra hardware to ensure that the architectural behaviour is followed despite no longer being relevant.

Implementations

Branch delay slots are found mainly in DSP architectures and older

architectures. MIPS,

PA-RISC Precision Architecture reduced instruction set computer, RISC (PA-RISC) or Hewlett Packard Precision Architecture (HP/PA or simply HPPA), is a computer, general purpose computer instruction set architecture (ISA) developed by Hewlett-Packard f ...

(delayed or non-delayed branch can be specified),

ETRAX CRIS The ETRAX CRIS is a RISC Instruction set architecture, ISA and series of Central processing unit, CPUs designed and manufactured by Axis Communications for use in embedded systems since 1993. The name is an acronym of the chip's features: ''Ethernet ...

SuperH SuperH (or SH) is a 32-bit reduced instruction set computing (RISC) instruction set architecture (ISA) developed by Hitachi and currently produced by Renesas. It is implemented by microcontrollers and microprocessors for embedded systems. At the ...

(unconditional branch instructions have one delay slot), Am29000, Intel i860 (unconditional branch instructions have one delay slot), MC88000 (delayed or non-delayed branch can be specified), and SPARC are RISC architectures that each have a single branch delay slot;

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...

, ARM,

Alpha Alpha (uppercase , lowercase ) is the first letter of the Greek alphabet. In the system of Greek numerals, it has a value of one. Alpha is derived from the Phoenician letter ''aleph'' , whose name comes from the West Semitic word for ' ...

, V850, and

RISC-V RISC-V (pronounced "risk-five") is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. The project commenced in 2010 at the University of California, Berkeley. It transfer ...

do not have any. DSP architectures that each have a single branch delay slot include μPD77230 and the VS DSP. The SHARC DSP and

MIPS-X MIPS-X is a reduced instruction set computer (RISC) microprocessor and instruction set architecture (ISA) developed as a follow-on project to the Stanford MIPS, MIPS project at Stanford University by the same team that developed MIPS. The project ...

use a double branch delay slot; such a processor will execute a pair of instructions following a branch instruction before the branch takes effect. Both TMS320C3x and TMS320C4x use a triple branch delay slot. The TMS320C4x has both non-delayed and delayed branches. The following example shows delayed branches in assembly language for the SHARC DSP including a pair after the RTS instruction. Registers R0 through R9 are cleared to zero in order by number (the register cleared after R6 is R7, not R9). No instruction executes more than once.

     R0 = 0;
     CALL fn (DB);      /* call a function, below at label "fn" */
     R1 = 0;            /* first delay slot */
     R2 = 0;            /* second delay slot */
     /***** discontinuity here (the CALL takes effect) *****/

     R6 = 0;            /* the CALL/RTS comes back here, not at "R1 = 0" */
     JUMP end (DB);
     R7 = 0;            /* first delay slot */
     R8 = 0;            /* second delay slot */
     /***** discontinuity here (the JUMP takes effect) *****/

     /* next 4 instructions are called from above, as function "fn" */
fn:  R3 = 0;
     RTS (DB);          /* return to caller, past the caller's delay slots */
     R4 = 0;            /* first delay slot */
     R5 = 0;            /* second delay slot */
     /***** discontinuity here (the RTS takes effect) *****/

end: R9 = 0;

Load delay slot

A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem. The following example is MIPS I assembly code, showing both a load delay slot and a branch delay slot. lw v0,4(v1) # load word from address v1+4 into v0 nop # wasted load delay slot jr v0 # jump to the address specified by v0 nop # wasted branch delay slot

References

External links

* * {{refend Instruction processing