SIMD within a register (SWAR), also known by the name "packed SIMD" is a technique for performing parallel operations on data contained in a

processor register A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-onl ...

SIMD Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...

stands for ''single instruction, multiple data''. Flynn's 1972 taxonomy categorises SWAR as "pipelined processing". Many modern general-purpose computer processors have some provisions for

, in the form of a group of registers and instructions to make use of them. SWAR refers to the use of those registers and instructions, as opposed to using specialized processing engines designed to be better at SIMD operations. It also refers to the use of SIMD with general-purpose registers and instructions that were not meant to do it at the time, by way of various novel software tricks.

SWAR architectures

A SWAR architecture is one that includes instructions explicitly intended to perform parallel operations across data that is stored in the independent subwords or fields of a register. A SWAR-capable architecture is one that includes a set of instructions that is sufficient to allow data stored in these fields to be treated independently even though the architecture does not include instructions that are explicitly intended for that purpose. An early example of a SWAR architecture was the Intel Pentium with MMX, which implemented the MMX extension set. The

Intel Pentium Pentium is a series of x86 architecture-compatible microprocessors produced by Intel from 1993 to 2023. The original Pentium was Intel's fifth generation processor, succeeding the i486; Pentium was Intel's flagship processor line for over ...

, by contrast, did not include such instructions, but could still act as a SWAR architecture through careful hand-coding or compiler techniques. Early SWAR architectures include

DEC Alpha Alpha (original name Alpha AXP) is a 64-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). Alpha was designed to replace 32-bit VAX complex instruction set computers ( ...

, Hewlett-Packard's

PA-RISC Precision Architecture reduced instruction set computer, RISC (PA-RISC) or Hewlett Packard Precision Architecture (HP/PA or simply HPPA), is a computer, general purpose computer instruction set architecture (ISA) developed by Hewlett-Packard f ...

MAX Max or MAX may refer to: Animals * Max (American dog) (1983–2013), at one time purported to be the world's oldest living dog * Max (British dog), the first pet dog to win the PDSA Order of Merit (animal equivalent of the OBE) * Max (gorilla) ...

, Silicon Graphics Incorporated's MIPS MDMX, and Sun's SPARC V9

VIS Vis, ViS, VIS, and other capitalizations may refer to: Places * Vis (island), a Croatian island in the Adriatic sea ** Vis (town), on the island of Vis * Vis (river), in south-central France * Vis, Bulgaria, a village in Haskovo Province * Visa ...

. Like MMX, many of the SWAR instruction sets are intended for faster video coding.

History of the SWAR programming model

Wesley A. Clark Wesley Allison Clark (April 10, 1927 – February 22, 2016) was an American physicist who is credited for designing the first modern personal computer. He was also a computer designer and the main participant, along with Charles Molnar, in the ...

introduced partitioned subword data operations in the 1950s. This can be seen as a very early predecessor to SWAR.

Leslie Lamport Leslie B. Lamport (born February 7, 1941) is an American computer scientist and mathematician. Lamport is best known for his seminal work in distributed systems, and as the initial developer of the document preparation system LaTeX and the author ...

presented SWAR techniques in his paper titled "Multiple byte processing with full-word instructions" in 1975. With the introduction of Intel's MMX multimedia instruction set extensions in 1996, desktop processors with SIMD parallel processing capabilities became common. Early on, these instructions could only be used via hand-written assembly code. In the fall of 1996, Professor Hank Dietz was the instructor for the undergraduate Compiler Construction course at Purdue University's School of Electrical and Computer Engineering. For this course, he assigned a series of projects in which the students would build a simple compiler targeting MMX. The input language was a subset dialect of

MasPar MasPar Computer Corporation (later NeoVista Software, Inc.) was a minisupercomputer vendor that was founded in 1987 by Jeff Kalb. The company was based in Sunnyvale, California. History While Kalb was the vice-president of the division of Digita ...

's MPL called NEMPL (Not Exactly MPL). During the course of the semester, it became clear to the course teaching assistant, Randall (Randy) Fisher, that there were a number of issues with MMX that would make it difficult to build the back-end of the NEMPL compiler. For example, MMX has an instruction for multiplying 16-bit data but not multiplying 8-bit data. The NEMPL language did not account for this problem, allowing the programmer to write programs that required 8-bit multiplies. Intel's x86 architecture was not the only architecture to include SIMD-like parallel instructions. Sun's

, SGI's MDMX, and other multimedia instruction sets had been added to other manufacturers' existing instruction set architectures to support so-called ''new media'' applications. These extensions had significant differences in the precision of data and types of instructions supported. Dietz and Fisher began developing the idea of a well-defined parallel programming model that would allow the programming to target the model without knowing the specifics of the target architecture. This model would become the basis of Fisher's dissertation. The acronym "SWAR" was coined by Dietz and Fisher one day in Hank's office in the MSEE building at Purdue University. It refers to this form of parallel processing, architectures that are designed to natively perform this type of processing, and the general-purpose programming model that is Fisher's dissertation. The problem of compiling for these widely varying architectures was discussed in a paper presented at LCPC98.

Some applications of SWAR

SWAR processing has been used in image processing, cryptographic pairings, raster processing, computational fluid dynamics, and communications.

Examples

SWAR techniques can be used even on systems without special hardware support.

Logical operation In logic, a logical connective (also called a logical operator, sentential connective, or sentential operator) is a logical constant. Connectives can be used to connect logical formulas. For instance in the syntax of propositional logic, th ...

s act bitwise, so act on each bit of a register independently. Using addition and subtraction is more difficult, but can be useful if care is taken to avoid unwanted

carry propagation An adder, or summer, is a digital circuit that performs addition of numbers. In many computers and other kinds of processors, adders are used in the arithmetic logic units (ALUs). They are also used in other parts of the processor, where they ar ...

between lanes. Except for this carry propagation, one 64-bit addition or subtraction is the same as performing eight 8-bit additions or subtractions.

Counting bits set

Probably the archetypical example of SWAR techniques is finding the population count of (number of bits set in) a register. The register is treated successively as a series of 1-bit, 2-bit, 4-bit, etc. fields. To begin with, note that the population count of a 1-bit field is simply the field itself. To find the population count of a 2-bit field, sum the population counts of its two constituent 1-bit fields. This can be done in parallel for 32 2-bit fields in a 64-bit value : x2 := (x & 0x5555555555555555) + (x >> 1 & 0x5555555555555555); The

hexadecimal Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...

constant is binary 0101₂, which isolates even-numbered bits. The addition cannot overflow each 2-bit field, as the maximum possible sum is 2. This can be repeated to combine 2-bit fields into 4-bit fields. Here, we use a mask of binary 0011₂, or hexadecimal , to isolate pairs of bits: x4 := (x2 & 0x3333333333333333) + (x2 >> 2 & 0x3333333333333333); Now each 4-bit field contains a count from 0 to 4. Because a 4-bit field can contain a value up to 15, no overflow is possible when adding two 4-bit population counts, allowing the masking to be done ''after'' the addition, rather than once per addend: x8 := x4 + (x4 >> 4) & 0x0f0f0f0f0f0f0f0f; At this point, the 8-bit fields can hold values up to 255, so there is no need for further masking until the very end: x16 = x8 + (x8 >> 8); x32 = x16 + (x16 >> 16); x64 = x32 + (x32 >> 32); population_count = x64 & 0xff;

Further refinements

There are several well-known variants on this. In particular, the last three shift-and-add steps can be combined into population_count = x8 * 0x0101010101010101 >> 56; Three stages of shifting and adding require 6 instructions, each with a

data dependency A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements (or instructions) i ...

on the previous one, so take at least 6 clock cycles. A multiplication can usually be performed faster. When operating on 32-bit words, it's less clear, as a 3-cycle multiply is common. A second variant is a change to the first step. Rather than combining the two bits ''b''₁ and ''b''₀ in each 2-bit field by adding them, consider the initial value of the 2-bit field as 2''b''₁ + ''b''₀. Subtracting ''b''₁ from this will produce the desired sum, with only one masking operation: x2 := x − (x >> 1 & 0x5555555555555555);

Finding zero bytes

It is common to search a

character string In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...

for a null terminator. Doing this one byte at a time is inefficient when a 64-bit processor can operate on 8 bytes at a time. The same technique can be used to search for pathname separators or other delimiters, by

exclusive-or Exclusive or, exclusive disjunction, exclusive alternation, logical non-equivalence, or logical inequality is a logical operator whose negation is the logical biconditional. With two inputs, XOR is true if and only if the inputs differ (one ...

ing with the target byte value first. Some architectures include special instructions for performing 8 byte comparisons at once. For example, the

included a instruction for performing 8 byte compares at once. However, searching for a zero byte can be done without any special support. One way would be to OR together 8 bits in a manner much like the bit-counting example above: x2 = x , x<<1; x4 = x2 , x2<<2; x8 = x4 , x4<<4; byte_map = ~x8 & 0x8080808080808080; This results in a with a 1 bit in the most significant bit of any byte that was originally zero. However, this can be done more quickly by taking advantage of carry propagation using arithmetic operations. Adding (binary 01111111₂) to each byte causes a carry into bit 7 if the low 7 bits are non-zero. The challenge is ensure the carry propagation ''stops'' at bit 7 and does not affect other bytes. This can be done by working on the low 7 bits and high bit of each byte separately. First, extract the low 7 bits of each byte by

AND And or AND may refer to: Logic, grammar and computing * Conjunction, connecting two words, phrases, or clauses * Logical conjunction in mathematical logic, notated as "∧", "⋅", "&", or simple juxtaposition * Bitwise AND, a Boolean oper ...

ing with before adding : x7 = (x & 0x7f7f7f7f7f7f7f7f) + 0x7f7f7f7f7f7f7f7f; Then combine with the most-significant bits: x8 = x7 , x; This value will have the msbit of each 8-bit field set to 1 if that byte is non-zero. Finally: byte_map = ~(x8 , 0x7f7f7f7f7f7f7f7f); will set all the unwanted low bits in each byte, then complement everything, leaving only 1 bits anywhere the corresponding input byte ''is'' zero. (This is equivalent to ~x8 & 0x80...80, but uses the same constant value.) If there are no 1 bits, the search can continue with the following word. If there are any 1 bits, the length of the string can be computed from their positions.

Further refinements

If the goal is limited to finding the ''first'' zero byte on a

little-endian '' Jonathan_Swift.html" ;"title="Gulliver's Travels'' by Jonathan Swift">Gulliver's Travels'' by Jonathan Swift, the novel from which the term was coined In computing, endianness is the order in which bytes within a word (data type), word of d ...

processor, it is possible to find the ''least significant'' zero byte in fewer operations, using two different constants: x7 = x − 0x0101010101010101; byte_map = x7 & ~x & 0x8080808080808080; For each byte ''b'', this sets its msbit in byte_map if the msbit of ''b'' − 1 is set and the msbit of ''b'' is clear, something that only happens if ''b'' = 0. The preceding statement is only true ''if there is no borrow in''; if there ''is'' a borrow, the condition will also be true if ''b'' = 1. However, such a borrow can only be generated by a less significant zero byte, so the ''least significant'' zero byte will be correctly identified, as desired. Not only does this save one binary operation, but they are not all sequentially dependent, so it can be performed in two cycles assuming the existence of an "and not" (bit clear) instruction

Small table lookups

As a generalization of a

bitmap In computing, a bitmap (also called raster) graphic is an image formed from rows of different colored pixels. A GIF is an example of a graphics image file that uses a bitmap. As a noun, the term "bitmap" is very often used to refer to a partic ...

, it is possible to store very small

lookup table In computer science, a lookup table (LUT) is an array data structure, array that replaces runtime (program lifecycle phase), runtime computation of a mathematical function (mathematics), function with a simpler array indexing operation, in a proc ...

s in a single register. For example, the number of days in a month varies from 28 to 31, a range of 4 values. This can be stored in 12×2 = 24 bits: days_table = 0xeefbb3 + (is_leap_year << 2); days_in_month = 28 + (days_table >> 2*month & 3); (This is assuming a 0-based month number. A 1-based month number can be accommodated by shifting the {{code, days_table.) The fact that the table fits neatly into one register makes it easy to modify for

leap year A leap year (also known as an intercalary year or bissextile year) is a calendar year that contains an additional day (or, in the case of a lunisolar calendar, a month) compared to a common year. The 366th day (or 13th month) is added to keep t ...

References

External links

The Aggregate - SWAR: SIMD Within A Register

Parallel computing SIMD computing

SWAR architectures

History of the SWAR programming model

Some applications of SWAR

Examples

Counting bits set

Further refinements

Finding zero bytes

Further refinements

Small table lookups

See also

References

External links