In CPU design, the use of a Sum Addressed Decoder (SAD) or Sum
Addressed Memory (SAM) Decoder is a method of reducing the latency of
the
Contents 1 Overview 2 Sum-addressed cache: Collapse the adder and decoder 3 Ignoring the LSBs: Late select on carry 4 Match generation 5 Gate-level implementation 6 What has been saved? 7 Further optimizations: predecode 8 References Overview[edit]
The L1 data cache should usually be in the most critical CPU resource,
because few things improve instructions per cycle (IPC) as directly as
a larger data cache, a larger data cache takes longer to access, and
pipelining the data cache makes IPC worse. One way of reducing the
latency of the L1 data cache access is by fusing the address
generation sum operation with the decode operation in the cache SRAM.
The address generation sum operation still must be performed, because
other units in the memory pipe will use the resulting virtual address.
That sum will be performed in parallel with the fused add/decode
described here.
The most profitable recurrence to accelerate is a load, followed by a
use of that load in a chain of integer operations leading to another
load. Assuming that load results are bypassed with the same priority
as integer results, then it's possible to summarize this recurrence as
a load followed by another load—as if the program was following a
linked list.
The rest of this page assumes an instruction set architecture (ISA)
with a single addressing mode (register+offset), a virtually indexed
data cache, and sign-extending loads that may be variable-width. Most
wordline[1026] = A[13] & B[12] & B[11] & B[10] & B[9] & B[8] & B[7] & B[6] & B[5] & A[4] & B[3] Both the carry chain of the adder and the decoder combine information from the entire width of the index portion of the address. Combining information across the entire width twice is redundant. A sum-addressed SRAM combines the information just once by implementing the adder and decoder together in one structure. Recall that the SRAM is indexed with the result of an add. Call the summands R (for register) and O (for the offset to that register). The sum-addressed decoder is going to decode R+O. For each decoder line, call the line number L. Suppose that our decoder drove both R and O over each decoder line, and each decoder line implemented: wordline[L] = (R+O)==L (R+O)==L <=> R+O-L==0 <=> R+O+~L+1==0 <=> R+O+~L==-1==11..1. A set of full adders can be used to reduce R+O+~L to S+C (this is
carry save addition). S+C==11..1 <=> S==~C. There will be no
carries in the final add. Note that since C is a row of carries, it's
shifted up one bit, so that R[13:3]+O[13:3]+~L[13:3] == 0,S[13:3] +
C[14:4],0
With this formulation, each row in the decoder is a set of full adders
which reduce the base register, the offset, and the row number to a
carry-save format, and a comparator. Most of this hardware will be
proven redundant below, but for now it's simpler to think of it all
existing in each row.
Ignoring the LSBs: Late select on carry[edit]
The formulation above checks the entire result of an add. However, in
a
I[13:3] even bank fetches line odd bank fetches line 100 100 101 101 110 101 110 110 111 Referring to the adjacent diagram, the even bank will fetch line 110 when I[13:3]==101 or I[13:3]==110. The odd bank will fetch line 101 when I[13:3]==100 or I[13:3]==101. In general, the odd SRAM bank should fetch line Lo==2N+1 when either I[13:3]==2N or I[13:3]==2N+1. The two conditions can be written as: I[13:3] = Lo-1 => R[13:3] + O[13:3] + ~Lo+1 = 11..11 => R[13:3] + O[13:3] + ~Lo = 11..10 I[13:3] = Lo => R[13:3] + O[13:3] + ~Lo = 11..11 Ignore the last digit of the compare: (S+C)[13:4]==11..1 Similarly, the even SRAM bank fetches line Le==2N when either I[13:3]==2N or I[13:3]==2N-1. The conditions are written as follows, and once again ignore the last digit of the compare. I[13:3] = Le-1 => R[13:3] + O[13:3] + ~Le = 11..10 I[13:3] = Le => R[13:3] + O[13:3] + ~Le = 11..11 Gate-level implementation[edit] R13 ... R6 R5 R4 R3 O13 ... O6 O5 O4 O3 L13 ... L6 L5 L4 L3 -------------------------- S13 ... S6 S5 S4 S3 C14 C13 ... C6 C5 C4 Before collapsing redundancy between rows, review: Each row of each decoder for each of two banks implements a set of full adders which reduce the three numbers to be added (R[13:3], O[13:3], and L) to two numbers (S[14:4] and C[13:3]). The LSB (==S[3]) is discarded. Carry out (==C[14]) is also discarded. The row matches if S[13:4] == ~C[13:4], which is &( xor(S[13:4], C[13:4])). It is possible to partially specialize the full adders to 2-input and, or, xor, and xnor because the L input is constant. The resulting expressions are common to all lines of the decoder and can be collected at the bottom. S0;i = S(Ri, Oi, 0) = Ri xor Oi S1;i = S(Ri, Oi, 1) = Ri xnor Oi C0;i+1 = C(Ri, Oi, 0) = Ri and Oi C1;i+1 = C(Ri, Oi, 1) = Ri or Oi. At each digit position, there are only two possible Si, two possible Ci, and four possible xors between them: Li=0 and Li-1=0: X0;0;i = S0;i xor C0;i = Ri xor Oi xor (Ri-1 and Oi-1) Li=0 and Li-1=1: X0;1;i = S0;i xor C1;i = Ri xor Oi xor (Ri-1 or Oi-1) Li=1 and Li-1=0: X1;0;i = S1;i xor C0;i = Ri xnor Oi xor (Ri-1 and Oi-1) = !X0;0;i Li=1 and Li-1=1: X1;1;i = S1;i xor C1;i = Ri xnor Oi xor (Ri-1 or Oi-1) = !X0;1;i One possible decoder for the example might calculate these four
expressions for each of the bits 4..13, and drive all 40 wires up the
decoder. Each line of the decoder would select one of the four wires
for each bit, and consist of a 10-input AND.
What has been saved?[edit]
A simpler data cache path would have an adder followed by a
traditional decoder. For our example cache subsystem, the critical
path would be a 14-bit adder, producing true and complement values,
followed by an 11-bit
Paul Demone has an explanation of sum-addressed caches in a
realworldtech article.
Heald et al.[1] have a paper in ISSCC 1998 that explains what may be
the original sum-addressed cache in the
United States patent 5,754,819, May 19, 1998, Low-latency memory indexing method and structure. Inventors: Lynch; William L. (Palo Alto, CA), Lauterbach; Gary R. (Los Altos, CA); Assignee: Sun Microsystems, Inc. (Mountain View, CA), Filed: July 28, 1994 At least one of the inventors named on a patent related to carry-free address decoding credits the following publication: Evaluation of A + B = K Conditions without Carry Propagation (1992) Jordi Cortadella, Jose M. Llaberia IEEE Transactions on Computers, [1] [2] The following patent extends this work, to use redundant form arithmetic throughout the processor, and so avoid carry propagation overhead even in ALU operations, or when an ALU operation is bypassed into a memory address: United States Patent 5,619,664, Processor with architecture for
improved pipelining of arithmetic instructions by forwarding redundant
intermediate data forms, awarded April 18, 1997, Inventor: Glew;
Andrew F. (Hillsboro, OR); Assignee:
^ Heald, R.; et al. (1998). "64 kB Sum-Addressed-Memory Cache with 1.6 ns Cycle and 2.6 ns Latency". ISSCC Digest of Technical Papers. pp. 350–351. doi:10.1109/ISSCC. |