9.3 Function Units

Function units follow the generic organization shown in Figure 9.4. Their operands may be the output of their own final stage or the output of their left or right neighbor. Forwarding the output of the unit to its input allows efficient execution of reduction operators like $\sum$ and $\prod$ and polynomial terms like $Ax^{n}$. Nearest neighbor connections capitalize on the short delay of local wires to implement chained operations in a manner similar to vector chaining. In addition an operand may also arrive over the interconnect, in which case the transferred value is first latched in a register. The interconnect register can also hold semistatic operands like constants used for scaling an operand stream. Several types of function units are used in this study.

Figure 9.4: Function Unit Architecture

Integer ALUs perform common operations like add, subtract, xor, etc. ALUs also have compare instructions, which not only return a value, but also set condition codes local to the particular ALU. Conditional move operations may be predicated on the condition codes set by previous compare instructions to route one of the two ALU inputs to the output. This makes if-conversion and conditional data flows possible. All ALU operations have single cycle latency.

FPUs support floating point add, subtract, multiply, compare and integer to floating point convert operations. While the FPU is IEEE 754 compatible at its interfaces, for multiply operations it internally uses a reduced precision of 13 bits of mantissa since the target applications work well with this precision [66]. Reduced precision in the multiplier contributes significant area and energy savings. All FPU operations have 7 cycle latency.

Multiply units support 32-bit integer multiply operations with 3 cycle latency.

In order to illustrate the advantages of fine grain pipeline control and modulo support and to demonstrate the generality claims, no application specific instructions have been added to the function units with two exceptions: the reduced precision of floating point multiplies and byte select/merge instructions, which select an individual byte from a word. The latter is similar to the pack/unpack instruction in Intel's IA-64 architecture or the AL/AH register fields in the IA-32 architecture. These instructions significantly ease dealing with RGB images.

Binu Mathew