9.4 Compiler Controlled Dataflow

As CMOS technology scales, wire delays get slower when compared to logic. The cluster interconnect reflects the belief that future architectures will need to explicitly address communication at the ISA level. Traditional architectures are based on implicit communication. For example the MIPS instruction $addi\,\,r1,\,r2,\,10$ depends on the hardware to keep track of the last location where the operand $r2$ was present and transfer it to where it is consumed. The location could be a renamed register or a pipeline stage. In a wide issue clustered processor, it is advantageous to have operands to a function unit be sourced from nearby function units to hide the effects of long wire delays. This is possible if communication is explicitly orchestrated by the compiler. In the perception processor all communication is explicitly orchestrated by the compiler. In the example above, the compiler would pick a function unit to execute the $addi$ instruction, transfer the output of the function unit that last produced the value corresponding to the $r2$ operand to the $A$ input of the chosen function unit, transfer the constant $10$ to the B input and schedule the actual addition to happen the cycle when both inputs are available. In the perception processor, pipeline registers at the interfaces of every unit including function units and SRAM ports are named and accessible to software. Data is explicitly transferred from the output pipeline register of a producer to the input registers of its consumers. Unlike traditional architectures where pipelines shift under hardware control, a compiler for the perception processor can use clock gating to control pipeline shifting and thereby control the lifetime of values held in pipeline registers. This ensures that a result will be alive till all its consumers have received a copy. This explicit management of result lifetime and communication is called compiler controlled data flow.

Explicit communication leads to the ability to overlap communication with computation with almost no hardware overhead. A significant number of bits in the wide microinstruction word are devoted to controlling the interconnect. While the interconnect can be controlled on a cycle by cycle basis, the compiler may elect to dedicate certain interconnect muxes to flows on a longer term basis. For example, while adding two vectors it is possible to dedicate separate interconnect muxes for the two operands for the duration of the vector addition. The compiler also attempts operand isolation, i.e., it tries to set unused muxes to states that reduce the amount of activity visible to the rest of the circuitry leading to lower power consumption.

Binu Mathew