3.2 Power Reduction Strategies

Equations 3.1 and 3.3 point to several power reduction strategies. For instance, power consumption can be reduced by increasing IPC. However, modern dynamically scheduled processors also increase the value of $C$ when they increase IPC due to the introduction of large reorder buffers, complex cache structures, register renaming and support for speculative execution. Architectures that can provide high IPC without an inordinate rise in the value of $C$ will lead to low power consumption. This can be achieved at the cost of generality by using simple application domain specific ILP enhancing mechanisms as well as by taking advantage of compiler driven static ILP improvements. Increasing the issue width causes some increase in power consumption because of the wider structures used to support multiple issue. Since most of the ILP extraction is done at compile time, and because the additional logic can be tailored to take advantage of domain specific optimizations, the strategy leads to a net power savings in the end.

Another architectural means of reducing power consumption is to decrease the activity factor $A$. Clock gating provides one method of reducing the activity factor [96]. Designing structures that isolate activity happening in one part from being visible in other parts is another useful technique. A typical example is the forwarding paths of a super-scalar microprocessor. A forwarding mux connected to the output of a function unit makes the value changes occurring in the final stage of that unit visible at the inputs of other function units even when the receiving units do not need the forwarded value. This leads to unnecessary switching activity and power dissipation at the receiving side. When the forwarding path is not needed, the mux select signals can be manipulated so that unnecessary value changes are not visible at the receiving side. This strategy called operand isolation was utilized in the IBM PowerPC 4xx embedded controllers [27]. Operand isolation under compiler control is used as a power saving strategy for the perception processor described in Chapter 9.

Lowering the ideal operating frequency also permits the use of a lower supply voltage, which results in power savings. If frequency is directly proportional to supply voltage, Equation 3.1 predicts cubic power reduction. However, in reality, $f\propto\frac{(V-V_{t})^{Kds}}{V}$ where $K_{ds}$ is a device saturation constant whose value ranges from zero to two when velocity saturation is not explicitly modeled [12]. Considering this relationship, quadratic or linear power savings may be obtained by lowering the supply voltage and operating frequency. This strategy capitalizes on the results produced by researchers exploring ideal voltage selection and voltage scaling [76]. Equation 3.1 applies only within a narrow, process specific, supply voltage range.

Ultimately, the average IPC available in an application is limited by the dependences between instructions. Further improvements may be obtained by multithreading the application, in which case $IPC_{avg}$ in Equation 3.3 corresponds to the aggregate IPCs of the individual threads. Traditional high performance multiprocessors exact a high energy price because of the complexities of memory system coherence and interthread communication. By tailoring a multiprocessor system to the information flow and synchronization patterns found in perception applications, it is possible to design simple architectures that provide sufficient generality for the perception domain.

Perception applications are usually stream oriented. They consist of a pipeline of algorithms, most of which are compute and memory intensive. Each phase typically touches and discards a large data set in a block oriented manner, i.e., several input blocks and a few blocks of local state are consulted to compute a block of output. There is little or no reuse of the high bandwidth input data, which is comprised of both input signals and massive knowledge bases that are too large to cache on-chip. One or more phases may be executed on a processor, and multiple processors may be connected in a pipeline fashion for efficient interphase communication while harvesting thread level parallelism.

Binu Mathew