6.5.1 Energy Savings

The Spice simulation results from the fully synthesized coprocessor architecture were compared against an actual 2.4 GHz Pentium 4 system that was modified to allow accurate measurement of processor power. Without considering the power consumed by main memory, the GAU accelerator consumed 1.8 watts while the Pentium 4 consumed 52.3 watts during Gaussian computation, representing a 29-fold improvement. The performance of the Pentium 4 system exceeded real-time demands by a factor of 1.6 while the coprocessor approach exceeded real time by 1.55. However the Pentium 4 is implemented in a highly tuned $0.13\mu$ process whereas the GAU accelerator was automatically synthesized for a generally available TSMC $0.25\mu$ process. When normalizing for process differences, the advantage of the GAU coprocessor approach increases significantly. After normalizing for the process, the coprocessor's throughput is 187% higher than the Pentium 4, while consuming 271 times less energy. It is important to note that energy consumption vs. performance is a common design trade-off. A more valid comparison is the energy-delay product. The GAU coprocessor improves upon the energy-delay product of the Pentium 4 processor by a factor of 507.

However the processor is only part of any system. Main memory is an important consideration as well. This includes the power dissipated by a memory controller, DRAM chips and the memory bus. It is difficult to estimate this accurately. Since the XScale processor has an on-chip memory controller, the power consumption on an XScale system accessing DRAM at peak bandwidth was measured. The main memory component of power consumed by Gaussian computation was calculated based on that measurement at the rate of 0.47 W per 64 MB/s of DRAM bandwidth. When the memory is included the GAU coprocessor approach improves upon the Pentium's energy delay product by a factor of 196 and has an energy advantage of a factor of 104, and the throughput performance stays the same as the processor-only results.

A Pentium 4 was used as the comparison because embedded processors like the XScale do not have either the floating point instructions or the performance required for the benchmarks. Software emulated floating point could possibly bloat the energy delay product of the XScale and make a meaningful comparison impossible. Another reason for the choice was simply the technical feasibility of measuring processor power. For example, the Intel XScale development platform used in this research had a processor module board with FPGA, Flash memory, etc., integrated on it, and isolating the processor power was difficult. The particular Pentium 4 system was chosen because the layout of the printed circuit board permitted modifications to permit measuring the energy consumption of the processor core alone.

Binu Mathew