SiliconIntelligence

10.4.8 The Cost of Generality

It could be argued that the perception processor achieves impressive power savings because it lacks the level of generality possessed by the Pentium or the XScale. The perception processor is believed to be Turing complete since it has instructions for integer arithmetic, comparisons, conditional moves, loads, stores and direct and indirect branches. However, Turing completeness is no measure of the ability to execute arbitrary programs efficiently. While it is possible to modify the perception processor for efficiency in the general case by traditional means like adding caches and branch prediction, consider the simpler alternative of using a perception processor to augment a general purpose processor. The generic sections of perception applications run on a host processor, and the perception specific algorithms run on the perception processor attached to the host processor. How efficient could such an organization be ?

Consider the case where the host processor is an XScale. This scenario represents a complete system since the XScale contains its own memory controller. It is true that additional interface circuits will be required between the XScale processor core, the memory controller and the perception processor. However, such additional circuitry is likely to be a very small portion of the hardware of the complete system and should not affect the results presented here significantly. It is also the case that the XScale is ill suited for this application since it consumes too much power for its performance level and possesses too much generality. A low power DSP might be a better choice for a host processor. But choosing an inefficient host processor makes the results presented in this section very conservative.

Figure 10.8: Energy Consumption of PP+
\includegraphics[width=1\columnwidth]{graphs/cluster_results/cluster_energy_flat}

Figure 10.9: Energy Delay Product of PP+
\includegraphics[width=1\columnwidth]{graphs/cluster_results/cluster_edp_flat}

Figure 10.2 shows that the process normalized peak power consumptions of the XScale and the perception processor are 0.675 W and 0.757 W respectively. Consider a chip multiprocessor called PP+ consisting of an XScale core and a perception processor on the same die. PP+ will then have a peak power consumption of 1.4 W. To make the results conservative assume that PP+ consumes 1.4 W of power for all the benchmarks even though in reality the application specific power savings will be significant. Figure 10.8 shows the energy consumed by PP+ to process each input packet. It may be seen that in spite of the addition of a host processor, PP+ has a significantly lower energy consumption than the XScale and the Pentium. This is on account of the fact that energy is the integral of power over time. Even though PP+ has a higher power consumption than the XScale, because of its superior performance it is able to complete tasks faster and thus consumes less energy. In particular PP+ consumes 5.5 and 53.6 times less energy per packet than the XScale and the Pentium respectively. It is only a factor of 12.4 worse than the ASIC implementations.

Figure 10.9 shows the energy delay product of the PP+. Since the power consumption of the PP+ is slightly larger than twice the power consumed by the perception processor, the energy delay product is expected to be a scaled down version of Figure 10.5. This is indeed the case with PP+ outperforming the XScale and the Pentium by factors of 64.1 and 93.6 respectively and it under-performs the ASIC implementations by a factor of 30. The results clearly demonstrate the benefit of using perception processors as coprocessors to general purpose processors.



Binu Mathew