1.1 The Problem

By their very nature, perception applications are likely to be most useful in mobile embedded systems. A fundamental problem that plagues these applications is that they require significantly more performance than current embedded processors can deliver. Most embedded and low-power processors, such as the Intel XScale, do not have the hardware resources and performance necessary to support a full featured speech recognizer. Even modern high performance microprocessors are barely able to keep up with the real-time requirements of sophisticated perception applications. The energy consumption that accompanies the required performance level is often orders of magnitude beyond typical embedded power budgets. This dissertation attempts to develop a specialized processor architecture that can provide high performance for perception applications in an energy-efficient manner.

Figure 1.1 shows actual measured performance of two perception applications: CMU Sphinx 3, a speech recognition system, and FaceRec, a face recognition application. The applications were run on Intel Pentium III and later processors with clock speeds varying from 900 MHz to 3 GHz. Details of these applications are presented in Chapters 4 and 7. The horizontal lines show the performance level required to achieve real-time targets. For the speech recognizer, this involves recognizing a 29.2 second long speech recording in the same interval of time. The workload for the face recognizer consists of processing 25 image frames in 5 seconds corresponding to the real-time target of handling 5 frames of $320\times200$ pixel images every second.

Figure 1.1: Perception Performance

Each of the smooth curves in the figure correspond to the hyperbola obtained by assuming ideal scaling of performance with frequency. They are derived by starting with the data point corresponding to 900 MHz and assuming that run time varies inversely with frequency. It is evident that for speech recognition, the performance of the processor does not scale ideally. In theory a 2.4 GHz processor should achieve real-time performance. In practice a processor frequency of approximately 2.9 GHz is required to satisfy real-time requirements. This performance gap suggests that when moving to more complex future speech recognition workloads, higher frequencies alone are not the solution. Fundamental architectural improvements are called for. The face recognizer demands a higher level of performance than is currently available. Its real-time requirements demand a 4.8 GHz or faster processor. The complexity of both workloads is likely to increase significantly in the future. The results clearly show that perception applications stress the performance limits of high end processors and low power embedded processors may never have the compute power required for perception applications.

Given Moore's law performance scaling, the performance issue is not by itself a critical problem. However two significant problems remain. First, the energy expended in high performance processors is intractable in the embedded space. Furthermore, the power requirements of new processors is increasing. The conclusion is that technology scaling alone cannot solve the problems faced by perception applications. Second, perception and security interfaces are by nature always operational. This limits the processor's availability for other compute tasks such as understanding what was perceived.

The usual solution to reducing power consumption while increasing performance is to use an Application Specific Integrated Circuit (ASIC). Given the complexity and the always on nature of perception tasks, a more relevant approach would be to use the ASIC as a coprocessor in conjunction with a low power host processor. As a part of this research, an ASIC coprocessor for one of the dominant phases of the CMU Sphinx speech recognition system was investigated. Details may be found in Chapter 6. This effort led to the usual realization that ASICs are costly and inflexible. Their high fabrication cost coupled with the costs associated with a lengthy design cycle are difficult to amortize. The inherent level of specialization in an ASIC makes it extremely difficult to support multiple applications, new methods, or even evolutionary algorithmic improvements. Given that embedded applications evolve rapidly and that embedded systems are extremely cost sensitive, these problems provide significant motivation to explore a more general purpose approach. The use of reconfigurable logic and FPGA devices is another common approach [31]. The inherent reconfigurability of FPGAs provides a level of specialization while retaining significant generality. However the reconfiguration time is relatively long, and FPGAs have a significant disadvantage both in performance and power when compared to either ASIC or CPU logic functions.

Binu Mathew