2.1 Optimization and Characterization of
Perception Applications

Perception processing, which encompasses a wide range of topics like computer vision, speech recognition and gesture recognition, is currently the focus of vigorous research. While it is common in the literature to see the relative merits and performance of algorithms compared, architecture level analysis of whole perception applications is extremely rare. Traditional research in perception has been geared towards improving accuracy. Performance is a secondary goal, and power efficiency has been largely ignored. For instance, the yearly Hub speech recognition evaluation reports typically emphasize improvements in recognition accuracy and mention improvements in performance as a multiple of ``slow down over real time'' [30,92].

Ravishanker's research improved the performance of the Sphinx speech recognition system by trading off accuracy in a computationally intensive phase for faster run time and then recovered the lost accuracy by doing additional processing in a computationally cheaper phase of the application [74]. This research also reduced the memory footprint of speech recognition by using a disk based language model cached in memory by the software.

Agram, Burger and Keckler characterized the Sphinx II speech recognition system in a manner useful for computer architects [6]. They focused on ILP, as well as memory system characteristics such as cache hit rates and block sizes, and concluded that available ILP was low. They compared the characteristics of the Sphinx II system with those of Spec benchmarks and also hinted at the possibilities and problems associated with exploiting thread level parallelism.

Researchers at the Intel ICRC labs published a performance analysis of a speech recognition system for Mandarin Chinese [59]. This study focused on the run time and the size of the working set while executing the Intel speech recognition system on several different versions of the x86 processor. They reported a decrease in ILP with increased clock rate. IPC decreased from between 1 and 1.2 at 500 MHz to approximately 0.4 at 1.5 GHz - a clear indication that increasing clock rate is not the solution to improving speech recognition performance. The decrease in ILP was attributed to memory system behavior, but a detailed explanation was not provided. The ICRC speech system is not publicly available, but the underlying semicontinuous HMM technique is the same as that used by Sphinx. An experiment reported by the Intel researchers claimed to achieve faster than real time recognition - 1.14 times faster than real time on a 1 GHz processor and 1.33 times faster than real time on a 1.5 GHz Pentium 4 processor. The results from Figure 1.1 show that Sphinx2.1 is 2.5 times and 1.5 times slower than real time on 1 GHz and 1.8 GHz Intel Pentium processors respectively. It is possible that the workload and vocabulary used by the Intel researchers was considerably simpler than the one used with Sphinx. Ravishanker reported that for Sphinx II, the language model search consumed about 40% of the recognition time [74]. For the Intel researchers, the language model search is a very small fraction of the execution time. Details of the ICRC speech model are not available. The huge gap in performance between Sphinx and the numbers published by ICRC is possibly because the ICRC speech model is simpler than the Hub-4 speech model used to evaluate Sphinx.

Rabiner and Huang provide data on historical trends in the compute requirement of continuous speech recognition. They predict that in the post year 2000 time it frame will require the compute power of 20 to 60 DSP processors each delivering 1000 MIPS [79]. No published work on the power consumption characteristics of speech recognition is known to exist at this time.

Compared to speech recognition, the algorithms used for perceptual computer vision are far more diverse, and workload characterization results are almost nonexistent. The problem is exacerbated by the fact that research is split into image understanding applications like automatic navigation and nonunderstanding applications like face recognition and detection. A large volume of existing research emphasizes the parallelization and hardware acceleration of early vision primitives like convolution, thresholding, segmentation and connected component labeling [9,105,107]. Toolkits like Xvision and the Intel computer vision library provide optimized versions of such vision primitives [43,52]. While there seems to be a consensus on early vision primitives for image understanding, there seems to be very little agreement and commonality in the higher level aspects of computer vision. Specialized systems for inspection of manufacturing defects, robot and vehicle navigation exist, but seem to be highly domain specific. Representative examples are commercial offerings by companies such as Cognex and Coreco, which provide application specific software for industrial applications such as visual inspection, security monitoring, motion detection, etc. [1,2]. In contrast, nonunderstanding computer vision applications seem to have more in common with each other, and complete applications are more readily available. These also possess a synergistic nature - face detection and lip tracking can augment speech recognition and improve recognition accuracy [102].

Rowley described an optimization for his neural network based face detector that can process a single $320\times200$ image in 7.2 seconds on a 200 MHz R4400 [83]. He reported that combined with flesh tone detection, it might be possible to reduce this time to two to four seconds. Viola and Jones published a method of detecting faces that can perform at a rate of 15 frames per second on a 700 MHz Pentium [103]. Their rapid rate of detection depends on three fundamental advances. They propose a new image representation called integral image that allows features used by their detector to be computed rapidly. This image representation can be coupled with a learning algorithm, which can select a small number of critical features from a large set and thus reduce computation. They also describe a method to cascade increasingly complex classifiers that prunes away uninteresting background regions so that the algorithm can spend more time on the promising part of an image. Together, these optimizations claim a factor of 15 speedup over the Rowley detector. Connell of the ECVG research group at IBM reported being able to perform face detection at 90 frames per second on a 400 MHz Pentium II by correlating the output of a variety of inaccurate and computationally cheap face detectors [25]. Details of this system are currently not available.

There is a serious dearth of research characterizing the performance of face detectors. This lack of published analysis can be mainly attributed to two different factors. First, there is a wide variety of nonneural net based face detection techniques. The prominent examples are support-vector based methods, naive Bayesian classifiers, template matching and Eigen vector based techniques [110,75]. Though each of these techniques has its ardent proponents, the field as a whole is too fractured. Anyone undertaking an architecture study is perplexed about which method is important. Second, most neural net face detectors are based on multilayer perceptrons (MLP). Because of their regular structure, it is simple to estimate the number of operations, bandwidth requirements, etc. of an MLP network. While performance is easy to estimate, the degree of numerical precision required, power consumption, die area, etc. are much more difficult to quantify. Face recognition shares the same problem as face detection in that no performance and power analysis studies are known to exist.

Binu Mathew