10.1 Benchmarks

The first two algorithms called GAU and HMM described in Chapter 4 are dominant components of the Sphinx 3.2 speech recognizer. The next five algorithms named Rowley, Fleshtone, Erode, Dilate and Viola are components of the visual feature recognition system described in Chapter 7. The last three algorithms are FFT, FIR and Rijndael and these are taken from the DSP and encryption domains. The DSP algorithms were added to test the generality of our approach. DSP functions like FFT and FIR are important components of speech recognition front ends and image processing algorithms. Encryption is of increasing importance to secure embedded systems. Rowley, GAU, FFT and Fleshtone are floating point intensive. The remaining benchmarks are integer only computations. Some components of GAU, Rowley and Fleshtone may be vectorized while the rest of the algorithms cannot. HMM is intensive in data dependent branches which may be if-converted.

Several source level optimizations have been made to the software versions that run on the Pentium and XScale to boost their performance as much as possible [66]. The optimizations included hand unrolled loops, partial specialization of functions when some arguments are known statically, replacing expensive functions with table lookups, reshaping data structures for better cache locality and a variety of algorithm optimizations discussed in Chapters 5 and 7. No SIMD optimizations were made in order to keep the comparison fair. The perception processor could use SIMD floating point units, just like SSE on the Pentium, but widening datapaths makes isolating the impact of architectural options like compiler controlled dataflow impossible. A brief description of the benchmarks follow.

GAU and HMM represent Gaussian probability density evaluation and hidden Markov model evaluation respectively. GAU occupies 57.5% and HMM consumes 41.5% of the execution time of the Sphinx 3.2 speech recognition system. Both Gaussian distributions and hidden Markov models are components of most mature speech recognizers [59,111,91]. GAU computes how closely a 10 ms frame of speech matches a known Gaussian probability distribution. One input packet corresponds to evaluating a single acoustic model state over 10 frames of a speech signal. A real-time recognizer needs to process 600,000 invocations of the GAU algorithm every second. The HMM algorithm performs a Viterbi search over a hidden Markov model corresponding to one model state. One input packet to the HMM implementation consists of 32 five-state hidden Markov models. While the GAU algorithm is entirely floating point, the HMM algorithm is dominated by integer compare and select operations. Its average rate of invocation varies significantly with context, but to guarantee real-time performance it is assumed in this research that all HMM models are evaluated thereby brute forcing a large component of speech processing.

Rowley represents a neural network based visual feature detector [83]. In the face recognizer a multilayer neural network is swept over $30\times30$ rectangular regions of an image. Each individual neuron is evaluated by the function: $tanh(\Sigma_{i=1}^{n}Weight_{i}\times Image[Connection_{i}])$. Neurons have multiple sizes for their fan-in ($n$), and each layer depends on the preceding layer's output. The software implementation developed for this dissertation used hand unrolled, specialized versions of neuron evaluation functions for each input size. Also, $\tanh()$ was implemented via table lookup whereas Rowley's original implementation used the $\tanh()$ function in the C library. This optimization boosted the Pentium's performance by a factor of 2.5. A $30\times30$ image as well as the outputs of all the neurons are maintained within the perception processor. Depending on the sizes of the neurons an input packet consisting of the weights and connections of 7 to 64 neurons is streamed through the perception processor. All computations involve single precision floating point numbers.

Fleshtone represents a skin toning algorithm typically used as a preprocessing step to find skin colored regions of an image so that a more sophisticated object detector like the Rowley detector may be applied to it. The benchmark converts RGB pixels to another color space and checks if the projected pixel falls in between two parabolic curves [90]. This algorithm represents a case that is difficult to vectorize since there are far more floating point operators per pixel than the number of FPUs present in the cluster. This necessitates multiple passes and saving of intermediate results. It also contains multiple if statements in the body. Each input packet consists of a single raster line of a $320\times200$ 24-bit color image. The output is a 320-entry bitmap whose elements are set where flesh color is found.

Erode and Dilate represent two operators from mathematical morphology that help in image segmentation. Erode sweeps a $3\times3$ pixel filter over the bitmap produced by Fleshtone and cuts away weakly connected regions, i.e., it blacks out pixels if all pixels within the filter are not set. Dilate does the opposite, it sweeps a $5\times5$ pixel filter over a bitmap and fills in pixels if any of the pixels are set. Fleshtone, Erode and Dilate are used for image segmentation in a visual feature recognition system [65]. Erode works on three raster lines and dilate works on five raster lines of a $320\times200$ image.

Viola is a reimplementation of the Viola and Jones' method of object detection based on a well known machine learning algorithm known as AdaBoost [103]. The algorithm relies on computing features or wavelets which are the weighted sum or difference of rectangular regions within a $30\times30$ window into an image. The coordinate and weight information for 100 features are maintained within the perception processor. Each input packet contains a $30\times30$ pixel image. The output contains the evaluation of all 100 features over the $30\times30$ image.

FFT implements a 128 point complex to complex Fourier transform on floating point data. The Fourier coefficients are maintained within the perception processor. Input and output packets consist of 128 complex numbers where each complex number consists of two single-precision floating point numbers. FFT represents a common algorithm for which many DSP processors implement ISA extensions. FFT also represents a case that causes bad interconnect conflicts on our architecture. Good performance depends on the interconnect borrowing technique described in Section 9.5. The software version on the Pentium is based on FFTW, a highly tuned FFT implementation which used dynamic programming techniques to adapt itself to the processor architecture [38]. The microcode implementation on the other hand uses a simple radix-2 algorithm and no ISA extensions. Since FFTW cannot be used on the XScale, the simple radix-2 algorithm is used instead.

FIR is a 32 tap finite impulse response filter, a common primitive in DSP applications. Impulse response coefficients are maintained inside the perception processor. Input packets of various sizes may be applied to the filter, which successively evaluates each input and outputs one integer corresponding to every input word.

Rijndael is the advanced encryption standard. The particular version implemented here uses 128 bit keys and works on 16 byte blocks [29]. Input blocks are 576 bytes long to simulate network level encryption. The default maximum size of Internet packets is 576 bytes. The key as well as the encryption S-boxes are maintained within the perception processor.

Binu Mathew