8.1 Application Characteristics

Figures 8.1 and 8.2 show the relative execution times of each algorithm when the application is run using the Rowley detector and the Viola/Jones detector. In both cases, the face detector is the dominant component. Since the detectors are heuristic in nature, the face regions identified by them may differ. This in turn leads to differences in the runtime of other algorithms that depend on the detector's output. Figures 8.3 and 8.4 show the L1 Dcache miss rate and the L2 cache hit rates for all five application configurations. Since the caches are inclusive, the L2 hit rate is defined as the L1 misses that hit in the L2 cache divided by the total number of accesses made by the application. Since this application achieves 99.8% Icache hit rate with a 16 KB Icache, no other Icache configurations were studied. Figure 8.5 shows IPC for a variety of execution unit configurations, and Figure 8.6 shows the run times normalized to real time.

Figure 8.1: Execution Time Break Down of Viola/Jones Detector Based Face

Figure 8.2: Execution Time Break Down of Rowley Detector Based Face Recognizer

Figure 8.3: L1 Dcache Miss Rate

Figure 8.4: L2 Cache Hit Rate

Figure 8.5: IPC

Figure 8.6: Speedup or Slow Down Over Real Time

For the entire application there is consistently greater than 92% L1 cache hit rate for Dcaches of 16 KB and above. This indicates that the streaming pipelined model for composing the algorithms is a good fit for the problem. Each $320\times200$ pixel color image is 187.5 KB long and the corresponding gray scale versions are about 64 KB. The images clearly will not fit in the L1 cache. The explanation is that the color image is accessed in streaming mode, i.e., each pixel is touched exactly once for flesh toning. Image segmentation works on the flesh tone bitmap (approximately 64 KB) making at most two passes over it. Since these accesses touch at most two image rows at a time, good cache utilization is insured. Subsequently, only small windows into the image are used. Since objects in these images are typically smaller than $50\times50$ pixels, each object is only about 2.5 KB in size. The downstream algorithms make several passes over each object, but only a small part of each object needs to be cache resident at each time. For example, the integral image computation in the Viola/Jones algorithm is based on a recurrence that involves two adjacent image rows and an additional row for intermediate storage and has an L1 cache footprint of about 4.4 KB. The Rowley algorithm touches at most 30 rows of the object at the same time. However, as it sweeps across the image left to right and top to bottom only a $30\times30$ pixel window needs to be cache resident at a time. Since it shifts its position one pixel at a time, a $29\times29$ region of this window will be reused by the next iteration contributing to high L1 cache hit rate. A similar pattern occurs in the later phase of the Viola/Jones algorithm on a $30\times30$ region. The Eigenfaces algorithm uses a projected image of the object to be recognized as well as basis, mean and projected image matrices corresponding to each reference object. The target object is reused while it is compared against each candidate. Each candidate, however, is accessed only once per target object.

The objects and their attributes from each stage are typically touched again by the next stage. The auxiliary information used by the algorithms is somewhat small. Both detector algorithms use fixed size data structures. The worst case is the Viola/Jones algorithm, which needs a weight and a type for each feature corresponding to $100\times2\times4=800$ bytes of L1 cache. The data set for the Eigenfaces algorithm on the other hand is linear in the number of the reference faces. Since these could potentially be streamed into the L1 Dcache once per target object (or once per frame), the footprint is small. Only the projected target object and a small part of the basis/mean/projected reference images need to be resident in the L1 Dcache. From Figure 8.4 it is seen that the L2 cache is largely ineffective since it is accessed infrequently due to the low L1 miss rate.

From a cache footprint perspective, both the detector algorithms and the entire application appear to be a good match for embedded processors with limited cache resources. Since images are accessed left to right, multiple rows at a time, sequential prefetch (or strided prefetch) would hide memory access latencies even when the L1 Dcache is small. Quite a different view unfolds on examination of the IPC and speedup graphs. Figure 8.5 shows IPC for a variety of execution unit configurations. IPC is seen to saturate early on for two main reasons. The first is caused by dependences in the loop bodies. For example, neural net evaluation involves computing $\Sigma_{i=0..n}Weight[i]*Image[Connection[i]]$. In addition to the loop carried dependence on the sum, each of the inputs is accessed indirectly via a pointer since an input to one neuron could be the output of another neuron. Second, the high ratio of array variable accesses to arithmetic operations causes saturation of the Dcache ports.

Figure 8.6 shows the run times normalized to real time. Here, 1.0 represents minimum real-time performance corresponding to 5 frames per second. For example, in Figure 8.6 in the 1 ALU + 1 FPU configuration, the Rowley algorithm is 1.13 times slower than real time while the Eigenfaces algorithm processes 5 frames in 0.69 seconds. The graph clearly shows that embedded processors are inadequate to handle the work load in real time. In this case instruction throughput is the culprit. Even when function units are available, dependences and contention for the Dcache ports causes low IPC. The power budgets required for real-time performance are beyond what is available on normal low power embedded platforms. Thermal dissipation is a problem even on high performance processors and energy saving solutions are important for real-time workloads like visual feature recognition. Hardware accelerators that use specialized data paths and stream array operands out of multiple SRAM buffers stand a good chance of accelerating these algorithms at embedded power budgets.

Binu Mathew