8. Characterization of Visual Feature Recognition

This chapter provides a detailed characterization of the visual feature recognition system described in Chapter 7. Native execution, profiling using processor performance counters, and simulation were used to characterize the application. The native execution results were obtained using SGI SpeedShop on a 666 MHz R14K processor. Unlike the results presented in Chapter 5, which used the SimpleScalar 3.0 simulator, results in this chapter are based on ML-RSIM, an out of order processor simulator derived from the Rice University RSIM simulator. This change was motivated by two reasons. First, the visual feature recognition application is implemented in C++, but the compiler used by SimpleScalar does not support C++. Since ML-RSIM accepts binaries compiled for SunOS, it was possible to generate the application binary on a Sun workstation. Second, a stable version of ML-RSIM was not available at the time the experiments in Chapter 5 were conducted.

A derivative of the Net BSD operating system was run within the simulator. An application binary compiled for SunOS was used without any modification since the OS emulates the SunOS system call interface. Two different configurations were simulated: a multi-GHz processor whose parameters like L1 cache hit time, memory access time, floating point latencies, etc., were measured on a 1.7 GHz AMD Athlon processor using the lmbench hardware performance analysis benchmark and an embedded configuration which is modeled after an Intel XScale 400 MHz processor except for the fact that it uses a Sparc ISA and has a floating point unit [68]. Since ML-RSIM could not be configured without an L2 cache, an inclusive L2 cache equivalent in size to the combined L1 instruction and data caches was added. Since the cache is inclusive and the same size as the sum of the L1 caches, this configuration behaves similar to a machine with no L2 cache. Numbers that could not be directly measured were obtained from vendor microarchitecture references. ML-RSIM was configured to reflect the parameters shown in Table 8.1. Unless mentioned otherwise, the remainder of this chapter uses the default configuration.

Table 8.1: Experiment Parameters
Native Execution:
SGI Onyx3, R14K processors at 666 MHz
32 KB 2-way IL1, 32 KB 2-way DL1, 8 MB L2
Software: IRIX 64, MIPS Pro compiler, Perfex, Speedshop
Simulator: (default configuration)
Sparc V8 ISA, out of order CPU model, 2 GHz
16 KB 2-way IL1, 2 cycle latency, 16 KB 2-way DL1, 2 cycle latency
2 MB 2-way L2, 20 cycle latency, 228 cycle DRAM latency
L1 line size 64 bytes, L2 line size 128 bytes
Issue width: 4 integer + 4 floating point, Max 4 graduations/cycle
DRAM interface: 600 MHz, 64 bits wide
Software: gcc 2.6.3
Embedded Configuration
Sparc V8 ISA, 400 MHz
32 KB 32-way IL1, 1 cycle latency, 32 KB 32-way DL1, 1 cycle latency
64 KB inclusive L2 cache
L1 line size 64 bytes, L2 line size 128 bytes
Issue width: 1 integer or 1 floating point, Max 1 graduation/cycle
DRAM interface: 100 MHz, 32 bits wide
Software: gcc 2.6.3

The application is studied in five configurations: a) full pipeline using the Rowley face detector, b) full pipeline using the Viola/Jones face detector, c) only the Rowley face detector with flesh toning and image segmentation, d) only the Viola/Jones face detector with flesh toning and image segmentation, e) only the Eigenfaces recognizer. The last three configurations are important from an energy savings perspective since running the individual algorithms on separate low frequency processors or hardware accelerators can lead to significant energy savings.


Binu Mathew