SiliconIntelligence

8.2 Optimization Opportunities

One recurring theme in image processing is computing a kernel that operates on an $M\times N$ subwindow of a larger $W\times H$ image. The kernel is recomputed for every possible pixel location within the larger image. This resembles sliding the $M\times N$ subwindow over the $W\times H$ image. There is significant scope for compiler based reordering of computations in such kernels. Here are two concrete examples.

As described in Section 7.4, the heuristics used by the Viola/Jones algorithm are based on the sum/difference of pixels in adjacent rectangular regions. Recomputing these sums for each pixel location is very expensive. A major contribution of their approach is an intermediate image representation called the integral image. The sum of the pixels in a rectangular window can be computed easily using the intermediate representation. The integral image value at pixel location (x,y) in an image is defined as the sum of all pixels to the left and above the pixel (x,y). This is computationally prohibitive. By expressing the same relationship as a pair of recurrences, it is possible to compute the integral image with just one pass over the image. This transformation required careful study and insight from the originators of the algorithm. Given the fact that the sums of rectangular subwindows of the larger image are recomputed at each pixel location, a compiler based tool aware of the access pattern and rules of arithmetic may be designed to deduce the recurrences.

The standard deviation of pixel values within a $30\times30$ pixel window starting at each pixel location within the probable face rectangle is required during the computation of Viola/Jones heuristics. An initial implementation that simply recomputed the standard deviation function at each pixel location was seen to occupy between 10-15% of the compute time of the whole application. When going from one pixel to the next, the windows overlap by $29\times30$ pixels and the mean and sum of squares for one pixel can be easily calculated from its predecessors values by adjusting for the nonoverlapping pixels alone. By defining a set of recurrences for the mean and mean square for $30\times30$ subwindows over a wider region, it is possible to compute the standard deviations in one pass over the image thereby reducing the execution time of this component to less than 1%. Currently, such transformations require a lot of attention from the programmer and insight into the algorithm and are error prone because of corner cases. This bolsters the argument in favor of compiler based loop restructuring that can apply axioms of algebra to deduce the right set of recurrences.

Another possible optimization is to reorder the computation so that data may be streamed through a set of execution units and results computed in the minimum number of passes while observing limits on the amount of intermediate storage used. Compiler based tools that consider parameters like the size of the image and the subwindow, and the size of the intermediate storage and automatically transform algorithmic kernels for optimum stream performance would be desirable.

As seen in Figures 8.5 and 8.6, wide issue clearly helps performance. In traditional architectures, wide issue usually comes at the cost of increased hardware complexity and could potentially limit the clock frequency as well as exceed a limited energy budget. This application is embarrassingly parallel in most sections due to the intrinsic data parallelism in the pixel processing. One way of achieving good performance at low power is to use a cluster of function units operating in parallel with a very small quantity of SRAM for local storage and no cache. This approach is investigated in the next chapter.



Binu Mathew