SiliconIntelligence


7.4 Viola and Jones' Detector

Viola and Jones present a new and radically faster approach to face detection based on the AdaBoost algorithm from machine learning [103]. They claim a factor of 15 speedup over the Rowley detector for their implementation, but they run their detector directly on entire images without using flesh toning to cut down the search area. Since their source code is proprietary, their algorithm was reimplemented by the author and Robert Evans based on example code obtained from Peter Carbonetto at the University of British Columbia. The visual feature recognizer used for this dissertation research uses flesh-toning for both the Rowley and the Viola/Jones detectors. Flesh-toning cuts down the region of the image that the detector needs to process. Under these circumstances a visual feature recognition system using Rowley's method performs as well as a system that uses Viola and Jones' method. To understand the Viola/Jones detector, the concept of boosting needs to be explained first.

A random guess to a yes or no question stands the chance of being correct 50% of the time. If a heuristic can improve the odds by a very small amount then it is called a weak learner. It is possible to generate weak learners for several tasks in a semiautomated manner by enumerating a huge set of heuristics generated on the basis of combinations of simple rules and evaluating their performance on a set of samples. A heuristic that can improve the odds of a guess by a significant amount is called a strong learner. Boosting is a method of combining several weak learners to generate a strong learner. AdaBoost is a well known algorithm to generate strong learners from weak learners, while providing statistical bounds on the training and generalization error of the algorithm [86].

The weak learners in the Viola/Jones algorithm are based on features of three kinds. A two-rectangle feature is the difference between the sum of the values of two adjacent rectangular windows. A three-rectangle feature considers three adjacent rectangles and computes the difference between sum of the pixels in the extreme rectangles and the sum of the pixels in the middle rectangle. A four-rectangle feature considers a $2\times2$ set of rectangles and computes the difference between sum of pixels in the rectangles that constitute the main and off diagonals. For a $24\times14$ subwindow there could be more than 180,000 such features. The task of the AdaBoost algorithm is to pick a few hundred features and assign weights to each using a set of training images. Face detection is reduced to computing the weighted sum of the chosen rectangle-features and applying a threshold. As in the case of the Rowley algorithm a $30\times30$ detector is swept over every pixel location in the image, and the image is rescaled. Rowley's voting algorithm is used to decide the final detection locations.

Computing rectangle features is a simple but slow operation based on the sum or difference of pixels in adjacent rectangular regions. Recomputing these sums for each pixel location is very expensive. A major contribution of the Viola/Jones approach is an intermediate image representation called the integral image. The sum of the pixels in a rectangular window can be computed easily using the intermediate representation. The integral image value at pixel location (x,y) in an image is defined as the sum of all pixels to the left and above the pixel (x,y). This is computationally prohibitive. By expressing the same relationship as a pair of recurrences, it is possible to compute the integral image with just one pass over the image. Given the integral image, computing a feature $F$ reduces to:

\begin{displaymath}
S=\sum_{i=1}^{9}W[i]\times IntegraImage[F.Index[i]]
\end{displaymath}


\begin{displaymath}
F.score=abs(S-F.mean\_face)<abs(S-F.mean\_nonface)
\end{displaymath}

$W[]$ is a set of weights that depends only on the type of the feature being computed. These are known constants unlike the trained weights of a neural network. Similar to neural networks $F.Index[]$ is a set of indices denoting connections to specific locations within the integral image. These are trained for each selected feature. $F.mean\_face$ represents the average distance from feature $F$ to a set of rectangles known to contain faces. Similarly $F.mean\_nonface$ represents the average distance from feature $F$ to a set of rectangles known to be devoid of faces. Thus the score for the feature depends on whether the feature is closer to the population of faces or nonfaces. The decision of whether an image window contains a face or not is based on the computation:

\begin{displaymath}
IsFace=\sum_{i=1}^{N}F[i].score>threshold
\end{displaymath}

$N$ is the number of features used for recognition and $threshold$ is determined by the AdaBoost algorithm.

The original slow approach described in the Viola/Jones paper uses 200 features. They then go on to describe a faster approach where they cascade many such detectors with more complex detectors following simpler ones. A window is passed to a detector only if it was not rejected by the preceding detector. Since training this cascade is a laborious process, the workload characteristics of this algorithm are modeled with a 100 feature detector.



Binu Mathew