SiliconIntelligence


7.3 Rowley Face Detector

Henry Rowley's neural net based face detector is well known as a pioneering contribution [83]. Its implementation was provided by the Robotics Institute at CMU. This detector is designed to determine if a $30\times30$ pixel image contains a face or not. Face detection is done by sweeping the detector over the image and computing the decision at each pixel location. Then the image is scaled and reduced in size by a factor of 0.8 and the procedure is repeated. The resulting series of images and detection locations is called an image pyramid. In the case of real faces, a detection will be reported at several nearby pixel locations at one scale and at corresponding locations in nearby scales. False positives do not usually happen with this regularity. Hence a voting algorithm can be applied to the image pyramid to decide the site of any true detections.

In each window the detector first applies a correction for varying lighting conditions followed by histogram equalization to expand the range of intensity values. The preprocessed window is then applied to a multilayer neural network where the input layer has retinal connections to the image window. Neural net evaluation can be represented as:

\begin{displaymath}
Y=\tanh(\sum_{i=1}^{N}W[i]\times Image[Connection[i]])
\end{displaymath}

$W[]$ is a set of weights associated with each neural connection and $Connection[]$ represents the image locations to which the neuron is connected. In practice, $Image$ contains additional storage following the actual stored image, and the outputs of neurons are stored to the additional locations. Thus a multilayer network can be evaluated as if it is a flat retinally connected array of neurons if it is ensured that neurons in deeper layers follow neurons closer to the retinal layer. The $\tanh$ function acts as a sigmoid shaped nonlinearity and computing it is expensive. Rowley's original implementation uses the $\tanh()$ implementation provided by the C-library. In the version developed for this dissertation, it was replaced with an 800 entry lookup table which has produced identical output to the original for the test images. This simple optimization improved the performance of the algorithm by a factor of 2.5 on a 2.4 GHz Pentium processor.

The retinal layer is followed by a hidden layer comprised of three classes of units. Four units look at $10\times10$ subwindows, 16 units look at $5\times5$ subwindows and 6 units look at overlapping $30\times5$ horizontal stripes. The final output of the network indicates if the $30\times30$ window contains a face or not.

The voting algorithm notes the location and scale of each detection in an image pyramid. The next step called spreading replaces each location in the pyramid with the count of the number of detections in a neighborhood. The neighborhood of a location extends an equal number of pixels along the position and scale axes. The values are then thresholded and the centroids of all remaining locations are found. Centroids are examined in descending order of the number of detections per centroid and other centroids that represent a face overlapping the current face are eliminated. The remaining centroids represent the location of faces found in the image. To further reduce false positives, multiple neural nets each trained separately may be applied to the image and their consensus can represent a more accurate detection.



Binu Mathew