SiliconIntelligence

7. Visual Feature Recognition Algorithms

Visual feature recognition systems vary significantly based on the type of feature that is being recognized. Relatively simple recognizers are regularly employed in industrial visual inspection systems. On the other hand, human face recognition is an extremely complex task given the huge possibility space of facial features and skin tones. Facial recognition systems clearly have utility in security and surveillance domains, and other visual recognizers play key roles in gesture interfaces, lip reading to support speech recognition, and robotics. Interest in face recognition is motivated by the difficulty of the problem, which cannot be currently supported by embedded systems. This is evident from Figure 1.1, which showed that a high performance 4.8 GHz processor was required to satisfy the real-time requirements of the FaceRec application. Furthermore the face detection algorithms like the neural network based Rowley detector and the rectangle feature based Viola/Jones detector used in this study are generic approaches for object detection [83,103]. They appear to be easily adapted to address other visual feature recognition tasks. The main differences for these other tasks is a different training regimen and different frame rate requirements. For example, the Rowley method of face detection described in Section 7.3 has been applied to license plate detection [83]. Thus, research in accelerating face detection and recognition also helps the detection and recognition of other objects.

The FaceRec application studied here can be viewed as a pipeline of three major functional components. A flesh tone detector is used to isolate areas of a frame where a face is likely to be present. The next stage is a face detector that determines whether a face is present or not in each area of interest. The final phase is a face recognizer. Each of these components is based on well known algorithms that have been adapted or reimplemented to fit into a unified framework. Some algorithmic optimization and restructuring has been done to suit benchmarking purposes, but the basic approach has been developed by other researchers.

Interestingly the face recognition system, when viewed from a structural perspective comprises a series of increasingly discriminating filters. Early stages of the sequence must inherently filter the entire image. As the process proceeds downstream, each stage needs to examine less image data since previous stages have eliminated certain areas from the probable candidate list. The result is an interesting balance of simple algorithms that analyze lots of data early in the sequence and more sophisticated algorithms that only need to analyze limited amounts of data late in the process. The result is a structure that is amenable for implementation as an embedded system.

Figure 7.1 shows the major steps in face recognition. The input is a low-resolution video stream such as $320\times200$ pixel images at 10 frames per second. The stream is processed one frame at a time, and sufficient state is maintained to perform history sensitive tasks like motion tracking. The process is essentially a pipeline of filters that reduce the data and attach attributes to frames for the use of down stream components. Typically each filter is invoked at the frame rate. This underlines the soft real-time nature of this application. Additional data is required since filters may access large databases or internal tables. These additional data streams add to the aggregate bandwidth requirement of the system. The periodic nature of the application domain often makes it possible to easily estimate the worst case requirements.

Figure 7.1: Algorithmic Stages of a Face Recognizer
\includegraphics[width=0.75\columnwidth,keepaspectratio]{figures/vision_algo/vision_algo_pipeline}

Object recognition typically proceeds in two steps: object detection and the actual object identification. Most approaches to object identification require a clearly marked area, normalized to a particular size, and the location of key features. Object detectors find the area where the desired feature is likely to reside, scale the area to meet the normalization requirement, and then create a location and boundary description for that area. False positives and negatives occur, but the algorithms try and minimize their occurrence.

Object detectors also often work at a fixed scale. The detector is swept across the image recording all positions at which a detection was reported. The image is then subsampled or scaled down by a small factor (typically 0.8), and the process is repeated until the frame is below the size of the detector. A decision procedure is then applied to all the predicted hits to decide which ones are the most likely. Detectors often have much lower compute cost per subwindow than their corresponding identifying routines. Since they are swept across the entire image, a significant portion of the application's execution time might be spent in the detector. In contrast, even though identifying filters are more compute intensive, they are applied only to the high probability regions of the frame, so their contribution to the overall execution time might be low. Though object detectors are less compute intensive, they are much more difficult to design due to their generality. For example a face identifier chooses from one of N known faces, but a face detector has to distinguish between the infinite sets of faces and nonfaces.

Since detection is time consuming, it is common to structure an object detector as a cascade of filters with cheaper heuristics upstream identifying potential regions for more expensive heuristics downstream. An extreme case of this is the Viola/Jones method, which trains a sequence of about 200 increasingly discriminate filters [103]. A more common approach when dealing with faces and gestures is to identify the flesh colored regions of an image and apply a more sophisticated detector to those regions.

The identifier receives candidate regions from the detector along with other information like probability, scale and feature locations. It typically employs some type of distance metric from known references to provide a positive identification. In the face recognizer, the first level of detection is provided by flesh toning which is followed by an image segmenting algorithm. These are followed in turn by a more complex detector, voting for high probability regions, an eye locater and finally a face identifier.



Subsections

Binu Mathew