A number of groups have developed face finders for controlled domains
such as mug shots. Some researchers have also created reasonably
competent face finders for general purpose domains (e.g. CMU's neural
networks). However, these tend to be computationally expensive and
therefore, given today's processor speeds, they are just run on key
frames extracted from video. The approach we took is different: we use
a number of fast but weak methods and combine their results. Because
the processing requirements for all the methods are modest, we can
afford to run the system on all the frames in a video. This dense
temporal sampling lets us apply additional continuity constraints to
further improve the system's performance and boost its resistance to
noise. The overall idea is that if you do enough "stupid" things fast
enough, the final result looks "smart".
There are three distinct phases to the face finder: pixel-level processing, region pruning, and constrained tracking. The first phase is the most computationally expensive. However, we do not try to push a single method to its limits in order to find face regions. Rather, we combine the partial results of a number of methods in more of a filtering paradigm. There is an initial flesh-tone finding routine but the acceptance band for skin color is very broad. By itself this leads to poor segmentation, yet it does discard a good fraction of most images without wiping out real faces. The color filter is followed by a Chroma filter that looks for regions which are significantly brighter or more colorful than the background. Again, this is a weak constraint by itself, but in conjunction with the color filter it removes more of the non-face regions. Finally, we look for regions with dark horizontal bars. Typically there are lots of these, but not many that are also skin colored and brighter than their backgrounds.
In the second phase, pixels that pass all three of these test are then grouped into potential face objects using binary connected components. Another round of constraints are applied to the "blobs" using their SRI parameters: area, elongation, orientation, etc. Faces in professionally produced videos are roughly round, slightly elongated in the vertical direction, and not too close to any of the borders of the image (cameramen do not typically clip people's faces). Furthermore, there are regularities to the positioning of people in produced video. For instance, size and postion are linked: small heads do not usually appear along the bottom edge of the picture. All the blob tests are very fast. Regions that fail any one of them are removed as potential faces. The output of the system consists of a set of face bounding boxes for each frame. The bounding boxes have unique IDs that are preserved across tracking steps so they can be aggregate into individual "tracks". These tracks can then be rendered into database tables as shown below for applications such as video browsing. This is useful for a number of purposes. First, it can be used as a prelude to face recognition - such systems typically expect a cropped "mugshot" which can be easily be generated from the bounding box. Second, even just the count and size of faces can be useful for selecting particular types of video segments such as action sequences versus talking heads. Also, when the camera shifts to a new location there is often an "establishing shot" showing the reporter's full body (see video). Looking for a single small head can potentially locate these semantic story breaks. Finally, sometimes thematic content can be inferred from filming style. For instance, intimate dialogs often consists of many close-up shots (i.e. big faces).
| ||||||||||||||||||||||||||||||||||||||||||||
| Contact: Jon Connell | Last updated: 6/12/02 | ||
|
|
|
|
|