Skip to main content
Click to return to IBM ECVG home

Face Finding

A number of groups have developed face finders for controlled domains such as mug shots. Some researchers have also created reasonably competent face finders for general purpose domains (e.g. CMU's neural networks). However, these tend to be computationally expensive and therefore, given today's processor speeds, they are just run on key frames extracted from video. The approach we took is different: we use a number of fast but weak methods and combine their results. Because the processing requirements for all the methods are modest, we can afford to run the system on all the frames in a video. This dense temporal sampling lets us apply additional continuity constraints to further improve the system's performance and boost its resistance to noise. The overall idea is that if you do enough "stupid" things fast enough, the final result looks "smart".

Flesh Tone High Chroma Horizontal Texture
Flesh Tone High Chroma Horizontal Texture
     
Candidate Regions Bounding Boxes Shape, Size, Tracking
Candidate Regions Bounding Boxes Shape, Size, Tracking

There are three distinct phases to the face finder: pixel-level processing, region pruning, and constrained tracking. The first phase is the most computationally expensive. However, we do not try to push a single method to its limits in order to find face regions. Rather, we combine the partial results of a number of methods in more of a filtering paradigm. There is an initial flesh-tone finding routine but the acceptance band for skin color is very broad. By itself this leads to poor segmentation, yet it does discard a good fraction of most images without wiping out real faces. The color filter is followed by a Chroma filter that looks for regions which are significantly brighter or more colorful than the background. Again, this is a weak constraint by itself, but in conjunction with the color filter it removes more of the non-face regions. Finally, we look for regions with dark horizontal bars. Typically there are lots of these, but not many that are also skin colored and brighter than their backgrounds.

   Candidate Pruning Steps   
  roughly
vertical ellipse
not touching
borders
sufficiently
high in frame
similar size
to biggest
tracked for
several frames
 
  roughly vertical ellipse not touching borders sufficiently high in frame similar size to biggest tracked for several frames  
 

In the second phase, pixels that pass all three of these test are then grouped into potential face objects using binary connected components. Another round of constraints are applied to the "blobs" using their SRI parameters: area, elongation, orientation, etc. Faces in professionally produced videos are roughly round, slightly elongated in the vertical direction, and not too close to any of the borders of the image (cameramen do not typically clip people's faces). Furthermore, there are regularities to the positioning of people in produced video. For instance, size and postion are linked: small heads do not usually appear along the bottom edge of the picture. All the blob tests are very fast. Regions that fail any one of them are removed as potential faces.

news face
Click for AVI video (2.9MB)
In the final phase, the candidates face regions from temporally adjacent frames are combined. Two regions are linked if they have roughly the same size and position in the two frames. We have not found it necessary to correct for velocity (i.e. test for overlap at a projected position) although this could be added. New face candidates are given provisional status until they have been consistently detected for a number of consecutive frames. This prevents a few frames of spurious detections from incorrectly instantiating a face object. Similarly, once a face reaches acceptance threshold it is given some additional persistence. That is, the system maintains the face track for a number of frames even in the absence of new candidate matches. This allows it to tolerate occasional drop-outs due to head rotations or objects passing in front of the person. Matching and tracking in this manner gives the system additional robustness to image variations and can "cover up" some of the mistakes made at the lower levels. It is also very fast since all the processing is done on symbolic lists, not images.

The output of the system consists of a set of face bounding boxes for each frame. The bounding boxes have unique IDs that are preserved across tracking steps so they can be aggregate into individual "tracks". These tracks can then be rendered into database tables as shown below for applications such as video browsing. This is useful for a number of purposes. First, it can be used as a prelude to face recognition - such systems typically expect a cropped "mugshot" which can be easily be generated from the bounding box. Second, even just the count and size of faces can be useful for selecting particular types of video segments such as action sequences versus talking heads. Also, when the camera shifts to a new location there is often an "establishing shot" showing the reporter's full body (see video). Looking for a single small head can potentially locate these semantic story breaks. Finally, sometimes thematic content can be inferred from filming style. For instance, intimate dialogs often consists of many close-up shots (i.e. big faces).


VIDEO        START         STOP        ATTRIBUTE    VALUE
---------    ----------    ----------  ----------   ------------
"abc3-30",    2, 066667,    4, 200000, Face Count,, 2,,,            
"abc3-30",    6, 000000,    8, 200000, Face Count,, 3,,,            
"abc3-30",    8, 200000,   12, 000000, Face Count,, 0,,,            
"abc3-30",   12, 000000,   25, 933333, Face Size,   Chest Shot   ,,,, 
"abc3-30",   12, 000000,   26, 733333, Face Size,   Waist Shot   ,,,, 
"abc3-30",   12, 000000,   28, 066667, Face Size,   Chest Shot   ,,,,  
"abc3-30",   12, 000000,   28, 733333, Face Size,   Waist Shot   ,,,,  
"abc3-30",   12, 000000,   29, 466667, Face Count,, 1,,,             
In summary, we have patched up the inadequacies of fast, simple pixel-processing routines in three ways. First, we combine a number of different methods. Second, we post-filter the candidates using cheap blob calculations. Third, we further prune the candidate list using a rapid calculation which ensures limited continuity over time. The face finder code currently runs at a blazing 90 fps on a 400 MHz Pentium II, although in real applications additional time is required for live image capture or MPEG decoding.

 
Contact: Jon Connell Last updated: 6/12/02
 
Research Projects Group Papers Issued Patents Related Groups


 
Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact