Skip to main content
Click to return to IBM ECVG home

Audio-Visual Speech

Although significant progress has been made in machine transcription of large vocabulary continuous speech (LVCSR) over the last few years, the technology to date is most effective only under controlled conditions such as low noise, speaker dependent recognition and read speech (as opposed to conversational speech etc.). Also, the interface is not as intuitive as it could be. In this project we are looking at ways visual interpretation of the user's face can enhance these capabilities.

The potential for improved recognition rates using visual features is well established in the literature on the basis of psychophysical experiments. Canonical mouth shapes that accompany speech utterances have been categorized, and are known as visual phonemes or "visemes". Visemes provide information that complements the phonetic stream from the point of view of confusability. For example, "m" and "n" which are confusable acoustically, especially in noise situations, are easy to distinguish visually: in "m" lips close at onset, where as in "n" they do not. The unvoiced fricatives "f" and "s" which are difficult to recognize acoustically belong to two different viseme groups.

mouth

Candidate visual features are gray­scale parameters of the mouth region; geometric model-based parameters such as area, height, width of mouth region; lip contours arrived at by curve fitting, spline parameters of inner/outer contour; and motion parameters obtained by 3­D tracking. Gray scale parameters suffer from being sensitive to lighting conditions. Lip contour information, although invariant to lighting conditions, may not provide enough information of the inner articulators such as teeth and tongue. In our work we consider gray scale parameters associated with the mouth region of the image. Given the location of the lip corners a rectangular region of normalized scale and rotation, centered on the mouth center is sub-sampled from the original video frame. Principal Component Analysis (PCA) is then used to extract a vector of smaller dimension from this vector of gray­scale image values. For speech corrupted by babble noise (15db SNR) adding in visual information improved the phoneme recognition rate from 40% (audio only) to 49%.

Other aspects of speech-based interfaces can also benefit from video interpretation. For instance, current HCI systems using speech recognition require a human to explicitly indicate one's intent to speak by turning on a microphone using the keyboard or mouse. But one of the key aspects of naturalness of speech communication involves the ability of humans to detect an intent to speak. Humans do this by a combination of visual and auditory cues. Visual cues include physical proximity, eye contact and lip movement, etc. Automatic detection of speech onset for open-microphone solutions can be carried out using silence/speech detection. However, purely audio-based techniques suffer from sensitivity to background noise.

We are exploring the use of the combination of visual cues and audio cues to provide robust indicators of speech intent and speech onset/offset. Our current approach uses the user's proximity to the computer and the frontality of pose. We also use a measure of visual speech activity based on detecting particular visemes from the extracted mouth region of the image. While these visual cues are not as effective as audio cues based on speech/silence detection, in combination they should improve overall system performance.

 

Selected publications:

Late Integration in Audio-visual Continuous Speech Recognition
A. Verma, T. Faruquie, C. Neti, S. Basu, A. Senior
In proceedings of Automatic Speech Recognition and Understanding,
Colorado, 12-15 December 1999.
More

 
Contact: Andrew Senior Last updated: 6/7/02
 
Research Projects Group Papers Issued Patents Related Groups


  Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact