|
Although significant progress has been made in machine transcription
of large vocabulary continuous speech (LVCSR) over the last few years,
the technology to date is most effective only under controlled conditions
such as low noise, speaker dependent recognition and read speech (as
opposed to conversational speech etc.). Also, the interface is not as
intuitive as it could be. In this project we are looking at ways visual
interpretation of the user's face can enhance these capabilities.
The potential for improved recognition rates using visual features is well established in the literature on the basis of psychophysical experiments. Canonical mouth shapes that accompany speech utterances have been categorized, and are known as visual phonemes or "visemes". Visemes provide information that complements the phonetic stream from the point of view of confusability. For example, "m" and "n" which are confusable acoustically, especially in noise situations, are easy to distinguish visually: in "m" lips close at onset, where as in "n" they do not. The unvoiced fricatives "f" and "s" which are difficult to recognize acoustically belong to two different viseme groups.
Candidate visual features are grayscale parameters of the mouth region; geometric model-based parameters such as area, height, width of mouth region; lip contours arrived at by curve fitting, spline parameters of inner/outer contour; and motion parameters obtained by 3D tracking. Gray scale parameters suffer from being sensitive to lighting conditions. Lip contour information, although invariant to lighting conditions, may not provide enough information of the inner articulators such as teeth and tongue. In our work we consider gray scale parameters associated with the mouth region of the image. Given the location of the lip corners a rectangular region of normalized scale and rotation, centered on the mouth center is sub-sampled from the original video frame. Principal Component Analysis (PCA) is then used to extract a vector of smaller dimension from this vector of grayscale image values. For speech corrupted by babble noise (15db SNR) adding in visual information improved the phoneme recognition rate from 40% (audio only) to 49%. Other aspects of speech-based interfaces can also benefit from video interpretation. For instance, current HCI systems using speech recognition require a human to explicitly indicate one's intent to speak by turning on a microphone using the keyboard or mouse. But one of the key aspects of naturalness of speech communication involves the ability of humans to detect an intent to speak. Humans do this by a combination of visual and auditory cues. Visual cues include physical proximity, eye contact and lip movement, etc. Automatic detection of speech onset for open-microphone solutions can be carried out using silence/speech detection. However, purely audio-based techniques suffer from sensitivity to background noise. We are exploring the use of the combination of visual cues and audio cues to provide robust indicators of speech intent and speech onset/offset. Our current approach uses the user's proximity to the computer and the frontality of pose. We also use a measure of visual speech activity based on detecting particular visemes from the extracted mouth region of the image. While these visual cues are not as effective as audio cues based on speech/silence detection, in combination they should improve overall system performance. |
| Contact: Andrew Senior | Last updated: 6/7/02 | ||
|
|
|
|
|