S. Basu, C. Neti, N. Rajput*, A. Senior, L. Subramaniam*, A. Verma *
IBM T. J. Watson Research Center, Yorktown Heights. NY 10598
*IBM Solutions Research Center, New Delhi, India.
Visual feature extraction:
Although previous work has been conducted to define the viseme units
derived from human lip-reading experiments [2]
and other psychophysical data, more research is necessary to identify the
mouth features that are relevant for large vocabulary, speaker independent
visual speech recognition.
Candidate features are gray-scale parameters of the mouth region; geometric/model based parameters such as area, height, width of mouth region; lip contours arrived at by curve fitting, spline parameters of inner/outer contour; and motion parameters obtained by 3-D tracking. Gray scale parameters suffer from being sensitive to lighting conditions. Lip contour information, although invariant to lighting conditions, may not provide enough information of the inner articulators such as teeth and tongue. Still another feature set suggested recently to take into account the above factors is the Active Shape model [8].
In this report, we consider grey scale parameters associated with the mouth region of the image. Given the location of the lip corners a rectangular region of normalized scale and rotation , centred on the mouth centre is subsampled from the original video frame. Principal Component Analysis (PCA) was used to extract a vector of smaller dimension from this vector of grey-scale values.
Audio feature extraction:
Digitized speech sampled at a rate of 16 kHz was considered. A frame
consists of a segment of speech of duration 25 ms, and produces an 24 dimensional
acoustic cepstral vector via the following process, which is standard in
speech recognition literature. Frames are advanced every 10 ms to obtain
succeeding acoustic vectors.
First, magnitudes of discrete Fourier transform of samples of speech data in a frame are considered in a logarithmically warped frequency scale. Next, these amplitude values themselves are transformed to a logarithmic scale (the later two steps are motivated by logarithmic sensitivity of human hearing to frequency and amplitude), and subsequently, a rotation in the form of discrete cosine transform is applied. One way to capture the dynamics is to use the delta (first-difference) and the delta-delta (second-order differences) information. An alternative way to capture dynamic information is to append a set of (say four) preceeding and succeeding vectors to the vector under cnsideration and then project the vector to a lower dimensional space, which is chosen to have the most discrimination. The latter procedure being known as the Linear Disciminant Analysis (LDA) in standard literature.
Using visual information to augment the audio signal for speech recognition
involves the ability to fuse different representations of the same underlying
production process. Such a mode-fusion or multi-modal integration involves
the following categories of sensory data fusion [10].
These constitute: (1) feature fusion -- features are extracted from the
raw data and subsequently combined. This involves, for example, fusing
speech features with lip and facial features; (2) decision fusion -- this
is the fusion at the most advanced stage of processing, after independent
classification of each modality and can happen at the sub-word level, word-level,
utterance level or at the action level. In the following section, we describe
some preliminary experiments using feature fusion. In feature fusion, we
first extract audio features and video features and then concatenate the
two to generate a single audio-video feature vector. We use LDA (as described
earlier) on the combined vector to generate a lower dimensional discriminant
feature representation.
The audio sampling rate of 16 kHz is chosen so as to be able to compare the joint audio-visual recognition results with the audio-only HUB4 evaluation experiments. While this is an ongoing data collection effort, at the present time we have about 700 video clips of approximately 10-15 seconds duration each (the entire HUB4 database is approximately 200 hrs. of speech data, not all of which is usable for our purpose).
In summary, we use a database of large vocabulary continuous visual speech transmitted over a broadcast channel. The fact that it is real-life data (as opposed to data collected in controlled environments) distinguishes this from existing databases. While making the system applicable to real problem domains, this does make visual feature extraction and subsequent processing a more challenging task.
The need for controlled data is not to be underplayed and, in our view, may indeed have an important role to play in this general area of research. For purposes of validation of results we also collected ``read'' large vocabulary continuous visual speech. This data was collected in acoustically quiet, controlled conditions and the resolution of the lip region in the video image was much larger than in the LDC data mentioned above -- thus making video based recognition a more tractable task. For the purpose of fair comparison with the LDC data, the video digitization parameters and audio sampling frequency were kept the same. We label this data the `ViaVoice Audio-Visual' (VVAV) data.
We report specific results on the joint audio-video phonetic classification and its comparison with audio-only and video-only classification. For video we use a `viseme' based approach as described above. One approach to labelling the video feature vectors is to label the speech data from a Viterbi alignment and to subsequently use a phoneme to viseme mapping. To produce phonetic aligments of the audio data we use the acoustic models trained using the DARPA HUB4 speech recognition data. The video frame rates are typically lower than the audio frame rate. This is circumvented by inter-frame interpolation. In all experiments the HUB4-video database of continuous large vocabulary speech mentioned in Section [11] is used.
In the following experiments 672 audio-video clips of VVAV data was
used as a training set.
The test set consisted of 36 different clips taken from the same database.
All the experiments use LDA features. In the phonetic/visemic classification
each phone/viseme is modelled as a mixture of 5 gaussians.
| Data Type | Dimension | Splice Param. | Reco. Rate |
| Audio Only (Training Data) | 24 | 60 dim | 53.66% |
| Video Only (Training Data) | 100 | 35 dim | 22.21 % |
| Audio Only (Test Data) | 24 | 60 dim | 48.08% |
| Video Only (Test Data) | 100 | 35 dim | 20.15% |
| Audio-Video (Training Data) | 24+50 | 35 dim | 53.58% |
| Audio-Video (Test Data) | 24+50 | 35 dim | 48.71% |
| Data Type | Dim | Splice dim | Phonetic | Visemic |
| Audio Only (Test) | 24 | 60 dim | 28.05 | 40.40% |
| Video Only (Test) | 100 | 35 dim | 20.15 | 27.76% |
| Audio-Video (Test) | 24+50 | 35 dim | 32.02 | 44.81% |
A comparison of Tables 1 and 2 shows that audio-visual recognition in acoustically degraded conditions is better than either of the two streams processed independently. An approximate improvement of 14% is obtained compared to audio-only classification scheme.
We used the following grouping of phonemes into viseme classes. For
a detailed explanation of the symbols used for phoneme classes we refer
to [13].
(AA, AH, AX), (AE), (AO), (AW), (AXR, ER), (AY), (CH),(EH), (EY),(HH),
(IH,IX), (IY), (JH), (L), (OW), (OY), (R), (UH, UW), (W), (X, D$),
(B,BD,M,P,PD),(D,DD,DX,,G,GD,K,KD,N,NG,T,TD,Y), (TS), (F,V), (S,Z),
(SH,ZH), (TH,DH).
When viseme's are used as classes, the video classification improves by about 37.5\%, relative. However, improvement in noisy conditions is about the same for visemic classes.
| Data Type | Dim | Splice Param. | Reco. Rate |
| Audio Only (Training Data) | 24 | 60 dim | 62.17% |
| Audio Only (Test Data) | 24 | 60 dim | 60.52% |
| Video Only (Training Data) | 100 | 35 dim | 28.14% |
| Video Only (Test Data) | 100 | 35 dim | 27.76% |
In our very preliminary experiments with HUB4 broadcast news data, we get the following results. Audio-only phonetic classification accuracy is 33.98%. Video-only phonetic classification accuracy using 35 dimensional LDA features is 9.48%.
These results are relatively poor compared to VVAV data. First, the
resolution of the mouth region for the HUB4 data is much less compared
to VVAV data, with the possibility of providing very little discriminative
information between phones. Secondly, the tracking of the lip region is
a harder problem and hence may result in loss of crucial information for
discrimination. We are investigating techniques to better track and represent
the lower resolution images in the HUB4 data.
In addition to speech recognition, the same problems of channel and environment dependence arise in speaker identification. Again, the problem can be alleviated by combining visual signatures of the speaker both in terms of characteristics of visual speech and other facial features to perform speaker identification. Combined use of audio and visual information is beginning to show improvements in such problems as well. One such example is [13] in which computer vision-based face recognition techniques have been shown to benefit significantly when augmented with speech-based authentication methods. See [14] for some application contexts for combined use of speech and vision.