IEEE
International Conference on Multimedia & Expo
New York, New York
Electronic Proceedings
© 2000 IEEE
Translingual Visual Speech Synthesis
-
Tanveer A. Faruquie, Chalapathy Neti*, Nitendra Rajput, L.V.
Subramaniam, Ashish Verma
-
IBM India Research Lab
-
Indian Institute of Technology
-
Hauz Khas, New Delhi 110016, India
-
91-11-6861100
-
{ftanveer,rnitendr,lvsubram,vashish}@in.ibm.com
-
http://www.research.ibm.com/irl
*IBM T.J. Watson Research Center
-
Yorktown Heights
-
NY 10598, USA
-
1-914-945-2921
-
Chalapathy_Neti@us.ibm.com
-
http://www.research.ibm.com/watson
Abstract
Audio driven facial animation is an interesting
and evolving technique for human-computer interaction. Based on an
incoming audio stream, a face image is animated with
full lip synchronization. This requires a speech recognition system
in the language in which audio is provided to get the
time alignment for the phonetic sequence of the audio signal. However,
building a speech recognition system is data intensive and is a very
tedious and time consuming task. We present a novel
scheme to implement a language independent
system for audio driven facial animation given a speech recognition
system for just one language, in our case, English.
The method presented here can also be used for text to audio-visual
speech synthesis.
Table of Contents
Introduction
To realize a natural and friendly interface is very important in
human-computer interaction. Speech recognition
and computer lip-reading have been developed
as a means of inputting information
for interaction with the computer. It
is also important to provide a natural
and friendly means for the computer to render the information.
Interpersonal communications, human-computer interaction, telework,
teleducation, multimedia telephones, animation and various other
multimedia approaches in communication offer the motivation
to design realistic facial animators. Such systems
represent a means for simplifying, enhancing
and in many cases completely changing the
current paradigms of interpersonal and human-computer communication.
A talking face, with lip movements in synchronization with
the spoken words and sentences, greatly enhances communication. Many
methods have been presented to animate the face in sync with the
audio [1][2][3].
These methods rely on a viseme based alignment
being generated from the incoming audio, where visemes are different, distinguishable
lip shapes. For this a speech recognition system is used
to generate the phonetic alignment from the incoming
audio. Phonetic alignment refers to the time duration and the transition
times between phonemes in an audio sequence [4]. A
phoneme to viseme mapping then generates the visemic
alignment from the phonetic alignment.
Techniques exist for synthesizing speech given text as input
to the system. These text to speech synthesizers work by producing a phonetic
alignment of the text to be pronounced and then
by generating the smooth transitions in between adjacent phones to
get the desired sentence [6]. Using
a phoneme-viseme mapping and text-to-speech synthesis,
a text-to-video synthesizer can be built. In the
audio driven animation case, the phonetic alignment is generated
from the audio representing the spoken sentence. Thus facial animation
can be driven by text or audio, depending on the needs of the application.
Audio driven facial animation requires training of
a speech recognition system which is used for generating
alignments from the input speech. Once the phonetic alignment is
generated, the mapping and the animation hardly have any language dependency
in them. Translingual visual speech synthesis can be achieved if the first
step of alignment generation can
be made speech independent. In this paper, we present
a method for translingual visual speech synthesis, that
is, given a speech recognition system for one language,
we describe a method of synthesizing video with speech
of any other language.
In Section 2 we describe a general visual speech synthesis
module. In Section 3 we describe the main idea of this paper, the
translingual visual speech synthesis system. We describe in
detail the method used to adapt
the speech recognition system of one
language to generate phonetic alignments
in a new language. We present the specific case
of adding Hindi words to an English speech
recognition system. In Section 4 we present the modifications required
to build a translingual visual speech synthesis system in block diagram
form. Finally the conclusions are presented in Section 5.
Visual Speech Synthesis
The visual speech synthesis module is shown in Figure 1. From
an incoming audio stream timing information and phoneme transitions
are extracted. This constitutes the phonetic alignment. The phonemes
are mapped to the corresponding visemes. This in turn
gives the viseme transitions and timings called the visemic
alignment. Now from a viseme database the animator picks out the
frames containing the corresponding visemes given by the visemic
alignment and animates the frames to give smooth transitions
between visemes. This results in visual speech synthesis corresponding
to the incoming audio.
Figure 1. Visual speech synthesis system model
The speech recognition engine is a critical part of this
system. The building of such a system for continuous
speech requires extensive training over a large database
of typical sentences [5]. With an increase in the
number of words that such a system can recognize the
system complexity increases greatly. The training and building
of such a system is a one time affair, due to the effort
and complexity. However, it needs to be realized that for the facial
animation system, it is not necessary to know the
exact word being spoken or even the exact phoneme from
a word recognition point of view, it is sufficient to know
the viseme. Typically the viseme set used in visual speech
synthesis varies from 8 to 30 visemes. There are
about 45-55 phonemes in the English language. This results
in a many to one mapping from phonemes to visemes. In
the following sections we exploit these facts to build
a translingual visual speech synthesis system.
In the case of text based visual synthesis, we give text as input to
the system in Figure 1 and the speech recognition unit is replaced by the
text to speech synthesis unit. The rest of the system remains unchanged.
Translingual Visual Speech Synthesis
In order to understand the proposed translingual speech synthesis system,
we enumerate the crucial points in translingual speech
synthesis below:
-
From the given input audio and the transcribed truth, we generate the
phonetic alignment. This requires a speech recognition engine
which could understand the phonetic baseforms of the
text. This would work fine if the input
audio is in the same language as the language used for
training the recognition system.
-
If the language in which the video is to be synthesized is
a new language, then the phoneme set of the new language may
be different from that of the training language. But the alignment
generation system gives the alignments based
on the best phone boundaries using its own
set of phonemes (corresponding to the language
used in the training). Therefore, a mapping
is required to convert the phonemes from one language
to the phonemes of the other language so as to get an
effective alignment in the phoneme set of the new language.
-
A phoneme to viseme mapping can then be used
to get the corresponding visemic alignment which generates
the sequence of visemes and their time durations which
are to be animated to get the desired video.
-
Animating the sequence of viseme images to get the desired
video output aligned with the input audio signals can
now be done as described in Section 2.
In this paper, a new approach to synthesizing visual speech from a given
audio signal in any language, with the help of
a speech recognition system in another language,
is presented. From here onwards, we refer
to the language used in training the
speech recognition system as the {\em base language} and the language in
which the video is to be synthesized as the {\em novel language}.
In the illustrations, Hindi has been chosen as the novel
language and
English as the base language. If a word in the
novel language is presented to the alignment
generator, then the alignment generator will
not be able to generate the alignments for such a word as
the word is not in the phonetic vocabulary
of the training system. Moreover the phonetic spelling
of a word in the novel language may not be represented completely
by the phonetic set of the base language. We
present below a technique to overcome these problems
resulting in a language independent alignment generation system.
We build a system that will have the trained alignment generator
and the viseme images for the base language which can
be used to generate the animation for
audio input in any language.
Phonetic Vocabulary Adaptation Layer
The base language vocabulary does not include words from the novel language.
Hence when a word from the novel language is presented to the speech recognition
system that has been trained in the base language it will fail to give
the phonetic baseforms of the word. In order to generate alignments for
words in the novel language, first a phonetic vocabulary of this
language is created wherein words are represented in the phonetic
baseforms using the phoneme set of the novel language.
Since the recognition system is trained on the phoneme set
of the base language, the vocabulary needs to be
modified so that the words from the novel language
now represent the baseforms in the base language phoneme set.
Such a modification is made possible by the Phonetic Vocabulary Adaptation
Layer. This layer works by using a mapping from the phoneme set of one
language to the other language. For illustration, a mapping
from the Hindi phones to the English phones is as shown in Figure
2. There are three possible cases:
-
The word in the novel language can be represented
by the phonemes in the base language; for such words, the baseforms
can be simply written using the base language phoneme set
-
The word in the novel language cannot be represented by the
base language phoneme set; then the word is written using the
novel language phoneme set and the mapping as in Figure 2
is used to obtain the baseforms in the base language.
-
A phoneme in the base language never appears in the novel language; in
such a case that particular phoneme in the
-
base language is redundant and is left as it is.
Figure 2. Phoneme mapping from English to Hindi
Since the aim of mapping the phoneme set is to generate the best
phoneme boundaries through acoustic alignment, the mapping
is based on similar-sounding phonemes, i.e. if there is no
exactly similar phoneme in the base language which can be associated with
the phoneme in the novel language, then
that base language phoneme is chosen which is acoustically
similar. Both however may map to a different viseme.
This problem is addressed in the next subsection.
The phonetic vocabulary modification layer helps in generating
the base language alignments for the novel language audio. In Figure
2 an example of mapping the phones of Hindi language to the English
language phoneme set is presented. As is seen, not all the English phonemes
are used by the novel language. Also there exists an exact
mapping for a large number of phones. These are shown by a
*** sign on that row. A ** in the row implies that the mapping
is not exact but that it is the acoustically closest map. A * in
the mapping implies that the novel language phoneme has
been approximated by a string of more than
one phoneme from the English language for acoustic similarity.
Next we show how to extract the base language visemic alignments for
animation in the novel language.
Generation of Visemic alignments
Since the system has to work for any novel language using
the trained alignment generator and the viseme set in
the base language, visemic alignment cannot be simply
generated from the phonetic alignment using
direct phoneme to viseme mapping as described in Section 2. As was
shown above, the phonetic vocabulary modification layer was built on the
mapping based on acoustically similar phonemes. However this mapping
may distort the visemic alignment as it does
not take into consideration the visemes corresponding
to each such phoneme. So an additional vocabulary which
represents the words of the novel language in the phoneme set
of the base language is created but this does not use the mapping
in Figure 2, it uses a mapping based on the visemic similarity of the two
phonemes. We call this mapping based on visemic similarity the visemic
vocabulary modification layer. Using this additional vocabulary, the base
language alignments and the base language phoneme-to-viseme
mapping, we get the visemic alignments. This
visemic alignment is used to generate the
animated video sequence. As can be seen in Figure 2, the phoneme mapping
between the two languages is not one-to-one. So a single phone in
the base language may represent more than one phone in the
novel language. This however creates no confusion as the Phonetic
Vocabulary Modification Layer outputs the alignment in the novel
language after taking into account the many-to-one mapping.
Alternately, if the viseme set is available for the novel language,
then the visemic vocabulary modification layer can be modified
to directly give the visemic alignment
using the phoneme-to-viseme mapping in the novel language.
Here the phonetic alignment generated in the base
language is converted to the novel language by using the corresponding
vocabulary entries in the two languages. Then the phoneme
to viseme mapping of the novel language is applied. Note
that the visemic alignment so generated is in the novel language
and this was desired as the visemes are available
in that language and not in the base language.
If the viseme set of the novel language is very different from
the viseme set of the base language then this modified system would be
especially useful.
System Model
Figure 3 shows the block diagram of the modification layers described above
to achieve translingual visual speech synthesis. In the figure the subscripts
B and N refer to the base language and the novel language respectively.
The superscripts P and V refer to phonemes and visemes respectively. The
speech recognition system is modified to generate visemic alignments corresponding
to the novel language using the phonetic and visemic vocabulary modifiers.
In case the visemes for the novel language are available the visemic
vocabulary modifier is not required and a direct phoneme to viseme mapping
in the novel language may be used to give visemic alignments.
Figure 3. Block diagram showing the modification layers
The system uses the generated visemic alignment for the purpose of animation.
For animation, morphing is done from one viseme to another as given
by the visemic alignments. Due to non-accurate mapping
of phonemes, the alignment may not represent
the exact phone boundaries. This however is not observed in
the animated video as the viseme is always in
transition during these boundaries. A smooth
and continuous video is thus generated which does not
reflect any inaccurate phoneme boundaries.
Conclusions
A system for translingual visual speech synthesis using a speech recognition
unit for any one language is presented. The advantage of using the approach
presented in this paper is that one doesn't need to build a speech recognition
engine for the same language in which the visual speech
is to be synthesized. Given a
speech recognition system for a language, one can easily
and quickly customize the phonetic and visemic alignment generation
layers to get a synthesized video in any other language. Moreover
the viseme images can also be of the language in
which the alignment generation system is built, thus obviating the
need for generating new viseme images for each language. The
system also works if the novel language has visemes that
are totally different from the visemes of the base language.
Similarly, for text to audiovisual speech synthesis one doesn't need a
text to speech synthesizer in the same language in which the
visual speech synthesis has to be performed.
An evaluation of the quality of generated video was performed by taking
the visemes of English language for a subject and the novel language
as Hindi. The generated video was demonstrated to an audience who judged
the quality and the realism of the animation to be impressive. Telugu (a
regional language in India) was also tried as the novel language and the
viewer response was encouraging.
Bibliography
[1]
Parke, F. I., Waters, K.,
Computer facial animation, Wellesley
MA: A K Peters, 1996.
-
[2]
-
Lavagetto F., Arzarello, Caranzano M., "Lipreadable
frame animation driven by speech
parameters," Proceedings International Symposium on Speech, Image
Processing and Neural Networks, 1994.
-
[3]
-
Ezzat T., Poggio T., "Miketalk: A talking facial display based on morphing
visemes," Proceedings of IEEE Computer Animation, Philadelphia
PA, USA, 8-10 June, 1998, pp. 96-102
-
[4]
-
Bahl, L. R., Brown, P. F., de Souza P. V., Mercer, R. L., "Speech
Recognition with continuous parameter hidden markov models," Proceedings
ICASSP-88, New York, May 1988, pp. 40-43.
[5]
Bahl, L.R., et. al., "Large vocabulary natural language continuous
speech recognition," Proceedings ICASSP-89, Glasgow, Scotland, May
1989, pp. 465-467.
[6]
Donovan R.E., Eide E.M., "The IBM Trainable Speech Synthesis
System," Proceedings International Conference on Speech
and Language Processing, 1998.