IEEE International Conference on Multimedia & Expo
New York, New York
Electronic Proceedings
© 2000 IEEE


Translingual Visual Speech Synthesis

Tanveer A. Faruquie, Chalapathy Neti*, Nitendra Rajput, L.V. Subramaniam, Ashish Verma
IBM India Research Lab
Indian Institute of Technology
Hauz Khas, New Delhi 110016, India
91-11-6861100
{ftanveer,rnitendr,lvsubram,vashish}@in.ibm.com
http://www.research.ibm.com/irl 

*IBM T.J. Watson Research Center
Yorktown Heights
NY 10598, USA
1-914-945-2921
Chalapathy_Neti@us.ibm.com
http://www.research.ibm.com/watson

Abstract

Audio driven  facial  animation  is an  interesting and  evolving technique for human-computer interaction. Based on an incoming audio stream, a  face image  is animated  with full  lip synchronization. This requires a speech recognition system in the language  in which audio is provided  to get  the time  alignment for the phonetic sequence of the audio signal. However,  building a speech recognition system  is data intensive and is a very tedious and time consuming task.  We  present  a  novel  scheme  to  implement  a   language independent system for  audio driven facial  animation given a speech recognition system  for just  one language,  in our  case, English. The method presented here can also be used  for text  to audio-visual speech synthesis.


Table of Contents


Introduction

To realize a natural and friendly interface is very important  in human-computer  interaction.  Speech  recognition  and   computer lip-reading  have  been  developed   as  a   means  of   inputting information  for  interaction  with  the  computer.  It  is  also important  to  provide  a  natural  and  friendly  means for  the computer to render the information. Interpersonal communications, human-computer  interaction,  telework, teleducation,  multimedia telephones, animation and various other multimedia approaches  in communication  offer the  motivation to  design realistic  facial animators.  Such  systems  represent  a  means  for  simplifying, enhancing  and  in  many  cases completely  changing the  current paradigms of interpersonal and human-computer communication.

A talking face, with lip  movements in  synchronization with  the spoken words and sentences, greatly enhances communication.  Many methods have been presented to animate the face in sync with  the audio [1][2][3].  These  methods rely  on a  viseme based alignment  being generated from the incoming audio, where visemes are different, distinguishable lip shapes. For this a speech  recognition system is  used  to  generate  the phonetic  alignment from  the incoming audio. Phonetic alignment refers to the time duration and the transition times between phonemes in an audio sequence [4]. A  phoneme to  viseme mapping  then generates  the visemic alignment from the phonetic alignment.

Techniques exist for synthesizing speech given text  as input  to the system. These text to speech synthesizers work by producing a phonetic  alignment  of the  text to  be pronounced  and then  by generating the smooth transitions in between adjacent phones to  get the  desired  sentence [6].  Using a  phoneme-viseme mapping  and text-to-speech  synthesis,  a  text-to-video  synthesizer can  be built. In  the audio driven animation  case, the  phonetic alignment is generated from the audio representing the spoken sentence. Thus facial animation can be driven by text  or audio, depending on the needs of the application.

Audio driven facial animation  requires  training  of a speech  recognition system  which  is used  for generating   alignments from the  input speech. Once the phonetic alignment is generated, the mapping and the animation hardly have any language dependency  in them. Translingual visual speech synthesis can be achieved if the first  step   of  alignment   generation  can   be  made   speech independent. In this paper, we present a method for  translingual visual  speech  synthesis, that  is, given  a speech  recognition system for one  language, we  describe a  method of  synthesizing video with speech of any other language.

In  Section 2  we describe  a general visual speech synthesis module. In Section 3 we  describe the main idea of this paper, the translingual visual speech synthesis system. We describe  in  detail  the  method  used  to   adapt  the   speech recognition  system  of   one  language   to  generate   phonetic alignments in a new  language. We  present the  specific case  of adding  Hindi  words  to  an English  speech recognition  system. In Section 4 we present the modifications required to build a translingual visual speech synthesis system in block diagram form. Finally the conclusions are presented in Section 5.

<-- Back to Table of Contents>

Visual Speech Synthesis

The visual speech synthesis  module is shown in Figure 1. From  an incoming  audio stream timing information and phoneme transitions are  extracted. This constitutes the phonetic alignment. The phonemes are  mapped to the corresponding visemes. This  in turn  gives the  viseme transitions and  timings called  the visemic  alignment. Now from  a viseme database the animator picks out the frames containing  the corresponding visemes given by the visemic alignment  and animates  the frames to give smooth transitions between visemes.  This  results in visual speech synthesis corresponding to the incoming audio.

A model for visual speech synthesis
Figure 1. Visual speech synthesis system model

The speech recognition engine is a critical part  of this  system. The  building of  such a  system for  continuous speech  requires extensive training over a  large database  of typical  sentences [5]. With an increase in the number of words  that such  a system  can recognize the system complexity increases  greatly. The  training and building of such a system is a one  time affair,  due to  the effort and complexity. However, it needs to be realized that  for the facial  animation system,  it is  not necessary  to know  the exact word being spoken or  even the  exact phoneme  from a  word recognition point of view, it is sufficient to  know the  viseme. Typically the viseme set used in visual  speech synthesis  varies from  8 to  30 visemes.  There are  about 45-55  phonemes in  the English language.  This results  in a  many to  one mapping  from phonemes to visemes. In the  following sections  we exploit  these facts to build a translingual visual speech synthesis system.

In the case of text based visual synthesis, we give text as input to the system in Figure 1 and the speech recognition unit is replaced by the text to speech synthesis unit. The rest of the system remains unchanged.

<-- Back to Table of Contents>

Translingual Visual Speech Synthesis

In order to understand the proposed translingual speech synthesis system, we enumerate the  crucial points  in translingual  speech synthesis below: In this paper, a new approach to synthesizing visual speech from a given audio signal in any language,  with the  help of   a  speech recognition  system  in  another  language,  is  presented. From  here onwards,  we  refer  to  the  language  used in  training  the  speech recognition system as the {\em base language} and the language in which the video is to be synthesized  as the  {\em novel language}.   In  the illustrations, Hindi has been chosen  as the  novel language  and
English as the base  language. If  a word  in the  novel language  is presented  to  the  alignment  generator,   then  the   alignment generator will not be able to generate the alignments for such  a word  as  the  word  is  not in  the phonetic  vocabulary of  the training  system. Moreover  the phonetic  spelling of  a word  in the novel language may not be represented completely by the  phonetic set  of  the base  language. We  present below  a technique  to overcome  these problems resulting  in a language independent alignment generation system. We build a system that will  have the  trained alignment generator and the  viseme images  for the  base language which can be used to generate the animation for
audio input in any language.
 

Phonetic Vocabulary Adaptation Layer

The base language vocabulary does not include words from the novel language. Hence when a word from the novel language is presented to the speech recognition system that has been trained in the base language it will fail to give the phonetic baseforms of the word. In order to generate alignments for words in the novel  language, first a phonetic vocabulary of this language  is created  wherein words are represented in the phonetic baseforms using the phoneme set  of  the  novel  language.  Since the  recognition system  is trained on the phoneme set of the base  language, the  vocabulary needs  to  be  modified  so  that  the  words from the novel language now represent  the baseforms in the base language phoneme set.  Such a  modification is made possible by the Phonetic Vocabulary Adaptation Layer. This layer works by using a mapping from the phoneme set of one language  to the other language. For illustration,  a mapping  from the  Hindi phones to the English phones is as shown in Figure 2. There are three possible cases:
 


English to Hindi phoneme mapping for translingual speech synthesis
Figure 2. Phoneme mapping from English to Hindi

Since the aim of mapping the phoneme set is to generate the  best phoneme  boundaries through  acoustic alignment,  the mapping  is based on similar-sounding phonemes, i.e. if there  is no  exactly similar phoneme in the base language which can be associated with the  phoneme  in  the  novel  language, then  that base  language phoneme is chosen which is  acoustically similar.   Both  however may map to a different viseme. This problem is addressed in the next subsection.

The phonetic vocabulary modification layer helps  in generating  the base language alignments for the  novel language audio. In Figure 2 an example of mapping the  phones of Hindi language to the English language phoneme set is presented. As is seen, not all the English phonemes are used by the novel  language. Also  there exists an exact mapping for a large number of  phones. These  are shown by a *** sign on that row. A  ** in the row  implies that the mapping is not exact but that it is the acoustically  closest map. A * in the mapping implies that  the novel  language phoneme  has  been  approximated  by a  string of  more than  one phoneme from the English language for acoustic similarity.

Next we show how to extract the base language visemic alignments for animation in the novel language.

<-- Back to Translingual Visual Speech synthesis>
<-- Back to Table of Contents>

Generation of Visemic alignments

Since the system has to  work for  any novel  language using  the trained alignment generator  and  the viseme  set in  the base  language, visemic  alignment cannot  be simply  generated  from  the  phonetic  alignment  using direct phoneme to viseme mapping as  described in Section 2. As was shown above, the phonetic vocabulary modification layer was built on the mapping based on acoustically similar  phonemes. However this mapping may distort the visemic alignment as it  does
not take  into consideration  the visemes  corresponding to  each such phoneme. So an additional  vocabulary which  represents the  words of  the novel language in the phoneme set of the base language  is created but this does not use the mapping in Figure 2, it uses a mapping based on the visemic similarity of the two phonemes. We call this mapping based on visemic similarity the visemic vocabulary modification layer. Using this additional vocabulary, the base language alignments and  the base  language  phoneme-to-viseme  mapping,  we  get the  visemic alignments.  This  visemic  alignment  is  used  to generate  the animated video sequence. As can be seen in Figure 2, the phoneme mapping between the two languages is not one-to-one. So  a single phone in the base language may  represent more than one  phone in the novel language. This however creates no  confusion as  the Phonetic Vocabulary Modification Layer outputs the alignment in the  novel language after taking into account the many-to-one mapping.

Alternately, if the viseme set is available for the novel language, then the visemic vocabulary modification layer can  be modified  to  directly  give  the  visemic  alignment  using  the phoneme-to-viseme  mapping in  the novel language.  Here the  phonetic alignment  generated in  the base  language is  converted to  the novel language by using the corresponding  vocabulary entries  in the two  languages. Then  the phoneme  to viseme  mapping of  the novel language is  applied. Note  that the  visemic alignment  so generated is in the novel language and  this was  desired as  the visemes  are  available  in  that language  and not  in the  base language. If  the viseme  set of the novel language is very different from the viseme set of the base language then this modified system would be especially useful.

<-- Back to Translingual Visual Speech synthesis>
<-- Back to Table of Contents>

System Model

Figure 3 shows the block diagram of the modification layers described above to achieve translingual visual speech synthesis. In the figure the subscripts B and N refer to the base language and the novel language respectively. The superscripts P and V refer to phonemes and visemes respectively. The speech recognition system is modified to generate visemic alignments corresponding to the novel language using the phonetic and visemic vocabulary modifiers. In case the visemes for the novel language are available  the visemic vocabulary modifier is not required and a direct phoneme to viseme mapping in the novel language may be used to give visemic alignments.

Block Diagram showing the modification layers for translingual visual speech synthesis

Figure 3. Block diagram showing the modification layers

The system uses the generated visemic alignment for the purpose of animation. For animation, morphing is done from one viseme to  another as given by the visemic alignments.  Due to  non-accurate  mapping  of  phonemes,   the alignment  may  not represent  the exact  phone boundaries.  This however is not observed in the animated  video as  the viseme  is always  in  transition  during  these  boundaries.  A smooth  and continuous video is  thus generated  which does  not reflect  any inaccurate phoneme boundaries.

<-- Back to Table of Contents>

Conclusions

A system for translingual visual speech synthesis using a speech recognition unit for any one language is presented. The advantage of using the approach presented in this paper is that one doesn't need to build a speech recognition engine for the same language in  which the  visual  speech  is  to  be  synthesized.   Given  a   speech recognition system for  a language,  one can  easily and  quickly customize the phonetic and visemic alignment generation layers to get a synthesized video in any  other language. Moreover  the viseme  images can  also be  of  the language in which the alignment generation system is  built, thus obviating the need for generating  new viseme  images for each language. The system also  works if  the novel  language has visemes that are totally different  from the  visemes of  the base language. Similarly, for text to audiovisual speech synthesis one doesn't need a text to speech synthesizer in the same  language in  which the  visual speech synthesis has to be performed.

An evaluation of the quality of generated video was performed by taking the visemes of English language for a subject and the novel  language as Hindi. The generated video was demonstrated to an audience who judged the quality and the realism of the animation to be impressive. Telugu (a regional language in India) was also tried as the novel language and the viewer response was encouraging.

<-- Back to Table of Contents>

Bibliography

[1]
Parke, F. I., Waters, K., Computer facial animation, Wellesley MA: A K Peters, 1996.
[2]
Lavagetto  F.,  Arzarello,  Caranzano  M., "Lipreadable  frame  animation   driven   by   speech   parameters," Proceedings  International Symposium on Speech, Image Processing and Neural  Networks, 1994.
[3]
Ezzat T., Poggio T., "Miketalk: A talking facial display based on morphing visemes," Proceedings of  IEEE Computer  Animation, Philadelphia PA, USA, 8-10 June, 1998, pp. 96-102
[4]
Bahl, L. R., Brown, P. F., de Souza P. V., Mercer, R. L., "Speech  Recognition with continuous parameter hidden markov models," Proceedings ICASSP-88, New York, May 1988, pp. 40-43.
[5]
Bahl, L.R., et. al., "Large vocabulary natural language continuous speech recognition," Proceedings ICASSP-89, Glasgow, Scotland, May 1989, pp. 465-467.
[6]
Donovan R.E., Eide E.M., "The IBM  Trainable Speech  Synthesis System," Proceedings  International Conference  on Speech  and  Language Processing, 1998.
<-- Back to Table of Contents>