Morphing based animation has been considered in the past [5]. In this paper we seek to
extend this to include animation with expression. For a richer scope of animation it is
necessary to be able to animate the face with appropriate expressions. In [3]
it is assumed
that there exists a video database of the head to be synthesized, wherein, the subject is
present in the expression to be synthesized at least once. In our system, given visemes
with two or more different facial expressions, a method is presented that can generate
the remaining visemes with these facial expressions. The Seven basic expressions
considered are neutral, surprise, fear, disgust, anger, happiness and sadness.
In section 2 we present the audio-driven facial animation system model. In Section 3 the
method of generating the animation is discussed. In Section 4 we discuss some applications
and present an evaluation of the system in the context of creating audio-visual reality.
Finally conclusions are presented in Section 5.
Figure 1. Extraction Module
Figure 1 shows the extraction module. For an incoming stream of synchronized audio+video we first recognize the phoneme and then map this phoneme to its corresponding viseme and take the corresponding video frame to represent this viseme. The expression recognition unit can be either audio based [2][9] or video based [4]. A short sentence like "The sharp quick brown fox jumped over the lazy dog." captures all the 12 visemes.
Figure 2. Background Processing Module
In the background processing module, shown in Figure 2, the extracted images are corrected for small pose differences. Then it may be possible that all visemes in all expressions may not have been extracted. This module generates the complete set of viseme+expression combinations (en , vm), where, n = 1,...,7, and, m = 1,2,...,12. Finally optical flows between different visemes within an expression and between the expressions are computed and stored.
Figure 3. Synthesis Module
Figure 3 shows the synthesis module. From an incoming audio stream timing information, phoneme transitions and expressions are extracted. The phonemes are then mapped to the corresponding visemes. This mapping is shown in Table 1. The timing information and phoneme transition can also be extracted for a novel language whose speech recognition engine is not available [6].The expression recognition unit based on audio gives the correct expression. However in our case the expression maps have been explicitly provided. Together the viseme+expression combination determines the frame to be used from the database, the timing information tells how long this viseme+expression lasts and the phoneme transitions in turn give the viseme transitions. These viseme transitions are brought about using precomputed optical flows.
| Phoneme | Vimseme Number |
|---|---|
| a, h | Viseme 1 |
| e, i | Viseme 2 |
| l | Viseme 3 |
| r | Viseme 4 |
| o, u, w | Viseme 5 |
| p, b, m | Viseme 6 |
| g, k, d, n, t, y | Viseme 7 |
| f, v | Viseme 8 |
| h, j, s, z | Viseme 9 |
| sh, ch | Viseme 10 |
| th | Viseme 11 |
| silence | Viseme 12
|
u (x, y) = a0 + a1 x + a2 y + p0 x2 + p1 x y (1)
v (x, y) = a3 + a4 x + a5 y + p0 x y +p1 y2 (2)
Since non rigid motions of facial features are not captured well by this model we can use this model to extract the 3D rigid body component of motion and to align the images. To estimate the parameters we use the approach suggested by Tsai and Huang [10] with modifications.Tsai and Huang's method is based on perspective displacement field model which is different from the kind of model we are using. This method is basically a least square fit over the image gradients and we use Singular Value Decomposition to calculate the above parameters.
Given facial images I1 and I2, we first estimate the 3D rigid body motion component from I2 to I1. Next, we warp image I2 using this model to align with I1 and having viseme shape/expression of I2. Some images may have slight facial deformation due to the assumed planar model for the face under perspective projection. Given a set of images we can align them with respect to a single image and repeat the whole process iteratively.
We restrict the extent of the morph depending upon the viseme and the duration of viseme transition. Figure 4 shows the rules used by our system. Consider a viseme transition between va and vb in duration Tc. Now, if Tc < Th, where Th is a threshold that is heuristically set, we generate the morph until t = Tc/Th. But there is a catch, consider a transition from viseme vb to vc in duration Tn. If Tn > Th then viseme vb needs to be emphasized and hence the morph to vb should be complete. In this case we extend the duration of transition va - vb and reduce the duration of transition vb - vc by Q, where Q = Min (Th-Tc,Tn-Th). If the transition vb -vc was long enough then viseme vb would be morphed from va. Further, visemes that represent p, b, m and v, f have to be morphed completely because these visemes involve lip closure or near closure. So if transition occurs to any of these visemes, then the morph is completed irrespective of the duration.
Suppose vb was not completely morphed then to generate the morph to viseme vc we cannot use the optical flows between vb and vc computed using the images in our database. We need to know the optical flow between the generated (and incomplete) viseme vb and vc. Since the optical flow computations are too costly and almost impossible in real time, we use the transitivity between the optical flows va - vb and vb -v c to calculate an approximate optical flow, which is used to generate the morph. Our system uses a threshold Th = 100 ms at 30 fps.
Figure 4. Audio Synchronization
We accomplish this as follows (see Figure 5 below).
Figure 5. New Viseme-Expression Pair Generation
Find the correspondence of pixels in (e1 , v1) going to (e1 , v2), call it flow1 and from (e1 , v1) to (e2,v1), call it flow2. Now put the velocity of every pixel in (e1 , v1) given by flow1 on the corresponding pixel of (e2 , v1) (found according to flow2). Call the optical flow of (e2 , v1) thus obtained as flownew. Generate (e2 , v2) from (e2 , v1) using flownew.
Figure 6. Introducing New Features
To introduce the new features that appear in viseme v2 (see Figure 6), detect the facial features that appear in (e1 , v2) which were not there in (e1 , v1) using flow1. The pixels in (e1 , v2) which do not correspond to any pixel in (e1 , v1) stand for the new features.Find the correspondence of pixels in (e1 , v2) going to (e1 , v1), call this flow3. Carry the pixels (new features) found using flow1 to (e2 , v2) in the same way as the nearby corresponding pixels in (e1 , v1) go to (e2 , v1) according to flow2. These nearby corresponding pixels in (e1 , v1) are determined by the correspondence of pixels given by flow3 on the nearby pixels in (e1 , v2).
Figure 7. Suppressing Disappearing Features
To suppress the facial features disappearing in viseme v2 (see Figure 7), detect the features that are present in (e1 , v1) but which disappear in (e1 , v2) using flow3. The pixels in (e1 , v1) which do not correspond to any pixel in (e1 , v2) stand for the disappearing features. Find where these pixels go in (e2 , v1) using flow2. While constructing the new image from (e2 , v1) suppress these pixels. This way these features won't appear in the new image. Figure 8 and Figure 9 are examples of new viseme+expression combinations generated from the existing ones.
Figure 8. Existing Images and the Constructed Image with New Features Appearing
Figure 9. Existing Images and the Constructed Image with Disappearing Features
This system is valuable where video has to be generated. Examples of such scenarios include:
Visual e-mail: At the receiving end the email is "read out" by the sender. The receiver mailbox activates the correct person, to read out the mail, by matching the address.
Newscast: In many cases involving a field reporter, the audio is available but due to various reasons, the corresponding video is not available. Usually a photograph of the person is shown on the TV screen along with the audio. Using the system presented here, a video of the person speaking can be generated and shown along with the audio. Vision directs the listener's attention and sustains interest.
Entertainment: Making people say things they normally would not. For example popular actors are made to say different things and "interact" with people.
Many other uses of this system can be thought of. A talking face has the advantage of directing the listener's attention and sustaining interest. An audio-visual reality is created if the animated face is able to hold human attention and successfully engage the person in useful conversation or task. To obtain feedback on the quality of the animation, clips were made and shown to a number of people. The feedback was very positive and in many cases, unless specifically mentioned, the animated clip passed off as an original. However, when many synthesized expression visemes are used in the animation, noticeable artifacts at the teeth and lips start appearing.