Photo
Audio Visual Speech Technologies

Guillaume Gravier

Guillaume Gravier

A Research Staff member at IBM's T.J.Watson Research Center in Yorktown Heights, New York, working on joint audio-visual statistical models for speech recognition.

I carry out my research in the field of Automatic Speech Processing (ASP) with emphasis on the following areas:

  • speech signal statistical modeling
  • speaker recognition
  • large vocabulary continuous speech recognition (LVCSR) systems
  • robustness in automatic speech processing
Among these broad classes of problems, I am particularly interested in the following topics:
  • segmentation and indexation of audio(-visual) documents: separation in broad classes (e.g. speech/noise/music), speaker segmentaiton, sentence segmentation, word level segmentation.
  • new paradigms in (statistical) speech modeling
  • speaker adaptation of speech reccgnition systems and LVCSR-based speaker recognition systems
  • pronunciation modeling in LVCSR systems
  • representation of the speech signal (feature extraction)
Here is a (non-exhaustive) list of my activities and realizations up to now:
  • mutli-band modeling with Markov random fields (Ph. D. thesis)
  • implementation and testing of a large vocabulary decoder (project ARC INRIA Sirocco)
  • family name recognition based on a phonetic recognizer
  • speaker recognition with GMM (NIST evaluations, 1997-2000)
  • customized passwords for speaker verification (projet Européen PICASSO)
  • improvement of a grapheme-phoneme converter and diphone dictionary compression in TTS (ELAN Informatique)
  • speech coding by recognition and indexing of segments ( project RNRT (french agency) Sympatex)
For more detail, you can have a look at my personal website or/and see below some details concerning my Ph. D. thesis and the projects I am involved in.

Ph. D. dissertation

Presentation

Titre Bidimensionnal statistical analysis for segmental modeling of speech signals - application to recognition
Date defended January 10, 2000
Supervisors Gérard CHOLLET et  Marc SIGELLE 
Laboratory  ENST, Dpt. Traitement du Signal et des Images 

Summary

Statistical modeling of speech is nowadays used in most of the speech and speaker recognition applications, the stochastic appoach providing an elegant framework to model the variabilities of speech in the time and frequency domains. The most commonly used models are the hidden Markov models which can be seen as the superposition of two stochastic processes to model the two axes of variability. The hidden Markov models are in principle used along with a cepstral representation of the signal. One of the advantages of such a representation is that it is less variable compared to a time/frequency one. On the other hand, denoising is more difficult to implement in the cepstral domain and some information is lost when projecting the spectral representation on the cepstral domain.

In this work, segmental modeling of speech in the time/frequency domain using a Markov random field based approach is studied. Starting from the formulation of a Markov chain in terms of Gibbs distribution, we propose a parametric model that can be seen as a mutli-band model in which a modeling of the synchrony between the bands is added. A maximum likelihood parameter estimation procedure as well as decoding strategies for the random field approch are proposed. The parameter estimatin procedure is based on a stochastic generalisation of the EM algorithm and is valid for any Gibbs distribution whose potentials are linear with respect to the parameters. This algorithm is applied to the proposed random field model and validation is performed on simulated data. Finally, the random field model is applied to isolated word recognition. In the mono-band case, the performances of the proposed approach are similar to the ones obtained with hidden Markov modeling. In the multi-band case, the experiments pointed out the fact that a good model of the \prior process is needed when the observations become more variable. The prior model is used for regularisation in the segmentation process. Modeling the inter-band synchrony, as proposed in this first approach to random field based speech modeling, turned out to be insufficient as a regularisation prior. The main interest of this work lies in the formulation of a new theoretical framework and of the associated algorithms for the segmental modeling of speech.

Former participation in research projects

In the last couple of years, I have been involved in the following projects:

SIROCCO : a large vocabulary recognition system

The SIROCCO project groups together several French laboratories on the topic of dictation. The goal of the project is to develop a software platform common to all the project partners. The sotware developped during this project will be used in the ARC-B1 evluations of the AUF-UREF on dictation (20 and 64 kWords) and on retranscription of broadcast news. Each partner of the project may also use the platform for his own research purposes. The laboratories involved in the projects are: IRISA/INRIA Rennes, LIA, LORIA/INRIA Lorraine , ENST and IRIT.

For this project, I work on the implementation and the improvement of the search algorithm as well as on the development of common reference system. For the first task, my work focuses on the implementation of a beam-search using a trie representation of the lexicon. The originality of the proposed decoder comes from the integration of contextual transcription rules in the decoder. An example of contextual rule may be "the transcription of word is /lEz/ with a probability 1 if it is followed by a plural word begining with a voiced sound". I work in integrating such rules in a trie based decoder. For the development of a common reference system, I work on the creation of the acoustic models (phones and triphones) and of the lexicon.

ELISA : speaker verification, detection and tracking

ELISA is not exactly a project but rather a consortium of labs that teamed up to work on speaker recognition and to participate to the NIST evaluations on this topic. The consortium was created following my participation with ENST in the 1997 evaluation. Since then, the ELISA consortium participates regularly in the NIST evaluations where some of our systems were ranked first for some evaluation conditions in 1998. The consortium currently teams up the following labs: IRISA/INRIA Rennes, Laboratoire d'Informatique d'Avignon and ENST, but other labs such as EPFL, IDIAP or FPMS participated to the work done within the consortium.

In this framework, I carried out work on signal representation and on score normalization techniques. I also participate in the animation of the consortium in conjonction with F. Bimbot (IRISA) and J.-F. Bonastre (LIA).

SYMPATEX : very low bit rate speech coding

SYMPATEX is a project financed by the French Réseau National de la Recherche en Télécommunications (RNRT), and aims at developping a very low bit rate speech coder. The coder is based on the thesis work of Jan Cernocký. The main idea is to index a reference database shared between the coder and the decoder and to code the incoming speech signal by sending the sequence of indexes that best correspond to the input signal. Indexation is based on segmental ALISP units that are automatically determined from the speech signal and is carried out either by statistical modeling of the units or by an exhaustive search of those units. The original message is then reconstructed by concatenating segments taken from the reference database. The partners of the projects are: ENST, ESIEE, Thomson-CSF Communication, ELAN Informatique and INFO Telecom.

In this project, I work on a generalization of the orginial coder to have it work in speaker-independent mode. For this, I work on speaker normalization (VTLN, speaker typology) and on HMM adaptation when statistical modeling of the units is used. I also participate in the research concerning the indexation of the reference database and the alignment of units between the reference and the segment to be coded.

PICASSO : Pioneering Caller Authentication for Secure Service Operation

The PICASSO project (EU Telematics Application Program) focuses on speaker recognition for telephone services that combine security for transactions with an easy to use interface able to understand natural spoken language.

My work on this project is on the use of customized passwords for speaker verification using HMM adpatation techniques. One aspect related to that topic is the password recognition from the input utterance.