![]() |
![]() |
![]() |
![]() |
|
| Audio Visual Speech Technologies | |||
|
|
|||
|
|
Guillaume Gravier
A Research Staff member at IBM's T.J.Watson Research Center in Yorktown Heights, New York, working on joint audio-visual statistical models for speech recognition. I carry out my research in the field of Automatic Speech Processing (ASP) with emphasis on the following areas:
Ph. D. dissertationPresentation
SummaryStatistical modeling of speech is nowadays used in most of the speech and speaker recognition applications, the stochastic appoach providing an elegant framework to model the variabilities of speech in the time and frequency domains. The most commonly used models are the hidden Markov models which can be seen as the superposition of two stochastic processes to model the two axes of variability. The hidden Markov models are in principle used along with a cepstral representation of the signal. One of the advantages of such a representation is that it is less variable compared to a time/frequency one. On the other hand, denoising is more difficult to implement in the cepstral domain and some information is lost when projecting the spectral representation on the cepstral domain. In this work, segmental modeling of speech in the time/frequency domain using a Markov random field based approach is studied. Starting from the formulation of a Markov chain in terms of Gibbs distribution, we propose a parametric model that can be seen as a mutli-band model in which a modeling of the synchrony between the bands is added. A maximum likelihood parameter estimation procedure as well as decoding strategies for the random field approch are proposed. The parameter estimatin procedure is based on a stochastic generalisation of the EM algorithm and is valid for any Gibbs distribution whose potentials are linear with respect to the parameters. This algorithm is applied to the proposed random field model and validation is performed on simulated data. Finally, the random field model is applied to isolated word recognition. In the mono-band case, the performances of the proposed approach are similar to the ones obtained with hidden Markov modeling. In the multi-band case, the experiments pointed out the fact that a good model of the \prior process is needed when the observations become more variable. The prior model is used for regularisation in the segmentation process. Modeling the inter-band synchrony, as proposed in this first approach to random field based speech modeling, turned out to be insufficient as a regularisation prior. The main interest of this work lies in the formulation of a new theoretical framework and of the associated algorithms for the segmental modeling of speech. Former participation in research projectsIn the last couple of years, I have been involved in the following projects:SIROCCO : a large vocabulary recognition systemThe SIROCCO project groups together several French laboratories on the topic of dictation. The goal of the project is to develop a software platform common to all the project partners. The sotware developped during this project will be used in the ARC-B1 evluations of the AUF-UREF on dictation (20 and 64 kWords) and on retranscription of broadcast news. Each partner of the project may also use the platform for his own research purposes. The laboratories involved in the projects are: IRISA/INRIA Rennes, LIA, LORIA/INRIA Lorraine , ENST and IRIT. For this project, I work on the implementation and the improvement of
the search algorithm as well as on the development of common reference
system. For the first task, my work focuses on the implementation of a
beam-search using a trie representation of the lexicon. The originality
of the proposed decoder comes from the integration of contextual transcription
rules in the decoder. An example of contextual rule may be "the transcription
of word ELISA
is not exactly a project but rather a consortium of labs that teamed up
to work on speaker recognition and to participate to the NIST
evaluations on this topic. The consortium was created following my
participation with ENST in the 1997 evaluation. Since then, the ELISA
consortium participates regularly in the NIST evaluations where some of
our systems were ranked first for some evaluation conditions in 1998.
The consortium currently teams up the following labs: IRISA/INRIA
Rennes, Laboratoire d'Informatique
d'Avignon and ENST, but other labs
such as EPFL, IDIAP
or FPMS participated to the work done
within the consortium.
In this framework, I carried out work on signal representation and on
score normalization techniques. I also participate in the animation of
the consortium in conjonction with F. Bimbot (IRISA) and J.-F. Bonastre
(LIA).
SYMPATEX is a project financed by the French Réseau National de
la Recherche en Télécommunications (RNRT), and aims at developping
a very low bit rate speech coder. The coder is based on the
thesis work of Jan Cernocký. The main idea is to index a reference
database shared between the coder and the decoder and to code the incoming
speech signal by sending the sequence of indexes that best correspond
to the input signal. Indexation is based on segmental ALISP units that
are automatically determined from the speech signal and is carried out
either by statistical modeling of the units or by an exhaustive search
of those units. The original message is then reconstructed by concatenating
segments taken from the reference database. The partners of the projects
are: ENST, ESIEE,
Thomson-CSF Communication, ELAN Informatique
and INFO Telecom.
In this project, I work on a generalization of the orginial coder to
have it work in speaker-independent mode. For this, I work on speaker
normalization (VTLN, speaker typology) and on HMM adaptation when statistical
modeling of the units is used. I also participate in the research concerning
the indexation of the reference database and the alignment of units between
the reference and the segment to be coded.
The PICASSO project (EU
Telematics Application Program) focuses on speaker recognition for telephone
services that combine security for transactions with an easy to use interface
able to understand natural spoken language.
My work on this project is on the use of customized passwords for speaker
verification using HMM adpatation techniques. One aspect related to that
topic is the password recognition from the input utterance.
|
| About IBM | Privacy | Legal | Contact |