Context-dependent acoustic models for medium and large vocabulary speech recognition with eigenvoice training
First Claim
1. A method for developing context-dependent models for automatic speech recognition, comprising:
- generating an eigenspace to represent a training speaker population;
providing a set of acoustic data for at least one training speaker and representing said acoustic data in said eigenspace to determine at least one allophone centroid for said training speaker;
subtracting said centroid from said acoustic data to generate speaker-adjusted acoustic data for said training speaker;
using said speaker-adjusted acoustic data to grow at least one decision tree having leaf nodes containing context-dependent models for different allophones.
4 Assignments
0 Petitions
Accused Products
Abstract
A reduced dimensionality eigenvoice analytical technique is used during training to develop context-dependent acoustic models for allophones. The eigenvoice technique is also used during run time upon the speech of a new speaker. The technique removes individual speaker idiosyncrasies, to produce more universally applicable and robust allophone models. In one embodiment the eigenvoice technique is used to identify the centroid of each speaker, which may then be “subtracted out” of the recognition equation. In another embodiment maximum likelihood estimation techniques are used to develop common decision tree frameworks that may be shared across all speakers when constructing the eigenvoice representation of speaker space.
-
Citations
10 Claims
-
1. A method for developing context-dependent models for automatic speech recognition, comprising:
-
generating an eigenspace to represent a training speaker population;
providing a set of acoustic data for at least one training speaker and representing said acoustic data in said eigenspace to determine at least one allophone centroid for said training speaker;
subtracting said centroid from said acoustic data to generate speaker-adjusted acoustic data for said training speaker;
using said speaker-adjusted acoustic data to grow at least one decision tree having leaf nodes containing context-dependent models for different allophones. - View Dependent Claims (2, 3, 4, 5, 6)
providing speech data from a new speaker;
using said eigenspace to determine at least one new speaker centroid of a new speaker and subtracting said new speaker centroid from said speech data from said new speaker to generate speaker-adjusted data; and
applying said speaker-adjusted data to a speech recognizer employing said context-dependent models.
-
-
5. A method of performing speech recognition using said context-dependent models developed as recited in claim 1, comprising:
-
providing speech data from a new speaker;
using said eigenspace to determine at least one new speaker centroid of a new speaker and adding said new speaker centroid to said context-dependent models to generate new speaker-adjusted context-dependent models; and
applying said speech data to a speech recognizer employing said new speaker-adjusted context-dependent models.
-
-
6. The method of claim 1, wherein the decision tree has at least one non-leaf node containing an eigen dimension question.
-
7. A method of training context-dependent models for automatic speech recognition, comprising:
-
constructing a decision tree framework of yes-no questions having leaf nodes for storing context-dependent allophone models;
training a set of speaker-dependent acoustic models for a plurality of training speakers and using said decision tree framework to construct a plurality of decision trees for said training speakers, storing the speaker-dependent acoustic models for each training speaker in the leaf nodes of the respective decision tree;
constructing an eigenspace by using said set of decision trees to generate supervectors that are subsequently transformed through dimensionality reduction. - View Dependent Claims (8)
-
-
9. A method of constructing a decision tree for storing context-dependent models for automatic speech recognition, comprising:
-
providing a pool of yes-no questions to identify different contexts of sound units;
providing a corpus of test speaker data;
for a plurality of test speakers represented by said corpus and for a plurality of questions in said pool, iteratively performing the following steps (a) through (e) inclusive;
(a) selecting a question from said pool;
(b) constructing a first yes model and a first no model for said selected question using speaker data from a first one of said test speakers;
(c) computing a first product of the probability scores for said first yes model and said first no model;
(d) constructing a second yes model and a second no model for said selected question using speaker data from a second one of said test speakers;
(d) computing a second product of the probability scores for said second yes model and said second no model;
(e) computing an overall score for said selected question by computing an overall product that includes the product of said first and second products;
growing a decision tree having nodes populated with different questions selected from the pool such that at each node the question with the highest overall score is used.
-
-
10. A memory for storing data for access by an application program being executed on a data processing system, whereby a decision tree for storing speech models is stored, and wherein the decision tree comprises:
-
a root node containing a question about a context of a phoneme;
a plurality of non-leaf child nodes containing additional questions, wherein the additional questions include at least one eigen dimension question; and
a plurality of leaf child nodes containing speech models.
-
Specification