Method and apparatus for speech recognition using optimized partial mixture tying of HMM state functions
First Claim
1. In a speech recognition system using a method for recognizing human speech, the method being of the type comprising the steps of:
- selecting a model to represent a selected subunit of speech, the model having associated with it a plurality of states and each state having associated with it a probability function, the probability function having undetermined parameters, the probability functions being represented by a mixture of simple probability functions, the simple probability functions being stored in a master codebook;
extracting features from a set of speech training data;
using the features to determine parameters for the probability functions in the model,an,improvement comprising the steps of;
identifying states that are mostly represented by a related set of simple probability functions;
clustering said states that are mostly represented by a related set of simple probability functions into a plurality of clusters;
splitting up the master codebook into a plurality of cluster codebooks, one cluster codebook associated with each one of said clusters;
pruning the cluster codebooks to reduce the number of entries in each said codebook by retaining the simple probability functions that are most used by the states in the cluster and deleting remaining functions; and
re-estimating the simple probability functions in each cluster codebook to better fit the states in that cluster and re-estimating the parameters for each state in the cluster.
2 Assignments
0 Petitions
Accused Products
Abstract
In accordance with the invention, a speech recognizer is provided which uses a computationally-feasible method for constructing a set of Hidden Markov Models (HMMs) for speech recognition that utilize a partial and optimal degree of mixture tying. With partially-tied HMMs, improved recognition accuracy of a large vocabulary word corpus as compared to systems that use fully-tied HMMs is achieved with less computational overhead than with a fully untied system. The computationally-feasible technique comprises the steps of determining a cluster of HMM states that share Gaussian components which are close together, developing a subset codebook for those clusters, and recalculating the Gaussians in the codebook to best estimate the clustered states.
62 Citations
17 Claims
-
1. In a speech recognition system using a method for recognizing human speech, the method being of the type comprising the steps of:
-
selecting a model to represent a selected subunit of speech, the model having associated with it a plurality of states and each state having associated with it a probability function, the probability function having undetermined parameters, the probability functions being represented by a mixture of simple probability functions, the simple probability functions being stored in a master codebook; extracting features from a set of speech training data; using the features to determine parameters for the probability functions in the model, an, improvement comprising the steps of; identifying states that are mostly represented by a related set of simple probability functions; clustering said states that are mostly represented by a related set of simple probability functions into a plurality of clusters; splitting up the master codebook into a plurality of cluster codebooks, one cluster codebook associated with each one of said clusters; pruning the cluster codebooks to reduce the number of entries in each said codebook by retaining the simple probability functions that are most used by the states in the cluster and deleting remaining functions; and re-estimating the simple probability functions in each cluster codebook to better fit the states in that cluster and re-estimating the parameters for each state in the cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. In a speech recognition system for responding to signals representative of digital speech, a method for developing models for subsets of speech comprising the steps of:
-
selecting a multi-state model with state probability functions, said probability functions being of a general form with initially undetermined parameters; creating an individual instance of a model for each subunit of speech to be processed; clustering states based on their acoustic similarity; creating a plurality of cluster codebooks, one codebook for each cluster;
said cluster codebooks consisting of probability density functions that are shared by each cluster'"'"'s states;estimating the probability densities of each cluster codebook and the parameters of the probability equations in each cluster. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method for recognizing speech using a computer comprising the steps of:
-
selecting a multi-state model with state probability functions, said probability functions being of a general form with initially undetermined parameters; creating an individual instance of a model for each subunit of speech to be processed; training the models with a training algorithm that determines parameters for the models that best fit a set of features extracted from known speech samples; clustering states into a predetermined number of clusters of states wherein the states in each said cluster have probability functions that can be well represented by a shared group of simple probability functions; developing a cluster codebook of simple probability functions for each cluster and storing said cluster codebooks; storing for each state an identifier for a cluster codebook and an array of weighting factors; extracting features from a speech sample to be recognized; and using said state probability functions and said cluster codebooks to determine a most probable state sequence for said speech sample.
-
-
16. In a speech recognition system for responding to signals representative of digital speech, a method for developing models for subsets of speech comprising the steps of:
-
selecting a multi-state model with state probability functions, said probability functions being of a general form with initially undetermined parameters; creating an individual instance of a model for each subunit of speech to be processed; clustering states based on their acoustic similarities; creating a plurality of cluster codebooks, one cluster codebook for each cluster;
each codebook comprising a group of shared probability functions; andre-estimating the probability densities of each cluster codebook and the parameters of the probability equations for each state in each cluster.
-
-
17. A speech recognizer, comprising:
-
a computer; storage means; a set of models for subunits of speech stored in the storage means; a feature extractor in the computer for extracting feature data capable of being processed by said computer from a speech signal; training means in the computer for training the models using features from identified samples of speech data and for producing a master codebook of probability density functions for use by the models; clustering means in the computer for identifying clusters of states that share subsets of the probability density functions in the codebooks; splitting and pruning means in the computer for producing cluster codebooks by splitting the master codebook into subsets of probability densities shared by clustered states; re-estimating means for retraining the models for the states in the clusters and for recalculating the probability densities in each cluster codebook; recognizing means for matching features from unidentified speech data to the models to produce a most likely path through the models where the path defines the most likely subunits and words in the speech data.
-
Specification