Senone tree representation and evaluation
First Claim
1. A method of performing speech recognition using a vocabulary of words characterized by one or more triphones for each word, each triphone including a central phoneme, a left phoneme immediately preceding the central phoneme, and a right phoneme immediately following the central phoneme, the method comprising:
- receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a triphone encountered in one of one of the training words; and
creating a plurality of senone trees for each successive phoneme of the vocabulary by;
selecting the phoneme;
for each successive state of the selected phoneme;
selecting the state;
creating a senone tree for the selected state of the selected phoneme, the tree having a plurality of levels with one or more nodes at each level, the senone tree being created by;
grouping together in a root node all received output distributions associated with the selected state of triphones that include the selected phoneme as their central phoneme; and
dividing each node into a plurality of nodes according to linguistic questions regarding the left and right phonemes of the triphones associated with the output distributions that are grouped in the root node such that each node represents a group of similar output distributions, and continuing the dividing of each node until a stop condition is met at which the nodes are leaf nodes associated with one or more output distributions.
2 Assignments
0 Petitions
Accused Products
Abstract
A speech recognition method provides improved modeling in recognition accuracy using hidden Markov models. During training, the method creates a senone tree for each state of each phoneme encountered in a data set of training words. All output distributions received for a selected state of a selected phoneme in the set of training words are clustered together in a root node of a senone tree. Each node of the tree beginning with the root node is divided into two nodes by asking linguistic questions regarding the phonemes immediately to the left and right of a central phoneme of a triphone. At a predetermined point, the tree creation stops, resulting in leaves representing clustered output distributions known as senones. The senone trees allow all possible triphones to be mapped into a sequence of senones simply by traversing the senone trees associated with the central phoneme of the triphone. As a result, unseen triphones not encountered in the training data can be modeled with senones created using the triphones actually found in the training data.
-
Citations
30 Claims
-
1. A method of performing speech recognition using a vocabulary of words characterized by one or more triphones for each word, each triphone including a central phoneme, a left phoneme immediately preceding the central phoneme, and a right phoneme immediately following the central phoneme, the method comprising:
-
receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a triphone encountered in one of one of the training words; and creating a plurality of senone trees for each successive phoneme of the vocabulary by; selecting the phoneme; for each successive state of the selected phoneme; selecting the state; creating a senone tree for the selected state of the selected phoneme, the tree having a plurality of levels with one or more nodes at each level, the senone tree being created by; grouping together in a root node all received output distributions associated with the selected state of triphones that include the selected phoneme as their central phoneme; and dividing each node into a plurality of nodes according to linguistic questions regarding the left and right phonemes of the triphones associated with the output distributions that are grouped in the root node such that each node represents a group of similar output distributions, and continuing the dividing of each node until a stop condition is met at which the nodes are leaf nodes associated with one or more output distributions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-implemented method of performing speech recognition for a vocabulary of words characterized by one or more phonemes for each word, comprising:
-
receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a phoneme of one of the training words; and creating a separate senone tree for each state of each phoneme of the training words, each senone tree having a root node and a plurality of leaf nodes, the creating step including; selecting successively each one of the phonemes; selecting successively each one of the states of the selected phoneme; grouping together in the root node of the senone tree of the selected state all received output distributions associated with the selected state; and distinguishing the leaf nodes from each other according to linguistic questions regarding phonemes adjacent the selected phoneme in one or more training words such that each leaf node represents one or more of the output distributions grouped in the root node. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A computer-implemented method of performing speech recognition using a vocabulary of words having one or more phonemes for each word, comprising:
-
receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a phoneme of one of the training words; and creating a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of nodes including leaf nodes and non-leaf nodes, the non-leaf nodes including a root node, each non-leaf node corresponding to a linguistic question regarding phoneme context of the phoneme and having branches that correspond to answers to the linguistic question, each leaf node indicating a senone representing output distributions corresponding to the answers represented by the branches taken from the root node to the leaf node. - View Dependent Claims (18, 19, 20)
-
-
21. A computer system for performing speech recognition using a vocabulary of words having one or more phonemes for each word, comprising:
-
means for receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a phoneme of one of the training words; and a trainer that creates a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of nodes including leaf nodes and non-leaf nodes, the non-leaf nodes including a root node, each non-leaf node corresponding to a linguistic question regarding phoneme context of the phoneme and having branches that correspond to answers to the linguistic question, each leaf node indicating a senone representing output distributions corresponding to the answers represented by the branches taken from the root node to the leaf node. - View Dependent Claims (22, 23)
-
- 24. A computer-readable storage medium including a data structure for use in speech recognition based on a vocabulary of words having one or more triphones for each word, each triphone including a central phoneme and phonemes positioned immediately adjacent the central phoneme, the data structure including a plurality of senone trees representing a data set of output distributions of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a triphone central phoneme of one of the training words, the plurality of senone trees including a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of nodes including leaf nodes and non-leaf nodes, the non-leaf nodes including a root node, each non-leaf node corresponding to a linguistic question regarding phoneme context of the phoneme and having branches that correspond to answers to the linguistic question, each leaf node indicating a senone representing output distributions corresponding to the answers represented by the branches taken from the root node to the leaf node.
-
27. A computer-readable storage medium including executable computer instructions for causing a computer to perform speech recognition, the storage medium comprising:
-
a plurality of senone trees created using a data set of output distributions based on a set of training words spoken by training users, each training word including one or more phonemes, each phoneme including a predetermined plural number of states, the plurality of senone trees including a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of leaf nodes, each leaf node indicating a senone representing one or more output distributions of the data set; computer instructions for causing the computer to detect an unseen triphone in a target word received by the computer, the unseen triphone being a triphone not encountered in one of the training words and including a central phoneme and left and right phonemes positioned immediately adjacent the central phoneme; computer instructions for causing the computer to traverse the senone trees for the central phoneme of the unseen triphone and thereby obtain a senone for each state of the central phoneme; and computer instructions for causing the computer to use the senones obtained for the central phoneme to create a triphone model for the unseen triphone, such that the triphone model can be used for future recognition of spoken words that include the unseen triphone. - View Dependent Claims (28, 29, 30)
-
Specification