Senone tree representation and evaluation

US 5,794,197 A
Filed: 05/02/1997
Issued: 08/11/1998
Est. Priority Date: 01/21/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method of performing speech recognition using a vocabulary of words characterized by one or more triphones for each word, each triphone including a central phoneme, a left phoneme immediately preceding the central phoneme, and a right phoneme immediately following the central phoneme, the method comprising:

receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a triphone encountered in one of one of the training words; and

creating a plurality of senone trees for each successive phoneme of the vocabulary by;

selecting the phoneme;

for each successive state of the selected phoneme;

selecting the state;

creating a senone tree for the selected state of the selected phoneme, the tree having a plurality of levels with one or more nodes at each level, the senone tree being created by;

grouping together in a root node all received output distributions associated with the selected state of triphones that include the selected phoneme as their central phoneme; and

dividing each node into a plurality of nodes according to linguistic questions regarding the left and right phonemes of the triphones associated with the output distributions that are grouped in the root node such that each node represents a group of similar output distributions, and continuing the dividing of each node until a stop condition is met at which the nodes are leaf nodes associated with one or more output distributions.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition method provides improved modeling in recognition accuracy using hidden Markov models. During training, the method creates a senone tree for each state of each phoneme encountered in a data set of training words. All output distributions received for a selected state of a selected phoneme in the set of training words are clustered together in a root node of a senone tree. Each node of the tree beginning with the root node is divided into two nodes by asking linguistic questions regarding the phonemes immediately to the left and right of a central phoneme of a triphone. At a predetermined point, the tree creation stops, resulting in leaves representing clustered output distributions known as senones. The senone trees allow all possible triphones to be mapped into a sequence of senones simply by traversing the senone trees associated with the central phoneme of the triphone. As a result, unseen triphones not encountered in the training data can be modeled with senones created using the triphones actually found in the training data.

Citations

30 Claims

1. A method of performing speech recognition using a vocabulary of words characterized by one or more triphones for each word, each triphone including a central phoneme, a left phoneme immediately preceding the central phoneme, and a right phoneme immediately following the central phoneme, the method comprising:
- receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a triphone encountered in one of one of the training words; and
  
  creating a plurality of senone trees for each successive phoneme of the vocabulary by;
  
  selecting the phoneme;
  
  for each successive state of the selected phoneme;
  
  selecting the state;
  
  creating a senone tree for the selected state of the selected phoneme, the tree having a plurality of levels with one or more nodes at each level, the senone tree being created by;
  
  grouping together in a root node all received output distributions associated with the selected state of triphones that include the selected phoneme as their central phoneme; and
  
  dividing each node into a plurality of nodes according to linguistic questions regarding the left and right phonemes of the triphones associated with the output distributions that are grouped in the root node such that each node represents a group of similar output distributions, and continuing the dividing of each node until a stop condition is met at which the nodes are leaf nodes associated with one or more output distributions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1 wherein each leaf node is associated with a plurality of output distributions, the method further comprising:
    - creating a senone for each leaf node by combining the output distributions for the leaf node;
      
      receiving a phonetic transcription of an unseen triphone not found in the set of training words, the unseen triphone including a central phoneme together with phonemes positioned immediately adjacent the central phoneme;
      
      traversing the senone trees of the states of the central phoneme in the unseen triphone; and
      
      determining which senone of each senone tree traversed is appropriate for the unseen triphone based on the phonemes positioned immediately adjacent the identified phoneme, the senones forming a senonic mapping of the unseen triphone.
  - 3. The method according to claim 2, further comprising:
    - storing a separate acoustic monophone model for each phoneme of the vocabulary, the monophone model for each phoneme being created without regard for any information regarding adjacent phonemes;
      
      receiving an acoustic representation of a target word to be recognized, the acoustic representation including a sequence of codewords each representing an output distribution;
      
      comparing each of the phoneme monophone models with a target codeword of the sequence of codewords of the target word;
      
      computing a monophone probability score for each phoneme monophone model based on the comparing step; and
      
      determining which of the phoneme monophone models most closely match the target codeword, the phonemes associated with the most closely matching phoneme monophone models being best matching phonemes.
  - 4. The method according to claim 3, further comprising:
    - comparing the target codeword of the target word with corresponding the senones of senone trees of the best matching phonemes; and
      
      computing a triphone probability score for each best matching phoneme based on the step of comparing the codeword with corresponding senones.
  - 5. The method according to claim 4, further comprising:
    - updating a word probability score for each word in the vocabulary by using the monophone probability score for the central phoneme of a triphone if the central phoneme is not one of the best matching phonemes and using the triphone probability score for the central phoneme if the central phoneme is one of the best matching phonemes;
      
      repeating the updating step for each codeword of the target word sequence of codewords;
      
      selecting as a best matching word, the vocabulary word having the highest word probability score; and
      
      outputting the best matching word.
  - 6. The method according to claim 1 wherein the creating a senone tree step includes forming a composite question for the root node from a plurality of linguistic questions.
  - 7. The method according to claim 6 wherein the forming a composite question step includes:
    - combining the leaf nodes of the senone tree into two clusters;
      
      selecting one of the two clusters;
      
      determining paths in the senone tree from the selected node to the leaf nodes of the selected cluster;
      
      conjoining the questions for each path from the root node to the selected cluster; and
      
      disjoining the conjoined questions.
  - 8. The method according to claim 1 wherein each node has an entropy that reflects the randomness of the output distributions of the node and wherein the step of dividing each node into a plurality of nodes includes:
    - calculating an entropy reduction value for each linguistic question of a set of linguistic questions, the entropy reduction value reflecting how much the linguistic question reduces the entropy of the output distributions of the node;
      
      determining which of the linguistic questions produces the largest entropy reduction value; and
      
      dividing the node into a plurality of nodes based on the linguistic question determined to produce the largest entropy reduction value.
  - 9. The method according to claim 8 wherein the step of calculating an entropy reduction value for each linguistic question includes:
    - calculating a weighted state entropy reduction value for each state of the selected phoneme; and
      
      summing the weighted state entropy reduction values to obtain the entropy reduction value for the selected state.

10. A computer-implemented method of performing speech recognition for a vocabulary of words characterized by one or more phonemes for each word, comprising:
- receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a phoneme of one of the training words; and
  
  creating a separate senone tree for each state of each phoneme of the training words, each senone tree having a root node and a plurality of leaf nodes, the creating step including;
  
  selecting successively each one of the phonemes;
  
  selecting successively each one of the states of the selected phoneme;
  
  grouping together in the root node of the senone tree of the selected state all received output distributions associated with the selected state; and
  
  distinguishing the leaf nodes from each other according to linguistic questions regarding phonemes adjacent the selected phoneme in one or more training words such that each leaf node represents one or more of the output distributions grouped in the root node.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The method according to claim 10 wherein each leaf node of each senone tree represents a plurality of output distributions, the method further comprising creating a senone for each leaf node by combining the output distributions for the leaf node.
  - 12. The method according to claim 11, further including:
    - receiving an acoustic representation of a target word to be recognized, the acoustic representation including a sequence of codewords each representing an output distribution;
      
      creating a triphone model for each of a plurality of triphones by traversing the senone trees corresponding to states of the central phonemes of the plurality of triphones, the traversing step identifying a senone for each state of each central phoneme of the plurality of triphones;
      
      comparing the codewords of the target word with corresponding senones of the plurality of triphone models; and
      
      identifying the triphone model whose senones most closely match the codewords of the target word.
  - 13. The method according to claim 11, further comprising:
    - storing a separate acoustic monophone model for each phoneme of the vocabulary, the monophone model for a phoneme being created without regard for any information regarding other phonemes;
      
      receiving an acoustic representation of a target word to be recognized, the acoustic representation including a sequence of codewords each representing an output distribution;
      
      comparing each of the phoneme monophone models with the acoustic representation of the target word;
      
      computing a monophone probability score for each phoneme monophone model based on the comparing step; and
      
      determining which of the phoneme monophone models most closely matches the acoustic representations of the target word, the phonemes associated with the most closely matching phoneme monophone models being identified as being most closely matching phonemes.
  - 14. The method according to claim 13, further comprising:
    - traversing the senone trees created for the states of each of the most closely matching phonemes, the traversing step identifying a senone for each state of the most closely matching phoneme;
      
      comparing the codewords of the target word with corresponding senones of the senone trees of the most closely matching phoneme; and
      
      computing a triphone probability score for each of the most closely matching phonemes based on the step of comparing the codewords with corresponding senones.
  - 15. The method according to claim 14, further comprising:
    - computing a word probability score for each of a plurality of words in the vocabulary by using the monophone probability score for each phoneme of the word if the phoneme is not one of the best matching phonemes and the triphone probability score for the phoneme if the phoneme is one of the best matching phonemes;
      
      selecting, as a best matching word, the training word having the highest word probability score; and
      
      outputting the best matching word.
  - 16. The method according to claim 10 wherein the distinguishing step includes distinguishing the leaf nodes from each other according to linguistic questions regarding either a phoneme immediately preceding the selected phoneme or a phoneme immediately following the selected phoneme.

17. A computer-implemented method of performing speech recognition using a vocabulary of words having one or more phonemes for each word, comprising:
- receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a phoneme of one of the training words; and
  
  creating a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of nodes including leaf nodes and non-leaf nodes, the non-leaf nodes including a root node, each non-leaf node corresponding to a linguistic question regarding phoneme context of the phoneme and having branches that correspond to answers to the linguistic question, each leaf node indicating a senone representing output distributions corresponding to the answers represented by the branches taken from the root node to the leaf node.
- View Dependent Claims (18, 19, 20)
- - 18. The method according to claim 17 wherein each leaf node of each senone tree represents a plurality of output distributions, the method further comprising:
    - creating a senone for each leaf node by combining the output distributions for the leaf node;
      
      creating a triphone model for each of a plurality of triphones by traversing the senone trees corresponding to states of the central phonemes of the plurality of triphones, the traversing step identifying a senone for each state of each central phoneme of the plurality of triphones;
      
      receiving an acoustic representation of a target word to be recognized, the acoustic representation including a sequence of codewords each representing an output distribution;
      
      comparing the codewords of the target word with corresponding senones of the plurality of triphone models; and
      
      identifying the triphone model whose senones most closely match the codewords of the target word.
  - 19. The method according to claim 17 wherein each leaf node of each senone tree represents a plurality of output distributions, the method further comprising:
    - creating a senone for each leaf node by combining the output distributions for the leaf node;
      
      storing a separate acoustic monophone model for each phoneme of the vocabulary, the monophone model for a phoneme being created without regard for any information regarding other phonemes;
      
      receiving an acoustic representation of a target word to be recognized, the acoustic representation including a sequence of codewords each representing an output distribution;
      
      comparing each of the phoneme monophone models with the acoustic representation of the target word;
      
      computing a monophone probability score for each phoneme monophone model based on the comparing step; and
      
      determining which of the phoneme monophone models most closely matches the acoustic representations of the target word, the phonemes associated with the most closely matching phoneme monophone models being identified as being most closely matching phonemes.
  - 20. The method according to claim 17 wherein the creating step includes creating a selected senone tree for a selected state of a selected phoneme, each non-leaf node of the selected senone tree corresponding to a linguistic question regarding either a phoneme immediately preceding the selected phoneme or a phoneme immediately following the selected phoneme.

21. A computer system for performing speech recognition using a vocabulary of words having one or more phonemes for each word, comprising:
- means for receiving a data set of output distributions based on a set of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a phoneme of one of the training words; and
  
  a trainer that creates a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of nodes including leaf nodes and non-leaf nodes, the non-leaf nodes including a root node, each non-leaf node corresponding to a linguistic question regarding phoneme context of the phoneme and having branches that correspond to answers to the linguistic question, each leaf node indicating a senone representing output distributions corresponding to the answers represented by the branches taken from the root node to the leaf node.
- View Dependent Claims (22, 23)
- - 22. The computer system of claim 21 wherein the trainer creates a senone for each leaf node by combining the output distributions for the leaf node and creates a triphone model for each of a plurality of triphones by traversing the senone trees corresponding to states of the central phonemes of the plurality of triphones, the trainer identifying a senone for each state of each central phoneme of the plurality of triphones based on the traversal, the system further comprising:
    - means for receiving an acoustic representation of a target word to be recognized, the acoustic representation including a sequence of codewords each representing an output distribution;
      
      a recognizer that compares the codewords of the target word with corresponding senones of the plurality of triphone models and identifies the triphone model whose senones most closely match the codewords of the target word.
  - 23. The computer system of claim 21 wherein the trainer is structured to create a selected senone tree for a selected state of a selected phoneme, each non-leaf node of the selected senone tree corresponding to a linguistic question regarding either a phoneme immediately preceding the selected phoneme or a phoneme immediately following the selected phoneme.

24. A computer-readable storage medium including a data structure for use in speech recognition based on a vocabulary of words having one or more triphones for each word, each triphone including a central phoneme and phonemes positioned immediately adjacent the central phoneme, the data structure including a plurality of senone trees representing a data set of output distributions of training words spoken by training users, each output distribution being associated with one of a predetermined number of states of a triphone central phoneme of one of the training words, the plurality of senone trees including a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of nodes including leaf nodes and non-leaf nodes, the non-leaf nodes including a root node, each non-leaf node corresponding to a linguistic question regarding phoneme context of the phoneme and having branches that correspond to answers to the linguistic question, each leaf node indicating a senone representing output distributions corresponding to the answers represented by the branches taken from the root node to the leaf node.
- View Dependent Claims (25, 26)
- - 25. The storage medium of claim 24, further including:
    - a triphone model for each triphone encountered in the training words, each triphone model including a senonic mapping that includes a senone for each state of the central phoneme of the triphone, the senones of the senonic mapping being obtained by traversing the senone trees for the central phoneme of the triphone, the triphone models being structured to enable a computer to recognize a spoken target word using the triphone models.
  - 26. The storage medium of claim 24 wherein a selected non-leaf node of a selected one of the senone trees for a selected phoneme corresponds to a linguistic question regarding either a phoneme immediately preceding the selected phoneme or a phoneme immediately following the selected phoneme.

27. A computer-readable storage medium including executable computer instructions for causing a computer to perform speech recognition, the storage medium comprising:
- a plurality of senone trees created using a data set of output distributions based on a set of training words spoken by training users, each training word including one or more phonemes, each phoneme including a predetermined plural number of states, the plurality of senone trees including a separate senone tree for each state of each phoneme of the training words, each senone tree having a plurality of leaf nodes, each leaf node indicating a senone representing one or more output distributions of the data set;
  
  computer instructions for causing the computer to detect an unseen triphone in a target word received by the computer, the unseen triphone being a triphone not encountered in one of the training words and including a central phoneme and left and right phonemes positioned immediately adjacent the central phoneme;
  
  computer instructions for causing the computer to traverse the senone trees for the central phoneme of the unseen triphone and thereby obtain a senone for each state of the central phoneme; and
  
  computer instructions for causing the computer to use the senones obtained for the central phoneme to create a triphone model for the unseen triphone, such that the triphone model can be used for future recognition of spoken words that include the unseen triphone.
- View Dependent Claims (28, 29, 30)
- - 28. The storage medium of claim 27 wherein each training word has one or more triphones each including a central phoneme and phonemes positioned immediately adjacent the central phoneme, each senone tree having a plurality of nodes including the leaf nodes and non-leaf nodes, the non-leaf nodes including a root node, each non-leaf node corresponding to a linguistic question regarding phoneme context of the phoneme and having branches that correspond to answers to the linguistic question, the senone for each leaf node representing output distributions corresponding to the answers represented by the branches taken from the root node to the leaf node.
  - 29. The storage medium of claim 28 wherein the computer instructions for causing the computer to traverse the senone trees including computer instructions for causing the computer to traverse each senone tree by determining answers to the linguistic questions corresponding to non-leaf nodes of the senone tree based on the left and right phonemes of the unseen triphone and wherein the triphone model for the unseen triphone includes a sequence of the senones obtained by traversing the senone trees corresponding to the central phoneme of the unseen triphone.
  - 30. The storage medium of claim 27, further including:
    - a set of phoneme monophone models, each monophone model representing a phoneme of the vocabulary without regard for any contextual information concerning phonemes adjacent the phoneme in any of the words of the vocabulary;
      
      computer instructions for causing the computer to receive an acoustic representation of the target word;
      
      computer instructions for causing the computer to compare the acoustic representation of the target word with each of the monophone models and determine a subset of the set of monophone models by determining which of the monophone models most closely match the acoustic representation, the phonemes corresponding to the monophone models in the subset being best matching phonemes;
      
      computer instructions for causing the computer to compare the acoustic representation only with the triphone models that represent the best matching phonemes;
      
      computer instructions for causing the computer to update a word probability score for each word in the vocabulary, the word probability scores for words that include one of the best matching phonemes being updated based on the comparison of the acoustic representation with triphone models, and the word probability scores for words that do not include one of the best matching phonemes being updated based on the comparison of the acoustic representation with the monophone models; and
      
      computer instructions for causing the computer to select, as a best matching word, the vocabulary word with the best word probability score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Hwang, Mei-Yuh, Huang, Xuedong, Alleva, Fileno A.
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/850,061
Time in Patent Office

466 Days
Field of Search

704/7, 704/9, 704/10, 704/242, 704/249, 704/250, 704/257, 704/256, 704/255
US Class Current

704/255
CPC Class Codes

G10L 15/146   with insufficient amount of...

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/022   Demisyllables, biphones or ...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/0631   Creating reference template...

Senone tree representation and evaluation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Senone tree representation and evaluation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links