Method and apparatus for generating decision tree questions for speech processing
First Claim
Patent Images
1. A computer-readable storage medium encoded with computer-executable instructions for causing a computer to perform steps comprising:
- forming a separate cluster of tokens for each possible token that can appear in training data;
determining whether to combine a first cluster of tokens and a second cluster of tokens to form a new cluster of tokens using mutual information wherein the mutual information is based on the number of times tokens from the new cluster of tokens appear next to tokens from another cluster of tokens in the training data;
building a decision tree by utilizing at least one of the clusters of tokens to form a question for a node in the decision tree, the question asking whether a token in an input is found within the at least one cluster; and
using the decision tree to identify a leaf node of the tree based on an input.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention automatically builds question sets for a decision tree. Under the invention, mutual information is used to cluster tokens, representing either phones or letters. Each cluster is formed so as to limit the loss in mutual information in a set of training data caused by the formation of the cluster. The resulting sets of clusters represent questions that can be used at the nodes of the decision tree.
18 Citations
19 Claims
-
1. A computer-readable storage medium encoded with computer-executable instructions for causing a computer to perform steps comprising:
-
forming a separate cluster of tokens for each possible token that can appear in training data; determining whether to combine a first cluster of tokens and a second cluster of tokens to form a new cluster of tokens using mutual information wherein the mutual information is based on the number of times tokens from the new cluster of tokens appear next to tokens from another cluster of tokens in the training data; building a decision tree by utilizing at least one of the clusters of tokens to form a question for a node in the decision tree, the question asking whether a token in an input is found within the at least one cluster; and using the decision tree to identify a leaf node of the tree based on an input. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of forming a decision tree used in speech processing, the method comprising:
-
grouping at least two tokens to form a first possible cluster; a processing unit determining a mutual information score based on the first possible cluster through steps comprising determining the number of times tokens from the first possible cluster appear next to tokens from a second cluster, the number of times tokens from the first possible cluster appear individually, and the number of times tokens from the second cluster appear individually; grouping at least two tokens to form a third possible cluster; the processing unit determining a mutual information score based on the third possible cluster through steps comprising determining the number of times tokens from the third possible cluster appear next to tokens from a fourth cluster, the number of times tokens from the third possible cluster appear individually, and the number of times tokens from the fourth cluster appear individually; the processing unit selecting one of the first cluster and the third cluster based on the mutual information scores associated with the first cluster and the third cluster; using the selected cluster to form a question in the decision tree used in speech processing; and storing the decision tree on a computer-readable storage medium for later use in speech processing. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method of forming a decision tree for speech processing, the method comprising:
-
identifying at least two possible clusters of tokens in a set of training data; a processing unit using co-occurrence frequency counts of clusters to select one of the at least two possible clusters wherein the co-occurrence frequency counts comprise the number of times tokens from two clusters appear next to each other in the training data; the processing unit storing the selected cluster on a computer-readable storage medium as a question for a node in the decision tree for speech processing wherein the question asks whether an input token is found in the selected cluster. - View Dependent Claims (16, 17, 18, 19)
-
Specification