Method and apparatus for estimating phone class probabilities a-posteriori using a decision tree
First Claim
1. A method of recognizing speech, comprising:
- (a) inputting a set of training data comprising a plurality of records, each record of the training data comprising a sequence of 2K+1 feature vectors and a member of the class, each feature vector being represented by a label;
(b) forming a binary decision tree, the tree comprising a root node and a plurality of child nodes each associated with a binary question, the tree terminating in a plurality of terminal nodes, wherein the step of forming the trees comprises;
(i) for each index t in the sequence of feature vectors, wherein the index t refers to the tth label in the sequence of 2K+1 labels in a training record, dividing the labels at each of indexes t-K, . . . , t, . . . t+K, into pairs of sets, respectively, wherein the labels at each of the indexes are divided so as to minimize entropy of the classes associated with the pairs of sets;
(ii) selecting from the pairs of sets a lowest entropy pair;
(iii) generating a binary question and assigning it to the node, wherein the question asks whether a label to be classified, occurring at index T corresponding to the index of the lowest entropy pair, is a member of the first set or the second set;
(c) partitioning the data at the current node into two child nodes in accordance with this question;
(d) repeating steps (b)(i)-(b)(iii) for each child node;
(e) for each child node, computing a probability distribution of the occurrence of the class members, given the members of the set of labels at that node;
inputting a sequence of speech to be recognized;
traversing the binary decision tree for every time frame of an input sequence of speech to determine a distribution of most likely phones for each time frame, the most likely phones for each time frame collectively forming a phone sequence;
outputting a recognition result based upon the distribution of most likely phones.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for estimating the probability of phones, a-posteriori, in the context of not only the acoustic feature at that time, but also the acoustic features in the vicinity of the current time, and its use in cutting down the search-space in a speech recognition system. The method constructs and uses a decision tree, with the predictors of the decision tree being the vector-quantized acoustic feature vectors at the current time, and in the vicinity of the current time. The process starts with an enumeration of all (predictor, class) events in the training data at the root node, and successively partitions the data at a node according to the most informative split at that node. An iterative algorithm is used to design the binary partitioning. After the construction of the tree is completed, the probability distribution of the predicted class is stored at all of its terminal leaves. The decision tree is used during the decoding process by tracing a path down to one of its leaves, based on the answers to binary questions about the vector-quantized acoustic feature vector at the current time and its vicinity.
-
Citations
11 Claims
-
1. A method of recognizing speech, comprising:
-
(a) inputting a set of training data comprising a plurality of records, each record of the training data comprising a sequence of 2K+1 feature vectors and a member of the class, each feature vector being represented by a label; (b) forming a binary decision tree, the tree comprising a root node and a plurality of child nodes each associated with a binary question, the tree terminating in a plurality of terminal nodes, wherein the step of forming the trees comprises; (i) for each index t in the sequence of feature vectors, wherein the index t refers to the tth label in the sequence of 2K+1 labels in a training record, dividing the labels at each of indexes t-K, . . . , t, . . . t+K, into pairs of sets, respectively, wherein the labels at each of the indexes are divided so as to minimize entropy of the classes associated with the pairs of sets; (ii) selecting from the pairs of sets a lowest entropy pair; (iii) generating a binary question and assigning it to the node, wherein the question asks whether a label to be classified, occurring at index T corresponding to the index of the lowest entropy pair, is a member of the first set or the second set; (c) partitioning the data at the current node into two child nodes in accordance with this question; (d) repeating steps (b)(i)-(b)(iii) for each child node; (e) for each child node, computing a probability distribution of the occurrence of the class members, given the members of the set of labels at that node; inputting a sequence of speech to be recognized; traversing the binary decision tree for every time frame of an input sequence of speech to determine a distribution of most likely phones for each time frame, the most likely phones for each time frame collectively forming a phone sequence; outputting a recognition result based upon the distribution of most likely phones. - View Dependent Claims (2, 3, 4)
-
-
5. A method for recognizing speech, comprising:
-
(a) providing a set of records, the records comprising 2k+1 labels lk, for k=-K . . . K, representing feature vectors at time t+k, where t is the current time, and a size of the label alphabet is i; (b) mapping each of the labels lk to a class value at time t; (c) generating a confusion matrix for each lk, each confusion matrix plotting a distribution of the class values for each value of the label lk ; (d) computing, based on the confusion matrices, a class distribution at a present node of a binary decision tree by summing probabilities of each class member pj for the i values that lk can take, and dividing the probabilities by the sum; (e) finding a minimum entropy split of the i values for each lk based on the computed class distribution, comprising the steps of; for each lk at times -K . . . K, splitting the set of values that lk can take into two sets, so as to minimize entropy of the class distribution conditioned on the split; calculating the class distribution for each of the child nodes resulting from the split; selecting the label lk that provides the largest reduction in uncertainty at that node, and assigning the split computed for lk as the binary question at that node; partitioning the training data at the current node based on the minimum entropy split of the selected label lk at the current node; repeating step (e) for each child node; inputting a sequence of speech to be recognized; traversing the binary decision tree for every time frame of an input sequence of speech to determine a distribution of most likely phones for each time frame, the most likely phones for each time frame collectively forming a phone sequence; outputting a recognition result based upon the distribution of most likely phones.
-
-
6. A method of recognizing speech, comprising:
-
inputting a plurality of words of training data; training one or more binary decision trees to ask a maximally informative question at each node based upon preceding and subsequent contextual information in the training data, wherein each binary decision tree corresponds to a different time in a sequence of the training data; traversing one of the decision trees for every time frame of an input sequence of speech to determine a distribution of most likely phones for every time frame, the most likely phones for each time frame collectively forming a phone sequence; outputting a recognition result based upon the distribution of most likely phones. - View Dependent Claims (7, 8, 9)
-
-
10. A method for recognizing speech, comprising:
-
(a) entering a string of utterances constituting training data; (b) converting the utterances of the training data to electrical signals; (c) representing the electrical signals of the training data as prototype quantized feature vectors, one feature vector representing a given time frame; (d) assigning to each prototype feature vector a class label associated with the prototype quantized feature vector; (e) forming one or more binary decision trees for different times in the training data, each tree having a root node and a plurality of child nodes, comprising the steps of; creating a set of training records comprising 2K+1 predictors, lk, and one predicted class, p, where the 2K+1 predictors are feature vector labels at 2K+1 consecutive times t-K, . . . , t, . . . , t+K, and the predicted class is a phone at time t in the training data; computing an estimated joint distribution of predictors lk and phone p for 2K+1 predictors using the training data, wherein the predictors are feature vector labels at times t-K . . . t, . . . , t+K and p is the phone at time t; storing the estimated joint distribution of lk and p and a corresponding phone distribution for each predictor lk at the root node; computing the best partitioning of the distribution of predictor lk for each lk to minimize phone uncertainty at each node; choosing the partitioned distribution of predictors lk having the lowest uncertainty and partitioning the training data into two child nodes based on the partitioning, each child node being assigned a phone distribution based on the training data at the child node; (f) repeating step (e) for each child node if the amount of training data at the child node is greater than a threshold; (g) inputting an utterance to be recognized; (h) converting the utterance into an electrical signal; (i) representing the electrical signal as a series of quantized feature vectors; (j) matching the series of quantized feature vectors against the stored prototype feature vectors to determine a closest match and assigning an input label to each of the series of feature vectors corresponding to the label of the closet matching prototype feature vector; (k) traversing the decision trees for each input label to determine a distribution of phone probabilities for each label.
-
-
11. A method of recognizing speech using a decision tree for predicting a distribution of phone classes based on 2K+1 predictors, comprising:
-
creating a set of training records comprising 2K+1 predictors, lk, and one predicted class, p, where the 2K+1 predictors are feature vector labels at 2K+1 consecutive times t-K, . . . , t, . . . , t+K, and the predicted class is a phone at time t in the training data; inputting a set of training labels l-N, . . . ,l, . . . ,l-N occurring at times t-N, . . . ,t, . . . ,t+N, and phone class occurring at time t; computing a joint probabilities of events (lik,p), when the value of predictor lk equals li, and the phone class value is p; computing the class distributions Pr(p=pk) for the current nodes by ##EQU5## computing an entropy H(p) of a class distribution by ##EQU6## partitioning each predictor lk into two complementary sets SLoptk and SLoptk so as to minimize average class entropy; splitting the training data at a current node of the tree based on the partitioning SLoptk, SLoptk, and computing the probabilities of the child nodes by ##EQU7## computing the class distribution at the child nodes conditional on the splitting by ##EQU8## computing the average entropy for the child nodes by
space="preserve" listing-type="equation">H.sub.avg.sup.k =Pr(SL.sub.opt.sup.k)H(p/SL.sub.opt.sup.k)+Pr(SL.sub.opt.sup.k)H(p/SL.sub.opt.sup.k);determining the reduction in class uncertainty for each predictor lk by H(p)-Havgk ; selecting the predictor and partition leading to the maximum reduction in uncertainty; partitioning the set of training labels based on the selected partition; partitioning the labels at a child node in accordance with the foregoing steps; inputting a sequence of speech to be recognized; traversing the decision tree for every time frame of an input sequence of speech to determine a distribution of most likely phones for each time frame, the most likely phones for each time frame collectively forming a phone sequence; outputting a recognition result based upon the distribution of most likely phones.
-
Specification