Speech recognition method and apparatus utilizing multiple feature streams
First Claim
1. A speech recognition system for identifying words from a series of digital values representing speech, the system comprising:
- a first feature extractor for generating a first feature vector for a segment using a first type of feature of a portion of the series of digital values;
a second feature extractor for generating a second feature vector for the same segment as the first feature extractor using a second type of feature of a portion of the series of digital values;
a decoder capable of generating a path score that is indicative of the probability that a sequence of words is represented by the series of digital values, the path score being based in part on a single chosen segment score selected from a group of at least two segment scores wherein each segment score in the group represents a separate probability of a same segment unit appearing within a segment but wherein each segment score in the group is based on a different feature vector formed using a different type of feature for the same segment.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus is provided for using multiple feature streams in speech recognition. In the method and apparatus, a feature extractor generates at least two feature vectors for a segment of an input signal. A decoder then generates a path score that is indicative of the probability that a word is represented by the input signal. The path score is generated by selecting the best feature vector to use for each segment. For each segment, the corresponding part in the path score for that segment is based in part on a chosen segment score that is selected from a group of at least two segment scores. The segment scores each represent a separate probability that a particular segment unit (e.g. senone, phoneme, diphone, triphone, or word) appears in that segment of the input signal. Although each segment score in the group relates to the same segment unit, the scores are based on different feature vectors for the segment.
114 Citations
33 Claims
-
1. A speech recognition system for identifying words from a series of digital values representing speech, the system comprising:
-
a first feature extractor for generating a first feature vector for a segment using a first type of feature of a portion of the series of digital values;
a second feature extractor for generating a second feature vector for the same segment as the first feature extractor using a second type of feature of a portion of the series of digital values;
a decoder capable of generating a path score that is indicative of the probability that a sequence of words is represented by the series of digital values, the path score being based in part on a single chosen segment score selected from a group of at least two segment scores wherein each segment score in the group represents a separate probability of a same segment unit appearing within a segment but wherein each segment score in the group is based on a different feature vector formed using a different type of feature for the same segment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
identifying a possible segment unit for a segment using a feature vector that is based on a first-pass feature type;
determining a group of segment scores for the possible segment unit using a plurality of feature vectors that are based on separate feature types;
selecting the best segment score and designating the segment score'"'"'s associated feature type as the segment'"'"'s feature type;
determining a revised segment unit for the segment using a feature vector that is based on the segment'"'"'s feature type;
determining a group of segment scores for the revised segment unit using a plurality of feature vectors that are based on separate feature types; and
selecting the best segment score for the revised segment unit as the chosen segment score.
-
-
13. The speech recognition system of claim 12 wherein the decoder is further capable of designating the feature associated with the best segment score as the segment'"'"'s feature and of again determining a revised segment unit, determining a group of segment scores, and selecting a best segment score.
-
14. A method of speech recognition comprising:
-
extracting at least two feature vectors for a segment from a series of digital values each feature vector being associated with a different type of feature; and
determining a path score that is indicative of the probability that a word is represented by the series of digital values through a method comprising;
using different feature vectors, each produced from different types of features for a same segment, to determine a group of segment scores that each represent a separate probability of a same segment unit appearing within a segment;
selecting one of the segment scores from the group as a chosen segment score; and
combining chosen segment scores from multiple segments to produce a path score for a word. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
grouping path scores into feature groups based on the type of feature associated with the last segment score used to produce the respective path scores;
determining a highest path score within each group; and
pruning path scores in each group that are more than a beam width different from the highest path score in each group, each beam width being set independently for each group.
-
-
18. The method of claim 15 wherein determining multiple path scores comprises:
-
selecting a first segment score from the group as a chosen segment score for a first path score; and
selecting a second segment score from the group as a chosen segment score for a second path score.
-
-
19. The method of claim 14 wherein combining segment scores from multiple segments comprises combining chosen segment scores from two adjacent segments and wherein one of the two chosen segment scores is based on a feature vector that is extracted using a first feature and the other of the two chosen segment scores is based on a feature vector that is extracted using a second feature.
-
20. The method of claim 19 wherein combining the segment scores comprises reducing the path score if the path score includes two chosen segment scores from two adjacent segments that are based on feature vectors extracted using different features.
-
21. The method of claim 20 wherein the segment represents a phoneme.
-
22. The method of claim 20 wherein the segment represents a word.
-
23. The method of claim 14 wherein extracting at least two feature vectors comprises extracting a first feature vector using a first feature extraction technique with a first set of parameters and extracting a second feature vector using the first feature extraction technique with a second set of parameters.
-
24. The method of claim 23 wherein the first set of parameters and the second set of parameters comprise a sampling window size.
-
25. The method of claim 14 wherein using different feature vectors of a segment to determine a group of segment scores and selecting one of the segment scores as a chosen segment score comprise:
-
identifying a possible segment unit for a segment using a feature vector that is based on a first-pass feature;
determining a group of segment scores for the possible segment unit using a plurality of feature vectors that are based on separate features;
selecting the best segment score and designating the segment scorer'"'"'s associated feature as the segment'"'"'s feature;
determining a revised segment unit for the segment using a feature vector that is based on the segment'"'"'s feature;
determining a group of segment scores for the revised segment unit using a plurality of feature vectors that are based on separate features; and
selecting the best segment score for the revised segment unit as the chosen segment score.
-
-
26. The method of claim 25 further comprising designating the feature associated with the best segment score as the segment'"'"'s feature and again determining a revised segment unit, determining a group of segment scores, and selecting a best segment score.
-
27. A computer-readable medium having computer-executable instructions for performing steps comprising:
-
receiving a digital signal representative of an input speech signal;
extracting at least two feature vectors for a frame of the digital signal; and
determining a path score that is indicative of the probability that a word is represented by the digital signal through steps comprising;
using different feature vectors of a frame to determine a group of segment scores that each represent a separate probability of a same segment unit appearing within a segment;
selecting one of the segment scores from the group as a chosen segment score; and
combining chosen segment scores from multiple segments to produce a path score for a word.
-
-
28. An apparatus for converting a speech signal into text, the apparatus comprising:
-
a first speech recognition system having a feature extractor, an acoustic model for each linguistic unit of a language, and a language model and being capable of generating a first word score from a first portion of the speech signal;
a second speech recognition system having a feature extractor, an acoustic model for each linguistic unit of a language, and a language model and being capable of generating a second word score from a second portion of the speech signal, the second speech recognition system being different from the first speech recognition system; and
a decoder capable of combining the first word score and the second word score to form a hypothesis path score and further capable of selecting a single path score from a group of hypothesis path scores to identify the text. - View Dependent Claims (29, 30, 31, 32, 33)
-
Specification