Apparatus for reducing spurious insertions in speech recognition
First Claim
1. Apparatus for automatically generating a phonetic baseform from a spoken utterance, the apparatus comprising:
- at least one storage medium; and
at least one processor coupled to the at least one storage medium and operative to (i) obtain a stream of acoustic observations representing the spoken utterance;
(ii) generate a sequence of subphone units representing candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
(iii) convert the sequence of subphone units into a phonetic baseform;
wherein the generating operation comprises merging candidate subphone units that relate to a same speech event and that overlap in time into a single candidate subphone unit, and associating with the single candidate subphone unit a score equal to a sum of scores associated with the merged candidate subphone units such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques for improving an automatic baseform generation system. More particularly, the invention provides techniques for reducing insertion of spurious speech events in a word or phone sequence generated by an automatic baseform generation system. Such automatic baseform generation techniques may be accomplished by enhancing the scores of long-lasting speech events with respect to the scores of short-lasting events. For example, this may be achieved by merging competing candidates that relate to the same speech event (e.g., phone or word) and that overlap in time into a single candidate, the score of which may be equal to the sum of the scores of the merged candidates.
13 Citations
11 Claims
-
1. Apparatus for automatically generating a phonetic baseform from a spoken utterance, the apparatus comprising:
-
at least one storage medium; and at least one processor coupled to the at least one storage medium and operative to (i) obtain a stream of acoustic observations representing the spoken utterance;
(ii) generate a sequence of subphone units representing candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
(iii) convert the sequence of subphone units into a phonetic baseform;wherein the generating operation comprises merging candidate subphone units that relate to a same speech event and that overlap in time into a single candidate subphone unit, and associating with the single candidate subphone unit a score equal to a sum of scores associated with the merged candidate subphone units such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An article of manufacture comprising at least one computer-readable storage medium encoded with one or more computer-executable programs which, when executed by at least one computer system, implement steps of:
-
obtaining a stream of acoustic observations representing a spoken utterance; generating a sequence of subphone units representing candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and converting the sequence of subphone units into a phonetic baseform; wherein the generating step comprises merging candidate subphone units that relate to a same speech event and that overlap in time into a single candidate subphone unit, and associating with the single candidate subphone unit a score equal to a sum of scores associated with the merged candidate subphone units such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event.
-
-
11. A speech recognition system, comprising:
-
a speech recognition engine; and a recognition lexicon associated with the speech recognition engine, the recognition lexicon including at least one phonetic baseform automatically generated by;
(i) obtaining a stream of acoustic observations representing a spoken utterance;
(ii) generating a sequence of subphone units representing candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations, wherein the generating comprises merging candidate subphone units that relate to a same speech event and that overlap in time into a single candidate subphone unit, and associating with the single candidate subphone unit a score equal to a sum of scores associated with the merged candidate subphone units such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event; and
(iii) converting the sequence of subphone units into the at least one phonetic baseform.
-
Specification