System and method for automatic speech to text conversion
First Claim
Patent Images
1. A system for recognizing speech that corresponds to a digital speech signal, the system comprising:
- a speech recognition engine that has access toa training corpus of known-class digitized speech utterances,a plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within the training corpus, andan ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers;
wherein the speech recognition engine comprises an event extractor for extracting speech signal events and patterns of the speech signal events from the digital speech signal, wherein the speech signal events and patterns of the speech signal events are relevant in speech recognition,wherein the speech recognition engine comprises at least one processor that is configured to perform a plurality of operations, wherein the plurality of operations comprisedetecting locations of relevant speech signal events in the digital speech signal, wherein each of the speech signal events comprise spectral information and temporal information,capturing spectral features of and temporal relationships between all of the speech signal events,segmenting the digital speech signal based on the detected locations of the detected speech signal events,analyzing the segmented digital speech signal, wherein the analysis is synchronized with the speech signal events,detecting patterns in the digital speech signal with the captured spectral information, the temporal relationships, and the analyzed digital speech signal,providing a list of perceptual alternatives for recognized speech data that corresponds to the detected patterns in the digital speech signal, anddisambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data;
wherein the at least one processor is configured to perform one or more of the operations using the ensemble detector; and
a module coupled to the speech recognition engine, wherein the module is configured to output the improved recognized speech data.
1 Assignment
0 Petitions
Accused Products
Abstract
Speech recognition is performed in near-real-time and improved by exploiting events and event sequences, employing machine learning techniques including boosted classifiers, ensembles, detectors and cascades and using perceptual clusters. Speech recognition is also improved using tandem processing. An automatic punctuator injects punctuation into recognized text streams.
-
Citations
19 Claims
-
1. A system for recognizing speech that corresponds to a digital speech signal, the system comprising:
-
a speech recognition engine that has access to a training corpus of known-class digitized speech utterances, a plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within the training corpus, and an ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers; wherein the speech recognition engine comprises an event extractor for extracting speech signal events and patterns of the speech signal events from the digital speech signal, wherein the speech signal events and patterns of the speech signal events are relevant in speech recognition, wherein the speech recognition engine comprises at least one processor that is configured to perform a plurality of operations, wherein the plurality of operations comprise detecting locations of relevant speech signal events in the digital speech signal, wherein each of the speech signal events comprise spectral information and temporal information, capturing spectral features of and temporal relationships between all of the speech signal events, segmenting the digital speech signal based on the detected locations of the detected speech signal events, analyzing the segmented digital speech signal, wherein the analysis is synchronized with the speech signal events, detecting patterns in the digital speech signal with the captured spectral information, the temporal relationships, and the analyzed digital speech signal, providing a list of perceptual alternatives for recognized speech data that corresponds to the detected patterns in the digital speech signal, and disambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data; wherein the at least one processor is configured to perform one or more of the operations using the ensemble detector; and a module coupled to the speech recognition engine, wherein the module is configured to output the improved recognized speech data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method of speech recognition comprising the steps of:
-
accessing a plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within a training corpus of known-class digitized speech utterances, and an ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers; receiving a speech signal; digitizing the received speech signal; detecting locations of relevant speech signal events in the received and digitized speech signal, wherein each of the relevant speech signal events comprise spectral information and temporal information; capturing spectral features of and temporal relationships between all of the speech signal events; segmenting the received and digitized speech signal based on the detected locations of the speech signal events; analyzing the segmented, received and digitized speech signal, wherein the analysis is synchronized with the speech signal events, detecting patterns in the digitized speech signal with the captured spectral information, the temporal relationships, and the analyzed speech signal; recognizing speech data that corresponds with the analyzed digitized speech signal, wherein the step of recognizing the speech data comprises the steps of providing a list of perceptual alternatives for the recognized speech data that corresponds to the detected patterns in the digitized speech signal, and disambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data; and outputting the improved speech data. - View Dependent Claims (16, 17, 18)
-
-
19. A system for recognizing speech that corresponds to a digital speech signal, the system comprising:
-
a speech recognition engine that has access to a training corpus of known-class digitized speech utterances, a plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within the training corpus, and an ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers; wherein the speech recognition engine comprises an event extractor for extracting speech signal events and patterns of the speech signal events from the digital speech signal, wherein the speech signal events and patterns of the speech signal events are relevant in speech recognition, wherein the speech recognition engine comprises at least one processor that is configured to perform a plurality of operations, wherein the plurality of operations comprise detecting locations of relevant speech signal events in the digital speech signal, wherein each of the speech signal events comprise spectral information and temporal information, capturing spectral features of and temporal relationships between all of the speech signal events, segmenting the digital speech signal based on the detected locations of the detected speech signal events, analyzing the segmented digital speech signal, wherein the analysis is synchronized with the speech signal events, detecting patterns in the digital speech signal with the captured spectral information, the temporal relationships, and the analyzed digital speech signal, providing a list of perceptual alternatives for recognized speech data that corresponds to the detected patterns in the digital speech signal, and disambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data; wherein the at least one processor is configured to perform one or more of the operations using the ensemble detector; a module coupled to the speech recognition engine, wherein the module is configured to output the improved recognized speech data; an automatic punctuation engine coupled with a database containing training data, wherein the automatic punctuation engine comprises at least one statistical processor for adding punctuation to the improved recognized speech data using the training data in the form of statistical-based punctuated text; a rule-based punctuator coupled with a lexical rule database, wherein the rule-based punctuator adds punctuation to the improved recognized speech data using rules from the lexical rule database in the form of rule-based punctuated text; and a decision module for determining whether the punctuated text or the statistical-based punctuated text produces a better punctuated result; and a mechanism that is configured to output the better punctuated result, based upon the determination.
-
Specification