System and method for automatic speech to text conversion

US 8,566,088 B2
Filed: 11/11/2009
Issued: 10/22/2013
Est. Priority Date: 11/12/2008
Status: Active Grant

First Claim

Patent Images

1. A system for recognizing speech that corresponds to a digital speech signal, the system comprising:

a speech recognition engine that has access toa training corpus of known-class digitized speech utterances,a plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within the training corpus, andan ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers;

wherein the speech recognition engine comprises an event extractor for extracting speech signal events and patterns of the speech signal events from the digital speech signal, wherein the speech signal events and patterns of the speech signal events are relevant in speech recognition,wherein the speech recognition engine comprises at least one processor that is configured to perform a plurality of operations, wherein the plurality of operations comprisedetecting locations of relevant speech signal events in the digital speech signal, wherein each of the speech signal events comprise spectral information and temporal information,capturing spectral features of and temporal relationships between all of the speech signal events,segmenting the digital speech signal based on the detected locations of the detected speech signal events,analyzing the segmented digital speech signal, wherein the analysis is synchronized with the speech signal events,detecting patterns in the digital speech signal with the captured spectral information, the temporal relationships, and the analyzed digital speech signal,providing a list of perceptual alternatives for recognized speech data that corresponds to the detected patterns in the digital speech signal, anddisambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data;

wherein the at least one processor is configured to perform one or more of the operations using the ensemble detector; and

a module coupled to the speech recognition engine, wherein the module is configured to output the improved recognized speech data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech recognition is performed in near-real-time and improved by exploiting events and event sequences, employing machine learning techniques including boosted classifiers, ensembles, detectors and cascades and using perceptual clusters. Speech recognition is also improved using tandem processing. An automatic punctuator injects punctuation into recognized text streams.

Citations

19 Claims

1. A system for recognizing speech that corresponds to a digital speech signal, the system comprising:
- a speech recognition engine that has access toa training corpus of known-class digitized speech utterances,a plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within the training corpus, andan ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers;
  
  wherein the speech recognition engine comprises an event extractor for extracting speech signal events and patterns of the speech signal events from the digital speech signal, wherein the speech signal events and patterns of the speech signal events are relevant in speech recognition,wherein the speech recognition engine comprises at least one processor that is configured to perform a plurality of operations, wherein the plurality of operations comprisedetecting locations of relevant speech signal events in the digital speech signal, wherein each of the speech signal events comprise spectral information and temporal information,capturing spectral features of and temporal relationships between all of the speech signal events,segmenting the digital speech signal based on the detected locations of the detected speech signal events,analyzing the segmented digital speech signal, wherein the analysis is synchronized with the speech signal events,detecting patterns in the digital speech signal with the captured spectral information, the temporal relationships, and the analyzed digital speech signal,providing a list of perceptual alternatives for recognized speech data that corresponds to the detected patterns in the digital speech signal, anddisambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data;
  
  wherein the at least one processor is configured to perform one or more of the operations using the ensemble detector; and
  
  a module coupled to the speech recognition engine, wherein the module is configured to output the improved recognized speech data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The system of claim 1, further comprising:
    - a mechanism for initiating at least one action in response to at least a portion of the output improved recognized speech data.
  - 3. The system of claim 2, wherein the at least one action comprises any ofa conversion of the improved recognized speech data into at least one text stream, ora suppression of an audio output when certain words are detected.
  - 4. The system of claim 2, further comprising:
    - a mechanism to detect at least one command in the improved recognized speech data;
      
      wherein the at least one action comprises an initiation of response to the detected command.
  - 5. The system of claim 1, further comprising:
    - the training corpus of known-class digitized speech utterances;
      
      wherein the at least one processor is further configured to establish the plurality of weak classifiers, and construct the ensemble detector.
  - 6. The system of claim 5, wherein the at least one processor is configured to iteratively construct the ensemble detector with a boosting algorithm to form a boosted ensemble detector.
  - 7. The system of claim 6, wherein the at least one processor is configured to simplify the constructed boosted ensemble detector.
  - 8. The system of claim 7, wherein the at least one processor is configured to convert the simplified constructed boosted ensemble detector into a cascading detector.
  - 9. The system of claim 1, wherein the list of the perceptual alternatives for the recognized speech data comprises a plurality of perceptual clusters.
  - 10. The system of claim 1, wherein the at least one processor is further configured to reject one or more regions of the digital speech signal that do not contain one or more of the speech signal events.
  - 11. The system of claim 1, wherein the at least one processor is further configured to detect sequences of the speech signal events based on the detected patterns.
  - 12. The system of claim 1, wherein the at least one processor is further configured to recognize alternative speech cues to strengthen recognition.
  - 13. The system of claim 1, further comprising:
    - a pre-segmentation filter; and
      
      a feature extractor;
      
      wherein the pre-segmentation filter is configured to define intervals that are used to synchronize feature computations;
      
      wherein the segmenting of the digital speech signal is based on perceptual differences of the defined intervals; and
      
      wherein the feature extractor is configured to extract features relative to the speech signal events from the segmented digital speech signal.
  - 14. The system of claim 1, wherein the at least one processor is further configured toconvert the improved recognized speech data into at least one text stream, andautomatically insert punctuation into the at least one text stream.

15. A method of speech recognition comprising the steps of:
- accessinga plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within a training corpus of known-class digitized speech utterances, andan ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers;
  
  receiving a speech signal;
  
  digitizing the received speech signal;
  
  detecting locations of relevant speech signal events in the received and digitized speech signal, wherein each of the relevant speech signal events comprise spectral information and temporal information;
  
  capturing spectral features of and temporal relationships between all of the speech signal events;
  
  segmenting the received and digitized speech signal based on the detected locations of the speech signal events;
  
  analyzing the segmented, received and digitized speech signal, wherein the analysis is synchronized with the speech signal events,detecting patterns in the digitized speech signal with the captured spectral information, the temporal relationships, and the analyzed speech signal;
  
  recognizing speech data that corresponds with the analyzed digitized speech signal, wherein the step of recognizing the speech data comprises the steps ofproviding a list of perceptual alternatives for the recognized speech data that corresponds to the detected patterns in the digitized speech signal, anddisambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data; and
  
  outputting the improved speech data.
- View Dependent Claims (16, 17, 18)
- - 16. The method of claim 15, further comprising the steps of:
    - establishing the plurality of weak classifiers; and
      
      constructing the ensemble detector;
      
      wherein the step of constructing an ensemble detector comprises the steps ofstoring a plurality of speech signals, wherein the speech signals comprise stored training examples stored in an automatic speech recognition system,extracting event patterns from a plurality of stored training examples, wherein the event patterns comprise distinctive characteristic locations in the stored plurality of speech signals, anditeratively performing the steps ofaccessing a sample of the plurality of speech signals having matching event patterns,aligning events from individual speech signals from among the samples, wherein the alignment comprises lining up the events from the individual speech signals temporally based on the matching event patterns,evaluating the effectiveness of a plurality of weak detectors in detecting the event patterns,applying a weighting scheme to the plurality of weak detectors based on the relative effectiveness of the weak detectors,wherein the most effective weak detectors are weighted highest, andadding at least one additional weak detector to the plurality of weak detectors;
      
      wherein the iteration is performed until the effectiveness of the weighting scheme performs to a set standard of efficiency for detecting the event patterns.
  - 17. The method of claim 16, wherein the step of accessing a sample of the plurality of speech signals having matching event patterns further comprises the step of automatically identifying regions in the plurality of speech signals that contain the event patterns, which comprises the steps ofaligning the plurality of speech signals relative to a common time axis,projecting one or more event locations of the individual speech signals onto the common time axis, andidentifying regions on the time axis having a concentration of the event locations in the form of regions in the plurality of speech signals that contain the event patterns.
  - 18. The method of claim 15, wherein the step of accessing a sample of the plurality of speech signals having matching event patterns further comprises the step of automatically identifying regions in the plurality of speech signals that contain the event patterns, which comprises the steps of:
    - accessing a training set;
      
      converting the speech signal into time-trajectory space regions containing all of the speech signal events from positive training examples; and
      
      repeatedly performing the steps ofcomputing the counts of negative examples for all of the time-traiectory space regions,selecting a region of the time-trajectory space regions with the fewest events from negative training examples, andeliminating negative examples with no speech signal events in the selected region from further consideration;
      
      until a cascade is created that operates perfectly on said training set.

19. A system for recognizing speech that corresponds to a digital speech signal, the system comprising:
- a speech recognition engine that has access toa training corpus of known-class digitized speech utterances,a plurality of weak classifiers, wherein each weak classifier comprises a decision function for determining the presence of an event within the training corpus, andan ensemble detector comprising a plurality of the weak classifiers, that together are better at determining the presence of a speech signal event than any of the constituent weak classifiers;
  
  wherein the speech recognition engine comprises an event extractor for extracting speech signal events and patterns of the speech signal events from the digital speech signal, wherein the speech signal events and patterns of the speech signal events are relevant in speech recognition,wherein the speech recognition engine comprises at least one processor that is configured to perform a plurality of operations, wherein the plurality of operations comprisedetecting locations of relevant speech signal events in the digital speech signal, wherein each of the speech signal events comprise spectral information and temporal information,capturing spectral features of and temporal relationships between all of the speech signal events,segmenting the digital speech signal based on the detected locations of the detected speech signal events,analyzing the segmented digital speech signal, wherein the analysis is synchronized with the speech signal events,detecting patterns in the digital speech signal with the captured spectral information, the temporal relationships, and the analyzed digital speech signal,providing a list of perceptual alternatives for recognized speech data that corresponds to the detected patterns in the digital speech signal, anddisambiguating between the perceptual alternatives for the recognized speech data based on the analysis of one or more of the speech signal events to improve the recognized speech data;
  
  wherein the at least one processor is configured to perform one or more of the operations using the ensemble detector;
  
  a module coupled to the speech recognition engine, wherein the module is configured to output the improved recognized speech data;
  
  an automatic punctuation engine coupled with a database containing training data, wherein the automatic punctuation engine comprises at least one statistical processor for adding punctuation to the improved recognized speech data using the training data in the form of statistical-based punctuated text;
  
  a rule-based punctuator coupled with a lexical rule database, wherein the rule-based punctuator adds punctuation to the improved recognized speech data using rules from the lexical rule database in the form of rule-based punctuated text; and
  
  a decision module for determining whether the punctuated text or the statistical-based punctuated text produces a better punctuated result; and
  
  a mechanism that is configured to output the better punctuated result, based upon the determination.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Scti Holdings, Inc.
Original Assignee
Scti Holdings, Inc.
Inventors
Flanagan, Mary, Makanvand, Shahrokh, Pinson, Mark, Pinson, David Sr.
Primary Examiner(s)
AZAD, ABUL K

Application Number

US12/616,723
Publication Number

US 20100121638A1
Time in Patent Office

1,441 Days
Field of Search

704231-257
US Class Current

704/235
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/32   Multiple recognisers used i...

System and method for automatic speech to text conversion

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for automatic speech to text conversion

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links