SYSTEM AND METHOD FOR AUTOMATIC SPEECH TO TEXT CONVERSION

US 20100121638A1
Filed: 11/11/2009
Published: 05/13/2010
Est. Priority Date: 11/12/2008
Status: Active Grant

First Claim

Patent Images

1. A speech recognition engine comprising:

an acoustical analyzer for receiving and digitizing a speech encoding signal;

an event extractor for extracting events from said speech signal, wherein said events or patterns of events which are highly relevant in speech recognition; and

a speech recognition module coupled to said event extractor, wherein said speech recognition module uses said events to initiate at least one action in response to detected content.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech recognition is performed in near-real-time and improved by exploiting events and event sequences, employing machine learning techniques including boosted classifiers, ensembles, detectors and cascades and using perceptual clusters. Speech recognition is also improved using tandem processing. An automatic punctuator injects punctuation into recognized text streams.

156 Citations

20 Claims

1. A speech recognition engine comprising:
- an acoustical analyzer for receiving and digitizing a speech encoding signal;
  
  an event extractor for extracting events from said speech signal, wherein said events or patterns of events which are highly relevant in speech recognition; and
  
  a speech recognition module coupled to said event extractor, wherein said speech recognition module uses said events to initiate at least one action in response to detected content.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The speech recognition engine of claim 1, wherein said initiated action is the conversion of the speech content of the said signal into at least one text stream.
  - 3. The speech recognition engine of claim 1, wherein said initiated action is to suppress audio output of a system when certain words are detected.
  - 4. The speech, recognition engine of claim 1, wherein said initiated action is to respond to detected commands.
  - 5. The speech recognition engine of claim 1, wherein said event extractor further comprises:
    - a training corpus of known-class digitized speech utterances;
      
      a plurality of weak detectors, wherein each weak detector comprises a method of determining the presence of an event; and
      
      a means of assembling an ensemble detector, said ensemble detector comprising a group of weak detectors that acting together are better at determining the presence of an event than any of the constituent weak detectors.
  - 6. The speech recognition engine of claim 5, wherein said ensemble detector is iteratively created using a boosting algorithm, thereby forming a boosted ensemble detector.
  - 7. The speech recognition engine of claim 6, wherein said event extractor includes a means for simplifying said boosted ensemble detector, thereby forming a simplified ensemble detector.
  - 8. The speech recognition engine of claim 7, wherein said event extractor includes a means for automatically converting said simplified ensemble detector into a cascading detector.
  - 9. The speech recognition engine of claim 1, wherein said event extractor further comprises a means for classifying speech into perceptual clusters and disambiguating between alternative perceptions.
  - 10. The speech recognition engine of claim 1, wherein said event extractor further comprises a weak region rejecter for rejecting regions of a digitized speech signal which do not contain events and which are unlikely to result in robust detectors.
  - 11. The speech recognition engine of claim 1, wherein said event extractor further comprises an event sequence recognizer, wherein said event sequence recognizer detects sequences of events.
  - 12. The speech recognition engine of claim 1, wherein said event extractor further comprises an alternative cue detector configured to recognize alternative speech cues to strengthen recognition, for example when aspects of the speech signal are corrupted.
  - 13. The speech recognition engine of claim 1, further comprising a signal synchronization engine comprising:
    - a pre-segmentation filter for defining intervals which are used to synchronize feature computations;
      
      a means for segmenting said digitized signal based on perceptual differences of the intervals, thereby forming a segmented signal; and
      
      a feature extractor for extracting features relative to events from said segmented signal.
  - 14. The speech recognition engine of claim 1, further comprising an automatic punctuator for automatically inserting punctuation into at least one text stream.

15. A method of speech recognition comprising:
- establishing a training of weak classifiers based on training examples;
  
  constructing an ensemble of detectors;
  
  receiving a speech signal;
  
  digitizing said speech signal;
  
  processing said speech signal using said ensemble of weak detectors, thereby recognizing the presence of at least one event, wherein said event comprises a pattern in said speech signal which is highly relevant in speech recognition; and
  
  processing said events to recognize speech.
- View Dependent Claims (16, 17, 18)
- - 16. The method of speech recognition according to claim 15, wherein the step of constructing an ensemble of detectors comprises the steps of:
    - storing a plurality of speech signals, wherein said speech signals comprise training examples stored in an automatic speech recognition system;
      
      extracting event patterns from said plurality of speech signals, wherein said event patterns comprise distinctive characteristic locations in said speech signals;
      
      accessing a sample of said plurality of speech signals having matching event patterns;
      
      aligning events from individual speech signals from among the samples, wherein said alignment comprises lining up said events from said individual speech signals temporally based on said matching event patterns;
      
      optionally scaling said individual signals to a common temporal duration;
      
      evaluating the effectiveness of a plurality of weak detectors in detecting said event patterns;
      
      applying a weighting scheme to said plurality of weak detectors based on the relative effectiveness of said weak detectors, wherein the most effective weak detectors are weighted highest;
      
      adding at least one additional weak detector to said plurality of weak detectors; and
      
      iterating said steps of;
      
      accessing a sample of said plurality of speech signals having matching event patterns;
      
      aligning events from individual speech signals from among the samples;
      
      optionally scaling said individual signals to a common temporal duration;
      
      evaluating the effectiveness of a plurality of weak detectors in detecting said event patterns;
      
      applying a weighting scheme to said plurality of weak detectors based on the relative effectiveness of said weak detectors; and
      
      adding at least one additional weak detector to said plurality of weak detector;
      
      wherein said step of iteration is performed until said effectiveness of said weighting scheme performs to a set standard of efficiency for detecting said event patterns.
  - 17. The method of claim 16, wherein the step of accessing a sample of said plurality of speech signals having matching event patterns further comprises automatically identifying regions in said plurality of speech signals that contain said event patterns comprising the steps of:
    - aligning said plurality of speech signals relative to a common time axis;
      
      optionally scaling the durations of each individual speech signal from the plurality of speech signals to one;
      
      projecting syllable centers and/or other event locations of said individual speech signals onto said time axis in the form of projected syllable centers and projected event locations; and
      
      identifying regions on said time axis having a concentration of syllable centers or other event locations in the form of regions in said plurality of speech signals that contain said event patterns.
  - 18. The method of claim 15, wherein the step of accessing a sample of said plurality of speech signals having matching event patterns further comprises automatically identifying regions in said plurality of speech signals that contain said event patterns comprising the steps of:
    - accessing a training set;
      
      converting said speech signal into a time-trajectory space regions containing all the events from positive training examples;
      
      computing the counts of negative examples for all said regions;
      
      selecting a region of the time-trajectory space with the fewest events from negative training examples;
      
      eliminating negative examples with no events in the selected region from further consideration; and
      
      repeating steps of;
      
      computing the counts of events from remaining negative examples in each region;
      
      selecting a region of the time-trajectory space with the fewest events from negative training examples; and
      
      eliminating negative examples with no events in the selected region form further consideration;
      
      until a cascade is created that operates perfectly on said training set.

19. A method for operating two or more speech recognition systems in tandem, wherein said two or more speech recognition systems perform detection and analysis of a speech signal at overlapping time intervals, said method comprising:
- configuring the time intervals to be used in each speech recognition engine, wherein said intervals are reconfigurable;
  
  configuring the overlap on the intervals, wherein the overlap is reconfigurable, and wherein the overlap is set to reflect the most information-rich portions of said speech signal;
  
  routing said detection and analysis between said speech recognition engines;
  
  weighting the results of said speech recognition engines, wherein a higher weight is given to results taken from the middle of the interval, and generating at least two opinions as to the identity of words within a single time interval; and
  
  determining which opinion from the at least two opinions better estimates a textual representation of said speech signal.

20. A speech recognition engine comprising:
- an acoustic analyzer for receiving and digitizing a speech signal in the form of a digital utterance;
  
  a speech recognition module coupled to said acoustic analyzer, wherein said speech recognition module converts said digital utterance into at least one text stream;
  
  an automatic punctuation engine coupled with a database containing training data, wherein said automatic punctuation engine includes at least one statistical processor for adding punctuation to said text stream using said training data in the form of statistical-based punctuated text;
  
  a rule-based punctuator coupled with a lexical rule database, wherein said rule-based punctuator adds punctuation to said text stream using rules from said lexical rule database in the form of rule-based punctuated text; and
  
  a decision module for determining whether said punctuated text or said statistical-based punctuated text produces a better punctuated result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Scti Holdings, Inc.
Original Assignee
Scti Holdings, Inc.
Inventors
Pinson, David SR., Makanvand, Shahrokh, Flanagan, Mary, PINSON, Mark

Granted Patent

US 8,566,088 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/32   Multiple recognisers used i...

SYSTEM AND METHOD FOR AUTOMATIC SPEECH TO TEXT CONVERSION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

156 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR AUTOMATIC SPEECH TO TEXT CONVERSION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

156 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links