Multi-pass speech activity detection strategy to improve automatic speech recognition

US 9,959,887 B2
Filed: 03/08/2016
Issued: 05/01/2018
Est. Priority Date: 03/08/2016
Status: Active Grant

First Claim

Patent Images

1. A method performed by an automatic speech recognition system having a processor, comprising:

performing, by the processor, at least two passes of speech activity detection on an acoustic utterance uttered by a speaker, the at least two passes including an initial pass and a subsequent pass;

estimating, by the processor, at least one of feature statistics and transforms for acoustic feature extraction and acoustic modeling based on an output of an initial pass; and

performing, by the processor, automatic speech recognition using an output of the subsequent pass and the at least one of the feature statistics and transforms estimated from the initial pass while bypassing an output of the initial pass to recognize the acoustic utterance and output a textual representation of the acoustic utterance,wherein the at least two passes are performed simultaneously or separately, based on a user selection.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic speech recognition system and a method performed by an automatic speech recognition system are provided. The method includes performing at least two passes of speech activity detection on an acoustic utterance uttered by a speaker. The at least two passes include an initial pass and a subsequent pass. The method further includes estimating at least one of feature statistics and transforms for acoustic feature extraction and acoustic modeling based on an output of an initial pass. The method further includes performing automatic speech recognition using an output of the subsequent pass while bypassing an output of the initial pass to recognize the acoustic utterance.

Citations

19 Claims

1. A method performed by an automatic speech recognition system having a processor, comprising:
- performing, by the processor, at least two passes of speech activity detection on an acoustic utterance uttered by a speaker, the at least two passes including an initial pass and a subsequent pass;
  
  estimating, by the processor, at least one of feature statistics and transforms for acoustic feature extraction and acoustic modeling based on an output of an initial pass; and
  
  performing, by the processor, automatic speech recognition using an output of the subsequent pass and the at least one of the feature statistics and transforms estimated from the initial pass while bypassing an output of the initial pass to recognize the acoustic utterance and output a textual representation of the acoustic utterance,wherein the at least two passes are performed simultaneously or separately, based on a user selection.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the initial pass is performed by configuring a speech activity detector to operate using a speech detection operating point having a miss rate higher than, and a false alarm rate lower than, the subsequent pass.
  - 3. The method of claim 2, wherein the speech detection operation point is configured, using the miss rate and the false alarm rate, to discard more non-speech segments of the acoustic utterance than the subsequent pass.
  - 4. The method of claim 1, where the speech activity detection is performed using variants of decision rules based on acoustic features to differentiate between speech classes and non-speech classes.
  - 5. The method of claim 1, where the speech activity detection is performed using variants of model based techniques that, in turn, use multi-layer perceptrons, hidden Markov models, and support vector machines to differentiate between speech classes and non-speech classes.
  - 6. The method of claim 1, where the automatic speech recognition is Neural Network based, Hidden Markov Model/Gaussian Mixture Model based, Support Vector Machine based, template based, or a combination thereof.
  - 7. The method of claim 1, where the automatic speech recognition is performed offline or in an online streaming mode.
  - 8. The method of claim 7, wherein the online streaming mode involves a server side and a client side, the speech activity detection is selectively performed at the client side or the server side.
  - 9. The method of claim 1, wherein in the initial pass, a subset of audio frames is selectively marked, from among a set of input audio frames corresponding to the acoustic utterance, as being designated for use in estimating the at least one of the feature statistics and the transforms, and wherein the at least one of the feature statistics and the transforms are estimated using only the marked audio frames in the subset while excluding remaining audio frames in the set of audio frames.
  - 10. The method of claim 9, wherein in the subsequent pass, another subset of audio frames is selectively marked, from among the set of input audio frames, as being designated for use in decoding, and wherein the acoustic utterance is decoded using the at least one of the feature statistics and the transforms.
  - 11. The method of claim 1, where the feature statistics are estimated for variants of normalization utilized in the acoustic feature extraction.
  - 12. The method of claim 1, where the transforms comprise speaker transforms and channel compensation transforms.
  - 13. The method of claim 1, where the at least one of the feature statistics and the transforms estimated from the output of the initial pass are used to decode regions of speech detected as the output of the subsequent pass.
  - 14. The method of claim 1, where the at least two passes comprise an initial set of passes and a subsequent set of passes with respect to the initial set of passes, wherein outputs of the initial set of passes are constrained for use in estimating the least one of the feature statistics and the transforms for acoustic feature extraction and acoustic modeling, and wherein outputs of the subsequent set of passes are used, while the outputs of the initial set of passes are bypassed, to recognize the acoustic utterance uttered by the speaker.
  - 15. The method of claim 1, wherein the at least two passes comprises a plurality of initial passes and at least one subsequent pass with respect to the plurality of initial passes, and wherein said estimating step estimates at least one of a single feature statistic and a single transform based on an output of a respective one of the plurality of initial passes.

16. A computer program product for automatic speech recognition, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising:
- performing, by a processor of an automatic speech recognition system, at least two passes of speech activity detection on an acoustic utterance uttered by a speaker, the at least two passes including an initial pass and a subsequent pass;
  
  estimating, by the processor, at least one of feature statistics and transforms for acoustic feature extraction and acoustic modeling based on an output of an initial pass; and
  
  performing, by the processor, automatic speech recognition using an output of the subsequent pass and the at least one of the feature statistics and transforms estimated from the initial pass while bypassing an output of the initial pass to recognize the acoustic utterance and output a textual representation of the acoustic utterance,wherein the at least two passes are performed simultaneously or separately, based on a user selection.
- View Dependent Claims (17, 18)
- - 17. The computer program product of claim 16, wherein in the initial pass, a subset of audio frames is selectively marked, from among a set of input audio frames corresponding to the acoustic utterance, as being designated for use in estimating the at least one of the feature statistics and the transforms, and wherein the at least one of the feature statistics and the transforms are estimated using only the marked audio frames in the subset while excluding remaining audio frames in the set of audio frames.
  - 18. The computer program product of claim 16, where the at least one of the feature statistics and the transforms estimated from the output of the initial pass are used to decode regions of speech detected as the output of the subsequent pass.

19. An automatic speech recognition system having a processor, comprising:
- a speech activity detector, implemented by the processor, for performing at least two passes of speech activity detection on an acoustic utterance uttered by a speaker, the at least two passes including an initial pass and a subsequent pass, and for estimating at least one of feature statistics and transforms for acoustic feature extraction and acoustic modeling based on an output of an initial pass; and
  
  a speech decoder, implemented by the processor, for performing automatic speech recognition using an output of the subsequent pass and the at least one of the feature statistics and transforms estimated from the initial pass while bypassing an output of the initial pass to recognize the acoustic utterance and output a textual representation of the acoustic utterance,wherein the at least two passes are performed simultaneously or separately, based on a user selection.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kuo, Hong-Kwang J., Mangu, Lidia L., Thomas, Samuel
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US15/064,441
Publication Number

US 20170263269A1
Time in Patent Office

784 Days
Field of Search

704232, 704257, 704251, 704246, 704273, 704233, 704235, 704203
US Class Current
CPC Class Codes

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/22   Procedures used during a sp...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

G10L 2015/225   Feedback of the input speech

G10L 25/30   using neural networks

G10L 25/78   Detection of presence or ab...

G10L 25/87   Detection of discrete point...

Multi-pass speech activity detection strategy to improve automatic speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Multi-pass speech activity detection strategy to improve automatic speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links