Using codec parameters for endpoint detection in speech recognition

US 8,762,150 B2
Filed: 09/16/2010
Issued: 06/24/2014
Est. Priority Date: 09/16/2010
Status: Active Grant

First Claim

Patent Images

1. A method for use in a system comprising a first device that receives human speech and a second device that comprises a speech recognizer, wherein the first device receives at least one sound wave comprising the human speech and encodes the received at least one sound wave, via at least one speech encoder in the first device, to produce at least one encoded representation of the at least one sound wave, wherein the first device transmits the at least one encoded representation to the second device, wherein the second device decodes the at least one encoded representation and performs, via the speech recognizer, speech recognition on the human speech, the method, performed by the first device, comprising acts of:

determining an estimated endpoint of the human speech in the at least one sound wave by analyzing information available from the at least one speech encoder, without analyzing the at least one sound wave and without producing a decoded representation of the at least one sound wave, wherein the encoded representation of the at least one sound wave comprises a plurality of speech frames, and wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a change in energy level between two speech frames; and

in addition to transmitting the at least one encoded representation to the second device, providing to the second device a separate indication of the estimated endpoint.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods and apparatus for determining an estimated endpoint of human speech in a sound wave received by a mobile device having a speech encoder for encoding the sound wave to produce an encoded representation of the sound wave. The estimated endpoint may be determined by analyzing information available from the speech encoder, without analyzing the sound wave directly and without producing a decoded representation of the sound wave. The encoded representation of the sound wave may be transmitted to a remote server for speech recognition processing, along with an indication of the estimated endpoint.

47 Citations

View as Search Results

47 Claims

1. A method for use in a system comprising a first device that receives human speech and a second device that comprises a speech recognizer, wherein the first device receives at least one sound wave comprising the human speech and encodes the received at least one sound wave, via at least one speech encoder in the first device, to produce at least one encoded representation of the at least one sound wave, wherein the first device transmits the at least one encoded representation to the second device, wherein the second device decodes the at least one encoded representation and performs, via the speech recognizer, speech recognition on the human speech, the method, performed by the first device, comprising acts of:
- determining an estimated endpoint of the human speech in the at least one sound wave by analyzing information available from the at least one speech encoder, without analyzing the at least one sound wave and without producing a decoded representation of the at least one sound wave, wherein the encoded representation of the at least one sound wave comprises a plurality of speech frames, and wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a change in energy level between two speech frames; and
  
  in addition to transmitting the at least one encoded representation to the second device, providing to the second device a separate indication of the estimated endpoint.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the first device comprises a mobile telephone.
  - 3. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a state of the at least one speech encoder.
  - 4. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information output from the at least one speech encoder.
  - 5. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information available from the at least one speech encoder that directly provides an estimate of an endpoint of human speech.
  - 6. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information available from the at least one speech encoder that directly provides an estimate of a presence or absence of human speech.
  - 7. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information comprising at least part of the encoded representation.
  - 8. The method of claim 1, wherein the encoded representation comprises a plurality of parameters, and wherein the act of determining an estimated endpoint of the human speech comprises analyzing at least one but not all of the plurality of parameters.
  - 9. The method of claim 1, wherein the at least one speech encoder encodes the at least one sound wave based on linear prediction, and wherein the act of determining an estimated endpoint of the human speech comprises analyzing a pitch gain parameter in the encoded representation.
  - 10. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a periodic component of an amplitude of the at least one sound wave.
  - 11. The method of claim 10, wherein the at least one speech encoder complies with an Adaptive Multi-Rate standard for codecs, and wherein the information indicative of a periodic component of an amplitude of the at least one sound wave comprises a pitch gain output of the at least one speech encoder.
  - 12. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a voiced vowel in the at least one sound wave.
  - 13. The method of claim 1, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a time-varying level of energy in the at least one sound wave.
  - 14. The method of claim 1, wherein the change in energy level comprises a change in energy level of a periodic component of the at least one sound wave.
  - 15. The method of claim 1, wherein the plurality of speech frames is a sequence of speech frames in time and the two speech frames are two consecutive speech frames in the sequence.

16. At least one non-transitory computer readable medium having encoded thereon instructions that, when executed by at least one processor, perform a method for use in a system comprising a first device that receives human speech and a second device that comprises a speech recognizer, wherein the first device receives at least one sound wave comprising the human speech and encodes the received at least one sound wave, via at least one speech encoder in the first device, to produce at least one encoded representation of the at least one sound wave, wherein the first device transmits the at least one encoded representation to the second device, wherein the second device decodes the at least one encoded representation and performs, via the speech recognizer, speech recognition on the human speech, the method, performed by the first device, comprising acts of:
- determining an estimated endpoint of the human speech in the at least one sound wave by analyzing information available from the at least one speech encoder, without analyzing the at least one sound wave and without producing a decoded representation of the at least one sound wave, wherein the encoded representation of the at least one sound wave comprises a plurality of speech frames, and wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a change in energy level between two speech frames; and
  
  in addition to transmitting the at least one encoded representation to the second device, providing to the second device a separate indication of the estimated endpoint.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 17. The at least one non-transitory computer readable medium of claim 16, wherein the first device comprises a mobile telephone.
  - 18. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a state of the at least one speech encoder.
  - 19. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information output from the at least one speech encoder.
  - 20. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information available from the at least one speech encoder that directly provides an estimate of an endpoint of human speech.
  - 21. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information available from the at least one speech encoder that directly provides an estimate of a presence or absence of human speech.
  - 22. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information comprising at least part of the encoded representation.
  - 23. The at least one non-transitory computer readable medium of claim 16, wherein the encoded representation comprises a plurality of parameters, and wherein the act of determining an estimated endpoint of the human speech comprises analyzing at least one but not all of the plurality of parameters.
  - 24. The at least one non-transitory computer readable medium of claim 16, wherein the at least one speech encoder encodes the at least one sound wave based on linear prediction, and wherein the act of determining an estimated endpoint of the human speech comprises analyzing a pitch gain parameter in the encoded representation.
  - 25. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a periodic component of an amplitude of the at least one sound wave.
  - 26. The at least one non-transitory computer readable medium of claim 25, wherein the at least one speech encoder complies with an Adaptive Multi-Rate standard for codecs, and wherein the information indicative of a periodic component of an amplitude of the at least one sound wave comprises a pitch gain output of the at least one speech encoder.
  - 27. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a voiced vowel in the at least one sound wave.
  - 28. The at least one non-transitory computer readable medium of claim 16, wherein the act of determining an estimated endpoint of the human speech comprises analyzing information indicative of a time-varying level of energy in the at least one sound wave.
  - 29. The at least one non-transitory computer readable medium of claim 16, wherein the change in energy level comprises a change in energy level of a periodic component of the at least one sound wave.
  - 30. The at least one non-transitory computer readable medium of claim 16, wherein the plurality of speech frames is a sequence of speech frames in time and the two speech frames are two consecutive speech frames in the sequence.

31. A first device for use in a system comprising at least one second device configured to decode at least one encoded representation of at least one sound wave comprising human speech, the at least one second device comprising a speech recognizer to perform speech recognition on the human speech, the first device comprising:
- at least one speech encoder to encode the at least one sound wave to produce the at least one encoded representation, the at least one sound wave being received at the first device;
  
  at least one endpoint detection circuit to determine an estimated endpoint of the human speech in the at least one sound wave by analyzing information available from the at least one speech encoder, without analyzing the at least one sound wave and without producing a decoded representation of the at least one sound wave, wherein the encoded representation of the at least one sound wave comprises a plurality of speech frames, and wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information indicative of a change in energy level between two speech frames; and
  
  at least one transmitter to transmit the at least one encoded representation of the at least one sound wave and a separate indication of the estimated endpoint to the at least one second device.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
- - 32. The first device of claim 31, comprising a mobile telephone.
  - 33. The first device of claim 31, wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information indicative of a state of the at least one speech encoder.
  - 34. The first device of claim 31, wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information output from the at least one speech encoder.
  - 35. The first device of claim 31, wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information available from the at least one speech encoder that directly provides an estimate of an endpoint of human speech.
  - 36. The first device of claim 31, wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information available from the at least one speech encoder that directly provides an estimate of a presence or absence of human speech.
  - 37. The first device of claim 31, wherein at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information comprising at least part of the encoded representation.
  - 38. The first device of claim 31, wherein the encoded representation comprises a plurality of parameters, and wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing at least one but not all of the plurality of parameters.
  - 39. The first device of claim 31, wherein the at least one speech encoder encodes the at least one sound wave based on linear prediction, and wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing a pitch gain parameter in the encoded representation.
  - 40. The first device of claim 31, wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information indicative of a periodic component of an amplitude of the at least one sound wave.
  - 41. The first device of claim 40, wherein the speech encoder complies with an Adaptive Multi-Rate standard for codecs, and wherein the information indicative of a periodic component of an amplitude of the at least one sound wave comprises a pitch gain output of the at least one speech encoder.
  - 42. The first device of claim 31, wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information indicative of a voiced vowel in the at least one sound wave.
  - 43. The first device of claim 31, wherein the at least one endpoint detection circuit is configured to determine an estimated endpoint of the human speech at least in part by analyzing information indicative of a time-varying level of energy in the at least one sound wave.
  - 44. The first device of claim 31, wherein the change in energy level comprises a change in energy level of a periodic component of the at least one sound wave.
  - 45. The first device of claim 31, wherein the plurality of speech frames is a sequence of speech frames in time and the two speech frames are two consecutive speech frames in the sequence.
  - 46. The first device of claim 31, in combination with the at least one second device.
  - 47. The first device of claim 31, wherein the at least one endpoint detection circuit comprises at least one processor programmed to perform endpoint detection.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Edgington, Michael D., Laverty, Stephen W., Evermann, Gunnar
Primary Examiner(s)
Neway, Samuel G

Application Number

US12/883,901
Publication Number

US 20120072211A1
Time in Patent Office

1,377 Days
Field of Search

704/219, 704231-257
US Class Current

704/253
CPC Class Codes

G10L 15/04   Segmentation; Word boundary...

G10L 15/30   Distributed recognition, e....

G10L 25/78   Detection of presence or ab...

Using codec parameters for endpoint detection in speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

47 Citations

47 Claims

Specification

Solutions

Use Cases

Quick Links

Using codec parameters for endpoint detection in speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

47 Citations

47 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links