Methods and apparatus for speech segmentation using multiple metadata

US 10,229,686 B2
Filed: 08/18/2014
Issued: 03/12/2019
Est. Priority Date: 08/18/2014
Status: Active Grant

First Claim

Patent Images

1. A method of performing automated speech recognition (ASR) in a system having a speech enhancement module for generating an audio stream signal and metadata, coupled to an ASR module for performing speech recognition on the audio stream signal using the metadata, the method comprising:

by the speech enhancement module, processing microphone signals to generate the audio stream signal;

by a first speech detector having a first response latency, generating first metadata that indicate the possible presence of speech in the audio stream signal with a first confidence level;

by a second speech detector having a second response latency that is higher than the first response latency, generating second metadata that indicate the possible presence of speech in the audio stream signal with a second confidence level that is higher than the first confidence level;

by the ASR module based on the first metadata, initiating buffering of the audio stream signal from an endpoint; and

by the ASR module based on the second metadata, initiating speech recognition on the buffered audio stream signal from the endpoint.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus to process microphone signals by a speech enhancement module to generate an audio stream signal including first and second metadata for use by a speech recognition module. In an embodiment, speech recognition is performed using endpointing information including transitioning from a silence state to a maybe speech state, in which data is buffered, based on the first metadata and transitioning to a speech state, in which speech recognition is performed, based upon the second metadata.

Citations

21 Claims

1. A method of performing automated speech recognition (ASR) in a system having a speech enhancement module for generating an audio stream signal and metadata, coupled to an ASR module for performing speech recognition on the audio stream signal using the metadata, the method comprising:
- by the speech enhancement module, processing microphone signals to generate the audio stream signal;
  
  by a first speech detector having a first response latency, generating first metadata that indicate the possible presence of speech in the audio stream signal with a first confidence level;
  
  by a second speech detector having a second response latency that is higher than the first response latency, generating second metadata that indicate the possible presence of speech in the audio stream signal with a second confidence level that is higher than the first confidence level;
  
  by the ASR module based on the first metadata, initiating buffering of the audio stream signal from an endpoint; and
  
  by the ASR module based on the second metadata, initiating speech recognition on the buffered audio stream signal from the endpoint.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1, wherein the first metadata has a frame-by-frame time scale.
  - 3. The method according to claim 1, wherein the second metadata has a sequence of frames time scale.
  - 4. The method according to claim 1, further including performing one or more of barge-in, beamforming, and/or echo cancellation for generating the first and/or second metadata.
  - 5. The method according to claim 1, further including tuning a speech detection threshold for a given latency for the first metadata.
  - 6. The method according to claim 1, further including adjusting latency for a given confidence level of voice activity detection for the second metadata.
  - 7. The method according to claim 1, further including controlling computation of the second metadata using the first metadata or computation of the first metadata using the second metadata.
  - 8. The method according to claim 1, further including performing one or more of barge-in, beamforming, and/or echo cancellation for generating further metadata.
  - 9. The method according to claim 1, wherein at least one of the first and second metadata is encoded into the audio signal.

10. An article, comprising a non-transitory computer readable medium having stored instructions that when executed perform a method of automated speech recognition (ASR) in a system having a speech enhancement module for generating an audio stream signal and metadata, coupled to an ASR module for performing speech recognition on the audio stream signal using the metadata, the method comprising:
- by the speech enhancement module, processing microphone signals to generate the audio stream signal;
  
  by a first speech detector having a first response latency, generating first metadata that indicate the possible presence of speech in the audio stream signal with a first confidence level;
  
  by a second speech detector having a second response latency that is higher than the first response latency, generating second metadata that indicate the possible presence of speech in the audio stream signal with a second confidence level that is higher than the first confidence level;
  
  by the ASR module based on the first metadata, initiating buffering of the audio stream signal from an endpoint; and
  
  by the ASR module based on the second metadata, initiating speech recognition on the buffered audio stream signal from the endpoint.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The article according to claim 10, wherein the first metadata has a frame-by-frame time scale.
  - 12. The article according to claim 10, wherein the second metadata has a sequence of frames time scale.
  - 13. The article according to claim 10, further including instructions to perform one or more of barge-in, beamforming, and/or echo cancellation for generating the first and second metadata.
  - 14. The article according to claim 10, further including instructions to tune speech detector parameters for a given latency for the first metadata.
  - 15. The article according to claim 10, further including instructions to adjust latency for a given confidence level of voice activity detection for the second metadata.
  - 16. The article according to claim 10, further including instructions to control computation of the second metadata using the first metadata or computation of the first metadata using the second metadata.
  - 17. The article according to claim 10, further including instructions to perform one or more of barge-in, beamforming, and/or echo cancellation for generating further metadata.

18. A system for performing automated speech recognition (ASR) comprising a speech enhancement module for generating an audio stream signal and metadata, coupled to an ASR module for performing speech recognition on the audio stream signal using the metadata, the system further comprising:
- in the speech enhancement module, electronic circuitry configured to provide;
  
  a first speech detector having a first response latency for generating first metadata that indicate the possible presence of speech in the audio stream signal with a first confidence level; and
  
  a second speech detector having a second response latency that is higher than the first response latency for generating second metadata that indicate the possible presence of speech in the audio stream signal with a second confidence level that is higher than the first confidence level; and
  
  in the ASR module, electronic circuitry configured to provide;
  
  an endpointing module for initiating, based on the first metadata, buffering of the audio stream signal from an endpoint, and for initiating, based on the second metadata, speech recognition on the buffered audio stream signal from the endpoint.
- View Dependent Claims (19, 20, 21)
- - 19. The system according to claim 18, further including a further speech detector to perform one or more of barge-in, beamforming, and/or echo cancellation for generating further metadata for use by the endpointing module.
  - 20. The system according to claim 18, wherein the first speech detector is further configured to tune detector parameters for a given latency for the first metadata.
  - 21. The system according to claim 18, wherein the second speech detector is further configured to adjust latency for a given confidence level of voice activity detection using the second metadata.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Buck, Markus, Herbig, Tobias, Graf, Simon, Ris, Christophe
Primary Examiner(s)
Shah, Paras D
Assistant Examiner(s)
Ogunbiyi, Oluwadamilola M

Application Number

US15/329,354
Publication Number

US 20170213556A1
Time in Patent Office

1,667 Days
Field of Search

704208
US Class Current
CPC Class Codes

G10L 15/04   Segmentation; Word boundary...

G10L 15/20   Speech recognition techniqu...

G10L 15/22   Procedures used during a sp...

G10L 15/28   Constructional details of s...

G10L 21/02   Speech enhancement, e.g. no...

G10L 25/78   Detection of presence or ab...

Methods and apparatus for speech segmentation using multiple metadata

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for speech segmentation using multiple metadata

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links