Audio output masking for improved automatic speech recognition

US 9,704,478 B1
Filed: 12/02/2013
Issued: 07/11/2017
Est. Priority Date: 12/02/2013
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more computing devices configured to at least;

receive an audio signal;

determine a frequency band based at least partly on a likelihood that audio input data regarding a user utterance will be present within the frequency band of a subsequent input signal;

generate a filtered output signal by filtering a portion of audio output data within the frequency band from the audio signal, wherein the filtered output signal is generated prior to receiving an input signal comprising audio data regarding the user utterance and presentation of the filtered output signal, and wherein filtering the portion of audio output data from the audio signal reduces energy of the audio signal in the frequency band;

generate audio using the filtered output signal;

receive the input signal, wherein the input signal comprises audio data regarding both the user utterance and presentation of the filtered output signal;

select an acoustic model of a plurality of acoustic models based at least partly on the acoustic model being associated with the frequency band; and

perform speech recognition using the input signal and the acoustic model to generate speech recognition results.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for filtering portions of an output audio signal in order to improve automatic speech recognition on an input signal which may include a representation of the output signal. A signal that includes audio content can be received, and a frequency or band of frequencies can be selected to be filtered from the signal. The frequency band may correspond to a desired frequency band for speech recognition. An input signal can be obtained comprising audio data corresponding to a user utterance and presentation of the output signal. Automatic speech recognition can be performed on the input signal. In some cases, an acoustic model trained for use with such frequency band filtering may be used to perform speech recognition.

187 Citations

25 Claims

1. A system comprising:
- one or more computing devices configured to at least;
  
  receive an audio signal;
  
  determine a frequency band based at least partly on a likelihood that audio input data regarding a user utterance will be present within the frequency band of a subsequent input signal;
  
  generate a filtered output signal by filtering a portion of audio output data within the frequency band from the audio signal, wherein the filtered output signal is generated prior to receiving an input signal comprising audio data regarding the user utterance and presentation of the filtered output signal, and wherein filtering the portion of audio output data from the audio signal reduces energy of the audio signal in the frequency band;
  
  generate audio using the filtered output signal;
  
  receive the input signal, wherein the input signal comprises audio data regarding both the user utterance and presentation of the filtered output signal;
  
  select an acoustic model of a plurality of acoustic models based at least partly on the acoustic model being associated with the frequency band; and
  
  perform speech recognition using the input signal and the acoustic model to generate speech recognition results.
- View Dependent Claims (2, 3, 4, 5, 6, 15, 16)
- - 2. The system of claim 1, wherein the audio signal comprises music, and wherein a genre of music is used to determine the frequency band.
  - 3. The system of claim 1, wherein the portion of the audio output data within the frequency band is filtered from the audio signal using a weighted filter.
  - 4. The system of claim 1, wherein the one or more computing devices are further configured to determine a plurality of different frequency bands, wherein audio output data in each of the plurality of different frequency bands is to be filtered from the audio signal.
  - 5. The device of claim 1, wherein the one or more computing devices are further configured to determine the frequency band to filter based at least partly on information regarding acoustic echo cancellation.
  - 6. The system of claim 1, wherein the acoustic model is selected based on the acoustic model being trained to recognize subword units using audio data within the frequency band.
  - 15. The computer-implemented method of claim 6, further comprising selecting an acoustic model of a plurality of acoustic models based at least partly on the frequency band.
  - 16. The computer-implemented method of claim 15, wherein the acoustic model was trained using training data, and wherein at least a portion of audio data in the frequency band of the training data was filtered from the training data.

7. A computer-implemented method comprising:
- as implemented by a computing device comprising one or more processors configured to execute specific instructions,receiving a first signal comprising data regarding audio content;
  
  determining a frequency band within which audio data regarding a user utterance is expected to be present in an input signal;
  
  generating an output signal comprising a portion of the first signal, wherein the output signal is generated prior to receiving the input signal, wherein the input signal comprises audio data corresponding to the user utterance and presentation of the output signal, and wherein the output signal excludes a portion of the first signal having a frequency within the frequency band;
  
  receiving the input signal, wherein the input signal comprises audio data corresponding to the user utterance and presentation of the output signal, and wherein a portion of the input signal comprising audio data corresponding to the user utterance has a frequency within the frequency band; and
  
  providing the input signal to a speech recognizer.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. The computer-implemented method of claim 7, wherein the portion of the first signal having the frequency within the frequency band is excluded from the output signal using a weighted filter.
  - 9. The computer-implemented method of claim 7, further comprising determining the frequency band based at least partly on a characteristic of the first signal.
  - 10. The computer-implemented method of claim 9, wherein the characteristic is determined using one of metadata associated with the first signal or an analysis of at least a portion of the first signal.
  - 11. The computer-implemented method of claim 7, further comprising determining the frequency band based at least partly on a vocal characteristic of a user.
  - 12. The computer-implemented method of claim 11, wherein the vocal characteristic is associated with one of:
    - gender, age, language, dialect, or accent.
  - 13. The computer-implemented method of claim 7, further comprising determining the frequency band based at least partly on a desired signal-to-noise ratio of the output signal for the frequency band.
  - 14. The computer-implemented method of claim 7, wherein generating the output signal comprises excluding a second portion of the first signal, wherein the second portion is excluded based on a frequency of the second portion being within a second frequency band, wherein the second frequency band is different than the first frequency band, and wherein the second portion is different than the portion.

17. A device comprising:
- means for receiving a first signal comprising data regarding audio content;
  
  means for determining a frequency band within which audio data regarding a user utterance is expected to be present in an input signal;
  
  means for generating an output signal comprising a portion of the first signal, wherein the output signal is generated prior to receiving the input signal, wherein the input signal comprises audio data corresponding to the user utterance and presentation of the output signal, and wherein the output signal excludes a portion of the first signal having a frequency within the frequency band;
  
  means for receiving the input signal, wherein the input signal comprises audio data corresponding to the user utterance and presentation of the output signal, and wherein a portion of the input signal comprising audio data corresponding to the user utterance has a frequency within the frequency band; and
  
  means for providing the input signal to a speech recognizer.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
- - 18. The device of claim 17, wherein the portion of the first signal having the frequency within the frequency band is filtered using a weighted filter.
  - 19. The device of claim 17, further comprising means for determining the frequency band based at least partly on a characteristic of the first signal.
  - 20. The device of claim 19, wherein the characteristic is determined using one of metadata associated with the first signal or an analysis of at least a portion of the first signal.
  - 21. The device of claim 17, wherein the means for generating the output signal generates the output signal by excluding a second portion of the first signal, wherein the second portion is excluded based on a frequency of the second portion being within a second frequency band, wherein the second frequency band is different than the first frequency band, and wherein the second portion is different than the portion.
  - 22. The device of claim 17, wherein the portion of the first signal having the frequency in the frequency band is filtered in anticipation of receiving a user utterance.
  - 23. The device of claim 17, further comprising means for selecting an acoustic model of a plurality of acoustic models based at least partly on the frequency band.
  - 24. The device of claim 23, wherein the acoustic model is trained to prioritize the frequency band in performing speech recognition.
  - 25. The device of claim 17, further comprising means for determining the frequency band based at least partly on information regarding acoustic echo cancellation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Prasad, Rohit, Vitaladevuni, Shiv Naga Prasad, Chhetri, Amit Singh, Hilmes, Phillip Ryan
Primary Examiner(s)
Colucci, Michael

Application Number

US14/094,591
Time in Patent Office

1,317 Days
Field of Search

704254, 704500, 704278, 704275, 704235, 704233, 704229, 704226, 704208, 704207, 704205, 7042001, 707769, 700 94, 381 943, 381 92, 381 66, 381 57, 381317
US Class Current
CPC Class Codes

G10L 15/00   Speech recognition G10L17/0...

G10L 2021/02082   the noise being echo, rever...

G10L 21/0232   Processing in the frequency...

Audio output masking for improved automatic speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

187 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Audio output masking for improved automatic speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

187 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links