Speech and noise models for speech recognition

US 8,249,868 B2
Filed: 09/30/2011
Issued: 08/21/2012
Est. Priority Date: 06/14/2010
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more processing devices; and

one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the one or more processing devices to;

receive a first audio signal generated by a device based on audio input from a user, the first audio signal including at least a first user audio portion that corresponds to both first background audio and one or more first user utterances recorded by the device;

access a user speech model associated with the user;

determine that the first background audio in the first user audio portion is below a defined threshold;

in response to determining that the first background audio in the first user audio portion is below the defined threshold, adapt the accessed user speech model based on the first audio signal to generate an adapted user speech model that models speech characteristics of the user;

receive a second audio signal generated by the device based on second audio input from a user, the second audio signal including at least a second user audio portion that corresponds to both second background audio and one or more second user utterances recorded by the device;

determine that the second background audio in the second user audio portion is not below the defined threshold;

in response to determining that the second background audio in the second user audio portion is not below the defined threshold, not adapt the accessed user speech model based on the second audio signal; and

perform noise compensation on a third audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the third audio signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An audio signal generated by a device based on audio input from a user may be received. The audio signal may include at least a user audio portion that corresponds to one or more user utterances recorded by the device. A user speech model associated with the user may be accessed and a determination may be made background audio in the audio signal is below a defined threshold. In response to determining that the background audio in the audio signal is below the defined threshold, the accessed user speech model may be adapted based on the audio signal to generate an adapted user speech model that models speech characteristics of the user. Noise compensation may be performed on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.

Citations

24 Claims

1. A system comprising:
- one or more processing devices; and
  
  one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the one or more processing devices to;
  
  receive a first audio signal generated by a device based on audio input from a user, the first audio signal including at least a first user audio portion that corresponds to both first background audio and one or more first user utterances recorded by the device;
  
  access a user speech model associated with the user;
  
  determine that the first background audio in the first user audio portion is below a defined threshold;
  
  in response to determining that the first background audio in the first user audio portion is below the defined threshold, adapt the accessed user speech model based on the first audio signal to generate an adapted user speech model that models speech characteristics of the user;
  
  receive a second audio signal generated by the device based on second audio input from a user, the second audio signal including at least a second user audio portion that corresponds to both second background audio and one or more second user utterances recorded by the device;
  
  determine that the second background audio in the second user audio portion is not below the defined threshold;
  
  in response to determining that the second background audio in the second user audio portion is not below the defined threshold, not adapt the accessed user speech model based on the second audio signal; and
  
  perform noise compensation on a third audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the third audio signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The system of claim 1 wherein the first audio signal includes an environmental audio portion that corresponds only to the first background audio and, to determine that the first background audio in the first user audio portion is below a defined threshold, the instructions include instructions that, when executed, cause the one or more processing devices to:
    - determine an amount of energy in the environmental audio portion; and
      
      determine that the amount of energy in the environmental audio portion is below a threshold energy.
  - 3. The system of claim 1 wherein, to determine that the first background audio in the first user audio portion is below a defined threshold, the instructions include instructions that, when executed, cause the one or more processing devices to:
    - determine a signal-to-noise ratio of the first audio signal; and
      
      determine that the signal-to-noise ratio is below a threshold signal-to-noise ratio.
  - 4. The system of claim 3 wherein the first audio signal includes an environmental audio portion that corresponds only to the first background audio and, to determine the signal-to-noise ratio of the first audio signal, the instructions include instructions that, when executed, cause the one or more processing devices to:
    - determine an amount of energy in the first user audio portion of the first audio signal;
      
      determine an amount of energy in the environmental audio portion of the first audio signal; and
      
      determine the signal-to-noise ratio by determining a ratio between the amount of energy in the first user audio portion and the amount of energy in the environmental audio portion.
  - 5. The system of claim 1 wherein the accessed user speech model comprises a surrogate user speech model that has not been adapted to model the speech characteristics of the user.
  - 6. The system of claim 5 wherein the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - select the surrogate user speech model; and
      
      associate the surrogate speech model with the user.
  - 7. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a gender of the user; and
      
      select the surrogate user speech model, from among multiple surrogate user speech models, based on the gender of the user.
  - 8. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a location of the user when the one or more utterances are recorded; and
      
      select the surrogate user speech model, from among multiple surrogate user speech models, based on the location of the user when the one or more utterances are recorded.
  - 9. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a language or accent of the user; and
      
      select the surrogate user speech model, from among multiple surrogate user speech models, based on the language or accent.
  - 10. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - receive an initial audio signal that includes at least an initial user audio portion that corresponds to one or more user utterances recorded by the device;
      
      determine similarity metrics between multiple surrogate user speech models and an expected user speech model for the user determined based on the initial audio signal; and
      
      select the surrogate user speech model, from among the multiple surrogate user speech models, based on the similarity metrics.
  - 11. The system of claim 1 wherein the instructions comprise instructions that, when executed, cause the one or more processing devices to:
    - access a noise model associated with the user; and
      
      wherein, to perform noise compensation, the instructions further comprise instructions that cause the one or more processing devices to perform noise compensation on the third audio signal using the adapted user speech model and the accessed noise model.
  - 12. The system of claim 11 wherein, to perform noise compensation, the instructions further comprise instructions that cause the one or more processing devices to:
    - adapt the accessed noise model based on the received first audio signal to generate an adapted noise model that models characteristics of background audio surrounding the user; and
      
      perform noise compensation on the third audio signal using the adapted user speech model and the adapted noise model.
  - 13. The system of claim 11 wherein the instructions comprise instructions that, when executed, cause the one or more processing devices to:
    - receive a fourth audio signal that includes at least a fourth user audio portion that corresponds to one or more fourth user utterances recorded by the device;
      
      determine that background audio in the fourth audio signal is above the defined threshold;
      
      in response to determining that the background audio in the fourth audio signal is above the defined threshold, adapt the noise model associated with the user based on the fourth audio signal to generate an adapted noise model that models characteristics of background audio surrounding the user.
  - 14. The system of claim 11 wherein the accessed noise model comprises a surrogate noise model that has not been adapted to model characteristics of background audio surrounding the user.
  - 15. The system of claim 14 wherein the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - select the surrogate noise model; and
      
      associate the surrogate noise model with the user.
  - 16. The system of claim 15 wherein, to select the surrogate noise model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - receive an initial audio signal that includes at least an initial user audio portion that corresponds to one or more user utterances recorded by the device;
      
      determine a location of the user when the one or more utterances corresponding to the initial user audio portion are recorded; and
      
      select the surrogate noise model, from among multiple surrogate noise models, based on the location of the user when the one or more utterances corresponding to the initial user audio portion are recorded.
  - 17. The system of claim 15 wherein, to select the surrogate noise model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - receive an initial audio signal that includes at least an initial user audio portion that corresponds to one or more user utterances recorded by the device;
      
      determine similarity metrics between multiple surrogate noise models and an expected noise model for the user determined based on the initial audio signal; and
      
      select the surrogate noise model, from among the multiple surrogate noise models, based on the similarity metrics.
  - 18. The system of claim 17 wherein each of the multiple surrogate noise models model characteristics of background audio in a particular location.
  - 19. The system of claim 17 wherein each of the multiple surrogate noise models model characteristics of background audio in a particular kind of environmental condition.
  - 20. The system of claim 11 wherein, to access the noise model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a location of the user when the one or more first utterances are recorded; and
      
      select the noise model, from among multiple noise models, based on the location of the user.
  - 21. The system of claim 1 wherein the third audio signal corresponds to a voice search query and the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - perform speech recognition on the filtered audio signal to generate one or more candidate transcriptions of the one or more user utterances;
      
      execute a search query using the one or more candidate transcriptions to generate search results; and
      
      send the search results to the device.

22. A system comprising:
- a client device configured to send, to an automated speech recognition system,a first audio signal that includes at least a first user audio portion that corresponds to both first background audio and one or more first user utterances recorded by the device,a second audio signal that includes at least a second user audio portion that corresponds to both second background audio and one or more second user utterances recorded by the device, anda third audio signal;
  
  an automated speech recognition system configured to;
  
  receive the first audio signal and the second audio signal from the client device;
  
  access a user speech model associated with the user;
  
  determine that the first background audio in the first user audio portion is below a defined threshold;
  
  in response to determining that the first background audio in the first user audio portion is below the defined threshold, adapt the accessed user speech model based on the first audio signal to generate an adapted user speech model that models speech characteristics of the user;
  
  determine that the second background audio in the second user audio portion is not below the defined threshold;
  
  in response to determining that the second background audio in the second user audio portion is not below the defined threshold, not adapt the accessed user speech model based on the second audio signal; and
  
  perform noise compensation on the received third audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received third audio signal.
- View Dependent Claims (23)
- - 23. The system of claim 22 wherein the automated speech recognition system is further configured to perform speech recognition on the filtered audio signal to generate one or more candidate transcriptions of the one or more user utterances, the system further comprising:
    - a search engine system configured to;
      
      execute a search query using the one or more candidate transcriptions to generate search results; and
      
      send the search results to the client device.

24. A method comprising:
- receiving, by one or more processing devices, a first audio signal generated by a device based on audio input from a user, the first audio signal including at least a first user audio portion that corresponds to both first background audio and one or more first user utterances recorded by the device;
  
  accessing, by the one or more processing devices, a user speech model associated with the user;
  
  determining, by the one or more processing devices, that the first background audio in the first user audio portion is below a defined threshold;
  
  in response to determining that the first background audio in the first user audio portion is below the defined threshold, adapting, by the one or more processing devices, the accessed user speech model based on the first audio signal to generate an adapted user speech model that models speech characteristics of the user;
  
  receiving, by the one or more processing devices, a second audio signal generated by the device based on second audio input from a user, the second audio signal including at least a second user audio portion that corresponds to both second background audio and one or more second user utterances recorded by the device;
  
  determining, by the one or more processing devices, that the second background audio in the second user audio portion is not below the defined threshold;
  
  in response to determining that the second background audio in the second user audio portion is not below the defined threshold, not adapting, by the one or more processing devices, the accessed user speech model based on the second audio signal; and
  
  performing, by the one or more processing devices, noise compensation on a third audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the third audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Lloyd, Matthew I., Kristjansson, Trausti
Primary Examiner(s)
Han, Qi

Application Number

US13/250,777
Publication Number

US 20120022860A1
Time in Patent Office

326 Days
Field of Search

704/233, 704/231, 704/235, 704/244, 704/251
US Class Current

704/233
CPC Class Codes

G10L 15/20 Speech recognition techniqu...

G10L 21/0208 Noise filtering

Speech and noise models for speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Speech and noise models for speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links