Speech and Noise Models for Speech Recognition

US 20110307253A1
Filed: 06/14/2010
Published: 12/15/2011
Est. Priority Date: 06/14/2010
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more procession devices; and

one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the one or more processing devices to;

receive an audio signal generated by a device based on audio input from a user, the audio signal including at least a user audio portion that corresponds to one or more user utterances recorded by the device;

access a user speech model associated with the user;

determine that background audio in the audio signal is below a defined threshold;

in response to determining that the background audio in the audio signal is below the defined threshold, adapt the accessed user speech model based on the audio signal to generate an adapted user speech model that models speech characteristics of the user; and

perform noise compensation on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An audio signal generated by a device based on audio input from a user may be received. The audio signal may include at least a user audio portion that corresponds to one or more user utterances recorded by the device. A user speech model associated with the user may be accessed and a determination may be made background audio in the audio signal is below a defined threshold. In response to determining that the background audio in the audio signal is below the defined threshold, the accessed user speech model may be adapted based on the audio signal to generate an adapted user speech model that models speech characteristics of the user. Noise compensation may be performed on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.

Citations

24 Claims

1. A system comprising:
- one or more procession devices; and
  
  one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the one or more processing devices to;
  
  receive an audio signal generated by a device based on audio input from a user, the audio signal including at least a user audio portion that corresponds to one or more user utterances recorded by the device;
  
  access a user speech model associated with the user;
  
  determine that background audio in the audio signal is below a defined threshold;
  
  in response to determining that the background audio in the audio signal is below the defined threshold, adapt the accessed user speech model based on the audio signal to generate an adapted user speech model that models speech characteristics of the user; and
  
  perform noise compensation on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.
- View Dependent Claims (2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The system of claim 1 wherein the audio signal includes an environmental audio portion that corresponds only to background audio surrounding the user and, to determine that the background audio in the audio signal is below a defined threshold, the instructions include instructions that, when executed, cause the one or more processing devices to:
    - determine an amount of energy in the environmental audio portion; and
      
      determine that the amount of energy in the environmental audio portion is below a threshold energy.
  - 3. The system of claim 2 wherein, to determine that the background audio in the audio signal is below a defined threshold, the instructions include instructions that, when executed, cause the one or more processing devices to:
    - determine a signal-to-noise ratio of the audio signal; and
      
      determine that the signal to noise ratio is below a threshold signal-to-noise ratio.
  - 5. The system of claim 1 wherein the accessed user speech model comprises a surrogate user speech model that has not been adapted to model the speech characteristics of the user.
  - 6. The system of claim 5 wherein the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - select the surrogate user speech model; and
      
      associate the surrogate speech model with the user.
  - 7. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a gender of the user; and
      
      select the surrogate user speech model, from among multiple surrogate user speech models, based on the gender of the user.
  - 8. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a location of the user when the one or more utterances are recorded; and
      
      select the surrogate user speech model, from among multiple surrogate user speech models, based on the location of the user when the one or more utterances are recorded.
  - 9. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a language or accent of the user; and
      
      select the surrogate user speech model, from among multiple surrogate user speech models, based on the language or accent.
  - 10. The system of claim 6 wherein, to select the surrogate user speech model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - receive an initial audio signal that includes at least an initial user audio portion that corresponds to one or more user utterances recorded by the device;
      
      determine similarity metrics between multiple surrogate user speech models and an expected user speech model for the user determined based on the initial audio signal; and
      
      select the surrogate user speech model, from among the multiple surrogate user speech models, based on the similarity metrics.
  - 11. The system of claim 1 wherein the instructions comprise instructions that, when executed, cause the one or more processing devices to:
    - access a noise model associated with the user; and
      
      wherein, to perform noise compensation, the instructions further comprise instructions that cause the one or more processing devices to perform noise compensation on the received audio signal using the adapted user speech model and the accessed noise model.
  - 12. The system of claim 11 wherein, to perform noise compensation, the instructions further comprise instructions that cause the one or more processing devices to:
    - adapt the accessed noise model based on the received audio signal to generate an adapted noise model that models characteristics of background audio surrounding the user; and
      
      perform noise compensation on the received audio signal using the adapted user speech model and the adapted noise model.
  - 13. The system of claim 11 wherein the instructions comprise instructions that, when executed, cause the one or more processing devices to:
    - receive a second audio signal that includes at least a second user audio portion that corresponds to one or more user utterances recorded by the device;
      
      determine that background audio in the second audio signal is above the defined threshold;
      
      in response to determining that the background audio in the second audio signal is above the defined threshold, adapt the noise model associated with the user based on the second audio signal to generate an adapted noise model that models characteristics of background audio surrounding the user.
  - 14. The system of claim 11 wherein the accessed noise model comprises a surrogate noise model that has not been adapted to model characteristics of background audio surrounding the user.
  - 15. The system of claim 14 wherein the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - select the surrogate noise model; and
      
      associate the surrogate noise model with the user.
  - 16. The system of claim 15 wherein, to select the surrogate noise model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - receive an initial audio signal that includes at least an initial user audio portion that corresponds to one or more user utterances recorded by the device;
      
      determine a location of the user when the one or more utterances corresponding to the initial user audio portion are recorded; and
      
      select the surrogate noise model, from among multiple surrogate noise models, based on the location of the user when the one or more utterances corresponding to the initial user audio portion are recorded.
  - 17. The system of claim 15 wherein, to select the surrogate noise model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - receive an initial audio signal that includes at least an initial user audio portion that corresponds to one or more user utterances recorded by the device;
      
      determine similarity metrics between multiple surrogate noise models and an expected noise model for the user determined based on the initial audio signal; and
      
      select the surrogate noise model, from among the multiple surrogate noise models, based on the similarity metrics.
  - 18. The system of claim 17 wherein each of the multiple surrogate noise models model characteristics of background audio in a particular location.
  - 19. The system of claim 17 wherein each of the multiple surrogate noise models model characteristics of background audio in a particular kind of environmental condition.
  - 20. The system of claim 11 wherein, to access the noise model, the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - determine a location of the user when the one or more utterances are recorded; and
      
      select the noise model, from among multiple noise models, based on the location of the user.
  - 21. The system of claim 1 wherein the audio signal corresponds to a voice search query and the instructions include instructions that, when executed by the one or more processing devices, cause the one or more processing devices to:
    - perform speech recognition on the filtered audio signal to generate one or more candidate transcriptions of the one or more user utterances;
      
      execute a search query using the one or more candidate transcriptions to generate search results; and
      
      send the search results to the device.

4. The system of claim 4 wherein the audio signal includes an environmental audio portion that corresponds only to background audio surrounding the user and, to determine the signal-to-noise ratio of the audio signal, the instructions include instructions that, when executed, cause the one or more processing devices to:
- determine an amount of energy in the user audio portion of the audio signal;
  
  determine an amount of energy in the environmental audio portion of the audio signal; and
  
  determine the signal-to-noise ratio by determining the ratio between the amount of energy in the user audio portion and the environmental audio portion.

22. A system comprising:
- a client device configured to send, to an automated speech recognition system, an audio signal that includes at least a user audio portion that corresponds to one or more user utterances recorded by the device;
  
  an automated speech recognition system configured to;
  
  receive the audio signal from the client device;
  
  access a user speech model associated with the user;
  
  determine that background audio in the audio signal is below a defined threshold;
  
  in response to determining that the background audio in the audio signal is below the defined threshold, adapt the accessed user speech model based on the audio signal to generate an adapted user speech model that models speech characteristics of the user; and
  
  perform noise compensation on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.
- View Dependent Claims (23)
- - 23. The system of claim 22 wherein the automated speech recognition system is further configured to perform speech recognition on the filtered audio signal to generate one or more candidate transcriptions of the one or more user utterances, the system further comprising:
    - a search engine system configured to;
      
      execute a search query using the one or more candidate transcriptions to generate search results; and
      
      send the search results to the client device.

24. A method comprising:
- receiving an audio signal generated by a device based on audio input from a user, the audio signal including at least a user audio portion that corresponds to one or more user utterances recorded by the device;
  
  accessing a user speech model associated with the user;
  
  determine that background audio in the audio signal is below a defined threshold;
  
  in response to determining that the background audio in the audio signal is below the defined threshold, adapting the accessed user speech model based on the audio signal to generate an adapted user speech model that models speech characteristics of the user; and
  
  performing noise compensation on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Lloyd, Matthew I., Kristjansson, Trausti

Granted Patent

US 8,234,111 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/233
CPC Class Codes

G10L 15/20 Speech recognition techniqu...

G10L 21/0208 Noise filtering

Speech and Noise Models for Speech Recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Speech and Noise Models for Speech Recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links