Speech and noise models for speech recognition
First Claim
1. A system comprising:
- one or more processing devices; and
one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the system to;
receive an audio signal generated by a device based on audio input from a user, the audio signal including background audio and one or more user utterances recorded by the device, the audio signal comprising a portion that includes background audio without utterances of the user;
identify the user or the device based on an identifier for the user or the device;
determine a location of the user when the one or more utterances are recorded;
determine that a set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device;
select a surrogate noise model in response to determining that the set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device;
generate a filtered audio signal with reduced background audio compared to the received audio signal using the selected surrogate noise model;
adapt the surrogate noise model based on the received audio signal to generate a first adapted noise model that models characteristics of background audio surrounding the user at the location; and
store the first adapted noise model as one of a plurality of adapted noise models that are specific to the identified user or device, each of the plurality of adapted noise models being associated with a different corresponding location, each of the plurality of adapted noise models having been adapted based on audio recorded by the device at the corresponding location with which the adapted noise model is associated, each of the plurality of adapted noise models being stored in association with the identifier for the user or the device and with its corresponding location such that when utterances of the user are recorded by the device at the different corresponding locations, the adapted noise model associated with the location where the device recorded the utterances is used to recognize the utterances of the user.
2 Assignments
0 Petitions
Accused Products
Abstract
An audio signal generated by a device based on audio input from a user may be received. The audio signal may include at least a user audio portion that corresponds to one or more user utterances recorded by the device. A user speech model associated with the user may be accessed and a determination may be made background audio in the audio signal is below a defined threshold. In response to determining that the background audio in the audio signal is below the defined threshold, the accessed user speech model may be adapted based on the audio signal to generate an adapted user speech model that models speech characteristics of the user. Noise compensation may be performed on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.
-
Citations
19 Claims
-
1. A system comprising:
-
one or more processing devices; and one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the system to; receive an audio signal generated by a device based on audio input from a user, the audio signal including background audio and one or more user utterances recorded by the device, the audio signal comprising a portion that includes background audio without utterances of the user; identify the user or the device based on an identifier for the user or the device; determine a location of the user when the one or more utterances are recorded; determine that a set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device; select a surrogate noise model in response to determining that the set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device; generate a filtered audio signal with reduced background audio compared to the received audio signal using the selected surrogate noise model; adapt the surrogate noise model based on the received audio signal to generate a first adapted noise model that models characteristics of background audio surrounding the user at the location; and store the first adapted noise model as one of a plurality of adapted noise models that are specific to the identified user or device, each of the plurality of adapted noise models being associated with a different corresponding location, each of the plurality of adapted noise models having been adapted based on audio recorded by the device at the corresponding location with which the adapted noise model is associated, each of the plurality of adapted noise models being stored in association with the identifier for the user or the device and with its corresponding location such that when utterances of the user are recorded by the device at the different corresponding locations, the adapted noise model associated with the location where the device recorded the utterances is used to recognize the utterances of the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer-implemented method comprising:
-
receiving an audio signal generated by a device based on audio input from a user, the audio signal including background audio and one or more user utterances recorded by the device, the audio signal comprising a portion that includes background audio without utterances of the user; identifying the user or the device based on an identifier for the user or the device; determining a location of the user when the one or more utterances are recorded; determining that a set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device; selecting a surrogate noise model in response to determining that the set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device; generating a filtered audio signal with reduced background audio compared to the received audio signal using the selected surrogate noise model; adapting the surrogate noise model based on the received audio signal to generate a first adapted noise model that models characteristics of background audio surrounding the user at the location; and storing the first adapted noise model as one of a plurality of adapted noise models that are specific to the identified user or device, each of the plurality of adapted noise models being associated with a different corresponding location, each of the plurality of adapted noise models having been adapted based on audio recorded by the device at the corresponding location with which the adapted noise model is associated, each of the plurality of adapted noise models being stored in association with the identifier for the user or the device and with its corresponding location such that when utterances of the user are recorded by the device at the different corresponding locations, the adapted noise model associated with the location where the device recorded the utterances is used to recognize the utterances of the user.
-
-
19. A computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
receiving an audio signal generated by a device based on audio input from a user, the audio signal including background audio and one or more user utterances recorded by the device, the audio signal comprising a portion that includes background audio without utterances of the user; identifying the user or the device based on an identifier for the user or the device; determining a location of the user when the one or more utterances are recorded; determining that a set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device; selecting a surrogate noise model in response to determining that the set of stored noise models does not include an adapted noise model that is adapted for the user or the device and that is associated with both (i) the determined location and (ii) the identifier for the user or the device; generating a filtered audio signal with reduced background audio compared to the received audio signal using the selected surrogate noise model; adapting the surrogate noise model based on the received audio signal to generate a first adapted noise model that models characteristics of background audio surrounding the user at the location; and storing the first adapted noise model as one of a plurality of adapted noise models that are specific to the identified user or device, each of the plurality of adapted noise models being associated with a different corresponding location, each of the plurality of adapted noise models having been adapted based on audio recorded by the device at the corresponding location with which the adapted noise model is associated, each of the plurality of adapted noise models being stored in association with the identifier for the user or the device and with its corresponding location such that when utterances of the user are recorded by the device at the different corresponding locations, the adapted noise model associated with the location where the device recorded the utterances is used to recognize the utterances of the user.
-
Specification