Improving speaker verification across locations, languages, and/or dialects
First Claim
1. A system comprising:
- a user device comprising one or more processors and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the user device to perform operations comprising;
receiving, by the user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword;
providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances;
generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user;
determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and
providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, apparatus, including computer programs encoded on computer storage medium, to facilitate language independent-speaker verification. In one aspect, a method includes actions of receiving, by a user device, audio data representing an utterance of a user. Other actions may include providing, to a neural network stored on the user device, input data derived from the audio data and a language identifier. The neural network may be trained using speech data representing speech in different languages or dialects. The method may include additional actions of generating, based on output of the neural network, a speaker representation and determining, based on the speaker representation and a second representation, that the utterance is an utterance of the user. The method may provide the user with access to the user device based on determining that the utterance is an utterance of the user.
166 Citations
20 Claims
-
1. A system comprising:
a user device comprising one or more processors and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the user device to perform operations comprising; receiving, by the user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword; providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances; generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user; determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
13. A method comprising:
-
receiving, by a user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword; providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances; generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user; determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors of a user device, cause the user device to perform operations comprising:
-
receiving, by a user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword; providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances; generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user; determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user.
-
Specification