Improving speaker verification across locations, languages, and/or dialects

US 10,403,291 B2
Filed: 06/01/2018
Issued: 09/03/2019
Est. Priority Date: 07/15/2016
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a user device comprising one or more processors and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the user device to perform operations comprising;

receiving, by the user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword;

providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances;

generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user;

determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and

providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, apparatus, including computer programs encoded on computer storage medium, to facilitate language independent-speaker verification. In one aspect, a method includes actions of receiving, by a user device, audio data representing an utterance of a user. Other actions may include providing, to a neural network stored on the user device, input data derived from the audio data and a language identifier. The neural network may be trained using speech data representing speech in different languages or dialects. The method may include additional actions of generating, based on output of the neural network, a speaker representation and determining, based on the speaker representation and a second representation, that the utterance is an utterance of the user. The method may provide the user with access to the user device based on determining that the utterance is an utterance of the user.

166 Citations

20 Claims

1. A system comprising:
- a user device comprising one or more processors and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the user device to perform operations comprising;
  
  receiving, by the user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword;
  
  providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances;
  
  generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user;
  
  determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and
  
  providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The system of claim 1, wherein the set of input data derived from the audio data and the determined language identifier or location identifier includes:
    - a first vector that is derived from the audio data, anda second vector that is derived from a language identifier associated with the user device.
  - 3. The system of claim 2, the operations further comprising generating an input vector by concatenating the first vector and the second vector into a single concatenated vector;
    - wherein providing the set of input data comprises providing, to the neural network, the generated input vector; and
      
      wherein generating the first speaker representation comprises generating, based on output of the neural network produced in response to receiving the input vector, the first speaker representation indicative of characteristics of the voice of the user.
  - 4. The system of claim 2, the operations further comprising generating an input vector by concatenating the outputs of at least two other neural networks that respectively generate outputs based on (i) the first vector, (ii) the second vector, or (iii) both the first vector and the second vector;
    - wherein providing the set of input data comprises providing, to the neural network, the generated input vector; and
      
      wherein generating the first speaker representation comprises generating, based on output of the neural network produced in response to receiving the input vector, the first speaker representation indicative of characteristics of the voice of the user.
  - 5. The system of claim 2, the operations further comprising generating an input vector based on a weighted sum of the first vector and the second vector;
    - wherein providing the set of input data comprises providing, to the neural network, the generated input vector; and
      
      wherein generating the first speaker representation comprises generating, based on output of the neural network produced in response to receiving the input vector, the first speaker representation indicative of characteristics of the voice of the user.
  - 6. The system of claim 1, wherein the output of the neural network produced in response to receiving the set of input data includes a set of activations generated at a layer of the neural network that was used as a hidden layer of the neural network during training.
  - 7. The system of claim 1, wherein the second speaker representation is a speaker representation that was generated using output of the neural network generated in response to receiving a set of input data derived from audio data of the previous utterance and the language identifier or location identifier associated with the user device.
  - 8. The system of claim 1, wherein the second speaker representation is stored by the user device prior to receiving the audio data representing the utterance.
  - 9. The system of claim 1, wherein generating the first speaker representation indicative of characteristics of the voice of the user comprises:
    - obtaining a voiceprint from the neural network as the first speaker representation indicative of characteristics of the voice of the user.
  - 10. The system of claim 1, wherein the parameters of the neural network have been trained using training examples including training utterances of a particular word or phrase designated as the hotword for multiple different languages or locations, wherein the particular word or phrase has a different pronunciation in at least some of multiple different languages or locations.
  - 11. The system of claim 1, wherein the parameters of the neural network have been trained using training examples including training utterances of different words or phrases designated as hotwords for different languages or locations.
  - 12. The system of claim 1, wherein the operations comprise:
    - determining a language of the utterance based on the audio data representing the utterance; and
      
      determining the language identifier or location identifier based on determining the language of the utterance based on the audio data representing the utterance.

13. A method comprising:
- receiving, by a user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword;
  
  providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances;
  
  generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user;
  
  determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and
  
  providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The method of claim 13, wherein the set of input data derived from the audio data and the determined language identifier or location identifier includes:
    - a first vector that is derived from the audio data, anda second vector that is derived from a language identifier associated with the user device.
  - 15. The method of claim 14, further comprising generating an input vector by concatenating the first vector and the second vector into a single concatenated vector;
    - wherein providing the set of input data comprises providing, to the neural network, the generated input vector; and
      
      wherein generating the first speaker representation comprises generating, based on output of the neural network produced in response to receiving the input vector, the first speaker representation indicative of characteristics of the voice of the user.
  - 16. The method of claim 14, further comprising generating an input vector by concatenating the outputs of at least two other neural networks that respectively generate outputs based on (i) the first vector, (ii) the second vector, or (iii) both the first vector and the second vector;
    - wherein providing the set of input data comprises providing, to the neural network, the generated input vector; and
      
      wherein generating the first speaker representation comprises generating, based on output of the neural network produced in response to receiving the input vector, the first speaker representation indicative of characteristics of the voice of the user.
  - 17. The method of claim 14, further comprising generating an input vector based on a weighted sum of the first vector and the second vector;
    - wherein providing the set of input data comprises providing, to the neural network, the generated input vector; and
      
      wherein generating the first speaker representation comprises generating, based on output of the neural network produced in response to receiving the input vector, the first speaker representation indicative of characteristics of the voice of the user.
  - 18. The method of claim 13, wherein the output of the neural network produced in response to receiving the set of input data includes a set of activations generated at a layer of the neural network that was used as a hidden layer of the neural network during training.
  - 19. The method of claim 18, wherein determining, based on the first speaker representation and a second representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user comprises:
    - determining a similarity measure indicating a similarity between the first representation and the second speaker representation; and
      
      determining that the similarity measure satisfies a threshold.

20. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors of a user device, cause the user device to perform operations comprising:
- receiving, by a user device, audio data representing an utterance, spoken by a user, of a predetermined word or phrase designated as a hotword for a language or location associated with the user, wherein the user device is configured to perform an action or change a state of the user device in response to detecting an utterance of the hotword;
  
  providing, as input to a neural network stored on the user device, a set of input data derived from the audio data and a language identifier or location identifier associated with the user device, the neural network having parameters trained using speech data representing speech in different languages or different dialects, wherein the parameters of the neural network have been trained using training examples including (i) training utterances of words or phrases designated as hotwords for different languages or locations, and (ii) language identifiers or location identifiers for languages or locations of the respective speakers of the training utterances;
  
  generating, based on output of the neural network produced in response to receiving the set of input data, a first speaker representation indicative of characteristics of the voice of the user;
  
  determining, based on the first speaker representation and a second speaker representation, that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user, wherein the second speaker representation is derived from a previous utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user; and
  
  providing the user access to the user device based on determining that the utterance is an utterance, spoken by the user, of the predetermined word or phrase designated as a hotword for the language or location associated with the user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Moreno, Ignacio Lopez, Wan, Li, Wang, Quan
Primary Examiner(s)
Roberts, Shaun

Application Number

US15/995,480
Publication Number

US 20180277124A1
Time in Patent Office

459 Days
Field of Search

704246, 704232
US Class Current
CPC Class Codes

G10L 17/02   Preprocessing operations, e...

G10L 17/08   Use of distortion metrics o...

G10L 17/14   Use of phonemic categorisat...

G10L 17/18   Artificial neural networks;...

G10L 17/22   Interactive procedures; Man...

G10L 17/24   the user being prompted to ...

Improving speaker verification across locations, languages, and/or dialects

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

166 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Improving speaker verification across locations, languages, and/or dialects

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

166 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links