Speaker verification using neural networks

US 9,401,148 B2
Filed: 03/28/2014
Issued: 07/26/2016
Est. Priority Date: 11/04/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

inputting, by a computing device, speech data that corresponds to a particular utterance of a particular speaker to a neural network having parameters trained based on propagation between an input layer and an output layer through one or more hidden layers located between the input layer and the output layer, wherein the one or more hidden layers were trained using utterances of multiple speakers, and wherein the multiple speakers do not include the particular speaker;

generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, a representation of activations occurring at a particular layer of the neural network that was trained as one of the hidden layers located between the input layer and the output layer;

comparing, by the computing device, the generated representation of activations occurring at the particular layer of the neural network in response to the speech data that corresponds to the particular utterance with a reference representation of activations occurring at the particular layer of the neural network in response to speech data that corresponds to one or more past utterances of the particular speaker;

based on comparing the generated representation and the reference representation, determining, by the computing device, that the particular utterance was likely spoken by the particular speaker; and

providing, by the computing device, access to the computing device based on determining that the particular utterance was likely spoken by the particular speaker.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for inputting speech data that corresponds to a particular utterance to a neural network; determining an evaluation vector based on output at a hidden layer of the neural network; comparing the evaluation vector with a reference vector that corresponds to a past utterance of a particular speaker; and based on comparing the evaluation vector and the reference vector, determining whether the particular utterance was likely spoken by the particular speaker.

55 Citations

View as Search Results

18 Claims

1. A method comprising:
- inputting, by a computing device, speech data that corresponds to a particular utterance of a particular speaker to a neural network having parameters trained based on propagation between an input layer and an output layer through one or more hidden layers located between the input layer and the output layer, wherein the one or more hidden layers were trained using utterances of multiple speakers, and wherein the multiple speakers do not include the particular speaker;
  
  generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, a representation of activations occurring at a particular layer of the neural network that was trained as one of the hidden layers located between the input layer and the output layer;
  
  comparing, by the computing device, the generated representation of activations occurring at the particular layer of the neural network in response to the speech data that corresponds to the particular utterance with a reference representation of activations occurring at the particular layer of the neural network in response to speech data that corresponds to one or more past utterances of the particular speaker;
  
  based on comparing the generated representation and the reference representation, determining, by the computing device, that the particular utterance was likely spoken by the particular speaker; and
  
  providing, by the computing device, access to the computing device based on determining that the particular utterance was likely spoken by the particular speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1,wherein comparing, by the computing device, the generated representation with the reference representation comprises determining, by the computing device, a distance between the generated representation and the reference representation, andwherein determining, by the computing device, that the particular utterance was spoken by the particular speaker comprises determining, by the computing device, that the distance between the generated representation and the reference representation satisfies a threshold.
  - 3. The method of claim 2, wherein determining, by the computing device, a distance between the generated representation and the reference representation comprises computing, by the computing device, a cosine distance between the generated representation and the reference representation.
  - 4. The method of claim 1, wherein generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, the representation of activations occurring at the particular layer of the neural network that was trained as one of the hidden layers located between the input layer and the output layer comprises generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, a representation of activations occurring at a particular layer of the neural network that was trained as one of the hidden layers located adjacent to the output layer.
  - 5. The method of claim 1, wherein generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, the representation of activations occurring at the particular layer of the neural network that was trained as one of the hidden layers located between the input layer and the output layer comprises generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, the representation of activations occurring at a particular layer of the neural network that was trained as a predetermined one of the hidden layers located between the input layer and the output layer.
  - 6. The method of claim 1, comprising:
    - obtaining, by the computing device access to the neural network;
      
      for each of multiple utterances of the particular speaker;
      
      inputting, by the computing device, speech data corresponding to the respective utterance to the neural network; and
      
      generating, by the computing device, a representation of activations occurring at the particular layer of the neural network in response to the speech data corresponding to the respective utterance;
      
      combining, by the computing device, the generated representations of activations occurring at the particular layer of the neural network in response to speech data corresponding to each of the multiple utterances of the particular speaker; and
      
      using, by the computing device, the combination of generated representations of activations occurring at the particular layer of the neural network in response to speech data corresponding to each of the multiple utterances of the particular speaker as the reference representation.
  - 7. The method of claim 1, further comprising dividing, by the computing device, the speech data corresponding to the particular utterance into frames;
    - andwherein generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, the representation of activations occurring at the particular layer of the neural network comprises;
      
      determining, by the computing device and for each of multiple different frames of the speech data, a corresponding set of activations occurring at the particular layer of the neural network based on the frame; and
      
      generating, by the computing device, the representation of the activations occurring at the particular layer by averaging the sets of activations that respectively correspond to the multiple different frames.
  - 8. The method of claim 1, wherein generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, the representation of activations occurring at the particular layer of the neural network comprises:
    - generating, by the computing device, the representation of activations occurring at the particular layer of the neural network (i) in response to inputting the speech data that corresponds to the particular utterance of the neural network, and (ii) irrespective of any activations occurring downstream from the particular layer in response to inputting the speech data that corresponds to the particular utterance of the neural network.
  - 9. The method of claim 8, wherein inputting, by the computing device, speech data that corresponds to the particular utterance to the neural network having parameters trained based on propagation between the input layer and the output layer through one or more hidden layers located between the input layer and the output layer comprises:
    - inputting, by the computing device, speech data that corresponds to the particular utterance to a neural network whose layers have been trained based on activations occurring at the output layer.
  - 10. The method of claim 1, wherein the representation of the activations at the particular layer is a vector that indicates the activations at the particular layer.
  - 11. The method of claim 1, wherein the input layer, the output layer, and the one or more hidden layers are included in a trained neural network;
    - wherein inputting the speech data comprises inputting the speech data to a neural network that includes a subset of the layers of the trained neural network and excludes the output layer of the trained neural network used during training of the trained neural network; and
      
      wherein generating the representation comprises generating the representation of activations of a particular layer of the neural network that includes the subset of the layers of the trained neural network and excludes the output layer of the trained neural network.
  - 12. The method of claim 1, wherein inputting the speech data comprises inputting the speech data to a neural network having parameters determined through supervised training of a first neural network including the input layer, the output layer, and the one or more hidden layers;
    - andwherein generating the representation of activations comprises generating the representation of activations occurring at a particular layer of the neural network having parameters determined through supervised training of the first neural network.
  - 13. The method of claim 1, wherein inputting the speech data comprises inputting the speech data to a neural network having parameters determined through training of a first neural network including the input layer, the output layer, and the one or more hidden layers using predetermined output targets for outputs at the output layer that correspond to different training inputs;
    - andwherein generating the representation of activations comprises generating the representation of activations occurring at a particular layer of the neural network having parameters determined through training of the first neural network using predetermined output targets for outputs at the output layer that correspond to different training inputs.
  - 14. The method of claim 1,wherein the computing device is a mobile phone on which data for the neural network is stored, andwherein the method comprises processing, by the mobile phone, propagation of data through the neural network to determine the activations at the particular layer in response to inputting the speech data to the neural network.

15. A non-transitory computer-readable medium storing software having stored thereon instructions, which, when executed by one or more computers, cause the one or more computers to perform operations of:
- inputting, by a computing device, speech data that corresponds to a particular utterance of a particular speaker to a neural network having parameters trained based on propagation between an input layer and an output layer through one or more hidden layers located between the input layer and the output layer, wherein the one or more hidden layers were trained using utterances of multiple speakers, and wherein the multiple speakers do not include the particular speaker;
  
  generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, a representation of activations occurring at a particular layer of the neural network that was trained as one of the hidden layers located between the input layer and the output layer;
  
  comparing, by the computing device, the generated representation of activations occurring at the particular layer of the neural network in response to the speech data that corresponds to the particular utterance with a reference representation of activations occurring at the particular layer of the neural network in response to speech data that corresponds to one or more past utterances of the particular speaker;
  
  based on comparing the generated representation and the reference representation, determining, by the computing device, that the particular utterance was likely spoken by the particular speaker; and
  
  providing, by the computing device, access to the computing device based on determining that the particular utterance was likely spoken by the particular speaker.
- View Dependent Claims (16, 17)
- - 16. The non-transitory computer-readable medium of claim 15,wherein comparing, by the computing device, the generated representation with the reference representation comprises determining, by the computing device, a distance between the generated representation and the reference representation, andwherein determining, by the computing device, that the particular utterance was spoken by the particular speaker comprises determining, by the computing device, that the distance between the generated representation and the reference representation satisfies a threshold.
  - 17. The non-transitory computer-readable medium of claim 15, wherein the operations comprise:
    - obtaining, by the computing device access to the neural network;
      
      for each of multiple utterances of the particular speaker;
      
      inputting, by the computing device, speech data corresponding to the respective utterance to the neural network; and
      
      generating, by the computing device, a representation of activations occurring at the particular layer of the neural network in response to the speech data corresponding to the respective utterance;
      
      combining, by the computing device, the generated representations of activations occurring at the particular layer of the neural network in response to speech data corresponding to each of the multiple utterances of the particular speaker; and
      
      using, by the computing device, the combination of generated representations of activations occurring at the particular layer of the neural network in response to speech data corresponding to each of the multiple utterances of the particular speaker as the reference representation.

18. A system comprising:
- one or more processors and one or more computer storage media storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising;
  
  inputting, by a computing device, speech data that corresponds to a particular utterance of a particular speaker to a neural network having parameters trained based on propagation between an input layer and an output layer through one or more hidden layers located between the input layer and the output layer, wherein the one or more hidden layers were trained using utterances of multiple speakers, and wherein the multiple speakers do not include the particular speaker;
  
  generating, by the computing device and in response to inputting the speech data that corresponds to the particular utterance to the neural network, a representation of activations occurring at a particular layer of the neural network that was trained as one of the hidden layers located between the input layer and the output layer;
  
  comparing, by the computing device, the generated representation of activations occurring at the particular layer of the neural network in response to the speech data that corresponds to the particular utterance with a reference representation of activations occurring at the particular layer of the neural network in response to speech data that corresponds to one or more past utterances of the particular speaker;
  
  based on comparing the generated representation and the reference representation, determining, by the computing device, that the particular utterance was likely spoken by the particular speaker; and
  
  providing, by the computing device, access to the computing device based on determining that the particular utterance was likely spoken by the particular speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Lei, Xin, McDermott, Erik, Variani, Ehsan, Moreno, Ignacio L.
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
KIM, JONATHAN C

Application Number

US14/228,469
Publication Number

US 20150127336A1
Time in Patent Office

851 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G10L 17/18 Artificial neural networks;...

Speaker verification using neural networks

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

55 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker verification using neural networks

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

55 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links