Joint speaker authentication and key phrase identification

US 10,476,872 B2
Filed: 02/02/2016
Issued: 11/12/2019
Est. Priority Date: 02/20/2015
Status: Active Grant

First Claim

Patent Images

1. A spoken command analyzer module comprising instructions embodied in one or more non-transitory machine accessible storage media, the spoken command analyzer module configured to cause a computing system comprising one or more computing devices to:

extract acoustic features from a speech sample;

in response to input of the acoustic features to a neural network, receive, from the neural network, a temporal sequence of bottleneck features;

wherein the neural network is trained to discriminate between classes of phonetic units;

compute statistics using a combination of the acoustic features and the temporal sequence of bottleneck features;

using the statistics, identify a command contained in the speech sample;

using the statistics, identify a speaker of the command;

in response to a comparison of the command and the speaker to a stored model, output, to a device, data that is used by the device to execute an action.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A spoken command analyzer computing system includes technologies configured to analyze information extracted from a speech sample and, using a joint speaker and phonetic content model, both determine whether the analyzed speech includes certain content (e.g., a command) and to identify the identity of the human speaker of the speech. In response to determining that the identity matches the authorized user'"'"'s identity and determining that the analyzed speech includes the modeled content (e.g., command), an action corresponding to the verified content (e.g., command) is performed by an associated device.

15 Citations

View as Search Results

33 Claims

1. A spoken command analyzer module comprising instructions embodied in one or more non-transitory machine accessible storage media, the spoken command analyzer module configured to cause a computing system comprising one or more computing devices to:
- extract acoustic features from a speech sample;
  
  in response to input of the acoustic features to a neural network, receive, from the neural network, a temporal sequence of bottleneck features;
  
  wherein the neural network is trained to discriminate between classes of phonetic units;
  
  compute statistics using a combination of the acoustic features and the temporal sequence of bottleneck features;
  
  using the statistics, identify a command contained in the speech sample;
  
  using the statistics, identify a speaker of the command;
  
  in response to a comparison of the command and the speaker to a stored model, output, to a device, data that is used by the device to execute an action.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The spoken command analyzer module of claim 1, wherein the neural network is language-independent.
  - 3. The spoken command analyzer module of claim 1, wherein the command comprises any combination of speech-based phonetics.
  - 4. The spoken command analyzer module of claim 1, wherein the command is one of a plurality of commands that correspond to different actions that may be taken by the device.
  - 5. The spoken command analyzer module of claim 1, wherein the data comprises an instruction to execute the action.
  - 6. The spoken command analyzer module of claim 1, wherein the speech sample comprises non-recorded live speech of a human speaker.
  - 7. The spoken command analyzer module of claim 1, wherein the speech sample comprises speech of multiple different speakers.
  - 8. The spoken command analyzer module of claim 1, wherein the acoustic features comprise any one or more of the following:
    - cepstral features, Mel frequency cepstral coefficient (MFCC) features, pcaDCT features.
  - 9. The spoken command analyzer module of claim 1, wherein the neural network comprises an input layer and an output layer, and the bottleneck features are extracted from a hidden layer of the neural network that is closer to the input layer than the output layer.
  - 10. The spoken command analyzer module of claim 1, wherein the neural network comprises an input layer and an output layer, and the bottleneck features are extracted from a hidden layer of the neural network that is closer to the output layer than the input layer.
  - 11. The spoken command analyzer module of claim 1, wherein the stored model is created using speech samples obtained during an enrollment process.

12. A method, comprising:
- extracting acoustic features from a speech sample;
  
  in response to inputting of the acoustic features to a neural network, receiving, from the neural network, bottleneck features;
  
  wherein the neural network is trained to discriminate between different classes of phonetic units;
  
  computing statistics using a combination of the acoustic features and the bottleneck features;
  
  using the statistics, identifying a command contained in the speech sample;
  
  using the statistics, identifying a speaker of the command;
  
  in response to a comparison of the command and the speaker to a stored model, outputting, to a device, data that is used by the device to execute an action;
  
  wherein the method is performed by one or more computing devices.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The method of claim 12, wherein the neural network is language-independent.
  - 14. The method of claim 12, wherein the command comprises any combination of speech-based phonetics.
  - 15. The method of claim 12, wherein the command is one of a plurality of commands that correspond to different actions that may be taken by the device.
  - 16. The method of claim 12, wherein the data comprises an instruction to execute the action.
  - 17. The method of claim 12, wherein the speech sample comprises non-recorded live speech of a human speaker.
  - 18. The method of claim 12, wherein the speech sample comprises speech of multiple different speakers.
  - 19. The method of claim 12, wherein the acoustic features comprise any one or more of the following:
    - cepstral features, Mel frequency cepstral coefficient (MFCC) features, pcaDCT features.
  - 20. The method of claim 12, wherein the neural network comprises an input layer and an output layer, and the bottleneck features are extracted from a hidden layer of the neural network that is closer to the input layer than the output layer.
  - 21. The method of claim 12, wherein the neural network comprises an input layer and an output layer, and the bottleneck features are extracted from a hidden layer of the neural network that is closer to the output layer than the input layer.
  - 22. The method of claim 12, wherein the stored model is created using speech samples obtained during an enrollment process.

23. An apparatus, comprising:
- at least one computing device;
  
  wherein the at least one computing device is coupled to a sound capture device;
  
  wherein the at least one computing device is configured to;
  
  extract time-aligned acoustic features from a speech sample captured by the sound capture device;
  
  in response to input of the time-aligned acoustic features to a neural network, receive from the neural network, bottleneck features;
  
  wherein the neural network is trained to discriminate between classes of phonetic units;
  
  compute statistics using a combination of the acoustic features and the bottleneck features;
  
  using the statistics, identify a command contained in the speech sample;
  
  using the statistics, identify a speaker of the command;
  
  in response to a comparison of the command and the speaker to a stored model, output, to a device, data that is used by the at least one computing device to execute an action.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. The apparatus of claim 23, wherein the neural network is language-independent.
  - 25. The apparatus of claim 23, wherein the command comprises any combination of speech-based phonetics.
  - 26. The apparatus of claim 23, wherein the command is one of a plurality of commands that correspond to different actions that may be taken by the device.
  - 27. The apparatus of claim 23, wherein the data comprises an instruction to execute the action.
  - 28. The apparatus of claim 23, wherein the speech sample comprises non-recorded live speech of a human speaker.
  - 29. The apparatus of claim 23, wherein the speech sample comprises speech of multiple different speakers.
  - 30. The apparatus of claim 23, wherein the acoustic features comprise any one or more of the following:
    - cepstral features, Mel frequency cepstral coefficient (MFCC) features, pcaDCT features.
  - 31. The apparatus of claim 23, wherein the neural network comprises an input layer and an output layer, and the bottleneck features are extracted from a hidden layer of the neural network that is closer to the input layer than the output layer.
  - 32. The apparatus of claim 23, wherein the neural network comprises an input layer and an output layer, and the bottleneck features are extracted from a hidden layer of the neural network that is closer to the output layer than the input layer.
  - 33. The apparatus of claim 23, wherein the stored model is created using speech samples obtained during an enrollment process.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SRI International, Inc.
Original Assignee
SRI International, Inc.
Inventors
McLaren, Mitchell Leigh, Lawson, Aaron Dennis
Primary Examiner(s)
Patel, Ashokkumar B
Assistant Examiner(s)
Ahsan, Syed M

Application Number

US15/013,580
Publication Number

US 20160248768A1
Time in Patent Office

1,379 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/16   using artificial neural net...

G10L 15/183   using context dependencies,...

G10L 15/22   Procedures used during a sp...

G10L 17/18   Artificial neural networks;...

G10L 17/22   Interactive procedures; Man...

G10L 2015/223   Execution procedure of a sp...

H04L 63/0861   using biometrical features,...

H04L 63/10   for controlling access to d...

H04L 63/102   Entity profiles

Joint speaker authentication and key phrase identification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

15 Citations

33 Claims

Specification

Use Cases

Quick Links

Others

Joint speaker authentication and key phrase identification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

33 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others