Securely executing voice actions with speaker identification and authentication input types

US 10,127,926 B2
Filed: 06/10/2016
Issued: 11/13/2018
Est. Priority Date: 06/10/2016
Status: Active Grant

First Claim

Patent Images

1. A method performed by a voice action server, the method comprising:

receiving, by the voice action server, (i) audio data representing a voice command spoken by a speaker and (ii) contextual data from a client device of the speaker, the contextual data indicating a status of the client device and comprising data values representing contextual signals that can authenticate the speaker without requiring the speaker to provide explicit authentication information;

identifying, by the voice action server, the speaker based on the audio data representing the voice command;

selecting, by the voice action server, a voice action based at least on a transcription of the audio data;

selecting, by the voice action server, a third-party service provider from among a plurality of different third-party service providers, wherein the third-party service provider is selected by obtaining a mapping of voice actions to the plurality of third-party service providers, the mapping indicating that the selected third-party service provider can perform the selected voice action, wherein the selected third-party service provider is configured to perform multiple voice actions, and wherein the selected third-party service provider requires different combinations of input data to perform authentication for at least some of the multiple voice actions;

identifying, by the voice action server, one or more input authentication data types that the selected third-party service provider uses to perform authentication for the selected voice action, wherein the identified one or more input authentication data types for the selected action are different from one or more input authentication data types that the selected third-party service provider uses to perform authentication for at least one other voice action;

obtaining, by the voice action server without requiring the speaker to provide explicit authentication information, one or more authentication data values representing contextual signals from the received contextual data that correspond to the identified one or more input authentication data types; and

providing, to the third-party service provider by the voice action server over a network, (i) a request to perform the selected voice action and (ii) a speaker identification result determined based on the audio data representing the voice command, and (iii) the obtained one or more authentication data values from the received contextual data, wherein the speaker identification result and the one or more obtained authentication data values enable the selected third-party service provider to authenticate the speaker and perform the selected voice action.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In some implementations, (i) audio data representing a voice command spoken by a speaker and (ii) a speaker identification result indicating that the voice command was spoken by the speaker are obtained. A voice action is selected based at least on a transcription of the audio data. A service provider corresponding to the selected voice action is selected from among a plurality of different service providers. One or more input data types that the selected service provider uses to perform authentication for the selected voice action are identified. A request to perform the selected voice action and (i) one or more values that correspond to the identified one or more input data types are provided to the service provider.

37 Citations

View as Search Results

23 Claims

1. A method performed by a voice action server, the method comprising:
- receiving, by the voice action server, (i) audio data representing a voice command spoken by a speaker and (ii) contextual data from a client device of the speaker, the contextual data indicating a status of the client device and comprising data values representing contextual signals that can authenticate the speaker without requiring the speaker to provide explicit authentication information;
  
  identifying, by the voice action server, the speaker based on the audio data representing the voice command;
  
  selecting, by the voice action server, a voice action based at least on a transcription of the audio data;
  
  selecting, by the voice action server, a third-party service provider from among a plurality of different third-party service providers, wherein the third-party service provider is selected by obtaining a mapping of voice actions to the plurality of third-party service providers, the mapping indicating that the selected third-party service provider can perform the selected voice action, wherein the selected third-party service provider is configured to perform multiple voice actions, and wherein the selected third-party service provider requires different combinations of input data to perform authentication for at least some of the multiple voice actions;
  
  identifying, by the voice action server, one or more input authentication data types that the selected third-party service provider uses to perform authentication for the selected voice action, wherein the identified one or more input authentication data types for the selected action are different from one or more input authentication data types that the selected third-party service provider uses to perform authentication for at least one other voice action;
  
  obtaining, by the voice action server without requiring the speaker to provide explicit authentication information, one or more authentication data values representing contextual signals from the received contextual data that correspond to the identified one or more input authentication data types; and
  
  providing, to the third-party service provider by the voice action server over a network, (i) a request to perform the selected voice action and (ii) a speaker identification result determined based on the audio data representing the voice command, and (iii) the obtained one or more authentication data values from the received contextual data, wherein the speaker identification result and the one or more obtained authentication data values enable the selected third-party service provider to authenticate the speaker and perform the selected voice action.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein identifying, by the voice action server, the speaker from the audio data representing the voice command comprises:
    - obtaining the audio data representing the voice command spoken by the speaker;
      
      obtaining a voiceprint for the speaker;
      
      determining that the voiceprint for the speaker matches the audio data representing the voice command spoken by the speaker; and
      
      in response to determining that the voiceprint for the speaker matches the audio data representing the voice command spoken by the speaker, generating the speaker identifier for the speaker that spoke the voice command.
  - 3. The method of claim 1, wherein selecting a voice action based at least on a transcription of the audio data comprises:
    - obtaining a set of voice actions, wherein each voice action identifies one or more terms that correspond to that voice action;
      
      determining that one or more terms in the transcription match the one or more terms that correspond to the voice action; and
      
      in response to determining that the one or more terms in the transcription match the one or more terms that correspond to the voice action, selecting the voice action from among the set of voice actions.
  - 4. The method of claim 1, wherein selecting a third-party service provider corresponding to the selected voice action from among a plurality of different third-party service providers comprises:
    - obtaining a mapping of voice actions to the plurality of third-party service providers, where for each voice action the mapping describes a third-party service provider that can perform the voice action;
      
      determining that the mapping of voice actions indicates that the third-party service provider can perform the selected voice action; and
      
      in response to determining that the mapping of voice actions indicates that the third-party service provider can perform the selected voice action, selecting the third-party service provider.
  - 5. The method of claim 1, wherein identifying one or more input data types, in addition to speaker identification, that the selected third-party service provider uses to perform authentication for the selected voice action comprises:
    - providing, to the selected third-party service provider over a network, a request for an identification of one or more input data types that the selected third-party service provider uses to perform authentication for the selected voice action;
      
      receiving, from the selected third-party service provider, a response to the request for the identification; and
      
      identifying the one or more input authentication data types that the selected service provider uses to perform authentication for the selected voice action from the response to the request for the identification.
  - 6. The method of claim 1, comprising:
    - generating the transcription of the audio data using an automated speech recognizer.
  - 7. The method of claim 1, comprising:
    - receiving, from the third-party service provider, an indication that the selected voice action has been performed.
  - 8. The method of claim 1, comprising:
    - receiving, from the third-party service provider, an indication that additional authentication is needed to perform the selected voice action; and
      
      in response to receiving, from the third-party service provider, the indication that additional authentication is needed to perform the selected voice action, providing a request for additional authentication.
  - 9. The method of claim 1, wherein identifying one or more input authentication data types, in addition to speaker identification, that the selected third-party service provider uses to perform authentication for the selected voice action comprises:
    - identifying that the selected third-party service provider uses one or more of an input authentication data type that indicates whether the speaker'"'"'s mobile computing device has been on a body since the mobile computing device was last unlocked, an input authentication data type that indicates whether a speaker'"'"'s mobile computing device is in short-range communication with a particular device, an input authentication data type that indicates whether a speaker'"'"'s mobile computing device is within a particular geographic area, or an input authentication data type that indicates whether a speaker'"'"'s face is in a view of a device.
  - 10. The method of claim 1, wherein one or more input authentication data types includes a data type that indicates a location of the client device and wherein the selected third-party service provider uses the location of the client device to perform authentication for the selected voice action.
  - 11. The method of claim 1, wherein one or more input authentication data types includes a data type that indicates whether the client device has been on a body since the client device was last unlocked.
  - 12. The method of claim 1, wherein one or more input authentication data types includes a data type that indicates whether the client device is in short-range communication with a second device.

13. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause a voice action server to perform operations comprising;
  
  receiving (i) audio data representing a voice command spoken by a speaker and (ii) contextual data from a client device of the speaker, the contextual data indicating a status of the client device and providing data values representing contextual signals that can authenticate the speaker without requiring the speaker to provide explicit authentication information;
  
  identifying, the speaker from the audio data representing the voice command;
  
  selecting a voice action based at least on a transcription of the audio data;
  
  selecting a third-party service provider from among a plurality of different third-party service providers, wherein the third-party service provider is selected by obtaining a mapping of voice actions to the plurality of third-party service providers, the mapping indicating that the selected third-party service provider can perform the selected voice action, the selected third-party service provider is configured to perform multiple voice actions, and wherein the selected third-party service provider requires different combinations of input data to perform authentication for at least some of the multiple voice actions;
  
  identifying one or more input authentication data types that the selected third-party service provider uses to perform authentication for the selected voice action, wherein the identified one or more input authentication data types for the selected action are different from one or more input data types that the selected third-party service provider uses to perform authentication for at least one other voice action;
  
  obtaining, without requiring the speaker to provide explicit authentication information, one or more authentication data values representing contextual signals from the received contextual data that correspond to the identified one or more input authentication data types; and
  
  providing, to the third-party service provider over a network, (i) a request to perform the selected voice action and (ii) a speaker identification result determined based on the audio data representing the voice command, and (iii) the obtained one or more authentication data values from the received contextual data, wherein the speaker identification result and the one or more obtained authentication data values enable the selected third-party service provider to authenticate the speaker and perform the selected voice action.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The system of claim 13, wherein identifying the speaker from the audio data representing the voice command comprises:
    - obtaining the audio data representing the voice command spoken by the speaker;
      
      obtaining a voiceprint for the speaker;
      
      determining that the voiceprint for the speaker matches the audio data representing the voice command spoken by the speaker; and
      
      in response to determining that the voiceprint for the speaker matches the audio data representing the voice command spoken by the speaker, generating the speaker identification result indicating that the voice command was spoken by the speaker.
  - 15. The system of claim 13, wherein selecting a voice action based at least on a transcription of the audio data comprises:
    - obtaining a set of voice actions, wherein each voice action identifies one or more terms that correspond to that voice action;
      
      determining that one or more terms in the transcription match the one or more terms that correspond to the voice action; and
      
      in response to determining that the one or more terms in the transcription match the one or more terms that correspond to the voice action, selecting the voice action from among the set of voice actions.
  - 16. The system of claim 13, wherein selecting a third-party service provider corresponding to the selected voice action from among a plurality of different third-party service providers comprises:
    - obtaining a mapping of voice actions to the plurality of third-party service providers, where for each voice action the mapping describes a third-party service provider that can perform the voice action;
      
      determining that the mapping of voice actions indicates that the third-party service provider can perform the selected voice action; and
      
      in response to determining that the mapping of voice actions indicates that the third-party service provider can perform the selected voice action, selecting the third-party service provider.
  - 17. The system of claim 13, wherein identifying one or more input authentication data types, in addition to speaker information, that the selected third-party service provider uses to perform authentication for the selected voice action comprises:
    - providing, to the selected third-party service provider over a network, a request for an identification of one or more input authentication data types that the selected service provider uses to perform authentication for the selected voice action;
      
      receiving, from the selected third-party service provider, a response to the request for the identification; and
      
      identifying the one or more input authentication data types that the selected third-party service provider uses to perform authentication for the selected voice action from the response to the request for the identification.
  - 18. The system of claim 13, the operations comprising:
    - generating the transcription of the audio data using an automated speech recognizer.
  - 19. The system of claim 13, the operations comprising:
    - receiving, from the third-party service provider, an indication that the selected voice action has been performed.
  - 20. The system of claim 13, the operations comprising:
    - receiving, from the third-party service provider, an indication that additional authentication is needed to perform the selected voice action; and
      
      in response to receiving, from the third-party service provider, the indication that additional authentication is needed to perform the selected voice action, providing a request for additional authentication.
  - 21. The system of claim 13, wherein identifying one or more input authentication data types, in addition to speaker identification, that the selected third-party service provider uses to perform authentication for the selected voice action comprises:
    - identifying that the selected third-party service provider uses one or more of an input authentication data type that indicates whether the speaker'"'"'s mobile computing device has been on a body since the mobile computing device was last unlocked, an input authentication data type that indicates whether a speaker'"'"'s mobile computing device is in short-range communication with a particular device, an input authentication data type that indicates whether a speaker'"'"'s mobile computing device is within a particular geographic area, or an input authentication data type that indicates whether a speaker'"'"'s face is in a view of a device.

22. A non-transitory computer-readable storage medium storing a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving (i) audio data representing a voice command spoken by a speaker and (ii) contextual data from a client device of the speaker, the contextual data indicating a status of the client device and providing data values representing contextual signals that can authenticate the speaker without requiring the speaker to provide explicit authentication information;
  
  identifying the speaker from the audio data representing the voice command;
  
  selecting a voice action based at least on a transcription of the audio data;
  
  selecting a third-party service provider from among a plurality of different third-party service providers, wherein the third-party service provider is selected by obtaining a mapping of voice actions to the plurality of third-party service providers, the mapping indicating that the selected third-party service provider can perform the selected voice action, the selected third-party service provider is configured to perform multiple voice actions, and wherein the selected third-party service provider requires different combinations of input data to perform authentication for at least some of the multiple voice actions;
  
  identifying one or more input authentication data types that the selected third-party service provider uses to perform authentication for the selected voice action, wherein the identified one or more input authentication data types for the selected action are different from one or more input authentication data types that the selected third-party service provider uses to perform authentication for at least one other voice actions;
  
  obtaining, without requiring the speaker to provide explicit authentication information, one or more data values from the received contextual data that correspond to the identified one or more input authentication data types; and
  
  providing, to the third-party service provider over a network, (i) a request to perform the selected voice action and (ii) a speaker identification result determined based on the audio data representing the voice command, and (iii) the obtained one or more authentication data values from the received contextual data, wherein the speaker identification result and the one or more obtained authentication data values enable the selected third-party service provider to authenticate the speaker and perform the selected voice action.
- View Dependent Claims (23)
- - 23. The non-transitory computer-readable storage medium of claim 22, wherein identifying the speaker from the audio data representing the voice command comprises:
    - obtaining the audio data representing the voice command spoken by the speaker;
      
      obtaining a voiceprint for the speaker;
      
      determining that the voiceprint for the speaker matches the audio data representing the voice command spoken by the speaker; and
      
      in response to determining that the voiceprint for the speaker matches the audio data representing the voice command spoken by the speaker, generating the speaker identification result indicating that the voice command was spoken by the speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
James, Barnaby John
Primary Examiner(s)
Lerner, Martin

Application Number

US15/178,895
Publication Number

US 20170358317A1
Time in Patent Office

886 Days
Field of Search

704246, 704270, 704273, 704275, 726 2, 726 4, 726 18, 726 21, 726 27
US Class Current
CPC Class Codes

G06F 21/32   using biometric data, e.g. ...

G06F 21/35   communicating wirelessly

G06F 3/167   Audio in a user interface, ...

G07C 9/37   using biometric data, e.g. ...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 17/00   Speaker identification or v...

G10L 17/02   Preprocessing operations, e...

G10L 2015/223   Execution procedure of a sp...

G10L 25/48   specially adapted for parti...

H04L 63/0861   using biometrical features,...

H04W 12/06   Authentication

H04W 12/065   Continuous authentication

H04W 12/30   Security of mobile devices;...

H04W 12/33   using wearable devices, e.g...

H04W 12/63   Location-dependent; Proximi...

H04W 12/65   Environment-dependent, e.g....

Securely executing voice actions with speaker identification and authentication input types

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Securely executing voice actions with speaker identification and authentication input types

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links