SPEAKER IDENTIFICATION AND UNSUPERVISED SPEAKER ADAPTATION TECHNIQUES

US 20160093304A1
Filed: 08/25/2015
Published: 03/31/2016
Est. Priority Date: 09/30/2014
Status: Active Grant

First Claim

Patent Images

1. A method for operating a virtual assistant, the method comprising:

at an electronic device;

receiving, at the electronic device, an audio input comprising user speech;

determining whether a speaker of the user speech is a predetermined user based at least in part on a speaker profile for the predetermined user; and

in accordance with a determination that the speaker of the user speech is the predetermined user, adding the audio input comprising user speech to the speaker profile for the predetermined user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for generating a speaker profile for use in performing speaker identification for a virtual assistant are provided. One example process can include receiving an audio input including user speech and determining whether a speaker of the user speech is a predetermined user based on a speaker profile for the predetermined user. In response to determining that the speaker of the user speech is the predetermined user, the user speech can be added to the speaker profile and operation of the virtual assistant can be triggered. In response to determining that the speaker of the user speech is not the predetermined user, the user speech can be added to an alternate speaker profile and operation of the virtual assistant may not be triggered. In some examples, contextual information can be used to verify results produced by the speaker identification process.

431 Citations

27 Claims

1. A method for operating a virtual assistant, the method comprising:
- at an electronic device;
  
  receiving, at the electronic device, an audio input comprising user speech;
  
  determining whether a speaker of the user speech is a predetermined user based at least in part on a speaker profile for the predetermined user; and
  
  in accordance with a determination that the speaker of the user speech is the predetermined user, adding the audio input comprising user speech to the speaker profile for the predetermined user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the speaker profile for the predetermined user comprises a plurality of voice prints.
  - 3. The method of claim 2, wherein each of the plurality of voice prints of the speaker profile for the predetermined user was generated from previously received audio inputs comprising user speech.
  - 4. The method of claim 2, wherein determining whether the speaker of the user speech is the predetermined user based at least in part on the speaker profile for the predetermined user comprises:
    - determining whether the audio input comprising user speech matches at least a threshold number of the plurality of voice prints;
      
      in accordance with a determination that the audio input comprising user speech matches at least the threshold number of the plurality of voice prints, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match at least the threshold number of the plurality of voice prints, determining that the speaker of the user speech is not the predetermined user.
  - 5. The method of claim 2, wherein determining whether the speaker of the user speech is the predetermined user based at least in part on the speaker profile for the predetermined user comprises:
    - determining whether the audio input comprising user speech matches at least a threshold number of the plurality of voice prints;
      
      in accordance with a determination that the audio input comprising user speech matches at least the threshold number of the plurality of voice prints;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match at least the threshold number of the plurality of voice prints;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is the predetermined user.
  - 6. The method of claim 1, wherein adding the audio input comprising user speech to the speaker profile for the predetermined user comprises:
    - generating a voice print from the audio input comprising user speech; and
      
      storing the voice print in association with the speaker profile for the predetermined user.
  - 7. The method of claim 1, wherein the method further comprises:
    - in accordance with a determination that the speaker of the user speech is not the predetermined user, adding the audio input comprising user speech to a speaker profile for an alternate user.
  - 8. The method of claim 7, wherein the speaker profile for the alternate user comprises a plurality of voice prints.
  - 9. The method of claim 8, wherein each of the plurality of voice prints of the speaker profile for the alternate user was generated from previously received audio inputs comprising user speech.
  - 10. The method of claim 7, wherein determining whether the speaker of the user speech is the predetermined user is further based at least in part on the speaker profile for the alternate user.
  - 11. The method of claim 7, wherein determining whether the speaker of the user speech is the predetermined user comprises:
    - determining whether the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      in accordance with a determination that the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user, determining that the speaker of the user speech is not the predetermined user.
  - 12. The method of claim 7, wherein determining whether the speaker of the user speech is the predetermined user comprises:
    - determining whether the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      in accordance with a determination that the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is the predetermined user.
  - 13. The method of claim 1, wherein the method further comprises:
    - in accordance with a determination that the speaker of the user speech is the predetermined user;
      
      performing speech-to-text conversion on a second audio input comprising a second user speech, wherein the second audio input is received after receiving the audio input comprising user speech;
      
      determining a user intent based on the second user speech;
      
      determining a task to be performed based on the second user speech;
      
      determining a parameter for the task to be performed based on the second user speech; and
      
      performing the task to be performed in accordance with the determined parameter.

14. A non-transitory computer-readable storage medium comprising instructions for:
- receiving an audio input comprising user speech;
  
  determining whether a speaker of the user speech is a predetermined user based at least in part on a speaker profile for the predetermined user; and
  
  in accordance with a determination that the speaker of the user speech is the predetermined user, adding the audio input comprising user speech to the speaker profile for the predetermined user.

15. A system comprising:
- one or more processors;
  
  memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving an audio input comprising user speech;
  
  determining whether a speaker of the user speech is a predetermined user based at least in part on a speaker profile for the predetermined user; and
  
  in accordance with a determination that the speaker of the user speech is the predetermined user, adding the audio input comprising user speech to the speaker profile for the predetermined user.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 16. The system of claim 15, wherein the speaker profile for the predetermined user comprises a plurality of voice prints.
  - 17. The system of claim 16, wherein each of the plurality of voice prints of the speaker profile for the predetermined user was generated from previously received audio inputs comprising user speech.
  - 18. The system of claim 16, wherein determining whether the speaker of the user speech is the predetermined user based at least in part on the speaker profile for the predetermined user comprises:
    - determining whether the audio input comprising user speech matches at least a threshold number of the plurality of voice prints;
      
      in accordance with a determination that the audio input comprising user speech matches at least the threshold number of the plurality of voice prints, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match at least the threshold number of the plurality of voice prints, determining that the speaker of the user speech is not the predetermined user.
  - 19. The system of claim 16, wherein determining whether the speaker of the user speech is the predetermined user based at least in part on the speaker profile for the predetermined user comprises:
    - determining whether the audio input comprising user speech matches at least a threshold number of the plurality of voice prints;
      
      in accordance with a determination that the audio input comprising user speech matches at least the threshold number of the plurality of voice prints;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match at least the threshold number of the plurality of voice prints;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is the predetermined user.
  - 20. The system of claim 15, wherein adding the audio input comprising user speech to the speaker profile for the predetermined user comprises:
    - generating a voice print from the audio input comprising user speech; and
      
      storing the voice print in association with the speaker profile for the predetermined user.
  - 21. The system of claim 15, wherein the one or more programs further include instructions for:
    - in accordance with a determination that the speaker of the user speech is not the predetermined user, adding the audio input comprising user speech to a speaker profile for an alternate user.
  - 22. The system of claim 21, wherein the speaker profile for the alternate user comprises a plurality of voice prints.
  - 23. The system of claim 22, wherein each of the plurality of voice prints of the speaker profile for the alternate user was generated from previously received audio inputs comprising user speech.
  - 24. The system of claim 21, wherein determining whether the speaker of the user speech is the predetermined user is further based at least in part on the speaker profile for the alternate user.
  - 25. The system of claim 21, wherein determining whether the speaker of the user speech is the predetermined user comprises:
    - determining whether the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      in accordance with a determination that the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user, determining that the speaker of the user speech is not the predetermined user.
  - 26. The system of claim 21, wherein determining whether the speaker of the user speech is the predetermined user comprises:
    - determining whether the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      in accordance with a determination that the audio input comprising user speech matches a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that the audio input comprising user speech does not match a greater number of voice prints of the speaker profile for the predetermined user than a number of voice prints of the speaker profile for the alternate user;
      
      determining whether an erroneous speaker determination was made based on contextual data;
      
      in accordance with a determination that an erroneous speaker determination was not made based on contextual data, determining that the speaker of the user speech is not the predetermined user; and
      
      in accordance with a determination that an erroneous speaker determination was made based on contextual data, determining that the speaker of the user speech is the predetermined user.
  - 27. The system of claim 15, wherein the one or more programs further includes instructions for:
    - in accordance with a determination that the speaker of the user speech is the predetermined user;
      
      performing speech-to-text conversion on a second audio input comprising a second user speech, wherein the second audio input is received after receiving the audio input comprising user speech;
      
      determining a user intent based on the second user speech;
      
      determining a task to be performed based on the second user speech;
      
      determining a parameter for the task to be performed based on the second user speech; and
      
      performing the task to be performed in accordance with the determined parameter.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
KIM, Yoon, KAJAREKAR, Sachin S.

Granted Patent

US 10,127,911 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/1822   Parsing for meaning underst...

G10L 15/26   Speech to text systems G10L...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

G10L 17/26   Recognition of special voic...

SPEAKER IDENTIFICATION AND UNSUPERVISED SPEAKER ADAPTATION TECHNIQUES

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

431 Citations

27 Claims

Specification

Use Cases

Quick Links

Others

SPEAKER IDENTIFICATION AND UNSUPERVISED SPEAKER ADAPTATION TECHNIQUES

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

431 Citations

27 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others